All of lore.kernel.org
 help / color / mirror / Atom feed
* Cache Invalidation Solution for Nested IOMMU
@ 2023-04-03  0:33 Nicolin Chen
  2023-04-03  7:26 ` Liu, Yi L
                   ` (2 more replies)
  0 siblings, 3 replies; 35+ messages in thread
From: Nicolin Chen @ 2023-04-03  0:33 UTC (permalink / raw)
  To: Robin Murphy, jgg, kevin.tian
  Cc: yi.l.liu, eric.auger, baolu.lu, shameerali.kolothum.thodi,
	jean-philippe, iommu

Hi all,

Per discussion in the nested SMMUv3 series[1], we have come up
with a couple of ideas to accelerate the invalidation uAPI.

I have drafted two solutions and would like to collect all of
your inputs, so we shall decide which approach we would choose
to put in the next version.

The first version is simply to individually forward the entire
command. This can save a few CPU cycles from packing/unpacking
invalidation fields of the commands via a data structure, v.s.
the structure in v1[2].

[User Data]
struct iommu_hwpt_invalidate_arm_smmuv3 {
	__u64 cmd[2];
	__u32 error;
	__u32 __reserved;
};

[Driver Handler]
// This draft requires set_rid/unset_rid in the 2nd draft for
// command sanity to report errors.
static void arm_smmu_cache_invalidate_user(struct iommu_domain *domain,
					   void *user_data)
{
	struct iommu_hwpt_invalidate_arm_smmuv3 *inv_info = user_data;
	struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
	struct arm_smmu_device *smmu = smmu_domain->smmu;
	u64 cmd[CMDQ_ENT_DWORDS];
 
	if (!smmu || !smmu_domain->s2 || domain->type != IOMMU_DOMAIN_NESTED)
		return;
 
	cmd[0] = inv_info->cmd[0];
	cmd[1] = inv_info->cmd[1];
	switch (cmd[0] & CMDQ_0_OP) {
	case CMDQ_OP_TLBI_NSNH_ALL:
		cmd[0] &= ~CMDQ_0_OP;
		cmd[0] |= CMDQ_OP_TLBI_NH_ALL;
		fallthrough;
	case CMDQ_OP_TLBI_NH_VA:
	case CMDQ_OP_TLBI_NH_VAA:
	case CMDQ_OP_TLBI_NH_ALL:
	case CMDQ_OP_TLBI_NH_ASID:
		cmd[0] &= ~CMDQ_TLBI_0_VMID;
		cmd[0] |= FIELD_PREP(CMDQ_TLBI_0_VMID,
				     smmu_domain->s2->s2_cfg.vmid);
		arm_smmu_cmdq_issue_cmdlist(smmu, cmd, 1, true);
		break;
	case CMDQ_OP_CFGI_CD:
	case CMDQ_OP_CFGI_CD_ALL:
		arm_smmu_sync_cd(smmu_domain,
				 FIELD_GET(CMDQ_CFGI_0_SSID, cmd[0]), false);
		break;
	default:
		return;
	}
}


The second version is a bit further to support batching, by sharing
a kernel page via mmap().

Firstly, I drafted a pair of new ioctls and iommu_ops callbacks to
link vSID with pSID. This seems to be useful for Intel VT-d also.
https://github.com/nicolinc/iommufd/commit/cbec01245edc70e5de88fbc477d8f0af50b3b0ed

Then I added a new mmap interface to share kernel page(s) from the
Driver, to allow QEMU to write all TLBI commands as a single batch.
Then it can initiate the batch invalidation via another synchronous
hypercall.
https://github.com/nicolinc/iommufd/commit/ee717eb6d46b5285db1aae9172ecdfc70b9cd9ca

The two TMP changes are available in this repo:
https://github.com/nicolinc/iommufd/commits/wip/iommufd_nesting-mmap-04022023
This solution certainly needs some rework before it can be posted
in the next version for review. Yet again, I'd like to see if it
can be a good direction for us to adopt in v2.

The new set_rid/unset_rid ioctls and the mmap interface would be
essential for VCMDQ support that we'd like to achieve at the end
of this journey. So, personally I'd like to see it can be used at
this stage, by the generic SMMUv3 (and potentially VT-d) too.

Thanks
Nicolin

[1] https://lore.kernel.org/linux-iommu/ZBe3kxRXf+VbKy+m@Asurada-Nvidia/
[2] https://lore.kernel.org/linux-iommu/364cfbe5b228ab178093db2de13fa3accf7a6120.1678348754.git.nicolinc@nvidia.com/

^ permalink raw reply	[flat|nested] 35+ messages in thread

* RE: Cache Invalidation Solution for Nested IOMMU
  2023-04-03  0:33 Cache Invalidation Solution for Nested IOMMU Nicolin Chen
@ 2023-04-03  7:26 ` Liu, Yi L
  2023-04-03  8:39   ` Tian, Kevin
  2023-04-03 12:23   ` Jason Gunthorpe
  2023-04-03  8:00 ` Tian, Kevin
  2023-04-03 14:08 ` Jason Gunthorpe
  2 siblings, 2 replies; 35+ messages in thread
From: Liu, Yi L @ 2023-04-03  7:26 UTC (permalink / raw)
  To: Nicolin Chen, Robin Murphy, jgg, Tian, Kevin
  Cc: eric.auger, baolu.lu, shameerali.kolothum.thodi, jean-philippe,
	iommu, Jason Gunthorpe, peterx

Hi Nic,

> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Monday, April 3, 2023 8:34 AM
> Hi all,
> 
> Per discussion in the nested SMMUv3 series[1], we have come up
> with a couple of ideas to accelerate the invalidation uAPI.
> 
> I have drafted two solutions and would like to collect all of
> your inputs, so we shall decide which approach we would choose
> to put in the next version.
> 
> The first version is simply to individually forward the entire
> command. This can save a few CPU cycles from packing/unpacking
> invalidation fields of the commands via a data structure, v.s.
> the structure in v1[2].
> 
> [User Data]
> struct iommu_hwpt_invalidate_arm_smmuv3 {
> 	__u64 cmd[2];
> 	__u32 error;
> 	__u32 __reserved;
> };
> 
> [Driver Handler]
> // This draft requires set_rid/unset_rid in the 2nd draft for
> // command sanity to report errors.
> static void arm_smmu_cache_invalidate_user(struct iommu_domain *domain,
> 					   void *user_data)
> {
> 	struct iommu_hwpt_invalidate_arm_smmuv3 *inv_info = user_data;
> 	struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
> 	struct arm_smmu_device *smmu = smmu_domain->smmu;
> 	u64 cmd[CMDQ_ENT_DWORDS];
> 
> 	if (!smmu || !smmu_domain->s2 || domain->type !=
> IOMMU_DOMAIN_NESTED)
> 		return;
> 
> 	cmd[0] = inv_info->cmd[0];
> 	cmd[1] = inv_info->cmd[1];
> 	switch (cmd[0] & CMDQ_0_OP) {
> 	case CMDQ_OP_TLBI_NSNH_ALL:
> 		cmd[0] &= ~CMDQ_0_OP;
> 		cmd[0] |= CMDQ_OP_TLBI_NH_ALL;
> 		fallthrough;
> 	case CMDQ_OP_TLBI_NH_VA:
> 	case CMDQ_OP_TLBI_NH_VAA:
> 	case CMDQ_OP_TLBI_NH_ALL:
> 	case CMDQ_OP_TLBI_NH_ASID:
> 		cmd[0] &= ~CMDQ_TLBI_0_VMID;
> 		cmd[0] |= FIELD_PREP(CMDQ_TLBI_0_VMID,
> 				     smmu_domain->s2->s2_cfg.vmid);
> 		arm_smmu_cmdq_issue_cmdlist(smmu, cmd, 1, true);
> 		break;
> 	case CMDQ_OP_CFGI_CD:
> 	case CMDQ_OP_CFGI_CD_ALL:
> 		arm_smmu_sync_cd(smmu_domain,
> 				 FIELD_GET(CMDQ_CFGI_0_SSID, cmd[0]), false);
> 		break;
> 	default:
> 		return;
> 	}
> }
> 
> 
> The second version is a bit further to support batching, by sharing
> a kernel page via mmap().
> 
> Firstly, I drafted a pair of new ioctls and iommu_ops callbacks to
> link vSID with pSID. This seems to be useful for Intel VT-d also.
> https://github.com/nicolinc/iommufd/commit/cbec01245edc70e5de88fbc477d8f0af5
> 0b3b0ed
> 
> Then I added a new mmap interface to share kernel page(s) from the
> Driver, to allow QEMU to write all TLBI commands as a single batch.
> Then it can initiate the batch invalidation via another synchronous
> hypercall.
> https://github.com/nicolinc/iommufd/commit/ee717eb6d46b5285db1aae9172ecdfc7
> 0b9cd9ca
> 
> The two TMP changes are available in this repo:
> https://github.com/nicolinc/iommufd/commits/wip/iommufd_nesting-mmap-
> 04022023
> This solution certainly needs some rework before it can be posted
> in the next version for review. Yet again, I'd like to see if it
> can be a good direction for us to adopt in v2.
> 
> The new set_rid/unset_rid ioctls and the mmap interface would be
> essential for VCMDQ support that we'd like to achieve at the end
> of this journey. So, personally I'd like to see it can be used at
> this stage, by the generic SMMUv3 (and potentially VT-d) too.

good try. Just curious if the vRID->pRID conversion is the only
conversion that is required in the cache invalidation path for
ARM SMMU?

For VT-d, except for the device-TLB invalidation, there is no RID
information in the invalidation command submitted by iommu driver.
Device-TLB invalidation would be somehow an included operation in
IOTLB invalidation if the page table is used by devices that has
enabled ATS. So guest VT-d iommu driver may not use the Device-TLB
invalidation. So vRID->pRID conversion may not happen for VT-d side.

VT-d side requires vPASID->pPASID and vDomain_id->pDomain_id conversion.
vPASID conversion may be needed later as we may disable guest PASID
support at this moment. For vDomain_id conversion, there is a gap.
Domain_id is not global today. So if there are multiple vIOMMU instance
exposed to guest, the vDomain_id would have conflict between the
invalidation commands submitted via different vIOMMU instances. Even
we may make its allocation global, I'm also curious when and how should
we build up the vDomain_id->pDomain_id relationship in kernel. Maybe a
new iommufd uapi just like the vRID set? or it may be part of the
device driver's attach_hwpt uapi (e.g. VFIO) as this relationship should
just exist when the hwpt and device are attached.

Regards,
Yi Liu

> 
> Thanks
> Nicolin
> 
> [1] https://lore.kernel.org/linux-iommu/ZBe3kxRXf+VbKy+m@Asurada-Nvidia/
> [2] https://lore.kernel.org/linux-
> iommu/364cfbe5b228ab178093db2de13fa3accf7a6120.1678348754.git.nicolinc@nvid
> ia.com/

^ permalink raw reply	[flat|nested] 35+ messages in thread

* RE: Cache Invalidation Solution for Nested IOMMU
  2023-04-03  0:33 Cache Invalidation Solution for Nested IOMMU Nicolin Chen
  2023-04-03  7:26 ` Liu, Yi L
@ 2023-04-03  8:00 ` Tian, Kevin
  2023-04-03 14:29   ` Nicolin Chen
  2023-04-03 14:08 ` Jason Gunthorpe
  2 siblings, 1 reply; 35+ messages in thread
From: Tian, Kevin @ 2023-04-03  8:00 UTC (permalink / raw)
  To: Nicolin Chen, Robin Murphy, jgg
  Cc: Liu, Yi L, eric.auger, baolu.lu, shameerali.kolothum.thodi,
	jean-philippe, iommu

> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Monday, April 3, 2023 8:34 AM
> 
> The new set_rid/unset_rid ioctls and the mmap interface would be
> essential for VCMDQ support that we'd like to achieve at the end
> of this journey. So, personally I'd like to see it can be used at
> this stage, by the generic SMMUv3 (and potentially VT-d) too.
> 

We talked earlier that there could be multiple VCMDQ's when the
guest is assigned multiple devices behind different SMMU's. How
does the mmap interface per iommufd work in that scenario?

and looks this is different from the requirement of having a
software short path in kernel to reduce the invalidation overhead
for emulated vIOMMUs. In this case the invalidation queue is
in guest memory then instead we want a registration cmd here.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* RE: Cache Invalidation Solution for Nested IOMMU
  2023-04-03  7:26 ` Liu, Yi L
@ 2023-04-03  8:39   ` Tian, Kevin
  2023-04-03 15:24     ` Nicolin Chen
  2023-04-03 12:23   ` Jason Gunthorpe
  1 sibling, 1 reply; 35+ messages in thread
From: Tian, Kevin @ 2023-04-03  8:39 UTC (permalink / raw)
  To: Liu, Yi L, Nicolin Chen, Robin Murphy, jgg
  Cc: eric.auger, baolu.lu, shameerali.kolothum.thodi, jean-philippe,
	iommu, Jason Gunthorpe, peterx

> From: Liu, Yi L <yi.l.liu@intel.com>
> Sent: Monday, April 3, 2023 3:26 PM
> 
> For VT-d, except for the device-TLB invalidation, there is no RID
> information in the invalidation command submitted by iommu driver.
> Device-TLB invalidation would be somehow an included operation in
> IOTLB invalidation if the page table is used by devices that has
> enabled ATS. So guest VT-d iommu driver may not use the Device-TLB
> invalidation. So vRID->pRID conversion may not happen for VT-d side.

Guest can still submit a device-tlb invalidation descriptor. Just that it
could be skipped as a nop here if the host always invalidates device-tlb
when handling the iotlb invalidation request.

> 
> VT-d side requires vPASID->pPASID and vDomain_id->pDomain_id
> conversion.
> vPASID conversion may be needed later as we may disable guest PASID

vPASID conversion is mandatory when we enable vSVA on SIOV device.

> support at this moment. For vDomain_id conversion, there is a gap.
> Domain_id is not global today. So if there are multiple vIOMMU instance
> exposed to guest, the vDomain_id would have conflict between the
> invalidation commands submitted via different vIOMMU instances. Even
> we may make its allocation global, I'm also curious when and how should
> we build up the vDomain_id->pDomain_id relationship in kernel. Maybe a
> new iommufd uapi just like the vRID set? or it may be part of the
> device driver's attach_hwpt uapi (e.g. VFIO) as this relationship should
> just exist when the hwpt and device are attached.
> 

IMHO we need a vDID->hwpt conversion i.e. the kernel wants to know
which s1 hwpt is covered by a vDID. From this angle adding vDID at
attach_hwpt makes more sense.

But honestly speaking I'm hesitating to introduce native format and those
assistant APIs for VT-d at this point. Supporting in-kernel short path
won't happen in short term. What we defined now may not fit the
requirement when it comes.

With that let's continue to define a customized simple format for VT-d iotlb
invalidation, plus allowing the user to batch the request. Having extra
packing/unpacking overhead is negligible compared to the long invalidation
path at this moment. Then we can consider native format as a 2nd
supported format later when in-kernel acceleration is being worked on.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Cache Invalidation Solution for Nested IOMMU
  2023-04-03  7:26 ` Liu, Yi L
  2023-04-03  8:39   ` Tian, Kevin
@ 2023-04-03 12:23   ` Jason Gunthorpe
  1 sibling, 0 replies; 35+ messages in thread
From: Jason Gunthorpe @ 2023-04-03 12:23 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: Nicolin Chen, Robin Murphy, Tian, Kevin, eric.auger, baolu.lu,
	shameerali.kolothum.thodi, jean-philippe, iommu, peterx

On Mon, Apr 03, 2023 at 07:26:28AM +0000, Liu, Yi L wrote:

> For VT-d, except for the device-TLB invalidation, there is no RID
> information in the invalidation command submitted by iommu driver.
> Device-TLB invalidation would be somehow an included operation in
> IOTLB invalidation if the page table is used by devices that has
> enabled ATS. So guest VT-d iommu driver may not use the Device-TLB
> invalidation. So vRID->pRID conversion may not happen for VT-d side.

It is the same as ARM, the RID translation is only needed to support
the ATS case.

> Domain_id is not global today. So if there are multiple vIOMMU instance
> exposed to guest, the vDomain_id would have conflict between the
> invalidation commands submitted via different vIOMMU instances.

You'd need to expose multiple physical instances as multiple virtual
instances.

> we may make its allocation global, I'm also curious when and how should
> we build up the vDomain_id->pDomain_id relationship in kernel. Maybe a
> new iommufd uapi just like the vRID set? 

I'm not so sure it even works sensibly for VT-d. Don't you also have
to convert the type of invalidation as well? Eg the nested mode uses
the DID quite differently than the guest that is not nesting.

IIRC you end up converting vDIDs into pPASID/pDIDs and doing a PASID
invalidation.

The point of doing all this stuff is to avoid invalidation
multiplication, one guest invalidation should translate into one host
invalidation. When properly setup the guest invalidations should not
iterate over host objects..

Jason

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Cache Invalidation Solution for Nested IOMMU
  2023-04-03  0:33 Cache Invalidation Solution for Nested IOMMU Nicolin Chen
  2023-04-03  7:26 ` Liu, Yi L
  2023-04-03  8:00 ` Tian, Kevin
@ 2023-04-03 14:08 ` Jason Gunthorpe
  2023-04-03 14:51   ` Nicolin Chen
  2 siblings, 1 reply; 35+ messages in thread
From: Jason Gunthorpe @ 2023-04-03 14:08 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: Robin Murphy, kevin.tian, yi.l.liu, eric.auger, baolu.lu,
	shameerali.kolothum.thodi, jean-philippe, iommu

On Sun, Apr 02, 2023 at 05:33:35PM -0700, Nicolin Chen wrote:
> The first version is simply to individually forward the entire
> command. This can save a few CPU cycles from packing/unpacking
> invalidation fields of the commands via a data structure, v.s.
> the structure in v1[2].

The kernel must validate the SID for the ATS invalidations, we can't
just blindly pass it through.

And this simple path needs an explanation how errors are properly
handled, eg by making execution synchronous, or someone guaranteeing
that errors are impossible.

> Then I added a new mmap interface to share kernel page(s) from the
> Driver, to allow QEMU to write all TLBI commands as a single batch.
> Then it can initiate the batch invalidation via another synchronous
> hypercall.

I don't think a mmap is really needed for simple batching, just
passing a larger buffer to ioctl is probably good enough

If a SW side is built it should mirror the HW vCMDQ path, not be
different.

Jason

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Cache Invalidation Solution for Nested IOMMU
  2023-04-03  8:00 ` Tian, Kevin
@ 2023-04-03 14:29   ` Nicolin Chen
  2023-04-04  2:15     ` Tian, Kevin
  0 siblings, 1 reply; 35+ messages in thread
From: Nicolin Chen @ 2023-04-03 14:29 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Robin Murphy, jgg, Liu, Yi L, eric.auger, baolu.lu,
	shameerali.kolothum.thodi, jean-philippe, iommu

On Mon, Apr 03, 2023 at 08:00:12AM +0000, Tian, Kevin wrote:
> External email: Use caution opening links or attachments
> 
> 
> > From: Nicolin Chen <nicolinc@nvidia.com>
> > Sent: Monday, April 3, 2023 8:34 AM
> >
> > The new set_rid/unset_rid ioctls and the mmap interface would be
> > essential for VCMDQ support that we'd like to achieve at the end
> > of this journey. So, personally I'd like to see it can be used at
> > this stage, by the generic SMMUv3 (and potentially VT-d) too.
> >
> 
> We talked earlier that there could be multiple VCMDQ's when the
> guest is assigned multiple devices behind different SMMU's. How
> does the mmap interface per iommufd work in that scenario?

Trying to documenting that each IOMMUFD object can possibly have
a shared page, the mmap interface takes the index of an IOMMUFD
object ID. So, either a pt_id(S1) or a dev_id should be able to
identify which physical SMMU, I think.

> and looks this is different from the requirement of having a
> software short path in kernel to reduce the invalidation overhead
> for emulated vIOMMUs. In this case the invalidation queue is
> in guest memory then instead we want a registration cmd here.

Yes for the first part. There are certain difficulties of doing
a short path, such as host kernel replacing the host queue that
the actual HW ran on, with a guest TLBI queue. So, my draft is
more about batching.

For the last part, what's that "registration cmd" to do? In my
draft, the hypervisor dispatch all invalidation commands to the
guest TLBI queue (or call it user queue), which is transparent
to the guest OS.

Thanks
Nicolin

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Cache Invalidation Solution for Nested IOMMU
  2023-04-03 14:08 ` Jason Gunthorpe
@ 2023-04-03 14:51   ` Nicolin Chen
  2023-04-03 19:15     ` Robin Murphy
  0 siblings, 1 reply; 35+ messages in thread
From: Nicolin Chen @ 2023-04-03 14:51 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Robin Murphy, kevin.tian, yi.l.liu, eric.auger, baolu.lu,
	shameerali.kolothum.thodi, jean-philippe, iommu

On Mon, Apr 03, 2023 at 11:08:23AM -0300, Jason Gunthorpe wrote:
> On Sun, Apr 02, 2023 at 05:33:35PM -0700, Nicolin Chen wrote:
> > The first version is simply to individually forward the entire
> > command. This can save a few CPU cycles from packing/unpacking
> > invalidation fields of the commands via a data structure, v.s.
> > the structure in v1[2].
> 
> The kernel must validate the SID for the ATS invalidations, we can't
> just blindly pass it through.

Yes. I didn't go further with the first version, yet leaving a
line of comments in the handler: we'd need set/unset_rid_user,
to validate the SID field of INV_ATC commands, as we discussed.

> And this simple path needs an explanation how errors are properly
> handled, eg by making execution synchronous, or someone guaranteeing
> that errors are impossible.

Yes. Both versions here execute in a synchronous fashion. The
error code will be returned in the cache_invalidate_user data
structure.

> > Then I added a new mmap interface to share kernel page(s) from the
> > Driver, to allow QEMU to write all TLBI commands as a single batch.
> > Then it can initiate the batch invalidation via another synchronous
> > hypercall.
> 
> I don't think a mmap is really needed for simple batching, just
> passing a larger buffer to ioctl is probably good enough

It wouldn't be a must, yet can omit a copy_from_user() at each
hypercall? And it also eases VCMDQ a bit.

> If a SW side is built it should mirror the HW vCMDQ path, not be
> different.

The host kernel has the host queue, while the hypervisor fills
in a guest TLBI queue. Switching between two queues at one SMMU
CMDQ (HW) requires a very complicated locking mechanism, v.s.
inserting the batch to the existing host queue. And it probably
doesn't have a big perf improvement by doing that?

If SMMU has ECMDQ, it'd allocate a free CMDQ upon availability,
calling arm_smmu_init_one_queue() and mmapping q->base, then it
can execute the guest TLBI queue directly, passing that q ptr.

Thanks
Nicolin

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Cache Invalidation Solution for Nested IOMMU
  2023-04-03  8:39   ` Tian, Kevin
@ 2023-04-03 15:24     ` Nicolin Chen
  2023-04-04  2:42       ` Tian, Kevin
  0 siblings, 1 reply; 35+ messages in thread
From: Nicolin Chen @ 2023-04-03 15:24 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Liu, Yi L, Robin Murphy, jgg, eric.auger, baolu.lu,
	shameerali.kolothum.thodi, jean-philippe, iommu, peterx

On Mon, Apr 03, 2023 at 08:39:02AM +0000, Tian, Kevin wrote:
> External email: Use caution opening links or attachments
> 
> 
> > From: Liu, Yi L <yi.l.liu@intel.com>
> > Sent: Monday, April 3, 2023 3:26 PM
> >
> > For VT-d, except for the device-TLB invalidation, there is no RID
> > information in the invalidation command submitted by iommu driver.
> > Device-TLB invalidation would be somehow an included operation in
> > IOTLB invalidation if the page table is used by devices that has
> > enabled ATS. So guest VT-d iommu driver may not use the Device-TLB
> > invalidation. So vRID->pRID conversion may not happen for VT-d side.
> 
> Guest can still submit a device-tlb invalidation descriptor. Just that it
> could be skipped as a nop here if the host always invalidates device-tlb
> when handling the iotlb invalidation request.

In either case, it sounds like RID isn't needed in VT-d case?

SMMU needs vSID<->pSID info to verify an ATS invalidation.

> > VT-d side requires vPASID->pPASID and vDomain_id->pDomain_id
> > conversion.
> > vPASID conversion may be needed later as we may disable guest PASID
> 
> vPASID conversion is mandatory when we enable vSVA on SIOV device.

vPASID is allocated at runtime, so the hypercall timing is a
bit different than SMMU's vSID. But I think it could go with
this uAPI too? We'd just need to turn the uAPI to a shareable
one.

> > support at this moment. For vDomain_id conversion, there is a gap.
> > Domain_id is not global today. So if there are multiple vIOMMU instance
> > exposed to guest, the vDomain_id would have conflict between the
> > invalidation commands submitted via different vIOMMU instances. Even
> > we may make its allocation global, I'm also curious when and how should
> > we build up the vDomain_id->pDomain_id relationship in kernel. Maybe a
> > new iommufd uapi just like the vRID set? or it may be part of the
> > device driver's attach_hwpt uapi (e.g. VFIO) as this relationship should
> > just exist when the hwpt and device are attached.
> >
> 
> IMHO we need a vDID->hwpt conversion i.e. the kernel wants to know
> which s1 hwpt is covered by a vDID. From this angle adding vDID at
> attach_hwpt makes more sense.

I share the same view. A "vDID" is HWPT oriented, while that
set/unset_rid_user uAPI is for passthrough devices. So, it'd
be probably better to go with a different uAPI at least.

> But honestly speaking I'm hesitating to introduce native format and those
> assistant APIs for VT-d at this point. Supporting in-kernel short path
> won't happen in short term. What we defined now may not fit the
> requirement when it comes.
>
> With that let's continue to define a customized simple format for VT-d iotlb
> invalidation, plus allowing the user to batch the request. Having extra
> packing/unpacking overhead is negligible compared to the long invalidation
> path at this moment. Then we can consider native format as a 2nd
> supported format later when in-kernel acceleration is being worked on.

It'd be okay to do it later for VT-d, so long as the uAPI we
add for SMMUv3 would potentially fit VT-d too :)

Thanks
Nicolin

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Cache Invalidation Solution for Nested IOMMU
  2023-04-03 14:51   ` Nicolin Chen
@ 2023-04-03 19:15     ` Robin Murphy
  2023-04-04  0:02       ` Nicolin Chen
  0 siblings, 1 reply; 35+ messages in thread
From: Robin Murphy @ 2023-04-03 19:15 UTC (permalink / raw)
  To: Nicolin Chen, Jason Gunthorpe
  Cc: kevin.tian, yi.l.liu, eric.auger, baolu.lu,
	shameerali.kolothum.thodi, jean-philippe, iommu

On 2023-04-03 15:51, Nicolin Chen wrote:
> On Mon, Apr 03, 2023 at 11:08:23AM -0300, Jason Gunthorpe wrote:
>> On Sun, Apr 02, 2023 at 05:33:35PM -0700, Nicolin Chen wrote:
>>> The first version is simply to individually forward the entire
>>> command. This can save a few CPU cycles from packing/unpacking
>>> invalidation fields of the commands via a data structure, v.s.
>>> the structure in v1[2].
>>
>> The kernel must validate the SID for the ATS invalidations, we can't
>> just blindly pass it through.
> 
> Yes. I didn't go further with the first version, yet leaving a
> line of comments in the handler: we'd need set/unset_rid_user,
> to validate the SID field of INV_ATC commands, as we discussed.
> 
>> And this simple path needs an explanation how errors are properly
>> handled, eg by making execution synchronous, or someone guaranteeing
>> that errors are impossible.
> 
> Yes. Both versions here execute in a synchronous fashion. The
> error code will be returned in the cache_invalidate_user data
> structure.
> 
>>> Then I added a new mmap interface to share kernel page(s) from the
>>> Driver, to allow QEMU to write all TLBI commands as a single batch.
>>> Then it can initiate the batch invalidation via another synchronous
>>> hypercall.
>>
>> I don't think a mmap is really needed for simple batching, just
>> passing a larger buffer to ioctl is probably good enough
> 
> It wouldn't be a must, yet can omit a copy_from_user() at each
> hypercall? And it also eases VCMDQ a bit.
> 
>> If a SW side is built it should mirror the HW vCMDQ path, not be
>> different.
> 
> The host kernel has the host queue, while the hypervisor fills
> in a guest TLBI queue. Switching between two queues at one SMMU
> CMDQ (HW) requires a very complicated locking mechanism, v.s.
> inserting the batch to the existing host queue. And it probably
> doesn't have a big perf improvement by doing that?
> 
> If SMMU has ECMDQ, it'd allocate a free CMDQ upon availability,
> calling arm_smmu_init_one_queue() and mmapping q->base, then it
> can execute the guest TLBI queue directly, passing that q ptr.

FWIW I don't think that should be visible in a userspace interface. When 
the VMM is just requesting some invalidations in order to emulate some 
commands, it's up to the SMMU driver, or at best between the driver and 
and IOMMUFD, to decide exactly how those requests get executed as 
physical commands - that should not make any difference to the requester 
other than how quickly the requests are processed.

AFAICS this interface can't look like the proper hardware vCMDQ path, 
because the whole point of that will be to configure it in advance, map 
the queue controls directly into the guest, and avoid trapping 
invalidations to the VMM at all. This invalidation request interface is 
a large part of precisely what that path is intended to bypass. I don't 
see much benefit in supporting an additional slightly-accelerated slow 
path where the host avoids a tiny bit of housekeeping by maintaining a 
real vCMDQ *on behalf of* the guest and forwarding trapped commands into 
it instead of just processing them normally as host commands. Or, 
conversely, emulating a vCMDQ *in* the host kernel, in a way which still 
requires traps to bounce through the VMM and back - that just seems 
objectively worse than keeping all the emulation together in one place 
(however, I would concur that a "vhost-style" emulation, using all the 
same interfaces for configuration and error/irq/etc. reporting as the 
real hardware would, might be viable if performance really demands it).

Thanks,
Robin.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Cache Invalidation Solution for Nested IOMMU
  2023-04-03 19:15     ` Robin Murphy
@ 2023-04-04  0:02       ` Nicolin Chen
  2023-04-04 16:20         ` Jason Gunthorpe
  0 siblings, 1 reply; 35+ messages in thread
From: Nicolin Chen @ 2023-04-04  0:02 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Jason Gunthorpe, kevin.tian, yi.l.liu, eric.auger, baolu.lu,
	shameerali.kolothum.thodi, jean-philippe, iommu

Hi Robin,

On Mon, Apr 03, 2023 at 08:15:03PM +0100, Robin Murphy wrote:
> External email: Use caution opening links or attachments
> 
> 
> On 2023-04-03 15:51, Nicolin Chen wrote:
> > On Mon, Apr 03, 2023 at 11:08:23AM -0300, Jason Gunthorpe wrote:
> > > On Sun, Apr 02, 2023 at 05:33:35PM -0700, Nicolin Chen wrote:
> > > > The first version is simply to individually forward the entire
> > > > command. This can save a few CPU cycles from packing/unpacking
> > > > invalidation fields of the commands via a data structure, v.s.
> > > > the structure in v1[2].
> > > 
> > > The kernel must validate the SID for the ATS invalidations, we can't
> > > just blindly pass it through.
> > 
> > Yes. I didn't go further with the first version, yet leaving a
> > line of comments in the handler: we'd need set/unset_rid_user,
> > to validate the SID field of INV_ATC commands, as we discussed.
> > 
> > > And this simple path needs an explanation how errors are properly
> > > handled, eg by making execution synchronous, or someone guaranteeing
> > > that errors are impossible.
> > 
> > Yes. Both versions here execute in a synchronous fashion. The
> > error code will be returned in the cache_invalidate_user data
> > structure.
> > 
> > > > Then I added a new mmap interface to share kernel page(s) from the
> > > > Driver, to allow QEMU to write all TLBI commands as a single batch.
> > > > Then it can initiate the batch invalidation via another synchronous
> > > > hypercall.
> > > 
> > > I don't think a mmap is really needed for simple batching, just
> > > passing a larger buffer to ioctl is probably good enough
> > 
> > It wouldn't be a must, yet can omit a copy_from_user() at each
> > hypercall? And it also eases VCMDQ a bit.
> > 
> > > If a SW side is built it should mirror the HW vCMDQ path, not be
> > > different.
> > 
> > The host kernel has the host queue, while the hypervisor fills
> > in a guest TLBI queue. Switching between two queues at one SMMU
> > CMDQ (HW) requires a very complicated locking mechanism, v.s.
> > inserting the batch to the existing host queue. And it probably
> > doesn't have a big perf improvement by doing that?
> > 
> > If SMMU has ECMDQ, it'd allocate a free CMDQ upon availability,
> > calling arm_smmu_init_one_queue() and mmapping q->base, then it
> > can execute the guest TLBI queue directly, passing that q ptr.
> 
> FWIW I don't think that should be visible in a userspace interface. When
> the VMM is just requesting some invalidations in order to emulate some
> commands, it's up to the SMMU driver, or at best between the driver and
> and IOMMUFD, to decide exactly how those requests get executed as
> physical commands - that should not make any difference to the requester
> other than how quickly the requests are processed.
> 
> AFAICS this interface can't look like the proper hardware vCMDQ path,
> because the whole point of that will be to configure it in advance, map
> the queue controls directly into the guest, and avoid trapping
> invalidations to the VMM at all. This invalidation request interface is
> a large part of precisely what that path is intended to bypass. I don't
> see much benefit in supporting an additional slightly-accelerated slow
> path where the host avoids a tiny bit of housekeeping by maintaining a
> real vCMDQ *on behalf of* the guest and forwarding trapped commands into
> it instead of just processing them normally as host commands. Or,

I tend to agree with most of your point. The implementation of
a SW emulated VCMDQ might be overcomplicated cooperating with
the kernel driver and the QEMU vSMMU code. If SMMU HW (in most
cases) only has one CMDQ, it is hard to switch between commands
of host's and guest's to have a performance gain. VCMDQ could
simply do that because it has multiple CMDQs by nature.

What is the normal processing approach that you'd suggest? Do
you agree that having a batch invalidation would be nicer? We
could go for the mmap'd page approach in my draft, or go for
the ioctl that Jason pointed out.

My preference is to have a mmap'd page, so the interface can
be reused later by VCMDQ too. Performance-wise, it should be
good enough, since it does batching, IMHO.

> conversely, emulating a vCMDQ *in* the host kernel, in a way which still
> requires traps to bounce through the VMM and back - that just seems
> objectively worse than keeping all the emulation together in one place
> (however, I would concur that a "vhost-style" emulation, using all the
> same interfaces for configuration and error/irq/etc. reporting as the
> real hardware would, might be viable if performance really demands it).

I am not very familiar with the vhost style. By a glance at
its doc, it seems to be one interface for all hypercalls?
That would be very different than what we design here..

Thanks
Nicolin

^ permalink raw reply	[flat|nested] 35+ messages in thread

* RE: Cache Invalidation Solution for Nested IOMMU
  2023-04-03 14:29   ` Nicolin Chen
@ 2023-04-04  2:15     ` Tian, Kevin
  2023-04-04  2:47       ` Nicolin Chen
  0 siblings, 1 reply; 35+ messages in thread
From: Tian, Kevin @ 2023-04-04  2:15 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: Robin Murphy, jgg, Liu, Yi L, eric.auger, baolu.lu,
	shameerali.kolothum.thodi, jean-philippe, iommu

> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Monday, April 3, 2023 10:30 PM
> 
> On Mon, Apr 03, 2023 at 08:00:12AM +0000, Tian, Kevin wrote:
> > External email: Use caution opening links or attachments
> >
> >
> > > From: Nicolin Chen <nicolinc@nvidia.com>
> > > Sent: Monday, April 3, 2023 8:34 AM
> > >
> > > The new set_rid/unset_rid ioctls and the mmap interface would be
> > > essential for VCMDQ support that we'd like to achieve at the end
> > > of this journey. So, personally I'd like to see it can be used at
> > > this stage, by the generic SMMUv3 (and potentially VT-d) too.
> > >
> >
> > We talked earlier that there could be multiple VCMDQ's when the
> > guest is assigned multiple devices behind different SMMU's. How
> > does the mmap interface per iommufd work in that scenario?
> 
> Trying to documenting that each IOMMUFD object can possibly have
> a shared page, the mmap interface takes the index of an IOMMUFD
> object ID. So, either a pt_id(S1) or a dev_id should be able to
> identify which physical SMMU, I think.

Are all allowed cmds in VCMDQ per hwpt? If not then building the
mmap interface per hwpt object is not correct. We may want explicit
VCMDQ object in that case.

and devices behind different SMMU's may be attached to a same
hwpt. In that case the number of VCMDQ associated to a hwpt is
also dynamic.

But if just talking about batching for emulated smmu then having
the user to pass a big buffer makes more sense.

> 
> > and looks this is different from the requirement of having a
> > software short path in kernel to reduce the invalidation overhead
> > for emulated vIOMMUs. In this case the invalidation queue is
> > in guest memory then instead we want a registration cmd here.
> 
> Yes for the first part. There are certain difficulties of doing
> a short path, such as host kernel replacing the host queue that
> the actual HW ran on, with a guest TLBI queue. So, my draft is
> more about batching.
> 
> For the last part, what's that "registration cmd" to do? In my
> draft, the hypervisor dispatch all invalidation commands to the
> guest TLBI queue (or call it user queue), which is transparent
> to the guest OS.
> 

registration means the user to pass the buffer to the kernel.

If we want to support kernel short path, then we want the host
smmu driver to directly read cmd out of guest TLBI queue.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* RE: Cache Invalidation Solution for Nested IOMMU
  2023-04-03 15:24     ` Nicolin Chen
@ 2023-04-04  2:42       ` Tian, Kevin
  2023-04-04  3:12         ` Nicolin Chen
  0 siblings, 1 reply; 35+ messages in thread
From: Tian, Kevin @ 2023-04-04  2:42 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: Liu, Yi L, Robin Murphy, jgg, eric.auger, baolu.lu,
	shameerali.kolothum.thodi, jean-philippe, iommu, peterx

> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Monday, April 3, 2023 11:24 PM
> 
> > > VT-d side requires vPASID->pPASID and vDomain_id->pDomain_id
> > > conversion.
> > > vPASID conversion may be needed later as we may disable guest PASID
> >
> > vPASID conversion is mandatory when we enable vSVA on SIOV device.
> 
> vPASID is allocated at runtime, so the hypercall timing is a
> bit different than SMMU's vSID. But I think it could go with
> this uAPI too? We'd just need to turn the uAPI to a shareable
> one.

Not necessarily. It's clearer to be a separate cmd and format.

> > But honestly speaking I'm hesitating to introduce native format and those
> > assistant APIs for VT-d at this point. Supporting in-kernel short path
> > won't happen in short term. What we defined now may not fit the
> > requirement when it comes.
> >
> > With that let's continue to define a customized simple format for VT-d iotlb
> > invalidation, plus allowing the user to batch the request. Having extra
> > packing/unpacking overhead is negligible compared to the long invalidation
> > path at this moment. Then we can consider native format as a 2nd
> > supported format later when in-kernel acceleration is being worked on.
> 
> It'd be okay to do it later for VT-d, so long as the uAPI we
> add for SMMUv3 would potentially fit VT-d too :)
> 

Yes. btw you need decide which usage is comprehended in this design:

1) vSMMU reads cmd from guest TLBI queue when the tail register is written
   and then submits the cmd in a user-provided buffer to the kernel.

   This is the basic path.

2) vSMMU reads base addr of guest TLBI queue when the start register is
   written and registers the guest queue to the kernel. In the meantime
   establish the protocol between kvm and smmu driver so when kvm
   traps guest write to the tail register it directly notifies the smmu driver
   and skips the userspace. smmu driver then directly reads cmd from guest
   queue to handle.

   This is the in-kernel short path.

3) with VCMDQ then vSMMU needs to mmap start/head/tail/... registers
   of VCMDQ and allows the guest to directly access. No host intervention
   when guest submits cmd to VCMDQ.

   This is the hw acceleration path.

I'm a bit confused in some discussions whether what you implemented
for 1) must be forward compatible with 2) and 3). This is difficult before
we actually start working on them. Given an iommu driver will support
multiple formats (e.g. when vhost-iommu comes) probably we should
more focus on the minimal necessary for what 1) actually requires now?

Thanks
Kevin

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Cache Invalidation Solution for Nested IOMMU
  2023-04-04  2:15     ` Tian, Kevin
@ 2023-04-04  2:47       ` Nicolin Chen
  0 siblings, 0 replies; 35+ messages in thread
From: Nicolin Chen @ 2023-04-04  2:47 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Robin Murphy, jgg, Liu, Yi L, eric.auger, baolu.lu,
	shameerali.kolothum.thodi, jean-philippe, iommu

On Tue, Apr 04, 2023 at 02:15:38AM +0000, Tian, Kevin wrote:
> External email: Use caution opening links or attachments
> 
> 
> > From: Nicolin Chen <nicolinc@nvidia.com>
> > Sent: Monday, April 3, 2023 10:30 PM
> >
> > On Mon, Apr 03, 2023 at 08:00:12AM +0000, Tian, Kevin wrote:
> > > External email: Use caution opening links or attachments
> > >
> > >
> > > > From: Nicolin Chen <nicolinc@nvidia.com>
> > > > Sent: Monday, April 3, 2023 8:34 AM
> > > >
> > > > The new set_rid/unset_rid ioctls and the mmap interface would be
> > > > essential for VCMDQ support that we'd like to achieve at the end
> > > > of this journey. So, personally I'd like to see it can be used at
> > > > this stage, by the generic SMMUv3 (and potentially VT-d) too.
> > > >
> > >
> > > We talked earlier that there could be multiple VCMDQ's when the
> > > guest is assigned multiple devices behind different SMMU's. How
> > > does the mmap interface per iommufd work in that scenario?
> >
> > Trying to documenting that each IOMMUFD object can possibly have
> > a shared page, the mmap interface takes the index of an IOMMUFD
> > object ID. So, either a pt_id(S1) or a dev_id should be able to
> > identify which physical SMMU, I think.
> 
> Are all allowed cmds in VCMDQ per hwpt? If not then building the
> mmap interface per hwpt object is not correct. We may want explicit
> VCMDQ object in that case.

One VCMDQ HW per SMMU instance. So all HWPTs that are created
by devices behind the same SMMU instance share the same VCMDQ
HW. Each VCMDQ HW can also allocate multiple queues that don't
necessarily tie to any HWPT either.

> and devices behind different SMMU's may be attached to a same
> hwpt. In that case the number of VCMDQ associated to a hwpt is
> also dynamic.

Unless two HWPTs share the same S1 Context Table, how can two
devices behind different SMMUs attach to the same HWPT? And,
it doesn't sound very plausible to share the same S1 Context
Table between two devices either?

> But if just talking about batching for emulated smmu then having
> the user to pass a big buffer makes more sense.

OK. That's in align with Jason's suggestion, passing a queue
buffer via the ioctl.

> > > and looks this is different from the requirement of having a
> > > software short path in kernel to reduce the invalidation overhead
> > > for emulated vIOMMUs. In this case the invalidation queue is
> > > in guest memory then instead we want a registration cmd here.
> >
> > Yes for the first part. There are certain difficulties of doing
> > a short path, such as host kernel replacing the host queue that
> > the actual HW ran on, with a guest TLBI queue. So, my draft is
> > more about batching.
> >
> > For the last part, what's that "registration cmd" to do? In my
> > draft, the hypervisor dispatch all invalidation commands to the
> > guest TLBI queue (or call it user queue), which is transparent
> > to the guest OS.
> >
> 
> registration means the user to pass the buffer to the kernel.
> 
> If we want to support kernel short path, then we want the host
> smmu driver to directly read cmd out of guest TLBI queue.

I expect the mmap approach can go a bit further for an SMMU
that has ECMDQ capability (multi CMDQs): it could load the
guest TLBI queue without copying that buffer and inserting
into the host queue, by allocating a separate cmdq object
in the SMMU driver and mmap'ing its cmdq->q.base to QEMU.

Thanks
Nicolin

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Cache Invalidation Solution for Nested IOMMU
  2023-04-04  2:42       ` Tian, Kevin
@ 2023-04-04  3:12         ` Nicolin Chen
  0 siblings, 0 replies; 35+ messages in thread
From: Nicolin Chen @ 2023-04-04  3:12 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Liu, Yi L, Robin Murphy, jgg, eric.auger, baolu.lu,
	shameerali.kolothum.thodi, jean-philippe, iommu, peterx

On Tue, Apr 04, 2023 at 02:42:49AM +0000, Tian, Kevin wrote:
> External email: Use caution opening links or attachments
> 
> 
> > From: Nicolin Chen <nicolinc@nvidia.com>
> > Sent: Monday, April 3, 2023 11:24 PM
> >
> > > > VT-d side requires vPASID->pPASID and vDomain_id->pDomain_id
> > > > conversion.
> > > > vPASID conversion may be needed later as we may disable guest PASID
> > >
> > > vPASID conversion is mandatory when we enable vSVA on SIOV device.
> >
> > vPASID is allocated at runtime, so the hypercall timing is a
> > bit different than SMMU's vSID. But I think it could go with
> > this uAPI too? We'd just need to turn the uAPI to a shareable
> > one.
> 
> Not necessarily. It's clearer to be a separate cmd and format.

OK. Then set/unset_rid_user can be standalone, yet it likely
needs a better naming or so.

> > > But honestly speaking I'm hesitating to introduce native format and those
> > > assistant APIs for VT-d at this point. Supporting in-kernel short path
> > > won't happen in short term. What we defined now may not fit the
> > > requirement when it comes.
> > >
> > > With that let's continue to define a customized simple format for VT-d iotlb
> > > invalidation, plus allowing the user to batch the request. Having extra
> > > packing/unpacking overhead is negligible compared to the long invalidation
> > > path at this moment. Then we can consider native format as a 2nd
> > > supported format later when in-kernel acceleration is being worked on.
> >
> > It'd be okay to do it later for VT-d, so long as the uAPI we
> > add for SMMUv3 would potentially fit VT-d too :)
> >
> 
> Yes. btw you need decide which usage is comprehended in this design:
> 
> 1) vSMMU reads cmd from guest TLBI queue when the tail register is written
>    and then submits the cmd in a user-provided buffer to the kernel.
> 
>    This is the basic path.
> 
> 2) vSMMU reads base addr of guest TLBI queue when the start register is
>    written and registers the guest queue to the kernel. In the meantime
>    establish the protocol between kvm and smmu driver so when kvm
>    traps guest write to the tail register it directly notifies the smmu driver
>    and skips the userspace. smmu driver then directly reads cmd from guest
>    queue to handle.
> 
>    This is the in-kernel short path.
> 
> 3) with VCMDQ then vSMMU needs to mmap start/head/tail/... registers
>    of VCMDQ and allows the guest to directly access. No host intervention
>    when guest submits cmd to VCMDQ.
> 
>    This is the hw acceleration path.
> 
> I'm a bit confused in some discussions whether what you implemented
> for 1) must be forward compatible with 2) and 3). This is difficult before
> we actually start working on them. Given an iommu driver will support
> multiple formats (e.g. when vhost-iommu comes) probably we should
> more focus on the minimal necessary for what 1) actually requires now?

My draft is more of an enhanced (1) with batching, meanwhile
using a mmap interface, thinking of (3). Comparing to the (2),
it simplifies the host kernel, as QEMU could load every TLBI
command into a mmap'd buffer whenever it traps one.

Maybe (2) could be a cleaner implementation. I could try it
too.

Thanks
Nicolin

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Cache Invalidation Solution for Nested IOMMU
  2023-04-04  0:02       ` Nicolin Chen
@ 2023-04-04 16:20         ` Jason Gunthorpe
  2023-04-04 16:50           ` Shameerali Kolothum Thodi
  2023-04-05  5:45           ` Nicolin Chen
  0 siblings, 2 replies; 35+ messages in thread
From: Jason Gunthorpe @ 2023-04-04 16:20 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: Robin Murphy, kevin.tian, yi.l.liu, eric.auger, baolu.lu,
	shameerali.kolothum.thodi, jean-philippe, iommu

On Mon, Apr 03, 2023 at 05:02:09PM -0700, Nicolin Chen wrote:

> My preference is to have a mmap'd page, so the interface can
> be reused later by VCMDQ too. Performance-wise, it should be
> good enough, since it does batching, IMHO.

You can't reuse mmaping the queue page with vcmdq, so it doesn't seem
meaningful to me.

There should be no mmap on the SW path. If you need a half step
between an ioctl as a batch and a full vhost-like queue scheme then
using iouring with pre-registered memory would be appropriate.

I feel like this is a topic Shameerali should share some insight with
since the Huawei implementation will rely on this SW path.

Jason

^ permalink raw reply	[flat|nested] 35+ messages in thread

* RE: Cache Invalidation Solution for Nested IOMMU
  2023-04-04 16:20         ` Jason Gunthorpe
@ 2023-04-04 16:50           ` Shameerali Kolothum Thodi
  2023-04-05 11:57             ` Jason Gunthorpe
  2023-04-06  6:23             ` Zhangfei Gao
  2023-04-05  5:45           ` Nicolin Chen
  1 sibling, 2 replies; 35+ messages in thread
From: Shameerali Kolothum Thodi @ 2023-04-04 16:50 UTC (permalink / raw)
  To: Jason Gunthorpe, Nicolin Chen
  Cc: Robin Murphy, kevin.tian, yi.l.liu, eric.auger, baolu.lu,
	jean-philippe, iommu, Zhangfei Gao



> -----Original Message-----
> From: Jason Gunthorpe [mailto:jgg@nvidia.com]
> Sent: 04 April 2023 17:20
> To: Nicolin Chen <nicolinc@nvidia.com>
> Cc: Robin Murphy <robin.murphy@arm.com>; kevin.tian@intel.com;
> yi.l.liu@intel.com; eric.auger@redhat.com; baolu.lu@linux.intel.com;
> Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>;
> jean-philippe@linaro.org; iommu@lists.linux.dev
> Subject: Re: Cache Invalidation Solution for Nested IOMMU
> 
> On Mon, Apr 03, 2023 at 05:02:09PM -0700, Nicolin Chen wrote:
> 
> > My preference is to have a mmap'd page, so the interface can
> > be reused later by VCMDQ too. Performance-wise, it should be
> > good enough, since it does batching, IMHO.
> 
> You can't reuse mmaping the queue page with vcmdq, so it doesn't seem
> meaningful to me.
> 
> There should be no mmap on the SW path. If you need a half step
> between an ioctl as a batch and a full vhost-like queue scheme then
> using iouring with pre-registered memory would be appropriate.
> 
> I feel like this is a topic Shameerali should share some insight with
> since the Huawei implementation will rely on this SW path.

So far the tests we have done are mainly using IOCTL method without
batching and I don't have any numbers to compare against the batched
method yet.

[+Zhangfei]

Zhangfei, do you think we can run some UADK/vSVA tests with different 
Invalidation solutions discussed here and compare? 

Thanks,
Shameer 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Cache Invalidation Solution for Nested IOMMU
  2023-04-04 16:20         ` Jason Gunthorpe
  2023-04-04 16:50           ` Shameerali Kolothum Thodi
@ 2023-04-05  5:45           ` Nicolin Chen
  2023-04-05 11:37             ` Jason Gunthorpe
  1 sibling, 1 reply; 35+ messages in thread
From: Nicolin Chen @ 2023-04-05  5:45 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Robin Murphy, kevin.tian, yi.l.liu, eric.auger, baolu.lu,
	shameerali.kolothum.thodi, jean-philippe, iommu

On Tue, Apr 04, 2023 at 01:20:01PM -0300, Jason Gunthorpe wrote:
> On Mon, Apr 03, 2023 at 05:02:09PM -0700, Nicolin Chen wrote:
> 
> > My preference is to have a mmap'd page, so the interface can
> > be reused later by VCMDQ too. Performance-wise, it should be
> > good enough, since it does batching, IMHO.
> 
> You can't reuse mmaping the queue page with vcmdq, so it doesn't seem
> meaningful to me.
> 
> There should be no mmap on the SW path. If you need a half step
> between an ioctl as a batch and a full vhost-like queue scheme then
> using iouring with pre-registered memory would be appropriate.

I've changed to a non-mmap approach that the host kernel reads
the guest queue directly and inserts all invalidation commands
into the host queue.

The qsz could be as large as 128 x 64K pages. So, there has to
be a big array of pages getting pinned in the handler.

(The handler still needs a pathway to report errors. I will add
 tomorrow.)

Does the implementation below look fine in general?

Thanks
Nicolin

[User Data]
/**
 * struct iommu_hwpt_invalidate_arm_smmuv3 - ARM SMMUv3 cahce invalidation info
 * @cmdq_base: User space base virtual address of user command queue
 * @cmdq_entry_size: Entry size of user command queue
 * @cmdq_log2size: User command queue size as log 2 (entries)
 *                 Refer to LOG2SIZE field of SMMU_CMDQ_BASE register
 * @cmdq_prod: Producer index of user command queue
 * @cmdq_cons: Consumer index of user command queue
 */
struct iommu_hwpt_invalidate_arm_smmuv3 {
	__u64 cmdq_base;
	__u32 cmdq_entry_size;
	__u32 cmdq_log2size;
	__u32 cmdq_prod;
	__u32 cmdq_cons;
};

[Host Handler]
static int arm_smmu_fix_user_cmd(struct arm_smmu_domain *smmu_domain, u64 *cmd)
{
	struct arm_smmu_stream *stream;

	switch (*cmd & CMDQ_0_OP) {
	case CMDQ_OP_TLBI_NSNH_ALL:
		*cmd &= ~CMDQ_0_OP;
		*cmd |= CMDQ_OP_TLBI_NH_ALL;
		fallthrough;
	case CMDQ_OP_TLBI_NH_VA:
	case CMDQ_OP_TLBI_NH_VAA:
	case CMDQ_OP_TLBI_NH_ALL:
	case CMDQ_OP_TLBI_NH_ASID:
		*cmd &= ~CMDQ_TLBI_0_VMID;
		*cmd |= FIELD_PREP(CMDQ_TLBI_0_VMID,
				   smmu_domain->s2->s2_cfg.vmid);
		break;
	case CMDQ_OP_ATC_INV:
	case CMDQ_OP_CFGI_CD:
	case CMDQ_OP_CFGI_CD_ALL:
		xa_lock(&smmu_domain->smmu->user_streams);
		stream = xa_load(&smmu_domain->smmu->user_streams,
				 FIELD_GET(CMDQ_CFGI_0_SID, *cmd));
		xa_unlock(&smmu_domain->smmu->user_streams);
		if (!stream)
			return -ENODEV;
		*cmd &= ~CMDQ_CFGI_0_SID;
		*cmd |= FIELD_PREP(CMDQ_CFGI_0_SID, stream->id);
		break;
	default:
		return -EOPNOTSUPP;
	}
	pr_debug("Fixed user CMD: %016llx : %016llx\n", cmd[1], cmd[0]);

	return 0;
}

static void arm_smmu_cache_invalidate_user(struct iommu_domain *domain,
					   void *user_data)
{
	const u32 cons_err = FIELD_PREP(CMDQ_CONS_ERR, CMDQ_ERR_CERROR_ILL_IDX);
	struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
	struct iommu_hwpt_invalidate_arm_smmuv3 *inv = user_data;
	struct arm_smmu_device *smmu = smmu_domain->smmu;
	struct arm_smmu_queue q = {
		.llq = {
			.prod = inv->cmdq_prod,
			.cons = inv->cmdq_cons,
			.max_n_shift = inv->cmdq_log2size,
		},
		.ent_dwords = inv->cmdq_entry_size / sizeof(u64),
	};
	int ncmds = inv->cmdq_prod - inv->cmdq_cons;
	unsigned int nents = 1 << q.llq.max_n_shift;
	size_t qsz = nents * inv->cmdq_entry_size;
	unsigned long npages = qsz >> PAGE_SHIFT;
	struct page **pages;
	long pinned;
	u64 *cmds;
	int i = 0;
	int ret;

	if (!smmu || !smmu_domain->s2 || domain->type != IOMMU_DOMAIN_NESTED)
		return;
	if (WARN_ON(q.ent_dwords != CMDQ_ENT_DWORDS))
		return;
	if (WARN_ON(queue_empty(&q.llq)))
		return;
	WARN_ON(q.llq.max_n_shift > smmu->cmdq.q.llq.max_n_shift);

	pages = kcalloc(npages, sizeof(*pages), GFP_KERNEL);
	if (!pages)
		return;

	if (ncmds <= 0)
		ncmds += nents;
	cmds = kcalloc(ncmds, inv->cmdq_entry_size, GFP_KERNEL);
	if (!cmds)
		goto out_free_pages;

	pinned = get_user_pages(inv->cmdq_base, npages, FOLL_GET, pages, NULL);
	if (pinned != npages)
		goto out_put_page;
	q.base = page_to_virt(pages[0]) + (inv->cmdq_base & (PAGE_SIZE - 1));

	do {
		u64 *cmd = &cmds[i * CMDQ_ENT_DWORDS];

		queue_read(cmd, Q_ENT(&q, q.llq.cons), q.ent_dwords);
		ret = arm_smmu_fix_user_cmd(smmu_domain, cmd);
		if (ret && ret != -EOPNOTSUPP) {
			q.llq.cons |= cons_err;
			goto out_put_page;
		}
		if (!ret)
			i++;
		queue_inc_cons(&q.llq);
	} while (!queue_empty(&q.llq));

	ret = arm_smmu_cmdq_issue_cmdlist(smmu, cmds, i, true);
	/* FIXME return CMD_SYNC timeout */
out_put_page:
	for (i = 0; i < pinned; i++)
		put_page(pages[i]);
	kfree(cmds);
out_free_pages:
	kfree(pages);
	inv->cmdq_cons = q.llq.cons;
}

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Cache Invalidation Solution for Nested IOMMU
  2023-04-05  5:45           ` Nicolin Chen
@ 2023-04-05 11:37             ` Jason Gunthorpe
  2023-04-05 15:34               ` Nicolin Chen
  0 siblings, 1 reply; 35+ messages in thread
From: Jason Gunthorpe @ 2023-04-05 11:37 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: Robin Murphy, kevin.tian, yi.l.liu, eric.auger, baolu.lu,
	shameerali.kolothum.thodi, jean-philippe, iommu

On Tue, Apr 04, 2023 at 10:45:56PM -0700, Nicolin Chen wrote:
> On Tue, Apr 04, 2023 at 01:20:01PM -0300, Jason Gunthorpe wrote:
> > On Mon, Apr 03, 2023 at 05:02:09PM -0700, Nicolin Chen wrote:
> > 
> > > My preference is to have a mmap'd page, so the interface can
> > > be reused later by VCMDQ too. Performance-wise, it should be
> > > good enough, since it does batching, IMHO.
> > 
> > You can't reuse mmaping the queue page with vcmdq, so it doesn't seem
> > meaningful to me.
> > 
> > There should be no mmap on the SW path. If you need a half step
> > between an ioctl as a batch and a full vhost-like queue scheme then
> > using iouring with pre-registered memory would be appropriate.
> 
> I've changed to a non-mmap approach that the host kernel reads
> the guest queue directly and inserts all invalidation commands
> into the host queue.
> 
> The qsz could be as large as 128 x 64K pages. So, there has to
> be a big array of pages getting pinned in the handler.
> 
> (The handler still needs a pathway to report errors. I will add
>  tomorrow.)
> 
> Does the implementation below look fine in general?

In general get_user_pages() is really slow compared to copy_from_user

Jason

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Cache Invalidation Solution for Nested IOMMU
  2023-04-04 16:50           ` Shameerali Kolothum Thodi
@ 2023-04-05 11:57             ` Jason Gunthorpe
  2023-04-06  6:23             ` Zhangfei Gao
  1 sibling, 0 replies; 35+ messages in thread
From: Jason Gunthorpe @ 2023-04-05 11:57 UTC (permalink / raw)
  To: Shameerali Kolothum Thodi
  Cc: Nicolin Chen, Robin Murphy, kevin.tian, yi.l.liu, eric.auger,
	baolu.lu, jean-philippe, iommu, Zhangfei Gao

On Tue, Apr 04, 2023 at 04:50:14PM +0000, Shameerali Kolothum Thodi wrote:
> 
> 
> > -----Original Message-----
> > From: Jason Gunthorpe [mailto:jgg@nvidia.com]
> > Sent: 04 April 2023 17:20
> > To: Nicolin Chen <nicolinc@nvidia.com>
> > Cc: Robin Murphy <robin.murphy@arm.com>; kevin.tian@intel.com;
> > yi.l.liu@intel.com; eric.auger@redhat.com; baolu.lu@linux.intel.com;
> > Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>;
> > jean-philippe@linaro.org; iommu@lists.linux.dev
> > Subject: Re: Cache Invalidation Solution for Nested IOMMU
> > 
> > On Mon, Apr 03, 2023 at 05:02:09PM -0700, Nicolin Chen wrote:
> > 
> > > My preference is to have a mmap'd page, so the interface can
> > > be reused later by VCMDQ too. Performance-wise, it should be
> > > good enough, since it does batching, IMHO.
> > 
> > You can't reuse mmaping the queue page with vcmdq, so it doesn't seem
> > meaningful to me.
> > 
> > There should be no mmap on the SW path. If you need a half step
> > between an ioctl as a batch and a full vhost-like queue scheme then
> > using iouring with pre-registered memory would be appropriate.
> > 
> > I feel like this is a topic Shameerali should share some insight with
> > since the Huawei implementation will rely on this SW path.
> 
> So far the tests we have done are mainly using IOCTL method without
> batching and I don't have any numbers to compare against the batched
> method yet.

The question is more are you happy with the performance of single
ioctl or does this need optimization. It sits in a hot path of the mm,
so it broadly impacts certain workloads once SVA is enabled.

Jason

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Cache Invalidation Solution for Nested IOMMU
  2023-04-05 11:37             ` Jason Gunthorpe
@ 2023-04-05 15:34               ` Nicolin Chen
  0 siblings, 0 replies; 35+ messages in thread
From: Nicolin Chen @ 2023-04-05 15:34 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Robin Murphy, kevin.tian, yi.l.liu, eric.auger, baolu.lu,
	shameerali.kolothum.thodi, jean-philippe, iommu

On Wed, Apr 05, 2023 at 08:37:56AM -0300, Jason Gunthorpe wrote:
> On Tue, Apr 04, 2023 at 10:45:56PM -0700, Nicolin Chen wrote:
> > On Tue, Apr 04, 2023 at 01:20:01PM -0300, Jason Gunthorpe wrote:
> > > On Mon, Apr 03, 2023 at 05:02:09PM -0700, Nicolin Chen wrote:
> > > 
> > > > My preference is to have a mmap'd page, so the interface can
> > > > be reused later by VCMDQ too. Performance-wise, it should be
> > > > good enough, since it does batching, IMHO.
> > > 
> > > You can't reuse mmaping the queue page with vcmdq, so it doesn't seem
> > > meaningful to me.
> > > 
> > > There should be no mmap on the SW path. If you need a half step
> > > between an ioctl as a batch and a full vhost-like queue scheme then
> > > using iouring with pre-registered memory would be appropriate.
> > 
> > I've changed to a non-mmap approach that the host kernel reads
> > the guest queue directly and inserts all invalidation commands
> > into the host queue.
> > 
> > The qsz could be as large as 128 x 64K pages. So, there has to
> > be a big array of pages getting pinned in the handler.
> > 
> > (The handler still needs a pathway to report errors. I will add
> >  tomorrow.)
> > 
> > Does the implementation below look fine in general?
> 
> In general get_user_pages() is really slow compared to copy_from_user

Will change to using copy_from_user().

Thanks!
Nic

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Cache Invalidation Solution for Nested IOMMU
  2023-04-04 16:50           ` Shameerali Kolothum Thodi
  2023-04-05 11:57             ` Jason Gunthorpe
@ 2023-04-06  6:23             ` Zhangfei Gao
  2023-04-06  6:39               ` Nicolin Chen
  2023-04-06 11:40               ` Jason Gunthorpe
  1 sibling, 2 replies; 35+ messages in thread
From: Zhangfei Gao @ 2023-04-06  6:23 UTC (permalink / raw)
  To: Shameerali Kolothum Thodi
  Cc: Jason Gunthorpe, Nicolin Chen, Robin Murphy, kevin.tian,
	yi.l.liu, eric.auger, baolu.lu, jean-philippe, iommu, qianweili

On Wed, 5 Apr 2023 at 00:50, Shameerali Kolothum Thodi
<shameerali.kolothum.thodi@huawei.com> wrote:
>
>
>
> > -----Original Message-----
> > From: Jason Gunthorpe [mailto:jgg@nvidia.com]
> > Sent: 04 April 2023 17:20
> > To: Nicolin Chen <nicolinc@nvidia.com>
> > Cc: Robin Murphy <robin.murphy@arm.com>; kevin.tian@intel.com;
> > yi.l.liu@intel.com; eric.auger@redhat.com; baolu.lu@linux.intel.com;
> > Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>;
> > jean-philippe@linaro.org; iommu@lists.linux.dev
> > Subject: Re: Cache Invalidation Solution for Nested IOMMU
> >
> > On Mon, Apr 03, 2023 at 05:02:09PM -0700, Nicolin Chen wrote:
> >
> > > My preference is to have a mmap'd page, so the interface can
> > > be reused later by VCMDQ too. Performance-wise, it should be
> > > good enough, since it does batching, IMHO.
> >
> > You can't reuse mmaping the queue page with vcmdq, so it doesn't seem
> > meaningful to me.
> >
> > There should be no mmap on the SW path. If you need a half step
> > between an ioctl as a batch and a full vhost-like queue scheme then
> > using iouring with pre-registered memory would be appropriate.
> >
> > I feel like this is a topic Shameerali should share some insight with
> > since the Huawei implementation will rely on this SW path.
>
> So far the tests we have done are mainly using IOCTL method without
> batching and I don't have any numbers to compare against the batched
> method yet.
>
> [+Zhangfei]
>
> Zhangfei, do you think we can run some UADK/vSVA tests with different
> Invalidation solutions discussed here and compare?

Hi, Nicolin

Do you have mmap branch for kernel and Qemu?

We are using ioctl method now.
From the testing, the TLB miss impacts performance a lot, so we use
huge page method.
After using huge page method, guest can achieve comparable performance
with host.

More details:
https://docs.qq.com/doc/DRXlpQmpTSlBZTGZZ

Thanks

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Cache Invalidation Solution for Nested IOMMU
  2023-04-06  6:23             ` Zhangfei Gao
@ 2023-04-06  6:39               ` Nicolin Chen
  2023-04-06 11:40               ` Jason Gunthorpe
  1 sibling, 0 replies; 35+ messages in thread
From: Nicolin Chen @ 2023-04-06  6:39 UTC (permalink / raw)
  To: Zhangfei Gao
  Cc: Shameerali Kolothum Thodi, Jason Gunthorpe, Robin Murphy,
	kevin.tian, yi.l.liu, eric.auger, baolu.lu, jean-philippe, iommu,
	qianweili

On Thu, Apr 06, 2023 at 02:23:17PM +0800, Zhangfei Gao wrote:
> External email: Use caution opening links or attachments
> 
> 
> On Wed, 5 Apr 2023 at 00:50, Shameerali Kolothum Thodi
> <shameerali.kolothum.thodi@huawei.com> wrote:
> >
> >
> >
> > > -----Original Message-----
> > > From: Jason Gunthorpe [mailto:jgg@nvidia.com]
> > > Sent: 04 April 2023 17:20
> > > To: Nicolin Chen <nicolinc@nvidia.com>
> > > Cc: Robin Murphy <robin.murphy@arm.com>; kevin.tian@intel.com;
> > > yi.l.liu@intel.com; eric.auger@redhat.com; baolu.lu@linux.intel.com;
> > > Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>;
> > > jean-philippe@linaro.org; iommu@lists.linux.dev
> > > Subject: Re: Cache Invalidation Solution for Nested IOMMU
> > >
> > > On Mon, Apr 03, 2023 at 05:02:09PM -0700, Nicolin Chen wrote:
> > >
> > > > My preference is to have a mmap'd page, so the interface can
> > > > be reused later by VCMDQ too. Performance-wise, it should be
> > > > good enough, since it does batching, IMHO.
> > >
> > > You can't reuse mmaping the queue page with vcmdq, so it doesn't seem
> > > meaningful to me.
> > >
> > > There should be no mmap on the SW path. If you need a half step
> > > between an ioctl as a batch and a full vhost-like queue scheme then
> > > using iouring with pre-registered memory would be appropriate.
> > >
> > > I feel like this is a topic Shameerali should share some insight with
> > > since the Huawei implementation will rely on this SW path.
> >
> > So far the tests we have done are mainly using IOCTL method without
> > batching and I don't have any numbers to compare against the batched
> > method yet.
> >
> > [+Zhangfei]
> >
> > Zhangfei, do you think we can run some UADK/vSVA tests with different
> > Invalidation solutions discussed here and compare?
> 
> Hi, Nicolin
> 
> Do you have mmap branch for kernel and Qemu?
> 
> We are using ioctl method now.
> From the testing, the TLB miss impacts performance a lot, so we use
> huge page method.
> After using huge page method, guest can achieve comparable performance
> with host.

Looks like the test has been running on previous VFIO solution?

Have you tried a sanity using the latest IOMMUFD pair? I recall
that Shameer has a set of branches for linux and QEMU.

Meanwhile, I'll prepare a pair of branches for you to test the
mmap solution. Let me get back to you by the end of the week.

Thanks
Nicolin

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Cache Invalidation Solution for Nested IOMMU
  2023-04-06  6:23             ` Zhangfei Gao
  2023-04-06  6:39               ` Nicolin Chen
@ 2023-04-06 11:40               ` Jason Gunthorpe
  2023-04-10  1:08                 ` Nicolin Chen
  1 sibling, 1 reply; 35+ messages in thread
From: Jason Gunthorpe @ 2023-04-06 11:40 UTC (permalink / raw)
  To: Zhangfei Gao
  Cc: Shameerali Kolothum Thodi, Nicolin Chen, Robin Murphy,
	kevin.tian, yi.l.liu, eric.auger, baolu.lu, jean-philippe, iommu,
	qianweili

On Thu, Apr 06, 2023 at 02:23:17PM +0800, Zhangfei Gao wrote:

> We are using ioctl method now.
> From the testing, the TLB miss impacts performance a lot, so we use
> huge page method.
> After using huge page method, guest can achieve comparable performance
> with host.

Looks like these tests are not stressing the MM, just measuring pure
BW of the DMA, so they don't get into the invalidation regime..

You need to measure a more real application that is actually using the
MM (eg alloc/free memory, fork, etc) while it operates and turn on SVA.

Jason

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Cache Invalidation Solution for Nested IOMMU
  2023-04-06 11:40               ` Jason Gunthorpe
@ 2023-04-10  1:08                 ` Nicolin Chen
  2023-04-11  9:07                   ` Jean-Philippe Brucker
  2023-04-12  2:47                   ` Zhangfei Gao
  0 siblings, 2 replies; 35+ messages in thread
From: Nicolin Chen @ 2023-04-10  1:08 UTC (permalink / raw)
  To: Zhangfei Gao, Jason Gunthorpe
  Cc: Shameerali Kolothum Thodi, Robin Murphy, kevin.tian, yi.l.liu,
	eric.auger, baolu.lu, jean-philippe, iommu, qianweili

On Thu, Apr 06, 2023 at 08:40:04AM -0300, Jason Gunthorpe wrote:
> On Thu, Apr 06, 2023 at 02:23:17PM +0800, Zhangfei Gao wrote:
> 
> > We are using ioctl method now.
> > From the testing, the TLB miss impacts performance a lot, so we use
> > huge page method.
> > After using huge page method, guest can achieve comparable performance
> > with host.
> 
> Looks like these tests are not stressing the MM, just measuring pure
> BW of the DMA, so they don't get into the invalidation regime..
> 
> You need to measure a more real application that is actually using the
> MM (eg alloc/free memory, fork, etc) while it operates and turn on SVA.

Would an iommu map/unmap benchmark test be useful here?

I added a test program to measure map/unmap times with a
set of different sized buffers:
https://github.com/nicolinc/iommufd/commit/3eb417f2cae0234cc801c6ad74de2afb0ddbdf84
(Also thinking about sending this with a RFC series)

@Zhangfei,
In case that this could be useful, you can pull these two
branches for perf measurement with and without mmap:
# Kernel
https://github.com/nicolinc/iommufd/commits/wip/iommufd_nesting-mmap-04082023
# QEMU
https://github.com/nicolinc/qemu/commits/wip/iommufd_nesting-mmap-04072023

To test without mmap, simply revert the following commits:
# Kernel
git revert ecf602a3c8480ba7ce2c7e77c2d15ca873dbf2e4
# QEMU
git revert c726c014de70998f14b8741a6a96e18a2a7bcd0f

And, you can get the test script in the branch:
  tools/testing/selftests/iommu/iommu_benchmark.sh


On my emulation environment (very slow), with mmap, I see
improvements. I'll also try setting up a test suite on a
proper HW this week.

Thanks
Nicolin

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Cache Invalidation Solution for Nested IOMMU
  2023-04-10  1:08                 ` Nicolin Chen
@ 2023-04-11  9:07                   ` Jean-Philippe Brucker
  2023-04-11 11:57                     ` Jason Gunthorpe
  2023-04-11 18:43                     ` Nicolin Chen
  2023-04-12  2:47                   ` Zhangfei Gao
  1 sibling, 2 replies; 35+ messages in thread
From: Jean-Philippe Brucker @ 2023-04-11  9:07 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: Zhangfei Gao, Jason Gunthorpe, Shameerali Kolothum Thodi,
	Robin Murphy, kevin.tian, yi.l.liu, eric.auger, baolu.lu, iommu,
	qianweili

On Sun, Apr 09, 2023 at 06:08:25PM -0700, Nicolin Chen wrote:
> On Thu, Apr 06, 2023 at 08:40:04AM -0300, Jason Gunthorpe wrote:
> > On Thu, Apr 06, 2023 at 02:23:17PM +0800, Zhangfei Gao wrote:
> > 
> > > We are using ioctl method now.
> > > From the testing, the TLB miss impacts performance a lot, so we use
> > > huge page method.
> > > After using huge page method, guest can achieve comparable performance
> > > with host.
> > 
> > Looks like these tests are not stressing the MM, just measuring pure
> > BW of the DMA, so they don't get into the invalidation regime..
> > 
> > You need to measure a more real application that is actually using the
> > MM (eg alloc/free memory, fork, etc) while it operates and turn on SVA.
> 
> Would an iommu map/unmap benchmark test be useful here?

You can reuse and improve dma_map_benchmark, already upstream. It supports
multi-threads and reports stddev. For some comparisons the report
resolution (.1 us) is insufficient and needs to be increased, but for
comparing guest performance it might be alright.

https://lore.kernel.org/linux-iommu/20201102080646.2180-2-song.bao.hua@hisilicon.com/

Thanks,
Jean


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Cache Invalidation Solution for Nested IOMMU
  2023-04-11  9:07                   ` Jean-Philippe Brucker
@ 2023-04-11 11:57                     ` Jason Gunthorpe
  2023-04-11 18:39                       ` Nicolin Chen
  2023-04-11 18:43                     ` Nicolin Chen
  1 sibling, 1 reply; 35+ messages in thread
From: Jason Gunthorpe @ 2023-04-11 11:57 UTC (permalink / raw)
  To: Jean-Philippe Brucker
  Cc: Nicolin Chen, Zhangfei Gao, Shameerali Kolothum Thodi,
	Robin Murphy, kevin.tian, yi.l.liu, eric.auger, baolu.lu, iommu,
	qianweili

On Tue, Apr 11, 2023 at 10:07:40AM +0100, Jean-Philippe Brucker wrote:
> On Sun, Apr 09, 2023 at 06:08:25PM -0700, Nicolin Chen wrote:
> > On Thu, Apr 06, 2023 at 08:40:04AM -0300, Jason Gunthorpe wrote:
> > > On Thu, Apr 06, 2023 at 02:23:17PM +0800, Zhangfei Gao wrote:
> > > 
> > > > We are using ioctl method now.
> > > > From the testing, the TLB miss impacts performance a lot, so we use
> > > > huge page method.
> > > > After using huge page method, guest can achieve comparable performance
> > > > with host.
> > > 
> > > Looks like these tests are not stressing the MM, just measuring pure
> > > BW of the DMA, so they don't get into the invalidation regime..
> > > 
> > > You need to measure a more real application that is actually using the
> > > MM (eg alloc/free memory, fork, etc) while it operates and turn on SVA.
> > 
> > Would an iommu map/unmap benchmark test be useful here?
> 
> You can reuse and improve dma_map_benchmark, already upstream. It supports
> multi-threads and reports stddev. For some comparisons the report
> resolution (.1 us) is insufficient and needs to be increased, but for
> comparing guest performance it might be alright.
> 
> https://lore.kernel.org/linux-iommu/20201102080646.2180-2-song.bao.hua@hisilicon.com/

I'm not talking about kernel dma_map/unmap, I'm talking about MM
activities like mmap and munmap that become slower once SVA is enabled

Jason

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Cache Invalidation Solution for Nested IOMMU
  2023-04-11 11:57                     ` Jason Gunthorpe
@ 2023-04-11 18:39                       ` Nicolin Chen
  2023-04-11 18:41                         ` Jason Gunthorpe
  0 siblings, 1 reply; 35+ messages in thread
From: Nicolin Chen @ 2023-04-11 18:39 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jean-Philippe Brucker, Zhangfei Gao, Shameerali Kolothum Thodi,
	Robin Murphy, kevin.tian, yi.l.liu, eric.auger, baolu.lu, iommu,
	qianweili

On Tue, Apr 11, 2023 at 08:57:35AM -0300, Jason Gunthorpe wrote:
> On Tue, Apr 11, 2023 at 10:07:40AM +0100, Jean-Philippe Brucker wrote:
> > On Sun, Apr 09, 2023 at 06:08:25PM -0700, Nicolin Chen wrote:
> > > On Thu, Apr 06, 2023 at 08:40:04AM -0300, Jason Gunthorpe wrote:
> > > > On Thu, Apr 06, 2023 at 02:23:17PM +0800, Zhangfei Gao wrote:
> > > > 
> > > > > We are using ioctl method now.
> > > > > From the testing, the TLB miss impacts performance a lot, so we use
> > > > > huge page method.
> > > > > After using huge page method, guest can achieve comparable performance
> > > > > with host.
> > > > 
> > > > Looks like these tests are not stressing the MM, just measuring pure
> > > > BW of the DMA, so they don't get into the invalidation regime..
> > > > 
> > > > You need to measure a more real application that is actually using the
> > > > MM (eg alloc/free memory, fork, etc) while it operates and turn on SVA.
> > > 
> > > Would an iommu map/unmap benchmark test be useful here?
> > 
> > You can reuse and improve dma_map_benchmark, already upstream. It supports
> > multi-threads and reports stddev. For some comparisons the report
> > resolution (.1 us) is insufficient and needs to be increased, but for
> > comparing guest performance it might be alright.
> > 
> >  https://lore.kernel.org/linux-iommu/20201102080646.2180-2-song.bao.hua@hisilicon.com/
> 
> I'm not talking about kernel dma_map/unmap, I'm talking about MM
> activities like mmap and munmap that become slower once SVA is enabled

Is it about mmap/munmap() calls or accessing mmap'd memory?

If it's about accessing the memory, the test can cover via the
invalidation pathway. Otherwise, mmap/munmap() are called in
->domain_alloc/free() respectively -- mind elaborating why it
should be concerned?

Thanks
Nic

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Cache Invalidation Solution for Nested IOMMU
  2023-04-11 18:39                       ` Nicolin Chen
@ 2023-04-11 18:41                         ` Jason Gunthorpe
  2023-04-11 19:02                           ` Nicolin Chen
  0 siblings, 1 reply; 35+ messages in thread
From: Jason Gunthorpe @ 2023-04-11 18:41 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: Jean-Philippe Brucker, Zhangfei Gao, Shameerali Kolothum Thodi,
	Robin Murphy, kevin.tian, yi.l.liu, eric.auger, baolu.lu, iommu,
	qianweili

On Tue, Apr 11, 2023 at 11:39:26AM -0700, Nicolin Chen wrote:
> On Tue, Apr 11, 2023 at 08:57:35AM -0300, Jason Gunthorpe wrote:
> > On Tue, Apr 11, 2023 at 10:07:40AM +0100, Jean-Philippe Brucker wrote:
> > > On Sun, Apr 09, 2023 at 06:08:25PM -0700, Nicolin Chen wrote:
> > > > On Thu, Apr 06, 2023 at 08:40:04AM -0300, Jason Gunthorpe wrote:
> > > > > On Thu, Apr 06, 2023 at 02:23:17PM +0800, Zhangfei Gao wrote:
> > > > > 
> > > > > > We are using ioctl method now.
> > > > > > From the testing, the TLB miss impacts performance a lot, so we use
> > > > > > huge page method.
> > > > > > After using huge page method, guest can achieve comparable performance
> > > > > > with host.
> > > > > 
> > > > > Looks like these tests are not stressing the MM, just measuring pure
> > > > > BW of the DMA, so they don't get into the invalidation regime..
> > > > > 
> > > > > You need to measure a more real application that is actually using the
> > > > > MM (eg alloc/free memory, fork, etc) while it operates and turn on SVA.
> > > > 
> > > > Would an iommu map/unmap benchmark test be useful here?
> > > 
> > > You can reuse and improve dma_map_benchmark, already upstream. It supports
> > > multi-threads and reports stddev. For some comparisons the report
> > > resolution (.1 us) is insufficient and needs to be increased, but for
> > > comparing guest performance it might be alright.
> > > 
> > >  https://lore.kernel.org/linux-iommu/20201102080646.2180-2-song.bao.hua@hisilicon.com/
> > 
> > I'm not talking about kernel dma_map/unmap, I'm talking about MM
> > activities like mmap and munmap that become slower once SVA is enabled
> 
> Is it about mmap/munmap() calls or accessing mmap'd memory?

I mean literally userspace mmap() calls to change the mm_struct.

Jason

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Cache Invalidation Solution for Nested IOMMU
  2023-04-11  9:07                   ` Jean-Philippe Brucker
  2023-04-11 11:57                     ` Jason Gunthorpe
@ 2023-04-11 18:43                     ` Nicolin Chen
  1 sibling, 0 replies; 35+ messages in thread
From: Nicolin Chen @ 2023-04-11 18:43 UTC (permalink / raw)
  To: Jean-Philippe Brucker
  Cc: Zhangfei Gao, Jason Gunthorpe, Shameerali Kolothum Thodi,
	Robin Murphy, kevin.tian, yi.l.liu, eric.auger, baolu.lu, iommu,
	qianweili

On Tue, Apr 11, 2023 at 10:07:40AM +0100, Jean-Philippe Brucker wrote:
> External email: Use caution opening links or attachments
> 
> 
> On Sun, Apr 09, 2023 at 06:08:25PM -0700, Nicolin Chen wrote:
> > On Thu, Apr 06, 2023 at 08:40:04AM -0300, Jason Gunthorpe wrote:
> > > On Thu, Apr 06, 2023 at 02:23:17PM +0800, Zhangfei Gao wrote:
> > >
> > > > We are using ioctl method now.
> > > > From the testing, the TLB miss impacts performance a lot, so we use
> > > > huge page method.
> > > > After using huge page method, guest can achieve comparable performance
> > > > with host.
> > >
> > > Looks like these tests are not stressing the MM, just measuring pure
> > > BW of the DMA, so they don't get into the invalidation regime..
> > >
> > > You need to measure a more real application that is actually using the
> > > MM (eg alloc/free memory, fork, etc) while it operates and turn on SVA.
> >
> > Would an iommu map/unmap benchmark test be useful here?
> 
> You can reuse and improve dma_map_benchmark, already upstream. It supports
> multi-threads and reports stddev. For some comparisons the report
> resolution (.1 us) is insufficient and needs to be increased, but for
> comparing guest performance it might be alright.

Oh, didn't know about that. Thanks for pointing the test :)

Nicolin

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Cache Invalidation Solution for Nested IOMMU
  2023-04-11 18:41                         ` Jason Gunthorpe
@ 2023-04-11 19:02                           ` Nicolin Chen
  0 siblings, 0 replies; 35+ messages in thread
From: Nicolin Chen @ 2023-04-11 19:02 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jean-Philippe Brucker, Zhangfei Gao, Shameerali Kolothum Thodi,
	Robin Murphy, kevin.tian, yi.l.liu, eric.auger, baolu.lu, iommu,
	qianweili

On Tue, Apr 11, 2023 at 03:41:07PM -0300, Jason Gunthorpe wrote:

> > > > > > > We are using ioctl method now.
> > > > > > > From the testing, the TLB miss impacts performance a lot, so we use
> > > > > > > huge page method.
> > > > > > > After using huge page method, guest can achieve comparable performance
> > > > > > > with host.
> > > > > > 
> > > > > > Looks like these tests are not stressing the MM, just measuring pure
> > > > > > BW of the DMA, so they don't get into the invalidation regime..
> > > > > > 
> > > > > > You need to measure a more real application that is actually using the
> > > > > > MM (eg alloc/free memory, fork, etc) while it operates and turn on SVA.

> > > I'm not talking about kernel dma_map/unmap, I'm talking about MM
> > > activities like mmap and munmap that become slower once SVA is enabled
> > 
> > Is it about mmap/munmap() calls or accessing mmap'd memory?
> 
> I mean literally userspace mmap() calls to change the mm_struct.

Hmm, it sounds like a different perf factor? Otherwise, I still
don't understand how it impacts our decision of choosing between
mmap'd queue page and doing copy_from_user().

Thanks
Nicolin

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Cache Invalidation Solution for Nested IOMMU
  2023-04-10  1:08                 ` Nicolin Chen
  2023-04-11  9:07                   ` Jean-Philippe Brucker
@ 2023-04-12  2:47                   ` Zhangfei Gao
  2023-04-12  5:47                     ` Nicolin Chen
  2023-05-03 15:14                     ` Shameerali Kolothum Thodi
  1 sibling, 2 replies; 35+ messages in thread
From: Zhangfei Gao @ 2023-04-12  2:47 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: Jason Gunthorpe, Shameerali Kolothum Thodi, Robin Murphy,
	kevin.tian, yi.l.liu, eric.auger, baolu.lu, jean-philippe, iommu,
	qianweili

Hi, Nicolin

On Mon, 10 Apr 2023 at 09:08, Nicolin Chen <nicolinc@nvidia.com> wrote:
>
> On Thu, Apr 06, 2023 at 08:40:04AM -0300, Jason Gunthorpe wrote:
> > On Thu, Apr 06, 2023 at 02:23:17PM +0800, Zhangfei Gao wrote:
> >
> > > We are using ioctl method now.
> > > From the testing, the TLB miss impacts performance a lot, so we use
> > > huge page method.
> > > After using huge page method, guest can achieve comparable performance
> > > with host.
> >
> > Looks like these tests are not stressing the MM, just measuring pure
> > BW of the DMA, so they don't get into the invalidation regime..
> >
> > You need to measure a more real application that is actually using the
> > MM (eg alloc/free memory, fork, etc) while it operates and turn on SVA.
>
> Would an iommu map/unmap benchmark test be useful here?
>
> I added a test program to measure map/unmap times with a
> set of different sized buffers:
> https://github.com/nicolinc/iommufd/commit/3eb417f2cae0234cc801c6ad74de2afb0ddbdf84
> (Also thinking about sending this with a RFC series)
>
> @Zhangfei,
> In case that this could be useful, you can pull these two
> branches for perf measurement with and without mmap:
> # Kernel
> https://github.com/nicolinc/iommufd/commits/wip/iommufd_nesting-mmap-04082023
> # QEMU
> https://github.com/nicolinc/qemu/commits/wip/iommufd_nesting-mmap-04072023
>
> To test without mmap, simply revert the following commits:
> # Kernel
> git revert ecf602a3c8480ba7ce2c7e77c2d15ca873dbf2e4
> # QEMU
> git revert c726c014de70998f14b8741a6a96e18a2a7bcd0f
>
> And, you can get the test script in the branch:
>   tools/testing/selftests/iommu/iommu_benchmark.sh
>
>
> On my emulation environment (very slow), with mmap, I see
> improvements. I'll also try setting up a test suite on a
> proper HW this week.
>

Still debugging the mmap method here,

Status 4.12

1.  ioctl method works, based on iommufd

performance same as before, ie.6.2

Using huge page method for guest, guest app can get comparable
performance as host

(The branches are provided by Shameer)
https://github.com/gaozhangfei/linux-kernel-uadk/tree/private-iommufd_nesting-03272023-6.3-rc4-arm
https://github.com/gaozhangfei/qemu/tree/private-v7.2.0-iommufd-nesting-arm


2.  mmap method still in debug,
(patch porting may have issues, need to port patches to the working
version, in check)

qemu:

https://github.com/gaozhangfei/linux-kernel-uadk/tree/iommufd_nesting-mmap-04082023

https://github.com/gaozhangfei/qemu/tree/private-v7.2.0-iommufd-nesting-arm-mmap

log:

guest:

hisi sec init Kunpeng920!

qemu-system-aarch64: IOMMU_HWPT_INVALIDATE failed: No such device

qemu-system-aarch64: --------smmuv3_invalidate_cache_mmap: failed to
invalidate TLB: prod=15fa vs cons=1000709

qemu-system-aarch64: IOMMU_HWPT_INVALIDATE failed: No such device

qemu-system-aarch64: --------smmuv3_invalidate_cache_mmap: failed to
invalidate TLB: prod=9bb vs cons=1000709

qemu-system-aarch64: IOMMU_HWPT_INVALIDATE failed: Connection timed out

qemu-system-aarch64: --------smmuv3_invalidate_cache_mmap: failed to
invalidate TLB: prod=4c0 vs cons=4c0

qemu-system-aarch64: IOMMU_HWPT_INVALIDATE failed: No such device

qemu-system-aarch64: --------smmuv3_invalidate_cache_mmap: failed to
invalidate TLB: prod=f7a vs cons=1000709

qemu-system-aarch64: IOMMU_HWPT_INVALIDATE failed: No such device

qemu-system-aarch64: --------smmuv3_invalidate_cache_mmap: failed to
invalidate TLB: prod=81d vs cons=1000709

host:

[218.986319] arm-smmu-v3 arm-smmu-v3.3.auto: unexpected global error
reported (0x00000001), this could be serious

[218.996464] arm-smmu-v3 arm-smmu-v3.3.auto: CMDQ error (cons
0x01004805): Illegal command

[219.004606] arm-smmu-v3 arm-smmu-v3.3.auto: skipping command in error state:

[219.011622] arm-smmu-v3 arm-smmu-v3.3.auto:0x0060000202110687

[219.017515] arm-smmu-v3 arm-smmu-v3.3.auto:0x0000000000000000

[219.023422] arm-smmu-v3 arm-smmu-v3.3.auto: unexpected global error
reported (0x00000001), this could be serious

[219.033550] arm-smmu-v3 arm-smmu-v3.3.auto: CMDQ error (cons
0x01004806): Illegal command

[219.041692] arm-smmu-v3 arm-smmu-v3.3.auto: skipping command in error state:

[219.048707] arm-smmu-v3 arm-smmu-v3.3.auto:0x0000000000000000

[219.054600]arm-smmu-v3 arm-smmu-v3.3.auto:0x0000000000000000

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Cache Invalidation Solution for Nested IOMMU
  2023-04-12  2:47                   ` Zhangfei Gao
@ 2023-04-12  5:47                     ` Nicolin Chen
  2023-05-03 15:14                     ` Shameerali Kolothum Thodi
  1 sibling, 0 replies; 35+ messages in thread
From: Nicolin Chen @ 2023-04-12  5:47 UTC (permalink / raw)
  To: Zhangfei Gao
  Cc: Jason Gunthorpe, Shameerali Kolothum Thodi, Robin Murphy,
	kevin.tian, yi.l.liu, eric.auger, baolu.lu, jean-philippe, iommu,
	qianweili

Hi Zhangfei,

On Wed, Apr 12, 2023 at 10:47:56AM +0800, Zhangfei Gao wrote:

> qemu-system-aarch64: IOMMU_HWPT_INVALIDATE failed: No such device

Only ATC_INV checks vSID field and return -ENODEV if vSID is
not linked to any pSID (i.e. set_rid_user has not been called
correctly for that device.)

Maybe you can try some prints in set_rid_user on both kernel
and qemu sides.
 
> host:
> [218.986319] arm-smmu-v3 arm-smmu-v3.3.auto: unexpected global error
> reported (0x00000001), this could be serious
> [218.996464] arm-smmu-v3 arm-smmu-v3.3.auto: CMDQ error (cons
> 0x01004805): Illegal command
> [219.004606] arm-smmu-v3 arm-smmu-v3.3.auto: skipping command in error state:
> [219.011622] arm-smmu-v3 arm-smmu-v3.3.auto:0x0060000202110687
> [219.017515] arm-smmu-v3 arm-smmu-v3.3.auto:0x0000000000000000
> [219.023422] arm-smmu-v3 arm-smmu-v3.3.auto: unexpected global error
> reported (0x00000001), this could be serious
> [219.033550] arm-smmu-v3 arm-smmu-v3.3.auto: CMDQ error (cons
> 0x01004806): Illegal command
> [219.041692] arm-smmu-v3 arm-smmu-v3.3.auto: skipping command in error state:
> [219.048707] arm-smmu-v3 arm-smmu-v3.3.auto:0x0000000000000000
> [219.054600]arm-smmu-v3 arm-smmu-v3.3.auto:0x0000000000000000

This two commands look very wrong...wondering why host could
encounter them.

There's a pr_debug in arm_smmu_fix_user_cmd() that can change
to pr_alert for guest queue debugging.

Sorry, the mmap solution might be buggy, as it's a proof of
concept.

Thanks
Nicolin

^ permalink raw reply	[flat|nested] 35+ messages in thread

* RE: Cache Invalidation Solution for Nested IOMMU
  2023-04-12  2:47                   ` Zhangfei Gao
  2023-04-12  5:47                     ` Nicolin Chen
@ 2023-05-03 15:14                     ` Shameerali Kolothum Thodi
  2023-05-03 23:44                       ` Nicolin Chen
  1 sibling, 1 reply; 35+ messages in thread
From: Shameerali Kolothum Thodi @ 2023-05-03 15:14 UTC (permalink / raw)
  To: Zhangfei Gao, Nicolin Chen
  Cc: Jason Gunthorpe, Robin Murphy, kevin.tian, yi.l.liu, eric.auger,
	baolu.lu, jean-philippe, iommu, qianweili



> -----Original Message-----
> From: Zhangfei Gao [mailto:zhangfei.gao@linaro.org]
> Sent: 12 April 2023 03:48
> To: Nicolin Chen <nicolinc@nvidia.com>
> Cc: Jason Gunthorpe <jgg@nvidia.com>; Shameerali Kolothum Thodi
> <shameerali.kolothum.thodi@huawei.com>; Robin Murphy
> <robin.murphy@arm.com>; kevin.tian@intel.com; yi.l.liu@intel.com;
> eric.auger@redhat.com; baolu.lu@linux.intel.com; jean-philippe@linaro.org;
> iommu@lists.linux.dev; qianweili <qianweili@huawei.com>
> Subject: Re: Cache Invalidation Solution for Nested IOMMU
> 
> Hi, Nicolin
> 
> On Mon, 10 Apr 2023 at 09:08, Nicolin Chen <nicolinc@nvidia.com> wrote:
> >
> > On Thu, Apr 06, 2023 at 08:40:04AM -0300, Jason Gunthorpe wrote:
> > > On Thu, Apr 06, 2023 at 02:23:17PM +0800, Zhangfei Gao wrote:
> > >
> > > > We are using ioctl method now.
> > > > From the testing, the TLB miss impacts performance a lot, so we
> > > > use huge page method.
> > > > After using huge page method, guest can achieve comparable
> > > > performance with host.
> > >
> > > Looks like these tests are not stressing the MM, just measuring pure
> > > BW of the DMA, so they don't get into the invalidation regime..
> > >
> > > You need to measure a more real application that is actually using
> > > the MM (eg alloc/free memory, fork, etc) while it operates and turn on
> SVA.
> >
> > Would an iommu map/unmap benchmark test be useful here?
> >
> > I added a test program to measure map/unmap times with a set of
> > different sized buffers:
> >
> https://github.com/nicolinc/iommufd/commit/3eb417f2cae0234cc801c6ad
> 74d
> > e2afb0ddbdf84 (Also thinking about sending this with a RFC series)
> >
> > @Zhangfei,
> > In case that this could be useful, you can pull these two branches for
> > perf measurement with and without mmap:
> > # Kernel
> >
> https://github.com/nicolinc/iommufd/commits/wip/iommufd_nesting-mma
> p-0
> > 4082023
> > # QEMU
> >
> https://github.com/nicolinc/qemu/commits/wip/iommufd_nesting-mmap-0
> 407
> > 2023
> >
> > To test without mmap, simply revert the following commits:
> > # Kernel
> > git revert ecf602a3c8480ba7ce2c7e77c2d15ca873dbf2e4
> > # QEMU
> > git revert c726c014de70998f14b8741a6a96e18a2a7bcd0f
> >
> > And, you can get the test script in the branch:
> >   tools/testing/selftests/iommu/iommu_benchmark.sh
> >
> >
> > On my emulation environment (very slow), with mmap, I see
> > improvements. I'll also try setting up a test suite on a proper HW
> > this week.
> >
> 
> Still debugging the mmap method here,
> 
> Status 4.12
> 
> 1.  ioctl method works, based on iommufd
> 
> performance same as before, ie.6.2
> 
> Using huge page method for guest, guest app can get comparable
> performance as host
> 
> (The branches are provided by Shameer)
> https://github.com/gaozhangfei/linux-kernel-uadk/tree/private-iommufd_ne
> sting-03272023-6.3-rc4-arm
> https://github.com/gaozhangfei/qemu/tree/private-v7.2.0-iommufd-nesting
> -arm
> 
> 
> 2.  mmap method still in debug,
> (patch porting may have issues, need to port patches to the working version,
> in check)
> 
> qemu:
> 
> https://github.com/gaozhangfei/linux-kernel-uadk/tree/iommufd_nesting-m
> map-04082023
> 
> https://github.com/gaozhangfei/qemu/tree/private-v7.2.0-iommufd-nesting
> -arm-mmap

Hi Zhangfei,

I had a go with your above branches. The issue below seems to happen only
when you assign multiple devices to Guest. It looks like when you have multiple devices
attached to Guest, the host kernel receives invalid Guest cmd(0x0) and ends up issuing
that to HW triggering the below errors. 

For now, could you please try with just one device? At least in my setup it didn’t trigger
the below with just one dev.

Also as Jason pointed out, I think we might need to run an application that actually
hits the mm invalidation path in Guest kernel(arm_smmu_mm_invalidate_range()).
With test_sva_perf I couldn't see much of that happening during the test run.

Also another point is once we have BTM enabled we might not have these mm
invalidations reaching the host kernel. 

Thanks,
Shameer 
> 
> hisi sec init Kunpeng920!
> 
> qemu-system-aarch64: IOMMU_HWPT_INVALIDATE failed: No such device
> 
> qemu-system-aarch64: --------smmuv3_invalidate_cache_mmap: failed to
> invalidate TLB: prod=15fa vs cons=1000709
> 
> qemu-system-aarch64: IOMMU_HWPT_INVALIDATE failed: No such device
> 
> qemu-system-aarch64: --------smmuv3_invalidate_cache_mmap: failed to
> invalidate TLB: prod=9bb vs cons=1000709
> 
> qemu-system-aarch64: IOMMU_HWPT_INVALIDATE failed: Connection timed
> out
> 
> qemu-system-aarch64: --------smmuv3_invalidate_cache_mmap: failed to
> invalidate TLB: prod=4c0 vs cons=4c0
> 
> qemu-system-aarch64: IOMMU_HWPT_INVALIDATE failed: No such device
> 
> qemu-system-aarch64: --------smmuv3_invalidate_cache_mmap: failed to
> invalidate TLB: prod=f7a vs cons=1000709
> 
> qemu-system-aarch64: IOMMU_HWPT_INVALIDATE failed: No such device
> 
> qemu-system-aarch64: --------smmuv3_invalidate_cache_mmap: failed to
> invalidate TLB: prod=81d vs cons=1000709
> 
> host:
> 
> [218.986319] arm-smmu-v3 arm-smmu-v3.3.auto: unexpected global error
> reported (0x00000001), this could be serious
> 
> [218.996464] arm-smmu-v3 arm-smmu-v3.3.auto: CMDQ error (cons
> 0x01004805): Illegal command
> 
> [219.004606] arm-smmu-v3 arm-smmu-v3.3.auto: skipping command in
> error state:
> 
> [219.011622] arm-smmu-v3 arm-smmu-v3.3.auto:0x0060000202110687
> 
> [219.017515] arm-smmu-v3 arm-smmu-v3.3.auto:0x0000000000000000
> 
> [219.023422] arm-smmu-v3 arm-smmu-v3.3.auto: unexpected global error
> reported (0x00000001), this could be serious
> 
> [219.033550] arm-smmu-v3 arm-smmu-v3.3.auto: CMDQ error (cons
> 0x01004806): Illegal command
> 
> [219.041692] arm-smmu-v3 arm-smmu-v3.3.auto: skipping command in
> error state:
> 
> [219.048707] arm-smmu-v3 arm-smmu-v3.3.auto:0x0000000000000000
> 
> [219.054600]arm-smmu-v3 arm-smmu-v3.3.auto:0x0000000000000000

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: Cache Invalidation Solution for Nested IOMMU
  2023-05-03 15:14                     ` Shameerali Kolothum Thodi
@ 2023-05-03 23:44                       ` Nicolin Chen
  0 siblings, 0 replies; 35+ messages in thread
From: Nicolin Chen @ 2023-05-03 23:44 UTC (permalink / raw)
  To: Zhangfei Gao, Shameerali Kolothum Thodi
  Cc: Jason Gunthorpe, Robin Murphy, kevin.tian, yi.l.liu, eric.auger,
	baolu.lu, jean-philippe, iommu, qianweili

On Wed, May 03, 2023 at 03:14:28PM +0000, Shameerali Kolothum Thodi wrote:

> > > On my emulation environment (very slow), with mmap, I see
> > > improvements. I'll also try setting up a test suite on a proper HW
> > > this week.
> > >
> >
> > Still debugging the mmap method here,
> >
> > Status 4.12
> >
> > 1.  ioctl method works, based on iommufd
 
> I had a go with your above branches. The issue below seems to happen only
> when you assign multiple devices to Guest. It looks like when you have multiple devices
> attached to Guest, the host kernel receives invalid Guest cmd(0x0) and ends up issuing
> that to HW triggering the below errors.

I actually found a bug in my previous wip branch (non-mmap),
resulting in a cons/prod index overflow. But the mmap branch
seems to be okay, since my QEMU code copies all TLBI commands
starting from index=0x0.

You may compare the git-diff, just in case that it helps:
https://github.com/nicolinc/iommufd/commits/wip/iommufd_nesting-04262023-Nic
https://github.com/nicolinc/qemu/commits/wip/iommufd_nesting-04222023

Btw, I ran a comparison on a real hardware. But it does not
show significant improvements between the ioctl solution and
the mmap solution. So I kept the ioctl one in v2.

Thanks
Nic

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2023-05-03 23:45 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-04-03  0:33 Cache Invalidation Solution for Nested IOMMU Nicolin Chen
2023-04-03  7:26 ` Liu, Yi L
2023-04-03  8:39   ` Tian, Kevin
2023-04-03 15:24     ` Nicolin Chen
2023-04-04  2:42       ` Tian, Kevin
2023-04-04  3:12         ` Nicolin Chen
2023-04-03 12:23   ` Jason Gunthorpe
2023-04-03  8:00 ` Tian, Kevin
2023-04-03 14:29   ` Nicolin Chen
2023-04-04  2:15     ` Tian, Kevin
2023-04-04  2:47       ` Nicolin Chen
2023-04-03 14:08 ` Jason Gunthorpe
2023-04-03 14:51   ` Nicolin Chen
2023-04-03 19:15     ` Robin Murphy
2023-04-04  0:02       ` Nicolin Chen
2023-04-04 16:20         ` Jason Gunthorpe
2023-04-04 16:50           ` Shameerali Kolothum Thodi
2023-04-05 11:57             ` Jason Gunthorpe
2023-04-06  6:23             ` Zhangfei Gao
2023-04-06  6:39               ` Nicolin Chen
2023-04-06 11:40               ` Jason Gunthorpe
2023-04-10  1:08                 ` Nicolin Chen
2023-04-11  9:07                   ` Jean-Philippe Brucker
2023-04-11 11:57                     ` Jason Gunthorpe
2023-04-11 18:39                       ` Nicolin Chen
2023-04-11 18:41                         ` Jason Gunthorpe
2023-04-11 19:02                           ` Nicolin Chen
2023-04-11 18:43                     ` Nicolin Chen
2023-04-12  2:47                   ` Zhangfei Gao
2023-04-12  5:47                     ` Nicolin Chen
2023-05-03 15:14                     ` Shameerali Kolothum Thodi
2023-05-03 23:44                       ` Nicolin Chen
2023-04-05  5:45           ` Nicolin Chen
2023-04-05 11:37             ` Jason Gunthorpe
2023-04-05 15:34               ` Nicolin Chen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.