kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs
@ 2020-09-10 10:45 Liu Yi L
  2020-09-10 10:45 ` [PATCH v7 01/16] iommu: Report domain nesting info Liu Yi L
                   ` (16 more replies)
  0 siblings, 17 replies; 81+ messages in thread
From: Liu Yi L @ 2020-09-10 10:45 UTC (permalink / raw)
  To: alex.williamson, eric.auger, baolu.lu, joro
  Cc: kevin.tian, jacob.jun.pan, ashok.raj, yi.l.liu, jun.j.tian,
	yi.y.sun, jean-philippe, peterx, jasowang, hao.wu, stefanha,
	iommu, kvm

Shared Virtual Addressing (SVA), a.k.a, Shared Virtual Memory (SVM) on
Intel platforms allows address space sharing between device DMA and
applications. SVA can reduce programming complexity and enhance security.

This VFIO series is intended to expose SVA usage to VMs. i.e. Sharing
guest application address space with passthru devices. This is called
vSVA in this series. The whole vSVA enabling requires QEMU/VFIO/IOMMU
changes. For IOMMU and QEMU changes, they are in separate series (listed
in the "Related series").

The high-level architecture for SVA virtualization is as below, the key
design of vSVA support is to utilize the dual-stage IOMMU translation (
also known as IOMMU nesting translation) capability in host IOMMU.


    .-------------.  .---------------------------.
    |   vIOMMU    |  | Guest process CR3, FL only|
    |             |  '---------------------------'
    .----------------/
    | PASID Entry |--- PASID cache flush -
    '-------------'                       |
    |             |                       V
    |             |                CR3 in GPA
    '-------------'
Guest
------| Shadow |--------------------------|--------
      v        v                          v
Host
    .-------------.  .----------------------.
    |   pIOMMU    |  | Bind FL for GVA-GPA  |
    |             |  '----------------------'
    .----------------/  |
    | PASID Entry |     V (Nested xlate)
    '----------------\.------------------------------.
    |             |   |SL for GPA-HPA, default domain|
    |             |   '------------------------------'
    '-------------'
Where:
 - FL = First level/stage one page tables
 - SL = Second level/stage two page tables

Patch Overview:
 1. reports IOMMU nesting info to userspace ( patch 0001, 0002, 0003, 0015 , 0016)
 2. vfio support for PASID allocation and free for VMs (patch 0004, 0005, 0007)
 3. a fix to a revisit in intel iommu driver (patch 0006)
 4. vfio support for binding guest page table to host (patch 0008, 0009, 0010)
 5. vfio support for IOMMU cache invalidation from VMs (patch 0011)
 6. vfio support for vSVA usage on IOMMU-backed mdevs (patch 0012)
 7. expose PASID capability to VM (patch 0013)
 8. add doc for VFIO dual stage control (patch 0014)

The complete vSVA kernel upstream patches are divided into three phases:
    1. Common APIs and PCI device direct assignment
    2. IOMMU-backed Mediated Device assignment
    3. Page Request Services (PRS) support

This patchset is aiming for the phase 1 and phase 2, and based on Jacob's
below series.
*) [PATCH v8 0/7] IOMMU user API enhancement - wip
   https://lore.kernel.org/linux-iommu/1598898300-65475-1-git-send-email-jacob.jun.pan@linux.intel.com/

*) [PATCH v2 0/9] IOASID extensions for guest SVA - wip
   https://lore.kernel.org/linux-iommu/1598070918-21321-1-git-send-email-jacob.jun.pan@linux.intel.com/

Complete set for current vSVA can be found in below branch.
https://github.com/luxis1999/linux-vsva.git vsva-linux-5.9-rc2-v7

The corresponding QEMU patch series is included in below branch:
https://github.com/luxis1999/qemu.git vsva_5.9_rc2_qemu_rfcv10


Regards,
Yi Liu

Changelog:
	- Patch v6 -> Patch v7:
	  a) drop [PATCH v6 01/15] of v6 as it's merged by Alex.
	  b) rebase on Jacob's v8 IOMMU uapi enhancement and v2 IOASID extension patchset.
	  c) Address comments against v6 from Alex and Eric.
	  Patch v6: https://lore.kernel.org/kvm/1595917664-33276-1-git-send-email-yi.l.liu@intel.com/

	- Patch v5 -> Patch v6:
	  a) Address comments against v5 from Eric.
	  b) rebase on Jacob's v6 IOMMU uapi enhancement
	  Patch v5: https://lore.kernel.org/kvm/1594552870-55687-1-git-send-email-yi.l.liu@intel.com/

	- Patch v4 -> Patch v5:
	  a) Address comments against v4
	  Patch v4: https://lore.kernel.org/kvm/1593861989-35920-1-git-send-email-yi.l.liu@intel.com/

	- Patch v3 -> Patch v4:
	  a) Address comments against v3
	  b) Add rb from Stefan on patch 14/15
	  Patch v3: https://lore.kernel.org/kvm/1592988927-48009-1-git-send-email-yi.l.liu@intel.com/

	- Patch v2 -> Patch v3:
	  a) Rebase on top of Jacob's v3 iommu uapi patchset
	  b) Address comments from Kevin and Stefan Hajnoczi
	  c) Reuse DOMAIN_ATTR_NESTING to get iommu nesting info
	  d) Drop [PATCH v2 07/15] iommu/uapi: Add iommu_gpasid_unbind_data
	  Patch v2: https://lore.kernel.org/kvm/1591877734-66527-1-git-send-email-yi.l.liu@intel.com/

	- Patch v1 -> Patch v2:
	  a) Refactor vfio_iommu_type1_ioctl() per suggestion from Christoph
	     Hellwig.
	  b) Re-sequence the patch series for better bisect support.
	  c) Report IOMMU nesting cap info in detail instead of a format in
	     v1.
	  d) Enforce one group per nesting type container for vfio iommu type1
	     driver.
	  e) Build the vfio_mm related code from vfio.c to be a separate
	     vfio_pasid.ko.
	  f) Add PASID ownership check in IOMMU driver.
	  g) Adopted to latest IOMMU UAPI design. Removed IOMMU UAPI version
	     check. Added iommu_gpasid_unbind_data for unbind requests from
	     userspace.
	  h) Define a single ioctl:VFIO_IOMMU_NESTING_OP for bind/unbind_gtbl
	     and cahce_invld.
	  i) Document dual stage control in vfio.rst.
	  Patch v1: https://lore.kernel.org/kvm/1584880325-10561-1-git-send-email-yi.l.liu@intel.com/

	- RFC v3 -> Patch v1:
	  a) Address comments to the PASID request(alloc/free) path
	  b) Report PASID alloc/free availabitiy to user-space
	  c) Add a vfio_iommu_type1 parameter to support pasid quota tuning
	  d) Adjusted to latest ioasid code implementation. e.g. remove the
	     code for tracking the allocated PASIDs as latest ioasid code
	     will track it, VFIO could use ioasid_free_set() to free all
	     PASIDs.
	  RFC v3: https://lore.kernel.org/kvm/1580299912-86084-1-git-send-email-yi.l.liu@intel.com/

	- RFC v2 -> v3:
	  a) Refine the whole patchset to fit the roughly parts in this series
	  b) Adds complete vfio PASID management framework. e.g. pasid alloc,
	  free, reclaim in VM crash/down and per-VM PASID quota to prevent
	  PASID abuse.
	  c) Adds IOMMU uAPI version check and page table format check to ensure
	  version compatibility and hardware compatibility.
	  d) Adds vSVA vfio support for IOMMU-backed mdevs.
	  RFC v2: https://lore.kernel.org/kvm/1571919983-3231-1-git-send-email-yi.l.liu@intel.com/

	- RFC v1 -> v2:
	  Dropped vfio: VFIO_IOMMU_ATTACH/DETACH_PASID_TABLE.
	  RFC v1: https://lore.kernel.org/kvm/1562324772-3084-1-git-send-email-yi.l.liu@intel.com/

---
Eric Auger (1):
  vfio: Document dual stage control

Liu Yi L (14):
  iommu: Report domain nesting info
  iommu/smmu: Report empty domain nesting info
  vfio/type1: Report iommu nesting info to userspace
  vfio: Add PASID allocation/free support
  iommu/vt-d: Support setting ioasid set to domain
  iommu/vt-d: Remove get_task_mm() in bind_gpasid()
  vfio/type1: Add VFIO_IOMMU_PASID_REQUEST (alloc/free)
  iommu/vt-d: Check ownership for PASIDs from user-space
  vfio/type1: Support binding guest page tables to PASID
  vfio/type1: Allow invalidating first-level/stage IOMMU cache
  vfio/type1: Add vSVA support for IOMMU-backed mdevs
  vfio/pci: Expose PCIe PASID capability to guest
  iommu/vt-d: Only support nesting when nesting caps are consistent
    across iommu units
  iommu/vt-d: Support reporting nesting capability info

Yi Sun (1):
  iommu: Pass domain to sva_unbind_gpasid()

 Documentation/driver-api/vfio.rst           |  76 ++++++
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c |  29 +-
 drivers/iommu/arm/arm-smmu/arm-smmu.c       |  29 +-
 drivers/iommu/intel/iommu.c                 | 137 +++++++++-
 drivers/iommu/intel/svm.c                   |  43 +--
 drivers/iommu/iommu.c                       |   2 +-
 drivers/vfio/Kconfig                        |   6 +
 drivers/vfio/Makefile                       |   1 +
 drivers/vfio/pci/vfio_pci_config.c          |   2 +-
 drivers/vfio/vfio_iommu_type1.c             | 395 +++++++++++++++++++++++++++-
 drivers/vfio/vfio_pasid.c                   | 283 ++++++++++++++++++++
 include/linux/intel-iommu.h                 |  25 +-
 include/linux/iommu.h                       |   4 +-
 include/linux/vfio.h                        |  54 ++++
 include/uapi/linux/iommu.h                  |  76 ++++++
 include/uapi/linux/vfio.h                   | 101 +++++++
 16 files changed, 1220 insertions(+), 43 deletions(-)
 create mode 100644 drivers/vfio/vfio_pasid.c

-- 
2.7.4


^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH v7 01/16] iommu: Report domain nesting info
  2020-09-10 10:45 [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs Liu Yi L
@ 2020-09-10 10:45 ` Liu Yi L
  2020-09-11 19:38   ` Alex Williamson
  2020-09-10 10:45 ` [PATCH v7 02/16] iommu/smmu: Report empty " Liu Yi L
                   ` (15 subsequent siblings)
  16 siblings, 1 reply; 81+ messages in thread
From: Liu Yi L @ 2020-09-10 10:45 UTC (permalink / raw)
  To: alex.williamson, eric.auger, baolu.lu, joro
  Cc: kevin.tian, jacob.jun.pan, ashok.raj, yi.l.liu, jun.j.tian,
	yi.y.sun, jean-philippe, peterx, jasowang, hao.wu, stefanha,
	iommu, kvm

IOMMUs that support nesting translation needs report the capability info
to userspace. It gives information about requirements the userspace needs
to implement plus other features characterizing the physical implementation.

This patch introduces a new IOMMU UAPI struct that gives information about
the nesting capabilities and features. This struct is supposed to be returned
by iommu_domain_get_attr() with DOMAIN_ATTR_NESTING attribute parameter, with
one domain whose type has been set to DOMAIN_ATTR_NESTING.

Cc: Kevin Tian <kevin.tian@intel.com>
CC: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Eric Auger <eric.auger@redhat.com>
Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
---
v6 -> v7:
*) rephrase the commit message, replace the @data[] field in struct
   iommu_nesting_info with union per comments from Eric Auger.

v5 -> v6:
*) rephrase the feature notes per comments from Eric Auger.
*) rename @size of struct iommu_nesting_info to @argsz.

v4 -> v5:
*) address comments from Eric Auger.

v3 -> v4:
*) split the SMMU driver changes to be a separate patch
*) move the @addr_width and @pasid_bits from vendor specific
   part to generic part.
*) tweak the description for the @features field of struct
   iommu_nesting_info.
*) add description on the @data[] field of struct iommu_nesting_info

v2 -> v3:
*) remvoe cap/ecap_mask in iommu_nesting_info.
*) reuse DOMAIN_ATTR_NESTING to get nesting info.
*) return an empty iommu_nesting_info for SMMU drivers per Jean'
   suggestion.
---
 include/uapi/linux/iommu.h | 76 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 76 insertions(+)

diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
index 1ebc23d..ff987e4 100644
--- a/include/uapi/linux/iommu.h
+++ b/include/uapi/linux/iommu.h
@@ -341,4 +341,80 @@ struct iommu_gpasid_bind_data {
 	} vendor;
 };
 
+/*
+ * struct iommu_nesting_info_vtd - Intel VT-d specific nesting info.
+ *
+ * @flags:	VT-d specific flags. Currently reserved for future
+ *		extension. must be set to 0.
+ * @cap_reg:	Describe basic capabilities as defined in VT-d capability
+ *		register.
+ * @ecap_reg:	Describe the extended capabilities as defined in VT-d
+ *		extended capability register.
+ */
+struct iommu_nesting_info_vtd {
+	__u32	flags;
+	__u64	cap_reg;
+	__u64	ecap_reg;
+};
+
+/*
+ * struct iommu_nesting_info - Information for nesting-capable IOMMU.
+ *			       userspace should check it before using
+ *			       nesting capability.
+ *
+ * @argsz:	size of the whole structure.
+ * @flags:	currently reserved for future extension. must set to 0.
+ * @format:	PASID table entry format, the same definition as struct
+ *		iommu_gpasid_bind_data @format.
+ * @features:	supported nesting features.
+ * @addr_width:	The output addr width of first level/stage translation
+ * @pasid_bits:	Maximum supported PASID bits, 0 represents no PASID
+ *		support.
+ * @vendor:	vendor specific data, structure type can be deduced from
+ *		@format field.
+ *
+ * +===============+======================================================+
+ * | feature       |  Notes                                               |
+ * +===============+======================================================+
+ * | SYSWIDE_PASID |  IOMMU vendor driver sets it to mandate userspace    |
+ * |               |  to allocate PASID from kernel. All PASID allocation |
+ * |               |  free must be mediated through the IOMMU UAPI.       |
+ * +---------------+------------------------------------------------------+
+ * | BIND_PGTBL    |  IOMMU vendor driver sets it to mandate userspace to |
+ * |               |  bind the first level/stage page table to associated |
+ * |               |  PASID (either the one specified in bind request or  |
+ * |               |  the default PASID of iommu domain), through IOMMU   |
+ * |               |  UAPI.                                               |
+ * +---------------+------------------------------------------------------+
+ * | CACHE_INVLD   |  IOMMU vendor driver sets it to mandate userspace to |
+ * |               |  explicitly invalidate the IOMMU cache through IOMMU |
+ * |               |  UAPI according to vendor-specific requirement when  |
+ * |               |  changing the 1st level/stage page table.            |
+ * +---------------+------------------------------------------------------+
+ *
+ * data struct types defined for @format:
+ * +================================+=====================================+
+ * | @format                        | data struct                         |
+ * +================================+=====================================+
+ * | IOMMU_PASID_FORMAT_INTEL_VTD   | struct iommu_nesting_info_vtd       |
+ * +--------------------------------+-------------------------------------+
+ *
+ */
+struct iommu_nesting_info {
+	__u32	argsz;
+	__u32	flags;
+	__u32	format;
+#define IOMMU_NESTING_FEAT_SYSWIDE_PASID	(1 << 0)
+#define IOMMU_NESTING_FEAT_BIND_PGTBL		(1 << 1)
+#define IOMMU_NESTING_FEAT_CACHE_INVLD		(1 << 2)
+	__u32	features;
+	__u16	addr_width;
+	__u16	pasid_bits;
+	__u8	padding[12];
+	/* Vendor specific data */
+	union {
+		struct iommu_nesting_info_vtd vtd;
+	} vendor;
+};
+
 #endif /* _UAPI_IOMMU_H */
-- 
2.7.4


^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH v7 02/16] iommu/smmu: Report empty domain nesting info
  2020-09-10 10:45 [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs Liu Yi L
  2020-09-10 10:45 ` [PATCH v7 01/16] iommu: Report domain nesting info Liu Yi L
@ 2020-09-10 10:45 ` Liu Yi L
  2021-01-12  6:50   ` Vivek Gautam
  2020-09-10 10:45 ` [PATCH v7 03/16] vfio/type1: Report iommu nesting info to userspace Liu Yi L
                   ` (14 subsequent siblings)
  16 siblings, 1 reply; 81+ messages in thread
From: Liu Yi L @ 2020-09-10 10:45 UTC (permalink / raw)
  To: alex.williamson, eric.auger, baolu.lu, joro
  Cc: kevin.tian, jacob.jun.pan, ashok.raj, yi.l.liu, jun.j.tian,
	yi.y.sun, jean-philippe, peterx, jasowang, hao.wu, stefanha,
	iommu, kvm, Will Deacon, Robin Murphy

This patch is added as instead of returning a boolean for DOMAIN_ATTR_NESTING,
iommu_domain_get_attr() should return an iommu_nesting_info handle. For
now, return an empty nesting info struct for now as true nesting is not
yet supported by the SMMUs.

Cc: Will Deacon <will@kernel.org>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Eric Auger <eric.auger@redhat.com>
Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
Suggested-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
---
v5 -> v6:
*) add review-by from Eric Auger.

v4 -> v5:
*) address comments from Eric Auger.
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 29 +++++++++++++++++++++++++++--
 drivers/iommu/arm/arm-smmu/arm-smmu.c       | 29 +++++++++++++++++++++++++++--
 2 files changed, 54 insertions(+), 4 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 7196207..016e2e5 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -3019,6 +3019,32 @@ static struct iommu_group *arm_smmu_device_group(struct device *dev)
 	return group;
 }
 
+static int arm_smmu_domain_nesting_info(struct arm_smmu_domain *smmu_domain,
+					void *data)
+{
+	struct iommu_nesting_info *info = (struct iommu_nesting_info *)data;
+	unsigned int size;
+
+	if (!info || smmu_domain->stage != ARM_SMMU_DOMAIN_NESTED)
+		return -ENODEV;
+
+	size = sizeof(struct iommu_nesting_info);
+
+	/*
+	 * if provided buffer size is smaller than expected, should
+	 * return 0 and also the expected buffer size to caller.
+	 */
+	if (info->argsz < size) {
+		info->argsz = size;
+		return 0;
+	}
+
+	/* report an empty iommu_nesting_info for now */
+	memset(info, 0x0, size);
+	info->argsz = size;
+	return 0;
+}
+
 static int arm_smmu_domain_get_attr(struct iommu_domain *domain,
 				    enum iommu_attr attr, void *data)
 {
@@ -3028,8 +3054,7 @@ static int arm_smmu_domain_get_attr(struct iommu_domain *domain,
 	case IOMMU_DOMAIN_UNMANAGED:
 		switch (attr) {
 		case DOMAIN_ATTR_NESTING:
-			*(int *)data = (smmu_domain->stage == ARM_SMMU_DOMAIN_NESTED);
-			return 0;
+			return arm_smmu_domain_nesting_info(smmu_domain, data);
 		default:
 			return -ENODEV;
 		}
diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.c b/drivers/iommu/arm/arm-smmu/arm-smmu.c
index 09c42af9..368486f 100644
--- a/drivers/iommu/arm/arm-smmu/arm-smmu.c
+++ b/drivers/iommu/arm/arm-smmu/arm-smmu.c
@@ -1510,6 +1510,32 @@ static struct iommu_group *arm_smmu_device_group(struct device *dev)
 	return group;
 }
 
+static int arm_smmu_domain_nesting_info(struct arm_smmu_domain *smmu_domain,
+					void *data)
+{
+	struct iommu_nesting_info *info = (struct iommu_nesting_info *)data;
+	unsigned int size;
+
+	if (!info || smmu_domain->stage != ARM_SMMU_DOMAIN_NESTED)
+		return -ENODEV;
+
+	size = sizeof(struct iommu_nesting_info);
+
+	/*
+	 * if provided buffer size is smaller than expected, should
+	 * return 0 and also the expected buffer size to caller.
+	 */
+	if (info->argsz < size) {
+		info->argsz = size;
+		return 0;
+	}
+
+	/* report an empty iommu_nesting_info for now */
+	memset(info, 0x0, size);
+	info->argsz = size;
+	return 0;
+}
+
 static int arm_smmu_domain_get_attr(struct iommu_domain *domain,
 				    enum iommu_attr attr, void *data)
 {
@@ -1519,8 +1545,7 @@ static int arm_smmu_domain_get_attr(struct iommu_domain *domain,
 	case IOMMU_DOMAIN_UNMANAGED:
 		switch (attr) {
 		case DOMAIN_ATTR_NESTING:
-			*(int *)data = (smmu_domain->stage == ARM_SMMU_DOMAIN_NESTED);
-			return 0;
+			return arm_smmu_domain_nesting_info(smmu_domain, data);
 		default:
 			return -ENODEV;
 		}
-- 
2.7.4


^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH v7 03/16] vfio/type1: Report iommu nesting info to userspace
  2020-09-10 10:45 [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs Liu Yi L
  2020-09-10 10:45 ` [PATCH v7 01/16] iommu: Report domain nesting info Liu Yi L
  2020-09-10 10:45 ` [PATCH v7 02/16] iommu/smmu: Report empty " Liu Yi L
@ 2020-09-10 10:45 ` Liu Yi L
  2020-09-11 20:16   ` Alex Williamson
  2020-09-10 10:45 ` [PATCH v7 04/16] vfio: Add PASID allocation/free support Liu Yi L
                   ` (13 subsequent siblings)
  16 siblings, 1 reply; 81+ messages in thread
From: Liu Yi L @ 2020-09-10 10:45 UTC (permalink / raw)
  To: alex.williamson, eric.auger, baolu.lu, joro
  Cc: kevin.tian, jacob.jun.pan, ashok.raj, yi.l.liu, jun.j.tian,
	yi.y.sun, jean-philippe, peterx, jasowang, hao.wu, stefanha,
	iommu, kvm

This patch exports iommu nesting capability info to user space through
VFIO. Userspace is expected to check this info for supported uAPIs (e.g.
PASID alloc/free, bind page table, and cache invalidation) and the vendor
specific format information for first level/stage page table that will be
bound to.

The nesting info is available only after container set to be NESTED type.
Current implementation imposes one limitation - one nesting container
should include at most one iommu group. The philosophy of vfio container
is having all groups/devices within the container share the same IOMMU
context. When vSVA is enabled, one IOMMU context could include one 2nd-
level address space and multiple 1st-level address spaces. While the
2nd-level address space is reasonably sharable by multiple groups, blindly
sharing 1st-level address spaces across all groups within the container
might instead break the guest expectation. In the future sub/super container
concept might be introduced to allow partial address space sharing within
an IOMMU context. But for now let's go with this restriction by requiring
singleton container for using nesting iommu features. Below link has the
related discussion about this decision.

https://lore.kernel.org/kvm/20200515115924.37e6996d@w520.home/

This patch also changes the NESTING type container behaviour. Something
that would have succeeded before will now fail: Before this series, if
user asked for a VFIO_IOMMU_TYPE1_NESTING, it would have succeeded even
if the SMMU didn't support stage-2, as the driver would have silently
fallen back on stage-1 mappings (which work exactly the same as stage-2
only since there was no nesting supported). After the series, we do check
for DOMAIN_ATTR_NESTING so if user asks for VFIO_IOMMU_TYPE1_NESTING and
the SMMU doesn't support stage-2, the ioctl fails. But it should be a good
fix and completely harmless. Detail can be found in below link as well.

https://lore.kernel.org/kvm/20200717090900.GC4850@myrica/

Cc: Kevin Tian <kevin.tian@intel.com>
CC: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Eric Auger <eric.auger@redhat.com>
Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
v6 -> v7:
*) using vfio_info_add_capability() for adding nesting cap per suggestion
   from Eric.

v5 -> v6:
*) address comments against v5 from Eric Auger.
*) don't report nesting cap to userspace if the nesting_info->format is
   invalid.

v4 -> v5:
*) address comments from Eric Auger.
*) return struct iommu_nesting_info for VFIO_IOMMU_TYPE1_INFO_CAP_NESTING as
   cap is much "cheap", if needs extension in future, just define another cap.
   https://lore.kernel.org/kvm/20200708132947.5b7ee954@x1.home/

v3 -> v4:
*) address comments against v3.

v1 -> v2:
*) added in v2
---
 drivers/vfio/vfio_iommu_type1.c | 92 +++++++++++++++++++++++++++++++++++------
 include/uapi/linux/vfio.h       | 19 +++++++++
 2 files changed, 99 insertions(+), 12 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index c992973..3c0048b 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -62,18 +62,20 @@ MODULE_PARM_DESC(dma_entry_limit,
 		 "Maximum number of user DMA mappings per container (65535).");
 
 struct vfio_iommu {
-	struct list_head	domain_list;
-	struct list_head	iova_list;
-	struct vfio_domain	*external_domain; /* domain for external user */
-	struct mutex		lock;
-	struct rb_root		dma_list;
-	struct blocking_notifier_head notifier;
-	unsigned int		dma_avail;
-	uint64_t		pgsize_bitmap;
-	bool			v2;
-	bool			nesting;
-	bool			dirty_page_tracking;
-	bool			pinned_page_dirty_scope;
+	struct list_head		domain_list;
+	struct list_head		iova_list;
+	/* domain for external user */
+	struct vfio_domain		*external_domain;
+	struct mutex			lock;
+	struct rb_root			dma_list;
+	struct blocking_notifier_head	notifier;
+	unsigned int			dma_avail;
+	uint64_t			pgsize_bitmap;
+	bool				v2;
+	bool				nesting;
+	bool				dirty_page_tracking;
+	bool				pinned_page_dirty_scope;
+	struct iommu_nesting_info	*nesting_info;
 };
 
 struct vfio_domain {
@@ -130,6 +132,9 @@ struct vfio_regions {
 #define IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)	\
 					(!list_empty(&iommu->domain_list))
 
+#define CONTAINER_HAS_DOMAIN(iommu)	(((iommu)->external_domain) || \
+					 (!list_empty(&(iommu)->domain_list)))
+
 #define DIRTY_BITMAP_BYTES(n)	(ALIGN(n, BITS_PER_TYPE(u64)) / BITS_PER_BYTE)
 
 /*
@@ -1992,6 +1997,13 @@ static void vfio_iommu_iova_insert_copy(struct vfio_iommu *iommu,
 
 	list_splice_tail(iova_copy, iova);
 }
+
+static void vfio_iommu_release_nesting_info(struct vfio_iommu *iommu)
+{
+	kfree(iommu->nesting_info);
+	iommu->nesting_info = NULL;
+}
+
 static int vfio_iommu_type1_attach_group(void *iommu_data,
 					 struct iommu_group *iommu_group)
 {
@@ -2022,6 +2034,12 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
 		}
 	}
 
+	/* Nesting type container can include only one group */
+	if (iommu->nesting && CONTAINER_HAS_DOMAIN(iommu)) {
+		mutex_unlock(&iommu->lock);
+		return -EINVAL;
+	}
+
 	group = kzalloc(sizeof(*group), GFP_KERNEL);
 	domain = kzalloc(sizeof(*domain), GFP_KERNEL);
 	if (!group || !domain) {
@@ -2092,6 +2110,25 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
 	if (ret)
 		goto out_domain;
 
+	/* Nesting cap info is available only after attaching */
+	if (iommu->nesting) {
+		int size = sizeof(struct iommu_nesting_info);
+
+		iommu->nesting_info = kzalloc(size, GFP_KERNEL);
+		if (!iommu->nesting_info) {
+			ret = -ENOMEM;
+			goto out_detach;
+		}
+
+		/* Now get the nesting info */
+		iommu->nesting_info->argsz = size;
+		ret = iommu_domain_get_attr(domain->domain,
+					    DOMAIN_ATTR_NESTING,
+					    iommu->nesting_info);
+		if (ret)
+			goto out_detach;
+	}
+
 	/* Get aperture info */
 	iommu_domain_get_attr(domain->domain, DOMAIN_ATTR_GEOMETRY, &geo);
 
@@ -2201,6 +2238,7 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
 	return 0;
 
 out_detach:
+	vfio_iommu_release_nesting_info(iommu);
 	vfio_iommu_detach_group(domain, group);
 out_domain:
 	iommu_domain_free(domain->domain);
@@ -2401,6 +2439,8 @@ static void vfio_iommu_type1_detach_group(void *iommu_data,
 					vfio_iommu_unmap_unpin_all(iommu);
 				else
 					vfio_iommu_unmap_unpin_reaccount(iommu);
+
+				vfio_iommu_release_nesting_info(iommu);
 			}
 			iommu_domain_free(domain->domain);
 			list_del(&domain->next);
@@ -2609,6 +2649,32 @@ static int vfio_iommu_migration_build_caps(struct vfio_iommu *iommu,
 	return vfio_info_add_capability(caps, &cap_mig.header, sizeof(cap_mig));
 }
 
+static int vfio_iommu_add_nesting_cap(struct vfio_iommu *iommu,
+				      struct vfio_info_cap *caps)
+{
+	struct vfio_iommu_type1_info_cap_nesting nesting_cap;
+	size_t size;
+
+	/* when nesting_info is null, no need to go further */
+	if (!iommu->nesting_info)
+		return 0;
+
+	/* when @format of nesting_info is 0, fail the call */
+	if (iommu->nesting_info->format == 0)
+		return -ENOENT;
+
+	size = offsetof(struct vfio_iommu_type1_info_cap_nesting, info) +
+	       iommu->nesting_info->argsz;
+
+	nesting_cap.header.id = VFIO_IOMMU_TYPE1_INFO_CAP_NESTING;
+	nesting_cap.header.version = 1;
+
+	memcpy(&nesting_cap.info, iommu->nesting_info,
+	       iommu->nesting_info->argsz);
+
+	return vfio_info_add_capability(caps, &nesting_cap.header, size);
+}
+
 static int vfio_iommu_type1_get_info(struct vfio_iommu *iommu,
 				     unsigned long arg)
 {
@@ -2644,6 +2710,8 @@ static int vfio_iommu_type1_get_info(struct vfio_iommu *iommu,
 	if (!ret)
 		ret = vfio_iommu_iova_build_caps(iommu, &caps);
 
+	ret = vfio_iommu_add_nesting_cap(iommu, &caps);
+
 	mutex_unlock(&iommu->lock);
 
 	if (ret)
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 9204705..ff40f9e 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -14,6 +14,7 @@
 
 #include <linux/types.h>
 #include <linux/ioctl.h>
+#include <linux/iommu.h>
 
 #define VFIO_API_VERSION	0
 
@@ -1039,6 +1040,24 @@ struct vfio_iommu_type1_info_cap_migration {
 	__u64	max_dirty_bitmap_size;		/* in bytes */
 };
 
+/*
+ * The nesting capability allows to report the related capability
+ * and info for nesting iommu type.
+ *
+ * The structures below define version 1 of this capability.
+ *
+ * Nested capabilities should be checked by the userspace after
+ * setting VFIO_TYPE1_NESTING_IOMMU.
+ *
+ * @info: the nesting info provided by IOMMU driver.
+ */
+#define VFIO_IOMMU_TYPE1_INFO_CAP_NESTING  3
+
+struct vfio_iommu_type1_info_cap_nesting {
+	struct	vfio_info_cap_header header;
+	struct	iommu_nesting_info info;
+};
+
 #define VFIO_IOMMU_GET_INFO _IO(VFIO_TYPE, VFIO_BASE + 12)
 
 /**
-- 
2.7.4


^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH v7 04/16] vfio: Add PASID allocation/free support
  2020-09-10 10:45 [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs Liu Yi L
                   ` (2 preceding siblings ...)
  2020-09-10 10:45 ` [PATCH v7 03/16] vfio/type1: Report iommu nesting info to userspace Liu Yi L
@ 2020-09-10 10:45 ` Liu Yi L
  2020-09-11 20:54   ` Alex Williamson
  2020-09-10 10:45 ` [PATCH v7 05/16] iommu/vt-d: Support setting ioasid set to domain Liu Yi L
                   ` (12 subsequent siblings)
  16 siblings, 1 reply; 81+ messages in thread
From: Liu Yi L @ 2020-09-10 10:45 UTC (permalink / raw)
  To: alex.williamson, eric.auger, baolu.lu, joro
  Cc: kevin.tian, jacob.jun.pan, ashok.raj, yi.l.liu, jun.j.tian,
	yi.y.sun, jean-philippe, peterx, jasowang, hao.wu, stefanha,
	iommu, kvm

Shared Virtual Addressing (a.k.a Shared Virtual Memory) allows sharing
multiple process virtual address spaces with the device for simplified
programming model. PASID is used to tag an virtual address space in DMA
requests and to identify the related translation structure in IOMMU. When
a PASID-capable device is assigned to a VM, we want the same capability
of using PASID to tag guest process virtual address spaces to achieve
virtual SVA (vSVA).

PASID management for guest is vendor specific. Some vendors (e.g. Intel
VT-d) requires system-wide managed PASIDs across all devices, regardless
of whether a device is used by host or assigned to guest. Other vendors
(e.g. ARM SMMU) may allow PASIDs managed per-device thus could be fully
delegated to the guest for assigned devices.

For system-wide managed PASIDs, this patch introduces a vfio module to
handle explicit PASID alloc/free requests from guest. Allocated PASIDs
are associated to a process (or, mm_struct) in IOASID core. A vfio_mm
object is introduced to track mm_struct. Multiple VFIO containers within
a process share the same vfio_mm object.

A quota mechanism is provided to prevent malicious user from exhausting
available PASIDs. Currently the quota is a global parameter applied to
all VFIO devices. In the future per-device quota might be supported too.

Cc: Kevin Tian <kevin.tian@intel.com>
CC: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Eric Auger <eric.auger@redhat.com>
Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Lu Baolu <baolu.lu@linux.intel.com>
Suggested-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
---
v6 -> v7:
*) remove "#include <linux/eventfd.h>" and add r-b from Eric Auger.

v5 -> v6:
*) address comments from Eric. Add vfio_unlink_pasid() to be consistent
   with vfio_unlink_dma(). Add a comment in vfio_pasid_exit().

v4 -> v5:
*) address comments from Eric Auger.
*) address the comments from Alex on the pasid free range support. Added
   per vfio_mm pasid r-b tree.
   https://lore.kernel.org/kvm/20200709082751.320742ab@x1.home/

v3 -> v4:
*) fix lock leam in vfio_mm_get_from_task()
*) drop pasid_quota field in struct vfio_mm
*) vfio_mm_get_from_task() returns ERR_PTR(-ENOTTY) when !CONFIG_VFIO_PASID

v1 -> v2:
*) added in v2, split from the pasid alloc/free support of v1
---
 drivers/vfio/Kconfig      |   5 +
 drivers/vfio/Makefile     |   1 +
 drivers/vfio/vfio_pasid.c | 247 ++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/vfio.h      |  28 ++++++
 4 files changed, 281 insertions(+)
 create mode 100644 drivers/vfio/vfio_pasid.c

diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
index fd17db9..3d8a108 100644
--- a/drivers/vfio/Kconfig
+++ b/drivers/vfio/Kconfig
@@ -19,6 +19,11 @@ config VFIO_VIRQFD
 	depends on VFIO && EVENTFD
 	default n
 
+config VFIO_PASID
+	tristate
+	depends on IOASID && VFIO
+	default n
+
 menuconfig VFIO
 	tristate "VFIO Non-Privileged userspace driver framework"
 	depends on IOMMU_API
diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile
index de67c47..bb836a3 100644
--- a/drivers/vfio/Makefile
+++ b/drivers/vfio/Makefile
@@ -3,6 +3,7 @@ vfio_virqfd-y := virqfd.o
 
 obj-$(CONFIG_VFIO) += vfio.o
 obj-$(CONFIG_VFIO_VIRQFD) += vfio_virqfd.o
+obj-$(CONFIG_VFIO_PASID) += vfio_pasid.o
 obj-$(CONFIG_VFIO_IOMMU_TYPE1) += vfio_iommu_type1.o
 obj-$(CONFIG_VFIO_IOMMU_SPAPR_TCE) += vfio_iommu_spapr_tce.o
 obj-$(CONFIG_VFIO_SPAPR_EEH) += vfio_spapr_eeh.o
diff --git a/drivers/vfio/vfio_pasid.c b/drivers/vfio/vfio_pasid.c
new file mode 100644
index 0000000..44ecdd5
--- /dev/null
+++ b/drivers/vfio/vfio_pasid.c
@@ -0,0 +1,247 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2020 Intel Corporation.
+ *     Author: Liu Yi L <yi.l.liu@intel.com>
+ *
+ */
+
+#include <linux/vfio.h>
+#include <linux/file.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/sched/mm.h>
+
+#define DRIVER_VERSION  "0.1"
+#define DRIVER_AUTHOR   "Liu Yi L <yi.l.liu@intel.com>"
+#define DRIVER_DESC     "PASID management for VFIO bus drivers"
+
+#define VFIO_DEFAULT_PASID_QUOTA	1000
+static int pasid_quota = VFIO_DEFAULT_PASID_QUOTA;
+module_param_named(pasid_quota, pasid_quota, uint, 0444);
+MODULE_PARM_DESC(pasid_quota,
+		 "Set the quota for max number of PASIDs that an application is allowed to request (default 1000)");
+
+struct vfio_mm_token {
+	unsigned long long val;
+};
+
+struct vfio_mm {
+	struct kref		kref;
+	struct ioasid_set	*ioasid_set;
+	struct mutex		pasid_lock;
+	struct rb_root		pasid_list;
+	struct list_head	next;
+	struct vfio_mm_token	token;
+};
+
+static struct mutex		vfio_mm_lock;
+static struct list_head		vfio_mm_list;
+
+struct vfio_pasid {
+	struct rb_node		node;
+	ioasid_t		pasid;
+};
+
+static void vfio_remove_all_pasids(struct vfio_mm *vmm);
+
+/* called with vfio.vfio_mm_lock held */
+static void vfio_mm_release(struct kref *kref)
+{
+	struct vfio_mm *vmm = container_of(kref, struct vfio_mm, kref);
+
+	list_del(&vmm->next);
+	mutex_unlock(&vfio_mm_lock);
+	vfio_remove_all_pasids(vmm);
+	ioasid_set_put(vmm->ioasid_set);//FIXME: should vfio_pasid get ioasid_set after allocation?
+	kfree(vmm);
+}
+
+void vfio_mm_put(struct vfio_mm *vmm)
+{
+	kref_put_mutex(&vmm->kref, vfio_mm_release, &vfio_mm_lock);
+}
+
+static void vfio_mm_get(struct vfio_mm *vmm)
+{
+	kref_get(&vmm->kref);
+}
+
+struct vfio_mm *vfio_mm_get_from_task(struct task_struct *task)
+{
+	struct mm_struct *mm = get_task_mm(task);
+	struct vfio_mm *vmm;
+	unsigned long long val = (unsigned long long)mm;
+	int ret;
+
+	mutex_lock(&vfio_mm_lock);
+	/* Search existing vfio_mm with current mm pointer */
+	list_for_each_entry(vmm, &vfio_mm_list, next) {
+		if (vmm->token.val == val) {
+			vfio_mm_get(vmm);
+			goto out;
+		}
+	}
+
+	vmm = kzalloc(sizeof(*vmm), GFP_KERNEL);
+	if (!vmm) {
+		vmm = ERR_PTR(-ENOMEM);
+		goto out;
+	}
+
+	/*
+	 * IOASID core provides a 'IOASID set' concept to track all
+	 * PASIDs associated with a token. Here we use mm_struct as
+	 * the token and create a IOASID set per mm_struct. All the
+	 * containers of the process share the same IOASID set.
+	 */
+	vmm->ioasid_set = ioasid_alloc_set(mm, pasid_quota, IOASID_SET_TYPE_MM);
+	if (IS_ERR(vmm->ioasid_set)) {
+		ret = PTR_ERR(vmm->ioasid_set);
+		kfree(vmm);
+		vmm = ERR_PTR(ret);
+		goto out;
+	}
+
+	kref_init(&vmm->kref);
+	vmm->token.val = val;
+	mutex_init(&vmm->pasid_lock);
+	vmm->pasid_list = RB_ROOT;
+
+	list_add(&vmm->next, &vfio_mm_list);
+out:
+	mutex_unlock(&vfio_mm_lock);
+	mmput(mm);
+	return vmm;
+}
+
+/*
+ * Find PASID within @min and @max
+ */
+static struct vfio_pasid *vfio_find_pasid(struct vfio_mm *vmm,
+					  ioasid_t min, ioasid_t max)
+{
+	struct rb_node *node = vmm->pasid_list.rb_node;
+
+	while (node) {
+		struct vfio_pasid *vid = rb_entry(node,
+						struct vfio_pasid, node);
+
+		if (max < vid->pasid)
+			node = node->rb_left;
+		else if (min > vid->pasid)
+			node = node->rb_right;
+		else
+			return vid;
+	}
+
+	return NULL;
+}
+
+static void vfio_link_pasid(struct vfio_mm *vmm, struct vfio_pasid *new)
+{
+	struct rb_node **link = &vmm->pasid_list.rb_node, *parent = NULL;
+	struct vfio_pasid *vid;
+
+	while (*link) {
+		parent = *link;
+		vid = rb_entry(parent, struct vfio_pasid, node);
+
+		if (new->pasid <= vid->pasid)
+			link = &(*link)->rb_left;
+		else
+			link = &(*link)->rb_right;
+	}
+
+	rb_link_node(&new->node, parent, link);
+	rb_insert_color(&new->node, &vmm->pasid_list);
+}
+
+static void vfio_unlink_pasid(struct vfio_mm *vmm, struct vfio_pasid *old)
+{
+	rb_erase(&old->node, &vmm->pasid_list);
+}
+
+static void vfio_remove_pasid(struct vfio_mm *vmm, struct vfio_pasid *vid)
+{
+	vfio_unlink_pasid(vmm, vid);
+	ioasid_free(vmm->ioasid_set, vid->pasid);
+	kfree(vid);
+}
+
+static void vfio_remove_all_pasids(struct vfio_mm *vmm)
+{
+	struct rb_node *node;
+
+	mutex_lock(&vmm->pasid_lock);
+	while ((node = rb_first(&vmm->pasid_list)))
+		vfio_remove_pasid(vmm, rb_entry(node, struct vfio_pasid, node));
+	mutex_unlock(&vmm->pasid_lock);
+}
+
+int vfio_pasid_alloc(struct vfio_mm *vmm, int min, int max)
+{
+	ioasid_t pasid;
+	struct vfio_pasid *vid;
+
+	pasid = ioasid_alloc(vmm->ioasid_set, min, max, NULL);
+	if (pasid == INVALID_IOASID)
+		return -ENOSPC;
+
+	vid = kzalloc(sizeof(*vid), GFP_KERNEL);
+	if (!vid) {
+		ioasid_free(vmm->ioasid_set, pasid);
+		return -ENOMEM;
+	}
+
+	vid->pasid = pasid;
+
+	mutex_lock(&vmm->pasid_lock);
+	vfio_link_pasid(vmm, vid);
+	mutex_unlock(&vmm->pasid_lock);
+
+	return pasid;
+}
+
+void vfio_pasid_free_range(struct vfio_mm *vmm,
+			   ioasid_t min, ioasid_t max)
+{
+	struct vfio_pasid *vid = NULL;
+
+	/*
+	 * IOASID core will notify PASID users (e.g. IOMMU driver) to
+	 * teardown necessary structures depending on the to-be-freed
+	 * PASID.
+	 */
+	mutex_lock(&vmm->pasid_lock);
+	while ((vid = vfio_find_pasid(vmm, min, max)) != NULL)
+		vfio_remove_pasid(vmm, vid);
+	mutex_unlock(&vmm->pasid_lock);
+}
+
+static int __init vfio_pasid_init(void)
+{
+	mutex_init(&vfio_mm_lock);
+	INIT_LIST_HEAD(&vfio_mm_list);
+	return 0;
+}
+
+static void __exit vfio_pasid_exit(void)
+{
+	/*
+	 * VFIO_PASID is supposed to be referenced by VFIO_IOMMU_TYPE1
+	 * and may be other module. once vfio_pasid_exit() is triggered,
+	 * that means its user (e.g. VFIO_IOMMU_TYPE1) has been removed.
+	 * All the vfio_mm instances should have been released. If not,
+	 * means there is vfio_mm leak, should be a bug of user module.
+	 * So just warn here.
+	 */
+	WARN_ON(!list_empty(&vfio_mm_list));
+}
+
+module_init(vfio_pasid_init);
+module_exit(vfio_pasid_exit);
+
+MODULE_VERSION(DRIVER_VERSION);
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR(DRIVER_AUTHOR);
+MODULE_DESCRIPTION(DRIVER_DESC);
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index 38d3c6a..31472a9 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -97,6 +97,34 @@ extern int vfio_register_iommu_driver(const struct vfio_iommu_driver_ops *ops);
 extern void vfio_unregister_iommu_driver(
 				const struct vfio_iommu_driver_ops *ops);
 
+struct vfio_mm;
+#if IS_ENABLED(CONFIG_VFIO_PASID)
+extern struct vfio_mm *vfio_mm_get_from_task(struct task_struct *task);
+extern void vfio_mm_put(struct vfio_mm *vmm);
+extern int vfio_pasid_alloc(struct vfio_mm *vmm, int min, int max);
+extern void vfio_pasid_free_range(struct vfio_mm *vmm,
+				  ioasid_t min, ioasid_t max);
+#else
+static inline struct vfio_mm *vfio_mm_get_from_task(struct task_struct *task)
+{
+	return ERR_PTR(-ENOTTY);
+}
+
+static inline void vfio_mm_put(struct vfio_mm *vmm)
+{
+}
+
+static inline int vfio_pasid_alloc(struct vfio_mm *vmm, int min, int max)
+{
+	return -ENOTTY;
+}
+
+static inline void vfio_pasid_free_range(struct vfio_mm *vmm,
+					  ioasid_t min, ioasid_t max)
+{
+}
+#endif /* CONFIG_VFIO_PASID */
+
 /*
  * External user API
  */
-- 
2.7.4


^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH v7 05/16] iommu/vt-d: Support setting ioasid set to domain
  2020-09-10 10:45 [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs Liu Yi L
                   ` (3 preceding siblings ...)
  2020-09-10 10:45 ` [PATCH v7 04/16] vfio: Add PASID allocation/free support Liu Yi L
@ 2020-09-10 10:45 ` Liu Yi L
  2020-09-10 10:45 ` [PATCH v7 06/16] iommu/vt-d: Remove get_task_mm() in bind_gpasid() Liu Yi L
                   ` (11 subsequent siblings)
  16 siblings, 0 replies; 81+ messages in thread
From: Liu Yi L @ 2020-09-10 10:45 UTC (permalink / raw)
  To: alex.williamson, eric.auger, baolu.lu, joro
  Cc: kevin.tian, jacob.jun.pan, ashok.raj, yi.l.liu, jun.j.tian,
	yi.y.sun, jean-philippe, peterx, jasowang, hao.wu, stefanha,
	iommu, kvm

From IOMMU p.o.v., PASIDs allocated and managed by external components
(e.g. VFIO) will be passed in for gpasid_bind/unbind operation. IOMMU
needs some knowledge to check the PASID ownership, hence add an interface
for those components to tell the PASID owner.

In latest kernel design, PASID ownership is managed by IOASID set where
the PASID is allocated from. This patch adds support for setting ioasid
set ID to the domains used for nesting/vSVA. Subsequent SVA operations
will check the PASID against its IOASID set for proper ownership.

Cc: Kevin Tian <kevin.tian@intel.com>
CC: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Eric Auger <eric.auger@redhat.com>
Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
---
v6 -> v7:
*) add a helper function __domain_config_ioasid_set() per Eric's comment.
*) rename @ioasid_sid field of struct dmar_domain to be @pasid_set.
*) Eric gave r-b against v6, but since there is change, so will seek for his
   r-b again on this version.

v5 -> v6:
*) address comments against v5 from Eric Auger.

v4 -> v5:
*) address comments from Eric Auger.
---
 drivers/iommu/intel/iommu.c | 26 ++++++++++++++++++++++++++
 include/linux/intel-iommu.h |  4 ++++
 include/linux/iommu.h       |  1 +
 3 files changed, 31 insertions(+)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 5813eea..d1c77fc 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -1806,6 +1806,7 @@ static struct dmar_domain *alloc_domain(int flags)
 	if (first_level_by_default())
 		domain->flags |= DOMAIN_FLAG_USE_FIRST_LEVEL;
 	domain->has_iotlb_device = false;
+	domain->pasid_set = host_pasid_set;
 	INIT_LIST_HEAD(&domain->devices);
 
 	return domain;
@@ -6007,6 +6008,22 @@ static bool intel_iommu_is_attach_deferred(struct iommu_domain *domain,
 	return attach_deferred(dev);
 }
 
+static int __domain_config_ioasid_set(struct dmar_domain *domain,
+				      struct ioasid_set *set)
+{
+	if (!(domain->flags & DOMAIN_FLAG_NESTING_MODE))
+		return -ENODEV;
+
+	if (domain->pasid_set != host_pasid_set &&
+	    domain->pasid_set != set) {
+		pr_warn_ratelimited("multi ioasid_set setting to domain");
+		return -EBUSY;
+	}
+
+	domain->pasid_set = set;
+	return 0;
+}
+
 static int
 intel_iommu_domain_set_attr(struct iommu_domain *domain,
 			    enum iommu_attr attr, void *data)
@@ -6030,6 +6047,15 @@ intel_iommu_domain_set_attr(struct iommu_domain *domain,
 		}
 		spin_unlock_irqrestore(&device_domain_lock, flags);
 		break;
+	case DOMAIN_ATTR_IOASID_SET:
+	{
+		struct ioasid_set *set = (struct ioasid_set *)data;
+
+		spin_lock_irqsave(&device_domain_lock, flags);
+		ret = __domain_config_ioasid_set(dmar_domain, set);
+		spin_unlock_irqrestore(&device_domain_lock, flags);
+		break;
+	}
 	default:
 		ret = -EINVAL;
 		break;
diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
index d36038e..3345391 100644
--- a/include/linux/intel-iommu.h
+++ b/include/linux/intel-iommu.h
@@ -549,6 +549,10 @@ struct dmar_domain {
 					   2 == 1GiB, 3 == 512GiB, 4 == 1TiB */
 	u64		max_addr;	/* maximum mapped address */
 
+	struct ioasid_set *pasid_set;	/*
+					 * the ioasid set which tracks all
+					 * PASIDs used by the domain.
+					 */
 	int		default_pasid;	/*
 					 * The default pasid used for non-SVM
 					 * traffic on mediated devices.
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index c364b1c..5b9f630 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -118,6 +118,7 @@ enum iommu_attr {
 	DOMAIN_ATTR_FSL_PAMUV1,
 	DOMAIN_ATTR_NESTING,	/* two stages of translation */
 	DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE,
+	DOMAIN_ATTR_IOASID_SET,
 	DOMAIN_ATTR_MAX,
 };
 
-- 
2.7.4


^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH v7 06/16] iommu/vt-d: Remove get_task_mm() in bind_gpasid()
  2020-09-10 10:45 [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs Liu Yi L
                   ` (4 preceding siblings ...)
  2020-09-10 10:45 ` [PATCH v7 05/16] iommu/vt-d: Support setting ioasid set to domain Liu Yi L
@ 2020-09-10 10:45 ` Liu Yi L
  2020-09-10 10:45 ` [PATCH v7 07/16] vfio/type1: Add VFIO_IOMMU_PASID_REQUEST (alloc/free) Liu Yi L
                   ` (10 subsequent siblings)
  16 siblings, 0 replies; 81+ messages in thread
From: Liu Yi L @ 2020-09-10 10:45 UTC (permalink / raw)
  To: alex.williamson, eric.auger, baolu.lu, joro
  Cc: kevin.tian, jacob.jun.pan, ashok.raj, yi.l.liu, jun.j.tian,
	yi.y.sun, jean-philippe, peterx, jasowang, hao.wu, stefanha,
	iommu, kvm

This patch is to address a REVISIT. As ioasid_set is added to domain,
upper layer/VFIO can set ioasid_set to iommu driver, and track the
PASID ownership, so no need to get_task_mm() in intel_svm_bind_gpasid().

Cc: Kevin Tian <kevin.tian@intel.com>
CC: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Eric Auger <eric.auger@redhat.com>
Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 drivers/iommu/intel/svm.c | 7 -------
 1 file changed, 7 deletions(-)

diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c
index 519eabb..d3cf52b 100644
--- a/drivers/iommu/intel/svm.c
+++ b/drivers/iommu/intel/svm.c
@@ -400,12 +400,6 @@ int intel_svm_bind_gpasid(struct iommu_domain *domain, struct device *dev,
 			ret = -ENOMEM;
 			goto out;
 		}
-		/* REVISIT: upper layer/VFIO can track host process that bind
-		 * the PASID. ioasid_set = mm might be sufficient for vfio to
-		 * check pasid VMM ownership. We can drop the following line
-		 * once VFIO and IOASID set check is in place.
-		 */
-		svm->mm = get_task_mm(current);
 		svm->pasid = data->hpasid;
 		if (data->flags & IOMMU_SVA_GPASID_VAL) {
 			svm->gpasid = data->gpasid;
@@ -420,7 +414,6 @@ int intel_svm_bind_gpasid(struct iommu_domain *domain, struct device *dev,
 		INIT_WORK(&svm->work, intel_svm_free_async_fn);
 		ioasid_attach_data(data->hpasid, svm);
 		INIT_LIST_HEAD_RCU(&svm->devs);
-		mmput(svm->mm);
 	}
 	sdev = kzalloc(sizeof(*sdev), GFP_KERNEL);
 	if (!sdev) {
-- 
2.7.4


^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH v7 07/16] vfio/type1: Add VFIO_IOMMU_PASID_REQUEST (alloc/free)
  2020-09-10 10:45 [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs Liu Yi L
                   ` (5 preceding siblings ...)
  2020-09-10 10:45 ` [PATCH v7 06/16] iommu/vt-d: Remove get_task_mm() in bind_gpasid() Liu Yi L
@ 2020-09-10 10:45 ` Liu Yi L
  2020-09-11 21:38   ` Alex Williamson
  2020-09-10 10:45 ` [PATCH v7 08/16] iommu: Pass domain to sva_unbind_gpasid() Liu Yi L
                   ` (9 subsequent siblings)
  16 siblings, 1 reply; 81+ messages in thread
From: Liu Yi L @ 2020-09-10 10:45 UTC (permalink / raw)
  To: alex.williamson, eric.auger, baolu.lu, joro
  Cc: kevin.tian, jacob.jun.pan, ashok.raj, yi.l.liu, jun.j.tian,
	yi.y.sun, jean-philippe, peterx, jasowang, hao.wu, stefanha,
	iommu, kvm

This patch allows userspace to request PASID allocation/free, e.g. when
serving the request from the guest.

PASIDs that are not freed by userspace are automatically freed when the
IOASID set is destroyed when process exits.

Cc: Kevin Tian <kevin.tian@intel.com>
CC: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Eric Auger <eric.auger@redhat.com>
Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
---
v6 -> v7:
*) current VFIO returns allocated pasid via signed int, thus VFIO UAPI
   can only support 31 bits pasid. If user space gives min,max which is
   wider than 31 bits, should fail the allocation or free request.

v5 -> v6:
*) address comments from Eric against v5. remove the alloc/free helper.

v4 -> v5:
*) address comments from Eric Auger.
*) the comments for the PASID_FREE request is addressed in patch 5/15 of
   this series.

v3 -> v4:
*) address comments from v3, except the below comment against the range
   of PASID_FREE request. needs more help on it.
    "> +if (req.range.min > req.range.max)

     Is it exploitable that a user can spin the kernel for a long time in
     the case of a free by calling this with [0, MAX_UINT] regardless of
     their actual allocations?"
    https://lore.kernel.org/linux-iommu/20200702151832.048b44d1@x1.home/

v1 -> v2:
*) move the vfio_mm related code to be a seprate module
*) use a single structure for alloc/free, could support a range of PASIDs
*) fetch vfio_mm at group_attach time instead of at iommu driver open time
---
 drivers/vfio/Kconfig            |  1 +
 drivers/vfio/vfio_iommu_type1.c | 76 +++++++++++++++++++++++++++++++++++++++++
 drivers/vfio/vfio_pasid.c       | 10 ++++++
 include/linux/vfio.h            |  6 ++++
 include/uapi/linux/vfio.h       | 43 +++++++++++++++++++++++
 5 files changed, 136 insertions(+)

diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
index 3d8a108..95d90c6 100644
--- a/drivers/vfio/Kconfig
+++ b/drivers/vfio/Kconfig
@@ -2,6 +2,7 @@
 config VFIO_IOMMU_TYPE1
 	tristate
 	depends on VFIO
+	select VFIO_PASID if (X86)
 	default n
 
 config VFIO_IOMMU_SPAPR_TCE
diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 3c0048b..bd4b668 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -76,6 +76,7 @@ struct vfio_iommu {
 	bool				dirty_page_tracking;
 	bool				pinned_page_dirty_scope;
 	struct iommu_nesting_info	*nesting_info;
+	struct vfio_mm			*vmm;
 };
 
 struct vfio_domain {
@@ -2000,6 +2001,11 @@ static void vfio_iommu_iova_insert_copy(struct vfio_iommu *iommu,
 
 static void vfio_iommu_release_nesting_info(struct vfio_iommu *iommu)
 {
+	if (iommu->vmm) {
+		vfio_mm_put(iommu->vmm);
+		iommu->vmm = NULL;
+	}
+
 	kfree(iommu->nesting_info);
 	iommu->nesting_info = NULL;
 }
@@ -2127,6 +2133,26 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
 					    iommu->nesting_info);
 		if (ret)
 			goto out_detach;
+
+		if (iommu->nesting_info->features &
+					IOMMU_NESTING_FEAT_SYSWIDE_PASID) {
+			struct vfio_mm *vmm;
+			struct ioasid_set *set;
+
+			vmm = vfio_mm_get_from_task(current);
+			if (IS_ERR(vmm)) {
+				ret = PTR_ERR(vmm);
+				goto out_detach;
+			}
+			iommu->vmm = vmm;
+
+			set = vfio_mm_ioasid_set(vmm);
+			ret = iommu_domain_set_attr(domain->domain,
+						    DOMAIN_ATTR_IOASID_SET,
+						    set);
+			if (ret)
+				goto out_detach;
+		}
 	}
 
 	/* Get aperture info */
@@ -2908,6 +2934,54 @@ static int vfio_iommu_type1_dirty_pages(struct vfio_iommu *iommu,
 	return -EINVAL;
 }
 
+static int vfio_iommu_type1_pasid_request(struct vfio_iommu *iommu,
+					  unsigned long arg)
+{
+	struct vfio_iommu_type1_pasid_request req;
+	unsigned long minsz;
+	int ret;
+
+	minsz = offsetofend(struct vfio_iommu_type1_pasid_request, range);
+
+	if (copy_from_user(&req, (void __user *)arg, minsz))
+		return -EFAULT;
+
+	if (req.argsz < minsz || (req.flags & ~VFIO_PASID_REQUEST_MASK))
+		return -EINVAL;
+
+	/*
+	 * Current VFIO_IOMMU_PASID_REQUEST only supports at most
+	 * 31 bits PASID. The min,max value from userspace should
+	 * not exceed 31 bits.
+	 */
+	if (req.range.min > req.range.max ||
+	    req.range.min > (1 << VFIO_IOMMU_PASID_BITS) ||
+	    req.range.max > (1 << VFIO_IOMMU_PASID_BITS))
+		return -EINVAL;
+
+	mutex_lock(&iommu->lock);
+	if (!iommu->vmm) {
+		mutex_unlock(&iommu->lock);
+		return -EOPNOTSUPP;
+	}
+
+	switch (req.flags & VFIO_PASID_REQUEST_MASK) {
+	case VFIO_IOMMU_FLAG_ALLOC_PASID:
+		ret = vfio_pasid_alloc(iommu->vmm, req.range.min,
+				       req.range.max);
+		break;
+	case VFIO_IOMMU_FLAG_FREE_PASID:
+		vfio_pasid_free_range(iommu->vmm, req.range.min,
+				      req.range.max);
+		ret = 0;
+		break;
+	default:
+		ret = -EINVAL;
+	}
+	mutex_unlock(&iommu->lock);
+	return ret;
+}
+
 static long vfio_iommu_type1_ioctl(void *iommu_data,
 				   unsigned int cmd, unsigned long arg)
 {
@@ -2924,6 +2998,8 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
 		return vfio_iommu_type1_unmap_dma(iommu, arg);
 	case VFIO_IOMMU_DIRTY_PAGES:
 		return vfio_iommu_type1_dirty_pages(iommu, arg);
+	case VFIO_IOMMU_PASID_REQUEST:
+		return vfio_iommu_type1_pasid_request(iommu, arg);
 	default:
 		return -ENOTTY;
 	}
diff --git a/drivers/vfio/vfio_pasid.c b/drivers/vfio/vfio_pasid.c
index 44ecdd5..0ec4660 100644
--- a/drivers/vfio/vfio_pasid.c
+++ b/drivers/vfio/vfio_pasid.c
@@ -60,6 +60,7 @@ void vfio_mm_put(struct vfio_mm *vmm)
 {
 	kref_put_mutex(&vmm->kref, vfio_mm_release, &vfio_mm_lock);
 }
+EXPORT_SYMBOL_GPL(vfio_mm_put);
 
 static void vfio_mm_get(struct vfio_mm *vmm)
 {
@@ -113,6 +114,13 @@ struct vfio_mm *vfio_mm_get_from_task(struct task_struct *task)
 	mmput(mm);
 	return vmm;
 }
+EXPORT_SYMBOL_GPL(vfio_mm_get_from_task);
+
+struct ioasid_set *vfio_mm_ioasid_set(struct vfio_mm *vmm)
+{
+	return vmm->ioasid_set;
+}
+EXPORT_SYMBOL_GPL(vfio_mm_ioasid_set);
 
 /*
  * Find PASID within @min and @max
@@ -201,6 +209,7 @@ int vfio_pasid_alloc(struct vfio_mm *vmm, int min, int max)
 
 	return pasid;
 }
+EXPORT_SYMBOL_GPL(vfio_pasid_alloc);
 
 void vfio_pasid_free_range(struct vfio_mm *vmm,
 			   ioasid_t min, ioasid_t max)
@@ -217,6 +226,7 @@ void vfio_pasid_free_range(struct vfio_mm *vmm,
 		vfio_remove_pasid(vmm, vid);
 	mutex_unlock(&vmm->pasid_lock);
 }
+EXPORT_SYMBOL_GPL(vfio_pasid_free_range);
 
 static int __init vfio_pasid_init(void)
 {
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index 31472a9..5c3d7a8 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -101,6 +101,7 @@ struct vfio_mm;
 #if IS_ENABLED(CONFIG_VFIO_PASID)
 extern struct vfio_mm *vfio_mm_get_from_task(struct task_struct *task);
 extern void vfio_mm_put(struct vfio_mm *vmm);
+extern struct ioasid_set *vfio_mm_ioasid_set(struct vfio_mm *vmm);
 extern int vfio_pasid_alloc(struct vfio_mm *vmm, int min, int max);
 extern void vfio_pasid_free_range(struct vfio_mm *vmm,
 				  ioasid_t min, ioasid_t max);
@@ -114,6 +115,11 @@ static inline void vfio_mm_put(struct vfio_mm *vmm)
 {
 }
 
+static inline struct ioasid_set *vfio_mm_ioasid_set(struct vfio_mm *vmm)
+{
+	return -ENOTTY;
+}
+
 static inline int vfio_pasid_alloc(struct vfio_mm *vmm, int min, int max)
 {
 	return -ENOTTY;
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index ff40f9e..a4bc42e 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -1172,6 +1172,49 @@ struct vfio_iommu_type1_dirty_bitmap_get {
 
 #define VFIO_IOMMU_DIRTY_PAGES             _IO(VFIO_TYPE, VFIO_BASE + 17)
 
+/**
+ * VFIO_IOMMU_PASID_REQUEST - _IOWR(VFIO_TYPE, VFIO_BASE + 18,
+ *				struct vfio_iommu_type1_pasid_request)
+ *
+ * PASID (Processor Address Space ID) is a PCIe concept for tagging
+ * address spaces in DMA requests. When system-wide PASID allocation
+ * is required by the underlying iommu driver (e.g. Intel VT-d), this
+ * provides an interface for userspace to request pasid alloc/free
+ * for its assigned devices. Userspace should check the availability
+ * of this API by checking VFIO_IOMMU_TYPE1_INFO_CAP_NESTING through
+ * VFIO_IOMMU_GET_INFO.
+ *
+ * @flags=VFIO_IOMMU_FLAG_ALLOC_PASID, allocate a single PASID within @range.
+ * @flags=VFIO_IOMMU_FLAG_FREE_PASID, free the PASIDs within @range.
+ * @range is [min, max], which means both @min and @max are inclusive.
+ * ALLOC_PASID and FREE_PASID are mutually exclusive.
+ *
+ * Current interface supports at most 31 bits PASID bits as returning
+ * PASID allocation result via signed int. PCIe spec defines 20 bits
+ * for PASID width, so 31 bits is enough. As a result user space should
+ * provide min, max no more than 31 bits.
+ * returns: allocated PASID value on success, -errno on failure for
+ *	    ALLOC_PASID;
+ *	    0 for FREE_PASID operation;
+ */
+struct vfio_iommu_type1_pasid_request {
+	__u32	argsz;
+#define VFIO_IOMMU_FLAG_ALLOC_PASID	(1 << 0)
+#define VFIO_IOMMU_FLAG_FREE_PASID	(1 << 1)
+	__u32	flags;
+	struct {
+		__u32	min;
+		__u32	max;
+	} range;
+};
+
+#define VFIO_PASID_REQUEST_MASK	(VFIO_IOMMU_FLAG_ALLOC_PASID | \
+					 VFIO_IOMMU_FLAG_FREE_PASID)
+
+#define VFIO_IOMMU_PASID_BITS		31
+
+#define VFIO_IOMMU_PASID_REQUEST	_IO(VFIO_TYPE, VFIO_BASE + 18)
+
 /* -------- Additional API for SPAPR TCE (Server POWERPC) IOMMU -------- */
 
 /*
-- 
2.7.4


^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH v7 08/16] iommu: Pass domain to sva_unbind_gpasid()
  2020-09-10 10:45 [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs Liu Yi L
                   ` (6 preceding siblings ...)
  2020-09-10 10:45 ` [PATCH v7 07/16] vfio/type1: Add VFIO_IOMMU_PASID_REQUEST (alloc/free) Liu Yi L
@ 2020-09-10 10:45 ` Liu Yi L
  2020-09-10 10:45 ` [PATCH v7 09/16] iommu/vt-d: Check ownership for PASIDs from user-space Liu Yi L
                   ` (8 subsequent siblings)
  16 siblings, 0 replies; 81+ messages in thread
From: Liu Yi L @ 2020-09-10 10:45 UTC (permalink / raw)
  To: alex.williamson, eric.auger, baolu.lu, joro
  Cc: kevin.tian, jacob.jun.pan, ashok.raj, yi.l.liu, jun.j.tian,
	yi.y.sun, jean-philippe, peterx, jasowang, hao.wu, stefanha,
	iommu, kvm

From: Yi Sun <yi.y.sun@intel.com>

Current interface is good enough for SVA virtualization on an assigned
physical PCI device, but when it comes to mediated devices, a physical
device may be attached with multiple aux-domains. Also, for guest unbind,
the PASID to be unbind should be allocated to the VM. This check requires
to know the ioasid_set which is associated with the domain.

So this interface needs to pass in domain info. Then the iommu driver is
able to know which domain will be used for the 2nd stage translation of
the nesting mode and also be able to do PASID ownership check. This patch
passes @domain per the above reason. Also, the prototype of &pasid is
changed from an "int" to "u32" as the below link.

[PATCH v6 01/12] iommu: Change type of pasid to u32
https://lore.kernel.org/linux-iommu/1594684087-61184-2-git-send-email-fenghua.yu@intel.com/

Cc: Kevin Tian <kevin.tian@intel.com>
CC: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Eric Auger <eric.auger@redhat.com>
Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Yi Sun <yi.y.sun@intel.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
v6 -> v7:
*) correct the link for the details of modifying pasid prototype to bve "u32".
*) hold off r-b from Eric Auger as there is modification in this patch, will
   seek r-b in this version.

v5 -> v6:
*) use "u32" prototype for @pasid.
*) add review-by from Eric Auger.

v2 -> v3:
*) pass in domain info only
*) use u32 for pasid instead of int type

v1 -> v2:
*) added in v2.
---
 drivers/iommu/intel/svm.c   | 3 ++-
 drivers/iommu/iommu.c       | 2 +-
 include/linux/intel-iommu.h | 3 ++-
 include/linux/iommu.h       | 3 ++-
 4 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c
index d3cf52b..d39fafb 100644
--- a/drivers/iommu/intel/svm.c
+++ b/drivers/iommu/intel/svm.c
@@ -476,7 +476,8 @@ int intel_svm_bind_gpasid(struct iommu_domain *domain, struct device *dev,
 	return ret;
 }
 
-int intel_svm_unbind_gpasid(struct device *dev, int pasid)
+int intel_svm_unbind_gpasid(struct iommu_domain *domain,
+			    struct device *dev, u32 pasid)
 {
 	struct intel_iommu *iommu = device_to_iommu(dev, NULL, NULL);
 	struct intel_svm_dev *sdev;
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 3bc263a..52aabb64 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -2155,7 +2155,7 @@ int iommu_sva_unbind_gpasid(struct iommu_domain *domain, struct device *dev,
 	if (unlikely(!domain->ops->sva_unbind_gpasid))
 		return -ENODEV;
 
-	return domain->ops->sva_unbind_gpasid(dev, data->hpasid);
+	return domain->ops->sva_unbind_gpasid(domain, dev, data->hpasid);
 }
 EXPORT_SYMBOL_GPL(iommu_sva_unbind_gpasid);
 
diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
index 3345391..ce0b33b 100644
--- a/include/linux/intel-iommu.h
+++ b/include/linux/intel-iommu.h
@@ -741,7 +741,8 @@ extern int intel_svm_enable_prq(struct intel_iommu *iommu);
 extern int intel_svm_finish_prq(struct intel_iommu *iommu);
 int intel_svm_bind_gpasid(struct iommu_domain *domain, struct device *dev,
 			  struct iommu_gpasid_bind_data *data);
-int intel_svm_unbind_gpasid(struct device *dev, int pasid);
+int intel_svm_unbind_gpasid(struct iommu_domain *domain,
+			    struct device *dev, u32 pasid);
 struct iommu_sva *intel_svm_bind(struct device *dev, struct mm_struct *mm,
 				 void *drvdata);
 void intel_svm_unbind(struct iommu_sva *handle);
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 5b9f630..d561448 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -297,7 +297,8 @@ struct iommu_ops {
 	int (*sva_bind_gpasid)(struct iommu_domain *domain,
 			struct device *dev, struct iommu_gpasid_bind_data *data);
 
-	int (*sva_unbind_gpasid)(struct device *dev, int pasid);
+	int (*sva_unbind_gpasid)(struct iommu_domain *domain,
+				 struct device *dev, u32 pasid);
 
 	int (*def_domain_type)(struct device *dev);
 
-- 
2.7.4


^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH v7 09/16] iommu/vt-d: Check ownership for PASIDs from user-space
  2020-09-10 10:45 [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs Liu Yi L
                   ` (7 preceding siblings ...)
  2020-09-10 10:45 ` [PATCH v7 08/16] iommu: Pass domain to sva_unbind_gpasid() Liu Yi L
@ 2020-09-10 10:45 ` Liu Yi L
  2020-09-10 10:45 ` [PATCH v7 10/16] vfio/type1: Support binding guest page tables to PASID Liu Yi L
                   ` (7 subsequent siblings)
  16 siblings, 0 replies; 81+ messages in thread
From: Liu Yi L @ 2020-09-10 10:45 UTC (permalink / raw)
  To: alex.williamson, eric.auger, baolu.lu, joro
  Cc: kevin.tian, jacob.jun.pan, ashok.raj, yi.l.liu, jun.j.tian,
	yi.y.sun, jean-philippe, peterx, jasowang, hao.wu, stefanha,
	iommu, kvm

When an IOMMU domain with nesting attribute is used for guest SVA, a
system-wide PASID is allocated for binding with the device and the domain.
For security reason, we need to check the PASID passed from user-space.
e.g. page table bind/unbind and PASID related cache invalidation.

Cc: Kevin Tian <kevin.tian@intel.com>
CC: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Eric Auger <eric.auger@redhat.com>
Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
---
v6 -> v7:
*) acquire device_domain_lock in bind/unbind_gpasid() to ensure dmar_domain
   is not modified during bind/unbind_gpasid().
*) the change to svm.c varies from previous version as Jacob refactored the
   svm.c code.
---
 drivers/iommu/intel/iommu.c | 29 +++++++++++++++++++++++++----
 drivers/iommu/intel/svm.c   | 33 ++++++++++++++++++++++++---------
 include/linux/intel-iommu.h |  2 ++
 3 files changed, 51 insertions(+), 13 deletions(-)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index d1c77fc..95740b9 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -5451,6 +5451,7 @@ intel_iommu_sva_invalidate(struct iommu_domain *domain, struct device *dev,
 		int granu = 0;
 		u64 pasid = 0;
 		u64 addr = 0;
+		void *pdata;
 
 		granu = to_vtd_granularity(cache_type, inv_info->granularity);
 		if (granu == -EINVAL) {
@@ -5470,6 +5471,15 @@ intel_iommu_sva_invalidate(struct iommu_domain *domain, struct device *dev,
 			 (inv_info->granu.addr_info.flags & IOMMU_INV_ADDR_FLAGS_PASID))
 			pasid = inv_info->granu.addr_info.pasid;
 
+		pdata = ioasid_find(dmar_domain->pasid_set, pasid, NULL);
+		if (!pdata) {
+			ret = -EINVAL;
+			goto out_unlock;
+		} else if (IS_ERR(pdata)) {
+			ret = PTR_ERR(pdata);
+			goto out_unlock;
+		}
+
 		switch (BIT(cache_type)) {
 		case IOMMU_CACHE_INV_TYPE_IOTLB:
 			/* HW will ignore LSB bits based on address mask */
@@ -5787,12 +5797,14 @@ static void intel_iommu_get_resv_regions(struct device *device,
 	list_add_tail(&reg->list, head);
 }
 
-int intel_iommu_enable_pasid(struct intel_iommu *iommu, struct device *dev)
+/*
+ * Caller should have held device_domain_lock
+ */
+int intel_iommu_enable_pasid_locked(struct intel_iommu *iommu, struct device *dev)
 {
 	struct device_domain_info *info;
 	struct context_entry *context;
 	struct dmar_domain *domain;
-	unsigned long flags;
 	u64 ctx_lo;
 	int ret;
 
@@ -5800,7 +5812,6 @@ int intel_iommu_enable_pasid(struct intel_iommu *iommu, struct device *dev)
 	if (!domain)
 		return -EINVAL;
 
-	spin_lock_irqsave(&device_domain_lock, flags);
 	spin_lock(&iommu->lock);
 
 	ret = -EINVAL;
@@ -5833,11 +5844,21 @@ int intel_iommu_enable_pasid(struct intel_iommu *iommu, struct device *dev)
 
  out:
 	spin_unlock(&iommu->lock);
-	spin_unlock_irqrestore(&device_domain_lock, flags);
 
 	return ret;
 }
 
+int intel_iommu_enable_pasid(struct intel_iommu *iommu, struct device *dev)
+{
+	unsigned long flags;
+	int ret;
+
+	spin_lock_irqsave(&device_domain_lock, flags);
+	ret = intel_iommu_enable_pasid_locked(iommu, dev);
+	spin_unlock_irqrestore(&device_domain_lock, flags);
+	return ret;
+}
+
 static void intel_iommu_apply_resv_region(struct device *dev,
 					  struct iommu_domain *domain,
 					  struct iommu_resv_region *region)
diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c
index d39fafb..80f58ab 100644
--- a/drivers/iommu/intel/svm.c
+++ b/drivers/iommu/intel/svm.c
@@ -293,7 +293,9 @@ static LIST_HEAD(global_svm_list);
 	list_for_each_entry((sdev), &(svm)->devs, list)	\
 		if ((d) != (sdev)->dev) {} else
 
-static int pasid_to_svm_sdev(struct device *dev, unsigned int pasid,
+static int pasid_to_svm_sdev(struct device *dev,
+			     struct ioasid_set *set,
+			     unsigned int pasid,
 			     struct intel_svm **rsvm,
 			     struct intel_svm_dev **rsdev)
 {
@@ -307,7 +309,7 @@ static int pasid_to_svm_sdev(struct device *dev, unsigned int pasid,
 	if (pasid == INVALID_IOASID || pasid >= PASID_MAX)
 		return -EINVAL;
 
-	svm = ioasid_find(NULL, pasid, NULL);
+	svm = ioasid_find(set, pasid, NULL);
 	if (IS_ERR(svm))
 		return PTR_ERR(svm);
 
@@ -344,6 +346,7 @@ int intel_svm_bind_gpasid(struct iommu_domain *domain, struct device *dev,
 	struct intel_svm_dev *sdev = NULL;
 	struct dmar_domain *dmar_domain;
 	struct intel_svm *svm = NULL;
+	unsigned long flags;
 	int ret = 0;
 
 	if (WARN_ON(!iommu) || !data)
@@ -377,7 +380,9 @@ int intel_svm_bind_gpasid(struct iommu_domain *domain, struct device *dev,
 	dmar_domain = to_dmar_domain(domain);
 
 	mutex_lock(&pasid_mutex);
-	ret = pasid_to_svm_sdev(dev, data->hpasid, &svm, &sdev);
+	spin_lock_irqsave(&device_domain_lock, flags);
+	ret = pasid_to_svm_sdev(dev, dmar_domain->pasid_set,
+				data->hpasid, &svm, &sdev);
 	if (ret)
 		goto out;
 
@@ -395,7 +400,7 @@ int intel_svm_bind_gpasid(struct iommu_domain *domain, struct device *dev,
 
 	if (!svm) {
 		/* We come here when PASID has never been bond to a device. */
-		svm = kzalloc(sizeof(*svm), GFP_KERNEL);
+		svm = kzalloc(sizeof(*svm), GFP_ATOMIC);
 		if (!svm) {
 			ret = -ENOMEM;
 			goto out;
@@ -415,7 +420,7 @@ int intel_svm_bind_gpasid(struct iommu_domain *domain, struct device *dev,
 		ioasid_attach_data(data->hpasid, svm);
 		INIT_LIST_HEAD_RCU(&svm->devs);
 	}
-	sdev = kzalloc(sizeof(*sdev), GFP_KERNEL);
+	sdev = kzalloc(sizeof(*sdev), GFP_ATOMIC);
 	if (!sdev) {
 		ret = -ENOMEM;
 		goto out;
@@ -427,7 +432,7 @@ int intel_svm_bind_gpasid(struct iommu_domain *domain, struct device *dev,
 		sdev->users = 1;
 
 	/* Set up device context entry for PASID if not enabled already */
-	ret = intel_iommu_enable_pasid(iommu, sdev->dev);
+	ret = intel_iommu_enable_pasid_locked(iommu, sdev->dev);
 	if (ret) {
 		dev_err_ratelimited(dev, "Failed to enable PASID capability\n");
 		kfree(sdev);
@@ -462,6 +467,7 @@ int intel_svm_bind_gpasid(struct iommu_domain *domain, struct device *dev,
 	init_rcu_head(&sdev->rcu);
 	list_add_rcu(&sdev->list, &svm->devs);
  out:
+	spin_unlock_irqrestore(&device_domain_lock, flags);
 	if (!IS_ERR_OR_NULL(svm) && list_empty(&svm->devs)) {
 		ioasid_attach_data(data->hpasid, NULL);
 		kfree(svm);
@@ -480,15 +486,22 @@ int intel_svm_unbind_gpasid(struct iommu_domain *domain,
 			    struct device *dev, u32 pasid)
 {
 	struct intel_iommu *iommu = device_to_iommu(dev, NULL, NULL);
+	struct dmar_domain *dmar_domain;
 	struct intel_svm_dev *sdev;
 	struct intel_svm *svm;
+	unsigned long flags;
 	int ret;
 
 	if (WARN_ON(!iommu))
 		return -EINVAL;
 
+	dmar_domain = to_dmar_domain(domain);
+
 	mutex_lock(&pasid_mutex);
-	ret = pasid_to_svm_sdev(dev, pasid, &svm, &sdev);
+	spin_lock_irqsave(&device_domain_lock, flags);
+	ret = pasid_to_svm_sdev(dev, dmar_domain->pasid_set,
+				pasid, &svm, &sdev);
+	spin_unlock_irqrestore(&device_domain_lock, flags);
 	if (ret)
 		goto out;
 
@@ -712,7 +725,8 @@ static int intel_svm_unbind_mm(struct device *dev, int pasid)
 	if (!iommu)
 		goto out;
 
-	ret = pasid_to_svm_sdev(dev, pasid, &svm, &sdev);
+	ret = pasid_to_svm_sdev(dev, host_pasid_set,
+				pasid, &svm, &sdev);
 	if (ret)
 		goto out;
 
@@ -1204,7 +1218,8 @@ int intel_svm_page_response(struct device *dev,
 		goto out;
 	}
 
-	ret = pasid_to_svm_sdev(dev, prm->pasid, &svm, &sdev);
+	ret = pasid_to_svm_sdev(dev, host_pasid_set,
+				prm->pasid, &svm, &sdev);
 	if (ret || !sdev) {
 		ret = -ENODEV;
 		goto out;
diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
index ce0b33b..db7fc59 100644
--- a/include/linux/intel-iommu.h
+++ b/include/linux/intel-iommu.h
@@ -730,6 +730,8 @@ struct intel_iommu *domain_get_iommu(struct dmar_domain *domain);
 int for_each_device_domain(int (*fn)(struct device_domain_info *info,
 				     void *data), void *data);
 void iommu_flush_write_buffer(struct intel_iommu *iommu);
+int intel_iommu_enable_pasid_locked(struct intel_iommu *iommu,
+				    struct device *dev);
 int intel_iommu_enable_pasid(struct intel_iommu *iommu, struct device *dev);
 struct dmar_domain *find_domain(struct device *dev);
 struct device_domain_info *get_domain_info(struct device *dev);
-- 
2.7.4


^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH v7 10/16] vfio/type1: Support binding guest page tables to PASID
  2020-09-10 10:45 [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs Liu Yi L
                   ` (8 preceding siblings ...)
  2020-09-10 10:45 ` [PATCH v7 09/16] iommu/vt-d: Check ownership for PASIDs from user-space Liu Yi L
@ 2020-09-10 10:45 ` Liu Yi L
  2020-09-11 22:03   ` Alex Williamson
  2020-09-10 10:45 ` [PATCH v7 11/16] vfio/type1: Allow invalidating first-level/stage IOMMU cache Liu Yi L
                   ` (6 subsequent siblings)
  16 siblings, 1 reply; 81+ messages in thread
From: Liu Yi L @ 2020-09-10 10:45 UTC (permalink / raw)
  To: alex.williamson, eric.auger, baolu.lu, joro
  Cc: kevin.tian, jacob.jun.pan, ashok.raj, yi.l.liu, jun.j.tian,
	yi.y.sun, jean-philippe, peterx, jasowang, hao.wu, stefanha,
	iommu, kvm

Nesting translation allows two-levels/stages page tables, with 1st level
for guest translations (e.g. GVA->GPA), 2nd level for host translations
(e.g. GPA->HPA). This patch adds interface for binding guest page tables
to a PASID. This PASID must have been allocated by the userspace before
the binding request.

Cc: Kevin Tian <kevin.tian@intel.com>
CC: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Eric Auger <eric.auger@redhat.com>
Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
---
v6 -> v7:
*) introduced @user in struct domain_capsule to simplify the code per Eric's
   suggestion.
*) introduced VFIO_IOMMU_NESTING_OP_NUM for sanitizing op from userspace.
*) corrected the @argsz value of unbind_data in vfio_group_unbind_gpasid_fn().

v5 -> v6:
*) dropped vfio_find_nesting_group() and add vfio_get_nesting_domain_capsule().
   per comment from Eric.
*) use iommu_uapi_sva_bind/unbind_gpasid() and iommu_sva_unbind_gpasid() in
   linux/iommu.h for userspace operation and in-kernel operation.

v3 -> v4:
*) address comments from Alex on v3

v2 -> v3:
*) use __iommu_sva_unbind_gpasid() for unbind call issued by VFIO
   https://lore.kernel.org/linux-iommu/1592931837-58223-6-git-send-email-jacob.jun.pan@linux.intel.com/

v1 -> v2:
*) rename subject from "vfio/type1: Bind guest page tables to host"
*) remove VFIO_IOMMU_BIND, introduce VFIO_IOMMU_NESTING_OP to support bind/
   unbind guet page table
*) replaced vfio_iommu_for_each_dev() with a group level loop since this
   series enforces one group per container w/ nesting type as start.
*) rename vfio_bind/unbind_gpasid_fn() to vfio_dev_bind/unbind_gpasid_fn()
*) vfio_dev_unbind_gpasid() always successful
*) use vfio_mm->pasid_lock to avoid race between PASID free and page table
   bind/unbind
---
 drivers/vfio/vfio_iommu_type1.c | 163 ++++++++++++++++++++++++++++++++++++++++
 drivers/vfio/vfio_pasid.c       |  26 +++++++
 include/linux/vfio.h            |  20 +++++
 include/uapi/linux/vfio.h       |  36 +++++++++
 4 files changed, 245 insertions(+)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index bd4b668..11f1156 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -149,6 +149,39 @@ struct vfio_regions {
 #define DIRTY_BITMAP_PAGES_MAX	 ((u64)INT_MAX)
 #define DIRTY_BITMAP_SIZE_MAX	 DIRTY_BITMAP_BYTES(DIRTY_BITMAP_PAGES_MAX)
 
+struct domain_capsule {
+	struct vfio_group	*group;
+	struct iommu_domain	*domain;
+	/* set if @data contains a user pointer*/
+	bool			user;
+	void			*data;
+};
+
+/* iommu->lock must be held */
+static int vfio_prepare_nesting_domain_capsule(struct vfio_iommu *iommu,
+					       struct domain_capsule *dc)
+{
+	struct vfio_domain *domain = NULL;
+	struct vfio_group *group = NULL;
+
+	if (!iommu->nesting_info)
+		return -EINVAL;
+
+	/*
+	 * Only support singleton container with nesting type. If
+	 * nesting_info is non-NULL, the container is non-empty.
+	 * Also domain is non-empty.
+	 */
+	domain = list_first_entry(&iommu->domain_list,
+				  struct vfio_domain, next);
+	group = list_first_entry(&domain->group_list,
+				 struct vfio_group, next);
+	dc->group = group;
+	dc->domain = domain->domain;
+	dc->user = true;
+	return 0;
+}
+
 static int put_pfn(unsigned long pfn, int prot);
 
 static struct vfio_group *vfio_iommu_find_iommu_group(struct vfio_iommu *iommu,
@@ -2405,6 +2438,49 @@ static int vfio_iommu_resv_refresh(struct vfio_iommu *iommu,
 	return ret;
 }
 
+static int vfio_dev_bind_gpasid_fn(struct device *dev, void *data)
+{
+	struct domain_capsule *dc = (struct domain_capsule *)data;
+	unsigned long arg = *(unsigned long *)dc->data;
+
+	return iommu_uapi_sva_bind_gpasid(dc->domain, dev,
+					  (void __user *)arg);
+}
+
+static int vfio_dev_unbind_gpasid_fn(struct device *dev, void *data)
+{
+	struct domain_capsule *dc = (struct domain_capsule *)data;
+
+	if (dc->user) {
+		unsigned long arg = *(unsigned long *)dc->data;
+
+		iommu_uapi_sva_unbind_gpasid(dc->domain,
+					     dev, (void __user *)arg);
+	} else {
+		struct iommu_gpasid_bind_data *unbind_data =
+				(struct iommu_gpasid_bind_data *)dc->data;
+
+		iommu_sva_unbind_gpasid(dc->domain, dev, unbind_data);
+	}
+	return 0;
+}
+
+static void vfio_group_unbind_gpasid_fn(ioasid_t pasid, void *data)
+{
+	struct domain_capsule *dc = (struct domain_capsule *)data;
+	struct iommu_gpasid_bind_data unbind_data;
+
+	unbind_data.argsz = sizeof(struct iommu_gpasid_bind_data);
+	unbind_data.flags = 0;
+	unbind_data.hpasid = pasid;
+
+	dc->user = false;
+	dc->data = &unbind_data;
+
+	iommu_group_for_each_dev(dc->group->iommu_group,
+				 dc, vfio_dev_unbind_gpasid_fn);
+}
+
 static void vfio_iommu_type1_detach_group(void *iommu_data,
 					  struct iommu_group *iommu_group)
 {
@@ -2448,6 +2524,20 @@ static void vfio_iommu_type1_detach_group(void *iommu_data,
 		if (!group)
 			continue;
 
+		if (iommu->vmm && (iommu->nesting_info->features &
+					IOMMU_NESTING_FEAT_BIND_PGTBL)) {
+			struct domain_capsule dc = { .group = group,
+						     .domain = domain->domain,
+						     .data = NULL };
+
+			/*
+			 * Unbind page tables bound with system wide PASIDs
+			 * which are allocated to userspace.
+			 */
+			vfio_mm_for_each_pasid(iommu->vmm, &dc,
+					       vfio_group_unbind_gpasid_fn);
+		}
+
 		vfio_iommu_detach_group(domain, group);
 		update_dirty_scope = !group->pinned_page_dirty_scope;
 		list_del(&group->next);
@@ -2982,6 +3072,77 @@ static int vfio_iommu_type1_pasid_request(struct vfio_iommu *iommu,
 	return ret;
 }
 
+static long vfio_iommu_handle_pgtbl_op(struct vfio_iommu *iommu,
+				       bool is_bind, unsigned long arg)
+{
+	struct domain_capsule dc = { .data = &arg };
+	struct iommu_nesting_info *info;
+	int ret;
+
+	mutex_lock(&iommu->lock);
+
+	info = iommu->nesting_info;
+	if (!info || !(info->features & IOMMU_NESTING_FEAT_BIND_PGTBL)) {
+		ret = -EOPNOTSUPP;
+		goto out_unlock;
+	}
+
+	if (!iommu->vmm) {
+		ret = -EINVAL;
+		goto out_unlock;
+	}
+
+	ret = vfio_prepare_nesting_domain_capsule(iommu, &dc);
+	if (ret)
+		goto out_unlock;
+
+	/* Avoid race with other containers within the same process */
+	vfio_mm_pasid_lock(iommu->vmm);
+
+	if (is_bind)
+		ret = iommu_group_for_each_dev(dc.group->iommu_group, &dc,
+					       vfio_dev_bind_gpasid_fn);
+	if (ret || !is_bind)
+		iommu_group_for_each_dev(dc.group->iommu_group,
+					 &dc, vfio_dev_unbind_gpasid_fn);
+
+	vfio_mm_pasid_unlock(iommu->vmm);
+out_unlock:
+	mutex_unlock(&iommu->lock);
+	return ret;
+}
+
+static long vfio_iommu_type1_nesting_op(struct vfio_iommu *iommu,
+					unsigned long arg)
+{
+	struct vfio_iommu_type1_nesting_op hdr;
+	unsigned int minsz;
+	int ret;
+
+	minsz = offsetofend(struct vfio_iommu_type1_nesting_op, flags);
+
+	if (copy_from_user(&hdr, (void __user *)arg, minsz))
+		return -EFAULT;
+
+	if (hdr.argsz < minsz ||
+	    hdr.flags & ~VFIO_NESTING_OP_MASK ||
+	    (hdr.flags & VFIO_NESTING_OP_MASK) >= VFIO_IOMMU_NESTING_OP_NUM)
+		return -EINVAL;
+
+	switch (hdr.flags & VFIO_NESTING_OP_MASK) {
+	case VFIO_IOMMU_NESTING_OP_BIND_PGTBL:
+		ret = vfio_iommu_handle_pgtbl_op(iommu, true, arg + minsz);
+		break;
+	case VFIO_IOMMU_NESTING_OP_UNBIND_PGTBL:
+		ret = vfio_iommu_handle_pgtbl_op(iommu, false, arg + minsz);
+		break;
+	default:
+		ret = -EINVAL;
+	}
+
+	return ret;
+}
+
 static long vfio_iommu_type1_ioctl(void *iommu_data,
 				   unsigned int cmd, unsigned long arg)
 {
@@ -3000,6 +3161,8 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
 		return vfio_iommu_type1_dirty_pages(iommu, arg);
 	case VFIO_IOMMU_PASID_REQUEST:
 		return vfio_iommu_type1_pasid_request(iommu, arg);
+	case VFIO_IOMMU_NESTING_OP:
+		return vfio_iommu_type1_nesting_op(iommu, arg);
 	default:
 		return -ENOTTY;
 	}
diff --git a/drivers/vfio/vfio_pasid.c b/drivers/vfio/vfio_pasid.c
index 0ec4660..9e2e4b0 100644
--- a/drivers/vfio/vfio_pasid.c
+++ b/drivers/vfio/vfio_pasid.c
@@ -220,6 +220,8 @@ void vfio_pasid_free_range(struct vfio_mm *vmm,
 	 * IOASID core will notify PASID users (e.g. IOMMU driver) to
 	 * teardown necessary structures depending on the to-be-freed
 	 * PASID.
+	 * Hold pasid_lock also avoids race with PASID usages like bind/
+	 * unbind page tables to requested PASID.
 	 */
 	mutex_lock(&vmm->pasid_lock);
 	while ((vid = vfio_find_pasid(vmm, min, max)) != NULL)
@@ -228,6 +230,30 @@ void vfio_pasid_free_range(struct vfio_mm *vmm,
 }
 EXPORT_SYMBOL_GPL(vfio_pasid_free_range);
 
+int vfio_mm_for_each_pasid(struct vfio_mm *vmm, void *data,
+			   void (*fn)(ioasid_t id, void *data))
+{
+	int ret;
+
+	mutex_lock(&vmm->pasid_lock);
+	ret = ioasid_set_for_each_ioasid(vmm->ioasid_set, fn, data);
+	mutex_unlock(&vmm->pasid_lock);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(vfio_mm_for_each_pasid);
+
+void vfio_mm_pasid_lock(struct vfio_mm *vmm)
+{
+	mutex_lock(&vmm->pasid_lock);
+}
+EXPORT_SYMBOL_GPL(vfio_mm_pasid_lock);
+
+void vfio_mm_pasid_unlock(struct vfio_mm *vmm)
+{
+	mutex_unlock(&vmm->pasid_lock);
+}
+EXPORT_SYMBOL_GPL(vfio_mm_pasid_unlock);
+
 static int __init vfio_pasid_init(void)
 {
 	mutex_init(&vfio_mm_lock);
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index 5c3d7a8..6a999c3 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -105,6 +105,11 @@ extern struct ioasid_set *vfio_mm_ioasid_set(struct vfio_mm *vmm);
 extern int vfio_pasid_alloc(struct vfio_mm *vmm, int min, int max);
 extern void vfio_pasid_free_range(struct vfio_mm *vmm,
 				  ioasid_t min, ioasid_t max);
+extern int vfio_mm_for_each_pasid(struct vfio_mm *vmm, void *data,
+				  void (*fn)(ioasid_t id, void *data));
+extern void vfio_mm_pasid_lock(struct vfio_mm *vmm);
+extern void vfio_mm_pasid_unlock(struct vfio_mm *vmm);
+
 #else
 static inline struct vfio_mm *vfio_mm_get_from_task(struct task_struct *task)
 {
@@ -129,6 +134,21 @@ static inline void vfio_pasid_free_range(struct vfio_mm *vmm,
 					  ioasid_t min, ioasid_t max)
 {
 }
+
+static inline int vfio_mm_for_each_pasid(struct vfio_mm *vmm, void *data,
+					 void (*fn)(ioasid_t id, void *data))
+{
+	return -ENOTTY;
+}
+
+static inline void vfio_mm_pasid_lock(struct vfio_mm *vmm)
+{
+}
+
+static inline void vfio_mm_pasid_unlock(struct vfio_mm *vmm)
+{
+}
+
 #endif /* CONFIG_VFIO_PASID */
 
 /*
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index a4bc42e..a99bd71 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -1215,6 +1215,42 @@ struct vfio_iommu_type1_pasid_request {
 
 #define VFIO_IOMMU_PASID_REQUEST	_IO(VFIO_TYPE, VFIO_BASE + 18)
 
+/**
+ * VFIO_IOMMU_NESTING_OP - _IOW(VFIO_TYPE, VFIO_BASE + 19,
+ *				struct vfio_iommu_type1_nesting_op)
+ *
+ * This interface allows userspace to utilize the nesting IOMMU
+ * capabilities as reported in VFIO_IOMMU_TYPE1_INFO_CAP_NESTING
+ * cap through VFIO_IOMMU_GET_INFO. For platforms which require
+ * system wide PASID, PASID will be allocated by VFIO_IOMMU_PASID
+ * _REQUEST.
+ *
+ * @data[] types defined for each op:
+ * +=================+===============================================+
+ * | NESTING OP      |      @data[]                                  |
+ * +=================+===============================================+
+ * | BIND_PGTBL      |      struct iommu_gpasid_bind_data            |
+ * +-----------------+-----------------------------------------------+
+ * | UNBIND_PGTBL    |      struct iommu_gpasid_bind_data            |
+ * +-----------------+-----------------------------------------------+
+ *
+ * returns: 0 on success, -errno on failure.
+ */
+struct vfio_iommu_type1_nesting_op {
+	__u32	argsz;
+	__u32	flags;
+#define VFIO_NESTING_OP_MASK	(0xffff) /* lower 16-bits for op */
+	__u8	data[];
+};
+
+enum {
+	VFIO_IOMMU_NESTING_OP_BIND_PGTBL,
+	VFIO_IOMMU_NESTING_OP_UNBIND_PGTBL,
+	VFIO_IOMMU_NESTING_OP_NUM,
+};
+
+#define VFIO_IOMMU_NESTING_OP		_IO(VFIO_TYPE, VFIO_BASE + 19)
+
 /* -------- Additional API for SPAPR TCE (Server POWERPC) IOMMU -------- */
 
 /*
-- 
2.7.4


^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH v7 11/16] vfio/type1: Allow invalidating first-level/stage IOMMU cache
  2020-09-10 10:45 [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs Liu Yi L
                   ` (9 preceding siblings ...)
  2020-09-10 10:45 ` [PATCH v7 10/16] vfio/type1: Support binding guest page tables to PASID Liu Yi L
@ 2020-09-10 10:45 ` Liu Yi L
  2020-09-10 10:45 ` [PATCH v7 12/16] vfio/type1: Add vSVA support for IOMMU-backed mdevs Liu Yi L
                   ` (5 subsequent siblings)
  16 siblings, 0 replies; 81+ messages in thread
From: Liu Yi L @ 2020-09-10 10:45 UTC (permalink / raw)
  To: alex.williamson, eric.auger, baolu.lu, joro
  Cc: kevin.tian, jacob.jun.pan, ashok.raj, yi.l.liu, jun.j.tian,
	yi.y.sun, jean-philippe, peterx, jasowang, hao.wu, stefanha,
	iommu, kvm

This patch provides an interface allowing the userspace to invalidate
IOMMU cache for first-level page table. It is required when the first
level IOMMU page table is not managed by the host kernel in the nested
translation setup.

Cc: Kevin Tian <kevin.tian@intel.com>
CC: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Eric Auger <eric.auger@redhat.com>
Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
Signed-off-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
---
v1 -> v2:
*) rename from "vfio/type1: Flush stage-1 IOMMU cache for nesting type"
*) rename vfio_cache_inv_fn() to vfio_dev_cache_invalidate_fn()
*) vfio_dev_cache_inv_fn() always successful
*) remove VFIO_IOMMU_CACHE_INVALIDATE, and reuse VFIO_IOMMU_NESTING_OP
---
 drivers/vfio/vfio_iommu_type1.c | 38 ++++++++++++++++++++++++++++++++++++++
 include/uapi/linux/vfio.h       |  3 +++
 2 files changed, 41 insertions(+)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 11f1156..b67ce2d 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -3112,6 +3112,41 @@ static long vfio_iommu_handle_pgtbl_op(struct vfio_iommu *iommu,
 	return ret;
 }
 
+static int vfio_dev_cache_invalidate_fn(struct device *dev, void *data)
+{
+	struct domain_capsule *dc = (struct domain_capsule *)data;
+	unsigned long arg = *(unsigned long *)dc->data;
+
+	iommu_uapi_cache_invalidate(dc->domain, dev, (void __user *)arg);
+	return 0;
+}
+
+static long vfio_iommu_invalidate_cache(struct vfio_iommu *iommu,
+					unsigned long arg)
+{
+	struct domain_capsule dc = { .data = &arg };
+	struct iommu_nesting_info *info;
+	int ret;
+
+	mutex_lock(&iommu->lock);
+	info = iommu->nesting_info;
+	if (!info || !(info->features & IOMMU_NESTING_FEAT_CACHE_INVLD)) {
+		ret = -EOPNOTSUPP;
+		goto out_unlock;
+	}
+
+	ret = vfio_prepare_nesting_domain_capsule(iommu, &dc);
+	if (ret)
+		goto out_unlock;
+
+	iommu_group_for_each_dev(dc.group->iommu_group, &dc,
+				 vfio_dev_cache_invalidate_fn);
+
+out_unlock:
+	mutex_unlock(&iommu->lock);
+	return ret;
+}
+
 static long vfio_iommu_type1_nesting_op(struct vfio_iommu *iommu,
 					unsigned long arg)
 {
@@ -3136,6 +3171,9 @@ static long vfio_iommu_type1_nesting_op(struct vfio_iommu *iommu,
 	case VFIO_IOMMU_NESTING_OP_UNBIND_PGTBL:
 		ret = vfio_iommu_handle_pgtbl_op(iommu, false, arg + minsz);
 		break;
+	case VFIO_IOMMU_NESTING_OP_CACHE_INVLD:
+		ret = vfio_iommu_invalidate_cache(iommu, arg + minsz);
+		break;
 	default:
 		ret = -EINVAL;
 	}
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index a99bd71..a09a407 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -1233,6 +1233,8 @@ struct vfio_iommu_type1_pasid_request {
  * +-----------------+-----------------------------------------------+
  * | UNBIND_PGTBL    |      struct iommu_gpasid_bind_data            |
  * +-----------------+-----------------------------------------------+
+ * | CACHE_INVLD     |      struct iommu_cache_invalidate_info       |
+ * +-----------------+-----------------------------------------------+
  *
  * returns: 0 on success, -errno on failure.
  */
@@ -1246,6 +1248,7 @@ struct vfio_iommu_type1_nesting_op {
 enum {
 	VFIO_IOMMU_NESTING_OP_BIND_PGTBL,
 	VFIO_IOMMU_NESTING_OP_UNBIND_PGTBL,
+	VFIO_IOMMU_NESTING_OP_CACHE_INVLD,
 	VFIO_IOMMU_NESTING_OP_NUM,
 };
 
-- 
2.7.4


^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH v7 12/16] vfio/type1: Add vSVA support for IOMMU-backed mdevs
  2020-09-10 10:45 [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs Liu Yi L
                   ` (10 preceding siblings ...)
  2020-09-10 10:45 ` [PATCH v7 11/16] vfio/type1: Allow invalidating first-level/stage IOMMU cache Liu Yi L
@ 2020-09-10 10:45 ` Liu Yi L
  2020-09-10 10:45 ` [PATCH v7 13/16] vfio/pci: Expose PCIe PASID capability to guest Liu Yi L
                   ` (4 subsequent siblings)
  16 siblings, 0 replies; 81+ messages in thread
From: Liu Yi L @ 2020-09-10 10:45 UTC (permalink / raw)
  To: alex.williamson, eric.auger, baolu.lu, joro
  Cc: kevin.tian, jacob.jun.pan, ashok.raj, yi.l.liu, jun.j.tian,
	yi.y.sun, jean-philippe, peterx, jasowang, hao.wu, stefanha,
	iommu, kvm

Recent years, mediated device pass-through framework (e.g. vfio-mdev)
is used to achieve flexible device sharing across domains (e.g. VMs).
Also there are hardware assisted mediated pass-through solutions from
platform vendors. e.g. Intel VT-d scalable mode which supports Intel
Scalable I/O Virtualization technology. Such mdevs are called IOMMU-
backed mdevs as there are IOMMU enforced DMA isolation for such mdevs.
In kernel, IOMMU-backed mdevs are exposed to IOMMU layer by aux-domain
concept, which means mdevs are protected by an iommu domain which is
auxiliary to the domain that the kernel driver primarily uses for DMA
API. Details can be found in the KVM presentation as below:

https://events19.linuxfoundation.org/wp-content/uploads/2017/12/Hardware-Assisted-Mediated-Pass-Through-with-VFIO-Kevin-Tian-Intel.pdf

This patch extends NESTING_IOMMU ops to IOMMU-backed mdev devices. The
main requirement is to use the auxiliary domain associated with mdev.

Cc: Kevin Tian <kevin.tian@intel.com>
CC: Jacob Pan <jacob.jun.pan@linux.intel.com>
CC: Jun Tian <jun.j.tian@intel.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Eric Auger <eric.auger@redhat.com>
Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
---
v5 -> v6:
*) add review-by from Eric Auger.

v1 -> v2:
*) check the iommu_device to ensure the handling mdev is IOMMU-backed
---
 drivers/vfio/vfio_iommu_type1.c | 36 +++++++++++++++++++++++++++++++-----
 1 file changed, 31 insertions(+), 5 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index b67ce2d..5cef732 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -2438,29 +2438,49 @@ static int vfio_iommu_resv_refresh(struct vfio_iommu *iommu,
 	return ret;
 }
 
+static struct device *vfio_get_iommu_device(struct vfio_group *group,
+					    struct device *dev)
+{
+	if (group->mdev_group)
+		return vfio_mdev_get_iommu_device(dev);
+	else
+		return dev;
+}
+
 static int vfio_dev_bind_gpasid_fn(struct device *dev, void *data)
 {
 	struct domain_capsule *dc = (struct domain_capsule *)data;
 	unsigned long arg = *(unsigned long *)dc->data;
+	struct device *iommu_device;
+
+	iommu_device = vfio_get_iommu_device(dc->group, dev);
+	if (!iommu_device)
+		return -EINVAL;
 
-	return iommu_uapi_sva_bind_gpasid(dc->domain, dev,
+	return iommu_uapi_sva_bind_gpasid(dc->domain, iommu_device,
 					  (void __user *)arg);
 }
 
 static int vfio_dev_unbind_gpasid_fn(struct device *dev, void *data)
 {
 	struct domain_capsule *dc = (struct domain_capsule *)data;
+	struct device *iommu_device;
+
+	iommu_device = vfio_get_iommu_device(dc->group, dev);
+	if (!iommu_device)
+		return -EINVAL;
 
 	if (dc->user) {
 		unsigned long arg = *(unsigned long *)dc->data;
 
-		iommu_uapi_sva_unbind_gpasid(dc->domain,
-					     dev, (void __user *)arg);
+		iommu_uapi_sva_unbind_gpasid(dc->domain, iommu_device,
+					     (void __user *)arg);
 	} else {
 		struct iommu_gpasid_bind_data *unbind_data =
 				(struct iommu_gpasid_bind_data *)dc->data;
 
-		iommu_sva_unbind_gpasid(dc->domain, dev, unbind_data);
+		iommu_sva_unbind_gpasid(dc->domain,
+					iommu_device, unbind_data);
 	}
 	return 0;
 }
@@ -3116,8 +3136,14 @@ static int vfio_dev_cache_invalidate_fn(struct device *dev, void *data)
 {
 	struct domain_capsule *dc = (struct domain_capsule *)data;
 	unsigned long arg = *(unsigned long *)dc->data;
+	struct device *iommu_device;
+
+	iommu_device = vfio_get_iommu_device(dc->group, dev);
+	if (!iommu_device)
+		return -EINVAL;
 
-	iommu_uapi_cache_invalidate(dc->domain, dev, (void __user *)arg);
+	iommu_uapi_cache_invalidate(dc->domain, iommu_device,
+				    (void __user *)arg);
 	return 0;
 }
 
-- 
2.7.4


^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH v7 13/16] vfio/pci: Expose PCIe PASID capability to guest
  2020-09-10 10:45 [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs Liu Yi L
                   ` (11 preceding siblings ...)
  2020-09-10 10:45 ` [PATCH v7 12/16] vfio/type1: Add vSVA support for IOMMU-backed mdevs Liu Yi L
@ 2020-09-10 10:45 ` Liu Yi L
  2020-09-11 22:13   ` Alex Williamson
  2020-09-10 10:45 ` [PATCH v7 14/16] vfio: Document dual stage control Liu Yi L
                   ` (3 subsequent siblings)
  16 siblings, 1 reply; 81+ messages in thread
From: Liu Yi L @ 2020-09-10 10:45 UTC (permalink / raw)
  To: alex.williamson, eric.auger, baolu.lu, joro
  Cc: kevin.tian, jacob.jun.pan, ashok.raj, yi.l.liu, jun.j.tian,
	yi.y.sun, jean-philippe, peterx, jasowang, hao.wu, stefanha,
	iommu, kvm

This patch exposes PCIe PASID capability to guest for assigned devices.
Existing vfio_pci driver hides it from guest by setting the capability
length as 0 in pci_ext_cap_length[].

And this patch only exposes PASID capability for devices which has PCIe
PASID extended struture in its configuration space. VFs will not expose
the PASID capability as they do not implement the PASID extended structure
in their config space. It is a TODO in future. Related discussion can be
found in below link:

https://lore.kernel.org/kvm/20200407095801.648b1371@w520.home/

Cc: Kevin Tian <kevin.tian@intel.com>
CC: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Eric Auger <eric.auger@redhat.com>
Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
---
v5 -> v6:
*) add review-by from Eric Auger.

v1 -> v2:
*) added in v2, but it was sent in a separate patchseries before
---
 drivers/vfio/pci/vfio_pci_config.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c
index d98843f..07ff2e6 100644
--- a/drivers/vfio/pci/vfio_pci_config.c
+++ b/drivers/vfio/pci/vfio_pci_config.c
@@ -95,7 +95,7 @@ static const u16 pci_ext_cap_length[PCI_EXT_CAP_ID_MAX + 1] = {
 	[PCI_EXT_CAP_ID_LTR]	=	PCI_EXT_CAP_LTR_SIZEOF,
 	[PCI_EXT_CAP_ID_SECPCI]	=	0,	/* not yet */
 	[PCI_EXT_CAP_ID_PMUX]	=	0,	/* not yet */
-	[PCI_EXT_CAP_ID_PASID]	=	0,	/* not yet */
+	[PCI_EXT_CAP_ID_PASID]	=	PCI_EXT_CAP_PASID_SIZEOF,
 };
 
 /*
-- 
2.7.4


^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH v7 14/16] vfio: Document dual stage control
  2020-09-10 10:45 [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs Liu Yi L
                   ` (12 preceding siblings ...)
  2020-09-10 10:45 ` [PATCH v7 13/16] vfio/pci: Expose PCIe PASID capability to guest Liu Yi L
@ 2020-09-10 10:45 ` Liu Yi L
  2020-09-10 10:45 ` [PATCH v7 15/16] iommu/vt-d: Only support nesting when nesting caps are consistent across iommu units Liu Yi L
                   ` (2 subsequent siblings)
  16 siblings, 0 replies; 81+ messages in thread
From: Liu Yi L @ 2020-09-10 10:45 UTC (permalink / raw)
  To: alex.williamson, eric.auger, baolu.lu, joro
  Cc: kevin.tian, jacob.jun.pan, ashok.raj, yi.l.liu, jun.j.tian,
	yi.y.sun, jean-philippe, peterx, jasowang, hao.wu, stefanha,
	iommu, kvm

From: Eric Auger <eric.auger@redhat.com>

The VFIO API was enhanced to support nested stage control: a bunch of
new ioctls and usage guideline.

Let's document the process to follow to set up nested mode.

Cc: Kevin Tian <kevin.tian@intel.com>
CC: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Eric Auger <eric.auger@redhat.com>
Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
---
v6 -> v7:
*) tweak per Eric's comments.

v5 -> v6:
*) tweak per Eric's comments.

v3 -> v4:
*) add review-by from Stefan Hajnoczi

v2 -> v3:
*) address comments from Stefan Hajnoczi

v1 -> v2:
*) new in v2, compared with Eric's original version, pasid table bind
   and fault reporting is removed as this series doesn't cover them.
   Original version from Eric.
   https://lore.kernel.org/kvm/20200320161911.27494-12-eric.auger@redhat.com/
---
 Documentation/driver-api/vfio.rst | 76 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 76 insertions(+)

diff --git a/Documentation/driver-api/vfio.rst b/Documentation/driver-api/vfio.rst
index f1a4d3c..10851dd 100644
--- a/Documentation/driver-api/vfio.rst
+++ b/Documentation/driver-api/vfio.rst
@@ -239,6 +239,82 @@ group and can access them as follows::
 	/* Gratuitous device reset and go... */
 	ioctl(device, VFIO_DEVICE_RESET);
 
+IOMMU Dual Stage Control
+------------------------
+
+Some IOMMUs support 2 stages/levels of translation. Stage corresponds
+to the ARM terminology while level corresponds to Intel's terminology.
+In the following text we use either without distinction.
+
+This is useful when the guest is exposed with a virtual IOMMU and some
+devices are assigned to the guest through VFIO. Then the guest OS can
+use stage-1 (GIOVA -> GPA or GVA->GPA), while the hypervisor uses stage
+2 for VM isolation (GPA -> HPA).
+
+Under dual stage translation, the guest gets ownership of the stage-1
+page tables or both the stage-1 configuration structures and page tables.
+This depends on vendor. e.g. on Intel platform, guest owns stage-1 page
+tables under nesting. While on ARM, guest owns both the stage-1 configuration
+structures and page tables under nesting. The hypervisor owns the root
+configuration structure (for security reason), including stage-2 configuration.
+This works as long as configuration structures and page table formats are
+compatible between the virtual IOMMU and the physical IOMMU.
+
+Assuming the HW supports it, this nested mode is selected by choosing the
+VFIO_TYPE1_NESTING_IOMMU type through:
+
+    ioctl(container, VFIO_SET_IOMMU, VFIO_TYPE1_NESTING_IOMMU);
+
+This forces the hypervisor to use the stage-2, leaving stage-1 available
+for guest usage.
+The stage-1 format and binding method are reported in nesting capability.
+(VFIO_IOMMU_TYPE1_INFO_CAP_NESTING) through VFIO_IOMMU_GET_INFO:
+
+    ioctl(container->fd, VFIO_IOMMU_GET_INFO, &nesting_info);
+
+The nesting cap info is available only after NESTING_IOMMU is selected.
+If the underlying IOMMU doesn't support nesting, VFIO_SET_IOMMU fails and
+userspace should try other IOMMU types. Details of the nesting cap info
+can be found in Documentation/userspace-api/iommu.rst.
+
+Bind stage-1 page table to the IOMMU differs per platform. On Intel,
+the stage1 page table info are mediated by the userspace for each PASID.
+On ARM, the userspace directly passes the GPA of the whole PASID table.
+Currently only Intel's binding is supported (IOMMU_NESTING_FEAT_BIND_PGTBL)
+is supported:
+
+    nesting_op->flags = VFIO_IOMMU_NESTING_OP_BIND_PGTBL;
+    memcpy(&nesting_op->data, &bind_data, sizeof(bind_data));
+    ioctl(container->fd, VFIO_IOMMU_NESTING_OP, nesting_op);
+
+When multiple stage-1 page tables are supported on a device, each page
+table is associated with a PASID (Process Address Space ID) to differentiate
+with each other. In such case, userspace should include PASID in the
+bind_data when issuing direct binding request.
+
+PASID could be managed per-device or system-wide which, again, depends on
+IOMMU vendor and is reported in nesting cap info. When system-wide policy
+is reported (IOMMU_NESTING_FEAT_SYSWIDE_PASID), e.g. as by Intel platforms,
+userspace *must* allocate PASID from VFIO before attempting binding of
+stage-1 page table:
+
+    req.flags = VFIO_IOMMU_ALLOC_PASID;
+    ioctl(container, VFIO_IOMMU_PASID_REQUEST, &req);
+
+Once the stage-1 page table is bound to the IOMMU, the guest is allowed to
+fully manage its mapping at its disposal. The IOMMU walks nested stage-1
+and stage-2 page tables when serving DMA requests from assigned device, and
+may cache the stage-1 mapping in the IOTLB. When required (IOMMU_NESTING_
+FEAT_CACHE_INVLD), userspace *must* forward guest stage-1 invalidation to
+the host, so the IOTLB is invalidated:
+
+    nesting_op->flags = VFIO_IOMMU_NESTING_OP_CACHE_INVLD;
+    memcpy(&nesting_op->data, &cache_inv_data, sizeof(cache_inv_data));
+    ioctl(container->fd, VFIO_IOMMU_NESTING_OP, nesting_op);
+
+Forwarded invalidations can happen at various granularity levels (page
+level, context level, etc.)
+
 VFIO User API
 -------------------------------------------------------------------------------
 
-- 
2.7.4


^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH v7 15/16] iommu/vt-d: Only support nesting when nesting caps are consistent across iommu units
  2020-09-10 10:45 [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs Liu Yi L
                   ` (13 preceding siblings ...)
  2020-09-10 10:45 ` [PATCH v7 14/16] vfio: Document dual stage control Liu Yi L
@ 2020-09-10 10:45 ` Liu Yi L
  2020-09-10 10:45 ` [PATCH v7 16/16] iommu/vt-d: Support reporting nesting capability info Liu Yi L
  2020-09-14  4:20 ` [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs Jason Wang
  16 siblings, 0 replies; 81+ messages in thread
From: Liu Yi L @ 2020-09-10 10:45 UTC (permalink / raw)
  To: alex.williamson, eric.auger, baolu.lu, joro
  Cc: kevin.tian, jacob.jun.pan, ashok.raj, yi.l.liu, jun.j.tian,
	yi.y.sun, jean-philippe, peterx, jasowang, hao.wu, stefanha,
	iommu, kvm

This patch makes change to only supports the case where all the physical iommu
units have the same CAP/ECAP MASKS for nested translation.

Cc: Kevin Tian <kevin.tian@intel.com>
CC: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Eric Auger <eric.auger@redhat.com>
Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
---
v6 - > v7:
*) added per comment from Eric.
   https://lore.kernel.org/kvm/7fe337fa-abbc-82be-c8e8-b9e2a6179b90@redhat.com/
---
 drivers/iommu/intel/iommu.c |  8 ++++++--
 include/linux/intel-iommu.h | 16 ++++++++++++++++
 2 files changed, 22 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 95740b9..38c6c9b 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -5675,12 +5675,16 @@ static inline bool iommu_pasid_support(void)
 static inline bool nested_mode_support(void)
 {
 	struct dmar_drhd_unit *drhd;
-	struct intel_iommu *iommu;
+	struct intel_iommu *iommu, *prev = NULL;
 	bool ret = true;
 
 	rcu_read_lock();
 	for_each_active_iommu(iommu, drhd) {
-		if (!sm_supported(iommu) || !ecap_nest(iommu->ecap)) {
+		if (!prev)
+			prev = iommu;
+		if (!sm_supported(iommu) || !ecap_nest(iommu->ecap) ||
+		    (VTD_CAP_MASK & (iommu->cap ^ prev->cap)) ||
+		    (VTD_ECAP_MASK & (iommu->ecap ^ prev->ecap))) {
 			ret = false;
 			break;
 		}
diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
index db7fc59..e7b7512 100644
--- a/include/linux/intel-iommu.h
+++ b/include/linux/intel-iommu.h
@@ -197,6 +197,22 @@
 #define ecap_max_handle_mask(e) ((e >> 20) & 0xf)
 #define ecap_sc_support(e)	((e >> 7) & 0x1) /* Snooping Control */
 
+/* Nesting Support Capability Alignment */
+#define VTD_CAP_FL1GP		BIT_ULL(56)
+#define VTD_CAP_FL5LP		BIT_ULL(60)
+#define VTD_ECAP_PRS		BIT_ULL(29)
+#define VTD_ECAP_ERS		BIT_ULL(30)
+#define VTD_ECAP_SRS		BIT_ULL(31)
+#define VTD_ECAP_EAFS		BIT_ULL(34)
+#define VTD_ECAP_PASID		BIT_ULL(40)
+
+/* Only capabilities marked in below MASKs are reported */
+#define VTD_CAP_MASK		(VTD_CAP_FL1GP | VTD_CAP_FL5LP)
+
+#define VTD_ECAP_MASK		(VTD_ECAP_PRS | VTD_ECAP_ERS | \
+				 VTD_ECAP_SRS | VTD_ECAP_EAFS | \
+				 VTD_ECAP_PASID)
+
 /* Virtual command interface capability */
 #define vccap_pasid(v)		(((v) & DMA_VCS_PAS)) /* PASID allocation */
 
-- 
2.7.4


^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH v7 16/16] iommu/vt-d: Support reporting nesting capability info
  2020-09-10 10:45 [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs Liu Yi L
                   ` (14 preceding siblings ...)
  2020-09-10 10:45 ` [PATCH v7 15/16] iommu/vt-d: Only support nesting when nesting caps are consistent across iommu units Liu Yi L
@ 2020-09-10 10:45 ` Liu Yi L
  2020-09-14  4:20 ` [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs Jason Wang
  16 siblings, 0 replies; 81+ messages in thread
From: Liu Yi L @ 2020-09-10 10:45 UTC (permalink / raw)
  To: alex.williamson, eric.auger, baolu.lu, joro
  Cc: kevin.tian, jacob.jun.pan, ashok.raj, yi.l.liu, jun.j.tian,
	yi.y.sun, jean-philippe, peterx, jasowang, hao.wu, stefanha,
	iommu, kvm

This patch reports nesting info when iommu_domain_get_attr() is called with
DOMAIN_ATTR_NESTING and one domain with nesting set.

Cc: Kevin Tian <kevin.tian@intel.com>
CC: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Eric Auger <eric.auger@redhat.com>
Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
---
v6 -> v7:
*) split the patch in v6 into two patches:
   [PATCH v7 15/16] iommu/vt-d: Only support nesting when nesting caps are consistent across iommu units
   [PATCH v7 16/16] iommu/vt-d: Support reporting nesting capability info

v2 -> v3:
*) remove cap/ecap_mask in iommu_nesting_info.
---
 drivers/iommu/intel/iommu.c | 74 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 74 insertions(+)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 38c6c9b..e46214e 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -6089,6 +6089,79 @@ intel_iommu_domain_set_attr(struct iommu_domain *domain,
 	return ret;
 }
 
+static int intel_iommu_get_nesting_info(struct iommu_domain *domain,
+					struct iommu_nesting_info *info)
+{
+	struct dmar_domain *dmar_domain = to_dmar_domain(domain);
+	u64 cap = VTD_CAP_MASK, ecap = VTD_ECAP_MASK;
+	struct device_domain_info *domain_info;
+	struct iommu_nesting_info_vtd vtd;
+	unsigned int size;
+
+	if (!info)
+		return -EINVAL;
+
+	if (domain->type != IOMMU_DOMAIN_UNMANAGED ||
+	    !(dmar_domain->flags & DOMAIN_FLAG_NESTING_MODE))
+		return -ENODEV;
+
+	size = sizeof(struct iommu_nesting_info);
+	/*
+	 * if provided buffer size is smaller than expected, should
+	 * return 0 and also the expected buffer size to caller.
+	 */
+	if (info->argsz < size) {
+		info->argsz = size;
+		return 0;
+	}
+
+	/*
+	 * arbitrary select the first domain_info as all nesting
+	 * related capabilities should be consistent across iommu
+	 * units.
+	 */
+	domain_info = list_first_entry(&dmar_domain->devices,
+				       struct device_domain_info, link);
+	cap &= domain_info->iommu->cap;
+	ecap &= domain_info->iommu->ecap;
+
+	info->addr_width = dmar_domain->gaw;
+	info->format = IOMMU_PASID_FORMAT_INTEL_VTD;
+	info->features = IOMMU_NESTING_FEAT_SYSWIDE_PASID |
+			 IOMMU_NESTING_FEAT_BIND_PGTBL |
+			 IOMMU_NESTING_FEAT_CACHE_INVLD;
+	info->pasid_bits = ilog2(intel_pasid_max_id);
+	memset(&info->padding, 0x0, 12);
+
+	vtd.flags = 0;
+	vtd.cap_reg = cap;
+	vtd.ecap_reg = ecap;
+
+	memcpy(&info->vendor.vtd, &vtd, sizeof(vtd));
+	return 0;
+}
+
+static int intel_iommu_domain_get_attr(struct iommu_domain *domain,
+				       enum iommu_attr attr, void *data)
+{
+	switch (attr) {
+	case DOMAIN_ATTR_NESTING:
+	{
+		struct iommu_nesting_info *info =
+				(struct iommu_nesting_info *)data;
+		unsigned long flags;
+		int ret;
+
+		spin_lock_irqsave(&device_domain_lock, flags);
+		ret = intel_iommu_get_nesting_info(domain, info);
+		spin_unlock_irqrestore(&device_domain_lock, flags);
+		return ret;
+	}
+	default:
+		return -ENOENT;
+	}
+}
+
 /*
  * Check that the device does not live on an external facing PCI port that is
  * marked as untrusted. Such devices should not be able to apply quirks and
@@ -6111,6 +6184,7 @@ const struct iommu_ops intel_iommu_ops = {
 	.domain_alloc		= intel_iommu_domain_alloc,
 	.domain_free		= intel_iommu_domain_free,
 	.domain_set_attr	= intel_iommu_domain_set_attr,
+	.domain_get_attr	= intel_iommu_domain_get_attr,
 	.attach_dev		= intel_iommu_attach_device,
 	.detach_dev		= intel_iommu_detach_device,
 	.aux_attach_dev		= intel_iommu_aux_attach_device,
-- 
2.7.4


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v7 01/16] iommu: Report domain nesting info
  2020-09-10 10:45 ` [PATCH v7 01/16] iommu: Report domain nesting info Liu Yi L
@ 2020-09-11 19:38   ` Alex Williamson
  0 siblings, 0 replies; 81+ messages in thread
From: Alex Williamson @ 2020-09-11 19:38 UTC (permalink / raw)
  To: Liu Yi L
  Cc: eric.auger, baolu.lu, joro, kevin.tian, jacob.jun.pan, ashok.raj,
	jun.j.tian, yi.y.sun, jean-philippe, peterx, jasowang, hao.wu,
	stefanha, iommu, kvm

On Thu, 10 Sep 2020 03:45:18 -0700
Liu Yi L <yi.l.liu@intel.com> wrote:

> IOMMUs that support nesting translation needs report the capability info
> to userspace. It gives information about requirements the userspace needs
> to implement plus other features characterizing the physical implementation.
> 
> This patch introduces a new IOMMU UAPI struct that gives information about
> the nesting capabilities and features. This struct is supposed to be returned
> by iommu_domain_get_attr() with DOMAIN_ATTR_NESTING attribute parameter, with
> one domain whose type has been set to DOMAIN_ATTR_NESTING.
> 
> Cc: Kevin Tian <kevin.tian@intel.com>
> CC: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Cc: Alex Williamson <alex.williamson@redhat.com>
> Cc: Eric Auger <eric.auger@redhat.com>
> Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
> Cc: Joerg Roedel <joro@8bytes.org>
> Cc: Lu Baolu <baolu.lu@linux.intel.com>
> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> ---
> v6 -> v7:
> *) rephrase the commit message, replace the @data[] field in struct
>    iommu_nesting_info with union per comments from Eric Auger.
> 
> v5 -> v6:
> *) rephrase the feature notes per comments from Eric Auger.
> *) rename @size of struct iommu_nesting_info to @argsz.
> 
> v4 -> v5:
> *) address comments from Eric Auger.
> 
> v3 -> v4:
> *) split the SMMU driver changes to be a separate patch
> *) move the @addr_width and @pasid_bits from vendor specific
>    part to generic part.
> *) tweak the description for the @features field of struct
>    iommu_nesting_info.
> *) add description on the @data[] field of struct iommu_nesting_info
> 
> v2 -> v3:
> *) remvoe cap/ecap_mask in iommu_nesting_info.
> *) reuse DOMAIN_ATTR_NESTING to get nesting info.
> *) return an empty iommu_nesting_info for SMMU drivers per Jean'
>    suggestion.
> ---
>  include/uapi/linux/iommu.h | 76 ++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 76 insertions(+)
> 
> diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
> index 1ebc23d..ff987e4 100644
> --- a/include/uapi/linux/iommu.h
> +++ b/include/uapi/linux/iommu.h
> @@ -341,4 +341,80 @@ struct iommu_gpasid_bind_data {
>  	} vendor;
>  };
>  
> +/*
> + * struct iommu_nesting_info_vtd - Intel VT-d specific nesting info.
> + *
> + * @flags:	VT-d specific flags. Currently reserved for future
> + *		extension. must be set to 0.
> + * @cap_reg:	Describe basic capabilities as defined in VT-d capability
> + *		register.
> + * @ecap_reg:	Describe the extended capabilities as defined in VT-d
> + *		extended capability register.
> + */
> +struct iommu_nesting_info_vtd {
> +	__u32	flags;
> +	__u64	cap_reg;
> +	__u64	ecap_reg;
> +};


The vendor union has 8-byte alignment, so flags here will be 8-byte
aligned, followed by a compiler dependent gap before the 8-byte fields.
We should fill that gap with padding to make it deterministic for
userspace.  Thanks,

Alex

> +
> +/*
> + * struct iommu_nesting_info - Information for nesting-capable IOMMU.
> + *			       userspace should check it before using
> + *			       nesting capability.
> + *
> + * @argsz:	size of the whole structure.
> + * @flags:	currently reserved for future extension. must set to 0.
> + * @format:	PASID table entry format, the same definition as struct
> + *		iommu_gpasid_bind_data @format.
> + * @features:	supported nesting features.
> + * @addr_width:	The output addr width of first level/stage translation
> + * @pasid_bits:	Maximum supported PASID bits, 0 represents no PASID
> + *		support.
> + * @vendor:	vendor specific data, structure type can be deduced from
> + *		@format field.
> + *
> + * +===============+======================================================+
> + * | feature       |  Notes                                               |
> + * +===============+======================================================+
> + * | SYSWIDE_PASID |  IOMMU vendor driver sets it to mandate userspace    |
> + * |               |  to allocate PASID from kernel. All PASID allocation |
> + * |               |  free must be mediated through the IOMMU UAPI.       |
> + * +---------------+------------------------------------------------------+
> + * | BIND_PGTBL    |  IOMMU vendor driver sets it to mandate userspace to |
> + * |               |  bind the first level/stage page table to associated |
> + * |               |  PASID (either the one specified in bind request or  |
> + * |               |  the default PASID of iommu domain), through IOMMU   |
> + * |               |  UAPI.                                               |
> + * +---------------+------------------------------------------------------+
> + * | CACHE_INVLD   |  IOMMU vendor driver sets it to mandate userspace to |
> + * |               |  explicitly invalidate the IOMMU cache through IOMMU |
> + * |               |  UAPI according to vendor-specific requirement when  |
> + * |               |  changing the 1st level/stage page table.            |
> + * +---------------+------------------------------------------------------+
> + *
> + * data struct types defined for @format:
> + * +================================+=====================================+
> + * | @format                        | data struct                         |
> + * +================================+=====================================+
> + * | IOMMU_PASID_FORMAT_INTEL_VTD   | struct iommu_nesting_info_vtd       |
> + * +--------------------------------+-------------------------------------+
> + *
> + */
> +struct iommu_nesting_info {
> +	__u32	argsz;
> +	__u32	flags;
> +	__u32	format;
> +#define IOMMU_NESTING_FEAT_SYSWIDE_PASID	(1 << 0)
> +#define IOMMU_NESTING_FEAT_BIND_PGTBL		(1 << 1)
> +#define IOMMU_NESTING_FEAT_CACHE_INVLD		(1 << 2)
> +	__u32	features;
> +	__u16	addr_width;
> +	__u16	pasid_bits;
> +	__u8	padding[12];
> +	/* Vendor specific data */
> +	union {
> +		struct iommu_nesting_info_vtd vtd;
> +	} vendor;
> +};
> +
>  #endif /* _UAPI_IOMMU_H */


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v7 03/16] vfio/type1: Report iommu nesting info to userspace
  2020-09-10 10:45 ` [PATCH v7 03/16] vfio/type1: Report iommu nesting info to userspace Liu Yi L
@ 2020-09-11 20:16   ` Alex Williamson
  2020-09-12  8:24     ` Liu, Yi L
  0 siblings, 1 reply; 81+ messages in thread
From: Alex Williamson @ 2020-09-11 20:16 UTC (permalink / raw)
  To: Liu Yi L
  Cc: eric.auger, baolu.lu, joro, kevin.tian, jacob.jun.pan, ashok.raj,
	jun.j.tian, yi.y.sun, jean-philippe, peterx, jasowang, hao.wu,
	stefanha, iommu, kvm

On Thu, 10 Sep 2020 03:45:20 -0700
Liu Yi L <yi.l.liu@intel.com> wrote:

> This patch exports iommu nesting capability info to user space through
> VFIO. Userspace is expected to check this info for supported uAPIs (e.g.
> PASID alloc/free, bind page table, and cache invalidation) and the vendor
> specific format information for first level/stage page table that will be
> bound to.
> 
> The nesting info is available only after container set to be NESTED type.
> Current implementation imposes one limitation - one nesting container
> should include at most one iommu group. The philosophy of vfio container
> is having all groups/devices within the container share the same IOMMU
> context. When vSVA is enabled, one IOMMU context could include one 2nd-
> level address space and multiple 1st-level address spaces. While the
> 2nd-level address space is reasonably sharable by multiple groups, blindly
> sharing 1st-level address spaces across all groups within the container
> might instead break the guest expectation. In the future sub/super container
> concept might be introduced to allow partial address space sharing within
> an IOMMU context. But for now let's go with this restriction by requiring
> singleton container for using nesting iommu features. Below link has the
> related discussion about this decision.
> 
> https://lore.kernel.org/kvm/20200515115924.37e6996d@w520.home/
> 
> This patch also changes the NESTING type container behaviour. Something
> that would have succeeded before will now fail: Before this series, if
> user asked for a VFIO_IOMMU_TYPE1_NESTING, it would have succeeded even
> if the SMMU didn't support stage-2, as the driver would have silently
> fallen back on stage-1 mappings (which work exactly the same as stage-2
> only since there was no nesting supported). After the series, we do check
> for DOMAIN_ATTR_NESTING so if user asks for VFIO_IOMMU_TYPE1_NESTING and
> the SMMU doesn't support stage-2, the ioctl fails. But it should be a good
> fix and completely harmless. Detail can be found in below link as well.
> 
> https://lore.kernel.org/kvm/20200717090900.GC4850@myrica/
> 
> Cc: Kevin Tian <kevin.tian@intel.com>
> CC: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Cc: Alex Williamson <alex.williamson@redhat.com>
> Cc: Eric Auger <eric.auger@redhat.com>
> Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
> Cc: Joerg Roedel <joro@8bytes.org>
> Cc: Lu Baolu <baolu.lu@linux.intel.com>
> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> ---
> v6 -> v7:
> *) using vfio_info_add_capability() for adding nesting cap per suggestion
>    from Eric.
> 
> v5 -> v6:
> *) address comments against v5 from Eric Auger.
> *) don't report nesting cap to userspace if the nesting_info->format is
>    invalid.
> 
> v4 -> v5:
> *) address comments from Eric Auger.
> *) return struct iommu_nesting_info for VFIO_IOMMU_TYPE1_INFO_CAP_NESTING as
>    cap is much "cheap", if needs extension in future, just define another cap.
>    https://lore.kernel.org/kvm/20200708132947.5b7ee954@x1.home/
> 
> v3 -> v4:
> *) address comments against v3.
> 
> v1 -> v2:
> *) added in v2
> ---
>  drivers/vfio/vfio_iommu_type1.c | 92 +++++++++++++++++++++++++++++++++++------
>  include/uapi/linux/vfio.h       | 19 +++++++++
>  2 files changed, 99 insertions(+), 12 deletions(-)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index c992973..3c0048b 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -62,18 +62,20 @@ MODULE_PARM_DESC(dma_entry_limit,
>  		 "Maximum number of user DMA mappings per container (65535).");
>  
>  struct vfio_iommu {
> -	struct list_head	domain_list;
> -	struct list_head	iova_list;
> -	struct vfio_domain	*external_domain; /* domain for external user */
> -	struct mutex		lock;
> -	struct rb_root		dma_list;
> -	struct blocking_notifier_head notifier;
> -	unsigned int		dma_avail;
> -	uint64_t		pgsize_bitmap;
> -	bool			v2;
> -	bool			nesting;
> -	bool			dirty_page_tracking;
> -	bool			pinned_page_dirty_scope;
> +	struct list_head		domain_list;
> +	struct list_head		iova_list;
> +	/* domain for external user */
> +	struct vfio_domain		*external_domain;
> +	struct mutex			lock;
> +	struct rb_root			dma_list;
> +	struct blocking_notifier_head	notifier;
> +	unsigned int			dma_avail;
> +	uint64_t			pgsize_bitmap;
> +	bool				v2;
> +	bool				nesting;
> +	bool				dirty_page_tracking;
> +	bool				pinned_page_dirty_scope;
> +	struct iommu_nesting_info	*nesting_info;

Nit, not as important as the previous alignment, but might as well move
this up with the uint64_t pgsize_bitmap with the bools at the end of
the structure to avoid adding new gaps.

>  };
>  
>  struct vfio_domain {
> @@ -130,6 +132,9 @@ struct vfio_regions {
>  #define IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)	\
>  					(!list_empty(&iommu->domain_list))
>  
> +#define CONTAINER_HAS_DOMAIN(iommu)	(((iommu)->external_domain) || \
> +					 (!list_empty(&(iommu)->domain_list)))
> +
>  #define DIRTY_BITMAP_BYTES(n)	(ALIGN(n, BITS_PER_TYPE(u64)) / BITS_PER_BYTE)
>  
>  /*
> @@ -1992,6 +1997,13 @@ static void vfio_iommu_iova_insert_copy(struct vfio_iommu *iommu,
>  
>  	list_splice_tail(iova_copy, iova);
>  }
> +
> +static void vfio_iommu_release_nesting_info(struct vfio_iommu *iommu)
> +{
> +	kfree(iommu->nesting_info);
> +	iommu->nesting_info = NULL;
> +}
> +
>  static int vfio_iommu_type1_attach_group(void *iommu_data,
>  					 struct iommu_group *iommu_group)
>  {
> @@ -2022,6 +2034,12 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
>  		}
>  	}
>  
> +	/* Nesting type container can include only one group */
> +	if (iommu->nesting && CONTAINER_HAS_DOMAIN(iommu)) {
> +		mutex_unlock(&iommu->lock);
> +		return -EINVAL;
> +	}
> +
>  	group = kzalloc(sizeof(*group), GFP_KERNEL);
>  	domain = kzalloc(sizeof(*domain), GFP_KERNEL);
>  	if (!group || !domain) {
> @@ -2092,6 +2110,25 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
>  	if (ret)
>  		goto out_domain;
>  
> +	/* Nesting cap info is available only after attaching */
> +	if (iommu->nesting) {
> +		int size = sizeof(struct iommu_nesting_info);
> +
> +		iommu->nesting_info = kzalloc(size, GFP_KERNEL);
> +		if (!iommu->nesting_info) {
> +			ret = -ENOMEM;
> +			goto out_detach;
> +		}
> +
> +		/* Now get the nesting info */
> +		iommu->nesting_info->argsz = size;
> +		ret = iommu_domain_get_attr(domain->domain,
> +					    DOMAIN_ATTR_NESTING,
> +					    iommu->nesting_info);
> +		if (ret)
> +			goto out_detach;
> +	}
> +
>  	/* Get aperture info */
>  	iommu_domain_get_attr(domain->domain, DOMAIN_ATTR_GEOMETRY, &geo);
>  
> @@ -2201,6 +2238,7 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
>  	return 0;
>  
>  out_detach:
> +	vfio_iommu_release_nesting_info(iommu);
>  	vfio_iommu_detach_group(domain, group);
>  out_domain:
>  	iommu_domain_free(domain->domain);
> @@ -2401,6 +2439,8 @@ static void vfio_iommu_type1_detach_group(void *iommu_data,
>  					vfio_iommu_unmap_unpin_all(iommu);
>  				else
>  					vfio_iommu_unmap_unpin_reaccount(iommu);
> +
> +				vfio_iommu_release_nesting_info(iommu);
>  			}
>  			iommu_domain_free(domain->domain);
>  			list_del(&domain->next);
> @@ -2609,6 +2649,32 @@ static int vfio_iommu_migration_build_caps(struct vfio_iommu *iommu,
>  	return vfio_info_add_capability(caps, &cap_mig.header, sizeof(cap_mig));
>  }
>  
> +static int vfio_iommu_add_nesting_cap(struct vfio_iommu *iommu,
> +				      struct vfio_info_cap *caps)
> +{
> +	struct vfio_iommu_type1_info_cap_nesting nesting_cap;
> +	size_t size;
> +
> +	/* when nesting_info is null, no need to go further */
> +	if (!iommu->nesting_info)
> +		return 0;
> +
> +	/* when @format of nesting_info is 0, fail the call */
> +	if (iommu->nesting_info->format == 0)
> +		return -ENOENT;


Should we fail this in the attach_group?  Seems the user would be in a
bad situation here if they successfully created a nesting container but
can't get info.  Is there backwards compatibility we're trying to
maintain with this?

> +
> +	size = offsetof(struct vfio_iommu_type1_info_cap_nesting, info) +
> +	       iommu->nesting_info->argsz;
> +
> +	nesting_cap.header.id = VFIO_IOMMU_TYPE1_INFO_CAP_NESTING;
> +	nesting_cap.header.version = 1;
> +
> +	memcpy(&nesting_cap.info, iommu->nesting_info,
> +	       iommu->nesting_info->argsz);
> +
> +	return vfio_info_add_capability(caps, &nesting_cap.header, size);
> +}
> +
>  static int vfio_iommu_type1_get_info(struct vfio_iommu *iommu,
>  				     unsigned long arg)
>  {
> @@ -2644,6 +2710,8 @@ static int vfio_iommu_type1_get_info(struct vfio_iommu *iommu,
>  	if (!ret)
>  		ret = vfio_iommu_iova_build_caps(iommu, &caps);
>  
> +	ret = vfio_iommu_add_nesting_cap(iommu, &caps);

Why don't we follow either the naming scheme or the error handling
scheme of the previous caps?  Seems like this should be:

if (!ret)
	ret = vfio_iommu_nesting_build_caps(...);

Thanks,

Alex


> +
>  	mutex_unlock(&iommu->lock);
>  
>  	if (ret)
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 9204705..ff40f9e 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -14,6 +14,7 @@
>  
>  #include <linux/types.h>
>  #include <linux/ioctl.h>
> +#include <linux/iommu.h>
>  
>  #define VFIO_API_VERSION	0
>  
> @@ -1039,6 +1040,24 @@ struct vfio_iommu_type1_info_cap_migration {
>  	__u64	max_dirty_bitmap_size;		/* in bytes */
>  };
>  
> +/*
> + * The nesting capability allows to report the related capability
> + * and info for nesting iommu type.
> + *
> + * The structures below define version 1 of this capability.
> + *
> + * Nested capabilities should be checked by the userspace after
> + * setting VFIO_TYPE1_NESTING_IOMMU.
> + *
> + * @info: the nesting info provided by IOMMU driver.
> + */
> +#define VFIO_IOMMU_TYPE1_INFO_CAP_NESTING  3
> +
> +struct vfio_iommu_type1_info_cap_nesting {
> +	struct	vfio_info_cap_header header;
> +	struct	iommu_nesting_info info;
> +};
> +
>  #define VFIO_IOMMU_GET_INFO _IO(VFIO_TYPE, VFIO_BASE + 12)
>  
>  /**


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v7 04/16] vfio: Add PASID allocation/free support
  2020-09-10 10:45 ` [PATCH v7 04/16] vfio: Add PASID allocation/free support Liu Yi L
@ 2020-09-11 20:54   ` Alex Williamson
  2020-09-15  4:03     ` Liu, Yi L
  0 siblings, 1 reply; 81+ messages in thread
From: Alex Williamson @ 2020-09-11 20:54 UTC (permalink / raw)
  To: Liu Yi L
  Cc: eric.auger, baolu.lu, joro, kevin.tian, jacob.jun.pan, ashok.raj,
	jun.j.tian, yi.y.sun, jean-philippe, peterx, jasowang, hao.wu,
	stefanha, iommu, kvm

On Thu, 10 Sep 2020 03:45:21 -0700
Liu Yi L <yi.l.liu@intel.com> wrote:

> Shared Virtual Addressing (a.k.a Shared Virtual Memory) allows sharing
> multiple process virtual address spaces with the device for simplified
> programming model. PASID is used to tag an virtual address space in DMA
> requests and to identify the related translation structure in IOMMU. When
> a PASID-capable device is assigned to a VM, we want the same capability
> of using PASID to tag guest process virtual address spaces to achieve
> virtual SVA (vSVA).
> 
> PASID management for guest is vendor specific. Some vendors (e.g. Intel
> VT-d) requires system-wide managed PASIDs across all devices, regardless
> of whether a device is used by host or assigned to guest. Other vendors
> (e.g. ARM SMMU) may allow PASIDs managed per-device thus could be fully
> delegated to the guest for assigned devices.
> 
> For system-wide managed PASIDs, this patch introduces a vfio module to
> handle explicit PASID alloc/free requests from guest. Allocated PASIDs
> are associated to a process (or, mm_struct) in IOASID core. A vfio_mm
> object is introduced to track mm_struct. Multiple VFIO containers within
> a process share the same vfio_mm object.
> 
> A quota mechanism is provided to prevent malicious user from exhausting
> available PASIDs. Currently the quota is a global parameter applied to
> all VFIO devices. In the future per-device quota might be supported too.
> 
> Cc: Kevin Tian <kevin.tian@intel.com>
> CC: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Cc: Eric Auger <eric.auger@redhat.com>
> Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
> Cc: Joerg Roedel <joro@8bytes.org>
> Cc: Lu Baolu <baolu.lu@linux.intel.com>
> Suggested-by: Alex Williamson <alex.williamson@redhat.com>
> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> Reviewed-by: Eric Auger <eric.auger@redhat.com>
> ---
> v6 -> v7:
> *) remove "#include <linux/eventfd.h>" and add r-b from Eric Auger.
> 
> v5 -> v6:
> *) address comments from Eric. Add vfio_unlink_pasid() to be consistent
>    with vfio_unlink_dma(). Add a comment in vfio_pasid_exit().
> 
> v4 -> v5:
> *) address comments from Eric Auger.
> *) address the comments from Alex on the pasid free range support. Added
>    per vfio_mm pasid r-b tree.
>    https://lore.kernel.org/kvm/20200709082751.320742ab@x1.home/
> 
> v3 -> v4:
> *) fix lock leam in vfio_mm_get_from_task()
> *) drop pasid_quota field in struct vfio_mm
> *) vfio_mm_get_from_task() returns ERR_PTR(-ENOTTY) when !CONFIG_VFIO_PASID
> 
> v1 -> v2:
> *) added in v2, split from the pasid alloc/free support of v1
> ---
>  drivers/vfio/Kconfig      |   5 +
>  drivers/vfio/Makefile     |   1 +
>  drivers/vfio/vfio_pasid.c | 247 ++++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/vfio.h      |  28 ++++++
>  4 files changed, 281 insertions(+)
>  create mode 100644 drivers/vfio/vfio_pasid.c
> 
> diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
> index fd17db9..3d8a108 100644
> --- a/drivers/vfio/Kconfig
> +++ b/drivers/vfio/Kconfig
> @@ -19,6 +19,11 @@ config VFIO_VIRQFD
>  	depends on VFIO && EVENTFD
>  	default n
>  
> +config VFIO_PASID
> +	tristate
> +	depends on IOASID && VFIO
> +	default n
> +
>  menuconfig VFIO
>  	tristate "VFIO Non-Privileged userspace driver framework"
>  	depends on IOMMU_API
> diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile
> index de67c47..bb836a3 100644
> --- a/drivers/vfio/Makefile
> +++ b/drivers/vfio/Makefile
> @@ -3,6 +3,7 @@ vfio_virqfd-y := virqfd.o
>  
>  obj-$(CONFIG_VFIO) += vfio.o
>  obj-$(CONFIG_VFIO_VIRQFD) += vfio_virqfd.o
> +obj-$(CONFIG_VFIO_PASID) += vfio_pasid.o
>  obj-$(CONFIG_VFIO_IOMMU_TYPE1) += vfio_iommu_type1.o
>  obj-$(CONFIG_VFIO_IOMMU_SPAPR_TCE) += vfio_iommu_spapr_tce.o
>  obj-$(CONFIG_VFIO_SPAPR_EEH) += vfio_spapr_eeh.o
> diff --git a/drivers/vfio/vfio_pasid.c b/drivers/vfio/vfio_pasid.c
> new file mode 100644
> index 0000000..44ecdd5
> --- /dev/null
> +++ b/drivers/vfio/vfio_pasid.c
> @@ -0,0 +1,247 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Copyright (C) 2020 Intel Corporation.
> + *     Author: Liu Yi L <yi.l.liu@intel.com>
> + *
> + */
> +
> +#include <linux/vfio.h>
> +#include <linux/file.h>
> +#include <linux/module.h>
> +#include <linux/slab.h>
> +#include <linux/sched/mm.h>
> +
> +#define DRIVER_VERSION  "0.1"
> +#define DRIVER_AUTHOR   "Liu Yi L <yi.l.liu@intel.com>"
> +#define DRIVER_DESC     "PASID management for VFIO bus drivers"
> +
> +#define VFIO_DEFAULT_PASID_QUOTA	1000

I'm not sure we really need a macro to define this since it's only used
once, but a comment discussing the basis for this default value would be
useful.  Also, since Matthew Rosato is finding it necessary to expose
the available DMA mapping counter to userspace, is this also a
limitation that userspace might be interested in knowing such that we
should plumb it through an IOMMU info capability?

> +static int pasid_quota = VFIO_DEFAULT_PASID_QUOTA;
> +module_param_named(pasid_quota, pasid_quota, uint, 0444);
> +MODULE_PARM_DESC(pasid_quota,
> +		 "Set the quota for max number of PASIDs that an application is allowed to request (default 1000)");
> +
> +struct vfio_mm_token {
> +	unsigned long long val;
> +};
> +
> +struct vfio_mm {
> +	struct kref		kref;
> +	struct ioasid_set	*ioasid_set;
> +	struct mutex		pasid_lock;
> +	struct rb_root		pasid_list;
> +	struct list_head	next;
> +	struct vfio_mm_token	token;
> +};
> +
> +static struct mutex		vfio_mm_lock;
> +static struct list_head		vfio_mm_list;
> +
> +struct vfio_pasid {
> +	struct rb_node		node;
> +	ioasid_t		pasid;
> +};
> +
> +static void vfio_remove_all_pasids(struct vfio_mm *vmm);
> +
> +/* called with vfio.vfio_mm_lock held */


s/vfio.//


> +static void vfio_mm_release(struct kref *kref)
> +{
> +	struct vfio_mm *vmm = container_of(kref, struct vfio_mm, kref);
> +
> +	list_del(&vmm->next);
> +	mutex_unlock(&vfio_mm_lock);
> +	vfio_remove_all_pasids(vmm);
> +	ioasid_set_put(vmm->ioasid_set);//FIXME: should vfio_pasid get ioasid_set after allocation?


Is the question whether each pasid should hold a reference to the set?
That really seems like a question internal to the ioasid_alloc/free,
but this FIXME needs to be resolved.


> +	kfree(vmm);
> +}
> +
> +void vfio_mm_put(struct vfio_mm *vmm)
> +{
> +	kref_put_mutex(&vmm->kref, vfio_mm_release, &vfio_mm_lock);
> +}
> +
> +static void vfio_mm_get(struct vfio_mm *vmm)
> +{
> +	kref_get(&vmm->kref);
> +}
> +
> +struct vfio_mm *vfio_mm_get_from_task(struct task_struct *task)
> +{
> +	struct mm_struct *mm = get_task_mm(task);
> +	struct vfio_mm *vmm;
> +	unsigned long long val = (unsigned long long)mm;
> +	int ret;
> +
> +	mutex_lock(&vfio_mm_lock);
> +	/* Search existing vfio_mm with current mm pointer */
> +	list_for_each_entry(vmm, &vfio_mm_list, next) {
> +		if (vmm->token.val == val) {
> +			vfio_mm_get(vmm);
> +			goto out;
> +		}
> +	}
> +
> +	vmm = kzalloc(sizeof(*vmm), GFP_KERNEL);
> +	if (!vmm) {
> +		vmm = ERR_PTR(-ENOMEM);
> +		goto out;
> +	}
> +
> +	/*
> +	 * IOASID core provides a 'IOASID set' concept to track all
> +	 * PASIDs associated with a token. Here we use mm_struct as
> +	 * the token and create a IOASID set per mm_struct. All the
> +	 * containers of the process share the same IOASID set.
> +	 */
> +	vmm->ioasid_set = ioasid_alloc_set(mm, pasid_quota, IOASID_SET_TYPE_MM);
> +	if (IS_ERR(vmm->ioasid_set)) {
> +		ret = PTR_ERR(vmm->ioasid_set);
> +		kfree(vmm);
> +		vmm = ERR_PTR(ret);
> +		goto out;

This would be a little less convoluted if we had a separate variable to
store ioasid_set so that we could free vmm without stashing the error
in a temporary variable.  Or at least make the stash more obvious by
defining the stash variable as something like "tmp" within the scope
of this branch.

> +	}
> +
> +	kref_init(&vmm->kref);
> +	vmm->token.val = val;
> +	mutex_init(&vmm->pasid_lock);
> +	vmm->pasid_list = RB_ROOT;
> +
> +	list_add(&vmm->next, &vfio_mm_list);
> +out:
> +	mutex_unlock(&vfio_mm_lock);
> +	mmput(mm);
> +	return vmm;
> +}
> +
> +/*
> + * Find PASID within @min and @max
> + */
> +static struct vfio_pasid *vfio_find_pasid(struct vfio_mm *vmm,
> +					  ioasid_t min, ioasid_t max)
> +{
> +	struct rb_node *node = vmm->pasid_list.rb_node;
> +
> +	while (node) {
> +		struct vfio_pasid *vid = rb_entry(node,
> +						struct vfio_pasid, node);
> +
> +		if (max < vid->pasid)
> +			node = node->rb_left;
> +		else if (min > vid->pasid)
> +			node = node->rb_right;
> +		else
> +			return vid;
> +	}
> +
> +	return NULL;
> +}
> +
> +static void vfio_link_pasid(struct vfio_mm *vmm, struct vfio_pasid *new)
> +{
> +	struct rb_node **link = &vmm->pasid_list.rb_node, *parent = NULL;
> +	struct vfio_pasid *vid;
> +
> +	while (*link) {
> +		parent = *link;
> +		vid = rb_entry(parent, struct vfio_pasid, node);
> +
> +		if (new->pasid <= vid->pasid)
> +			link = &(*link)->rb_left;
> +		else
> +			link = &(*link)->rb_right;
> +	}
> +
> +	rb_link_node(&new->node, parent, link);
> +	rb_insert_color(&new->node, &vmm->pasid_list);
> +}
> +
> +static void vfio_unlink_pasid(struct vfio_mm *vmm, struct vfio_pasid *old)
> +{
> +	rb_erase(&old->node, &vmm->pasid_list);
> +}
> +
> +static void vfio_remove_pasid(struct vfio_mm *vmm, struct vfio_pasid *vid)
> +{
> +	vfio_unlink_pasid(vmm, vid);
> +	ioasid_free(vmm->ioasid_set, vid->pasid);
> +	kfree(vid);
> +}
> +
> +static void vfio_remove_all_pasids(struct vfio_mm *vmm)
> +{
> +	struct rb_node *node;
> +
> +	mutex_lock(&vmm->pasid_lock);
> +	while ((node = rb_first(&vmm->pasid_list)))
> +		vfio_remove_pasid(vmm, rb_entry(node, struct vfio_pasid, node));
> +	mutex_unlock(&vmm->pasid_lock);
> +}
> +
> +int vfio_pasid_alloc(struct vfio_mm *vmm, int min, int max)

I might have asked before, but why doesn't this return an ioasid_t and
require ioasid_t args?  Our free function below uses an ioasid_t
range, seems rather inconsistent.  We can use a BUILD_BUG_ON if we need
to test that an ioasid_t fits within our uapi.

> +{
> +	ioasid_t pasid;
> +	struct vfio_pasid *vid;
> +
> +	pasid = ioasid_alloc(vmm->ioasid_set, min, max, NULL);
> +	if (pasid == INVALID_IOASID)
> +		return -ENOSPC;
> +
> +	vid = kzalloc(sizeof(*vid), GFP_KERNEL);
> +	if (!vid) {
> +		ioasid_free(vmm->ioasid_set, pasid);
> +		return -ENOMEM;
> +	}
> +
> +	vid->pasid = pasid;
> +
> +	mutex_lock(&vmm->pasid_lock);
> +	vfio_link_pasid(vmm, vid);
> +	mutex_unlock(&vmm->pasid_lock);
> +
> +	return pasid;
> +}
> +
> +void vfio_pasid_free_range(struct vfio_mm *vmm,
> +			   ioasid_t min, ioasid_t max)
> +{
> +	struct vfio_pasid *vid = NULL;
> +
> +	/*
> +	 * IOASID core will notify PASID users (e.g. IOMMU driver) to
> +	 * teardown necessary structures depending on the to-be-freed
> +	 * PASID.
> +	 */
> +	mutex_lock(&vmm->pasid_lock);
> +	while ((vid = vfio_find_pasid(vmm, min, max)) != NULL)

!= NULL is not necessary and isn't consistent with the same time of
test in the above rb_first() loop.

> +		vfio_remove_pasid(vmm, vid);
> +	mutex_unlock(&vmm->pasid_lock);
> +}
> +
> +static int __init vfio_pasid_init(void)
> +{
> +	mutex_init(&vfio_mm_lock);
> +	INIT_LIST_HEAD(&vfio_mm_list);
> +	return 0;
> +}
> +
> +static void __exit vfio_pasid_exit(void)
> +{
> +	/*
> +	 * VFIO_PASID is supposed to be referenced by VFIO_IOMMU_TYPE1
> +	 * and may be other module. once vfio_pasid_exit() is triggered,
> +	 * that means its user (e.g. VFIO_IOMMU_TYPE1) has been removed.
> +	 * All the vfio_mm instances should have been released. If not,
> +	 * means there is vfio_mm leak, should be a bug of user module.
> +	 * So just warn here.
> +	 */
> +	WARN_ON(!list_empty(&vfio_mm_list));

Do we need to be using try_module_get/module_put to enforce that we
cannot be removed while in use or does that already work correctly via
the function references and this is just paranoia?  If we do exit, I'm
not sure what good it does to keep the remaining list entries.  Thanks,

Alex

> +}
> +
> +module_init(vfio_pasid_init);
> +module_exit(vfio_pasid_exit);
> +
> +MODULE_VERSION(DRIVER_VERSION);
> +MODULE_LICENSE("GPL v2");
> +MODULE_AUTHOR(DRIVER_AUTHOR);
> +MODULE_DESCRIPTION(DRIVER_DESC);
> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> index 38d3c6a..31472a9 100644
> --- a/include/linux/vfio.h
> +++ b/include/linux/vfio.h
> @@ -97,6 +97,34 @@ extern int vfio_register_iommu_driver(const struct vfio_iommu_driver_ops *ops);
>  extern void vfio_unregister_iommu_driver(
>  				const struct vfio_iommu_driver_ops *ops);
>  
> +struct vfio_mm;
> +#if IS_ENABLED(CONFIG_VFIO_PASID)
> +extern struct vfio_mm *vfio_mm_get_from_task(struct task_struct *task);
> +extern void vfio_mm_put(struct vfio_mm *vmm);
> +extern int vfio_pasid_alloc(struct vfio_mm *vmm, int min, int max);
> +extern void vfio_pasid_free_range(struct vfio_mm *vmm,
> +				  ioasid_t min, ioasid_t max);
> +#else
> +static inline struct vfio_mm *vfio_mm_get_from_task(struct task_struct *task)
> +{
> +	return ERR_PTR(-ENOTTY);
> +}
> +
> +static inline void vfio_mm_put(struct vfio_mm *vmm)
> +{
> +}
> +
> +static inline int vfio_pasid_alloc(struct vfio_mm *vmm, int min, int max)
> +{
> +	return -ENOTTY;
> +}
> +
> +static inline void vfio_pasid_free_range(struct vfio_mm *vmm,
> +					  ioasid_t min, ioasid_t max)
> +{
> +}
> +#endif /* CONFIG_VFIO_PASID */
> +
>  /*
>   * External user API
>   */


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v7 07/16] vfio/type1: Add VFIO_IOMMU_PASID_REQUEST (alloc/free)
  2020-09-10 10:45 ` [PATCH v7 07/16] vfio/type1: Add VFIO_IOMMU_PASID_REQUEST (alloc/free) Liu Yi L
@ 2020-09-11 21:38   ` Alex Williamson
  0 siblings, 0 replies; 81+ messages in thread
From: Alex Williamson @ 2020-09-11 21:38 UTC (permalink / raw)
  To: Liu Yi L
  Cc: eric.auger, baolu.lu, joro, kevin.tian, jacob.jun.pan, ashok.raj,
	jun.j.tian, yi.y.sun, jean-philippe, peterx, jasowang, hao.wu,
	stefanha, iommu, kvm

On Thu, 10 Sep 2020 03:45:24 -0700
Liu Yi L <yi.l.liu@intel.com> wrote:

> This patch allows userspace to request PASID allocation/free, e.g. when
> serving the request from the guest.
> 
> PASIDs that are not freed by userspace are automatically freed when the
> IOASID set is destroyed when process exits.
> 
> Cc: Kevin Tian <kevin.tian@intel.com>
> CC: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Cc: Alex Williamson <alex.williamson@redhat.com>
> Cc: Eric Auger <eric.auger@redhat.com>
> Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
> Cc: Joerg Roedel <joro@8bytes.org>
> Cc: Lu Baolu <baolu.lu@linux.intel.com>
> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> ---
> v6 -> v7:
> *) current VFIO returns allocated pasid via signed int, thus VFIO UAPI
>    can only support 31 bits pasid. If user space gives min,max which is
>    wider than 31 bits, should fail the allocation or free request.
> 
> v5 -> v6:
> *) address comments from Eric against v5. remove the alloc/free helper.
> 
> v4 -> v5:
> *) address comments from Eric Auger.
> *) the comments for the PASID_FREE request is addressed in patch 5/15 of
>    this series.
> 
> v3 -> v4:
> *) address comments from v3, except the below comment against the range
>    of PASID_FREE request. needs more help on it.
>     "> +if (req.range.min > req.range.max)  
> 
>      Is it exploitable that a user can spin the kernel for a long time in
>      the case of a free by calling this with [0, MAX_UINT] regardless of
>      their actual allocations?"
>     https://lore.kernel.org/linux-iommu/20200702151832.048b44d1@x1.home/
> 
> v1 -> v2:
> *) move the vfio_mm related code to be a seprate module
> *) use a single structure for alloc/free, could support a range of PASIDs
> *) fetch vfio_mm at group_attach time instead of at iommu driver open time
> ---
>  drivers/vfio/Kconfig            |  1 +
>  drivers/vfio/vfio_iommu_type1.c | 76 +++++++++++++++++++++++++++++++++++++++++
>  drivers/vfio/vfio_pasid.c       | 10 ++++++
>  include/linux/vfio.h            |  6 ++++
>  include/uapi/linux/vfio.h       | 43 +++++++++++++++++++++++
>  5 files changed, 136 insertions(+)
> 
> diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
> index 3d8a108..95d90c6 100644
> --- a/drivers/vfio/Kconfig
> +++ b/drivers/vfio/Kconfig
> @@ -2,6 +2,7 @@
>  config VFIO_IOMMU_TYPE1
>  	tristate
>  	depends on VFIO
> +	select VFIO_PASID if (X86)
>  	default n
>  
>  config VFIO_IOMMU_SPAPR_TCE
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index 3c0048b..bd4b668 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -76,6 +76,7 @@ struct vfio_iommu {
>  	bool				dirty_page_tracking;
>  	bool				pinned_page_dirty_scope;
>  	struct iommu_nesting_info	*nesting_info;
> +	struct vfio_mm			*vmm;
>  };
>  
>  struct vfio_domain {
> @@ -2000,6 +2001,11 @@ static void vfio_iommu_iova_insert_copy(struct vfio_iommu *iommu,
>  
>  static void vfio_iommu_release_nesting_info(struct vfio_iommu *iommu)
>  {
> +	if (iommu->vmm) {
> +		vfio_mm_put(iommu->vmm);
> +		iommu->vmm = NULL;
> +	}
> +
>  	kfree(iommu->nesting_info);
>  	iommu->nesting_info = NULL;
>  }
> @@ -2127,6 +2133,26 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
>  					    iommu->nesting_info);
>  		if (ret)
>  			goto out_detach;
> +
> +		if (iommu->nesting_info->features &
> +					IOMMU_NESTING_FEAT_SYSWIDE_PASID) {
> +			struct vfio_mm *vmm;
> +			struct ioasid_set *set;
> +
> +			vmm = vfio_mm_get_from_task(current);
> +			if (IS_ERR(vmm)) {
> +				ret = PTR_ERR(vmm);
> +				goto out_detach;
> +			}
> +			iommu->vmm = vmm;
> +
> +			set = vfio_mm_ioasid_set(vmm);
> +			ret = iommu_domain_set_attr(domain->domain,
> +						    DOMAIN_ATTR_IOASID_SET,
> +						    set);
> +			if (ret)
> +				goto out_detach;
> +		}
>  	}
>  
>  	/* Get aperture info */
> @@ -2908,6 +2934,54 @@ static int vfio_iommu_type1_dirty_pages(struct vfio_iommu *iommu,
>  	return -EINVAL;
>  }
>  
> +static int vfio_iommu_type1_pasid_request(struct vfio_iommu *iommu,
> +					  unsigned long arg)
> +{
> +	struct vfio_iommu_type1_pasid_request req;
> +	unsigned long minsz;
> +	int ret;
> +
> +	minsz = offsetofend(struct vfio_iommu_type1_pasid_request, range);
> +
> +	if (copy_from_user(&req, (void __user *)arg, minsz))
> +		return -EFAULT;
> +
> +	if (req.argsz < minsz || (req.flags & ~VFIO_PASID_REQUEST_MASK))
> +		return -EINVAL;
> +
> +	/*
> +	 * Current VFIO_IOMMU_PASID_REQUEST only supports at most
> +	 * 31 bits PASID. The min,max value from userspace should
> +	 * not exceed 31 bits.

Please describe the source of this restriction.  I think it's due to
using the ioctl return value to return the PASID, thus excluding the
negative values, but aren't we actually restricted to pasid_bits
exposed in the nesting_info?  If this is just a sanity test for the API
then why are we defining VFIO_IOMMU_PASID_BITS in the uapi header,
which causes conflicting information to the user... which do they
honor?  Should we instead verify that pasid_bits matches our API scheme
when configuring the nested domain and then let the ioasid allocator
reject requests outside of the range?

> +	 */
> +	if (req.range.min > req.range.max ||
> +	    req.range.min > (1 << VFIO_IOMMU_PASID_BITS) ||
> +	    req.range.max > (1 << VFIO_IOMMU_PASID_BITS))

Off by one, >= for the bit test.

> +		return -EINVAL;
> +
> +	mutex_lock(&iommu->lock);
> +	if (!iommu->vmm) {
> +		mutex_unlock(&iommu->lock);
> +		return -EOPNOTSUPP;
> +	}
> +
> +	switch (req.flags & VFIO_PASID_REQUEST_MASK) {
> +	case VFIO_IOMMU_FLAG_ALLOC_PASID:
> +		ret = vfio_pasid_alloc(iommu->vmm, req.range.min,
> +				       req.range.max);
> +		break;
> +	case VFIO_IOMMU_FLAG_FREE_PASID:
> +		vfio_pasid_free_range(iommu->vmm, req.range.min,
> +				      req.range.max);
> +		ret = 0;

Set the initial value when it's declared?

> +		break;
> +	default:
> +		ret = -EINVAL;
> +	}
> +	mutex_unlock(&iommu->lock);
> +	return ret;
> +}
> +
>  static long vfio_iommu_type1_ioctl(void *iommu_data,
>  				   unsigned int cmd, unsigned long arg)
>  {
> @@ -2924,6 +2998,8 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
>  		return vfio_iommu_type1_unmap_dma(iommu, arg);
>  	case VFIO_IOMMU_DIRTY_PAGES:
>  		return vfio_iommu_type1_dirty_pages(iommu, arg);
> +	case VFIO_IOMMU_PASID_REQUEST:
> +		return vfio_iommu_type1_pasid_request(iommu, arg);
>  	default:
>  		return -ENOTTY;
>  	}
> diff --git a/drivers/vfio/vfio_pasid.c b/drivers/vfio/vfio_pasid.c
> index 44ecdd5..0ec4660 100644
> --- a/drivers/vfio/vfio_pasid.c
> +++ b/drivers/vfio/vfio_pasid.c
> @@ -60,6 +60,7 @@ void vfio_mm_put(struct vfio_mm *vmm)
>  {
>  	kref_put_mutex(&vmm->kref, vfio_mm_release, &vfio_mm_lock);
>  }
> +EXPORT_SYMBOL_GPL(vfio_mm_put);
>  
>  static void vfio_mm_get(struct vfio_mm *vmm)
>  {
> @@ -113,6 +114,13 @@ struct vfio_mm *vfio_mm_get_from_task(struct task_struct *task)
>  	mmput(mm);
>  	return vmm;
>  }
> +EXPORT_SYMBOL_GPL(vfio_mm_get_from_task);
> +
> +struct ioasid_set *vfio_mm_ioasid_set(struct vfio_mm *vmm)
> +{
> +	return vmm->ioasid_set;
> +}
> +EXPORT_SYMBOL_GPL(vfio_mm_ioasid_set);
>  
>  /*
>   * Find PASID within @min and @max
> @@ -201,6 +209,7 @@ int vfio_pasid_alloc(struct vfio_mm *vmm, int min, int max)
>  
>  	return pasid;
>  }
> +EXPORT_SYMBOL_GPL(vfio_pasid_alloc);
>  
>  void vfio_pasid_free_range(struct vfio_mm *vmm,
>  			   ioasid_t min, ioasid_t max)
> @@ -217,6 +226,7 @@ void vfio_pasid_free_range(struct vfio_mm *vmm,
>  		vfio_remove_pasid(vmm, vid);
>  	mutex_unlock(&vmm->pasid_lock);
>  }
> +EXPORT_SYMBOL_GPL(vfio_pasid_free_range);
>  
>  static int __init vfio_pasid_init(void)
>  {
> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> index 31472a9..5c3d7a8 100644
> --- a/include/linux/vfio.h
> +++ b/include/linux/vfio.h
> @@ -101,6 +101,7 @@ struct vfio_mm;
>  #if IS_ENABLED(CONFIG_VFIO_PASID)
>  extern struct vfio_mm *vfio_mm_get_from_task(struct task_struct *task);
>  extern void vfio_mm_put(struct vfio_mm *vmm);
> +extern struct ioasid_set *vfio_mm_ioasid_set(struct vfio_mm *vmm);
>  extern int vfio_pasid_alloc(struct vfio_mm *vmm, int min, int max);
>  extern void vfio_pasid_free_range(struct vfio_mm *vmm,
>  				  ioasid_t min, ioasid_t max);
> @@ -114,6 +115,11 @@ static inline void vfio_mm_put(struct vfio_mm *vmm)
>  {
>  }
>  
> +static inline struct ioasid_set *vfio_mm_ioasid_set(struct vfio_mm *vmm)
> +{
> +	return -ENOTTY;
> +}
> +
>  static inline int vfio_pasid_alloc(struct vfio_mm *vmm, int min, int max)
>  {
>  	return -ENOTTY;
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index ff40f9e..a4bc42e 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -1172,6 +1172,49 @@ struct vfio_iommu_type1_dirty_bitmap_get {
>  
>  #define VFIO_IOMMU_DIRTY_PAGES             _IO(VFIO_TYPE, VFIO_BASE + 17)
>  
> +/**
> + * VFIO_IOMMU_PASID_REQUEST - _IOWR(VFIO_TYPE, VFIO_BASE + 18,
> + *				struct vfio_iommu_type1_pasid_request)
> + *
> + * PASID (Processor Address Space ID) is a PCIe concept for tagging
> + * address spaces in DMA requests. When system-wide PASID allocation
> + * is required by the underlying iommu driver (e.g. Intel VT-d), this
> + * provides an interface for userspace to request pasid alloc/free
> + * for its assigned devices. Userspace should check the availability
> + * of this API by checking VFIO_IOMMU_TYPE1_INFO_CAP_NESTING through
> + * VFIO_IOMMU_GET_INFO.
> + *
> + * @flags=VFIO_IOMMU_FLAG_ALLOC_PASID, allocate a single PASID within @range.
> + * @flags=VFIO_IOMMU_FLAG_FREE_PASID, free the PASIDs within @range.
> + * @range is [min, max], which means both @min and @max are inclusive.
> + * ALLOC_PASID and FREE_PASID are mutually exclusive.
> + *
> + * Current interface supports at most 31 bits PASID bits as returning
> + * PASID allocation result via signed int. PCIe spec defines 20 bits
> + * for PASID width, so 31 bits is enough. As a result user space should
> + * provide min, max no more than 31 bits.

Perhaps this is the description I was looking for, but this still
conflicts with what I think the user is supposed to do, which is to
provide a range within nesting_info.pasid_bits.  These seem like
implementation details, not uapi.  Thanks,

Alex

> + * returns: allocated PASID value on success, -errno on failure for
> + *	    ALLOC_PASID;
> + *	    0 for FREE_PASID operation;
> + */
> +struct vfio_iommu_type1_pasid_request {
> +	__u32	argsz;
> +#define VFIO_IOMMU_FLAG_ALLOC_PASID	(1 << 0)
> +#define VFIO_IOMMU_FLAG_FREE_PASID	(1 << 1)
> +	__u32	flags;
> +	struct {
> +		__u32	min;
> +		__u32	max;
> +	} range;
> +};
> +
> +#define VFIO_PASID_REQUEST_MASK	(VFIO_IOMMU_FLAG_ALLOC_PASID | \
> +					 VFIO_IOMMU_FLAG_FREE_PASID)
> +
> +#define VFIO_IOMMU_PASID_BITS		31
> +
> +#define VFIO_IOMMU_PASID_REQUEST	_IO(VFIO_TYPE, VFIO_BASE + 18)
> +
>  /* -------- Additional API for SPAPR TCE (Server POWERPC) IOMMU -------- */
>  
>  /*


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v7 10/16] vfio/type1: Support binding guest page tables to PASID
  2020-09-10 10:45 ` [PATCH v7 10/16] vfio/type1: Support binding guest page tables to PASID Liu Yi L
@ 2020-09-11 22:03   ` Alex Williamson
  2020-09-12  6:02     ` Liu, Yi L
  0 siblings, 1 reply; 81+ messages in thread
From: Alex Williamson @ 2020-09-11 22:03 UTC (permalink / raw)
  To: Liu Yi L
  Cc: eric.auger, baolu.lu, joro, kevin.tian, jacob.jun.pan, ashok.raj,
	jun.j.tian, yi.y.sun, jean-philippe, peterx, jasowang, hao.wu,
	stefanha, iommu, kvm

On Thu, 10 Sep 2020 03:45:27 -0700
Liu Yi L <yi.l.liu@intel.com> wrote:

> Nesting translation allows two-levels/stages page tables, with 1st level
> for guest translations (e.g. GVA->GPA), 2nd level for host translations
> (e.g. GPA->HPA). This patch adds interface for binding guest page tables
> to a PASID. This PASID must have been allocated by the userspace before
> the binding request.
> 
> Cc: Kevin Tian <kevin.tian@intel.com>
> CC: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Cc: Alex Williamson <alex.williamson@redhat.com>
> Cc: Eric Auger <eric.auger@redhat.com>
> Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
> Cc: Joerg Roedel <joro@8bytes.org>
> Cc: Lu Baolu <baolu.lu@linux.intel.com>
> Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.com>
> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> ---
> v6 -> v7:
> *) introduced @user in struct domain_capsule to simplify the code per Eric's
>    suggestion.
> *) introduced VFIO_IOMMU_NESTING_OP_NUM for sanitizing op from userspace.
> *) corrected the @argsz value of unbind_data in vfio_group_unbind_gpasid_fn().
> 
> v5 -> v6:
> *) dropped vfio_find_nesting_group() and add vfio_get_nesting_domain_capsule().
>    per comment from Eric.
> *) use iommu_uapi_sva_bind/unbind_gpasid() and iommu_sva_unbind_gpasid() in
>    linux/iommu.h for userspace operation and in-kernel operation.
> 
> v3 -> v4:
> *) address comments from Alex on v3
> 
> v2 -> v3:
> *) use __iommu_sva_unbind_gpasid() for unbind call issued by VFIO
>    https://lore.kernel.org/linux-iommu/1592931837-58223-6-git-send-email-jacob.jun.pan@linux.intel.com/
> 
> v1 -> v2:
> *) rename subject from "vfio/type1: Bind guest page tables to host"
> *) remove VFIO_IOMMU_BIND, introduce VFIO_IOMMU_NESTING_OP to support bind/
>    unbind guet page table
> *) replaced vfio_iommu_for_each_dev() with a group level loop since this
>    series enforces one group per container w/ nesting type as start.
> *) rename vfio_bind/unbind_gpasid_fn() to vfio_dev_bind/unbind_gpasid_fn()
> *) vfio_dev_unbind_gpasid() always successful
> *) use vfio_mm->pasid_lock to avoid race between PASID free and page table
>    bind/unbind
> ---
>  drivers/vfio/vfio_iommu_type1.c | 163 ++++++++++++++++++++++++++++++++++++++++
>  drivers/vfio/vfio_pasid.c       |  26 +++++++
>  include/linux/vfio.h            |  20 +++++
>  include/uapi/linux/vfio.h       |  36 +++++++++
>  4 files changed, 245 insertions(+)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index bd4b668..11f1156 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -149,6 +149,39 @@ struct vfio_regions {
>  #define DIRTY_BITMAP_PAGES_MAX	 ((u64)INT_MAX)
>  #define DIRTY_BITMAP_SIZE_MAX	 DIRTY_BITMAP_BYTES(DIRTY_BITMAP_PAGES_MAX)
>  
> +struct domain_capsule {
> +	struct vfio_group	*group;
> +	struct iommu_domain	*domain;
> +	/* set if @data contains a user pointer*/
> +	bool			user;
> +	void			*data;
> +};

Put the hole in the structure at the end, but I suspect we might lose
the user field when the internal api drops the unnecessary structure
for unbind anyway.

> +
> +/* iommu->lock must be held */
> +static int vfio_prepare_nesting_domain_capsule(struct vfio_iommu *iommu,
> +					       struct domain_capsule *dc)
> +{
> +	struct vfio_domain *domain = NULL;
> +	struct vfio_group *group = NULL;

Unnecessary initialization.

> +
> +	if (!iommu->nesting_info)
> +		return -EINVAL;
> +
> +	/*
> +	 * Only support singleton container with nesting type. If
> +	 * nesting_info is non-NULL, the container is non-empty.
> +	 * Also domain is non-empty.
> +	 */
> +	domain = list_first_entry(&iommu->domain_list,
> +				  struct vfio_domain, next);
> +	group = list_first_entry(&domain->group_list,
> +				 struct vfio_group, next);
> +	dc->group = group;
> +	dc->domain = domain->domain;
> +	dc->user = true;
> +	return 0;
> +}
> +
>  static int put_pfn(unsigned long pfn, int prot);
>  
>  static struct vfio_group *vfio_iommu_find_iommu_group(struct vfio_iommu *iommu,
> @@ -2405,6 +2438,49 @@ static int vfio_iommu_resv_refresh(struct vfio_iommu *iommu,
>  	return ret;
>  }
>  
> +static int vfio_dev_bind_gpasid_fn(struct device *dev, void *data)
> +{
> +	struct domain_capsule *dc = (struct domain_capsule *)data;
> +	unsigned long arg = *(unsigned long *)dc->data;
> +
> +	return iommu_uapi_sva_bind_gpasid(dc->domain, dev,
> +					  (void __user *)arg);
> +}
> +
> +static int vfio_dev_unbind_gpasid_fn(struct device *dev, void *data)
> +{
> +	struct domain_capsule *dc = (struct domain_capsule *)data;
> +
> +	if (dc->user) {
> +		unsigned long arg = *(unsigned long *)dc->data;
> +
> +		iommu_uapi_sva_unbind_gpasid(dc->domain,
> +					     dev, (void __user *)arg);
> +	} else {
> +		struct iommu_gpasid_bind_data *unbind_data =
> +				(struct iommu_gpasid_bind_data *)dc->data;
> +
> +		iommu_sva_unbind_gpasid(dc->domain, dev, unbind_data);
> +	}
> +	return 0;
> +}
> +
> +static void vfio_group_unbind_gpasid_fn(ioasid_t pasid, void *data)
> +{
> +	struct domain_capsule *dc = (struct domain_capsule *)data;
> +	struct iommu_gpasid_bind_data unbind_data;
> +
> +	unbind_data.argsz = sizeof(struct iommu_gpasid_bind_data);
> +	unbind_data.flags = 0;
> +	unbind_data.hpasid = pasid;


As in thread with Jacob, this all seems a little excessive for an
internal api callback that requires one arg.


> +
> +	dc->user = false;
> +	dc->data = &unbind_data;
> +
> +	iommu_group_for_each_dev(dc->group->iommu_group,
> +				 dc, vfio_dev_unbind_gpasid_fn);
> +}
> +
>  static void vfio_iommu_type1_detach_group(void *iommu_data,
>  					  struct iommu_group *iommu_group)
>  {
> @@ -2448,6 +2524,20 @@ static void vfio_iommu_type1_detach_group(void *iommu_data,
>  		if (!group)
>  			continue;
>  
> +		if (iommu->vmm && (iommu->nesting_info->features &
> +					IOMMU_NESTING_FEAT_BIND_PGTBL)) {
> +			struct domain_capsule dc = { .group = group,
> +						     .domain = domain->domain,
> +						     .data = NULL };
> +
> +			/*
> +			 * Unbind page tables bound with system wide PASIDs
> +			 * which are allocated to userspace.
> +			 */
> +			vfio_mm_for_each_pasid(iommu->vmm, &dc,
> +					       vfio_group_unbind_gpasid_fn);
> +		}
> +
>  		vfio_iommu_detach_group(domain, group);
>  		update_dirty_scope = !group->pinned_page_dirty_scope;
>  		list_del(&group->next);
> @@ -2982,6 +3072,77 @@ static int vfio_iommu_type1_pasid_request(struct vfio_iommu *iommu,
>  	return ret;
>  }
>  
> +static long vfio_iommu_handle_pgtbl_op(struct vfio_iommu *iommu,
> +				       bool is_bind, unsigned long arg)
> +{
> +	struct domain_capsule dc = { .data = &arg };
> +	struct iommu_nesting_info *info;
> +	int ret;
> +
> +	mutex_lock(&iommu->lock);
> +
> +	info = iommu->nesting_info;
> +	if (!info || !(info->features & IOMMU_NESTING_FEAT_BIND_PGTBL)) {
> +		ret = -EOPNOTSUPP;
> +		goto out_unlock;
> +	}
> +
> +	if (!iommu->vmm) {
> +		ret = -EINVAL;
> +		goto out_unlock;
> +	}
> +
> +	ret = vfio_prepare_nesting_domain_capsule(iommu, &dc);
> +	if (ret)
> +		goto out_unlock;
> +
> +	/* Avoid race with other containers within the same process */
> +	vfio_mm_pasid_lock(iommu->vmm);
> +
> +	if (is_bind)
> +		ret = iommu_group_for_each_dev(dc.group->iommu_group, &dc,
> +					       vfio_dev_bind_gpasid_fn);
> +	if (ret || !is_bind)
> +		iommu_group_for_each_dev(dc.group->iommu_group,
> +					 &dc, vfio_dev_unbind_gpasid_fn);
> +
> +	vfio_mm_pasid_unlock(iommu->vmm);
> +out_unlock:
> +	mutex_unlock(&iommu->lock);
> +	return ret;
> +}
> +
> +static long vfio_iommu_type1_nesting_op(struct vfio_iommu *iommu,
> +					unsigned long arg)
> +{
> +	struct vfio_iommu_type1_nesting_op hdr;
> +	unsigned int minsz;
> +	int ret;
> +
> +	minsz = offsetofend(struct vfio_iommu_type1_nesting_op, flags);
> +
> +	if (copy_from_user(&hdr, (void __user *)arg, minsz))
> +		return -EFAULT;
> +
> +	if (hdr.argsz < minsz ||
> +	    hdr.flags & ~VFIO_NESTING_OP_MASK ||
> +	    (hdr.flags & VFIO_NESTING_OP_MASK) >= VFIO_IOMMU_NESTING_OP_NUM)


Isn't this redundant to the default switch case?


> +		return -EINVAL;
> +
> +	switch (hdr.flags & VFIO_NESTING_OP_MASK) {
> +	case VFIO_IOMMU_NESTING_OP_BIND_PGTBL:
> +		ret = vfio_iommu_handle_pgtbl_op(iommu, true, arg + minsz);
> +		break;
> +	case VFIO_IOMMU_NESTING_OP_UNBIND_PGTBL:
> +		ret = vfio_iommu_handle_pgtbl_op(iommu, false, arg + minsz);
> +		break;
> +	default:
> +		ret = -EINVAL;
> +	}
> +
> +	return ret;
> +}
> +
>  static long vfio_iommu_type1_ioctl(void *iommu_data,
>  				   unsigned int cmd, unsigned long arg)
>  {
> @@ -3000,6 +3161,8 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
>  		return vfio_iommu_type1_dirty_pages(iommu, arg);
>  	case VFIO_IOMMU_PASID_REQUEST:
>  		return vfio_iommu_type1_pasid_request(iommu, arg);
> +	case VFIO_IOMMU_NESTING_OP:
> +		return vfio_iommu_type1_nesting_op(iommu, arg);
>  	default:
>  		return -ENOTTY;
>  	}
> diff --git a/drivers/vfio/vfio_pasid.c b/drivers/vfio/vfio_pasid.c
> index 0ec4660..9e2e4b0 100644
> --- a/drivers/vfio/vfio_pasid.c
> +++ b/drivers/vfio/vfio_pasid.c
> @@ -220,6 +220,8 @@ void vfio_pasid_free_range(struct vfio_mm *vmm,
>  	 * IOASID core will notify PASID users (e.g. IOMMU driver) to
>  	 * teardown necessary structures depending on the to-be-freed
>  	 * PASID.
> +	 * Hold pasid_lock also avoids race with PASID usages like bind/
> +	 * unbind page tables to requested PASID.
>  	 */
>  	mutex_lock(&vmm->pasid_lock);
>  	while ((vid = vfio_find_pasid(vmm, min, max)) != NULL)
> @@ -228,6 +230,30 @@ void vfio_pasid_free_range(struct vfio_mm *vmm,
>  }
>  EXPORT_SYMBOL_GPL(vfio_pasid_free_range);
>  
> +int vfio_mm_for_each_pasid(struct vfio_mm *vmm, void *data,
> +			   void (*fn)(ioasid_t id, void *data))
> +{
> +	int ret;
> +
> +	mutex_lock(&vmm->pasid_lock);
> +	ret = ioasid_set_for_each_ioasid(vmm->ioasid_set, fn, data);
> +	mutex_unlock(&vmm->pasid_lock);
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(vfio_mm_for_each_pasid);
> +
> +void vfio_mm_pasid_lock(struct vfio_mm *vmm)
> +{
> +	mutex_lock(&vmm->pasid_lock);
> +}
> +EXPORT_SYMBOL_GPL(vfio_mm_pasid_lock);
> +
> +void vfio_mm_pasid_unlock(struct vfio_mm *vmm)
> +{
> +	mutex_unlock(&vmm->pasid_lock);
> +}
> +EXPORT_SYMBOL_GPL(vfio_mm_pasid_unlock);
> +
>  static int __init vfio_pasid_init(void)
>  {
>  	mutex_init(&vfio_mm_lock);
> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> index 5c3d7a8..6a999c3 100644
> --- a/include/linux/vfio.h
> +++ b/include/linux/vfio.h
> @@ -105,6 +105,11 @@ extern struct ioasid_set *vfio_mm_ioasid_set(struct vfio_mm *vmm);
>  extern int vfio_pasid_alloc(struct vfio_mm *vmm, int min, int max);
>  extern void vfio_pasid_free_range(struct vfio_mm *vmm,
>  				  ioasid_t min, ioasid_t max);
> +extern int vfio_mm_for_each_pasid(struct vfio_mm *vmm, void *data,
> +				  void (*fn)(ioasid_t id, void *data));
> +extern void vfio_mm_pasid_lock(struct vfio_mm *vmm);
> +extern void vfio_mm_pasid_unlock(struct vfio_mm *vmm);
> +
>  #else
>  static inline struct vfio_mm *vfio_mm_get_from_task(struct task_struct *task)
>  {
> @@ -129,6 +134,21 @@ static inline void vfio_pasid_free_range(struct vfio_mm *vmm,
>  					  ioasid_t min, ioasid_t max)
>  {
>  }
> +
> +static inline int vfio_mm_for_each_pasid(struct vfio_mm *vmm, void *data,
> +					 void (*fn)(ioasid_t id, void *data))
> +{
> +	return -ENOTTY;
> +}
> +
> +static inline void vfio_mm_pasid_lock(struct vfio_mm *vmm)
> +{
> +}
> +
> +static inline void vfio_mm_pasid_unlock(struct vfio_mm *vmm)
> +{
> +}
> +
>  #endif /* CONFIG_VFIO_PASID */
>  
>  /*
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index a4bc42e..a99bd71 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -1215,6 +1215,42 @@ struct vfio_iommu_type1_pasid_request {
>  
>  #define VFIO_IOMMU_PASID_REQUEST	_IO(VFIO_TYPE, VFIO_BASE + 18)
>  
> +/**
> + * VFIO_IOMMU_NESTING_OP - _IOW(VFIO_TYPE, VFIO_BASE + 19,
> + *				struct vfio_iommu_type1_nesting_op)
> + *
> + * This interface allows userspace to utilize the nesting IOMMU
> + * capabilities as reported in VFIO_IOMMU_TYPE1_INFO_CAP_NESTING
> + * cap through VFIO_IOMMU_GET_INFO. For platforms which require
> + * system wide PASID, PASID will be allocated by VFIO_IOMMU_PASID
> + * _REQUEST.
> + *
> + * @data[] types defined for each op:
> + * +=================+===============================================+
> + * | NESTING OP      |      @data[]                                  |
> + * +=================+===============================================+
> + * | BIND_PGTBL      |      struct iommu_gpasid_bind_data            |
> + * +-----------------+-----------------------------------------------+
> + * | UNBIND_PGTBL    |      struct iommu_gpasid_bind_data            |
> + * +-----------------+-----------------------------------------------+
> + *
> + * returns: 0 on success, -errno on failure.
> + */
> +struct vfio_iommu_type1_nesting_op {
> +	__u32	argsz;
> +	__u32	flags;
> +#define VFIO_NESTING_OP_MASK	(0xffff) /* lower 16-bits for op */
> +	__u8	data[];
> +};
> +
> +enum {
> +	VFIO_IOMMU_NESTING_OP_BIND_PGTBL,
> +	VFIO_IOMMU_NESTING_OP_UNBIND_PGTBL,
> +	VFIO_IOMMU_NESTING_OP_NUM,
> +};

"VFIO_IOMMU_NESTING_NUM_OPS" would be more consistent with the vfio
uapi.  Thanks,

Alex

> +
> +#define VFIO_IOMMU_NESTING_OP		_IO(VFIO_TYPE, VFIO_BASE + 19)
> +
>  /* -------- Additional API for SPAPR TCE (Server POWERPC) IOMMU -------- */
>  
>  /*


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v7 13/16] vfio/pci: Expose PCIe PASID capability to guest
  2020-09-10 10:45 ` [PATCH v7 13/16] vfio/pci: Expose PCIe PASID capability to guest Liu Yi L
@ 2020-09-11 22:13   ` Alex Williamson
  2020-09-12  7:17     ` Liu, Yi L
  0 siblings, 1 reply; 81+ messages in thread
From: Alex Williamson @ 2020-09-11 22:13 UTC (permalink / raw)
  To: Liu Yi L
  Cc: eric.auger, baolu.lu, joro, kevin.tian, jacob.jun.pan, ashok.raj,
	jun.j.tian, yi.y.sun, jean-philippe, peterx, jasowang, hao.wu,
	stefanha, iommu, kvm

On Thu, 10 Sep 2020 03:45:30 -0700
Liu Yi L <yi.l.liu@intel.com> wrote:

> This patch exposes PCIe PASID capability to guest for assigned devices.
> Existing vfio_pci driver hides it from guest by setting the capability
> length as 0 in pci_ext_cap_length[].

This exposes the PASID capability, but it's still read-only, so this
largely just helps userspace know where to emulate the capability,
right?  Thanks,

Alex
 
> And this patch only exposes PASID capability for devices which has PCIe
> PASID extended struture in its configuration space. VFs will not expose
> the PASID capability as they do not implement the PASID extended structure
> in their config space. It is a TODO in future. Related discussion can be
> found in below link:
> 
> https://lore.kernel.org/kvm/20200407095801.648b1371@w520.home/
> 
> Cc: Kevin Tian <kevin.tian@intel.com>
> CC: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Cc: Alex Williamson <alex.williamson@redhat.com>
> Cc: Eric Auger <eric.auger@redhat.com>
> Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
> Cc: Joerg Roedel <joro@8bytes.org>
> Cc: Lu Baolu <baolu.lu@linux.intel.com>
> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> Reviewed-by: Eric Auger <eric.auger@redhat.com>
> ---
> v5 -> v6:
> *) add review-by from Eric Auger.
> 
> v1 -> v2:
> *) added in v2, but it was sent in a separate patchseries before
> ---
>  drivers/vfio/pci/vfio_pci_config.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c
> index d98843f..07ff2e6 100644
> --- a/drivers/vfio/pci/vfio_pci_config.c
> +++ b/drivers/vfio/pci/vfio_pci_config.c
> @@ -95,7 +95,7 @@ static const u16 pci_ext_cap_length[PCI_EXT_CAP_ID_MAX + 1] = {
>  	[PCI_EXT_CAP_ID_LTR]	=	PCI_EXT_CAP_LTR_SIZEOF,
>  	[PCI_EXT_CAP_ID_SECPCI]	=	0,	/* not yet */
>  	[PCI_EXT_CAP_ID_PMUX]	=	0,	/* not yet */
> -	[PCI_EXT_CAP_ID_PASID]	=	0,	/* not yet */
> +	[PCI_EXT_CAP_ID_PASID]	=	PCI_EXT_CAP_PASID_SIZEOF,
>  };
>  
>  /*


^ permalink raw reply	[flat|nested] 81+ messages in thread

* RE: [PATCH v7 10/16] vfio/type1: Support binding guest page tables to PASID
  2020-09-11 22:03   ` Alex Williamson
@ 2020-09-12  6:02     ` Liu, Yi L
  0 siblings, 0 replies; 81+ messages in thread
From: Liu, Yi L @ 2020-09-12  6:02 UTC (permalink / raw)
  To: Alex Williamson
  Cc: eric.auger, baolu.lu, joro, Tian, Kevin, jacob.jun.pan, Raj,
	Ashok, Tian, Jun J, Sun, Yi Y, jean-philippe, peterx, jasowang,
	Wu, Hao, stefanha, iommu, kvm

Hi Alex,

> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Saturday, September 12, 2020 6:04 AM
> 
> On Thu, 10 Sep 2020 03:45:27 -0700
> Liu Yi L <yi.l.liu@intel.com> wrote:
> 
> > Nesting translation allows two-levels/stages page tables, with 1st
> > level for guest translations (e.g. GVA->GPA), 2nd level for host
> > translations (e.g. GPA->HPA). This patch adds interface for binding
> > guest page tables to a PASID. This PASID must have been allocated by
> > the userspace before the binding request.
> >
> > Cc: Kevin Tian <kevin.tian@intel.com>
> > CC: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > Cc: Alex Williamson <alex.williamson@redhat.com>
> > Cc: Eric Auger <eric.auger@redhat.com>
> > Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
> > Cc: Joerg Roedel <joro@8bytes.org>
> > Cc: Lu Baolu <baolu.lu@linux.intel.com>
> > Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.com>
> > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > ---
> > v6 -> v7:
> > *) introduced @user in struct domain_capsule to simplify the code per Eric's
> >    suggestion.
> > *) introduced VFIO_IOMMU_NESTING_OP_NUM for sanitizing op from userspace.
> > *) corrected the @argsz value of unbind_data in vfio_group_unbind_gpasid_fn().
> >
> > v5 -> v6:
> > *) dropped vfio_find_nesting_group() and add vfio_get_nesting_domain_capsule().
> >    per comment from Eric.
> > *) use iommu_uapi_sva_bind/unbind_gpasid() and iommu_sva_unbind_gpasid() in
> >    linux/iommu.h for userspace operation and in-kernel operation.
> >
> > v3 -> v4:
> > *) address comments from Alex on v3
> >
> > v2 -> v3:
> > *) use __iommu_sva_unbind_gpasid() for unbind call issued by VFIO
> >
> > https://lore.kernel.org/linux-iommu/1592931837-58223-6-git-send-email-
> > jacob.jun.pan@linux.intel.com/
> >
> > v1 -> v2:
> > *) rename subject from "vfio/type1: Bind guest page tables to host"
> > *) remove VFIO_IOMMU_BIND, introduce VFIO_IOMMU_NESTING_OP to support
> bind/
> >    unbind guet page table
> > *) replaced vfio_iommu_for_each_dev() with a group level loop since this
> >    series enforces one group per container w/ nesting type as start.
> > *) rename vfio_bind/unbind_gpasid_fn() to
> > vfio_dev_bind/unbind_gpasid_fn()
> > *) vfio_dev_unbind_gpasid() always successful
> > *) use vfio_mm->pasid_lock to avoid race between PASID free and page table
> >    bind/unbind
> > ---
> >  drivers/vfio/vfio_iommu_type1.c | 163
> ++++++++++++++++++++++++++++++++++++++++
> >  drivers/vfio/vfio_pasid.c       |  26 +++++++
> >  include/linux/vfio.h            |  20 +++++
> >  include/uapi/linux/vfio.h       |  36 +++++++++
> >  4 files changed, 245 insertions(+)
> >
> > diff --git a/drivers/vfio/vfio_iommu_type1.c
> > b/drivers/vfio/vfio_iommu_type1.c index bd4b668..11f1156 100644
> > --- a/drivers/vfio/vfio_iommu_type1.c
> > +++ b/drivers/vfio/vfio_iommu_type1.c
> > @@ -149,6 +149,39 @@ struct vfio_regions {
> >  #define DIRTY_BITMAP_PAGES_MAX	 ((u64)INT_MAX)
> >  #define DIRTY_BITMAP_SIZE_MAX
> DIRTY_BITMAP_BYTES(DIRTY_BITMAP_PAGES_MAX)
> >
> > +struct domain_capsule {
> > +	struct vfio_group	*group;
> > +	struct iommu_domain	*domain;
> > +	/* set if @data contains a user pointer*/
> > +	bool			user;
> > +	void			*data;
> > +};
> 
> Put the hole in the structure at the end, but I suspect we might lose the user field
> when the internal api drops the unnecessary structure for unbind anyway.

I see. will move @user and its comment to the end of this struct. As it's
used to imply the @data field user pointer or not, I guess it's still useful
to keep it. The difference would be the @data is a pasid not a bind_data
struct.

> > +
> > +/* iommu->lock must be held */
> > +static int vfio_prepare_nesting_domain_capsule(struct vfio_iommu *iommu,
> > +					       struct domain_capsule *dc) {
> > +	struct vfio_domain *domain = NULL;
> > +	struct vfio_group *group = NULL;
> 
> Unnecessary initialization.

will remove them. :-)

> > +
> > +	if (!iommu->nesting_info)
> > +		return -EINVAL;
> > +
> > +	/*
> > +	 * Only support singleton container with nesting type. If
> > +	 * nesting_info is non-NULL, the container is non-empty.
> > +	 * Also domain is non-empty.
> > +	 */
> > +	domain = list_first_entry(&iommu->domain_list,
> > +				  struct vfio_domain, next);
> > +	group = list_first_entry(&domain->group_list,
> > +				 struct vfio_group, next);
> > +	dc->group = group;
> > +	dc->domain = domain->domain;
> > +	dc->user = true;
> > +	return 0;
> > +}
> > +
> >  static int put_pfn(unsigned long pfn, int prot);
> >
> >  static struct vfio_group *vfio_iommu_find_iommu_group(struct
> > vfio_iommu *iommu, @@ -2405,6 +2438,49 @@ static int
> vfio_iommu_resv_refresh(struct vfio_iommu *iommu,
> >  	return ret;
> >  }
> >
> > +static int vfio_dev_bind_gpasid_fn(struct device *dev, void *data) {
> > +	struct domain_capsule *dc = (struct domain_capsule *)data;
> > +	unsigned long arg = *(unsigned long *)dc->data;
> > +
> > +	return iommu_uapi_sva_bind_gpasid(dc->domain, dev,
> > +					  (void __user *)arg);
> > +}
> > +
> > +static int vfio_dev_unbind_gpasid_fn(struct device *dev, void *data)
> > +{
> > +	struct domain_capsule *dc = (struct domain_capsule *)data;
> > +
> > +	if (dc->user) {
> > +		unsigned long arg = *(unsigned long *)dc->data;
> > +
> > +		iommu_uapi_sva_unbind_gpasid(dc->domain,
> > +					     dev, (void __user *)arg);
> > +	} else {
> > +		struct iommu_gpasid_bind_data *unbind_data =
> > +				(struct iommu_gpasid_bind_data *)dc->data;
> > +
> > +		iommu_sva_unbind_gpasid(dc->domain, dev, unbind_data);
> > +	}
> > +	return 0;
> > +}
> > +
> > +static void vfio_group_unbind_gpasid_fn(ioasid_t pasid, void *data) {
> > +	struct domain_capsule *dc = (struct domain_capsule *)data;
> > +	struct iommu_gpasid_bind_data unbind_data;
> > +
> > +	unbind_data.argsz = sizeof(struct iommu_gpasid_bind_data);
> > +	unbind_data.flags = 0;
> > +	unbind_data.hpasid = pasid;
> 
> 
> As in thread with Jacob, this all seems a little excessive for an internal api callback
> that requires one arg.

yep, Jacob informed me about that change.

> 
> > +
> > +	dc->user = false;
> > +	dc->data = &unbind_data;
> > +
> > +	iommu_group_for_each_dev(dc->group->iommu_group,
> > +				 dc, vfio_dev_unbind_gpasid_fn);
> > +}
> > +
> >  static void vfio_iommu_type1_detach_group(void *iommu_data,
> >  					  struct iommu_group *iommu_group)
> { @@ -2448,6 +2524,20 @@
> > static void vfio_iommu_type1_detach_group(void *iommu_data,
> >  		if (!group)
> >  			continue;
> >
> > +		if (iommu->vmm && (iommu->nesting_info->features &
> > +					IOMMU_NESTING_FEAT_BIND_PGTBL)) {
> > +			struct domain_capsule dc = { .group = group,
> > +						     .domain = domain->domain,
> > +						     .data = NULL };
> > +
> > +			/*
> > +			 * Unbind page tables bound with system wide PASIDs
> > +			 * which are allocated to userspace.
> > +			 */
> > +			vfio_mm_for_each_pasid(iommu->vmm, &dc,
> > +					       vfio_group_unbind_gpasid_fn);
> > +		}
> > +
> >  		vfio_iommu_detach_group(domain, group);
> >  		update_dirty_scope = !group->pinned_page_dirty_scope;
> >  		list_del(&group->next);
> > @@ -2982,6 +3072,77 @@ static int vfio_iommu_type1_pasid_request(struct
> vfio_iommu *iommu,
> >  	return ret;
> >  }
> >
> > +static long vfio_iommu_handle_pgtbl_op(struct vfio_iommu *iommu,
> > +				       bool is_bind, unsigned long arg) {
> > +	struct domain_capsule dc = { .data = &arg };
> > +	struct iommu_nesting_info *info;
> > +	int ret;
> > +
> > +	mutex_lock(&iommu->lock);
> > +
> > +	info = iommu->nesting_info;
> > +	if (!info || !(info->features & IOMMU_NESTING_FEAT_BIND_PGTBL)) {
> > +		ret = -EOPNOTSUPP;
> > +		goto out_unlock;
> > +	}
> > +
> > +	if (!iommu->vmm) {
> > +		ret = -EINVAL;
> > +		goto out_unlock;
> > +	}
> > +
> > +	ret = vfio_prepare_nesting_domain_capsule(iommu, &dc);
> > +	if (ret)
> > +		goto out_unlock;
> > +
> > +	/* Avoid race with other containers within the same process */
> > +	vfio_mm_pasid_lock(iommu->vmm);
> > +
> > +	if (is_bind)
> > +		ret = iommu_group_for_each_dev(dc.group->iommu_group, &dc,
> > +					       vfio_dev_bind_gpasid_fn);
> > +	if (ret || !is_bind)
> > +		iommu_group_for_each_dev(dc.group->iommu_group,
> > +					 &dc, vfio_dev_unbind_gpasid_fn);
> > +
> > +	vfio_mm_pasid_unlock(iommu->vmm);
> > +out_unlock:
> > +	mutex_unlock(&iommu->lock);
> > +	return ret;
> > +}
> > +
> > +static long vfio_iommu_type1_nesting_op(struct vfio_iommu *iommu,
> > +					unsigned long arg)
> > +{
> > +	struct vfio_iommu_type1_nesting_op hdr;
> > +	unsigned int minsz;
> > +	int ret;
> > +
> > +	minsz = offsetofend(struct vfio_iommu_type1_nesting_op, flags);
> > +
> > +	if (copy_from_user(&hdr, (void __user *)arg, minsz))
> > +		return -EFAULT;
> > +
> > +	if (hdr.argsz < minsz ||
> > +	    hdr.flags & ~VFIO_NESTING_OP_MASK ||
> > +	    (hdr.flags & VFIO_NESTING_OP_MASK) >=
> VFIO_IOMMU_NESTING_OP_NUM)
> 
> 
> Isn't this redundant to the default switch case?

oh, yes. From sanity chek p.o.v, it looks to be necessary to put
the flags check here. but it also makes the default switch case
to be a dead code. perhaps, I could remove the check against the
OP_NUM and keep the switch case. how about your opinion?

> 
> > +		return -EINVAL;
> > +
> > +	switch (hdr.flags & VFIO_NESTING_OP_MASK) {
> > +	case VFIO_IOMMU_NESTING_OP_BIND_PGTBL:
> > +		ret = vfio_iommu_handle_pgtbl_op(iommu, true, arg + minsz);
> > +		break;
> > +	case VFIO_IOMMU_NESTING_OP_UNBIND_PGTBL:
> > +		ret = vfio_iommu_handle_pgtbl_op(iommu, false, arg + minsz);
> > +		break;
> > +	default:
> > +		ret = -EINVAL;
> > +	}
> > +
> > +	return ret;
> > +}
> > +
> >  static long vfio_iommu_type1_ioctl(void *iommu_data,
> >  				   unsigned int cmd, unsigned long arg)  { @@ -
> 3000,6 +3161,8 @@
> > static long vfio_iommu_type1_ioctl(void *iommu_data,
> >  		return vfio_iommu_type1_dirty_pages(iommu, arg);
> >  	case VFIO_IOMMU_PASID_REQUEST:
> >  		return vfio_iommu_type1_pasid_request(iommu, arg);
> > +	case VFIO_IOMMU_NESTING_OP:
> > +		return vfio_iommu_type1_nesting_op(iommu, arg);
> >  	default:
> >  		return -ENOTTY;
> >  	}
> > diff --git a/drivers/vfio/vfio_pasid.c b/drivers/vfio/vfio_pasid.c
> > index 0ec4660..9e2e4b0 100644
> > --- a/drivers/vfio/vfio_pasid.c
> > +++ b/drivers/vfio/vfio_pasid.c
> > @@ -220,6 +220,8 @@ void vfio_pasid_free_range(struct vfio_mm *vmm,
> >  	 * IOASID core will notify PASID users (e.g. IOMMU driver) to
> >  	 * teardown necessary structures depending on the to-be-freed
> >  	 * PASID.
> > +	 * Hold pasid_lock also avoids race with PASID usages like bind/
> > +	 * unbind page tables to requested PASID.
> >  	 */
> >  	mutex_lock(&vmm->pasid_lock);
> >  	while ((vid = vfio_find_pasid(vmm, min, max)) != NULL) @@ -228,6
> > +230,30 @@ void vfio_pasid_free_range(struct vfio_mm *vmm,  }
> > EXPORT_SYMBOL_GPL(vfio_pasid_free_range);
> >
> > +int vfio_mm_for_each_pasid(struct vfio_mm *vmm, void *data,
> > +			   void (*fn)(ioasid_t id, void *data)) {
> > +	int ret;
> > +
> > +	mutex_lock(&vmm->pasid_lock);
> > +	ret = ioasid_set_for_each_ioasid(vmm->ioasid_set, fn, data);
> > +	mutex_unlock(&vmm->pasid_lock);
> > +	return ret;
> > +}
> > +EXPORT_SYMBOL_GPL(vfio_mm_for_each_pasid);
> > +
> > +void vfio_mm_pasid_lock(struct vfio_mm *vmm) {
> > +	mutex_lock(&vmm->pasid_lock);
> > +}
> > +EXPORT_SYMBOL_GPL(vfio_mm_pasid_lock);
> > +
> > +void vfio_mm_pasid_unlock(struct vfio_mm *vmm) {
> > +	mutex_unlock(&vmm->pasid_lock);
> > +}
> > +EXPORT_SYMBOL_GPL(vfio_mm_pasid_unlock);
> > +
> >  static int __init vfio_pasid_init(void)  {
> >  	mutex_init(&vfio_mm_lock);
> > diff --git a/include/linux/vfio.h b/include/linux/vfio.h index
> > 5c3d7a8..6a999c3 100644
> > --- a/include/linux/vfio.h
> > +++ b/include/linux/vfio.h
> > @@ -105,6 +105,11 @@ extern struct ioasid_set
> > *vfio_mm_ioasid_set(struct vfio_mm *vmm);  extern int
> > vfio_pasid_alloc(struct vfio_mm *vmm, int min, int max);  extern void
> vfio_pasid_free_range(struct vfio_mm *vmm,
> >  				  ioasid_t min, ioasid_t max);
> > +extern int vfio_mm_for_each_pasid(struct vfio_mm *vmm, void *data,
> > +				  void (*fn)(ioasid_t id, void *data)); extern void
> > +vfio_mm_pasid_lock(struct vfio_mm *vmm); extern void
> > +vfio_mm_pasid_unlock(struct vfio_mm *vmm);
> > +
> >  #else
> >  static inline struct vfio_mm *vfio_mm_get_from_task(struct
> > task_struct *task)  { @@ -129,6 +134,21 @@ static inline void
> > vfio_pasid_free_range(struct vfio_mm *vmm,
> >  					  ioasid_t min, ioasid_t max)
> >  {
> >  }
> > +
> > +static inline int vfio_mm_for_each_pasid(struct vfio_mm *vmm, void *data,
> > +					 void (*fn)(ioasid_t id, void *data)) {
> > +	return -ENOTTY;
> > +}
> > +
> > +static inline void vfio_mm_pasid_lock(struct vfio_mm *vmm) { }
> > +
> > +static inline void vfio_mm_pasid_unlock(struct vfio_mm *vmm) { }
> > +
> >  #endif /* CONFIG_VFIO_PASID */
> >
> >  /*
> > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> > index a4bc42e..a99bd71 100644
> > --- a/include/uapi/linux/vfio.h
> > +++ b/include/uapi/linux/vfio.h
> > @@ -1215,6 +1215,42 @@ struct vfio_iommu_type1_pasid_request {
> >
> >  #define VFIO_IOMMU_PASID_REQUEST	_IO(VFIO_TYPE, VFIO_BASE + 18)
> >
> > +/**
> > + * VFIO_IOMMU_NESTING_OP - _IOW(VFIO_TYPE, VFIO_BASE + 19,
> > + *				struct vfio_iommu_type1_nesting_op)
> > + *
> > + * This interface allows userspace to utilize the nesting IOMMU
> > + * capabilities as reported in VFIO_IOMMU_TYPE1_INFO_CAP_NESTING
> > + * cap through VFIO_IOMMU_GET_INFO. For platforms which require
> > + * system wide PASID, PASID will be allocated by VFIO_IOMMU_PASID
> > + * _REQUEST.
> > + *
> > + * @data[] types defined for each op:
> > + *
> +=================+===============================================+
> > + * | NESTING OP      |      @data[]                                  |
> > + *
> +=================+===============================================+
> > + * | BIND_PGTBL      |      struct iommu_gpasid_bind_data            |
> > + * +-----------------+-----------------------------------------------+
> > + * | UNBIND_PGTBL    |      struct iommu_gpasid_bind_data            |
> > + *
> > ++-----------------+-----------------------------------------------+
> > + *
> > + * returns: 0 on success, -errno on failure.
> > + */
> > +struct vfio_iommu_type1_nesting_op {
> > +	__u32	argsz;
> > +	__u32	flags;
> > +#define VFIO_NESTING_OP_MASK	(0xffff) /* lower 16-bits for op */
> > +	__u8	data[];
> > +};
> > +
> > +enum {
> > +	VFIO_IOMMU_NESTING_OP_BIND_PGTBL,
> > +	VFIO_IOMMU_NESTING_OP_UNBIND_PGTBL,
> > +	VFIO_IOMMU_NESTING_OP_NUM,
> > +};
> 
> "VFIO_IOMMU_NESTING_NUM_OPS" would be more consistent with the vfio uapi.

I see. will rename it if we decide to keep it.

Regards,
Yi Liu

> Thanks,
> 
> Alex
> 
> > +
> > +#define VFIO_IOMMU_NESTING_OP		_IO(VFIO_TYPE, VFIO_BASE + 19)
> > +
> >  /* -------- Additional API for SPAPR TCE (Server POWERPC) IOMMU
> > -------- */
> >
> >  /*


^ permalink raw reply	[flat|nested] 81+ messages in thread

* RE: [PATCH v7 13/16] vfio/pci: Expose PCIe PASID capability to guest
  2020-09-11 22:13   ` Alex Williamson
@ 2020-09-12  7:17     ` Liu, Yi L
  0 siblings, 0 replies; 81+ messages in thread
From: Liu, Yi L @ 2020-09-12  7:17 UTC (permalink / raw)
  To: Alex Williamson
  Cc: eric.auger, baolu.lu, joro, Tian, Kevin, jacob.jun.pan, Raj,
	Ashok, Tian, Jun J, Sun, Yi Y, jean-philippe, peterx, jasowang,
	Wu, Hao, stefanha, iommu, kvm

Hi Alex,

> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Saturday, September 12, 2020 6:13 AM
> 
> On Thu, 10 Sep 2020 03:45:30 -0700
> Liu Yi L <yi.l.liu@intel.com> wrote:
> 
> > This patch exposes PCIe PASID capability to guest for assigned devices.
> > Existing vfio_pci driver hides it from guest by setting the capability
> > length as 0 in pci_ext_cap_length[].
> 
> This exposes the PASID capability, but it's still read-only, so this largely just helps
> userspace know where to emulate the capability, right?  Thanks,

oh, yes. This path only makes it visible to userspace. perhaps, I should
refine the commit message and the patch name. right?

Regards,
Yi Liu

> Alex
> 
> > And this patch only exposes PASID capability for devices which has
> > PCIe PASID extended struture in its configuration space. VFs will not
> > expose the PASID capability as they do not implement the PASID
> > extended structure in their config space. It is a TODO in future.
> > Related discussion can be found in below link:
> >
> > https://lore.kernel.org/kvm/20200407095801.648b1371@w520.home/
> >
> > Cc: Kevin Tian <kevin.tian@intel.com>
> > CC: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > Cc: Alex Williamson <alex.williamson@redhat.com>
> > Cc: Eric Auger <eric.auger@redhat.com>
> > Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
> > Cc: Joerg Roedel <joro@8bytes.org>
> > Cc: Lu Baolu <baolu.lu@linux.intel.com>
> > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > Reviewed-by: Eric Auger <eric.auger@redhat.com>
> > ---
> > v5 -> v6:
> > *) add review-by from Eric Auger.
> >
> > v1 -> v2:
> > *) added in v2, but it was sent in a separate patchseries before
> > ---
> >  drivers/vfio/pci/vfio_pci_config.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/drivers/vfio/pci/vfio_pci_config.c
> > b/drivers/vfio/pci/vfio_pci_config.c
> > index d98843f..07ff2e6 100644
> > --- a/drivers/vfio/pci/vfio_pci_config.c
> > +++ b/drivers/vfio/pci/vfio_pci_config.c
> > @@ -95,7 +95,7 @@ static const u16 pci_ext_cap_length[PCI_EXT_CAP_ID_MAX +
> 1] = {
> >  	[PCI_EXT_CAP_ID_LTR]	=	PCI_EXT_CAP_LTR_SIZEOF,
> >  	[PCI_EXT_CAP_ID_SECPCI]	=	0,	/* not yet */
> >  	[PCI_EXT_CAP_ID_PMUX]	=	0,	/* not yet */
> > -	[PCI_EXT_CAP_ID_PASID]	=	0,	/* not yet */
> > +	[PCI_EXT_CAP_ID_PASID]	=	PCI_EXT_CAP_PASID_SIZEOF,
> >  };
> >
> >  /*


^ permalink raw reply	[flat|nested] 81+ messages in thread

* RE: [PATCH v7 03/16] vfio/type1: Report iommu nesting info to userspace
  2020-09-11 20:16   ` Alex Williamson
@ 2020-09-12  8:24     ` Liu, Yi L
  0 siblings, 0 replies; 81+ messages in thread
From: Liu, Yi L @ 2020-09-12  8:24 UTC (permalink / raw)
  To: Alex Williamson
  Cc: eric.auger, baolu.lu, joro, Tian, Kevin, jacob.jun.pan, Raj,
	Ashok, Tian, Jun J, Sun, Yi Y, jean-philippe, peterx, jasowang,
	Wu, Hao, stefanha, iommu, kvm

Hi Alex,

> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Saturday, September 12, 2020 4:17 AM
> 
> On Thu, 10 Sep 2020 03:45:20 -0700
> Liu Yi L <yi.l.liu@intel.com> wrote:
> 
> > This patch exports iommu nesting capability info to user space through
> > VFIO. Userspace is expected to check this info for supported uAPIs (e.g.
> > PASID alloc/free, bind page table, and cache invalidation) and the
> > vendor specific format information for first level/stage page table
> > that will be bound to.
> >
> > The nesting info is available only after container set to be NESTED type.
> > Current implementation imposes one limitation - one nesting container
> > should include at most one iommu group. The philosophy of vfio
> > container is having all groups/devices within the container share the
> > same IOMMU context. When vSVA is enabled, one IOMMU context could
> > include one 2nd- level address space and multiple 1st-level address
> > spaces. While the 2nd-level address space is reasonably sharable by
> > multiple groups, blindly sharing 1st-level address spaces across all
> > groups within the container might instead break the guest expectation.
> > In the future sub/super container concept might be introduced to allow
> > partial address space sharing within an IOMMU context. But for now
> > let's go with this restriction by requiring singleton container for
> > using nesting iommu features. Below link has the related discussion about this
> decision.
> >
> > https://lore.kernel.org/kvm/20200515115924.37e6996d@w520.home/
> >
> > This patch also changes the NESTING type container behaviour.
> > Something that would have succeeded before will now fail: Before this
> > series, if user asked for a VFIO_IOMMU_TYPE1_NESTING, it would have
> > succeeded even if the SMMU didn't support stage-2, as the driver would
> > have silently fallen back on stage-1 mappings (which work exactly the
> > same as stage-2 only since there was no nesting supported). After the
> > series, we do check for DOMAIN_ATTR_NESTING so if user asks for
> > VFIO_IOMMU_TYPE1_NESTING and the SMMU doesn't support stage-2, the
> > ioctl fails. But it should be a good fix and completely harmless. Detail can be found
> in below link as well.
> >
> > https://lore.kernel.org/kvm/20200717090900.GC4850@myrica/
> >
> > Cc: Kevin Tian <kevin.tian@intel.com>
> > CC: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > Cc: Alex Williamson <alex.williamson@redhat.com>
> > Cc: Eric Auger <eric.auger@redhat.com>
> > Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
> > Cc: Joerg Roedel <joro@8bytes.org>
> > Cc: Lu Baolu <baolu.lu@linux.intel.com>
> > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > ---
> > v6 -> v7:
> > *) using vfio_info_add_capability() for adding nesting cap per suggestion
> >    from Eric.
> >
> > v5 -> v6:
> > *) address comments against v5 from Eric Auger.
> > *) don't report nesting cap to userspace if the nesting_info->format is
> >    invalid.
> >
> > v4 -> v5:
> > *) address comments from Eric Auger.
> > *) return struct iommu_nesting_info for
> VFIO_IOMMU_TYPE1_INFO_CAP_NESTING as
> >    cap is much "cheap", if needs extension in future, just define another cap.
> >    https://lore.kernel.org/kvm/20200708132947.5b7ee954@x1.home/
> >
> > v3 -> v4:
> > *) address comments against v3.
> >
> > v1 -> v2:
> > *) added in v2
> > ---
> >  drivers/vfio/vfio_iommu_type1.c | 92 +++++++++++++++++++++++++++++++++++-
> -----
> >  include/uapi/linux/vfio.h       | 19 +++++++++
> >  2 files changed, 99 insertions(+), 12 deletions(-)
> >
> > diff --git a/drivers/vfio/vfio_iommu_type1.c
> > b/drivers/vfio/vfio_iommu_type1.c index c992973..3c0048b 100644
> > --- a/drivers/vfio/vfio_iommu_type1.c
> > +++ b/drivers/vfio/vfio_iommu_type1.c
> > @@ -62,18 +62,20 @@ MODULE_PARM_DESC(dma_entry_limit,
> >  		 "Maximum number of user DMA mappings per container (65535).");
> >
> >  struct vfio_iommu {
> > -	struct list_head	domain_list;
> > -	struct list_head	iova_list;
> > -	struct vfio_domain	*external_domain; /* domain for external user */
> > -	struct mutex		lock;
> > -	struct rb_root		dma_list;
> > -	struct blocking_notifier_head notifier;
> > -	unsigned int		dma_avail;
> > -	uint64_t		pgsize_bitmap;
> > -	bool			v2;
> > -	bool			nesting;
> > -	bool			dirty_page_tracking;
> > -	bool			pinned_page_dirty_scope;
> > +	struct list_head		domain_list;
> > +	struct list_head		iova_list;
> > +	/* domain for external user */
> > +	struct vfio_domain		*external_domain;
> > +	struct mutex			lock;
> > +	struct rb_root			dma_list;
> > +	struct blocking_notifier_head	notifier;
> > +	unsigned int			dma_avail;
> > +	uint64_t			pgsize_bitmap;
> > +	bool				v2;
> > +	bool				nesting;
> > +	bool				dirty_page_tracking;
> > +	bool				pinned_page_dirty_scope;
> > +	struct iommu_nesting_info	*nesting_info;
> 
> Nit, not as important as the previous alignment, but might as well move this up with
> the uint64_t pgsize_bitmap with the bools at the end of the structure to avoid adding
> new gaps.

got it. :-)

> 
> >  };
> >
> >  struct vfio_domain {
> > @@ -130,6 +132,9 @@ struct vfio_regions {
> >  #define IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)	\
> >  					(!list_empty(&iommu->domain_list))
> >
> > +#define CONTAINER_HAS_DOMAIN(iommu)	(((iommu)->external_domain) || \
> > +					 (!list_empty(&(iommu)->domain_list)))
> > +
> >  #define DIRTY_BITMAP_BYTES(n)	(ALIGN(n, BITS_PER_TYPE(u64)) /
> BITS_PER_BYTE)
> >
> >  /*
> > @@ -1992,6 +1997,13 @@ static void vfio_iommu_iova_insert_copy(struct
> > vfio_iommu *iommu,
> >
> >  	list_splice_tail(iova_copy, iova);
> >  }
> > +
> > +static void vfio_iommu_release_nesting_info(struct vfio_iommu *iommu)
> > +{
> > +	kfree(iommu->nesting_info);
> > +	iommu->nesting_info = NULL;
> > +}
> > +
> >  static int vfio_iommu_type1_attach_group(void *iommu_data,
> >  					 struct iommu_group *iommu_group)
> { @@ -2022,6 +2034,12 @@
> > static int vfio_iommu_type1_attach_group(void *iommu_data,
> >  		}
> >  	}
> >
> > +	/* Nesting type container can include only one group */
> > +	if (iommu->nesting && CONTAINER_HAS_DOMAIN(iommu)) {
> > +		mutex_unlock(&iommu->lock);
> > +		return -EINVAL;
> > +	}
> > +
> >  	group = kzalloc(sizeof(*group), GFP_KERNEL);
> >  	domain = kzalloc(sizeof(*domain), GFP_KERNEL);
> >  	if (!group || !domain) {
> > @@ -2092,6 +2110,25 @@ static int vfio_iommu_type1_attach_group(void
> *iommu_data,
> >  	if (ret)
> >  		goto out_domain;
> >
> > +	/* Nesting cap info is available only after attaching */
> > +	if (iommu->nesting) {
> > +		int size = sizeof(struct iommu_nesting_info);
> > +
> > +		iommu->nesting_info = kzalloc(size, GFP_KERNEL);
> > +		if (!iommu->nesting_info) {
> > +			ret = -ENOMEM;
> > +			goto out_detach;
> > +		}
> > +
> > +		/* Now get the nesting info */
> > +		iommu->nesting_info->argsz = size;
> > +		ret = iommu_domain_get_attr(domain->domain,
> > +					    DOMAIN_ATTR_NESTING,
> > +					    iommu->nesting_info);
> > +		if (ret)
> > +			goto out_detach;
> > +	}
> > +
> >  	/* Get aperture info */
> >  	iommu_domain_get_attr(domain->domain, DOMAIN_ATTR_GEOMETRY,
> &geo);
> >
> > @@ -2201,6 +2238,7 @@ static int vfio_iommu_type1_attach_group(void
> *iommu_data,
> >  	return 0;
> >
> >  out_detach:
> > +	vfio_iommu_release_nesting_info(iommu);
> >  	vfio_iommu_detach_group(domain, group);
> >  out_domain:
> >  	iommu_domain_free(domain->domain);
> > @@ -2401,6 +2439,8 @@ static void vfio_iommu_type1_detach_group(void
> *iommu_data,
> >  					vfio_iommu_unmap_unpin_all(iommu);
> >  				else
> >
> 	vfio_iommu_unmap_unpin_reaccount(iommu);
> > +
> > +				vfio_iommu_release_nesting_info(iommu);
> >  			}
> >  			iommu_domain_free(domain->domain);
> >  			list_del(&domain->next);
> > @@ -2609,6 +2649,32 @@ static int vfio_iommu_migration_build_caps(struct
> vfio_iommu *iommu,
> >  	return vfio_info_add_capability(caps, &cap_mig.header,
> > sizeof(cap_mig));  }
> >
> > +static int vfio_iommu_add_nesting_cap(struct vfio_iommu *iommu,
> > +				      struct vfio_info_cap *caps) {
> > +	struct vfio_iommu_type1_info_cap_nesting nesting_cap;
> > +	size_t size;
> > +
> > +	/* when nesting_info is null, no need to go further */
> > +	if (!iommu->nesting_info)
> > +		return 0;
> > +
> > +	/* when @format of nesting_info is 0, fail the call */
> > +	if (iommu->nesting_info->format == 0)
> > +		return -ENOENT;
> 
> 
> Should we fail this in the attach_group?  Seems the user would be in a bad situation
> here if they successfully created a nesting container but can't get info.  Is there
> backwards compatibility we're trying to maintain with this?

agreed. fail it in attach_group would be better.

> > +
> > +	size = offsetof(struct vfio_iommu_type1_info_cap_nesting, info) +
> > +	       iommu->nesting_info->argsz;
> > +
> > +	nesting_cap.header.id = VFIO_IOMMU_TYPE1_INFO_CAP_NESTING;
> > +	nesting_cap.header.version = 1;
> > +
> > +	memcpy(&nesting_cap.info, iommu->nesting_info,
> > +	       iommu->nesting_info->argsz);
> > +
> > +	return vfio_info_add_capability(caps, &nesting_cap.header, size); }
> > +
> >  static int vfio_iommu_type1_get_info(struct vfio_iommu *iommu,
> >  				     unsigned long arg)
> >  {
> > @@ -2644,6 +2710,8 @@ static int vfio_iommu_type1_get_info(struct
> vfio_iommu *iommu,
> >  	if (!ret)
> >  		ret = vfio_iommu_iova_build_caps(iommu, &caps);
> >
> > +	ret = vfio_iommu_add_nesting_cap(iommu, &caps);
> 
> Why don't we follow either the naming scheme or the error handling scheme of the
> previous caps?  Seems like this should be:
> 
> if (!ret)
> 	ret = vfio_iommu_nesting_build_caps(...);

got it. should follow the error handling scheme and also the naming. will
do it.

Regards,
Yi Liu

> Thanks,
> 
> Alex
> 
> 
> > +
> >  	mutex_unlock(&iommu->lock);
> >
> >  	if (ret)
> > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> > index 9204705..ff40f9e 100644
> > --- a/include/uapi/linux/vfio.h
> > +++ b/include/uapi/linux/vfio.h
> > @@ -14,6 +14,7 @@
> >
> >  #include <linux/types.h>
> >  #include <linux/ioctl.h>
> > +#include <linux/iommu.h>
> >
> >  #define VFIO_API_VERSION	0
> >
> > @@ -1039,6 +1040,24 @@ struct vfio_iommu_type1_info_cap_migration {
> >  	__u64	max_dirty_bitmap_size;		/* in bytes */
> >  };
> >
> > +/*
> > + * The nesting capability allows to report the related capability
> > + * and info for nesting iommu type.
> > + *
> > + * The structures below define version 1 of this capability.
> > + *
> > + * Nested capabilities should be checked by the userspace after
> > + * setting VFIO_TYPE1_NESTING_IOMMU.
> > + *
> > + * @info: the nesting info provided by IOMMU driver.
> > + */
> > +#define VFIO_IOMMU_TYPE1_INFO_CAP_NESTING  3
> > +
> > +struct vfio_iommu_type1_info_cap_nesting {
> > +	struct	vfio_info_cap_header header;
> > +	struct	iommu_nesting_info info;
> > +};
> > +
> >  #define VFIO_IOMMU_GET_INFO _IO(VFIO_TYPE, VFIO_BASE + 12)
> >
> >  /**


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs
  2020-09-10 10:45 [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs Liu Yi L
                   ` (15 preceding siblings ...)
  2020-09-10 10:45 ` [PATCH v7 16/16] iommu/vt-d: Support reporting nesting capability info Liu Yi L
@ 2020-09-14  4:20 ` Jason Wang
  2020-09-14  8:01   ` Tian, Kevin
  2020-09-14 13:31   ` Jean-Philippe Brucker
  16 siblings, 2 replies; 81+ messages in thread
From: Jason Wang @ 2020-09-14  4:20 UTC (permalink / raw)
  To: Liu Yi L, alex.williamson, eric.auger, baolu.lu, joro
  Cc: kevin.tian, jacob.jun.pan, ashok.raj, jun.j.tian, yi.y.sun,
	jean-philippe, peterx, hao.wu, stefanha, iommu, kvm,
	Jason Gunthorpe, Michael S. Tsirkin


On 2020/9/10 下午6:45, Liu Yi L wrote:
> Shared Virtual Addressing (SVA), a.k.a, Shared Virtual Memory (SVM) on
> Intel platforms allows address space sharing between device DMA and
> applications. SVA can reduce programming complexity and enhance security.
>
> This VFIO series is intended to expose SVA usage to VMs. i.e. Sharing
> guest application address space with passthru devices. This is called
> vSVA in this series. The whole vSVA enabling requires QEMU/VFIO/IOMMU
> changes. For IOMMU and QEMU changes, they are in separate series (listed
> in the "Related series").
>
> The high-level architecture for SVA virtualization is as below, the key
> design of vSVA support is to utilize the dual-stage IOMMU translation (
> also known as IOMMU nesting translation) capability in host IOMMU.
>
>
>      .-------------.  .---------------------------.
>      |   vIOMMU    |  | Guest process CR3, FL only|
>      |             |  '---------------------------'
>      .----------------/
>      | PASID Entry |--- PASID cache flush -
>      '-------------'                       |
>      |             |                       V
>      |             |                CR3 in GPA
>      '-------------'
> Guest
> ------| Shadow |--------------------------|--------
>        v        v                          v
> Host
>      .-------------.  .----------------------.
>      |   pIOMMU    |  | Bind FL for GVA-GPA  |
>      |             |  '----------------------'
>      .----------------/  |
>      | PASID Entry |     V (Nested xlate)
>      '----------------\.------------------------------.
>      |             ||SL for GPA-HPA, default domain|
>      |             |   '------------------------------'
>      '-------------'
> Where:
>   - FL = First level/stage one page tables
>   - SL = Second level/stage two page tables
>
> Patch Overview:
>   1. reports IOMMU nesting info to userspace ( patch 0001, 0002, 0003, 0015 , 0016)
>   2. vfio support for PASID allocation and free for VMs (patch 0004, 0005, 0007)
>   3. a fix to a revisit in intel iommu driver (patch 0006)
>   4. vfio support for binding guest page table to host (patch 0008, 0009, 0010)
>   5. vfio support for IOMMU cache invalidation from VMs (patch 0011)
>   6. vfio support for vSVA usage on IOMMU-backed mdevs (patch 0012)
>   7. expose PASID capability to VM (patch 0013)
>   8. add doc for VFIO dual stage control (patch 0014)


If it's possible, I would suggest a generic uAPI instead of a VFIO 
specific one.

Jason suggest something like /dev/sva. There will be a lot of other 
subsystems that could benefit from this (e.g vDPA).

Have you ever considered this approach?

Thanks


^ permalink raw reply	[flat|nested] 81+ messages in thread

* RE: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs
  2020-09-14  4:20 ` [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs Jason Wang
@ 2020-09-14  8:01   ` Tian, Kevin
  2020-09-14  8:57     ` Jason Wang
  2020-09-14 13:31   ` Jean-Philippe Brucker
  1 sibling, 1 reply; 81+ messages in thread
From: Tian, Kevin @ 2020-09-14  8:01 UTC (permalink / raw)
  To: Jason Wang, Liu, Yi L, alex.williamson, eric.auger, baolu.lu, joro
  Cc: jacob.jun.pan, Raj, Ashok, Tian, Jun J, Sun, Yi Y, jean-philippe,
	peterx, Wu, Hao, stefanha, iommu, kvm, Jason Gunthorpe,
	Michael S. Tsirkin

> From: Jason Wang <jasowang@redhat.com>
> Sent: Monday, September 14, 2020 12:20 PM
> 
> On 2020/9/10 下午6:45, Liu Yi L wrote:
> > Shared Virtual Addressing (SVA), a.k.a, Shared Virtual Memory (SVM) on
> > Intel platforms allows address space sharing between device DMA and
> > applications. SVA can reduce programming complexity and enhance
> security.
> >
> > This VFIO series is intended to expose SVA usage to VMs. i.e. Sharing
> > guest application address space with passthru devices. This is called
> > vSVA in this series. The whole vSVA enabling requires QEMU/VFIO/IOMMU
> > changes. For IOMMU and QEMU changes, they are in separate series (listed
> > in the "Related series").
> >
> > The high-level architecture for SVA virtualization is as below, the key
> > design of vSVA support is to utilize the dual-stage IOMMU translation (
> > also known as IOMMU nesting translation) capability in host IOMMU.
> >
> >
> >      .-------------.  .---------------------------.
> >      |   vIOMMU    |  | Guest process CR3, FL only|
> >      |             |  '---------------------------'
> >      .----------------/
> >      | PASID Entry |--- PASID cache flush -
> >      '-------------'                       |
> >      |             |                       V
> >      |             |                CR3 in GPA
> >      '-------------'
> > Guest
> > ------| Shadow |--------------------------|--------
> >        v        v                          v
> > Host
> >      .-------------.  .----------------------.
> >      |   pIOMMU    |  | Bind FL for GVA-GPA  |
> >      |             |  '----------------------'
> >      .----------------/  |
> >      | PASID Entry |     V (Nested xlate)
> >      '----------------\.------------------------------.
> >      |             ||SL for GPA-HPA, default domain|
> >      |             |   '------------------------------'
> >      '-------------'
> > Where:
> >   - FL = First level/stage one page tables
> >   - SL = Second level/stage two page tables
> >
> > Patch Overview:
> >   1. reports IOMMU nesting info to userspace ( patch 0001, 0002, 0003,
> 0015 , 0016)
> >   2. vfio support for PASID allocation and free for VMs (patch 0004, 0005,
> 0007)
> >   3. a fix to a revisit in intel iommu driver (patch 0006)
> >   4. vfio support for binding guest page table to host (patch 0008, 0009,
> 0010)
> >   5. vfio support for IOMMU cache invalidation from VMs (patch 0011)
> >   6. vfio support for vSVA usage on IOMMU-backed mdevs (patch 0012)
> >   7. expose PASID capability to VM (patch 0013)
> >   8. add doc for VFIO dual stage control (patch 0014)
> 
> 
> If it's possible, I would suggest a generic uAPI instead of a VFIO
> specific one.
> 
> Jason suggest something like /dev/sva. There will be a lot of other
> subsystems that could benefit from this (e.g vDPA).
> 

Just be curious. When does vDPA subsystem plan to support vSVA and 
when could one expect a SVA-capable vDPA device in market?

Thanks
Kevin

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs
  2020-09-14  8:01   ` Tian, Kevin
@ 2020-09-14  8:57     ` Jason Wang
  2020-09-14 10:38       ` Tian, Kevin
  0 siblings, 1 reply; 81+ messages in thread
From: Jason Wang @ 2020-09-14  8:57 UTC (permalink / raw)
  To: Tian, Kevin, Liu, Yi L, alex.williamson, eric.auger, baolu.lu, joro
  Cc: jacob.jun.pan, Raj, Ashok, Tian, Jun J, Sun, Yi Y, jean-philippe,
	peterx, Wu, Hao, stefanha, iommu, kvm, Jason Gunthorpe,
	Michael S. Tsirkin


On 2020/9/14 下午4:01, Tian, Kevin wrote:
>> From: Jason Wang <jasowang@redhat.com>
>> Sent: Monday, September 14, 2020 12:20 PM
>>
>> On 2020/9/10 下午6:45, Liu Yi L wrote:
>>> Shared Virtual Addressing (SVA), a.k.a, Shared Virtual Memory (SVM) on
>>> Intel platforms allows address space sharing between device DMA and
>>> applications. SVA can reduce programming complexity and enhance
>> security.
>>> This VFIO series is intended to expose SVA usage to VMs. i.e. Sharing
>>> guest application address space with passthru devices. This is called
>>> vSVA in this series. The whole vSVA enabling requires QEMU/VFIO/IOMMU
>>> changes. For IOMMU and QEMU changes, they are in separate series (listed
>>> in the "Related series").
>>>
>>> The high-level architecture for SVA virtualization is as below, the key
>>> design of vSVA support is to utilize the dual-stage IOMMU translation (
>>> also known as IOMMU nesting translation) capability in host IOMMU.
>>>
>>>
>>>       .-------------.  .---------------------------.
>>>       |   vIOMMU    |  | Guest process CR3, FL only|
>>>       |             |  '---------------------------'
>>>       .----------------/
>>>       | PASID Entry |--- PASID cache flush -
>>>       '-------------'                       |
>>>       |             |                       V
>>>       |             |                CR3 in GPA
>>>       '-------------'
>>> Guest
>>> ------| Shadow |--------------------------|--------
>>>         v        v                          v
>>> Host
>>>       .-------------.  .----------------------.
>>>       |   pIOMMU    |  | Bind FL for GVA-GPA  |
>>>       |             |  '----------------------'
>>>       .----------------/  |
>>>       | PASID Entry |     V (Nested xlate)
>>>       '----------------\.------------------------------.
>>>       |             ||SL for GPA-HPA, default domain|
>>>       |             |   '------------------------------'
>>>       '-------------'
>>> Where:
>>>    - FL = First level/stage one page tables
>>>    - SL = Second level/stage two page tables
>>>
>>> Patch Overview:
>>>    1. reports IOMMU nesting info to userspace ( patch 0001, 0002, 0003,
>> 0015 , 0016)
>>>    2. vfio support for PASID allocation and free for VMs (patch 0004, 0005,
>> 0007)
>>>    3. a fix to a revisit in intel iommu driver (patch 0006)
>>>    4. vfio support for binding guest page table to host (patch 0008, 0009,
>> 0010)
>>>    5. vfio support for IOMMU cache invalidation from VMs (patch 0011)
>>>    6. vfio support for vSVA usage on IOMMU-backed mdevs (patch 0012)
>>>    7. expose PASID capability to VM (patch 0013)
>>>    8. add doc for VFIO dual stage control (patch 0014)
>>
>> If it's possible, I would suggest a generic uAPI instead of a VFIO
>> specific one.
>>
>> Jason suggest something like /dev/sva. There will be a lot of other
>> subsystems that could benefit from this (e.g vDPA).
>>
> Just be curious. When does vDPA subsystem plan to support vSVA and
> when could one expect a SVA-capable vDPA device in market?
>
> Thanks
> Kevin


vSVA is in the plan but there's no ETA. I think we might start the work 
after control vq support.  It will probably start from SVA first and 
then vSVA (since it might require platform support).

For the device part, it really depends on the chipset and other device 
vendors. We plan to do the prototype in virtio by introducing PASID 
support in the spec.

Thanks



^ permalink raw reply	[flat|nested] 81+ messages in thread

* RE: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs
  2020-09-14  8:57     ` Jason Wang
@ 2020-09-14 10:38       ` Tian, Kevin
  2020-09-14 11:38         ` Jason Gunthorpe
  0 siblings, 1 reply; 81+ messages in thread
From: Tian, Kevin @ 2020-09-14 10:38 UTC (permalink / raw)
  To: Jason Wang, Liu, Yi L, alex.williamson, eric.auger, baolu.lu, joro
  Cc: jean-philippe, Raj, Ashok, kvm, Michael S. Tsirkin, stefanha,
	Tian, Jun J, iommu, Sun, Yi Y, Jason Gunthorpe, Wu, Hao

> From: Jason Wang
> Sent: Monday, September 14, 2020 4:57 PM
> 
> On 2020/9/14 下午4:01, Tian, Kevin wrote:
> >> From: Jason Wang <jasowang@redhat.com>
> >> Sent: Monday, September 14, 2020 12:20 PM
> >>
> >> On 2020/9/10 下午6:45, Liu Yi L wrote:
> >>> Shared Virtual Addressing (SVA), a.k.a, Shared Virtual Memory (SVM) on
> >>> Intel platforms allows address space sharing between device DMA and
> >>> applications. SVA can reduce programming complexity and enhance
> >> security.
> >>> This VFIO series is intended to expose SVA usage to VMs. i.e. Sharing
> >>> guest application address space with passthru devices. This is called
> >>> vSVA in this series. The whole vSVA enabling requires
> QEMU/VFIO/IOMMU
> >>> changes. For IOMMU and QEMU changes, they are in separate series
> (listed
> >>> in the "Related series").
> >>>
> >>> The high-level architecture for SVA virtualization is as below, the key
> >>> design of vSVA support is to utilize the dual-stage IOMMU translation (
> >>> also known as IOMMU nesting translation) capability in host IOMMU.
> >>>
> >>>
> >>>       .-------------.  .---------------------------.
> >>>       |   vIOMMU    |  | Guest process CR3, FL only|
> >>>       |             |  '---------------------------'
> >>>       .----------------/
> >>>       | PASID Entry |--- PASID cache flush -
> >>>       '-------------'                       |
> >>>       |             |                       V
> >>>       |             |                CR3 in GPA
> >>>       '-------------'
> >>> Guest
> >>> ------| Shadow |--------------------------|--------
> >>>         v        v                          v
> >>> Host
> >>>       .-------------.  .----------------------.
> >>>       |   pIOMMU    |  | Bind FL for GVA-GPA  |
> >>>       |             |  '----------------------'
> >>>       .----------------/  |
> >>>       | PASID Entry |     V (Nested xlate)
> >>>       '----------------\.------------------------------.
> >>>       |             ||SL for GPA-HPA, default domain|
> >>>       |             |   '------------------------------'
> >>>       '-------------'
> >>> Where:
> >>>    - FL = First level/stage one page tables
> >>>    - SL = Second level/stage two page tables
> >>>
> >>> Patch Overview:
> >>>    1. reports IOMMU nesting info to userspace ( patch 0001, 0002, 0003,
> >> 0015 , 0016)
> >>>    2. vfio support for PASID allocation and free for VMs (patch 0004, 0005,
> >> 0007)
> >>>    3. a fix to a revisit in intel iommu driver (patch 0006)
> >>>    4. vfio support for binding guest page table to host (patch 0008, 0009,
> >> 0010)
> >>>    5. vfio support for IOMMU cache invalidation from VMs (patch 0011)
> >>>    6. vfio support for vSVA usage on IOMMU-backed mdevs (patch 0012)
> >>>    7. expose PASID capability to VM (patch 0013)
> >>>    8. add doc for VFIO dual stage control (patch 0014)
> >>
> >> If it's possible, I would suggest a generic uAPI instead of a VFIO
> >> specific one.
> >>
> >> Jason suggest something like /dev/sva. There will be a lot of other
> >> subsystems that could benefit from this (e.g vDPA).
> >>
> > Just be curious. When does vDPA subsystem plan to support vSVA and
> > when could one expect a SVA-capable vDPA device in market?
> >
> > Thanks
> > Kevin
> 
> 
> vSVA is in the plan but there's no ETA. I think we might start the work
> after control vq support.  It will probably start from SVA first and
> then vSVA (since it might require platform support).
> 
> For the device part, it really depends on the chipset and other device
> vendors. We plan to do the prototype in virtio by introducing PASID
> support in the spec.
> 

Thanks for the info. Then here is my thought.

First, I don't think /dev/sva is the right interface. Once we start 
considering such generic uAPI, it better behaves as the one interface
for all kinds of DMA requirements on device/subdevice passthrough.
Nested page table thru vSVA is one way. Manual map/unmap is
another way. It doesn't make sense to have one through generic
uAPI and the other through subsystem specific uAPI. In the end
the interface might become /dev/iommu, for delegating certain
IOMMU operations to userspace. 

In addition, delegated IOMMU operations have different scopes.
PASID allocation is per process/VM. pgtbl-bind/unbind, map/unmap 
and cache invalidation are per iommu domain. page request/
response are per device/subdevice. This requires the uAPI to also
understand and manage the association between domain/group/
device/subdevice (such as group attach/detach), instead of doing 
it separately in VFIO or vDPA as today. 

Based on above, I feel a more reasonable way is to first make a 
/dev/iommu uAPI supporting DMA map/unmap usages and then 
introduce vSVA to it. Doing this order is because DMA map/unmap 
is widely used thus can better help verify the core logic with 
many existing devices. For vSVA, vDPA support has not be started
while VFIO support is close to be accepted. It doesn't make much 
sense by blocking the VFIO part until vDPA is ready for wide 
verification and /dev/iommu is mature enough. Yes, the newly-
added uAPIs will be finally deprecated when /dev/iommu starts 
to support vSVA. But using /dev/iommu will anyway deprecate 
some existing VFIO IOMMU uAPIs at that time...

Thanks
Kevin

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs
  2020-09-14 10:38       ` Tian, Kevin
@ 2020-09-14 11:38         ` Jason Gunthorpe
  0 siblings, 0 replies; 81+ messages in thread
From: Jason Gunthorpe @ 2020-09-14 11:38 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Jason Wang, Liu, Yi L, alex.williamson, eric.auger, baolu.lu,
	joro, jean-philippe, Raj, Ashok, kvm, Michael S. Tsirkin,
	stefanha, Tian, Jun J, iommu, Sun, Yi Y, Wu, Hao

On Mon, Sep 14, 2020 at 10:38:10AM +0000, Tian, Kevin wrote:

> is widely used thus can better help verify the core logic with
> many existing devices. For vSVA, vDPA support has not be started
> while VFIO support is close to be accepted. It doesn't make much
> sense by blocking the VFIO part until vDPA is ready for wide

You keep saying that, but if we keep ignoring the right architecture
we end up with a mess inside VFIO just to save some development
time. That is usually not how the kernel process works.

Jason

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs
  2020-09-14  4:20 ` [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs Jason Wang
  2020-09-14  8:01   ` Tian, Kevin
@ 2020-09-14 13:31   ` Jean-Philippe Brucker
  2020-09-14 13:47     ` Jason Gunthorpe
  2020-09-16  2:29     ` Jason Wang
  1 sibling, 2 replies; 81+ messages in thread
From: Jean-Philippe Brucker @ 2020-09-14 13:31 UTC (permalink / raw)
  To: Jason Wang
  Cc: Liu Yi L, alex.williamson, eric.auger, baolu.lu, joro,
	kevin.tian, jacob.jun.pan, ashok.raj, jun.j.tian, yi.y.sun,
	peterx, hao.wu, stefanha, iommu, kvm, Jason Gunthorpe,
	Michael S. Tsirkin

On Mon, Sep 14, 2020 at 12:20:10PM +0800, Jason Wang wrote:
> 
> On 2020/9/10 下午6:45, Liu Yi L wrote:
> > Shared Virtual Addressing (SVA), a.k.a, Shared Virtual Memory (SVM) on
> > Intel platforms allows address space sharing between device DMA and
> > applications. SVA can reduce programming complexity and enhance security.
> > 
> > This VFIO series is intended to expose SVA usage to VMs. i.e. Sharing
> > guest application address space with passthru devices. This is called
> > vSVA in this series. The whole vSVA enabling requires QEMU/VFIO/IOMMU
> > changes. For IOMMU and QEMU changes, they are in separate series (listed
> > in the "Related series").
> > 
> > The high-level architecture for SVA virtualization is as below, the key
> > design of vSVA support is to utilize the dual-stage IOMMU translation (
> > also known as IOMMU nesting translation) capability in host IOMMU.
> > 
> > 
> >      .-------------.  .---------------------------.
> >      |   vIOMMU    |  | Guest process CR3, FL only|
> >      |             |  '---------------------------'
> >      .----------------/
> >      | PASID Entry |--- PASID cache flush -
> >      '-------------'                       |
> >      |             |                       V
> >      |             |                CR3 in GPA
> >      '-------------'
> > Guest
> > ------| Shadow |--------------------------|--------
> >        v        v                          v
> > Host
> >      .-------------.  .----------------------.
> >      |   pIOMMU    |  | Bind FL for GVA-GPA  |
> >      |             |  '----------------------'
> >      .----------------/  |
> >      | PASID Entry |     V (Nested xlate)
> >      '----------------\.------------------------------.
> >      |             ||SL for GPA-HPA, default domain|
> >      |             |   '------------------------------'
> >      '-------------'
> > Where:
> >   - FL = First level/stage one page tables
> >   - SL = Second level/stage two page tables
> > 
> > Patch Overview:
> >   1. reports IOMMU nesting info to userspace ( patch 0001, 0002, 0003, 0015 , 0016)
> >   2. vfio support for PASID allocation and free for VMs (patch 0004, 0005, 0007)
> >   3. a fix to a revisit in intel iommu driver (patch 0006)
> >   4. vfio support for binding guest page table to host (patch 0008, 0009, 0010)
> >   5. vfio support for IOMMU cache invalidation from VMs (patch 0011)
> >   6. vfio support for vSVA usage on IOMMU-backed mdevs (patch 0012)
> >   7. expose PASID capability to VM (patch 0013)
> >   8. add doc for VFIO dual stage control (patch 0014)
> 
> 
> If it's possible, I would suggest a generic uAPI instead of a VFIO specific
> one.

A large part of this work is already generic uAPI, in
include/uapi/linux/iommu.h. This patchset connects that generic interface
to the pre-existing VFIO uAPI that deals with IOMMU mappings of an
assigned device. But the bulk of the work is done by the IOMMU subsystem,
and is available to all device drivers.

> Jason suggest something like /dev/sva. There will be a lot of other
> subsystems that could benefit from this (e.g vDPA).

Do you have a more precise idea of the interface /dev/sva would provide,
how it would interact with VFIO and others?  vDPA could transport the
generic iommu.h structures via its own uAPI, and call the IOMMU API
directly without going through an intermediate /dev/sva handle.

Thanks,
Jean

> Have you ever considered this approach?
> 
> Thanks
> 

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs
  2020-09-14 13:31   ` Jean-Philippe Brucker
@ 2020-09-14 13:47     ` Jason Gunthorpe
  2020-09-14 16:22       ` Raj, Ashok
  2020-09-16  2:29     ` Jason Wang
  1 sibling, 1 reply; 81+ messages in thread
From: Jason Gunthorpe @ 2020-09-14 13:47 UTC (permalink / raw)
  To: Jean-Philippe Brucker
  Cc: Jason Wang, Liu Yi L, alex.williamson, eric.auger, baolu.lu,
	joro, kevin.tian, jacob.jun.pan, ashok.raj, jun.j.tian, yi.y.sun,
	peterx, hao.wu, stefanha, iommu, kvm, Michael S. Tsirkin

On Mon, Sep 14, 2020 at 03:31:13PM +0200, Jean-Philippe Brucker wrote:

> > Jason suggest something like /dev/sva. There will be a lot of other
> > subsystems that could benefit from this (e.g vDPA).
> 
> Do you have a more precise idea of the interface /dev/sva would provide,
> how it would interact with VFIO and others?  vDPA could transport the
> generic iommu.h structures via its own uAPI, and call the IOMMU API
> directly without going through an intermediate /dev/sva handle.

Prior to PASID IOMMU really only makes sense as part of vfio-pci
because the iommu can only key on the BDF. That can't work unless the
whole PCI function can be assigned. It is hard to see how a shared PCI
device can work with IOMMU like this, so may as well use vfio.

SVA and various vIOMMU models change this, a shared PCI driver can
absoultely work with a PASID that is assigned to a VM safely, and
actually don't need to know if their PASID maps a mm_struct or
something else.

So, some /dev/sva is another way to allocate a PASID that is not 1:1
with mm_struct, as the existing SVA stuff enforces. ie it is a way to
program the DMA address map of the PASID.

This new PASID allocator would match the guest memory layout and
support the IOMMU nesting stuff needed for vPASID.

This is the common code for the complex cases of virtualization with
PASID, shared by all user DMA drivers, including VFIO.

It doesn't make a lot of sense to build a uAPI exclusive to VFIO just
for PASID and vPASID. We already know everything doing user DMA will
eventually need this stuff.

Jason

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs
  2020-09-14 13:47     ` Jason Gunthorpe
@ 2020-09-14 16:22       ` Raj, Ashok
  2020-09-14 16:33         ` Jason Gunthorpe
  0 siblings, 1 reply; 81+ messages in thread
From: Raj, Ashok @ 2020-09-14 16:22 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jean-Philippe Brucker, Jason Wang, Liu Yi L, alex.williamson,
	eric.auger, baolu.lu, joro, kevin.tian, jacob.jun.pan,
	jun.j.tian, yi.y.sun, peterx, hao.wu, stefanha, iommu, kvm,
	Michael S. Tsirkin, Ashok Raj

Hi Jason,

On Mon, Sep 14, 2020 at 10:47:38AM -0300, Jason Gunthorpe wrote:
> On Mon, Sep 14, 2020 at 03:31:13PM +0200, Jean-Philippe Brucker wrote:
> 
> > > Jason suggest something like /dev/sva. There will be a lot of other
> > > subsystems that could benefit from this (e.g vDPA).
> > 
> > Do you have a more precise idea of the interface /dev/sva would provide,
> > how it would interact with VFIO and others?  vDPA could transport the
> > generic iommu.h structures via its own uAPI, and call the IOMMU API
> > directly without going through an intermediate /dev/sva handle.
> 
> Prior to PASID IOMMU really only makes sense as part of vfio-pci
> because the iommu can only key on the BDF. That can't work unless the
> whole PCI function can be assigned. It is hard to see how a shared PCI
> device can work with IOMMU like this, so may as well use vfio.
> 
> SVA and various vIOMMU models change this, a shared PCI driver can
> absoultely work with a PASID that is assigned to a VM safely, and
> actually don't need to know if their PASID maps a mm_struct or
> something else.

Well, IOMMU does care if its a native mm_struct or something that belongs
to guest. Because you need ability to forward page-requests and pickup
page-responses from guest. Since there is just one PRQ on the IOMMU and
responses can't be sent directly. You have to depend on vIOMMU type
interface in guest to make all of this magic work right?

> 
> So, some /dev/sva is another way to allocate a PASID that is not 1:1
> with mm_struct, as the existing SVA stuff enforces. ie it is a way to
> program the DMA address map of the PASID.
> 
> This new PASID allocator would match the guest memory layout and

Not sure what you mean by "match guest memory layout"? 
Probably, meaning first level is gVA or gIOVA? 

Cheers,
Ashok

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs
  2020-09-14 16:22       ` Raj, Ashok
@ 2020-09-14 16:33         ` Jason Gunthorpe
  2020-09-14 16:58           ` Alex Williamson
  0 siblings, 1 reply; 81+ messages in thread
From: Jason Gunthorpe @ 2020-09-14 16:33 UTC (permalink / raw)
  To: Raj, Ashok
  Cc: Jean-Philippe Brucker, Jason Wang, Liu Yi L, alex.williamson,
	eric.auger, baolu.lu, joro, kevin.tian, jacob.jun.pan,
	jun.j.tian, yi.y.sun, peterx, hao.wu, stefanha, iommu, kvm,
	Michael S. Tsirkin

On Mon, Sep 14, 2020 at 09:22:47AM -0700, Raj, Ashok wrote:
> Hi Jason,
> 
> On Mon, Sep 14, 2020 at 10:47:38AM -0300, Jason Gunthorpe wrote:
> > On Mon, Sep 14, 2020 at 03:31:13PM +0200, Jean-Philippe Brucker wrote:
> > 
> > > > Jason suggest something like /dev/sva. There will be a lot of other
> > > > subsystems that could benefit from this (e.g vDPA).
> > > 
> > > Do you have a more precise idea of the interface /dev/sva would provide,
> > > how it would interact with VFIO and others?  vDPA could transport the
> > > generic iommu.h structures via its own uAPI, and call the IOMMU API
> > > directly without going through an intermediate /dev/sva handle.
> > 
> > Prior to PASID IOMMU really only makes sense as part of vfio-pci
> > because the iommu can only key on the BDF. That can't work unless the
> > whole PCI function can be assigned. It is hard to see how a shared PCI
> > device can work with IOMMU like this, so may as well use vfio.
> > 
> > SVA and various vIOMMU models change this, a shared PCI driver can
> > absoultely work with a PASID that is assigned to a VM safely, and
> > actually don't need to know if their PASID maps a mm_struct or
> > something else.
> 
> Well, IOMMU does care if its a native mm_struct or something that belongs
> to guest. Because you need ability to forward page-requests and pickup
> page-responses from guest. Since there is just one PRQ on the IOMMU and
> responses can't be sent directly. You have to depend on vIOMMU type
> interface in guest to make all of this magic work right?

Yes, IOMMU cares, but not the PCI Driver. It just knows it has a
PASID. Details on how page faultings is handled or how the mapping is
setup is abstracted by the PASID.

> > This new PASID allocator would match the guest memory layout and
> 
> Not sure what you mean by "match guest memory layout"? 
> Probably, meaning first level is gVA or gIOVA? 

It means whatever the qemu/viommu/guest/etc needs across all the
IOMMU/arch implementations.

Basically, there should only be two ways to get a PASID
 - From mm_struct that mirrors the creating process
 - Via '/dev/sva' which has an complete interface to create and
   control a PASID suitable for virtualization and more

VFIO should not have its own special way to get a PASID.

Jason
 

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs
  2020-09-14 16:33         ` Jason Gunthorpe
@ 2020-09-14 16:58           ` Alex Williamson
  2020-09-14 17:41             ` Jason Gunthorpe
  0 siblings, 1 reply; 81+ messages in thread
From: Alex Williamson @ 2020-09-14 16:58 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Raj, Ashok, Jean-Philippe Brucker, Jason Wang, Liu Yi L,
	eric.auger, baolu.lu, joro, kevin.tian, jacob.jun.pan,
	jun.j.tian, yi.y.sun, peterx, hao.wu, stefanha, iommu, kvm,
	Michael S. Tsirkin

On Mon, 14 Sep 2020 13:33:54 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Mon, Sep 14, 2020 at 09:22:47AM -0700, Raj, Ashok wrote:
> > Hi Jason,
> > 
> > On Mon, Sep 14, 2020 at 10:47:38AM -0300, Jason Gunthorpe wrote:  
> > > On Mon, Sep 14, 2020 at 03:31:13PM +0200, Jean-Philippe Brucker wrote:
> > >   
> > > > > Jason suggest something like /dev/sva. There will be a lot of other
> > > > > subsystems that could benefit from this (e.g vDPA).  
> > > > 
> > > > Do you have a more precise idea of the interface /dev/sva would provide,
> > > > how it would interact with VFIO and others?  vDPA could transport the
> > > > generic iommu.h structures via its own uAPI, and call the IOMMU API
> > > > directly without going through an intermediate /dev/sva handle.  
> > > 
> > > Prior to PASID IOMMU really only makes sense as part of vfio-pci
> > > because the iommu can only key on the BDF. That can't work unless the
> > > whole PCI function can be assigned. It is hard to see how a shared PCI
> > > device can work with IOMMU like this, so may as well use vfio.
> > > 
> > > SVA and various vIOMMU models change this, a shared PCI driver can
> > > absoultely work with a PASID that is assigned to a VM safely, and
> > > actually don't need to know if their PASID maps a mm_struct or
> > > something else.  
> > 
> > Well, IOMMU does care if its a native mm_struct or something that belongs
> > to guest. Because you need ability to forward page-requests and pickup
> > page-responses from guest. Since there is just one PRQ on the IOMMU and
> > responses can't be sent directly. You have to depend on vIOMMU type
> > interface in guest to make all of this magic work right?  
> 
> Yes, IOMMU cares, but not the PCI Driver. It just knows it has a
> PASID. Details on how page faultings is handled or how the mapping is
> setup is abstracted by the PASID.
> 
> > > This new PASID allocator would match the guest memory layout and  
> > 
> > Not sure what you mean by "match guest memory layout"? 
> > Probably, meaning first level is gVA or gIOVA?   
> 
> It means whatever the qemu/viommu/guest/etc needs across all the
> IOMMU/arch implementations.
> 
> Basically, there should only be two ways to get a PASID
>  - From mm_struct that mirrors the creating process
>  - Via '/dev/sva' which has an complete interface to create and
>    control a PASID suitable for virtualization and more
> 
> VFIO should not have its own special way to get a PASID.

"its own special way" is arguable, VFIO is just making use of what's
being proposed as the uapi via its existing IOMMU interface.  PASIDs
are also a system resource, so we require some degree of access control
and quotas for management of PASIDs.  Does libvirt now get involved to
know whether an assigned device requires PASIDs such that access to
this dev file is provided to QEMU?  How does the kernel validate usage
or implement quotas when disconnected from device ownership?  PASIDs
would be an obvious DoS path if any user can create arbitrary
allocations.  If we can move code out of VFIO, I'm all for it, but I
think it needs to be better defined than "implement magic universal sva
uapi interface" before we can really consider it.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs
  2020-09-14 16:58           ` Alex Williamson
@ 2020-09-14 17:41             ` Jason Gunthorpe
  2020-09-14 18:23               ` Alex Williamson
  0 siblings, 1 reply; 81+ messages in thread
From: Jason Gunthorpe @ 2020-09-14 17:41 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Raj, Ashok, Jean-Philippe Brucker, Jason Wang, Liu Yi L,
	eric.auger, baolu.lu, joro, kevin.tian, jacob.jun.pan,
	jun.j.tian, yi.y.sun, peterx, hao.wu, stefanha, iommu, kvm,
	Michael S. Tsirkin

On Mon, Sep 14, 2020 at 10:58:57AM -0600, Alex Williamson wrote:
 
> "its own special way" is arguable, VFIO is just making use of what's
> being proposed as the uapi via its existing IOMMU interface.

I mean, if we have a /dev/sva then it makes no sense to extend the
VFIO interfaces with the same stuff. VFIO should simply accept a PASID
created from /dev/sva and use it just like any other user-DMA driver
would.

> are also a system resource, so we require some degree of access control
> and quotas for management of PASIDs.  

This has already happened, the SVA patches generally allow unpriv user
space to allocate a PASID for their process.

If a device implements a mdev shared with a kernel driver (like IDXD)
then it will be sharing that PASID pool across both drivers. In this
case it makes no sense that VFIO has PASID quota logic because it has
an incomplete view. It could only make sense if VFIO is the exclusive
owner of the bus/device/function.

The tracking logic needs to be global.. Most probably in some kind of
PASID cgroup controller?

> know whether an assigned device requires PASIDs such that access to
> this dev file is provided to QEMU?

Wouldn't QEMU just open /dev/sva if it needs it? Like other dev files?
Why would it need something special?

> would be an obvious DoS path if any user can create arbitrary
> allocations.  If we can move code out of VFIO, I'm all for it, but I
> think it needs to be better defined than "implement magic universal sva
> uapi interface" before we can really consider it.  Thanks,

Jason began by saying VDPA will need this too, I agree with him.

I'm not sure why it would be "magic"? This series already gives a
pretty solid blueprint for what the interface would need to
have. Interested folks need to sit down and talk about it not just
default everything to being built inside VFIO.

Jason

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs
  2020-09-14 17:41             ` Jason Gunthorpe
@ 2020-09-14 18:23               ` Alex Williamson
  2020-09-14 19:00                 ` Jason Gunthorpe
  0 siblings, 1 reply; 81+ messages in thread
From: Alex Williamson @ 2020-09-14 18:23 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Raj, Ashok, Jean-Philippe Brucker, Jason Wang, Liu Yi L,
	eric.auger, baolu.lu, joro, kevin.tian, jacob.jun.pan,
	jun.j.tian, yi.y.sun, peterx, hao.wu, stefanha, iommu, kvm,
	Michael S. Tsirkin

On Mon, 14 Sep 2020 14:41:21 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Mon, Sep 14, 2020 at 10:58:57AM -0600, Alex Williamson wrote:
>  
> > "its own special way" is arguable, VFIO is just making use of what's
> > being proposed as the uapi via its existing IOMMU interface.  
> 
> I mean, if we have a /dev/sva then it makes no sense to extend the
> VFIO interfaces with the same stuff. VFIO should simply accept a PASID
> created from /dev/sva and use it just like any other user-DMA driver
> would.

I don't think that's absolutely true.  By the same logic, we could say
that pci-sysfs provides access to PCI BAR and config space resources,
the VFIO device interface duplicates part of that interface therefore it
should be abandoned.  But in reality, VFIO providing access to those
resources puts those accesses within the scope and control of the VFIO
interface.  Ownership of a device through vfio is provided by allowing
the user access to the vfio group dev file, not by the group file, plus
some number of resource files, and the config file, and running with
admin permissions to see the full extent of config space.  Reserved
ranges for the IOMMU are also provided via sysfs, but VFIO includes a
capability on the IOMMU get_info ioctl for the user to learn about
available IOVA ranges w/o scraping through sysfs.

> > are also a system resource, so we require some degree of access control
> > and quotas for management of PASIDs.    
> 
> This has already happened, the SVA patches generally allow unpriv user
> space to allocate a PASID for their process.
> 
> If a device implements a mdev shared with a kernel driver (like IDXD)
> then it will be sharing that PASID pool across both drivers. In this
> case it makes no sense that VFIO has PASID quota logic because it has
> an incomplete view. It could only make sense if VFIO is the exclusive
> owner of the bus/device/function.
> 
> The tracking logic needs to be global.. Most probably in some kind of
> PASID cgroup controller?

AIUI, that doesn't exist yet, so it makes sense that VFIO, as the
mechanism through which a user would allocate a PASID, implements a
reasonable quota to avoid an unprivileged user exhausting the address
space.  Also, "unprivileged user" is a bit of a misnomer in this
context as the VFIO user must be privileged with ownership of a device
before they can even participate in PASID allocation.  Is truly
unprivileged access reasonable for a limited resource?
 
> > know whether an assigned device requires PASIDs such that access to
> > this dev file is provided to QEMU?  
> 
> Wouldn't QEMU just open /dev/sva if it needs it? Like other dev files?
> Why would it need something special?

QEMU typically runs in a sandbox with limited access, when a device or
mdev is assigned to a VM, file permissions are configured to allow that
access.  QEMU doesn't get to poke at any random dev file it likes,
that's part of how userspace reduces the potential attack surface.
 
> > would be an obvious DoS path if any user can create arbitrary
> > allocations.  If we can move code out of VFIO, I'm all for it, but I
> > think it needs to be better defined than "implement magic universal sva
> > uapi interface" before we can really consider it.  Thanks,  
> 
> Jason began by saying VDPA will need this too, I agree with him.
> 
> I'm not sure why it would be "magic"? This series already gives a
> pretty solid blueprint for what the interface would need to
> have. Interested folks need to sit down and talk about it not just
> default everything to being built inside VFIO.

This series is a blueprint within the context of the ownership and
permission model that VFIO already provides.  It doesn't seem like we
can pluck that out on its own, nor is it necessarily the case that VFIO
wouldn't want to provide PASID services within its own API even if we
did have this undefined /dev/sva interface.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs
  2020-09-14 18:23               ` Alex Williamson
@ 2020-09-14 19:00                 ` Jason Gunthorpe
  2020-09-14 22:33                   ` Alex Williamson
       [not found]                   ` <20200914224438.GA65940@otc-nc-03>
  0 siblings, 2 replies; 81+ messages in thread
From: Jason Gunthorpe @ 2020-09-14 19:00 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Raj, Ashok, Jean-Philippe Brucker, Jason Wang, Liu Yi L,
	eric.auger, baolu.lu, joro, kevin.tian, jacob.jun.pan,
	jun.j.tian, yi.y.sun, peterx, hao.wu, stefanha, iommu, kvm,
	Michael S. Tsirkin

On Mon, Sep 14, 2020 at 12:23:28PM -0600, Alex Williamson wrote:
> On Mon, 14 Sep 2020 14:41:21 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Mon, Sep 14, 2020 at 10:58:57AM -0600, Alex Williamson wrote:
> >  
> > > "its own special way" is arguable, VFIO is just making use of what's
> > > being proposed as the uapi via its existing IOMMU interface.  
> > 
> > I mean, if we have a /dev/sva then it makes no sense to extend the
> > VFIO interfaces with the same stuff. VFIO should simply accept a PASID
> > created from /dev/sva and use it just like any other user-DMA driver
> > would.
> 
> I don't think that's absolutely true.  By the same logic, we could say
> that pci-sysfs provides access to PCI BAR and config space
> resources,

No, it is the reverse, VFIO is a better version of pci-sysfs, so
pci-sysfs is the one that is obsoleted by VFIO. Similarly a /dev/sva
would be the superset interface for PASID, so whatver VFIO has would
be obsoleted.

It would be very unusual for the kernel to have to 'preferred'
interfaces for the same thing, IMHO. The review process for uAPI
should really prevent that by allowing all interests to be served
while the uAPI is designed.

> the VFIO device interface duplicates part of that interface therefore it
> should be abandoned.  But in reality, VFIO providing access to those
> resources puts those accesses within the scope and control of the VFIO
> interface.

Not clear to my why VFIO needs that. PASID seems quite orthogonal from
VFIO to me.

> > This has already happened, the SVA patches generally allow unpriv user
> > space to allocate a PASID for their process.
> > 
> > If a device implements a mdev shared with a kernel driver (like IDXD)
> > then it will be sharing that PASID pool across both drivers. In this
> > case it makes no sense that VFIO has PASID quota logic because it has
> > an incomplete view. It could only make sense if VFIO is the exclusive
> > owner of the bus/device/function.
> > 
> > The tracking logic needs to be global.. Most probably in some kind of
> > PASID cgroup controller?
> 
> AIUI, that doesn't exist yet, so it makes sense that VFIO, as the
> mechanism through which a user would allocate a PASID, 

VFIO is not the exclusive user interface for PASID. Other SVA drivers
will allocate PASIDs. Any quota has to be implemented by the IOMMU
layer, and shared across all drivers.

> space.  Also, "unprivileged user" is a bit of a misnomer in this
> context as the VFIO user must be privileged with ownership of a device
> before they can even participate in PASID allocation.  Is truly
> unprivileged access reasonable for a limited resource?

I'm not talking about VFIO, I'm talking about the other SVA drivers. I
expect some of them will be unpriv safe, like IDXD, for
instance.

Some way to manage the limited PASID resource will be necessary beyond
just VFIO.

> QEMU typically runs in a sandbox with limited access, when a device or
> mdev is assigned to a VM, file permissions are configured to allow that
> access.  QEMU doesn't get to poke at any random dev file it likes,
> that's part of how userspace reduces the potential attack surface.

Plumbing the exact same APIs through VFIO's uAPI vs /dev/sva doesn't
reduce the attack surface. qemu can simply include /dev/sva in the
sandbox when using VFIO with no increase in attack surface from this
proposed series.

> This series is a blueprint within the context of the ownership and
> permission model that VFIO already provides.  It doesn't seem like we
> can pluck that out on its own, nor is it necessarily the case that VFIO
> wouldn't want to provide PASID services within its own API even if we
> did have this undefined /dev/sva interface.

I don't see what you do - VFIO does not own PASID, and in this
vfio-mdev mode it does not own the PCI device/IOMMU either. So why
would this need to be part of the VFIO owernship and permission model?

Jason

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs
  2020-09-14 19:00                 ` Jason Gunthorpe
@ 2020-09-14 22:33                   ` Alex Williamson
  2020-09-15 14:29                     ` Jason Gunthorpe
       [not found]                   ` <20200914224438.GA65940@otc-nc-03>
  1 sibling, 1 reply; 81+ messages in thread
From: Alex Williamson @ 2020-09-14 22:33 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Raj, Ashok, Jean-Philippe Brucker, Jason Wang, Liu Yi L,
	eric.auger, baolu.lu, joro, kevin.tian, jacob.jun.pan,
	jun.j.tian, yi.y.sun, peterx, hao.wu, stefanha, iommu, kvm,
	Michael S. Tsirkin

On Mon, 14 Sep 2020 16:00:57 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Mon, Sep 14, 2020 at 12:23:28PM -0600, Alex Williamson wrote:
> > On Mon, 14 Sep 2020 14:41:21 -0300
> > Jason Gunthorpe <jgg@nvidia.com> wrote:
> >   
> > > On Mon, Sep 14, 2020 at 10:58:57AM -0600, Alex Williamson wrote:
> > >    
> > > > "its own special way" is arguable, VFIO is just making use of what's
> > > > being proposed as the uapi via its existing IOMMU interface.    
> > > 
> > > I mean, if we have a /dev/sva then it makes no sense to extend the
> > > VFIO interfaces with the same stuff. VFIO should simply accept a PASID
> > > created from /dev/sva and use it just like any other user-DMA driver
> > > would.  
> > 
> > I don't think that's absolutely true.  By the same logic, we could say
> > that pci-sysfs provides access to PCI BAR and config space
> > resources,  
> 
> No, it is the reverse, VFIO is a better version of pci-sysfs, so
> pci-sysfs is the one that is obsoleted by VFIO. Similarly a /dev/sva
> would be the superset interface for PASID, so whatver VFIO has would
> be obsoleted.
> 
> It would be very unusual for the kernel to have to 'preferred'
> interfaces for the same thing, IMHO. The review process for uAPI
> should really prevent that by allowing all interests to be served
> while the uAPI is designed.
> 
> > the VFIO device interface duplicates part of that interface therefore it
> > should be abandoned.  But in reality, VFIO providing access to those
> > resources puts those accesses within the scope and control of the VFIO
> > interface.  
> 
> Not clear to my why VFIO needs that. PASID seems quite orthogonal from
> VFIO to me.


Can you explain that further, or spit-ball what you think this /dev/sva
interface looks like and how a user might interact between vfio and
this new interface?  The interface proposed here definitely does not
seem orthogonal to the vfio IOMMU interface, ie. selecting a specific
IOMMU domain mode during vfio setup, allocating pasids and associating
them with page tables for that two-stage IOMMU setup, performing cache
invalidations based on page table updates, etc.  How does it make more
sense for a vIOMMU to setup some aspects of the IOMMU through vfio and
others through a TBD interface?


> > > This has already happened, the SVA patches generally allow unpriv user
> > > space to allocate a PASID for their process.
> > > 
> > > If a device implements a mdev shared with a kernel driver (like IDXD)
> > > then it will be sharing that PASID pool across both drivers. In this
> > > case it makes no sense that VFIO has PASID quota logic because it has
> > > an incomplete view. It could only make sense if VFIO is the exclusive
> > > owner of the bus/device/function.
> > > 
> > > The tracking logic needs to be global.. Most probably in some kind of
> > > PASID cgroup controller?  
> > 
> > AIUI, that doesn't exist yet, so it makes sense that VFIO, as the
> > mechanism through which a user would allocate a PASID,   
> 
> VFIO is not the exclusive user interface for PASID. Other SVA drivers
> will allocate PASIDs. Any quota has to be implemented by the IOMMU
> layer, and shared across all drivers.


The IOMMU needs to allocate PASIDs, so in that sense it enforces a
quota via the architectural limits, but is the IOMMU layer going to
distinguish in-kernel versus user limits?  A cgroup limit seems like a
good idea, but that's not really at the IOMMU layer either and I don't
see that a /dev/sva and vfio interface couldn't both support a cgroup
type quota.

 
> > space.  Also, "unprivileged user" is a bit of a misnomer in this
> > context as the VFIO user must be privileged with ownership of a device
> > before they can even participate in PASID allocation.  Is truly
> > unprivileged access reasonable for a limited resource?  
> 
> I'm not talking about VFIO, I'm talking about the other SVA drivers. I
> expect some of them will be unpriv safe, like IDXD, for
> instance.
> 
> Some way to manage the limited PASID resource will be necessary beyond
> just VFIO.

And it's not clear that they'll have compatible requirements.  A
userspace idxd driver might have limited needs versus a vIOMMU backend.
Does a single quota model adequately support both or are we back to the
differences between access to a device and ownership of a device?
Maybe a single pasid per user makes sense in the former.  If we could
bring this discussion to some sort of more concrete proposal it might
be easier to weigh the choices.
 
> > QEMU typically runs in a sandbox with limited access, when a device or
> > mdev is assigned to a VM, file permissions are configured to allow that
> > access.  QEMU doesn't get to poke at any random dev file it likes,
> > that's part of how userspace reduces the potential attack surface.  
> 
> Plumbing the exact same APIs through VFIO's uAPI vs /dev/sva doesn't
> reduce the attack surface. qemu can simply include /dev/sva in the
> sandbox when using VFIO with no increase in attack surface from this
> proposed series.

APIs confined to the ownership model that vfio already enforces might
absolutely present a more limited attack surface than some new
interface intended to provide universal sva resource access.  We don't
know until we see it.  The real argument would be whether we have a
more hardened interface due to more review from more users.

 
> > This series is a blueprint within the context of the ownership and
> > permission model that VFIO already provides.  It doesn't seem like we
> > can pluck that out on its own, nor is it necessarily the case that VFIO
> > wouldn't want to provide PASID services within its own API even if we
> > did have this undefined /dev/sva interface.  
> 
> I don't see what you do - VFIO does not own PASID, and in this
> vfio-mdev mode it does not own the PCI device/IOMMU either. So why
> would this need to be part of the VFIO owernship and permission model?

Doesn't the PASID model essentially just augment the requester ID IOMMU
model so as to manage the IOVAs for a subdevice of a RID?  The vfio
model builds on a user's access to a vfio group to entitle them to
allocate IOMMU resources, or in this case PASIDs.  What elevates a user
to be able to allocate such resources in this new proposal?  Do they
need a device at all?  It's not clear to me why RID based IOMMU
management fits within vfio's scope, but PASID based does not.  Seems
like that would chip away at aux domains in general.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 81+ messages in thread

* RE: [PATCH v7 04/16] vfio: Add PASID allocation/free support
  2020-09-11 20:54   ` Alex Williamson
@ 2020-09-15  4:03     ` Liu, Yi L
  0 siblings, 0 replies; 81+ messages in thread
From: Liu, Yi L @ 2020-09-15  4:03 UTC (permalink / raw)
  To: Alex Williamson
  Cc: eric.auger, baolu.lu, joro, Tian, Kevin, jacob.jun.pan, Raj,
	Ashok, Tian, Jun J, Sun, Yi Y, jean-philippe, peterx, jasowang,
	Wu, Hao, stefanha, iommu, kvm

Hi Alex,

> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Saturday, September 12, 2020 4:55 AM
> 
> On Thu, 10 Sep 2020 03:45:21 -0700
> Liu Yi L <yi.l.liu@intel.com> wrote:
> 
> > Shared Virtual Addressing (a.k.a Shared Virtual Memory) allows sharing
> > multiple process virtual address spaces with the device for simplified
> > programming model. PASID is used to tag an virtual address space in
> > DMA requests and to identify the related translation structure in
> > IOMMU. When a PASID-capable device is assigned to a VM, we want the
> > same capability of using PASID to tag guest process virtual address
> > spaces to achieve virtual SVA (vSVA).
> >
> > PASID management for guest is vendor specific. Some vendors (e.g.
> > Intel
> > VT-d) requires system-wide managed PASIDs across all devices,
> > regardless of whether a device is used by host or assigned to guest.
> > Other vendors (e.g. ARM SMMU) may allow PASIDs managed per-device thus
> > could be fully delegated to the guest for assigned devices.
> >
> > For system-wide managed PASIDs, this patch introduces a vfio module to
> > handle explicit PASID alloc/free requests from guest. Allocated PASIDs
> > are associated to a process (or, mm_struct) in IOASID core. A vfio_mm
> > object is introduced to track mm_struct. Multiple VFIO containers
> > within a process share the same vfio_mm object.
> >
> > A quota mechanism is provided to prevent malicious user from
> > exhausting available PASIDs. Currently the quota is a global parameter
> > applied to all VFIO devices. In the future per-device quota might be supported too.
> >
> > Cc: Kevin Tian <kevin.tian@intel.com>
> > CC: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > Cc: Eric Auger <eric.auger@redhat.com>
> > Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
> > Cc: Joerg Roedel <joro@8bytes.org>
> > Cc: Lu Baolu <baolu.lu@linux.intel.com>
> > Suggested-by: Alex Williamson <alex.williamson@redhat.com>
> > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > Reviewed-by: Eric Auger <eric.auger@redhat.com>
> > ---
> > v6 -> v7:
> > *) remove "#include <linux/eventfd.h>" and add r-b from Eric Auger.
> >
> > v5 -> v6:
> > *) address comments from Eric. Add vfio_unlink_pasid() to be consistent
> >    with vfio_unlink_dma(). Add a comment in vfio_pasid_exit().
> >
> > v4 -> v5:
> > *) address comments from Eric Auger.
> > *) address the comments from Alex on the pasid free range support. Added
> >    per vfio_mm pasid r-b tree.
> >    https://lore.kernel.org/kvm/20200709082751.320742ab@x1.home/
> >
> > v3 -> v4:
> > *) fix lock leam in vfio_mm_get_from_task()
> > *) drop pasid_quota field in struct vfio_mm
> > *) vfio_mm_get_from_task() returns ERR_PTR(-ENOTTY) when
> > !CONFIG_VFIO_PASID
> >
> > v1 -> v2:
> > *) added in v2, split from the pasid alloc/free support of v1
> > ---
> >  drivers/vfio/Kconfig      |   5 +
> >  drivers/vfio/Makefile     |   1 +
> >  drivers/vfio/vfio_pasid.c | 247
> ++++++++++++++++++++++++++++++++++++++++++++++
> >  include/linux/vfio.h      |  28 ++++++
> >  4 files changed, 281 insertions(+)
> >  create mode 100644 drivers/vfio/vfio_pasid.c
> >
> > diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig index
> > fd17db9..3d8a108 100644
> > --- a/drivers/vfio/Kconfig
> > +++ b/drivers/vfio/Kconfig
> > @@ -19,6 +19,11 @@ config VFIO_VIRQFD
> >  	depends on VFIO && EVENTFD
> >  	default n
> >
> > +config VFIO_PASID
> > +	tristate
> > +	depends on IOASID && VFIO
> > +	default n
> > +
> >  menuconfig VFIO
> >  	tristate "VFIO Non-Privileged userspace driver framework"
> >  	depends on IOMMU_API
> > diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile index
> > de67c47..bb836a3 100644
> > --- a/drivers/vfio/Makefile
> > +++ b/drivers/vfio/Makefile
> > @@ -3,6 +3,7 @@ vfio_virqfd-y := virqfd.o
> >
> >  obj-$(CONFIG_VFIO) += vfio.o
> >  obj-$(CONFIG_VFIO_VIRQFD) += vfio_virqfd.o
> > +obj-$(CONFIG_VFIO_PASID) += vfio_pasid.o
> >  obj-$(CONFIG_VFIO_IOMMU_TYPE1) += vfio_iommu_type1.o
> >  obj-$(CONFIG_VFIO_IOMMU_SPAPR_TCE) += vfio_iommu_spapr_tce.o
> >  obj-$(CONFIG_VFIO_SPAPR_EEH) += vfio_spapr_eeh.o diff --git
> > a/drivers/vfio/vfio_pasid.c b/drivers/vfio/vfio_pasid.c new file mode
> > 100644 index 0000000..44ecdd5
> > --- /dev/null
> > +++ b/drivers/vfio/vfio_pasid.c
> > @@ -0,0 +1,247 @@
> > +// SPDX-License-Identifier: GPL-2.0-only
> > +/*
> > + * Copyright (C) 2020 Intel Corporation.
> > + *     Author: Liu Yi L <yi.l.liu@intel.com>
> > + *
> > + */
> > +
> > +#include <linux/vfio.h>
> > +#include <linux/file.h>
> > +#include <linux/module.h>
> > +#include <linux/slab.h>
> > +#include <linux/sched/mm.h>
> > +
> > +#define DRIVER_VERSION  "0.1"
> > +#define DRIVER_AUTHOR   "Liu Yi L <yi.l.liu@intel.com>"
> > +#define DRIVER_DESC     "PASID management for VFIO bus drivers"
> > +
> > +#define VFIO_DEFAULT_PASID_QUOTA	1000
> 
> I'm not sure we really need a macro to define this since it's only used once, but a
> comment discussing the basis for this default value would be useful.

yep, may remove the macro. 1000 is actually a value come from an offline
discussion with Jacob. And was first mentioned in below link. Since we
don't have much data to decide a default quota today, so we'd like to
make 1000 be default quota as a start. In future we would give administrator
the ability to tune the quota.

https://lore.kernel.org/kvm/A2975661238FB949B60364EF0F2C25743A0F8CB4@SHSMSX104.ccr.corp.intel.com/

> Also, since
> Matthew Rosato is finding it necessary to expose the available DMA mapping
> counter to userspace, is this also a limitation that userspace might be
> interested in knowing such that we should plumb it through an IOMMU info
> capability?

agreed. it would be helpful. I'll add it.

> > +static int pasid_quota = VFIO_DEFAULT_PASID_QUOTA;
> > +module_param_named(pasid_quota, pasid_quota, uint, 0444);
> > +MODULE_PARM_DESC(pasid_quota,
> > +		 "Set the quota for max number of PASIDs that an application is
> > +allowed to request (default 1000)");
> > +
> > +struct vfio_mm_token {
> > +	unsigned long long val;
> > +};
> > +
> > +struct vfio_mm {
> > +	struct kref		kref;
> > +	struct ioasid_set	*ioasid_set;
> > +	struct mutex		pasid_lock;
> > +	struct rb_root		pasid_list;
> > +	struct list_head	next;
> > +	struct vfio_mm_token	token;
> > +};
> > +
> > +static struct mutex		vfio_mm_lock;
> > +static struct list_head		vfio_mm_list;
> > +
> > +struct vfio_pasid {
> > +	struct rb_node		node;
> > +	ioasid_t		pasid;
> > +};
> > +
> > +static void vfio_remove_all_pasids(struct vfio_mm *vmm);
> > +
> > +/* called with vfio.vfio_mm_lock held */
> 
> 
> s/vfio.//

got it. thanks for spotting it.

> 
> 
> > +static void vfio_mm_release(struct kref *kref) {
> > +	struct vfio_mm *vmm = container_of(kref, struct vfio_mm, kref);
> > +
> > +	list_del(&vmm->next);
> > +	mutex_unlock(&vfio_mm_lock);
> > +	vfio_remove_all_pasids(vmm);
> > +	ioasid_set_put(vmm->ioasid_set);//FIXME: should vfio_pasid get ioasid_set
> after allocation?
> 
> 
> Is the question whether each pasid should hold a reference to the set?

no, I was considering whether vfio_pasid needs to hold a reference on
the ioasid_set. But after checking ioasid_alloc_set(), the answer is
"no" since a successful ioasid_alloc_set() calling will atomically
increase the refcnt of the returned set. So no need to take another
reference in vfio_pasid.

> That really seems like a question internal to the ioasid_alloc/free, but
> this FIXME needs to be resolved.

I should have removed it before sending it out. sorry for the confusion. :-(

> 
> 
> > +	kfree(vmm);
> > +}
> > +
> > +void vfio_mm_put(struct vfio_mm *vmm) {
> > +	kref_put_mutex(&vmm->kref, vfio_mm_release, &vfio_mm_lock); }
> > +
> > +static void vfio_mm_get(struct vfio_mm *vmm) {
> > +	kref_get(&vmm->kref);
> > +}
> > +
> > +struct vfio_mm *vfio_mm_get_from_task(struct task_struct *task) {
> > +	struct mm_struct *mm = get_task_mm(task);
> > +	struct vfio_mm *vmm;
> > +	unsigned long long val = (unsigned long long)mm;
> > +	int ret;
> > +
> > +	mutex_lock(&vfio_mm_lock);
> > +	/* Search existing vfio_mm with current mm pointer */
> > +	list_for_each_entry(vmm, &vfio_mm_list, next) {
> > +		if (vmm->token.val == val) {
> > +			vfio_mm_get(vmm);
> > +			goto out;
> > +		}
> > +	}
> > +
> > +	vmm = kzalloc(sizeof(*vmm), GFP_KERNEL);
> > +	if (!vmm) {
> > +		vmm = ERR_PTR(-ENOMEM);
> > +		goto out;
> > +	}
> > +
> > +	/*
> > +	 * IOASID core provides a 'IOASID set' concept to track all
> > +	 * PASIDs associated with a token. Here we use mm_struct as
> > +	 * the token and create a IOASID set per mm_struct. All the
> > +	 * containers of the process share the same IOASID set.
> > +	 */
> > +	vmm->ioasid_set = ioasid_alloc_set(mm, pasid_quota,
> IOASID_SET_TYPE_MM);
> > +	if (IS_ERR(vmm->ioasid_set)) {
> > +		ret = PTR_ERR(vmm->ioasid_set);
> > +		kfree(vmm);
> > +		vmm = ERR_PTR(ret);
> > +		goto out;
> 
> This would be a little less convoluted if we had a separate variable to store
> ioasid_set so that we could free vmm without stashing the error in a temporary
> variable.  Or at least make the stash more obvious by defining the stash variable as
> something like "tmp" within the scope of this branch.

I see. also the "ret" is not necessary as only used only once. so it would
be like below:

tmp = ioasid_alloc_set(mm, pasid_quota, IOASID_SET_TYPE_MM);
if (IS_ERR(tmp)) {
	kfree(vmm);
	vmm = ERR_PTR(ret);
	goto out;
}

> > +	}
> > +
> > +	kref_init(&vmm->kref);
> > +	vmm->token.val = val;
> > +	mutex_init(&vmm->pasid_lock);
> > +	vmm->pasid_list = RB_ROOT;
> > +
> > +	list_add(&vmm->next, &vfio_mm_list);
> > +out:
> > +	mutex_unlock(&vfio_mm_lock);
> > +	mmput(mm);
> > +	return vmm;
> > +}
> > +
> > +/*
> > + * Find PASID within @min and @max
> > + */
> > +static struct vfio_pasid *vfio_find_pasid(struct vfio_mm *vmm,
> > +					  ioasid_t min, ioasid_t max)
> > +{
> > +	struct rb_node *node = vmm->pasid_list.rb_node;
> > +
> > +	while (node) {
> > +		struct vfio_pasid *vid = rb_entry(node,
> > +						struct vfio_pasid, node);
> > +
> > +		if (max < vid->pasid)
> > +			node = node->rb_left;
> > +		else if (min > vid->pasid)
> > +			node = node->rb_right;
> > +		else
> > +			return vid;
> > +	}
> > +
> > +	return NULL;
> > +}
> > +
> > +static void vfio_link_pasid(struct vfio_mm *vmm, struct vfio_pasid
> > +*new) {
> > +	struct rb_node **link = &vmm->pasid_list.rb_node, *parent = NULL;
> > +	struct vfio_pasid *vid;
> > +
> > +	while (*link) {
> > +		parent = *link;
> > +		vid = rb_entry(parent, struct vfio_pasid, node);
> > +
> > +		if (new->pasid <= vid->pasid)
> > +			link = &(*link)->rb_left;
> > +		else
> > +			link = &(*link)->rb_right;
> > +	}
> > +
> > +	rb_link_node(&new->node, parent, link);
> > +	rb_insert_color(&new->node, &vmm->pasid_list); }
> > +
> > +static void vfio_unlink_pasid(struct vfio_mm *vmm, struct vfio_pasid
> > +*old) {
> > +	rb_erase(&old->node, &vmm->pasid_list); }
> > +
> > +static void vfio_remove_pasid(struct vfio_mm *vmm, struct vfio_pasid
> > +*vid) {
> > +	vfio_unlink_pasid(vmm, vid);
> > +	ioasid_free(vmm->ioasid_set, vid->pasid);
> > +	kfree(vid);
> > +}
> > +
> > +static void vfio_remove_all_pasids(struct vfio_mm *vmm) {
> > +	struct rb_node *node;
> > +
> > +	mutex_lock(&vmm->pasid_lock);
> > +	while ((node = rb_first(&vmm->pasid_list)))
> > +		vfio_remove_pasid(vmm, rb_entry(node, struct vfio_pasid, node));
> > +	mutex_unlock(&vmm->pasid_lock);
> > +}
> > +
> > +int vfio_pasid_alloc(struct vfio_mm *vmm, int min, int max)
> 
> I might have asked before, but why doesn't this return an ioasid_t and require
> ioasid_t args?  Our free function below uses an ioasid_t range, seems rather
> inconsistent. 

yep, will fix it.

> We can use a BUILD_BUG_ON if we need to test that an ioasid_t fits
> within our uapi.

perhaps not. vfio_pasid_alloc() should return INVALID_IOASID for allocation
failure. vfio_iommu_type1 should check the return value of vfio_pasid_alloc()
and return a proper result to userspace.

> 
> > +{
> > +	ioasid_t pasid;
> > +	struct vfio_pasid *vid;
> > +
> > +	pasid = ioasid_alloc(vmm->ioasid_set, min, max, NULL);
> > +	if (pasid == INVALID_IOASID)
> > +		return -ENOSPC;
> > +
> > +	vid = kzalloc(sizeof(*vid), GFP_KERNEL);
> > +	if (!vid) {
> > +		ioasid_free(vmm->ioasid_set, pasid);
> > +		return -ENOMEM;
> > +	}
> > +
> > +	vid->pasid = pasid;
> > +
> > +	mutex_lock(&vmm->pasid_lock);
> > +	vfio_link_pasid(vmm, vid);
> > +	mutex_unlock(&vmm->pasid_lock);
> > +
> > +	return pasid;
> > +}
> > +
> > +void vfio_pasid_free_range(struct vfio_mm *vmm,
> > +			   ioasid_t min, ioasid_t max)
> > +{
> > +	struct vfio_pasid *vid = NULL;
> > +
> > +	/*
> > +	 * IOASID core will notify PASID users (e.g. IOMMU driver) to
> > +	 * teardown necessary structures depending on the to-be-freed
> > +	 * PASID.
> > +	 */
> > +	mutex_lock(&vmm->pasid_lock);
> > +	while ((vid = vfio_find_pasid(vmm, min, max)) != NULL)
> 
> != NULL is not necessary and isn't consistent with the same time of test in the above
> rb_first() loop.

got it. thanks for the guiding.

> 
> > +		vfio_remove_pasid(vmm, vid);
> > +	mutex_unlock(&vmm->pasid_lock);
> > +}
> > +
> > +static int __init vfio_pasid_init(void) {
> > +	mutex_init(&vfio_mm_lock);
> > +	INIT_LIST_HEAD(&vfio_mm_list);
> > +	return 0;
> > +}
> > +
> > +static void __exit vfio_pasid_exit(void) {
> > +	/*
> > +	 * VFIO_PASID is supposed to be referenced by VFIO_IOMMU_TYPE1
> > +	 * and may be other module. once vfio_pasid_exit() is triggered,
> > +	 * that means its user (e.g. VFIO_IOMMU_TYPE1) has been removed.
> > +	 * All the vfio_mm instances should have been released. If not,
> > +	 * means there is vfio_mm leak, should be a bug of user module.
> > +	 * So just warn here.
> > +	 */
> > +	WARN_ON(!list_empty(&vfio_mm_list));
> 
> Do we need to be using try_module_get/module_put to enforce that we cannot be
> removed while in use or does that already work correctly via the function references
> and this is just paranoia?  If we do exit, I'm not sure what good it does to keep the
> remaining list entries.  Thanks,

I did a test before, and it's true the module dependency is enforced
via function references and cannot remove a module before removing
other modules which have referred its function. BTW., for the WARN_ON,
I referred the handling against vfio.group_list in vfio_cleanup(). :-)

Regards,
Yi Liu

> Alex
> 
> > +}
> > +
> > +module_init(vfio_pasid_init);
> > +module_exit(vfio_pasid_exit);
> > +
> > +MODULE_VERSION(DRIVER_VERSION);
> > +MODULE_LICENSE("GPL v2");
> > +MODULE_AUTHOR(DRIVER_AUTHOR);
> > +MODULE_DESCRIPTION(DRIVER_DESC);
> > diff --git a/include/linux/vfio.h b/include/linux/vfio.h index
> > 38d3c6a..31472a9 100644
> > --- a/include/linux/vfio.h
> > +++ b/include/linux/vfio.h
> > @@ -97,6 +97,34 @@ extern int vfio_register_iommu_driver(const struct
> > vfio_iommu_driver_ops *ops);  extern void vfio_unregister_iommu_driver(
> >  				const struct vfio_iommu_driver_ops *ops);
> >
> > +struct vfio_mm;
> > +#if IS_ENABLED(CONFIG_VFIO_PASID)
> > +extern struct vfio_mm *vfio_mm_get_from_task(struct task_struct
> > +*task); extern void vfio_mm_put(struct vfio_mm *vmm); extern int
> > +vfio_pasid_alloc(struct vfio_mm *vmm, int min, int max); extern void
> > +vfio_pasid_free_range(struct vfio_mm *vmm,
> > +				  ioasid_t min, ioasid_t max);
> > +#else
> > +static inline struct vfio_mm *vfio_mm_get_from_task(struct
> > +task_struct *task) {
> > +	return ERR_PTR(-ENOTTY);
> > +}
> > +
> > +static inline void vfio_mm_put(struct vfio_mm *vmm) { }
> > +
> > +static inline int vfio_pasid_alloc(struct vfio_mm *vmm, int min, int
> > +max) {
> > +	return -ENOTTY;
> > +}
> > +
> > +static inline void vfio_pasid_free_range(struct vfio_mm *vmm,
> > +					  ioasid_t min, ioasid_t max)
> > +{
> > +}
> > +#endif /* CONFIG_VFIO_PASID */
> > +
> >  /*
> >   * External user API
> >   */


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs
       [not found]                   ` <20200914224438.GA65940@otc-nc-03>
@ 2020-09-15 11:33                     ` Jason Gunthorpe
  2020-09-15 18:11                       ` Raj, Ashok
  0 siblings, 1 reply; 81+ messages in thread
From: Jason Gunthorpe @ 2020-09-15 11:33 UTC (permalink / raw)
  To: Raj, Ashok
  Cc: Alex Williamson, Jean-Philippe Brucker, Jason Wang, Liu Yi L,
	eric.auger, baolu.lu, joro, kevin.tian, jacob.jun.pan,
	jun.j.tian, yi.y.sun, peterx, hao.wu, stefanha, iommu, kvm,
	Michael S. Tsirkin

On Mon, Sep 14, 2020 at 03:44:38PM -0700, Raj, Ashok wrote:
> Hi Jason,
> 
> I thought we discussed this at LPC, but still seems to be going in
> circles :-(.

We discused mdev at LPC, not PASID.

PASID applies widely to many device and needs to be introduced with a
wide community agreement so all scenarios will be supportable.

> As you had suggested earlier in the mail thread could Jason Wang maybe
> build out what it takes to have a full fledged /dev/sva interface for vDPA
> and figure out how the interfaces should emerge? otherwise it appears
> everyone is talking very high level and with that limited understanding of
> how things work at the moment. 

You want Jason Wang to do the work to get Intel PASID support merged?
Seems a bit of strange request.

> This has to move ahead of these email discussions, hoping somone with the
> right ideas would help move this forward.

Why not try yourself to come up with a proposal?

Jason 

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs
  2020-09-14 22:33                   ` Alex Williamson
@ 2020-09-15 14:29                     ` Jason Gunthorpe
  2020-09-16  1:19                       ` Tian, Kevin
  0 siblings, 1 reply; 81+ messages in thread
From: Jason Gunthorpe @ 2020-09-15 14:29 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Raj, Ashok, Jean-Philippe Brucker, Jason Wang, Liu Yi L,
	eric.auger, baolu.lu, joro, kevin.tian, jacob.jun.pan,
	jun.j.tian, yi.y.sun, peterx, hao.wu, stefanha, iommu, kvm,
	Michael S. Tsirkin

On Mon, Sep 14, 2020 at 04:33:10PM -0600, Alex Williamson wrote:

> Can you explain that further, or spit-ball what you think this /dev/sva
> interface looks like and how a user might interact between vfio and
> this new interface? 

When you open it you get some container, inside the container the
user can create PASIDs. PASIDs outside that container cannot be
reached.

Creating a PASID, or the guest PASID range would be the entry point
for doing all the operations against a PASID or range that this patch
series imagines:
 - Map process VA mappings to the PASID's DMA virtual address space
 - Catch faults
 - Setup any special HW stuff like Intel's two level thing, ARM stuff, etc
 - Expose resource controls, cgroup, whatever
 - Migration special stuff (allocate fixed PASIDs)

A PASID is a handle for an IOMMU page table, and the tools to
manipulate it. Within /dev/sva the page table is just 'floating' and
not linked to any PCI functions

The open /dev/sva FD holding the allocated PASIDs would be passed to a
kernel driver. This is a security authorization that the specified
PASID can be assigned to a PCI device by the kernel.

At this point the kernel driver would have the IOMMU permit its
bus/device/function to use the PASID. The PASID can be passed to
multiple drivers of any driver flavour so table re-use is
possible. Now the IOMMU page table is linked to a device.

The kernel device driver would also do the device specific programming
to setup the PASID in the device, attach it to some device object and
expose the device for user DMA.

For instance IDXD's char dev would map the queue memory and associate
the PASID with that queue and setup the HW to be ready for the new
enque instruction. The IDXD mdev would link to its emulated PCI BAR
and ensure the guest can only use PASID's included in the /dev/sva
container.

The qemu control plane for vIOMMU related to PASID would run over
/dev/sva.

I think the design could go further where a 'PASID' is just an
abstract idea of a page table, then vfio-pci could consume it too as a
IOMMU page table handle even though there is no actual PASID. So qemu
could end up with one API to universally control the vIOMMU, an API
that can be shared between subsystems and is not tied to VFIO.

> allocating pasids and associating them with page tables for that
> two-stage IOMMU setup, performing cache invalidations based on page
> table updates, etc.  How does it make more sense for a vIOMMU to
> setup some aspects of the IOMMU through vfio and others through a
> TBD interface?

vfio's IOMMU interface is about RID based full device ownership,
and fixed mappings.

PASID is about mediation, shared ownership and page faulting.

Does PASID overlap with the existing IOMMU RID interface beyond both
are using the IOMMU?

> The IOMMU needs to allocate PASIDs, so in that sense it enforces a
> quota via the architectural limits, but is the IOMMU layer going to
> distinguish in-kernel versus user limits?  A cgroup limit seems like a
> good idea, but that's not really at the IOMMU layer either and I don't
> see that a /dev/sva and vfio interface couldn't both support a cgroup
> type quota.

It is all good questions. PASID is new, this stuff needs to be
sketched out more. A lot of in-kernel users of IOMMU PASID are
probably going to be triggered by userspace actions.

I think a cgroup quota would end up near the IOMMU layer, so vfio,
sva, and any other driver char devs would all be restricted by the
cgroup as peers.

> And it's not clear that they'll have compatible requirements.  A
> userspace idxd driver might have limited needs versus a vIOMMU backend.
> Does a single quota model adequately support both or are we back to the
> differences between access to a device and ownership of a device?

At the end of the day a PASID is just a number and the drivers only
use of it is to program it into HW.

All these other differences deal with the IOMMU side of the PASID, how
pages are mapped into it, how page fault works, etc, etc. Keeping the
two concerns seperated seems very clean. A device driver shouldn't
care how the PASID is setup.

> > > This series is a blueprint within the context of the ownership and
> > > permission model that VFIO already provides.  It doesn't seem like we
> > > can pluck that out on its own, nor is it necessarily the case that VFIO
> > > wouldn't want to provide PASID services within its own API even if we
> > > did have this undefined /dev/sva interface.  
> > 
> > I don't see what you do - VFIO does not own PASID, and in this
> > vfio-mdev mode it does not own the PCI device/IOMMU either. So why
> > would this need to be part of the VFIO owernship and permission model?
> 
> Doesn't the PASID model essentially just augment the requester ID IOMMU
> model so as to manage the IOVAs for a subdevice of a RID?  

I'd say not really.. PASID is very different from RID because PASID
must always be mediated by the kernel. vfio-pci doesn't know how to
use PASID because it doesn't know how to program the PASID into
a specific device. While RID is fully self contained with vfio-pci.

Further, with the SVA models, the mediated devices are highly likely
to be shared between a vfio-mdev and a normal driver, as IDXD
shows. Userspace will get PASID's for SVA and share the device equally
with vfio-mdev.

> What elevates a user to be able to allocate such resources in this
> new proposal?

AFAIK the target for the current SVA model is no limitation. User
processes can open their devices, establish SVA and go ahead with
their workload.

If you are asking about iommu groups.. For PASID the PCI
bus/device/function that is the 'control point' for PASID must be
secure and owned by the kernel. ie only the kernel can progam the
device to use a given PASID. P2P access from other devices under
non-kernel control must not be allowed, as they could program a device
to use a PASID the kernel would not authorize.

All of this has to be done regardless of VFIO's involvement..

> Do they need a device at all?  It's not clear to me why RID based
> IOMMU management fits within vfio's scope, but PASID based does not.

In RID mode vfio-pci completely owns the PCI function, so it is more
natural that VFIO, as the sole device owner, would own the DMA mapping
machinery. Further, the RID IOMMU mode is rarely used outside of VFIO
so there is not much reason to try and disaggregate the API.

PASID on the other hand, is shared. vfio-mdev drivers will share the
device with other kernel drivers. PASID and DMA will be concurrent
with VFIO and other kernel drivers/etc.

Thus it makes more sense here to have the control plane for PASID also
be shared and not tied exclusively to VFIO.

Jason

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs
  2020-09-15 11:33                     ` Jason Gunthorpe
@ 2020-09-15 18:11                       ` Raj, Ashok
  2020-09-15 18:45                         ` Jason Gunthorpe
  0 siblings, 1 reply; 81+ messages in thread
From: Raj, Ashok @ 2020-09-15 18:11 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Jean-Philippe Brucker, Jason Wang, Liu Yi L,
	eric.auger, baolu.lu, joro, kevin.tian, jacob.jun.pan,
	jun.j.tian, yi.y.sun, peterx, hao.wu, stefanha, iommu, kvm,
	Michael S. Tsirkin, Ashok Raj, Jacon Jun Pan

On Tue, Sep 15, 2020 at 08:33:41AM -0300, Jason Gunthorpe wrote:
> On Mon, Sep 14, 2020 at 03:44:38PM -0700, Raj, Ashok wrote:
> > Hi Jason,
> > 
> > I thought we discussed this at LPC, but still seems to be going in
> > circles :-(.
> 
> We discused mdev at LPC, not PASID.
> 
> PASID applies widely to many device and needs to be introduced with a
> wide community agreement so all scenarios will be supportable.

True, reading some of the earlier replies I was clearly confused as I
thought you were talking about mdev again. But now that you stay it, you
have moved past mdev and its the PASID interfaces correct?

> 
> > As you had suggested earlier in the mail thread could Jason Wang maybe
> > build out what it takes to have a full fledged /dev/sva interface for vDPA
> > and figure out how the interfaces should emerge? otherwise it appears
> > everyone is talking very high level and with that limited understanding of
> > how things work at the moment. 
> 
> You want Jason Wang to do the work to get Intel PASID support merged?
> Seems a bit of strange request.

I was reading mdev in my head. Not PASID, sorry.

For the native user applications have just 1 PASID per process. There is no
need for a quota management. VFIO being the one used for guest where there
is more PASID's per guest is where this is enforced today. 

IIUC, you are asking that part of the interface to move to a API interface
that potentially the new /dev/sva and VFIO could share? I think the API's
for PASID management themselves are generic (Jean's patchset + Jacob's
ioasid set management). 

Possibly what you need is already available, but not in a specific way that
you expect maybe? 

Let me check with Jacob and let him/Jean pick that up.

Cheers,
Ashok


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs
  2020-09-15 18:11                       ` Raj, Ashok
@ 2020-09-15 18:45                         ` Jason Gunthorpe
  2020-09-15 19:26                           ` Raj, Ashok
  2020-09-15 22:08                           ` Jacob Pan
  0 siblings, 2 replies; 81+ messages in thread
From: Jason Gunthorpe @ 2020-09-15 18:45 UTC (permalink / raw)
  To: Raj, Ashok
  Cc: Alex Williamson, Jean-Philippe Brucker, Jason Wang, Liu Yi L,
	eric.auger, baolu.lu, joro, kevin.tian, jacob.jun.pan,
	jun.j.tian, yi.y.sun, peterx, hao.wu, stefanha, iommu, kvm,
	Michael S. Tsirkin, Jacon Jun Pan

On Tue, Sep 15, 2020 at 11:11:54AM -0700, Raj, Ashok wrote:
> > PASID applies widely to many device and needs to be introduced with a
> > wide community agreement so all scenarios will be supportable.
> 
> True, reading some of the earlier replies I was clearly confused as I
> thought you were talking about mdev again. But now that you stay it, you
> have moved past mdev and its the PASID interfaces correct?

Yes, we agreed mdev for IDXD at LPC, didn't talk about PASID.

> For the native user applications have just 1 PASID per
> process. There is no need for a quota management.

Yes, there is. There is a limited pool of HW PASID's. If one user fork
bombs it can easially claim an unreasonable number from that pool as
each process will claim a PASID. That can DOS the rest of the system.

If PASID DOS is a worry then it must be solved at the IOMMU level for
all user applications that might trigger a PASID allocation. VFIO is
not special.

> IIUC, you are asking that part of the interface to move to a API interface
> that potentially the new /dev/sva and VFIO could share? I think the API's
> for PASID management themselves are generic (Jean's patchset + Jacob's
> ioasid set management).

Yes, the in kernel APIs are pretty generic now, and can be used by
many types of drivers.

As JasonW kicked this off, VDPA will need all this identical stuff
too. We already know this, and I think Intel VDPA HW will need it, so
it should concern you too :)

A PASID vIOMMU solution sharable with VDPA and VFIO, based on a PASID
control char dev (eg /dev/sva, or maybe /dev/iommu) seems like a
reasonable starting point for discussion.

Jason

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs
  2020-09-15 18:45                         ` Jason Gunthorpe
@ 2020-09-15 19:26                           ` Raj, Ashok
  2020-09-15 23:45                             ` Jason Gunthorpe
  2020-09-16  2:33                             ` Jason Wang
  2020-09-15 22:08                           ` Jacob Pan
  1 sibling, 2 replies; 81+ messages in thread
From: Raj, Ashok @ 2020-09-15 19:26 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Jean-Philippe Brucker, Jason Wang, Liu Yi L,
	eric.auger, baolu.lu, joro, kevin.tian, jacob.jun.pan,
	jun.j.tian, yi.y.sun, peterx, hao.wu, stefanha, iommu, kvm,
	Michael S. Tsirkin, Jacon Jun Pan, Ashok Raj

On Tue, Sep 15, 2020 at 03:45:10PM -0300, Jason Gunthorpe wrote:
> On Tue, Sep 15, 2020 at 11:11:54AM -0700, Raj, Ashok wrote:
> > > PASID applies widely to many device and needs to be introduced with a
> > > wide community agreement so all scenarios will be supportable.
> > 
> > True, reading some of the earlier replies I was clearly confused as I
> > thought you were talking about mdev again. But now that you stay it, you
> > have moved past mdev and its the PASID interfaces correct?
> 
> Yes, we agreed mdev for IDXD at LPC, didn't talk about PASID.
> 
> > For the native user applications have just 1 PASID per
> > process. There is no need for a quota management.
> 
> Yes, there is. There is a limited pool of HW PASID's. If one user fork
> bombs it can easially claim an unreasonable number from that pool as
> each process will claim a PASID. That can DOS the rest of the system.

Not sure how you had this played out.. For PASID used in ENQCMD today for
our SVM usages, we *DO* not automatically propagate or allocate new PASIDs. 

The new process needs to bind to get a PASID for its own use. For threads
of same process the PASID is inherited. For forks(), we do not
auto-allocate them. Since PASID isn't a sharable resource much like how you
would not pass mmio mmap's to forked processes that cannot be shared correct?
Such as your doorbell space for e.g. 

> 
> If PASID DOS is a worry then it must be solved at the IOMMU level for
> all user applications that might trigger a PASID allocation. VFIO is
> not special.

Feels like you can simply avoid the PASID DOS rather than permit it to
happen. 
> 
> > IIUC, you are asking that part of the interface to move to a API interface
> > that potentially the new /dev/sva and VFIO could share? I think the API's
> > for PASID management themselves are generic (Jean's patchset + Jacob's
> > ioasid set management).
> 
> Yes, the in kernel APIs are pretty generic now, and can be used by
> many types of drivers.

Good, so there is no new requirements here I suppose.
> 
> As JasonW kicked this off, VDPA will need all this identical stuff
> too. We already know this, and I think Intel VDPA HW will need it, so
> it should concern you too :)

This is one of those things that I would disagree and commit :-).. 

> 
> A PASID vIOMMU solution sharable with VDPA and VFIO, based on a PASID
> control char dev (eg /dev/sva, or maybe /dev/iommu) seems like a
> reasonable starting point for discussion.

Looks like now we are getting closer to what we need. :-)

Given that PASID api's are general purpose today and any driver can use it
to take advantage. VFIO fortunately or unfortunately has the IOMMU things
abstracted. I suppose that support is also mostly built on top of the
generic iommu* api abstractions in a vendor neutral way? 

I'm still lost on what is missing that vDPA can't build on top of what is
available?

Cheers,
Ashok

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs
  2020-09-15 18:45                         ` Jason Gunthorpe
  2020-09-15 19:26                           ` Raj, Ashok
@ 2020-09-15 22:08                           ` Jacob Pan
  2020-09-15 23:51                             ` Jason Gunthorpe
  1 sibling, 1 reply; 81+ messages in thread
From: Jacob Pan @ 2020-09-15 22:08 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Raj, Ashok, Alex Williamson, Jean-Philippe Brucker, Jason Wang,
	Liu Yi L, eric.auger, baolu.lu, joro, kevin.tian, jun.j.tian,
	yi.y.sun, peterx, hao.wu, stefanha, iommu, kvm,
	Michael S. Tsirkin, Jacon Jun Pan, jacob.jun.pan

Hi Jason,

On Tue, 15 Sep 2020 15:45:10 -0300, Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Tue, Sep 15, 2020 at 11:11:54AM -0700, Raj, Ashok wrote:
> > > PASID applies widely to many device and needs to be introduced with a
> > > wide community agreement so all scenarios will be supportable.  
> > 
> > True, reading some of the earlier replies I was clearly confused as I
> > thought you were talking about mdev again. But now that you stay it, you
> > have moved past mdev and its the PASID interfaces correct?  
> 
> Yes, we agreed mdev for IDXD at LPC, didn't talk about PASID.
> 
> > For the native user applications have just 1 PASID per
> > process. There is no need for a quota management.  
> 
> Yes, there is. There is a limited pool of HW PASID's. If one user fork
> bombs it can easially claim an unreasonable number from that pool as
> each process will claim a PASID. That can DOS the rest of the system.
> 
> If PASID DOS is a worry then it must be solved at the IOMMU level for
> all user applications that might trigger a PASID allocation. VFIO is
> not special.
> 
> > IIUC, you are asking that part of the interface to move to a API
> > interface that potentially the new /dev/sva and VFIO could share? I
> > think the API's for PASID management themselves are generic (Jean's
> > patchset + Jacob's ioasid set management).  
> 
> Yes, the in kernel APIs are pretty generic now, and can be used by
> many types of drivers.
> 
Right, IOMMU UAPIs are not VFIO specific, we pass user pointer to the IOMMU
layer to process.

Similarly for PASID management, the IOASID extensions we are proposing
will handle ioasid_set (groups/pools), quota, permissions, and notifications
in the IOASID core. There is nothing VFIO specific.
https://lkml.org/lkml/2020/8/22/12

> As JasonW kicked this off, VDPA will need all this identical stuff
> too. We already know this, and I think Intel VDPA HW will need it, so
> it should concern you too :)
> 
> A PASID vIOMMU solution sharable with VDPA and VFIO, based on a PASID
> control char dev (eg /dev/sva, or maybe /dev/iommu) seems like a
> reasonable starting point for discussion.
> 
I am not sure what can really be consolidated in /dev/sva. VFIO and VDPA
will have their own kerne-user interfaces anyway for their usage models.
They are just providing the specific transport while sharing generic IOMMU
UAPIs and IOASID management.

As I mentioned PASID management is already consolidated in the IOASID layer,
so for VDPA or other users, it just matter of create its own ioasid_set,
doing allocation.

IOASID is also available to the in-kernel users which does not
need /dev/sva AFAICT. For bare metal SVA, I don't see a need to create this
'floating' state of the PASID when created by /dev/sva. PASID allocation
could happen behind the scene when users need to bind page tables to a
device DMA stream. Security authorization of the PASID is natively enforced
when user try to bind page table, there is no need to pass the FD handle of
the PASID back to the kernel as you suggested earlier.

Thanks,

Jacob

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs
  2020-09-15 19:26                           ` Raj, Ashok
@ 2020-09-15 23:45                             ` Jason Gunthorpe
  2020-09-16  2:33                             ` Jason Wang
  1 sibling, 0 replies; 81+ messages in thread
From: Jason Gunthorpe @ 2020-09-15 23:45 UTC (permalink / raw)
  To: Raj, Ashok
  Cc: Alex Williamson, Jean-Philippe Brucker, Jason Wang, Liu Yi L,
	eric.auger, baolu.lu, joro, kevin.tian, jacob.jun.pan,
	jun.j.tian, yi.y.sun, peterx, hao.wu, stefanha, iommu, kvm,
	Michael S. Tsirkin, Jacon Jun Pan

On Tue, Sep 15, 2020 at 12:26:32PM -0700, Raj, Ashok wrote:

> > Yes, there is. There is a limited pool of HW PASID's. If one user fork
> > bombs it can easially claim an unreasonable number from that pool as
> > each process will claim a PASID. That can DOS the rest of the system.
> 
> Not sure how you had this played out.. For PASID used in ENQCMD today for
> our SVM usages, we *DO* not automatically propagate or allocate new PASIDs. 
> 
> The new process needs to bind to get a PASID for its own use. For threads
> of same process the PASID is inherited. For forks(), we do not
> auto-allocate them.

Auto-allocate doesn't matter, the PASID is tied to the mm_struct,
after fork the program will get a new mm_struct, and it can manually
re-trigger PASID allocation for that mm_struct from any SVA kernel
driver.

64k processes, each with their own mm_struct, all triggering SVA, will
allocate 64k PASID's and use up the whole 16 bit space.

> Given that PASID api's are general purpose today and any driver can use it
> to take advantage. VFIO fortunately or unfortunately has the IOMMU things
> abstracted. I suppose that support is also mostly built on top of the
> generic iommu* api abstractions in a vendor neutral way? 
> 
> I'm still lost on what is missing that vDPA can't build on top of what is
> available?

I think it is basically everything in this patch.. Why duplicate all
this uAPI?

Jason 

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs
  2020-09-15 22:08                           ` Jacob Pan
@ 2020-09-15 23:51                             ` Jason Gunthorpe
       [not found]                               ` <20200915171319.00003f59@linux.intel.com>
  0 siblings, 1 reply; 81+ messages in thread
From: Jason Gunthorpe @ 2020-09-15 23:51 UTC (permalink / raw)
  To: Jacob Pan
  Cc: Raj, Ashok, Alex Williamson, Jean-Philippe Brucker, Jason Wang,
	Liu Yi L, eric.auger, baolu.lu, joro, kevin.tian, jun.j.tian,
	yi.y.sun, peterx, hao.wu, stefanha, iommu, kvm,
	Michael S. Tsirkin, Jacon Jun Pan

On Tue, Sep 15, 2020 at 03:08:51PM -0700, Jacob Pan wrote:
> > A PASID vIOMMU solution sharable with VDPA and VFIO, based on a PASID
> > control char dev (eg /dev/sva, or maybe /dev/iommu) seems like a
> > reasonable starting point for discussion.
> 
> I am not sure what can really be consolidated in /dev/sva. 

More or less, everything in this patch. All the manipulations of PASID
that are required for vIOMMU use case/etc. Basically all PASID control
that is not just a 1:1 mapping of the mm_struct.

> will have their own kerne-user interfaces anyway for their usage models.
> They are just providing the specific transport while sharing generic IOMMU
> UAPIs and IOASID management.

> As I mentioned PASID management is already consolidated in the IOASID layer,
> so for VDPA or other users, it just matter of create its own ioasid_set,
> doing allocation.

Creating the PASID is not the problem, managing what the PASID maps to
is the issue. That is all uAPI that we don't really have today.

> IOASID is also available to the in-kernel users which does not
> need /dev/sva AFAICT. For bare metal SVA, I don't see a need to create this
> 'floating' state of the PASID when created by /dev/sva. PASID allocation
> could happen behind the scene when users need to bind page tables to a
> device DMA stream.

My point is I would like to see one set of uAPI ioctls to bind page
tables. I don't want to have VFIO, VDPA, etc, etc uAPIs to do the exact
same things only slightly differently.

If user space wants to bind page tables, create the PASID with
/dev/sva, use ioctls there to setup the page table the way it wants,
then pass the now configured PASID to a driver that can use it. 

Driver does not do page table binding. Do not duplicate all the
control plane uAPI in every driver.

PASID managment and binding is seperated from the driver(s) that are
using the PASID.

Jason

^ permalink raw reply	[flat|nested] 81+ messages in thread

* RE: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs
  2020-09-15 14:29                     ` Jason Gunthorpe
@ 2020-09-16  1:19                       ` Tian, Kevin
  2020-09-16  8:32                         ` Jean-Philippe Brucker
  2020-09-16 14:44                         ` Jason Gunthorpe
  0 siblings, 2 replies; 81+ messages in thread
From: Tian, Kevin @ 2020-09-16  1:19 UTC (permalink / raw)
  To: Jason Gunthorpe, Alex Williamson
  Cc: Raj, Ashok, Jean-Philippe Brucker, Jason Wang, Liu, Yi L,
	eric.auger, baolu.lu, joro, jacob.jun.pan, Tian, Jun J, Sun,
	Yi Y, peterx, Wu, Hao, stefanha, iommu, kvm, Michael S. Tsirkin

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Tuesday, September 15, 2020 10:29 PM
>
> > Do they need a device at all?  It's not clear to me why RID based
> > IOMMU management fits within vfio's scope, but PASID based does not.
> 
> In RID mode vfio-pci completely owns the PCI function, so it is more
> natural that VFIO, as the sole device owner, would own the DMA mapping
> machinery. Further, the RID IOMMU mode is rarely used outside of VFIO
> so there is not much reason to try and disaggregate the API.

It is also used by vDPA.

> 
> PASID on the other hand, is shared. vfio-mdev drivers will share the
> device with other kernel drivers. PASID and DMA will be concurrent
> with VFIO and other kernel drivers/etc.
> 

Looks you are equating PASID to host-side sharing, while ignoring 
another valid usage that a PASID-capable device is passed through
to the guest through vfio-pci and then PASID is used by the guest 
for guest-side sharing. In such case, it is an exclusive usage in host
side and then what is the problem for VFIO to manage PASID given
that vfio-pci completely owns the function?

Thanks
Kevin 

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs
       [not found]                               ` <20200915171319.00003f59@linux.intel.com>
@ 2020-09-16  1:46                                 ` Lu Baolu
  2020-09-16 15:07                                 ` Jason Gunthorpe
  1 sibling, 0 replies; 81+ messages in thread
From: Lu Baolu @ 2020-09-16  1:46 UTC (permalink / raw)
  To: Jacob Pan (Jun), Jason Gunthorpe
  Cc: baolu.lu, Raj, Ashok, Alex Williamson, Jean-Philippe Brucker,
	Jason Wang, Liu Yi L, eric.auger, joro, kevin.tian, jun.j.tian,
	yi.y.sun, peterx, hao.wu, stefanha, iommu, kvm,
	Michael S. Tsirkin, jacob.jun.pan

On 9/16/20 8:22 AM, Jacob Pan (Jun) wrote:
>> If user space wants to bind page tables, create the PASID with
>> /dev/sva, use ioctls there to setup the page table the way it wants,
>> then pass the now configured PASID to a driver that can use it.
>>
> Are we talking about bare metal SVA? If so, I don't see the need for
> userspace to know there is a PASID. All user space need is that my
> current mm is bound to a device by the driver. So it can be a one-step
> process for user instead of two.
> 
>> Driver does not do page table binding. Do not duplicate all the
>> control plane uAPI in every driver.
>>
>> PASID managment and binding is seperated from the driver(s) that are
>> using the PASID.
>>
> Why separate? Drivers need to be involved in PASID life cycle
> management. For example, when tearing down a PASID, the driver needs to
> stop DMA, IOMMU driver needs to unbind, etc. If driver is the control
> point, then things are just in order. I am referring to bare metal SVA.
> 
> For guest SVA, I agree that binding is separate from PASID allocation.
> Could you review this doc. in terms of life cycle?
> https://lkml.org/lkml/2020/8/22/13
> 
> My point is that /dev/sda has no value for bare metal SVA, we are just
> talking about if guest SVA UAPIs can be consolidated. Or am I missing
> something?
> 

Not only bare metal SVA, but also subdevice passthrough (Intel Scalable
IOV and ARM SubStream ID) also consumes PASID which has nothing to do
with user space, hence the /dev/sva is unsuited.

Best regards,
baolu

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs
  2020-09-14 13:31   ` Jean-Philippe Brucker
  2020-09-14 13:47     ` Jason Gunthorpe
@ 2020-09-16  2:29     ` Jason Wang
  1 sibling, 0 replies; 81+ messages in thread
From: Jason Wang @ 2020-09-16  2:29 UTC (permalink / raw)
  To: Jean-Philippe Brucker
  Cc: Liu Yi L, alex.williamson, eric.auger, baolu.lu, joro,
	kevin.tian, jacob.jun.pan, ashok.raj, jun.j.tian, yi.y.sun,
	peterx, hao.wu, stefanha, iommu, kvm, Jason Gunthorpe,
	Michael S. Tsirkin


On 2020/9/14 下午9:31, Jean-Philippe Brucker wrote:
>> If it's possible, I would suggest a generic uAPI instead of a VFIO specific
>> one.
> A large part of this work is already generic uAPI, in
> include/uapi/linux/iommu.h.


This is not what I read from this series, all the following uAPI is VFIO 
specific:

struct vfio_iommu_type1_nesting_op;
struct vfio_iommu_type1_pasid_request;

And include/uapi/linux/iommu.h is not included in 
include/uapi/linux/vfio.h at all.



> This patchset connects that generic interface
> to the pre-existing VFIO uAPI that deals with IOMMU mappings of an
> assigned device. But the bulk of the work is done by the IOMMU subsystem,
> and is available to all device drivers.


So any reason not introducing the uAPI to IOMMU drivers directly?


>
>> Jason suggest something like /dev/sva. There will be a lot of other
>> subsystems that could benefit from this (e.g vDPA).
> Do you have a more precise idea of the interface /dev/sva would provide,
> how it would interact with VFIO and others?


Can we replace the container fd with sva fd like:

sva = open("/dev/sva", O_RDWR);
group = open("/dev/vfio/26", O_RDWR);
ioctl(group, VFIO_GROUP_SET_SVA, &sva);

Then we can do all SVA stuffs through sva fd, and for other subsystems 
(like vDPA) it only need to implement the function that is equivalent to 
VFIO_GROUP_SET_SVA.


>    vDPA could transport the
> generic iommu.h structures via its own uAPI, and call the IOMMU API
> directly without going through an intermediate /dev/sva handle.


Any value for those transporting? I think we have agreed that VFIO is 
not the only user for vSVA ...

It's not hard to forecast that there would be more subsystems that want 
to benefit from vSVA, we don't want to duplicate the similar uAPIs in 
all of those subsystems.

Thanks


>
> Thanks,
> Jean
>


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs
  2020-09-15 19:26                           ` Raj, Ashok
  2020-09-15 23:45                             ` Jason Gunthorpe
@ 2020-09-16  2:33                             ` Jason Wang
  1 sibling, 0 replies; 81+ messages in thread
From: Jason Wang @ 2020-09-16  2:33 UTC (permalink / raw)
  To: Raj, Ashok, Jason Gunthorpe
  Cc: Alex Williamson, Jean-Philippe Brucker, Liu Yi L, eric.auger,
	baolu.lu, joro, kevin.tian, jacob.jun.pan, jun.j.tian, yi.y.sun,
	peterx, hao.wu, stefanha, iommu, kvm, Michael S. Tsirkin,
	Jacon Jun Pan


On 2020/9/16 上午3:26, Raj, Ashok wrote:
>>> IIUC, you are asking that part of the interface to move to a API interface
>>> that potentially the new /dev/sva and VFIO could share? I think the API's
>>> for PASID management themselves are generic (Jean's patchset + Jacob's
>>> ioasid set management).
>> Yes, the in kernel APIs are pretty generic now, and can be used by
>> many types of drivers.
> Good, so there is no new requirements here I suppose.


The requirement is not for in-kernel APIs but a generic uAPIs.


>> As JasonW kicked this off, VDPA will need all this identical stuff
>> too. We already know this, and I think Intel VDPA HW will need it, so
>> it should concern you too:)
> This is one of those things that I would disagree and commit :-)..
>
>> A PASID vIOMMU solution sharable with VDPA and VFIO, based on a PASID
>> control char dev (eg /dev/sva, or maybe /dev/iommu) seems like a
>> reasonable starting point for discussion.
> Looks like now we are getting closer to what we need.:-)
>
> Given that PASID api's are general purpose today and any driver can use it
> to take advantage. VFIO fortunately or unfortunately has the IOMMU things
> abstracted. I suppose that support is also mostly built on top of the
> generic iommu* api abstractions in a vendor neutral way?
>
> I'm still lost on what is missing that vDPA can't build on top of what is
> available?


For sure it can, but we may end up duplicated (or similar) uAPIs which 
is bad.

Thanks


>
> Cheers,
> Ashok
>


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs
  2020-09-16  1:19                       ` Tian, Kevin
@ 2020-09-16  8:32                         ` Jean-Philippe Brucker
  2020-09-16 14:51                           ` Jason Gunthorpe
  2020-09-16 14:44                         ` Jason Gunthorpe
  1 sibling, 1 reply; 81+ messages in thread
From: Jean-Philippe Brucker @ 2020-09-16  8:32 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Jason Gunthorpe, Alex Williamson, Raj, Ashok, Jason Wang, Liu,
	Yi L, eric.auger, baolu.lu, joro, jacob.jun.pan, Tian, Jun J,
	Sun, Yi Y, peterx, Wu, Hao, stefanha, iommu, kvm,
	Michael S. Tsirkin

On Wed, Sep 16, 2020 at 01:19:18AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Tuesday, September 15, 2020 10:29 PM
> >
> > > Do they need a device at all?  It's not clear to me why RID based
> > > IOMMU management fits within vfio's scope, but PASID based does not.
> > 
> > In RID mode vfio-pci completely owns the PCI function, so it is more
> > natural that VFIO, as the sole device owner, would own the DMA mapping
> > machinery. Further, the RID IOMMU mode is rarely used outside of VFIO
> > so there is not much reason to try and disaggregate the API.
> 
> It is also used by vDPA.
> 
> > 
> > PASID on the other hand, is shared. vfio-mdev drivers will share the
> > device with other kernel drivers. PASID and DMA will be concurrent
> > with VFIO and other kernel drivers/etc.
> > 
> 
> Looks you are equating PASID to host-side sharing, while ignoring 
> another valid usage that a PASID-capable device is passed through
> to the guest through vfio-pci and then PASID is used by the guest 
> for guest-side sharing. In such case, it is an exclusive usage in host
> side and then what is the problem for VFIO to manage PASID given
> that vfio-pci completely owns the function?

And this is the only PASID model for Arm SMMU (and AMD IOMMU, I believe):
the PASID space of a PCI function cannot be shared between host and guest,
so we assign the whole PASID table along with the RID. Since we need the
BIND, INVALIDATE, and report APIs introduced here to support nested
translation, a /dev/sva interface would need to support this mode as well.

Thanks,
Jean

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs
  2020-09-16  1:19                       ` Tian, Kevin
  2020-09-16  8:32                         ` Jean-Philippe Brucker
@ 2020-09-16 14:44                         ` Jason Gunthorpe
  2020-09-17  6:01                           ` Tian, Kevin
  1 sibling, 1 reply; 81+ messages in thread
From: Jason Gunthorpe @ 2020-09-16 14:44 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Alex Williamson, Raj, Ashok, Jean-Philippe Brucker, Jason Wang,
	Liu, Yi L, eric.auger, baolu.lu, joro, jacob.jun.pan, Tian,
	Jun J, Sun, Yi Y, peterx, Wu, Hao, stefanha, iommu, kvm,
	Michael S. Tsirkin

On Wed, Sep 16, 2020 at 01:19:18AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Tuesday, September 15, 2020 10:29 PM
> >
> > > Do they need a device at all?  It's not clear to me why RID based
> > > IOMMU management fits within vfio's scope, but PASID based does not.
> > 
> > In RID mode vfio-pci completely owns the PCI function, so it is more
> > natural that VFIO, as the sole device owner, would own the DMA mapping
> > machinery. Further, the RID IOMMU mode is rarely used outside of VFIO
> > so there is not much reason to try and disaggregate the API.
> 
> It is also used by vDPA.

A driver in VDPA, not VDPA itself.

> > PASID on the other hand, is shared. vfio-mdev drivers will share the
> > device with other kernel drivers. PASID and DMA will be concurrent
> > with VFIO and other kernel drivers/etc.
> 
> Looks you are equating PASID to host-side sharing, while ignoring 
> another valid usage that a PASID-capable device is passed through
> to the guest through vfio-pci and then PASID is used by the guest
> for guest-side sharing. In such case, it is an exclusive usage in host
> side and then what is the problem for VFIO to manage PASID given
> that vfio-pci completely owns the function?

This is no different than vfio-pci being yet another client to
/dev/sva

Jason

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs
  2020-09-16  8:32                         ` Jean-Philippe Brucker
@ 2020-09-16 14:51                           ` Jason Gunthorpe
  2020-09-16 16:20                             ` Jean-Philippe Brucker
  0 siblings, 1 reply; 81+ messages in thread
From: Jason Gunthorpe @ 2020-09-16 14:51 UTC (permalink / raw)
  To: Jean-Philippe Brucker
  Cc: Tian, Kevin, Alex Williamson, Raj, Ashok, Jason Wang, Liu, Yi L,
	eric.auger, baolu.lu, joro, jacob.jun.pan, Tian, Jun J, Sun,
	Yi Y, peterx, Wu, Hao, stefanha, iommu, kvm, Michael S. Tsirkin

On Wed, Sep 16, 2020 at 10:32:17AM +0200, Jean-Philippe Brucker wrote:
> And this is the only PASID model for Arm SMMU (and AMD IOMMU, I believe):
> the PASID space of a PCI function cannot be shared between host and guest,
> so we assign the whole PASID table along with the RID. Since we need the
> BIND, INVALIDATE, and report APIs introduced here to support nested
> translation, a /dev/sva interface would need to support this mode as well.

Well, that means this HW cannot support PASID capable 'SIOV' style
devices in guests.

I admit whole function PASID delegation might be something vfio-pci
should handle - but only if it really doesn't fit in some /dev/sva
after we cover the other PASID cases.

Jason

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs
       [not found]                               ` <20200915171319.00003f59@linux.intel.com>
  2020-09-16  1:46                                 ` Lu Baolu
@ 2020-09-16 15:07                                 ` Jason Gunthorpe
  2020-09-16 16:33                                   ` Raj, Ashok
  1 sibling, 1 reply; 81+ messages in thread
From: Jason Gunthorpe @ 2020-09-16 15:07 UTC (permalink / raw)
  To: Jacob Pan (Jun)
  Cc: Raj, Ashok, Alex Williamson, Jean-Philippe Brucker, Jason Wang,
	Liu Yi L, eric.auger, baolu.lu, joro, kevin.tian, jun.j.tian,
	yi.y.sun, peterx, hao.wu, stefanha, iommu, kvm,
	Michael S. Tsirkin, jacob.jun.pan

On Tue, Sep 15, 2020 at 05:22:26PM -0700, Jacob Pan (Jun) wrote:
> > If user space wants to bind page tables, create the PASID with
> > /dev/sva, use ioctls there to setup the page table the way it wants,
> > then pass the now configured PASID to a driver that can use it. 
> 
> Are we talking about bare metal SVA? 

What a weird term.

> If so, I don't see the need for userspace to know there is a
> PASID. All user space need is that my current mm is bound to a
> device by the driver. So it can be a one-step process for user
> instead of two.

You've missed the entire point of the conversation, VDPA already needs
more than "my current mm is bound to a device"

> > PASID managment and binding is seperated from the driver(s) that are
> > using the PASID.
> 
> Why separate? Drivers need to be involved in PASID life cycle
> management. For example, when tearing down a PASID, the driver needs to
> stop DMA, IOMMU driver needs to unbind, etc. If driver is the control
> point, then things are just in order. I am referring to bare metal SVA.

Drivers can be involved and still have the uAPIs seperate. It isn't hard.

Jason

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs
  2020-09-16 14:51                           ` Jason Gunthorpe
@ 2020-09-16 16:20                             ` Jean-Philippe Brucker
  2020-09-16 16:32                               ` Jason Gunthorpe
  0 siblings, 1 reply; 81+ messages in thread
From: Jean-Philippe Brucker @ 2020-09-16 16:20 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tian, Kevin, Alex Williamson, Raj, Ashok, Jason Wang, Liu, Yi L,
	eric.auger, baolu.lu, joro, jacob.jun.pan, Tian, Jun J, Sun,
	Yi Y, peterx, Wu, Hao, stefanha, iommu, kvm, Michael S. Tsirkin

On Wed, Sep 16, 2020 at 11:51:48AM -0300, Jason Gunthorpe wrote:
> On Wed, Sep 16, 2020 at 10:32:17AM +0200, Jean-Philippe Brucker wrote:
> > And this is the only PASID model for Arm SMMU (and AMD IOMMU, I believe):
> > the PASID space of a PCI function cannot be shared between host and guest,
> > so we assign the whole PASID table along with the RID. Since we need the
> > BIND, INVALIDATE, and report APIs introduced here to support nested
> > translation, a /dev/sva interface would need to support this mode as well.
> 
> Well, that means this HW cannot support PASID capable 'SIOV' style
> devices in guests.

It does not yet support Intel SIOV, no. It does support the standards,
though: PCI SR-IOV to partition a device and PASIDs in a guest.

> I admit whole function PASID delegation might be something vfio-pci
> should handle - but only if it really doesn't fit in some /dev/sva
> after we cover the other PASID cases.

Wouldn't that be the duplication you're trying to avoid?  A second channel
for bind, invalidate, capability and fault reporting mechanisms?  If we
extract SVA parts of vfio_iommu_type1 into a separate chardev, PASID table
pass-through [1] will have to use that.

Thanks,
Jean

[1] https://lore.kernel.org/linux-iommu/20200320161911.27494-1-eric.auger@redhat.com/

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs
  2020-09-16 16:20                             ` Jean-Philippe Brucker
@ 2020-09-16 16:32                               ` Jason Gunthorpe
  2020-09-16 16:50                                 ` Auger Eric
  0 siblings, 1 reply; 81+ messages in thread
From: Jason Gunthorpe @ 2020-09-16 16:32 UTC (permalink / raw)
  To: Jean-Philippe Brucker
  Cc: Tian, Kevin, Alex Williamson, Raj, Ashok, Jason Wang, Liu, Yi L,
	eric.auger, baolu.lu, joro, jacob.jun.pan, Tian, Jun J, Sun,
	Yi Y, peterx, Wu, Hao, stefanha, iommu, kvm, Michael S. Tsirkin

On Wed, Sep 16, 2020 at 06:20:52PM +0200, Jean-Philippe Brucker wrote:
> On Wed, Sep 16, 2020 at 11:51:48AM -0300, Jason Gunthorpe wrote:
> > On Wed, Sep 16, 2020 at 10:32:17AM +0200, Jean-Philippe Brucker wrote:
> > > And this is the only PASID model for Arm SMMU (and AMD IOMMU, I believe):
> > > the PASID space of a PCI function cannot be shared between host and guest,
> > > so we assign the whole PASID table along with the RID. Since we need the
> > > BIND, INVALIDATE, and report APIs introduced here to support nested
> > > translation, a /dev/sva interface would need to support this mode as well.
> > 
> > Well, that means this HW cannot support PASID capable 'SIOV' style
> > devices in guests.
> 
> It does not yet support Intel SIOV, no. It does support the standards,
> though: PCI SR-IOV to partition a device and PASIDs in a guest.

SIOV is basically standards based, it is better thought of as a
cookbook on how to use PASID and IOMMU together.

> > I admit whole function PASID delegation might be something vfio-pci
> > should handle - but only if it really doesn't fit in some /dev/sva
> > after we cover the other PASID cases.
> 
> Wouldn't that be the duplication you're trying to avoid?  A second
> channel for bind, invalidate, capability and fault reporting
> mechanisms?

Yes, which is why it seems like it would be nicer to avoid it. Why I
said "might" :)

> If we extract SVA parts of vfio_iommu_type1 into a separate chardev,
> PASID table pass-through [1] will have to use that.

Yes, '/dev/sva' (which is a terrible name) would want to be the uAPI
entry point for controlling the vIOMMU related to PASID.

Does anything in the [1] series have tight coupling to VFIO other than
needing to know a bus/device/function? It looks like it is mostly
exposing iommu_* functions as uAPI?

Jason

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs
  2020-09-16 15:07                                 ` Jason Gunthorpe
@ 2020-09-16 16:33                                   ` Raj, Ashok
  2020-09-16 17:01                                     ` Jason Gunthorpe
  0 siblings, 1 reply; 81+ messages in thread
From: Raj, Ashok @ 2020-09-16 16:33 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jacob Pan (Jun),
	Alex Williamson, Jean-Philippe Brucker, Jason Wang, Liu Yi L,
	eric.auger, baolu.lu, joro, kevin.tian, jun.j.tian, yi.y.sun,
	peterx, hao.wu, stefanha, iommu, kvm, Michael S. Tsirkin,
	jacob.jun.pan, Ashok Raj

On Wed, Sep 16, 2020 at 12:07:54PM -0300, Jason Gunthorpe wrote:
> On Tue, Sep 15, 2020 at 05:22:26PM -0700, Jacob Pan (Jun) wrote:
> > > If user space wants to bind page tables, create the PASID with
> > > /dev/sva, use ioctls there to setup the page table the way it wants,
> > > then pass the now configured PASID to a driver that can use it. 
> > 
> > Are we talking about bare metal SVA? 
> 
> What a weird term.

Glad you noticed it at v7 :-) 

Any suggestions on something less weird than 
Shared Virtual Addressing? There is a reason why we moved from SVM to SVA.
> 
> > If so, I don't see the need for userspace to know there is a
> > PASID. All user space need is that my current mm is bound to a
> > device by the driver. So it can be a one-step process for user
> > instead of two.
> 
> You've missed the entire point of the conversation, VDPA already needs
> more than "my current mm is bound to a device"

You mean current version of vDPA? or a potential future version of vDPA?

Cheers,
Ashok

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs
  2020-09-16 16:32                               ` Jason Gunthorpe
@ 2020-09-16 16:50                                 ` Auger Eric
  0 siblings, 0 replies; 81+ messages in thread
From: Auger Eric @ 2020-09-16 16:50 UTC (permalink / raw)
  To: Jason Gunthorpe, Jean-Philippe Brucker
  Cc: Tian, Kevin, Alex Williamson, Raj, Ashok, Jason Wang, Liu, Yi L,
	baolu.lu, joro, jacob.jun.pan, Tian, Jun J, Sun, Yi Y, peterx,
	Wu, Hao, stefanha, iommu, kvm, Michael S. Tsirkin

Hi,
On 9/16/20 6:32 PM, Jason Gunthorpe wrote:
> On Wed, Sep 16, 2020 at 06:20:52PM +0200, Jean-Philippe Brucker wrote:
>> On Wed, Sep 16, 2020 at 11:51:48AM -0300, Jason Gunthorpe wrote:
>>> On Wed, Sep 16, 2020 at 10:32:17AM +0200, Jean-Philippe Brucker wrote:
>>>> And this is the only PASID model for Arm SMMU (and AMD IOMMU, I believe):
>>>> the PASID space of a PCI function cannot be shared between host and guest,
>>>> so we assign the whole PASID table along with the RID. Since we need the
>>>> BIND, INVALIDATE, and report APIs introduced here to support nested
>>>> translation, a /dev/sva interface would need to support this mode as well.
>>>
>>> Well, that means this HW cannot support PASID capable 'SIOV' style
>>> devices in guests.
>>
>> It does not yet support Intel SIOV, no. It does support the standards,
>> though: PCI SR-IOV to partition a device and PASIDs in a guest.
> 
> SIOV is basically standards based, it is better thought of as a
> cookbook on how to use PASID and IOMMU together.
> 
>>> I admit whole function PASID delegation might be something vfio-pci
>>> should handle - but only if it really doesn't fit in some /dev/sva
>>> after we cover the other PASID cases.
>>
>> Wouldn't that be the duplication you're trying to avoid?  A second
>> channel for bind, invalidate, capability and fault reporting
>> mechanisms?
> 
> Yes, which is why it seems like it would be nicer to avoid it. Why I
> said "might" :)
> 
>> If we extract SVA parts of vfio_iommu_type1 into a separate chardev,
>> PASID table pass-through [1] will have to use that.
> 
> Yes, '/dev/sva' (which is a terrible name) would want to be the uAPI
> entry point for controlling the vIOMMU related to PASID.
> 
> Does anything in the [1] series have tight coupling to VFIO other than
> needing to know a bus/device/function? It looks like it is mostly
> exposing iommu_* functions as uAPI?

this series does not use any PASID so it fits quite nicely into the VFIO
framework I think. Besides cache invalidation that takes the struct
device, other operations (MSI binding and PASID table passing operate on
the iommu domain). Also we use the VFIO memory region and
interrupt/eventfd registration mechanism to return faults.

Thanks

Eric
> 
> Jason
> 


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs
  2020-09-16 16:33                                   ` Raj, Ashok
@ 2020-09-16 17:01                                     ` Jason Gunthorpe
  2020-09-16 18:21                                       ` Jacob Pan (Jun)
  0 siblings, 1 reply; 81+ messages in thread
From: Jason Gunthorpe @ 2020-09-16 17:01 UTC (permalink / raw)
  To: Raj, Ashok
  Cc: Jacob Pan (Jun),
	Alex Williamson, Jean-Philippe Brucker, Jason Wang, Liu Yi L,
	eric.auger, baolu.lu, joro, kevin.tian, jun.j.tian, yi.y.sun,
	peterx, hao.wu, stefanha, iommu, kvm, Michael S. Tsirkin,
	jacob.jun.pan

On Wed, Sep 16, 2020 at 09:33:43AM -0700, Raj, Ashok wrote:
> On Wed, Sep 16, 2020 at 12:07:54PM -0300, Jason Gunthorpe wrote:
> > On Tue, Sep 15, 2020 at 05:22:26PM -0700, Jacob Pan (Jun) wrote:
> > > > If user space wants to bind page tables, create the PASID with
> > > > /dev/sva, use ioctls there to setup the page table the way it wants,
> > > > then pass the now configured PASID to a driver that can use it. 
> > > 
> > > Are we talking about bare metal SVA? 
> > 
> > What a weird term.
> 
> Glad you noticed it at v7 :-) 
> 
> Any suggestions on something less weird than 
> Shared Virtual Addressing? There is a reason why we moved from SVM
> to SVA.

SVA is fine, what is "bare metal" supposed to mean?

PASID is about constructing an arbitary DMA IOVA map for PCI-E
devices, being able to intercept device DMA faults, etc.

SVA is doing DMA IOVA 1:1 with the mm_struct CPU VA. DMA faults
trigger the same thing as CPU page faults. If is it not 1:1 then there
is no "shared". When SVA is done using PCI-E PASID it is "PASID for
SVA". Lots of existing devices already have SVA without PASID or
IOMMU, so lets not muddy the terminology.

vPASID/vIOMMU is allowing a guest to control the DMA IOVA map and
manipulate the PASIDs.

vSVA is when a guest uses a vPASID to provide SVA, not sure this is
an informative term.

This particular patch series seems to be about vPASID/vIOMMU for vfio-mdev
vs the other vPASID/vIOMMU patch which was about vPASID for vfio-pci.

> > > If so, I don't see the need for userspace to know there is a
> > > PASID. All user space need is that my current mm is bound to a
> > > device by the driver. So it can be a one-step process for user
> > > instead of two.
> > 
> > You've missed the entire point of the conversation, VDPA already needs
> > more than "my current mm is bound to a device"
> 
> You mean current version of vDPA? or a potential future version of vDPA?

Future VDPA drivers, it was made clear this was important to Intel
during the argument about VDPA as a mdev.

Jason

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs
  2020-09-16 17:01                                     ` Jason Gunthorpe
@ 2020-09-16 18:21                                       ` Jacob Pan (Jun)
  2020-09-16 18:38                                         ` Jason Gunthorpe
  0 siblings, 1 reply; 81+ messages in thread
From: Jacob Pan (Jun) @ 2020-09-16 18:21 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Raj, Ashok, Alex Williamson, Jean-Philippe Brucker, Jason Wang,
	Liu Yi L, eric.auger, baolu.lu, joro, kevin.tian, jun.j.tian,
	yi.y.sun, peterx, hao.wu, stefanha, iommu, kvm,
	Michael S. Tsirkin, jacob.jun.pan, jacob.jun.pan

Hi Jason,
On Wed, 16 Sep 2020 14:01:13 -0300, Jason Gunthorpe <jgg@nvidia.com>
wrote:

> On Wed, Sep 16, 2020 at 09:33:43AM -0700, Raj, Ashok wrote:
> > On Wed, Sep 16, 2020 at 12:07:54PM -0300, Jason Gunthorpe wrote:  
> > > On Tue, Sep 15, 2020 at 05:22:26PM -0700, Jacob Pan (Jun) wrote:  
> > > > > If user space wants to bind page tables, create the PASID with
> > > > > /dev/sva, use ioctls there to setup the page table the way it
> > > > > wants, then pass the now configured PASID to a driver that
> > > > > can use it.   
> > > > 
> > > > Are we talking about bare metal SVA?   
> > > 
> > > What a weird term.  
> > 
> > Glad you noticed it at v7 :-) 
> > 
> > Any suggestions on something less weird than 
> > Shared Virtual Addressing? There is a reason why we moved from SVM
> > to SVA.  
> 
> SVA is fine, what is "bare metal" supposed to mean?
> 
What I meant here is sharing virtual address between DMA and host
process. This requires devices perform DMA request with PASID and use
IOMMU first level/stage 1 page tables.
This can be further divided into 1) user SVA 2) supervisor SVA (sharing
init_mm)

My point is that /dev/sva is not useful here since the driver can
perform PASID allocation while doing SVA bind.

> PASID is about constructing an arbitary DMA IOVA map for PCI-E
> devices, being able to intercept device DMA faults, etc.
> 
An arbitrary IOVA map does not need PASID. In IOVA, you do map/unmap
explicitly, why you need to handle IO page fault?

To me, PASID identifies an address space that is associated with a
mm_struct.

> SVA is doing DMA IOVA 1:1 with the mm_struct CPU VA. DMA faults
> trigger the same thing as CPU page faults. If is it not 1:1 then there
> is no "shared". When SVA is done using PCI-E PASID it is "PASID for
> SVA". Lots of existing devices already have SVA without PASID or
> IOMMU, so lets not muddy the terminology.
> 
I agree. This conversation is about "PASID for SVA" not "SVA without
PASID"


> vPASID/vIOMMU is allowing a guest to control the DMA IOVA map and
> manipulate the PASIDs.
> 
> vSVA is when a guest uses a vPASID to provide SVA, not sure this is
> an informative term.
> 
I agree.

> This particular patch series seems to be about vPASID/vIOMMU for
> vfio-mdev vs the other vPASID/vIOMMU patch which was about vPASID for
> vfio-pci.
> 
Yi can correct me but this set is is about VFIO-PCI, VFIO-mdev will be
introduced later.

> > > > If so, I don't see the need for userspace to know there is a
> > > > PASID. All user space need is that my current mm is bound to a
> > > > device by the driver. So it can be a one-step process for user
> > > > instead of two.  
> > > 
> > > You've missed the entire point of the conversation, VDPA already
> > > needs more than "my current mm is bound to a device"  
> > 
> > You mean current version of vDPA? or a potential future version of
> > vDPA?  
> 
> Future VDPA drivers, it was made clear this was important to Intel
> during the argument about VDPA as a mdev.
> 
> Jason


Thanks,

Jacob

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs
  2020-09-16 18:21                                       ` Jacob Pan (Jun)
@ 2020-09-16 18:38                                         ` Jason Gunthorpe
  2020-09-16 23:09                                           ` Jacob Pan (Jun)
  0 siblings, 1 reply; 81+ messages in thread
From: Jason Gunthorpe @ 2020-09-16 18:38 UTC (permalink / raw)
  To: Jacob Pan (Jun)
  Cc: Raj, Ashok, Alex Williamson, Jean-Philippe Brucker, Jason Wang,
	Liu Yi L, eric.auger, baolu.lu, joro, kevin.tian, jun.j.tian,
	yi.y.sun, peterx, hao.wu, stefanha, iommu, kvm,
	Michael S. Tsirkin, jacob.jun.pan

On Wed, Sep 16, 2020 at 11:21:10AM -0700, Jacob Pan (Jun) wrote:
> Hi Jason,
> On Wed, 16 Sep 2020 14:01:13 -0300, Jason Gunthorpe <jgg@nvidia.com>
> wrote:
> 
> > On Wed, Sep 16, 2020 at 09:33:43AM -0700, Raj, Ashok wrote:
> > > On Wed, Sep 16, 2020 at 12:07:54PM -0300, Jason Gunthorpe wrote:  
> > > > On Tue, Sep 15, 2020 at 05:22:26PM -0700, Jacob Pan (Jun) wrote:  
> > > > > > If user space wants to bind page tables, create the PASID with
> > > > > > /dev/sva, use ioctls there to setup the page table the way it
> > > > > > wants, then pass the now configured PASID to a driver that
> > > > > > can use it.   
> > > > > 
> > > > > Are we talking about bare metal SVA?   
> > > > 
> > > > What a weird term.  
> > > 
> > > Glad you noticed it at v7 :-) 
> > > 
> > > Any suggestions on something less weird than 
> > > Shared Virtual Addressing? There is a reason why we moved from SVM
> > > to SVA.  
> > 
> > SVA is fine, what is "bare metal" supposed to mean?
> > 
> What I meant here is sharing virtual address between DMA and host
> process. This requires devices perform DMA request with PASID and use
> IOMMU first level/stage 1 page tables.
> This can be further divided into 1) user SVA 2) supervisor SVA (sharing
> init_mm)
> 
> My point is that /dev/sva is not useful here since the driver can
> perform PASID allocation while doing SVA bind.

No, you are thinking too small.

Look at VDPA, it has a SVA uAPI. Some HW might use PASID for the SVA.

When VDPA is used by DPDK it makes sense that the PASID will be SVA and
1:1 with the mm_struct.

When VDPA is used by qemu it makes sense that the PASID will be an
arbitary IOVA map constructed to be 1:1 with the guest vCPU physical
map. /dev/sva allows a single uAPI to do this kind of setup, and qemu
can support it while supporting a range of SVA kernel drivers. VDPA
and vfio-mdev are obvious initial targets.

*BOTH* are needed.

In general any uAPI for PASID should have the option to use either the
mm_struct SVA PASID *OR* a PASID from /dev/sva. It costs virtually
nothing to implement this in the driver as PASID is just a number, and
gives so much more flexability.

> Yi can correct me but this set is is about VFIO-PCI, VFIO-mdev will be
> introduced later.

Last patch is:

  vfio/type1: Add vSVA support for IOMMU-backed mdevs

So pretty hard to see how this is not about vfio-mdev, at least a
little..

Jason

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs
  2020-09-16 18:38                                         ` Jason Gunthorpe
@ 2020-09-16 23:09                                           ` Jacob Pan (Jun)
  2020-09-17  3:53                                             ` Jason Wang
  0 siblings, 1 reply; 81+ messages in thread
From: Jacob Pan (Jun) @ 2020-09-16 23:09 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Raj, Ashok, Alex Williamson, Jean-Philippe Brucker, Jason Wang,
	Liu Yi L, eric.auger, baolu.lu, joro, kevin.tian, jun.j.tian,
	yi.y.sun, peterx, hao.wu, stefanha, iommu, kvm,
	Michael S. Tsirkin, jacob.jun.pan, jacob.jun.pan

Hi Jason,
On Wed, 16 Sep 2020 15:38:41 -0300, Jason Gunthorpe <jgg@nvidia.com>
wrote:

> On Wed, Sep 16, 2020 at 11:21:10AM -0700, Jacob Pan (Jun) wrote:
> > Hi Jason,
> > On Wed, 16 Sep 2020 14:01:13 -0300, Jason Gunthorpe <jgg@nvidia.com>
> > wrote:
> >   
> > > On Wed, Sep 16, 2020 at 09:33:43AM -0700, Raj, Ashok wrote:  
> > > > On Wed, Sep 16, 2020 at 12:07:54PM -0300, Jason Gunthorpe
> > > > wrote:    
> > > > > On Tue, Sep 15, 2020 at 05:22:26PM -0700, Jacob Pan (Jun)
> > > > > wrote:    
> > > > > > > If user space wants to bind page tables, create the PASID
> > > > > > > with /dev/sva, use ioctls there to setup the page table
> > > > > > > the way it wants, then pass the now configured PASID to a
> > > > > > > driver that can use it.     
> > > > > > 
> > > > > > Are we talking about bare metal SVA?     
> > > > > 
> > > > > What a weird term.    
> > > > 
> > > > Glad you noticed it at v7 :-) 
> > > > 
> > > > Any suggestions on something less weird than 
> > > > Shared Virtual Addressing? There is a reason why we moved from
> > > > SVM to SVA.    
> > > 
> > > SVA is fine, what is "bare metal" supposed to mean?
> > >   
> > What I meant here is sharing virtual address between DMA and host
> > process. This requires devices perform DMA request with PASID and
> > use IOMMU first level/stage 1 page tables.
> > This can be further divided into 1) user SVA 2) supervisor SVA
> > (sharing init_mm)
> > 
> > My point is that /dev/sva is not useful here since the driver can
> > perform PASID allocation while doing SVA bind.  
> 
> No, you are thinking too small.
> 
> Look at VDPA, it has a SVA uAPI. Some HW might use PASID for the SVA.
> 
Could you point to me the SVA UAPI? I couldn't find it in the mainline.
Seems VDPA uses VHOST interface?

> When VDPA is used by DPDK it makes sense that the PASID will be SVA
> and 1:1 with the mm_struct.
> 
I still don't see why bare metal DPDK needs to get a handle of the
PASID. Perhaps the SVA patch would explain. Or are you talking about
vDPA DPDK process that is used to support virtio-net-pmd in the guest?

> When VDPA is used by qemu it makes sense that the PASID will be an
> arbitary IOVA map constructed to be 1:1 with the guest vCPU physical
> map. /dev/sva allows a single uAPI to do this kind of setup, and qemu
> can support it while supporting a range of SVA kernel drivers. VDPA
> and vfio-mdev are obvious initial targets.
> 
> *BOTH* are needed.
> 
> In general any uAPI for PASID should have the option to use either the
> mm_struct SVA PASID *OR* a PASID from /dev/sva. It costs virtually
> nothing to implement this in the driver as PASID is just a number, and
> gives so much more flexability.
> 
Not really nothing in terms of PASID life cycles. For example, if user
uses uacce interface to open an accelerator, it gets an FD_acc. Then it
opens /dev/sva to allocate PASID then get another FD_pasid. Then we
pass FD_pasid to the driver to bind page tables, perhaps multiple
drivers. Now we have to worry about If FD_pasid gets closed before
FD_acc(s) closed and all these race conditions.

If we do not expose FD_pasid to the user, the teardown is much simpler
and streamlined. Following each FD_acc close, PASID unbind is performed.

> > Yi can correct me but this set is is about VFIO-PCI, VFIO-mdev will
> > be introduced later.  
> 
> Last patch is:
> 
>   vfio/type1: Add vSVA support for IOMMU-backed mdevs
> 
> So pretty hard to see how this is not about vfio-mdev, at least a
> little..
> 
> Jason


Thanks,

Jacob

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs
  2020-09-16 23:09                                           ` Jacob Pan (Jun)
@ 2020-09-17  3:53                                             ` Jason Wang
  2020-09-17 17:31                                               ` Jason Gunthorpe
  2020-09-17 18:17                                               ` Jacob Pan (Jun)
  0 siblings, 2 replies; 81+ messages in thread
From: Jason Wang @ 2020-09-17  3:53 UTC (permalink / raw)
  To: Jacob Pan (Jun), Jason Gunthorpe
  Cc: Raj, Ashok, Alex Williamson, Jean-Philippe Brucker, Liu Yi L,
	eric.auger, baolu.lu, joro, kevin.tian, jun.j.tian, yi.y.sun,
	peterx, hao.wu, stefanha, iommu, kvm, Michael S. Tsirkin,
	jacob.jun.pan


On 2020/9/17 上午7:09, Jacob Pan (Jun) wrote:
> Hi Jason,
> On Wed, 16 Sep 2020 15:38:41 -0300, Jason Gunthorpe <jgg@nvidia.com>
> wrote:
>
>> On Wed, Sep 16, 2020 at 11:21:10AM -0700, Jacob Pan (Jun) wrote:
>>> Hi Jason,
>>> On Wed, 16 Sep 2020 14:01:13 -0300, Jason Gunthorpe <jgg@nvidia.com>
>>> wrote:
>>>    
>>>> On Wed, Sep 16, 2020 at 09:33:43AM -0700, Raj, Ashok wrote:
>>>>> On Wed, Sep 16, 2020 at 12:07:54PM -0300, Jason Gunthorpe
>>>>> wrote:
>>>>>> On Tue, Sep 15, 2020 at 05:22:26PM -0700, Jacob Pan (Jun)
>>>>>> wrote:
>>>>>>>> If user space wants to bind page tables, create the PASID
>>>>>>>> with /dev/sva, use ioctls there to setup the page table
>>>>>>>> the way it wants, then pass the now configured PASID to a
>>>>>>>> driver that can use it.
>>>>>>> Are we talking about bare metal SVA?
>>>>>> What a weird term.
>>>>> Glad you noticed it at v7 :-)
>>>>>
>>>>> Any suggestions on something less weird than
>>>>> Shared Virtual Addressing? There is a reason why we moved from
>>>>> SVM to SVA.
>>>> SVA is fine, what is "bare metal" supposed to mean?
>>>>    
>>> What I meant here is sharing virtual address between DMA and host
>>> process. This requires devices perform DMA request with PASID and
>>> use IOMMU first level/stage 1 page tables.
>>> This can be further divided into 1) user SVA 2) supervisor SVA
>>> (sharing init_mm)
>>>
>>> My point is that /dev/sva is not useful here since the driver can
>>> perform PASID allocation while doing SVA bind.
>> No, you are thinking too small.
>>
>> Look at VDPA, it has a SVA uAPI. Some HW might use PASID for the SVA.
>>
> Could you point to me the SVA UAPI? I couldn't find it in the mainline.
> Seems VDPA uses VHOST interface?


It's the vhost_iotlb_msg defined in uapi/linux/vhost_types.h.


>
>> When VDPA is used by DPDK it makes sense that the PASID will be SVA
>> and 1:1 with the mm_struct.
>>
> I still don't see why bare metal DPDK needs to get a handle of the
> PASID.


My understanding is that it may:

- have a unified uAPI with vSVA: alloc, bind, unbind, free
- leave the binding policy to userspace instead of the using a implied 
one in the kenrel


> Perhaps the SVA patch would explain. Or are you talking about
> vDPA DPDK process that is used to support virtio-net-pmd in the guest?
>
>> When VDPA is used by qemu it makes sense that the PASID will be an
>> arbitary IOVA map constructed to be 1:1 with the guest vCPU physical
>> map. /dev/sva allows a single uAPI to do this kind of setup, and qemu
>> can support it while supporting a range of SVA kernel drivers. VDPA
>> and vfio-mdev are obvious initial targets.
>>
>> *BOTH* are needed.
>>
>> In general any uAPI for PASID should have the option to use either the
>> mm_struct SVA PASID *OR* a PASID from /dev/sva. It costs virtually
>> nothing to implement this in the driver as PASID is just a number, and
>> gives so much more flexability.
>>
> Not really nothing in terms of PASID life cycles. For example, if user
> uses uacce interface to open an accelerator, it gets an FD_acc. Then it
> opens /dev/sva to allocate PASID then get another FD_pasid. Then we
> pass FD_pasid to the driver to bind page tables, perhaps multiple
> drivers. Now we have to worry about If FD_pasid gets closed before
> FD_acc(s) closed and all these race conditions.


I'm not sure I understand this. But this demonstrates the flexibility of 
an unified uAPI. E.g it allows vDPA and VFIO device to use the same 
PAISD which can be shared with a process in the guest.

For the race condition, it could be probably solved with refcnt.

Thanks


>
> If we do not expose FD_pasid to the user, the teardown is much simpler
> and streamlined. Following each FD_acc close, PASID unbind is performed.
>
>>> Yi can correct me but this set is is about VFIO-PCI, VFIO-mdev will
>>> be introduced later.
>> Last patch is:
>>
>>    vfio/type1: Add vSVA support for IOMMU-backed mdevs
>>
>> So pretty hard to see how this is not about vfio-mdev, at least a
>> little..
>>
>> Jason
>
> Thanks,
>
> Jacob
>


^ permalink raw reply	[flat|nested] 81+ messages in thread

* RE: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs
  2020-09-16 14:44                         ` Jason Gunthorpe
@ 2020-09-17  6:01                           ` Tian, Kevin
  0 siblings, 0 replies; 81+ messages in thread
From: Tian, Kevin @ 2020-09-17  6:01 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Raj, Ashok, Jean-Philippe Brucker, Jason Wang,
	Liu, Yi L, eric.auger, baolu.lu, joro, jacob.jun.pan, Tian,
	Jun J, Sun, Yi Y, peterx, Wu, Hao, stefanha, iommu, kvm,
	Michael S. Tsirkin

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, September 16, 2020 10:45 PM
> 
> On Wed, Sep 16, 2020 at 01:19:18AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Tuesday, September 15, 2020 10:29 PM
> > >
> > > > Do they need a device at all?  It's not clear to me why RID based
> > > > IOMMU management fits within vfio's scope, but PASID based does not.
> > >
> > > In RID mode vfio-pci completely owns the PCI function, so it is more
> > > natural that VFIO, as the sole device owner, would own the DMA
> mapping
> > > machinery. Further, the RID IOMMU mode is rarely used outside of VFIO
> > > so there is not much reason to try and disaggregate the API.
> >
> > It is also used by vDPA.
> 
> A driver in VDPA, not VDPA itself.

what is the difference? It is still the example of using RID IOMMU mode
outside of VFIO (and just implies that vDPA even doesn't do a good
abstraction internally).

> 
> > > PASID on the other hand, is shared. vfio-mdev drivers will share the
> > > device with other kernel drivers. PASID and DMA will be concurrent
> > > with VFIO and other kernel drivers/etc.
> >
> > Looks you are equating PASID to host-side sharing, while ignoring
> > another valid usage that a PASID-capable device is passed through
> > to the guest through vfio-pci and then PASID is used by the guest
> > for guest-side sharing. In such case, it is an exclusive usage in host
> > side and then what is the problem for VFIO to manage PASID given
> > that vfio-pci completely owns the function?
> 
> This is no different than vfio-pci being yet another client to
> /dev/sva
> 

My comment was to echo Alex's question about "why RID based
IOMMU management fits within vfio's scope, but PASID based 
does not". and when talking about generalization we should look
bigger beyond sva. What really matters here is the iommu_domain
which is about everything related to DMA mapping. The domain
associated with a passthru device is marked as "unmanaged" in 
kernel and allows userspace to manage DMA mapping of this 
device through a set of iommu_ops:

- alloc/free domain;
- attach/detach device/subdevice;
- map/unmap a memory region;
- bind/unbind page table and invalidate iommu cache;
- ... (and lots of other callbacks)

map/unmap or bind/unbind are just different ways of managing
DMAs in an iommu domain. The passthrough framework (VFIO 
or VDPA) has been providing its uAPI to manage every aspect of 
iommu_domain so far, and sva is just a natural extension following 
this design. If we really want to generalize something, it needs to 
be /dev/iommu as an unified interface for managing every aspect 
of iommu_domain. Asking SVA abstraction alone just causes 
unnecessary mess to both kernel (sync domain/device association 
between /dev/vfio and /dev/sva) and userspace (talk to two 
interfaces even for same vfio-pci device). Then it sounds like more 
like a bandaid for saving development effort in VDPA (which 
instead should go proposing /dev/iommu when it was invented 
instead of reinventing its own bits until such effort is unaffordable 
and then ask for partial abstraction to fix its gap).

Thanks
Kevin

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs
  2020-09-17  3:53                                             ` Jason Wang
@ 2020-09-17 17:31                                               ` Jason Gunthorpe
  2020-09-17 18:17                                               ` Jacob Pan (Jun)
  1 sibling, 0 replies; 81+ messages in thread
From: Jason Gunthorpe @ 2020-09-17 17:31 UTC (permalink / raw)
  To: Jason Wang
  Cc: Jacob Pan (Jun),
	Raj, Ashok, Alex Williamson, Jean-Philippe Brucker, Liu Yi L,
	eric.auger, baolu.lu, joro, kevin.tian, jun.j.tian, yi.y.sun,
	peterx, hao.wu, stefanha, iommu, kvm, Michael S. Tsirkin,
	jacob.jun.pan

On Thu, Sep 17, 2020 at 11:53:49AM +0800, Jason Wang wrote:
> > > When VDPA is used by qemu it makes sense that the PASID will be an
> > > arbitary IOVA map constructed to be 1:1 with the guest vCPU physical
> > > map. /dev/sva allows a single uAPI to do this kind of setup, and qemu
> > > can support it while supporting a range of SVA kernel drivers. VDPA
> > > and vfio-mdev are obvious initial targets.
> > > 
> > > *BOTH* are needed.
> > > 
> > > In general any uAPI for PASID should have the option to use either the
> > > mm_struct SVA PASID *OR* a PASID from /dev/sva. It costs virtually
> > > nothing to implement this in the driver as PASID is just a number, and
> > > gives so much more flexability.
> > > 
> > Not really nothing in terms of PASID life cycles. For example, if user
> > uses uacce interface to open an accelerator, it gets an FD_acc. Then it
> > opens /dev/sva to allocate PASID then get another FD_pasid. Then we
> > pass FD_pasid to the driver to bind page tables, perhaps multiple
> > drivers. Now we have to worry about If FD_pasid gets closed before
> > FD_acc(s) closed and all these race conditions.
> 
> 
> I'm not sure I understand this. But this demonstrates the flexibility of an
> unified uAPI. E.g it allows vDPA and VFIO device to use the same PAISD which
> can be shared with a process in the guest.
> 
> For the race condition, it could be probably solved with refcnt.

Yep

Jason

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs
  2020-09-17  3:53                                             ` Jason Wang
  2020-09-17 17:31                                               ` Jason Gunthorpe
@ 2020-09-17 18:17                                               ` Jacob Pan (Jun)
  2020-09-18  3:58                                                 ` Jason Wang
  1 sibling, 1 reply; 81+ messages in thread
From: Jacob Pan (Jun) @ 2020-09-17 18:17 UTC (permalink / raw)
  To: Jason Wang
  Cc: Jason Gunthorpe, Raj, Ashok, Alex Williamson,
	Jean-Philippe Brucker, Liu Yi L, eric.auger, baolu.lu, joro,
	kevin.tian, jun.j.tian, yi.y.sun, peterx, hao.wu, stefanha,
	iommu, kvm, Michael S. Tsirkin, jacob.jun.pan, jacob.jun.pan

Hi Jason,
On Thu, 17 Sep 2020 11:53:49 +0800, Jason Wang <jasowang@redhat.com>
wrote:

> On 2020/9/17 上午7:09, Jacob Pan (Jun) wrote:
> > Hi Jason,
> > On Wed, 16 Sep 2020 15:38:41 -0300, Jason Gunthorpe <jgg@nvidia.com>
> > wrote:
> >  
> >> On Wed, Sep 16, 2020 at 11:21:10AM -0700, Jacob Pan (Jun) wrote:  
> >>> Hi Jason,
> >>> On Wed, 16 Sep 2020 14:01:13 -0300, Jason Gunthorpe
> >>> <jgg@nvidia.com> wrote:
> >>>      
> >>>> On Wed, Sep 16, 2020 at 09:33:43AM -0700, Raj, Ashok wrote:  
> >>>>> On Wed, Sep 16, 2020 at 12:07:54PM -0300, Jason Gunthorpe
> >>>>> wrote:  
> >>>>>> On Tue, Sep 15, 2020 at 05:22:26PM -0700, Jacob Pan (Jun)
> >>>>>> wrote:  
> >>>>>>>> If user space wants to bind page tables, create the PASID
> >>>>>>>> with /dev/sva, use ioctls there to setup the page table
> >>>>>>>> the way it wants, then pass the now configured PASID to a
> >>>>>>>> driver that can use it.  
> >>>>>>> Are we talking about bare metal SVA?  
> >>>>>> What a weird term.  
> >>>>> Glad you noticed it at v7 :-)
> >>>>>
> >>>>> Any suggestions on something less weird than
> >>>>> Shared Virtual Addressing? There is a reason why we moved from
> >>>>> SVM to SVA.  
> >>>> SVA is fine, what is "bare metal" supposed to mean?
> >>>>      
> >>> What I meant here is sharing virtual address between DMA and host
> >>> process. This requires devices perform DMA request with PASID and
> >>> use IOMMU first level/stage 1 page tables.
> >>> This can be further divided into 1) user SVA 2) supervisor SVA
> >>> (sharing init_mm)
> >>>
> >>> My point is that /dev/sva is not useful here since the driver can
> >>> perform PASID allocation while doing SVA bind.  
> >> No, you are thinking too small.
> >>
> >> Look at VDPA, it has a SVA uAPI. Some HW might use PASID for the
> >> SVA. 
> > Could you point to me the SVA UAPI? I couldn't find it in the
> > mainline. Seems VDPA uses VHOST interface?  
> 
> 
> It's the vhost_iotlb_msg defined in uapi/linux/vhost_types.h.
> 
Thanks for the pointer, for complete vSVA functionality we would need
1 TLB flush (IOTLB and PASID cache etc.)
2 PASID alloc/free
3 bind/unbind page tables or PASID tables
4 Page request service

Seems vhost_iotlb_msg can be used for #1 partially. And the
proposal is to pluck out the rest into /dev/sda? Seems awkward as Alex
pointed out earlier for similar situation in VFIO.

> 
> >  
> >> When VDPA is used by DPDK it makes sense that the PASID will be SVA
> >> and 1:1 with the mm_struct.
> >>  
> > I still don't see why bare metal DPDK needs to get a handle of the
> > PASID.  
> 
> 
> My understanding is that it may:
> 
> - have a unified uAPI with vSVA: alloc, bind, unbind, free
Got your point, but vSVA needs more than these

> - leave the binding policy to userspace instead of the using a
> implied one in the kenrel
> 
Only if necessary.

> 
> > Perhaps the SVA patch would explain. Or are you talking about
> > vDPA DPDK process that is used to support virtio-net-pmd in the
> > guest? 
> >> When VDPA is used by qemu it makes sense that the PASID will be an
> >> arbitary IOVA map constructed to be 1:1 with the guest vCPU
> >> physical map. /dev/sva allows a single uAPI to do this kind of
> >> setup, and qemu can support it while supporting a range of SVA
> >> kernel drivers. VDPA and vfio-mdev are obvious initial targets.
> >>
> >> *BOTH* are needed.
> >>
> >> In general any uAPI for PASID should have the option to use either
> >> the mm_struct SVA PASID *OR* a PASID from /dev/sva. It costs
> >> virtually nothing to implement this in the driver as PASID is just
> >> a number, and gives so much more flexability.
> >>  
> > Not really nothing in terms of PASID life cycles. For example, if
> > user uses uacce interface to open an accelerator, it gets an
> > FD_acc. Then it opens /dev/sva to allocate PASID then get another
> > FD_pasid. Then we pass FD_pasid to the driver to bind page tables,
> > perhaps multiple drivers. Now we have to worry about If FD_pasid
> > gets closed before FD_acc(s) closed and all these race conditions.  
> 
> 
> I'm not sure I understand this. But this demonstrates the flexibility
> of an unified uAPI. E.g it allows vDPA and VFIO device to use the
> same PAISD which can be shared with a process in the guest.
> 
This is for user DMA not for vSVA. I was contending that /dev/sva
creates unnecessary steps for such usage.

For vSVA, I think vDPA and VFIO can potentially share but I am not
seeing convincing benefits.

If a guest process wants to do SVA with a VFIO assigned device and a
vDPA-backed virtio-net at the same time, it might be a limitation if
PASID is not managed via a common interface. But I am not sure how vDPA
SVA support will look like, does it support gIOVA? need virtio IOMMU?

> For the race condition, it could be probably solved with refcnt.
> 
Agreed but the best solution might be not to have the problem in the
first place :)

> Thanks
> 
> 
> >
> > If we do not expose FD_pasid to the user, the teardown is much
> > simpler and streamlined. Following each FD_acc close, PASID unbind
> > is performed. 
> >>> Yi can correct me but this set is is about VFIO-PCI, VFIO-mdev
> >>> will be introduced later.  
> >> Last patch is:
> >>
> >>    vfio/type1: Add vSVA support for IOMMU-backed mdevs
> >>
> >> So pretty hard to see how this is not about vfio-mdev, at least a
> >> little..
> >>
> >> Jason  
> >
> > Thanks,
> >
> > Jacob
> >  
> 


Thanks,

Jacob

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs
  2020-09-17 18:17                                               ` Jacob Pan (Jun)
@ 2020-09-18  3:58                                                 ` Jason Wang
  0 siblings, 0 replies; 81+ messages in thread
From: Jason Wang @ 2020-09-18  3:58 UTC (permalink / raw)
  To: Jacob Pan (Jun)
  Cc: Jason Gunthorpe, Raj, Ashok, Alex Williamson,
	Jean-Philippe Brucker, Liu Yi L, eric.auger, baolu.lu, joro,
	kevin.tian, jun.j.tian, yi.y.sun, peterx, hao.wu, stefanha,
	iommu, kvm, Michael S. Tsirkin, jacob.jun.pan


On 2020/9/18 上午2:17, Jacob Pan (Jun) wrote:
> Hi Jason,
> On Thu, 17 Sep 2020 11:53:49 +0800, Jason Wang <jasowang@redhat.com>
> wrote:
>
>> On 2020/9/17 上午7:09, Jacob Pan (Jun) wrote:
>>> Hi Jason,
>>> On Wed, 16 Sep 2020 15:38:41 -0300, Jason Gunthorpe <jgg@nvidia.com>
>>> wrote:
>>>   
>>>> On Wed, Sep 16, 2020 at 11:21:10AM -0700, Jacob Pan (Jun) wrote:
>>>>> Hi Jason,
>>>>> On Wed, 16 Sep 2020 14:01:13 -0300, Jason Gunthorpe
>>>>> <jgg@nvidia.com> wrote:
>>>>>       
>>>>>> On Wed, Sep 16, 2020 at 09:33:43AM -0700, Raj, Ashok wrote:
>>>>>>> On Wed, Sep 16, 2020 at 12:07:54PM -0300, Jason Gunthorpe
>>>>>>> wrote:
>>>>>>>> On Tue, Sep 15, 2020 at 05:22:26PM -0700, Jacob Pan (Jun)
>>>>>>>> wrote:
>>>>>>>>>> If user space wants to bind page tables, create the PASID
>>>>>>>>>> with /dev/sva, use ioctls there to setup the page table
>>>>>>>>>> the way it wants, then pass the now configured PASID to a
>>>>>>>>>> driver that can use it.
>>>>>>>>> Are we talking about bare metal SVA?
>>>>>>>> What a weird term.
>>>>>>> Glad you noticed it at v7 :-)
>>>>>>>
>>>>>>> Any suggestions on something less weird than
>>>>>>> Shared Virtual Addressing? There is a reason why we moved from
>>>>>>> SVM to SVA.
>>>>>> SVA is fine, what is "bare metal" supposed to mean?
>>>>>>       
>>>>> What I meant here is sharing virtual address between DMA and host
>>>>> process. This requires devices perform DMA request with PASID and
>>>>> use IOMMU first level/stage 1 page tables.
>>>>> This can be further divided into 1) user SVA 2) supervisor SVA
>>>>> (sharing init_mm)
>>>>>
>>>>> My point is that /dev/sva is not useful here since the driver can
>>>>> perform PASID allocation while doing SVA bind.
>>>> No, you are thinking too small.
>>>>
>>>> Look at VDPA, it has a SVA uAPI. Some HW might use PASID for the
>>>> SVA.
>>> Could you point to me the SVA UAPI? I couldn't find it in the
>>> mainline. Seems VDPA uses VHOST interface?
>>
>> It's the vhost_iotlb_msg defined in uapi/linux/vhost_types.h.
>>
> Thanks for the pointer, for complete vSVA functionality we would need
> 1 TLB flush (IOTLB and PASID cache etc.)
> 2 PASID alloc/free
> 3 bind/unbind page tables or PASID tables
> 4 Page request service
>
> Seems vhost_iotlb_msg can be used for #1 partially. And the
> proposal is to pluck out the rest into /dev/sda? Seems awkward as Alex
> pointed out earlier for similar situation in VFIO.


Consider it doesn't have any PASID support yet, my understanding is that 
if we go with /dev/sva:

- vhost uAPI will still keep the uAPI for associating an ASID to a 
specific virtqueue
- except for this, we can use /dev/sva for all the rest (P)ASID operations


>
>>>   
>>>> When VDPA is used by DPDK it makes sense that the PASID will be SVA
>>>> and 1:1 with the mm_struct.
>>>>   
>>> I still don't see why bare metal DPDK needs to get a handle of the
>>> PASID.
>>
>> My understanding is that it may:
>>
>> - have a unified uAPI with vSVA: alloc, bind, unbind, free
> Got your point, but vSVA needs more than these


Yes it's just a subset of what vSVA required.


>
>> - leave the binding policy to userspace instead of the using a
>> implied one in the kenrel
>>
> Only if necessary.


Yes, I think it's all about visibility(flexibility) and**manageability.

Consider device has queue A, B, C. We will only dedicated queue A, B for 
one PASID(for vSVA) and C with another PASID(for SVA). It looks to me 
the current sva_bind() API doesn't support this. We still need an API 
for allocating a PASID for SVA and assign it to the (mediated) device.  
This case is pretty common for implementing a shadow queue for a guest.


>
>>> Perhaps the SVA patch would explain. Or are you talking about
>>> vDPA DPDK process that is used to support virtio-net-pmd in the
>>> guest?
>>>> When VDPA is used by qemu it makes sense that the PASID will be an
>>>> arbitary IOVA map constructed to be 1:1 with the guest vCPU
>>>> physical map. /dev/sva allows a single uAPI to do this kind of
>>>> setup, and qemu can support it while supporting a range of SVA
>>>> kernel drivers. VDPA and vfio-mdev are obvious initial targets.
>>>>
>>>> *BOTH* are needed.
>>>>
>>>> In general any uAPI for PASID should have the option to use either
>>>> the mm_struct SVA PASID *OR* a PASID from /dev/sva. It costs
>>>> virtually nothing to implement this in the driver as PASID is just
>>>> a number, and gives so much more flexability.
>>>>   
>>> Not really nothing in terms of PASID life cycles. For example, if
>>> user uses uacce interface to open an accelerator, it gets an
>>> FD_acc. Then it opens /dev/sva to allocate PASID then get another
>>> FD_pasid. Then we pass FD_pasid to the driver to bind page tables,
>>> perhaps multiple drivers. Now we have to worry about If FD_pasid
>>> gets closed before FD_acc(s) closed and all these race conditions.
>>
>> I'm not sure I understand this. But this demonstrates the flexibility
>> of an unified uAPI. E.g it allows vDPA and VFIO device to use the
>> same PAISD which can be shared with a process in the guest.
>>
> This is for user DMA not for vSVA. I was contending that /dev/sva
> creates unnecessary steps for such usage.


A question here is where the PASID management is expected to be done. 
I'm not quite sure the silent 1:1 binding done in intel_svm_bind_mm() 
can satisfy the requirement for management layer.


>
> For vSVA, I think vDPA and VFIO can potentially share but I am not
> seeing convincing benefits.
>
> If a guest process wants to do SVA with a VFIO assigned device and a
> vDPA-backed virtio-net at the same time, it might be a limitation if
> PASID is not managed via a common interface.


Yes.


> But I am not sure how vDPA
> SVA support will look like, does it support gIOVA? need virtio IOMMU?


Yes, it supports gIOVA and it should work with any type of vIOMMU. I 
think vDPA will start from Intel vIOMMU support in Qemu.

For virtio IOMMU, we will probably support it in the future consider it 
doesn't have any SVA capability, and it doesn't use a page table that 
can be nested via a hardware IOMMU.


>> For the race condition, it could be probably solved with refcnt.
>>
> Agreed but the best solution might be not to have the problem in the
> first place :)


I agree, it's only worth to bother if it has real benefits.

Thanks


>
>> Thanks
>>
>>
>>> If we do not expose FD_pasid to the user, the teardown is much
>>> simpler and streamlined. Following each FD_acc close, PASID unbind
>>> is performed.
>>>>> Yi can correct me but this set is is about VFIO-PCI, VFIO-mdev
>>>>> will be introduced later.
>>>> Last patch is:
>>>>
>>>>     vfio/type1: Add vSVA support for IOMMU-backed mdevs
>>>>
>>>> So pretty hard to see how this is not about vfio-mdev, at least a
>>>> little..
>>>>
>>>> Jason
>>> Thanks,
>>>
>>> Jacob
>>>   
>
> Thanks,
>
> Jacob
>


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v7 02/16] iommu/smmu: Report empty domain nesting info
  2020-09-10 10:45 ` [PATCH v7 02/16] iommu/smmu: Report empty " Liu Yi L
@ 2021-01-12  6:50   ` Vivek Gautam
  2021-01-12  9:21     ` Liu, Yi L
  0 siblings, 1 reply; 81+ messages in thread
From: Vivek Gautam @ 2021-01-12  6:50 UTC (permalink / raw)
  To: Liu Yi L
  Cc: alex.williamson, eric.auger, baolu.lu, Joerg Roedel,
	Jean-Philippe Brucker, kevin.tian, ashok.raj, kvm, stefanha,
	jun.j.tian,
	list@263.net:IOMMU DRIVERS
	<iommu@lists.linux-foundation.org>,
	Joerg Roedel <joro@8bytes.org>,,
	yi.y.sun, Robin Murphy, Will Deacon, jasowang, hao.wu

Hi Yi,


On Thu, Sep 10, 2020 at 4:13 PM Liu Yi L <yi.l.liu@intel.com> wrote:
>
> This patch is added as instead of returning a boolean for DOMAIN_ATTR_NESTING,
> iommu_domain_get_attr() should return an iommu_nesting_info handle. For
> now, return an empty nesting info struct for now as true nesting is not
> yet supported by the SMMUs.
>
> Cc: Will Deacon <will@kernel.org>
> Cc: Robin Murphy <robin.murphy@arm.com>
> Cc: Eric Auger <eric.auger@redhat.com>
> Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
> Suggested-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Reviewed-by: Eric Auger <eric.auger@redhat.com>
> ---
> v5 -> v6:
> *) add review-by from Eric Auger.
>
> v4 -> v5:
> *) address comments from Eric Auger.
> ---
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 29 +++++++++++++++++++++++++++--
>  drivers/iommu/arm/arm-smmu/arm-smmu.c       | 29 +++++++++++++++++++++++++++--
>  2 files changed, 54 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> index 7196207..016e2e5 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> @@ -3019,6 +3019,32 @@ static struct iommu_group *arm_smmu_device_group(struct device *dev)
>         return group;
>  }
>
> +static int arm_smmu_domain_nesting_info(struct arm_smmu_domain *smmu_domain,
> +                                       void *data)
> +{
> +       struct iommu_nesting_info *info = (struct iommu_nesting_info *)data;
> +       unsigned int size;
> +
> +       if (!info || smmu_domain->stage != ARM_SMMU_DOMAIN_NESTED)
> +               return -ENODEV;
> +
> +       size = sizeof(struct iommu_nesting_info);
> +
> +       /*
> +        * if provided buffer size is smaller than expected, should
> +        * return 0 and also the expected buffer size to caller.
> +        */
> +       if (info->argsz < size) {
> +               info->argsz = size;
> +               return 0;
> +       }
> +
> +       /* report an empty iommu_nesting_info for now */
> +       memset(info, 0x0, size);
> +       info->argsz = size;
> +       return 0;
> +}
> +
>  static int arm_smmu_domain_get_attr(struct iommu_domain *domain,
>                                     enum iommu_attr attr, void *data)
>  {
> @@ -3028,8 +3054,7 @@ static int arm_smmu_domain_get_attr(struct iommu_domain *domain,
>         case IOMMU_DOMAIN_UNMANAGED:
>                 switch (attr) {
>                 case DOMAIN_ATTR_NESTING:
> -                       *(int *)data = (smmu_domain->stage == ARM_SMMU_DOMAIN_NESTED);
> -                       return 0;
> +                       return arm_smmu_domain_nesting_info(smmu_domain, data);

Thanks for the patch.
This would unnecessarily overflow 'data' for any caller that's expecting only
an int data. Dump from one such issue that I was seeing when testing
this change along with local kvmtool changes is pasted below [1].

I could get around with the issue by adding another (iommu_attr) -
DOMAIN_ATTR_NESTING_INFO that returns (iommu_nesting_info).

Thanks & regards
Vivek

[1]--------------
[  811.756516] vfio-pci 0000:08:00.1: vfio_ecap_init: hiding ecap 0x1b@0x108
[  811.756516] Kernel panic - not syncing: stack-protector: Kernel
stack is corrupted in: vfio_pci_open+0x644/0x648
[  811.756516] CPU: 0 PID: 175 Comm: lkvm-cleanup-ne Not tainted
5.10.0-rc5-00096-gf015061e14cf #43
[  811.756516] Call trace:
[  811.756516]  dump_backtrace+0x0/0x1b0
[  811.756516]  show_stack+0x18/0x68
[  811.756516]  dump_stack+0xd8/0x134
[  811.756516]  panic+0x174/0x33c
[  811.756516]  __stack_chk_fail+0x3c/0x40
[  811.756516]  vfio_pci_open+0x644/0x648
[  811.756516]  vfio_group_fops_unl_ioctl+0x4bc/0x648
[  811.756516]  0x0
[  811.756516] SMP: stopping secondary CPUs
[  811.756597] Kernel Offset: disabled
[  811.756597] CPU features: 0x0040006,6a00aa38
[  811.756602] Memory Limit: none
[  811.768497] ---[ end Kernel panic - not syncing: stack-protector:
Kernel stack is corrupted in: vfio_pci_open+0x644/0x648 ]
-------------

^ permalink raw reply	[flat|nested] 81+ messages in thread

* RE: [PATCH v7 02/16] iommu/smmu: Report empty domain nesting info
  2021-01-12  6:50   ` Vivek Gautam
@ 2021-01-12  9:21     ` Liu, Yi L
  2021-01-12 11:05       ` Vivek Gautam
  0 siblings, 1 reply; 81+ messages in thread
From: Liu, Yi L @ 2021-01-12  9:21 UTC (permalink / raw)
  To: Vivek Gautam
  Cc: alex.williamson, eric.auger, baolu.lu, Joerg Roedel,
	Jean-Philippe Brucker, Tian, Kevin, Raj, Ashok, kvm, stefanha,
	Tian, Jun J,
	list@263.net:IOMMU DRIVERS
	<iommu@lists.linux-foundation.org>,
	Joerg Roedel <joro@8bytes.org>,,
	Sun, Yi Y, Robin Murphy, Will Deacon, jasowang, Wu, Hao

Hi Vivek,

> From: Vivek Gautam <vivek.gautam@arm.com>
> Sent: Tuesday, January 12, 2021 2:50 PM
> 
> Hi Yi,
> 
> 
> On Thu, Sep 10, 2020 at 4:13 PM Liu Yi L <yi.l.liu@intel.com> wrote:
> >
> > This patch is added as instead of returning a boolean for
> DOMAIN_ATTR_NESTING,
> > iommu_domain_get_attr() should return an iommu_nesting_info handle.
> For
> > now, return an empty nesting info struct for now as true nesting is not
> > yet supported by the SMMUs.
> >
> > Cc: Will Deacon <will@kernel.org>
> > Cc: Robin Murphy <robin.murphy@arm.com>
> > Cc: Eric Auger <eric.auger@redhat.com>
> > Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
> > Suggested-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
> > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > Reviewed-by: Eric Auger <eric.auger@redhat.com>
> > ---
> > v5 -> v6:
> > *) add review-by from Eric Auger.
> >
> > v4 -> v5:
> > *) address comments from Eric Auger.
> > ---
> >  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 29
> +++++++++++++++++++++++++++--
> >  drivers/iommu/arm/arm-smmu/arm-smmu.c       | 29
> +++++++++++++++++++++++++++--
> >  2 files changed, 54 insertions(+), 4 deletions(-)
> >
> > diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > index 7196207..016e2e5 100644
> > --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > @@ -3019,6 +3019,32 @@ static struct iommu_group
> *arm_smmu_device_group(struct device *dev)
> >         return group;
> >  }
> >
> > +static int arm_smmu_domain_nesting_info(struct arm_smmu_domain
> *smmu_domain,
> > +                                       void *data)
> > +{
> > +       struct iommu_nesting_info *info = (struct iommu_nesting_info
> *)data;
> > +       unsigned int size;
> > +
> > +       if (!info || smmu_domain->stage != ARM_SMMU_DOMAIN_NESTED)
> > +               return -ENODEV;
> > +
> > +       size = sizeof(struct iommu_nesting_info);
> > +
> > +       /*
> > +        * if provided buffer size is smaller than expected, should
> > +        * return 0 and also the expected buffer size to caller.
> > +        */
> > +       if (info->argsz < size) {
> > +               info->argsz = size;
> > +               return 0;
> > +       }
> > +
> > +       /* report an empty iommu_nesting_info for now */
> > +       memset(info, 0x0, size);
> > +       info->argsz = size;
> > +       return 0;
> > +}
> > +
> >  static int arm_smmu_domain_get_attr(struct iommu_domain *domain,
> >                                     enum iommu_attr attr, void *data)
> >  {
> > @@ -3028,8 +3054,7 @@ static int arm_smmu_domain_get_attr(struct
> iommu_domain *domain,
> >         case IOMMU_DOMAIN_UNMANAGED:
> >                 switch (attr) {
> >                 case DOMAIN_ATTR_NESTING:
> > -                       *(int *)data = (smmu_domain->stage ==
> ARM_SMMU_DOMAIN_NESTED);
> > -                       return 0;
> > +                       return arm_smmu_domain_nesting_info(smmu_domain,
> data);
> 
> Thanks for the patch.
> This would unnecessarily overflow 'data' for any caller that's expecting only
> an int data. Dump from one such issue that I was seeing when testing
> this change along with local kvmtool changes is pasted below [1].
> 
> I could get around with the issue by adding another (iommu_attr) -
> DOMAIN_ATTR_NESTING_INFO that returns (iommu_nesting_info).

nice to hear from you. At first, we planned to have a separate iommu_attr
for getting nesting_info. However, we considered there is no existing user
which gets DOMAIN_ATTR_NESTING, so we decided to reuse it for iommu nesting
info. Could you share me the code base you are using? If the error you
encountered is due to this change, so there should be a place which gets
DOMAIN_ATTR_NESTING.

Regards,
Yi Liu

> Thanks & regards
> Vivek
> 
> [1]--------------
> [  811.756516] vfio-pci 0000:08:00.1: vfio_ecap_init: hiding ecap
> 0x1b@0x108
> [  811.756516] Kernel panic - not syncing: stack-protector: Kernel
> stack is corrupted in: vfio_pci_open+0x644/0x648
> [  811.756516] CPU: 0 PID: 175 Comm: lkvm-cleanup-ne Not tainted
> 5.10.0-rc5-00096-gf015061e14cf #43
> [  811.756516] Call trace:
> [  811.756516]  dump_backtrace+0x0/0x1b0
> [  811.756516]  show_stack+0x18/0x68
> [  811.756516]  dump_stack+0xd8/0x134
> [  811.756516]  panic+0x174/0x33c
> [  811.756516]  __stack_chk_fail+0x3c/0x40
> [  811.756516]  vfio_pci_open+0x644/0x648
> [  811.756516]  vfio_group_fops_unl_ioctl+0x4bc/0x648
> [  811.756516]  0x0
> [  811.756516] SMP: stopping secondary CPUs
> [  811.756597] Kernel Offset: disabled
> [  811.756597] CPU features: 0x0040006,6a00aa38
> [  811.756602] Memory Limit: none
> [  811.768497] ---[ end Kernel panic - not syncing: stack-protector:
> Kernel stack is corrupted in: vfio_pci_open+0x644/0x648 ]
> -------------

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v7 02/16] iommu/smmu: Report empty domain nesting info
  2021-01-12  9:21     ` Liu, Yi L
@ 2021-01-12 11:05       ` Vivek Gautam
  2021-01-13  5:56         ` Liu, Yi L
  0 siblings, 1 reply; 81+ messages in thread
From: Vivek Gautam @ 2021-01-12 11:05 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: Sun, Yi Y, Jean-Philippe Brucker, Tian, Kevin, Raj, Ashok, kvm,
	jasowang, stefanha, Will Deacon,
	list@263.net:IOMMU DRIVERS
	<iommu@lists.linux-foundation.org>,
	Joerg Roedel <joro@8bytes.org>,,
	alex.williamson, Wu, Hao, Robin Murphy, Tian, Jun J

Hi Yi,


On Tue, Jan 12, 2021 at 2:51 PM Liu, Yi L <yi.l.liu@intel.com> wrote:
>
> Hi Vivek,
>
> > From: Vivek Gautam <vivek.gautam@arm.com>
> > Sent: Tuesday, January 12, 2021 2:50 PM
> >
> > Hi Yi,
> >
> >
> > On Thu, Sep 10, 2020 at 4:13 PM Liu Yi L <yi.l.liu@intel.com> wrote:
> > >
> > > This patch is added as instead of returning a boolean for
> > DOMAIN_ATTR_NESTING,
> > > iommu_domain_get_attr() should return an iommu_nesting_info handle.
> > For
> > > now, return an empty nesting info struct for now as true nesting is not
> > > yet supported by the SMMUs.
> > >
> > > Cc: Will Deacon <will@kernel.org>
> > > Cc: Robin Murphy <robin.murphy@arm.com>
> > > Cc: Eric Auger <eric.auger@redhat.com>
> > > Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
> > > Suggested-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
> > > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > > Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > > Reviewed-by: Eric Auger <eric.auger@redhat.com>
> > > ---
> > > v5 -> v6:
> > > *) add review-by from Eric Auger.
> > >
> > > v4 -> v5:
> > > *) address comments from Eric Auger.
> > > ---
> > >  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 29
> > +++++++++++++++++++++++++++--
> > >  drivers/iommu/arm/arm-smmu/arm-smmu.c       | 29
> > +++++++++++++++++++++++++++--
> > >  2 files changed, 54 insertions(+), 4 deletions(-)
> > >
> > > diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > > index 7196207..016e2e5 100644
> > > --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > > +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > > @@ -3019,6 +3019,32 @@ static struct iommu_group
> > *arm_smmu_device_group(struct device *dev)
> > >         return group;
> > >  }
> > >
> > > +static int arm_smmu_domain_nesting_info(struct arm_smmu_domain
> > *smmu_domain,
> > > +                                       void *data)
> > > +{
> > > +       struct iommu_nesting_info *info = (struct iommu_nesting_info
> > *)data;
> > > +       unsigned int size;
> > > +
> > > +       if (!info || smmu_domain->stage != ARM_SMMU_DOMAIN_NESTED)
> > > +               return -ENODEV;
> > > +
> > > +       size = sizeof(struct iommu_nesting_info);
> > > +
> > > +       /*
> > > +        * if provided buffer size is smaller than expected, should
> > > +        * return 0 and also the expected buffer size to caller.
> > > +        */
> > > +       if (info->argsz < size) {
> > > +               info->argsz = size;
> > > +               return 0;
> > > +       }
> > > +
> > > +       /* report an empty iommu_nesting_info for now */
> > > +       memset(info, 0x0, size);
> > > +       info->argsz = size;
> > > +       return 0;
> > > +}
> > > +
> > >  static int arm_smmu_domain_get_attr(struct iommu_domain *domain,
> > >                                     enum iommu_attr attr, void *data)
> > >  {
> > > @@ -3028,8 +3054,7 @@ static int arm_smmu_domain_get_attr(struct
> > iommu_domain *domain,
> > >         case IOMMU_DOMAIN_UNMANAGED:
> > >                 switch (attr) {
> > >                 case DOMAIN_ATTR_NESTING:
> > > -                       *(int *)data = (smmu_domain->stage ==
> > ARM_SMMU_DOMAIN_NESTED);
> > > -                       return 0;
> > > +                       return arm_smmu_domain_nesting_info(smmu_domain,
> > data);
> >
> > Thanks for the patch.
> > This would unnecessarily overflow 'data' for any caller that's expecting only
> > an int data. Dump from one such issue that I was seeing when testing
> > this change along with local kvmtool changes is pasted below [1].
> >
> > I could get around with the issue by adding another (iommu_attr) -
> > DOMAIN_ATTR_NESTING_INFO that returns (iommu_nesting_info).
>
> nice to hear from you. At first, we planned to have a separate iommu_attr
> for getting nesting_info. However, we considered there is no existing user
> which gets DOMAIN_ATTR_NESTING, so we decided to reuse it for iommu nesting
> info. Could you share me the code base you are using? If the error you
> encountered is due to this change, so there should be a place which gets
> DOMAIN_ATTR_NESTING.

I am currently working on top of Eric's tree for nested stage support [1].
My best guess was that the vfio_pci_dma_fault_init() method [2] that is
requesting DOMAIN_ATTR_NESTING causes stack overflow, and corruption.
That's when I added a new attribute.

I will soon publish my patches to the list for review. Let me know
your thoughts.

[1] https://github.com/eauger/linux/tree/5.10-rc4-2stage-v13
[2] https://github.com/eauger/linux/blob/5.10-rc4-2stage-v13/drivers/vfio/pci/vfio_pci.c#L494

Thanks
Vivek

>
> Regards,
> Yi Liu

[snip]

^ permalink raw reply	[flat|nested] 81+ messages in thread

* RE: [PATCH v7 02/16] iommu/smmu: Report empty domain nesting info
  2021-01-12 11:05       ` Vivek Gautam
@ 2021-01-13  5:56         ` Liu, Yi L
  2021-01-19 10:03           ` Auger Eric
  0 siblings, 1 reply; 81+ messages in thread
From: Liu, Yi L @ 2021-01-13  5:56 UTC (permalink / raw)
  To: Vivek Gautam, eric.auger
  Cc: Sun, Yi Y, Jean-Philippe Brucker, Tian, Kevin, Raj, Ashok, kvm,
	jasowang, stefanha, Will Deacon,
	list@263.net:IOMMU DRIVERS
	<iommu@lists.linux-foundation.org>,
	Joerg Roedel <joro@8bytes.org>,,
	alex.williamson, Wu, Hao, Robin Murphy, Tian, Jun J

Hi Vivek,

> From: Vivek Gautam <vivek.gautam@arm.com>
> Sent: Tuesday, January 12, 2021 7:06 PM
> 
> Hi Yi,
> 
> 
> On Tue, Jan 12, 2021 at 2:51 PM Liu, Yi L <yi.l.liu@intel.com> wrote:
> >
> > Hi Vivek,
> >
> > > From: Vivek Gautam <vivek.gautam@arm.com>
> > > Sent: Tuesday, January 12, 2021 2:50 PM
> > >
> > > Hi Yi,
> > >
> > >
> > > On Thu, Sep 10, 2020 at 4:13 PM Liu Yi L <yi.l.liu@intel.com> wrote:
> > > >
> > > > This patch is added as instead of returning a boolean for
> > > DOMAIN_ATTR_NESTING,
> > > > iommu_domain_get_attr() should return an iommu_nesting_info
> handle.
> > > For
> > > > now, return an empty nesting info struct for now as true nesting is not
> > > > yet supported by the SMMUs.
> > > >
> > > > Cc: Will Deacon <will@kernel.org>
> > > > Cc: Robin Murphy <robin.murphy@arm.com>
> > > > Cc: Eric Auger <eric.auger@redhat.com>
> > > > Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
> > > > Suggested-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
> > > > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > > > Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > > > Reviewed-by: Eric Auger <eric.auger@redhat.com>
> > > > ---
> > > > v5 -> v6:
> > > > *) add review-by from Eric Auger.
> > > >
> > > > v4 -> v5:
> > > > *) address comments from Eric Auger.
> > > > ---
> > > >  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 29
> > > +++++++++++++++++++++++++++--
> > > >  drivers/iommu/arm/arm-smmu/arm-smmu.c       | 29
> > > +++++++++++++++++++++++++++--
> > > >  2 files changed, 54 insertions(+), 4 deletions(-)
> > > >
> > > > diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > > b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > > > index 7196207..016e2e5 100644
> > > > --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > > > +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > > > @@ -3019,6 +3019,32 @@ static struct iommu_group
> > > *arm_smmu_device_group(struct device *dev)
> > > >         return group;
> > > >  }
> > > >
> > > > +static int arm_smmu_domain_nesting_info(struct
> arm_smmu_domain
> > > *smmu_domain,
> > > > +                                       void *data)
> > > > +{
> > > > +       struct iommu_nesting_info *info = (struct iommu_nesting_info
> > > *)data;
> > > > +       unsigned int size;
> > > > +
> > > > +       if (!info || smmu_domain->stage !=
> ARM_SMMU_DOMAIN_NESTED)
> > > > +               return -ENODEV;
> > > > +
> > > > +       size = sizeof(struct iommu_nesting_info);
> > > > +
> > > > +       /*
> > > > +        * if provided buffer size is smaller than expected, should
> > > > +        * return 0 and also the expected buffer size to caller.
> > > > +        */
> > > > +       if (info->argsz < size) {
> > > > +               info->argsz = size;
> > > > +               return 0;
> > > > +       }
> > > > +
> > > > +       /* report an empty iommu_nesting_info for now */
> > > > +       memset(info, 0x0, size);
> > > > +       info->argsz = size;
> > > > +       return 0;
> > > > +}
> > > > +
> > > >  static int arm_smmu_domain_get_attr(struct iommu_domain
> *domain,
> > > >                                     enum iommu_attr attr, void *data)
> > > >  {
> > > > @@ -3028,8 +3054,7 @@ static int
> arm_smmu_domain_get_attr(struct
> > > iommu_domain *domain,
> > > >         case IOMMU_DOMAIN_UNMANAGED:
> > > >                 switch (attr) {
> > > >                 case DOMAIN_ATTR_NESTING:
> > > > -                       *(int *)data = (smmu_domain->stage ==
> > > ARM_SMMU_DOMAIN_NESTED);
> > > > -                       return 0;
> > > > +                       return
> arm_smmu_domain_nesting_info(smmu_domain,
> > > data);
> > >
> > > Thanks for the patch.
> > > This would unnecessarily overflow 'data' for any caller that's expecting
> only
> > > an int data. Dump from one such issue that I was seeing when testing
> > > this change along with local kvmtool changes is pasted below [1].
> > >
> > > I could get around with the issue by adding another (iommu_attr) -
> > > DOMAIN_ATTR_NESTING_INFO that returns (iommu_nesting_info).
> >
> > nice to hear from you. At first, we planned to have a separate iommu_attr
> > for getting nesting_info. However, we considered there is no existing user
> > which gets DOMAIN_ATTR_NESTING, so we decided to reuse it for iommu
> nesting
> > info. Could you share me the code base you are using? If the error you
> > encountered is due to this change, so there should be a place which gets
> > DOMAIN_ATTR_NESTING.
> 
> I am currently working on top of Eric's tree for nested stage support [1].
> My best guess was that the vfio_pci_dma_fault_init() method [2] that is
> requesting DOMAIN_ATTR_NESTING causes stack overflow, and corruption.
> That's when I added a new attribute.

I see. I think there needs a change in the code there. Should also expect
a nesting_info returned instead of an int anymore. @Eric, how about your
opinion?

	domain = iommu_get_domain_for_dev(&vdev->pdev->dev);
	ret = iommu_domain_get_attr(domain, DOMAIN_ATTR_NESTING, &info);
	if (ret || !(info.features & IOMMU_NESTING_FEAT_PAGE_RESP)) {
		/*
		 * No need go futher as no page request service support.
		 */
		return 0;
	}

https://github.com/luxis1999/linux-vsva/blob/vsva-linux-5.9-rc6-v8%2BPRQ/drivers/vfio/pci/vfio_pci.c

Regards,
Yi Liu

> I will soon publish my patches to the list for review. Let me know
> your thoughts.
> 
> [1] https://github.com/eauger/linux/tree/5.10-rc4-2stage-v13
> [2] https://github.com/eauger/linux/blob/5.10-rc4-2stage-
> v13/drivers/vfio/pci/vfio_pci.c#L494
> 
> Thanks
> Vivek
> 
> >
> > Regards,
> > Yi Liu
> 
> [snip]

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v7 02/16] iommu/smmu: Report empty domain nesting info
  2021-01-13  5:56         ` Liu, Yi L
@ 2021-01-19 10:03           ` Auger Eric
  2021-01-23  8:59             ` Liu, Yi L
  0 siblings, 1 reply; 81+ messages in thread
From: Auger Eric @ 2021-01-19 10:03 UTC (permalink / raw)
  To: Liu, Yi L, Vivek Gautam
  Cc: Sun, Yi Y, Jean-Philippe Brucker, Tian, Kevin, Raj, Ashok, kvm,
	jasowang, stefanha, Will Deacon, list@263.net:IOMMU DRIVERS,
	Joerg Roedel, iommu, alex.williamson, Wu, Hao, Robin Murphy,
	Tian, Jun J

Hi Yi, Vivek,

On 1/13/21 6:56 AM, Liu, Yi L wrote:
> Hi Vivek,
> 
>> From: Vivek Gautam <vivek.gautam@arm.com>
>> Sent: Tuesday, January 12, 2021 7:06 PM
>>
>> Hi Yi,
>>
>>
>> On Tue, Jan 12, 2021 at 2:51 PM Liu, Yi L <yi.l.liu@intel.com> wrote:
>>>
>>> Hi Vivek,
>>>
>>>> From: Vivek Gautam <vivek.gautam@arm.com>
>>>> Sent: Tuesday, January 12, 2021 2:50 PM
>>>>
>>>> Hi Yi,
>>>>
>>>>
>>>> On Thu, Sep 10, 2020 at 4:13 PM Liu Yi L <yi.l.liu@intel.com> wrote:
>>>>>
>>>>> This patch is added as instead of returning a boolean for
>>>> DOMAIN_ATTR_NESTING,
>>>>> iommu_domain_get_attr() should return an iommu_nesting_info
>> handle.
>>>> For
>>>>> now, return an empty nesting info struct for now as true nesting is not
>>>>> yet supported by the SMMUs.
>>>>>
>>>>> Cc: Will Deacon <will@kernel.org>
>>>>> Cc: Robin Murphy <robin.murphy@arm.com>
>>>>> Cc: Eric Auger <eric.auger@redhat.com>
>>>>> Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
>>>>> Suggested-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
>>>>> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
>>>>> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
>>>>> Reviewed-by: Eric Auger <eric.auger@redhat.com>
>>>>> ---
>>>>> v5 -> v6:
>>>>> *) add review-by from Eric Auger.
>>>>>
>>>>> v4 -> v5:
>>>>> *) address comments from Eric Auger.
>>>>> ---
>>>>>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 29
>>>> +++++++++++++++++++++++++++--
>>>>>  drivers/iommu/arm/arm-smmu/arm-smmu.c       | 29
>>>> +++++++++++++++++++++++++++--
>>>>>  2 files changed, 54 insertions(+), 4 deletions(-)
>>>>>
>>>>> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>>>> b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>>>>> index 7196207..016e2e5 100644
>>>>> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>>>>> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>>>>> @@ -3019,6 +3019,32 @@ static struct iommu_group
>>>> *arm_smmu_device_group(struct device *dev)
>>>>>         return group;
>>>>>  }
>>>>>
>>>>> +static int arm_smmu_domain_nesting_info(struct
>> arm_smmu_domain
>>>> *smmu_domain,
>>>>> +                                       void *data)
>>>>> +{
>>>>> +       struct iommu_nesting_info *info = (struct iommu_nesting_info
>>>> *)data;
>>>>> +       unsigned int size;
>>>>> +
>>>>> +       if (!info || smmu_domain->stage !=
>> ARM_SMMU_DOMAIN_NESTED)
>>>>> +               return -ENODEV;
>>>>> +
>>>>> +       size = sizeof(struct iommu_nesting_info);
>>>>> +
>>>>> +       /*
>>>>> +        * if provided buffer size is smaller than expected, should
>>>>> +        * return 0 and also the expected buffer size to caller.
>>>>> +        */
>>>>> +       if (info->argsz < size) {
>>>>> +               info->argsz = size;
>>>>> +               return 0;
>>>>> +       }
>>>>> +
>>>>> +       /* report an empty iommu_nesting_info for now */
>>>>> +       memset(info, 0x0, size);
>>>>> +       info->argsz = size;
>>>>> +       return 0;
>>>>> +}
>>>>> +
>>>>>  static int arm_smmu_domain_get_attr(struct iommu_domain
>> *domain,
>>>>>                                     enum iommu_attr attr, void *data)
>>>>>  {
>>>>> @@ -3028,8 +3054,7 @@ static int
>> arm_smmu_domain_get_attr(struct
>>>> iommu_domain *domain,
>>>>>         case IOMMU_DOMAIN_UNMANAGED:
>>>>>                 switch (attr) {
>>>>>                 case DOMAIN_ATTR_NESTING:
>>>>> -                       *(int *)data = (smmu_domain->stage ==
>>>> ARM_SMMU_DOMAIN_NESTED);
>>>>> -                       return 0;
>>>>> +                       return
>> arm_smmu_domain_nesting_info(smmu_domain,
>>>> data);
>>>>
>>>> Thanks for the patch.
>>>> This would unnecessarily overflow 'data' for any caller that's expecting
>> only
>>>> an int data. Dump from one such issue that I was seeing when testing
>>>> this change along with local kvmtool changes is pasted below [1].
>>>>
>>>> I could get around with the issue by adding another (iommu_attr) -
>>>> DOMAIN_ATTR_NESTING_INFO that returns (iommu_nesting_info).
>>>
>>> nice to hear from you. At first, we planned to have a separate iommu_attr
>>> for getting nesting_info. However, we considered there is no existing user
>>> which gets DOMAIN_ATTR_NESTING, so we decided to reuse it for iommu
>> nesting
>>> info. Could you share me the code base you are using? If the error you
>>> encountered is due to this change, so there should be a place which gets
>>> DOMAIN_ATTR_NESTING.
>>
>> I am currently working on top of Eric's tree for nested stage support [1].
>> My best guess was that the vfio_pci_dma_fault_init() method [2] that is
>> requesting DOMAIN_ATTR_NESTING causes stack overflow, and corruption.
>> That's when I added a new attribute.
> 
> I see. I think there needs a change in the code there. Should also expect
> a nesting_info returned instead of an int anymore. @Eric, how about your
> opinion?
> 
> 	domain = iommu_get_domain_for_dev(&vdev->pdev->dev);
> 	ret = iommu_domain_get_attr(domain, DOMAIN_ATTR_NESTING, &info);
> 	if (ret || !(info.features & IOMMU_NESTING_FEAT_PAGE_RESP)) {
> 		/*
> 		 * No need go futher as no page request service support.
> 		 */
> 		return 0;
> 	}
Sure I think it is "just" a matter of synchro between the 2 series. Yi,
do you have plans to respin part of
[PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs
or would you allow me to embed this patch in my series.

Thanks

Eric
> 
> https://github.com/luxis1999/linux-vsva/blob/vsva-linux-5.9-rc6-v8%2BPRQ/drivers/vfio/pci/vfio_pci.c
> 
> Regards,
> Yi Liu
> 
>> I will soon publish my patches to the list for review. Let me know
>> your thoughts.
>>
>> [1] https://github.com/eauger/linux/tree/5.10-rc4-2stage-v13
>> [2] https://github.com/eauger/linux/blob/5.10-rc4-2stage-
>> v13/drivers/vfio/pci/vfio_pci.c#L494
>>
>> Thanks
>> Vivek
>>
>>>
>>> Regards,
>>> Yi Liu
>>
>> [snip]


^ permalink raw reply	[flat|nested] 81+ messages in thread

* RE: [PATCH v7 02/16] iommu/smmu: Report empty domain nesting info
  2021-01-19 10:03           ` Auger Eric
@ 2021-01-23  8:59             ` Liu, Yi L
  2021-02-12  7:14               ` Vivek Gautam
  0 siblings, 1 reply; 81+ messages in thread
From: Liu, Yi L @ 2021-01-23  8:59 UTC (permalink / raw)
  To: Auger Eric, Vivek Gautam
  Cc: Sun, Yi Y, Jean-Philippe Brucker, Tian, Kevin, Raj, Ashok, kvm,
	jasowang, stefanha, Will Deacon, list@263.net:IOMMU DRIVERS,
	Joerg Roedel, iommu, alex.williamson, Wu, Hao, Robin Murphy,
	Tian, Jun J

Hi Eric,

> From: Auger Eric <eric.auger@redhat.com>
> Sent: Tuesday, January 19, 2021 6:03 PM
> 
> Hi Yi, Vivek,
> 
[...]
> > I see. I think there needs a change in the code there. Should also expect
> > a nesting_info returned instead of an int anymore. @Eric, how about your
> > opinion?
> >
> > 	domain = iommu_get_domain_for_dev(&vdev->pdev->dev);
> > 	ret = iommu_domain_get_attr(domain, DOMAIN_ATTR_NESTING,
> &info);
> > 	if (ret || !(info.features & IOMMU_NESTING_FEAT_PAGE_RESP)) {
> > 		/*
> > 		 * No need go futher as no page request service support.
> > 		 */
> > 		return 0;
> > 	}
> Sure I think it is "just" a matter of synchro between the 2 series. Yi,

exactly.

> do you have plans to respin part of
> [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs
> or would you allow me to embed this patch in my series.

My v7 hasn’t touch the prq change yet. So I think it's better for you to
embed it to  your series. ^_^

Regards,
Yi Liu


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v7 02/16] iommu/smmu: Report empty domain nesting info
  2021-01-23  8:59             ` Liu, Yi L
@ 2021-02-12  7:14               ` Vivek Gautam
  2021-02-12  9:57                 ` Auger Eric
  0 siblings, 1 reply; 81+ messages in thread
From: Vivek Gautam @ 2021-02-12  7:14 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: Auger Eric, Jean-Philippe Brucker, Tian, Kevin, alex.williamson,
	Raj, Ashok, kvm, stefanha, jasowang, iommu, Tian, Jun J, Sun,
	Yi Y, Robin Murphy, Will Deacon, Wu, Hao

Hi Yi,


On Sat, Jan 23, 2021 at 2:29 PM Liu, Yi L <yi.l.liu@intel.com> wrote:
>
> Hi Eric,
>
> > From: Auger Eric <eric.auger@redhat.com>
> > Sent: Tuesday, January 19, 2021 6:03 PM
> >
> > Hi Yi, Vivek,
> >
> [...]
> > > I see. I think there needs a change in the code there. Should also expect
> > > a nesting_info returned instead of an int anymore. @Eric, how about your
> > > opinion?
> > >
> > >     domain = iommu_get_domain_for_dev(&vdev->pdev->dev);
> > >     ret = iommu_domain_get_attr(domain, DOMAIN_ATTR_NESTING,
> > &info);
> > >     if (ret || !(info.features & IOMMU_NESTING_FEAT_PAGE_RESP)) {
> > >             /*
> > >              * No need go futher as no page request service support.
> > >              */
> > >             return 0;
> > >     }
> > Sure I think it is "just" a matter of synchro between the 2 series. Yi,
>
> exactly.
>
> > do you have plans to respin part of
> > [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs
> > or would you allow me to embed this patch in my series.
>
> My v7 hasn’t touch the prq change yet. So I think it's better for you to
> embed it to  your series. ^_^
>

Can you please let me know if you have an updated series of these
patches? It will help me to work with virtio-iommu/arm side changes.

Thanks & regards
Vivek

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v7 02/16] iommu/smmu: Report empty domain nesting info
  2021-02-12  7:14               ` Vivek Gautam
@ 2021-02-12  9:57                 ` Auger Eric
  2021-02-12 10:18                   ` Vivek Kumar Gautam
  2021-03-03  9:44                   ` Liu, Yi L
  0 siblings, 2 replies; 81+ messages in thread
From: Auger Eric @ 2021-02-12  9:57 UTC (permalink / raw)
  To: Vivek Gautam, Liu, Yi L
  Cc: Jean-Philippe Brucker, Tian, Kevin, alex.williamson, Raj, Ashok,
	kvm, stefanha, jasowang, iommu, Tian, Jun J, Sun, Yi Y,
	Robin Murphy, Will Deacon, Wu, Hao

Hi Vivek, Yi,

On 2/12/21 8:14 AM, Vivek Gautam wrote:
> Hi Yi,
> 
> 
> On Sat, Jan 23, 2021 at 2:29 PM Liu, Yi L <yi.l.liu@intel.com> wrote:
>>
>> Hi Eric,
>>
>>> From: Auger Eric <eric.auger@redhat.com>
>>> Sent: Tuesday, January 19, 2021 6:03 PM
>>>
>>> Hi Yi, Vivek,
>>>
>> [...]
>>>> I see. I think there needs a change in the code there. Should also expect
>>>> a nesting_info returned instead of an int anymore. @Eric, how about your
>>>> opinion?
>>>>
>>>>     domain = iommu_get_domain_for_dev(&vdev->pdev->dev);
>>>>     ret = iommu_domain_get_attr(domain, DOMAIN_ATTR_NESTING,
>>> &info);
>>>>     if (ret || !(info.features & IOMMU_NESTING_FEAT_PAGE_RESP)) {
>>>>             /*
>>>>              * No need go futher as no page request service support.
>>>>              */
>>>>             return 0;
>>>>     }
>>> Sure I think it is "just" a matter of synchro between the 2 series. Yi,
>>
>> exactly.
>>
>>> do you have plans to respin part of
>>> [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs
>>> or would you allow me to embed this patch in my series.
>>
>> My v7 hasn’t touch the prq change yet. So I think it's better for you to
>> embed it to  your series. ^_^>>
> 
> Can you please let me know if you have an updated series of these
> patches? It will help me to work with virtio-iommu/arm side changes.

As per the previous discussion, I plan to take those 2 patches in my
SMMUv3 nested stage series:

[PATCH v7 01/16] iommu: Report domain nesting info
[PATCH v7 02/16] iommu/smmu: Report empty domain nesting info

we need to upgrade both since we do not want to report an empty nesting
info anymore, for arm.

Thanks

Eric
> 
> Thanks & regards
> Vivek
> 


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v7 02/16] iommu/smmu: Report empty domain nesting info
  2021-02-12  9:57                 ` Auger Eric
@ 2021-02-12 10:18                   ` Vivek Kumar Gautam
  2021-02-12 11:01                     ` Vivek Kumar Gautam
  2021-03-03  9:44                   ` Liu, Yi L
  1 sibling, 1 reply; 81+ messages in thread
From: Vivek Kumar Gautam @ 2021-02-12 10:18 UTC (permalink / raw)
  To: Auger Eric, Liu, Yi L
  Cc: Jean-Philippe Brucker, Tian, Kevin, alex.williamson, Raj, Ashok,
	kvm, stefanha, jasowang, iommu, Tian, Jun J, Sun, Yi Y,
	Robin Murphy, Will Deacon, Wu, Hao

Hi Eric,


On 2/12/21 3:27 PM, Auger Eric wrote:
> Hi Vivek, Yi,
>
> On 2/12/21 8:14 AM, Vivek Gautam wrote:
>> Hi Yi,
>>
>>
>> On Sat, Jan 23, 2021 at 2:29 PM Liu, Yi L <yi.l.liu@intel.com> wrote:
>>>
>>> Hi Eric,
>>>
>>>> From: Auger Eric <eric.auger@redhat.com>
>>>> Sent: Tuesday, January 19, 2021 6:03 PM
>>>>
>>>> Hi Yi, Vivek,
>>>>
>>> [...]
>>>>> I see. I think there needs a change in the code there. Should also expect
>>>>> a nesting_info returned instead of an int anymore. @Eric, how about your
>>>>> opinion?
>>>>>
>>>>>      domain = iommu_get_domain_for_dev(&vdev->pdev->dev);
>>>>>      ret = iommu_domain_get_attr(domain, DOMAIN_ATTR_NESTING,
>>>> &info);
>>>>>      if (ret || !(info.features & IOMMU_NESTING_FEAT_PAGE_RESP)) {
>>>>>              /*
>>>>>               * No need go futher as no page request service support.
>>>>>               */
>>>>>              return 0;
>>>>>      }
>>>> Sure I think it is "just" a matter of synchro between the 2 series. Yi,
>>>
>>> exactly.
>>>
>>>> do you have plans to respin part of
>>>> [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs
>>>> or would you allow me to embed this patch in my series.
>>>
>>> My v7 hasn’t touch the prq change yet. So I think it's better for you to
>>> embed it to  your series. ^_^>>
>>
>> Can you please let me know if you have an updated series of these
>> patches? It will help me to work with virtio-iommu/arm side changes.
>
> As per the previous discussion, I plan to take those 2 patches in my
> SMMUv3 nested stage series:
>
> [PATCH v7 01/16] iommu: Report domain nesting info
> [PATCH v7 02/16] iommu/smmu: Report empty domain nesting info
>
> we need to upgrade both since we do not want to report an empty nesting
> info anymore, for arm.

Absolutely. Let me send the couple of patches that I have been using,
that add arm configuration.

Best regards
Vivek

>
> Thanks
>
> Eric
>>
>> Thanks & regards
>> Vivek
>>
>
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v7 02/16] iommu/smmu: Report empty domain nesting info
  2021-02-12 10:18                   ` Vivek Kumar Gautam
@ 2021-02-12 11:01                     ` Vivek Kumar Gautam
  0 siblings, 0 replies; 81+ messages in thread
From: Vivek Kumar Gautam @ 2021-02-12 11:01 UTC (permalink / raw)
  To: Auger Eric, Liu, Yi L
  Cc: Jean-Philippe Brucker, Tian, Kevin, alex.williamson, Raj, Ashok,
	kvm, stefanha, jasowang, iommu, Tian, Jun J, Sun, Yi Y,
	Robin Murphy, Will Deacon, Wu, Hao

Hi Eric,


On 2/12/21 3:48 PM, Vivek Kumar Gautam wrote:
> Hi Eric,
>
>
> On 2/12/21 3:27 PM, Auger Eric wrote:
>> Hi Vivek, Yi,
>>
>> On 2/12/21 8:14 AM, Vivek Gautam wrote:
>>> Hi Yi,
>>>
>>>
>>> On Sat, Jan 23, 2021 at 2:29 PM Liu, Yi L <yi.l.liu@intel.com> wrote:
>>>>
>>>> Hi Eric,
>>>>
>>>>> From: Auger Eric <eric.auger@redhat.com>
>>>>> Sent: Tuesday, January 19, 2021 6:03 PM
>>>>>
>>>>> Hi Yi, Vivek,
>>>>>
>>>> [...]
>>>>>> I see. I think there needs a change in the code there. Should also
>>>>>> expect
>>>>>> a nesting_info returned instead of an int anymore. @Eric, how
>>>>>> about your
>>>>>> opinion?
>>>>>>
>>>>>>      domain = iommu_get_domain_for_dev(&vdev->pdev->dev);
>>>>>>      ret = iommu_domain_get_attr(domain, DOMAIN_ATTR_NESTING,
>>>>> &info);
>>>>>>      if (ret || !(info.features & IOMMU_NESTING_FEAT_PAGE_RESP)) {
>>>>>>              /*
>>>>>>               * No need go futher as no page request service support.
>>>>>>               */
>>>>>>              return 0;
>>>>>>      }
>>>>> Sure I think it is "just" a matter of synchro between the 2 series.
>>>>> Yi,
>>>>
>>>> exactly.
>>>>
>>>>> do you have plans to respin part of
>>>>> [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs
>>>>> or would you allow me to embed this patch in my series.
>>>>
>>>> My v7 hasn’t touch the prq change yet. So I think it's better for
>>>> you to
>>>> embed it to  your series. ^_^>>
>>>
>>> Can you please let me know if you have an updated series of these
>>> patches? It will help me to work with virtio-iommu/arm side changes.
>>
>> As per the previous discussion, I plan to take those 2 patches in my
>> SMMUv3 nested stage series:
>>
>> [PATCH v7 01/16] iommu: Report domain nesting info
>> [PATCH v7 02/16] iommu/smmu: Report empty domain nesting info
>>
>> we need to upgrade both since we do not want to report an empty nesting
>> info anymore, for arm.
>
> Absolutely. Let me send the couple of patches that I have been using,
> that add arm configuration.

Posted the couple of patches that I have been using -

https://lore.kernel.org/linux-iommu/20210212105859.8445-1-vivek.gautam@arm.com/T/#t


Thanks & regards
Vivek

>
> Best regards
> Vivek
>
>>
>> Thanks
>>
>> Eric
>>>
>>> Thanks & regards
>>> Vivek
>>>
>>
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* RE: [PATCH v7 02/16] iommu/smmu: Report empty domain nesting info
  2021-02-12  9:57                 ` Auger Eric
  2021-02-12 10:18                   ` Vivek Kumar Gautam
@ 2021-03-03  9:44                   ` Liu, Yi L
  1 sibling, 0 replies; 81+ messages in thread
From: Liu, Yi L @ 2021-03-03  9:44 UTC (permalink / raw)
  To: Auger Eric, Vivek Gautam
  Cc: Jean-Philippe Brucker, Tian, Kevin, alex.williamson, Raj, Ashok,
	kvm, stefanha, jasowang, iommu, Tian, Jun J, Sun, Yi Y,
	Robin Murphy, Will Deacon, Wu, Hao

Hi Eric,

> From: Auger Eric <eric.auger@redhat.com>
> Sent: Friday, February 12, 2021 5:58 PM
> 
> Hi Vivek, Yi,
> 
> On 2/12/21 8:14 AM, Vivek Gautam wrote:
> > Hi Yi,
> >
> >
> > On Sat, Jan 23, 2021 at 2:29 PM Liu, Yi L <yi.l.liu@intel.com> wrote:
> >>
> >> Hi Eric,
> >>
> >>> From: Auger Eric <eric.auger@redhat.com>
> >>> Sent: Tuesday, January 19, 2021 6:03 PM
> >>>
> >>> Hi Yi, Vivek,
> >>>
> >> [...]
> >>>> I see. I think there needs a change in the code there. Should also expect
> >>>> a nesting_info returned instead of an int anymore. @Eric, how about your
> >>>> opinion?
> >>>>
> >>>>     domain = iommu_get_domain_for_dev(&vdev->pdev->dev);
> >>>>     ret = iommu_domain_get_attr(domain, DOMAIN_ATTR_NESTING,
> >>> &info);
> >>>>     if (ret || !(info.features & IOMMU_NESTING_FEAT_PAGE_RESP)) {
> >>>>             /*
> >>>>              * No need go futher as no page request service support.
> >>>>              */
> >>>>             return 0;
> >>>>     }
> >>> Sure I think it is "just" a matter of synchro between the 2 series. Yi,
> >>
> >> exactly.
> >>
> >>> do you have plans to respin part of
> >>> [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs
> >>> or would you allow me to embed this patch in my series.
> >>
> >> My v7 hasn’t touch the prq change yet. So I think it's better for you to
> >> embed it to  your series. ^_^>>
> >
> > Can you please let me know if you have an updated series of these
> > patches? It will help me to work with virtio-iommu/arm side changes.
> 
> As per the previous discussion, I plan to take those 2 patches in my
> SMMUv3 nested stage series:
> 
> [PATCH v7 01/16] iommu: Report domain nesting info
> [PATCH v7 02/16] iommu/smmu: Report empty domain nesting info
> 
> we need to upgrade both since we do not want to report an empty nesting
> info anymore, for arm.

sorry for the late response. I've sent out the updated version. Also,
yeah, please feel free to take the patch in your series.

https://lore.kernel.org/linux-iommu/20210302203545.436623-2-yi.l.liu@intel.com/

Regards,
Yi Liu

> Thanks
> 
> Eric
> >
> > Thanks & regards
> > Vivek
> >


^ permalink raw reply	[flat|nested] 81+ messages in thread

end of thread, other threads:[~2021-03-04  0:27 UTC | newest]

Thread overview: 81+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-10 10:45 [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs Liu Yi L
2020-09-10 10:45 ` [PATCH v7 01/16] iommu: Report domain nesting info Liu Yi L
2020-09-11 19:38   ` Alex Williamson
2020-09-10 10:45 ` [PATCH v7 02/16] iommu/smmu: Report empty " Liu Yi L
2021-01-12  6:50   ` Vivek Gautam
2021-01-12  9:21     ` Liu, Yi L
2021-01-12 11:05       ` Vivek Gautam
2021-01-13  5:56         ` Liu, Yi L
2021-01-19 10:03           ` Auger Eric
2021-01-23  8:59             ` Liu, Yi L
2021-02-12  7:14               ` Vivek Gautam
2021-02-12  9:57                 ` Auger Eric
2021-02-12 10:18                   ` Vivek Kumar Gautam
2021-02-12 11:01                     ` Vivek Kumar Gautam
2021-03-03  9:44                   ` Liu, Yi L
2020-09-10 10:45 ` [PATCH v7 03/16] vfio/type1: Report iommu nesting info to userspace Liu Yi L
2020-09-11 20:16   ` Alex Williamson
2020-09-12  8:24     ` Liu, Yi L
2020-09-10 10:45 ` [PATCH v7 04/16] vfio: Add PASID allocation/free support Liu Yi L
2020-09-11 20:54   ` Alex Williamson
2020-09-15  4:03     ` Liu, Yi L
2020-09-10 10:45 ` [PATCH v7 05/16] iommu/vt-d: Support setting ioasid set to domain Liu Yi L
2020-09-10 10:45 ` [PATCH v7 06/16] iommu/vt-d: Remove get_task_mm() in bind_gpasid() Liu Yi L
2020-09-10 10:45 ` [PATCH v7 07/16] vfio/type1: Add VFIO_IOMMU_PASID_REQUEST (alloc/free) Liu Yi L
2020-09-11 21:38   ` Alex Williamson
2020-09-10 10:45 ` [PATCH v7 08/16] iommu: Pass domain to sva_unbind_gpasid() Liu Yi L
2020-09-10 10:45 ` [PATCH v7 09/16] iommu/vt-d: Check ownership for PASIDs from user-space Liu Yi L
2020-09-10 10:45 ` [PATCH v7 10/16] vfio/type1: Support binding guest page tables to PASID Liu Yi L
2020-09-11 22:03   ` Alex Williamson
2020-09-12  6:02     ` Liu, Yi L
2020-09-10 10:45 ` [PATCH v7 11/16] vfio/type1: Allow invalidating first-level/stage IOMMU cache Liu Yi L
2020-09-10 10:45 ` [PATCH v7 12/16] vfio/type1: Add vSVA support for IOMMU-backed mdevs Liu Yi L
2020-09-10 10:45 ` [PATCH v7 13/16] vfio/pci: Expose PCIe PASID capability to guest Liu Yi L
2020-09-11 22:13   ` Alex Williamson
2020-09-12  7:17     ` Liu, Yi L
2020-09-10 10:45 ` [PATCH v7 14/16] vfio: Document dual stage control Liu Yi L
2020-09-10 10:45 ` [PATCH v7 15/16] iommu/vt-d: Only support nesting when nesting caps are consistent across iommu units Liu Yi L
2020-09-10 10:45 ` [PATCH v7 16/16] iommu/vt-d: Support reporting nesting capability info Liu Yi L
2020-09-14  4:20 ` [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs Jason Wang
2020-09-14  8:01   ` Tian, Kevin
2020-09-14  8:57     ` Jason Wang
2020-09-14 10:38       ` Tian, Kevin
2020-09-14 11:38         ` Jason Gunthorpe
2020-09-14 13:31   ` Jean-Philippe Brucker
2020-09-14 13:47     ` Jason Gunthorpe
2020-09-14 16:22       ` Raj, Ashok
2020-09-14 16:33         ` Jason Gunthorpe
2020-09-14 16:58           ` Alex Williamson
2020-09-14 17:41             ` Jason Gunthorpe
2020-09-14 18:23               ` Alex Williamson
2020-09-14 19:00                 ` Jason Gunthorpe
2020-09-14 22:33                   ` Alex Williamson
2020-09-15 14:29                     ` Jason Gunthorpe
2020-09-16  1:19                       ` Tian, Kevin
2020-09-16  8:32                         ` Jean-Philippe Brucker
2020-09-16 14:51                           ` Jason Gunthorpe
2020-09-16 16:20                             ` Jean-Philippe Brucker
2020-09-16 16:32                               ` Jason Gunthorpe
2020-09-16 16:50                                 ` Auger Eric
2020-09-16 14:44                         ` Jason Gunthorpe
2020-09-17  6:01                           ` Tian, Kevin
     [not found]                   ` <20200914224438.GA65940@otc-nc-03>
2020-09-15 11:33                     ` Jason Gunthorpe
2020-09-15 18:11                       ` Raj, Ashok
2020-09-15 18:45                         ` Jason Gunthorpe
2020-09-15 19:26                           ` Raj, Ashok
2020-09-15 23:45                             ` Jason Gunthorpe
2020-09-16  2:33                             ` Jason Wang
2020-09-15 22:08                           ` Jacob Pan
2020-09-15 23:51                             ` Jason Gunthorpe
     [not found]                               ` <20200915171319.00003f59@linux.intel.com>
2020-09-16  1:46                                 ` Lu Baolu
2020-09-16 15:07                                 ` Jason Gunthorpe
2020-09-16 16:33                                   ` Raj, Ashok
2020-09-16 17:01                                     ` Jason Gunthorpe
2020-09-16 18:21                                       ` Jacob Pan (Jun)
2020-09-16 18:38                                         ` Jason Gunthorpe
2020-09-16 23:09                                           ` Jacob Pan (Jun)
2020-09-17  3:53                                             ` Jason Wang
2020-09-17 17:31                                               ` Jason Gunthorpe
2020-09-17 18:17                                               ` Jacob Pan (Jun)
2020-09-18  3:58                                                 ` Jason Wang
2020-09-16  2:29     ` Jason Wang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).