KVM Archive on lore.kernel.org
 help / color / Atom feed
* [RFC v2 0/3] vfio: support Shared Virtual Addressing
@ 2019-10-24 12:26 Liu Yi L
  2019-10-24 12:26 ` [RFC v2 1/3] vfio: VFIO_IOMMU_CACHE_INVALIDATE Liu Yi L
                   ` (3 more replies)
  0 siblings, 4 replies; 32+ messages in thread
From: Liu Yi L @ 2019-10-24 12:26 UTC (permalink / raw)
  To: alex.williamson, eric.auger
  Cc: kevin.tian, jacob.jun.pan, joro, ashok.raj, yi.l.liu, jun.j.tian,
	yi.y.sun, jean-philippe.brucker, peterx, iommu, kvm

Shared virtual address (SVA), a.k.a, Shared virtual memory (SVM) on Intel
platforms allow address space sharing between device DMA and applications.
SVA can reduce programming complexity and enhance security.
This series is intended to expose SVA capability to VMs. i.e. shared guest
application address space with passthru devices. The whole SVA virtualization
requires QEMU/VFIO/IOMMU changes. This series includes the VFIO changes, for
QEMU and IOMMU changes, they are in separate series (listed in the "Related
series").

The high-level architecture for SVA virtualization is as below:

    .-------------.  .---------------------------.
    |   vIOMMU    |  | Guest process CR3, FL only|
    |             |  '---------------------------'
    .----------------/
    | PASID Entry |--- PASID cache flush -
    '-------------'                       |
    |             |                       V
    |             |                CR3 in GPA
    '-------------'
Guest
------| Shadow |--------------------------|--------
      v        v                          v
Host
    .-------------.  .----------------------.
    |   pIOMMU    |  | Bind FL for GVA-GPA  |
    |             |  '----------------------'
    .----------------/  |
    | PASID Entry |     V (Nested xlate)
    '----------------\.------------------------------.
    |             |   |SL for GPA-HPA, default domain|
    |             |   '------------------------------'
    '-------------'
Where:
 - FL = First level/stage one page tables
 - SL = Second level/stage two page tables

There are roughly three parts in this patchset which are
corresponding to the basic vSVA support for PCI device
assignment
 1. vfio support for PASID allocation and free from VMs
 2. vfio support for guest PASID binding from VMs
 3. vfio support for IOMMU cache invalidation from VMs

The complete vSVA upstream patches are divided into three phases:
    1. Common APIs and PCI device direct assignment
    2. Page Request Services (PRS) support
    3. Mediated device assignment

This RFC patchset is aiming for the phase 1, and works together with the
VT-d driver[1] changes and QEMU changes[2]. Complete set for vSVA can be
found in:
https://github.com/jacobpan/linux.git:siov_sva.

And this patchset doesn't include the patch to expose PASID capability to
guest. This is expected to be in another series.

Related series:
[1] [PATCH v6 00/10] Nested Shared Virtual Address (SVA) VT-d support:
https://lkml.org/lkml/2019/10/22/953
<This series is based on this kernel series from Jacob Pan>

[2] [RFC v2 00/20] intel_iommu: expose Shared Virtual Addressing to VM
from Yi Liu

Changelog:
	- RFC v1 -> v2:
	  Dropped vfio: VFIO_IOMMU_ATTACH/DETACH_PASID_TABLE.
	  RFC v1: https://patchwork.kernel.org/cover/11033699/

Liu Yi L (3):
  vfio: VFIO_IOMMU_CACHE_INVALIDATE
  vfio/type1: VFIO_IOMMU_PASID_REQUEST(alloc/free)
  vfio/type1: bind guest pasid (guest page tables) to host

 drivers/vfio/vfio_iommu_type1.c | 305 ++++++++++++++++++++++++++++++++++++++++
 include/uapi/linux/vfio.h       |  82 +++++++++++
 2 files changed, 387 insertions(+)

-- 
2.7.4


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [RFC v2 1/3] vfio: VFIO_IOMMU_CACHE_INVALIDATE
  2019-10-24 12:26 [RFC v2 0/3] vfio: support Shared Virtual Addressing Liu Yi L
@ 2019-10-24 12:26 ` Liu Yi L
  2019-10-25  9:14   ` Tian, Kevin
  2019-10-24 12:26 ` [RFC v2 2/3] vfio/type1: VFIO_IOMMU_PASID_REQUEST(alloc/free) Liu Yi L
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 32+ messages in thread
From: Liu Yi L @ 2019-10-24 12:26 UTC (permalink / raw)
  To: alex.williamson, eric.auger
  Cc: kevin.tian, jacob.jun.pan, joro, ashok.raj, yi.l.liu, jun.j.tian,
	yi.y.sun, jean-philippe.brucker, peterx, iommu, kvm

From: Liu Yi L <yi.l.liu@linux.intel.com>

When the guest "owns" the stage 1 translation structures,  the host
IOMMU driver has no knowledge of caching structure updates unless
the guest invalidation requests are trapped and passed down to the
host.

This patch adds the VFIO_IOMMU_CACHE_INVALIDATE ioctl with aims
at propagating guest stage1 IOMMU cache invalidations to the host.

Cc: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Liu Yi L <yi.l.liu@linux.intel.com>
Signed-off-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
---
 drivers/vfio/vfio_iommu_type1.c | 55 +++++++++++++++++++++++++++++++++++++++++
 include/uapi/linux/vfio.h       | 13 ++++++++++
 2 files changed, 68 insertions(+)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 96fddc1d..cd8d3a5 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -124,6 +124,34 @@ struct vfio_regions {
 #define IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)	\
 					(!list_empty(&iommu->domain_list))
 
+struct domain_capsule {
+	struct iommu_domain *domain;
+	void *data;
+};
+
+/* iommu->lock must be held */
+static int
+vfio_iommu_lookup_dev(struct vfio_iommu *iommu,
+		      int (*fn)(struct device *dev, void *data),
+		      void *data)
+{
+	struct domain_capsule dc = {.data = data};
+	struct vfio_domain *d;
+	struct vfio_group *g;
+	int ret = 0;
+
+	list_for_each_entry(d, &iommu->domain_list, next) {
+		dc.domain = d->domain;
+		list_for_each_entry(g, &d->group_list, next) {
+			ret = iommu_group_for_each_dev(g->iommu_group,
+						       &dc, fn);
+			if (ret)
+				break;
+		}
+	}
+	return ret;
+}
+
 static int put_pfn(unsigned long pfn, int prot);
 
 /*
@@ -2211,6 +2239,15 @@ static int vfio_iommu_iova_build_caps(struct vfio_iommu *iommu,
 	return ret;
 }
 
+static int vfio_cache_inv_fn(struct device *dev, void *data)
+{
+	struct domain_capsule *dc = (struct domain_capsule *)data;
+	struct vfio_iommu_type1_cache_invalidate *ustruct =
+		(struct vfio_iommu_type1_cache_invalidate *)dc->data;
+
+	return iommu_cache_invalidate(dc->domain, dev, &ustruct->info);
+}
+
 static long vfio_iommu_type1_ioctl(void *iommu_data,
 				   unsigned int cmd, unsigned long arg)
 {
@@ -2315,6 +2352,24 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
 
 		return copy_to_user((void __user *)arg, &unmap, minsz) ?
 			-EFAULT : 0;
+	} else if (cmd == VFIO_IOMMU_CACHE_INVALIDATE) {
+		struct vfio_iommu_type1_cache_invalidate ustruct;
+		int ret;
+
+		minsz = offsetofend(struct vfio_iommu_type1_cache_invalidate,
+				    info);
+
+		if (copy_from_user(&ustruct, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (ustruct.argsz < minsz || ustruct.flags)
+			return -EINVAL;
+
+		mutex_lock(&iommu->lock);
+		ret = vfio_iommu_lookup_dev(iommu, vfio_cache_inv_fn,
+					    &ustruct);
+		mutex_unlock(&iommu->lock);
+		return ret;
 	}
 
 	return -ENOTTY;
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 9e843a1..ccf60a2 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -794,6 +794,19 @@ struct vfio_iommu_type1_dma_unmap {
 #define VFIO_IOMMU_ENABLE	_IO(VFIO_TYPE, VFIO_BASE + 15)
 #define VFIO_IOMMU_DISABLE	_IO(VFIO_TYPE, VFIO_BASE + 16)
 
+/**
+ * VFIO_IOMMU_CACHE_INVALIDATE - _IOWR(VFIO_TYPE, VFIO_BASE + 24,
+ *			struct vfio_iommu_type1_cache_invalidate)
+ *
+ * Propagate guest IOMMU cache invalidation to the host.
+ */
+struct vfio_iommu_type1_cache_invalidate {
+	__u32   argsz;
+	__u32   flags;
+	struct iommu_cache_invalidate_info info;
+};
+#define VFIO_IOMMU_CACHE_INVALIDATE      _IO(VFIO_TYPE, VFIO_BASE + 24)
+
 /* -------- Additional API for SPAPR TCE (Server POWERPC) IOMMU -------- */
 
 /*
-- 
2.7.4


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [RFC v2 2/3] vfio/type1: VFIO_IOMMU_PASID_REQUEST(alloc/free)
  2019-10-24 12:26 [RFC v2 0/3] vfio: support Shared Virtual Addressing Liu Yi L
  2019-10-24 12:26 ` [RFC v2 1/3] vfio: VFIO_IOMMU_CACHE_INVALIDATE Liu Yi L
@ 2019-10-24 12:26 ` Liu Yi L
  2019-10-25 10:06   ` Tian, Kevin
  2019-11-05 23:35   ` Alex Williamson
  2019-10-24 12:26 ` [RFC v2 3/3] vfio/type1: bind guest pasid (guest page tables) to host Liu Yi L
  2019-10-25  8:59 ` [RFC v2 0/3] vfio: support Shared Virtual Addressing Tian, Kevin
  3 siblings, 2 replies; 32+ messages in thread
From: Liu Yi L @ 2019-10-24 12:26 UTC (permalink / raw)
  To: alex.williamson, eric.auger
  Cc: kevin.tian, jacob.jun.pan, joro, ashok.raj, yi.l.liu, jun.j.tian,
	yi.y.sun, jean-philippe.brucker, peterx, iommu, kvm

This patch adds VFIO_IOMMU_PASID_REQUEST ioctl which aims
to passdown PASID allocation/free request from the virtual
iommu. This is required to get PASID managed in system-wide.

Cc: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
---
 drivers/vfio/vfio_iommu_type1.c | 114 ++++++++++++++++++++++++++++++++++++++++
 include/uapi/linux/vfio.h       |  25 +++++++++
 2 files changed, 139 insertions(+)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index cd8d3a5..3d73a7d 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -2248,6 +2248,83 @@ static int vfio_cache_inv_fn(struct device *dev, void *data)
 	return iommu_cache_invalidate(dc->domain, dev, &ustruct->info);
 }
 
+static int vfio_iommu_type1_pasid_alloc(struct vfio_iommu *iommu,
+					 int min_pasid,
+					 int max_pasid)
+{
+	int ret;
+	ioasid_t pasid;
+	struct mm_struct *mm = NULL;
+
+	mutex_lock(&iommu->lock);
+	if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) {
+		ret = -EINVAL;
+		goto out_unlock;
+	}
+	mm = get_task_mm(current);
+	/* Track ioasid allocation owner by mm */
+	pasid = ioasid_alloc((struct ioasid_set *)mm, min_pasid,
+				max_pasid, NULL);
+	if (pasid == INVALID_IOASID) {
+		ret = -ENOSPC;
+		goto out_unlock;
+	}
+	ret = pasid;
+out_unlock:
+	mutex_unlock(&iommu->lock);
+	if (mm)
+		mmput(mm);
+	return ret;
+}
+
+static int vfio_iommu_type1_pasid_free(struct vfio_iommu *iommu,
+				       unsigned int pasid)
+{
+	struct mm_struct *mm = NULL;
+	void *pdata;
+	int ret = 0;
+
+	mutex_lock(&iommu->lock);
+	if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) {
+		ret = -EINVAL;
+		goto out_unlock;
+	}
+
+	/**
+	 * REVISIT:
+	 * There are two cases free could fail:
+	 * 1. free pasid by non-owner, we use ioasid_set to track mm, if
+	 * the set does not match, caller is not permitted to free.
+	 * 2. free before unbind all devices, we can check if ioasid private
+	 * data, if data != NULL, then fail to free.
+	 */
+	mm = get_task_mm(current);
+	pdata = ioasid_find((struct ioasid_set *)mm, pasid, NULL);
+	if (IS_ERR(pdata)) {
+		if (pdata == ERR_PTR(-ENOENT))
+			pr_err("PASID %u is not allocated\n", pasid);
+		else if (pdata == ERR_PTR(-EACCES))
+			pr_err("Free PASID %u by non-owner, denied", pasid);
+		else
+			pr_err("Error searching PASID %u\n", pasid);
+		ret = -EPERM;
+		goto out_unlock;
+	}
+	if (pdata) {
+		pr_debug("Cannot free pasid %d with private data\n", pasid);
+		/* Expect PASID has no private data if not bond */
+		ret = -EBUSY;
+		goto out_unlock;
+	}
+	ioasid_free(pasid);
+
+out_unlock:
+	if (mm)
+		mmput(mm);
+	mutex_unlock(&iommu->lock);
+	return ret;
+}
+
 static long vfio_iommu_type1_ioctl(void *iommu_data,
 				   unsigned int cmd, unsigned long arg)
 {
@@ -2370,6 +2447,43 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
 					    &ustruct);
 		mutex_unlock(&iommu->lock);
 		return ret;
+
+	} else if (cmd == VFIO_IOMMU_PASID_REQUEST) {
+		struct vfio_iommu_type1_pasid_request req;
+		int min_pasid, max_pasid, pasid;
+
+		minsz = offsetofend(struct vfio_iommu_type1_pasid_request,
+				    flag);
+
+		if (copy_from_user(&req, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (req.argsz < minsz)
+			return -EINVAL;
+
+		switch (req.flag) {
+		/**
+		 * TODO: min_pasid and max_pasid align with
+		 * typedef unsigned int ioasid_t
+		 */
+		case VFIO_IOMMU_PASID_ALLOC:
+			if (copy_from_user(&min_pasid,
+				(void __user *)arg + minsz, sizeof(min_pasid)))
+				return -EFAULT;
+			if (copy_from_user(&max_pasid,
+				(void __user *)arg + minsz + sizeof(min_pasid),
+				sizeof(max_pasid)))
+				return -EFAULT;
+			return vfio_iommu_type1_pasid_alloc(iommu,
+						min_pasid, max_pasid);
+		case VFIO_IOMMU_PASID_FREE:
+			if (copy_from_user(&pasid,
+				(void __user *)arg + minsz, sizeof(pasid)))
+				return -EFAULT;
+			return vfio_iommu_type1_pasid_free(iommu, pasid);
+		default:
+			return -EINVAL;
+		}
 	}
 
 	return -ENOTTY;
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index ccf60a2..04de290 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -807,6 +807,31 @@ struct vfio_iommu_type1_cache_invalidate {
 };
 #define VFIO_IOMMU_CACHE_INVALIDATE      _IO(VFIO_TYPE, VFIO_BASE + 24)
 
+/*
+ * @flag=VFIO_IOMMU_PASID_ALLOC, refer to the @min_pasid and @max_pasid fields
+ * @flag=VFIO_IOMMU_PASID_FREE, refer to @pasid field
+ */
+struct vfio_iommu_type1_pasid_request {
+	__u32	argsz;
+#define VFIO_IOMMU_PASID_ALLOC	(1 << 0)
+#define VFIO_IOMMU_PASID_FREE	(1 << 1)
+	__u32	flag;
+	union {
+		struct {
+			int min_pasid;
+			int max_pasid;
+		};
+		int pasid;
+	};
+};
+
+/**
+ * VFIO_IOMMU_PASID_REQUEST - _IOWR(VFIO_TYPE, VFIO_BASE + 27,
+ *				struct vfio_iommu_type1_pasid_request)
+ *
+ */
+#define VFIO_IOMMU_PASID_REQUEST	_IO(VFIO_TYPE, VFIO_BASE + 27)
+
 /* -------- Additional API for SPAPR TCE (Server POWERPC) IOMMU -------- */
 
 /*
-- 
2.7.4


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [RFC v2 3/3] vfio/type1: bind guest pasid (guest page tables) to host
  2019-10-24 12:26 [RFC v2 0/3] vfio: support Shared Virtual Addressing Liu Yi L
  2019-10-24 12:26 ` [RFC v2 1/3] vfio: VFIO_IOMMU_CACHE_INVALIDATE Liu Yi L
  2019-10-24 12:26 ` [RFC v2 2/3] vfio/type1: VFIO_IOMMU_PASID_REQUEST(alloc/free) Liu Yi L
@ 2019-10-24 12:26 ` Liu Yi L
  2019-11-07 23:20   ` Alex Williamson
  2019-10-25  8:59 ` [RFC v2 0/3] vfio: support Shared Virtual Addressing Tian, Kevin
  3 siblings, 1 reply; 32+ messages in thread
From: Liu Yi L @ 2019-10-24 12:26 UTC (permalink / raw)
  To: alex.williamson, eric.auger
  Cc: kevin.tian, jacob.jun.pan, joro, ashok.raj, yi.l.liu, jun.j.tian,
	yi.y.sun, jean-philippe.brucker, peterx, iommu, kvm

This patch adds vfio support to bind guest translation structure
to host iommu. VFIO exposes iommu programming capability to user-
space. Guest is a user-space application in host under KVM solution.
For SVA usage in Virtual Machine, guest owns GVA->GPA translation
structure. And this part should be passdown to host to enable nested
translation (or say two stage translation). This patch reuses the
VFIO_IOMMU_BIND proposal from Jean-Philippe Brucker, and adds new
bind type for binding guest owned translation structure to host.

*) Add two new ioctls for VFIO containers.

  - VFIO_IOMMU_BIND: for bind request from userspace, it could be
                   bind a process to a pasid or bind a guest pasid
                   to a device, this is indicated by type
  - VFIO_IOMMU_UNBIND: for unbind request from userspace, it could be
                   unbind a process to a pasid or unbind a guest pasid
                   to a device, also indicated by type
  - Bind type:
	VFIO_IOMMU_BIND_PROCESS: user-space request to bind a process
                   to a device
	VFIO_IOMMU_BIND_GUEST_PASID: bind guest owned translation
                   structure to host iommu. e.g. guest page table

*) Code logic in vfio_iommu_type1_ioctl() to handle VFIO_IOMMU_BIND/UNBIND

Cc: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
---
 drivers/vfio/vfio_iommu_type1.c | 136 ++++++++++++++++++++++++++++++++++++++++
 include/uapi/linux/vfio.h       |  44 +++++++++++++
 2 files changed, 180 insertions(+)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 3d73a7d..1a27e25 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -2325,6 +2325,104 @@ static int vfio_iommu_type1_pasid_free(struct vfio_iommu *iommu,
 	return ret;
 }
 
+static int vfio_bind_gpasid_fn(struct device *dev, void *data)
+{
+	struct domain_capsule *dc = (struct domain_capsule *)data;
+	struct iommu_gpasid_bind_data *ustruct =
+		(struct iommu_gpasid_bind_data *) dc->data;
+
+	return iommu_sva_bind_gpasid(dc->domain, dev, ustruct);
+}
+
+static int vfio_unbind_gpasid_fn(struct device *dev, void *data)
+{
+	struct domain_capsule *dc = (struct domain_capsule *)data;
+	struct iommu_gpasid_bind_data *ustruct =
+		(struct iommu_gpasid_bind_data *) dc->data;
+
+	return iommu_sva_unbind_gpasid(dc->domain, dev,
+						ustruct->hpasid);
+}
+
+/*
+ * unbind specific gpasid, caller of this function requires hold
+ * vfio_iommu->lock
+ */
+static long vfio_iommu_type1_do_guest_unbind(struct vfio_iommu *iommu,
+		  struct iommu_gpasid_bind_data *gbind_data)
+{
+	return vfio_iommu_lookup_dev(iommu, vfio_unbind_gpasid_fn, gbind_data);
+}
+
+static long vfio_iommu_type1_bind_gpasid(struct vfio_iommu *iommu,
+					    void __user *arg,
+					    struct vfio_iommu_type1_bind *bind)
+{
+	struct iommu_gpasid_bind_data gbind_data;
+	unsigned long minsz;
+	int ret = 0;
+
+	minsz = sizeof(*bind) + sizeof(gbind_data);
+	if (bind->argsz < minsz)
+		return -EINVAL;
+
+	if (copy_from_user(&gbind_data, arg, sizeof(gbind_data)))
+		return -EFAULT;
+
+	mutex_lock(&iommu->lock);
+	if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) {
+		ret = -EINVAL;
+		goto out_unlock;
+	}
+
+	ret = vfio_iommu_lookup_dev(iommu, vfio_bind_gpasid_fn, &gbind_data);
+	/*
+	 * If bind failed, it may not be a total failure. Some devices within
+	 * the iommu group may have bind successfully. Although we don't enable
+	 * pasid capability for non-singletion iommu groups, a unbind operation
+	 * would be helpful to ensure no partial binding for an iommu group.
+	 */
+	if (ret)
+		/*
+		 * Undo all binds that already succeeded, no need to check the
+		 * return value here since some device within the group has no
+		 * successful bind when coming to this place switch.
+		 */
+		vfio_iommu_type1_do_guest_unbind(iommu, &gbind_data);
+
+out_unlock:
+	mutex_unlock(&iommu->lock);
+	return ret;
+}
+
+static long vfio_iommu_type1_unbind_gpasid(struct vfio_iommu *iommu,
+					    void __user *arg,
+					    struct vfio_iommu_type1_bind *bind)
+{
+	struct iommu_gpasid_bind_data gbind_data;
+	unsigned long minsz;
+	int ret = 0;
+
+	minsz = sizeof(*bind) + sizeof(gbind_data);
+	if (bind->argsz < minsz)
+		return -EINVAL;
+
+	if (copy_from_user(&gbind_data, arg, sizeof(gbind_data)))
+		return -EFAULT;
+
+	mutex_lock(&iommu->lock);
+	if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) {
+		ret = -EINVAL;
+		goto out_unlock;
+	}
+
+	ret = vfio_iommu_type1_do_guest_unbind(iommu, &gbind_data);
+
+out_unlock:
+	mutex_unlock(&iommu->lock);
+	return ret;
+}
+
 static long vfio_iommu_type1_ioctl(void *iommu_data,
 				   unsigned int cmd, unsigned long arg)
 {
@@ -2484,6 +2582,44 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
 		default:
 			return -EINVAL;
 		}
+
+	} else if (cmd == VFIO_IOMMU_BIND) {
+		struct vfio_iommu_type1_bind bind;
+
+		minsz = offsetofend(struct vfio_iommu_type1_bind, bind_type);
+
+		if (copy_from_user(&bind, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (bind.argsz < minsz)
+			return -EINVAL;
+
+		switch (bind.bind_type) {
+		case VFIO_IOMMU_BIND_GUEST_PASID:
+			return vfio_iommu_type1_bind_gpasid(iommu,
+					(void __user *)(arg + minsz), &bind);
+		default:
+			return -EINVAL;
+		}
+
+	} else if (cmd == VFIO_IOMMU_UNBIND) {
+		struct vfio_iommu_type1_bind bind;
+
+		minsz = offsetofend(struct vfio_iommu_type1_bind, bind_type);
+
+		if (copy_from_user(&bind, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (bind.argsz < minsz)
+			return -EINVAL;
+
+		switch (bind.bind_type) {
+		case VFIO_IOMMU_BIND_GUEST_PASID:
+			return vfio_iommu_type1_unbind_gpasid(iommu,
+					(void __user *)(arg + minsz), &bind);
+		default:
+			return -EINVAL;
+		}
 	}
 
 	return -ENOTTY;
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 04de290..78e8c64 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -832,6 +832,50 @@ struct vfio_iommu_type1_pasid_request {
  */
 #define VFIO_IOMMU_PASID_REQUEST	_IO(VFIO_TYPE, VFIO_BASE + 27)
 
+enum vfio_iommu_bind_type {
+	VFIO_IOMMU_BIND_PROCESS,
+	VFIO_IOMMU_BIND_GUEST_PASID,
+};
+
+/*
+ * Supported types:
+ *	- VFIO_IOMMU_BIND_GUEST_PASID: bind guest pasid, which invoked
+ *			by guest, it takes iommu_gpasid_bind_data in data.
+ */
+struct vfio_iommu_type1_bind {
+	__u32				argsz;
+	enum vfio_iommu_bind_type	bind_type;
+	__u8				data[];
+};
+
+/*
+ * VFIO_IOMMU_BIND - _IOWR(VFIO_TYPE, VFIO_BASE + 28, struct vfio_iommu_bind)
+ *
+ * Manage address spaces of devices in this container. Initially a TYPE1
+ * container can only have one address space, managed with
+ * VFIO_IOMMU_MAP/UNMAP_DMA.
+ *
+ * An IOMMU of type VFIO_TYPE1_NESTING_IOMMU can be managed by both MAP/UNMAP
+ * and BIND ioctls at the same time. MAP/UNMAP acts on the stage-2 (host) page
+ * tables, and BIND manages the stage-1 (guest) page tables. Other types of
+ * IOMMU may allow MAP/UNMAP and BIND to coexist, where MAP/UNMAP controls
+ * non-PASID traffic and BIND controls PASID traffic. But this depends on the
+ * underlying IOMMU architecture and isn't guaranteed.
+ *
+ * Availability of this feature depends on the device, its bus, the underlying
+ * IOMMU and the CPU architecture.
+ *
+ * returns: 0 on success, -errno on failure.
+ */
+#define VFIO_IOMMU_BIND		_IO(VFIO_TYPE, VFIO_BASE + 28)
+
+/*
+ * VFIO_IOMMU_UNBIND - _IOWR(VFIO_TYPE, VFIO_BASE + 29, struct vfio_iommu_bind)
+ *
+ * Undo what was done by the corresponding VFIO_IOMMU_BIND ioctl.
+ */
+#define VFIO_IOMMU_UNBIND	_IO(VFIO_TYPE, VFIO_BASE + 29)
+
 /* -------- Additional API for SPAPR TCE (Server POWERPC) IOMMU -------- */
 
 /*
-- 
2.7.4


^ permalink raw reply	[flat|nested] 32+ messages in thread

* RE: [RFC v2 0/3] vfio: support Shared Virtual Addressing
  2019-10-24 12:26 [RFC v2 0/3] vfio: support Shared Virtual Addressing Liu Yi L
                   ` (2 preceding siblings ...)
  2019-10-24 12:26 ` [RFC v2 3/3] vfio/type1: bind guest pasid (guest page tables) to host Liu Yi L
@ 2019-10-25  8:59 ` Tian, Kevin
  2019-10-25 11:18   ` Liu, Yi L
  3 siblings, 1 reply; 32+ messages in thread
From: Tian, Kevin @ 2019-10-25  8:59 UTC (permalink / raw)
  To: Liu, Yi L, alex.williamson, eric.auger
  Cc: Raj, Ashok, kvm, jean-philippe.brucker, Tian, Jun J, iommu, Sun, Yi Y

> From: Liu Yi L
> Sent: Thursday, October 24, 2019 8:26 PM
> 
> Shared virtual address (SVA), a.k.a, Shared virtual memory (SVM) on Intel
> platforms allow address space sharing between device DMA and
> applications.
> SVA can reduce programming complexity and enhance security.
> This series is intended to expose SVA capability to VMs. i.e. shared guest
> application address space with passthru devices. The whole SVA
> virtualization
> requires QEMU/VFIO/IOMMU changes. This series includes the VFIO
> changes, for
> QEMU and IOMMU changes, they are in separate series (listed in the
> "Related
> series").
> 
> The high-level architecture for SVA virtualization is as below:
> 
>     .-------------.  .---------------------------.
>     |   vIOMMU    |  | Guest process CR3, FL only|
>     |             |  '---------------------------'
>     .----------------/
>     | PASID Entry |--- PASID cache flush -
>     '-------------'                       |
>     |             |                       V
>     |             |                CR3 in GPA
>     '-------------'
> Guest
> ------| Shadow |--------------------------|--------
>       v        v                          v
> Host
>     .-------------.  .----------------------.
>     |   pIOMMU    |  | Bind FL for GVA-GPA  |
>     |             |  '----------------------'
>     .----------------/  |
>     | PASID Entry |     V (Nested xlate)
>     '----------------\.------------------------------.
>     |             |   |SL for GPA-HPA, default domain|
>     |             |   '------------------------------'
>     '-------------'
> Where:
>  - FL = First level/stage one page tables
>  - SL = Second level/stage two page tables
> 
> There are roughly three parts in this patchset which are
> corresponding to the basic vSVA support for PCI device
> assignment
>  1. vfio support for PASID allocation and free from VMs
>  2. vfio support for guest PASID binding from VMs
>  3. vfio support for IOMMU cache invalidation from VMs
> 
> The complete vSVA upstream patches are divided into three phases:
>     1. Common APIs and PCI device direct assignment
>     2. Page Request Services (PRS) support
>     3. Mediated device assignment
> 
> This RFC patchset is aiming for the phase 1, and works together with the
> VT-d driver[1] changes and QEMU changes[2]. Complete set for vSVA can
> be
> found in:
> https://github.com/jacobpan/linux.git:siov_sva.
> 
> And this patchset doesn't include the patch to expose PASID capability to
> guest. This is expected to be in another series.
> 
> Related series:
> [1] [PATCH v6 00/10] Nested Shared Virtual Address (SVA) VT-d support:
> https://lkml.org/lkml/2019/10/22/953
> <This series is based on this kernel series from Jacob Pan>
> 
> [2] [RFC v2 00/20] intel_iommu: expose Shared Virtual Addressing to VM
> from Yi Liu

there is no link, and should be [RFC v2 00/22]

> 
> Changelog:
> 	- RFC v1 -> v2:
> 	  Dropped vfio: VFIO_IOMMU_ATTACH/DETACH_PASID_TABLE.
> 	  RFC v1: https://patchwork.kernel.org/cover/11033699/
> 
> Liu Yi L (3):
>   vfio: VFIO_IOMMU_CACHE_INVALIDATE
>   vfio/type1: VFIO_IOMMU_PASID_REQUEST(alloc/free)
>   vfio/type1: bind guest pasid (guest page tables) to host
> 
>  drivers/vfio/vfio_iommu_type1.c | 305
> ++++++++++++++++++++++++++++++++++++++++
>  include/uapi/linux/vfio.h       |  82 +++++++++++
>  2 files changed, 387 insertions(+)
> 
> --
> 2.7.4
> 
> _______________________________________________
> iommu mailing list
> iommu@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 32+ messages in thread

* RE: [RFC v2 1/3] vfio: VFIO_IOMMU_CACHE_INVALIDATE
  2019-10-24 12:26 ` [RFC v2 1/3] vfio: VFIO_IOMMU_CACHE_INVALIDATE Liu Yi L
@ 2019-10-25  9:14   ` Tian, Kevin
  2019-10-25 11:20     ` Liu, Yi L
  0 siblings, 1 reply; 32+ messages in thread
From: Tian, Kevin @ 2019-10-25  9:14 UTC (permalink / raw)
  To: Liu, Yi L, alex.williamson, eric.auger
  Cc: jacob.jun.pan, joro, Raj, Ashok, Tian, Jun J, Sun, Yi Y,
	jean-philippe.brucker, peterx, iommu, kvm

> From: Liu, Yi L
> Sent: Thursday, October 24, 2019 8:26 PM
> 
> From: Liu Yi L <yi.l.liu@linux.intel.com>
> 
> When the guest "owns" the stage 1 translation structures,  the host
> IOMMU driver has no knowledge of caching structure updates unless
> the guest invalidation requests are trapped and passed down to the
> host.
> 
> This patch adds the VFIO_IOMMU_CACHE_INVALIDATE ioctl with aims
> at propagating guest stage1 IOMMU cache invalidations to the host.
> 
> Cc: Kevin Tian <kevin.tian@intel.com>
> Signed-off-by: Liu Yi L <yi.l.liu@linux.intel.com>
> Signed-off-by: Eric Auger <eric.auger@redhat.com>
> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> ---
>  drivers/vfio/vfio_iommu_type1.c | 55
> +++++++++++++++++++++++++++++++++++++++++
>  include/uapi/linux/vfio.h       | 13 ++++++++++
>  2 files changed, 68 insertions(+)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c
> b/drivers/vfio/vfio_iommu_type1.c
> index 96fddc1d..cd8d3a5 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -124,6 +124,34 @@ struct vfio_regions {
>  #define IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)	\
>  					(!list_empty(&iommu->domain_list))
> 
> +struct domain_capsule {
> +	struct iommu_domain *domain;
> +	void *data;
> +};
> +
> +/* iommu->lock must be held */
> +static int
> +vfio_iommu_lookup_dev(struct vfio_iommu *iommu,
> +		      int (*fn)(struct device *dev, void *data),
> +		      void *data)

'lookup' usually means find a device and then return. But
the real purpose here is to loop all the devices within this
container and then do something. Does it make more 
sense to be vfio_iommu_for_each_dev?

> +{
> +	struct domain_capsule dc = {.data = data};
> +	struct vfio_domain *d;
> +	struct vfio_group *g;
> +	int ret = 0;
> +
> +	list_for_each_entry(d, &iommu->domain_list, next) {
> +		dc.domain = d->domain;
> +		list_for_each_entry(g, &d->group_list, next) {
> +			ret = iommu_group_for_each_dev(g-
> >iommu_group,
> +						       &dc, fn);
> +			if (ret)
> +				break;
> +		}
> +	}
> +	return ret;
> +}
> +
>  static int put_pfn(unsigned long pfn, int prot);
> 
>  /*
> @@ -2211,6 +2239,15 @@ static int vfio_iommu_iova_build_caps(struct
> vfio_iommu *iommu,
>  	return ret;
>  }
> 
> +static int vfio_cache_inv_fn(struct device *dev, void *data)
> +{
> +	struct domain_capsule *dc = (struct domain_capsule *)data;
> +	struct vfio_iommu_type1_cache_invalidate *ustruct =
> +		(struct vfio_iommu_type1_cache_invalidate *)dc->data;
> +
> +	return iommu_cache_invalidate(dc->domain, dev, &ustruct->info);
> +}
> +
>  static long vfio_iommu_type1_ioctl(void *iommu_data,
>  				   unsigned int cmd, unsigned long arg)
>  {
> @@ -2315,6 +2352,24 @@ static long vfio_iommu_type1_ioctl(void
> *iommu_data,
> 
>  		return copy_to_user((void __user *)arg, &unmap, minsz) ?
>  			-EFAULT : 0;
> +	} else if (cmd == VFIO_IOMMU_CACHE_INVALIDATE) {
> +		struct vfio_iommu_type1_cache_invalidate ustruct;

it's weird to call a variable as struct.

> +		int ret;
> +
> +		minsz = offsetofend(struct
> vfio_iommu_type1_cache_invalidate,
> +				    info);
> +
> +		if (copy_from_user(&ustruct, (void __user *)arg, minsz))
> +			return -EFAULT;
> +
> +		if (ustruct.argsz < minsz || ustruct.flags)
> +			return -EINVAL;
> +
> +		mutex_lock(&iommu->lock);
> +		ret = vfio_iommu_lookup_dev(iommu, vfio_cache_inv_fn,
> +					    &ustruct);
> +		mutex_unlock(&iommu->lock);
> +		return ret;
>  	}
> 
>  	return -ENOTTY;
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 9e843a1..ccf60a2 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -794,6 +794,19 @@ struct vfio_iommu_type1_dma_unmap {
>  #define VFIO_IOMMU_ENABLE	_IO(VFIO_TYPE, VFIO_BASE + 15)
>  #define VFIO_IOMMU_DISABLE	_IO(VFIO_TYPE, VFIO_BASE + 16)
> 
> +/**
> + * VFIO_IOMMU_CACHE_INVALIDATE - _IOWR(VFIO_TYPE, VFIO_BASE +
> 24,
> + *			struct vfio_iommu_type1_cache_invalidate)
> + *
> + * Propagate guest IOMMU cache invalidation to the host.

guest or first-level/stage-1? Ideally userspace application may also
bind its own address space as stage-1 one day...

> + */
> +struct vfio_iommu_type1_cache_invalidate {
> +	__u32   argsz;
> +	__u32   flags;
> +	struct iommu_cache_invalidate_info info;
> +};
> +#define VFIO_IOMMU_CACHE_INVALIDATE      _IO(VFIO_TYPE, VFIO_BASE
> + 24)
> +
>  /* -------- Additional API for SPAPR TCE (Server POWERPC) IOMMU --------
> */
> 
>  /*
> --
> 2.7.4


^ permalink raw reply	[flat|nested] 32+ messages in thread

* RE: [RFC v2 2/3] vfio/type1: VFIO_IOMMU_PASID_REQUEST(alloc/free)
  2019-10-24 12:26 ` [RFC v2 2/3] vfio/type1: VFIO_IOMMU_PASID_REQUEST(alloc/free) Liu Yi L
@ 2019-10-25 10:06   ` Tian, Kevin
  2019-10-25 11:16     ` Liu, Yi L
  2019-11-05 23:35   ` Alex Williamson
  1 sibling, 1 reply; 32+ messages in thread
From: Tian, Kevin @ 2019-10-25 10:06 UTC (permalink / raw)
  To: Liu, Yi L, alex.williamson, eric.auger
  Cc: jacob.jun.pan, joro, Raj, Ashok, Liu, Yi L, Tian, Jun J, Sun,
	Yi Y, jean-philippe.brucker, peterx, iommu, kvm

> From: Liu Yi L
> Sent: Thursday, October 24, 2019 8:26 PM
> 
> This patch adds VFIO_IOMMU_PASID_REQUEST ioctl which aims
> to passdown PASID allocation/free request from the virtual
> iommu. This is required to get PASID managed in system-wide.
> 
> Cc: Kevin Tian <kevin.tian@intel.com>
> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> ---
>  drivers/vfio/vfio_iommu_type1.c | 114
> ++++++++++++++++++++++++++++++++++++++++
>  include/uapi/linux/vfio.h       |  25 +++++++++
>  2 files changed, 139 insertions(+)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c
> b/drivers/vfio/vfio_iommu_type1.c
> index cd8d3a5..3d73a7d 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -2248,6 +2248,83 @@ static int vfio_cache_inv_fn(struct device *dev,
> void *data)
>  	return iommu_cache_invalidate(dc->domain, dev, &ustruct->info);
>  }
> 
> +static int vfio_iommu_type1_pasid_alloc(struct vfio_iommu *iommu,
> +					 int min_pasid,
> +					 int max_pasid)
> +{
> +	int ret;
> +	ioasid_t pasid;
> +	struct mm_struct *mm = NULL;
> +
> +	mutex_lock(&iommu->lock);
> +	if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) {
> +		ret = -EINVAL;
> +		goto out_unlock;
> +	}
> +	mm = get_task_mm(current);
> +	/* Track ioasid allocation owner by mm */
below is purely allocation. Where does 'track' come to play?

> +	pasid = ioasid_alloc((struct ioasid_set *)mm, min_pasid,
> +				max_pasid, NULL);
> +	if (pasid == INVALID_IOASID) {
> +		ret = -ENOSPC;
> +		goto out_unlock;
> +	}
> +	ret = pasid;
> +out_unlock:
> +	mutex_unlock(&iommu->lock);
> +	if (mm)
> +		mmput(mm);
> +	return ret;
> +}
> +
> +static int vfio_iommu_type1_pasid_free(struct vfio_iommu *iommu,
> +				       unsigned int pasid)
> +{
> +	struct mm_struct *mm = NULL;
> +	void *pdata;
> +	int ret = 0;
> +
> +	mutex_lock(&iommu->lock);
> +	if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) {
> +		ret = -EINVAL;
> +		goto out_unlock;
> +	}
> +
> +	/**
> +	 * REVISIT:
> +	 * There are two cases free could fail:
> +	 * 1. free pasid by non-owner, we use ioasid_set to track mm, if
> +	 * the set does not match, caller is not permitted to free.
> +	 * 2. free before unbind all devices, we can check if ioasid private
> +	 * data, if data != NULL, then fail to free.
> +	 */

Does REVISIT mean that above comment is the right way but
the code doesn't follow yet, or the comment itself should be
revisited?

should we have some notification mechanism, so the guy
who holds the reference to the pasid can be notified to
release its usage?

> +	mm = get_task_mm(current);
> +	pdata = ioasid_find((struct ioasid_set *)mm, pasid, NULL);
> +	if (IS_ERR(pdata)) {
> +		if (pdata == ERR_PTR(-ENOENT))
> +			pr_err("PASID %u is not allocated\n", pasid);
> +		else if (pdata == ERR_PTR(-EACCES))
> +			pr_err("Free PASID %u by non-owner, denied",
> pasid);
> +		else
> +			pr_err("Error searching PASID %u\n", pasid);
> +		ret = -EPERM;
> +		goto out_unlock;
> +	}
> +	if (pdata) {
> +		pr_debug("Cannot free pasid %d with private data\n",
> pasid);
> +		/* Expect PASID has no private data if not bond */
> +		ret = -EBUSY;
> +		goto out_unlock;
> +	}
> +	ioasid_free(pasid);
> +
> +out_unlock:
> +	if (mm)
> +		mmput(mm);
> +	mutex_unlock(&iommu->lock);
> +	return ret;
> +}
> +
>  static long vfio_iommu_type1_ioctl(void *iommu_data,
>  				   unsigned int cmd, unsigned long arg)
>  {
> @@ -2370,6 +2447,43 @@ static long vfio_iommu_type1_ioctl(void
> *iommu_data,
>  					    &ustruct);
>  		mutex_unlock(&iommu->lock);
>  		return ret;
> +
> +	} else if (cmd == VFIO_IOMMU_PASID_REQUEST) {
> +		struct vfio_iommu_type1_pasid_request req;
> +		int min_pasid, max_pasid, pasid;
> +
> +		minsz = offsetofend(struct
> vfio_iommu_type1_pasid_request,
> +				    flag);
> +
> +		if (copy_from_user(&req, (void __user *)arg, minsz))
> +			return -EFAULT;
> +
> +		if (req.argsz < minsz)
> +			return -EINVAL;
> +
> +		switch (req.flag) {
> +		/**
> +		 * TODO: min_pasid and max_pasid align with
> +		 * typedef unsigned int ioasid_t
> +		 */
> +		case VFIO_IOMMU_PASID_ALLOC:
> +			if (copy_from_user(&min_pasid,
> +				(void __user *)arg + minsz,
> sizeof(min_pasid)))
> +				return -EFAULT;
> +			if (copy_from_user(&max_pasid,
> +				(void __user *)arg + minsz +
> sizeof(min_pasid),
> +				sizeof(max_pasid)))
> +				return -EFAULT;
> +			return vfio_iommu_type1_pasid_alloc(iommu,
> +						min_pasid, max_pasid);
> +		case VFIO_IOMMU_PASID_FREE:
> +			if (copy_from_user(&pasid,
> +				(void __user *)arg + minsz, sizeof(pasid)))
> +				return -EFAULT;
> +			return vfio_iommu_type1_pasid_free(iommu,
> pasid);
> +		default:
> +			return -EINVAL;
> +		}
>  	}
> 
>  	return -ENOTTY;
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index ccf60a2..04de290 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -807,6 +807,31 @@ struct vfio_iommu_type1_cache_invalidate {
>  };
>  #define VFIO_IOMMU_CACHE_INVALIDATE      _IO(VFIO_TYPE, VFIO_BASE
> + 24)
> 
> +/*
> + * @flag=VFIO_IOMMU_PASID_ALLOC, refer to the @min_pasid and
> @max_pasid fields
> + * @flag=VFIO_IOMMU_PASID_FREE, refer to @pasid field
> + */
> +struct vfio_iommu_type1_pasid_request {
> +	__u32	argsz;
> +#define VFIO_IOMMU_PASID_ALLOC	(1 << 0)
> +#define VFIO_IOMMU_PASID_FREE	(1 << 1)
> +	__u32	flag;
> +	union {
> +		struct {
> +			int min_pasid;
> +			int max_pasid;
> +		};
> +		int pasid;
> +	};
> +};
> +
> +/**
> + * VFIO_IOMMU_PASID_REQUEST - _IOWR(VFIO_TYPE, VFIO_BASE + 27,
> + *				struct vfio_iommu_type1_pasid_request)
> + *
> + */
> +#define VFIO_IOMMU_PASID_REQUEST	_IO(VFIO_TYPE, VFIO_BASE
> + 27)
> +
>  /* -------- Additional API for SPAPR TCE (Server POWERPC) IOMMU --------
> */
> 
>  /*
> --
> 2.7.4


^ permalink raw reply	[flat|nested] 32+ messages in thread

* RE: [RFC v2 2/3] vfio/type1: VFIO_IOMMU_PASID_REQUEST(alloc/free)
  2019-10-25 10:06   ` Tian, Kevin
@ 2019-10-25 11:16     ` Liu, Yi L
  0 siblings, 0 replies; 32+ messages in thread
From: Liu, Yi L @ 2019-10-25 11:16 UTC (permalink / raw)
  To: Tian, Kevin, alex.williamson, eric.auger
  Cc: jacob.jun.pan, joro, Raj, Ashok, Tian, Jun J, Sun, Yi Y,
	jean-philippe.brucker, peterx, iommu, kvm

Hi Kevin,

> From: Tian, Kevin
> Sent: Friday, October 25, 2019 6:06 PM
> To: Liu, Yi L <yi.l.liu@intel.com>; alex.williamson@redhat.com;
> Subject: RE: [RFC v2 2/3] vfio/type1: VFIO_IOMMU_PASID_REQUEST(alloc/free)
> 
> > From: Liu Yi L
> > Sent: Thursday, October 24, 2019 8:26 PM
> >
> > This patch adds VFIO_IOMMU_PASID_REQUEST ioctl which aims to passdown
> > PASID allocation/free request from the virtual iommu. This is required
> > to get PASID managed in system-wide.
> >
> > Cc: Kevin Tian <kevin.tian@intel.com>
> > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
> > Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > ---
> >  drivers/vfio/vfio_iommu_type1.c | 114
> > ++++++++++++++++++++++++++++++++++++++++
> >  include/uapi/linux/vfio.h       |  25 +++++++++
> >  2 files changed, 139 insertions(+)
> >
> > diff --git a/drivers/vfio/vfio_iommu_type1.c
> > b/drivers/vfio/vfio_iommu_type1.c index cd8d3a5..3d73a7d 100644
> > --- a/drivers/vfio/vfio_iommu_type1.c
> > +++ b/drivers/vfio/vfio_iommu_type1.c
> > @@ -2248,6 +2248,83 @@ static int vfio_cache_inv_fn(struct device
> > *dev, void *data)
> >  	return iommu_cache_invalidate(dc->domain, dev, &ustruct->info);  }
> >
> > +static int vfio_iommu_type1_pasid_alloc(struct vfio_iommu *iommu,
> > +					 int min_pasid,
> > +					 int max_pasid)
> > +{
> > +	int ret;
> > +	ioasid_t pasid;
> > +	struct mm_struct *mm = NULL;
> > +
> > +	mutex_lock(&iommu->lock);
> > +	if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) {
> > +		ret = -EINVAL;
> > +		goto out_unlock;
> > +	}
> > +	mm = get_task_mm(current);
> > +	/* Track ioasid allocation owner by mm */
> below is purely allocation. Where does 'track' come to play?

ioasid_set is kind of owner track. As allocation is separate with
bind, here set the "owner" could be used to do sanity check when
a pasid bind comes.

> > +	pasid = ioasid_alloc((struct ioasid_set *)mm, min_pasid,
> > +				max_pasid, NULL);
> > +	if (pasid == INVALID_IOASID) {
> > +		ret = -ENOSPC;
> > +		goto out_unlock;
> > +	}
> > +	ret = pasid;
> > +out_unlock:
> > +	mutex_unlock(&iommu->lock);
> > +	if (mm)
> > +		mmput(mm);
> > +	return ret;
> > +}
> > +
> > +static int vfio_iommu_type1_pasid_free(struct vfio_iommu *iommu,
> > +				       unsigned int pasid)
> > +{
> > +	struct mm_struct *mm = NULL;
> > +	void *pdata;
> > +	int ret = 0;
> > +
> > +	mutex_lock(&iommu->lock);
> > +	if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) {
> > +		ret = -EINVAL;
> > +		goto out_unlock;
> > +	}
> > +
> > +	/**
> > +	 * REVISIT:
> > +	 * There are two cases free could fail:
> > +	 * 1. free pasid by non-owner, we use ioasid_set to track mm, if
> > +	 * the set does not match, caller is not permitted to free.
> > +	 * 2. free before unbind all devices, we can check if ioasid private
> > +	 * data, if data != NULL, then fail to free.
> > +	 */
> 
> Does REVISIT mean that above comment is the right way but the code doesn't follow
> yet, or the comment itself should be revisited?

Sorry, it's a mistake... should be removed. It's added in the development phase
for remind. Will remove it.

> 
> should we have some notification mechanism, so the guy who holds the reference to
> the pasid can be notified to release its usage?

Do you mean the ioasid itself to provide such a notification mechanism?

Currently, we prevent pasid free before all user (iommu driver, guest) released
their usage. This is achieved by checking the private data, in which there is
a user_cnt of a pasid. e.g. struct intel_svm. A fresh guest pasid bind will allocate
the private data. A second guest pasid bind will increase the user_cnt. guest pasid
unbind decreases the user_cnt. The private data will be freed by the last guest
pasid unbind. Do you think it is sufficient? or we may want to have a notification
mechanism to allow such pasid free and keep the user updated?

Thanks,
Yi Liu

^ permalink raw reply	[flat|nested] 32+ messages in thread

* RE: [RFC v2 0/3] vfio: support Shared Virtual Addressing
  2019-10-25  8:59 ` [RFC v2 0/3] vfio: support Shared Virtual Addressing Tian, Kevin
@ 2019-10-25 11:18   ` Liu, Yi L
  0 siblings, 0 replies; 32+ messages in thread
From: Liu, Yi L @ 2019-10-25 11:18 UTC (permalink / raw)
  To: Tian, Kevin, alex.williamson, eric.auger
  Cc: Raj, Ashok, kvm, jean-philippe.brucker, Tian, Jun J, iommu, Sun, Yi Y

Hi Kevin,

> From: Tian, Kevin
> Sent: Friday, October 25, 2019 4:59 PM
> To: Liu, Yi L <yi.l.liu@intel.com>; alex.williamson@redhat.com;
> Subject: RE: [RFC v2 0/3] vfio: support Shared Virtual Addressing
> 
> > From: Liu Yi L
> > Sent: Thursday, October 24, 2019 8:26 PM
> >
> > Shared virtual address (SVA), a.k.a, Shared virtual memory (SVM) on
> > Intel platforms allow address space sharing between device DMA and
> > applications.
> > SVA can reduce programming complexity and enhance security.
> > This series is intended to expose SVA capability to VMs. i.e. shared
> > guest application address space with passthru devices. The whole SVA
> > virtualization requires QEMU/VFIO/IOMMU changes. This series includes
> > the VFIO changes, for QEMU and IOMMU changes, they are in separate
> > series (listed in the "Related series").
> >
> > The high-level architecture for SVA virtualization is as below:
> >
[...]
> >
> > Related series:
> > [1] [PATCH v6 00/10] Nested Shared Virtual Address (SVA) VT-d support:
> > https://lkml.org/lkml/2019/10/22/953
> > <This series is based on this kernel series from Jacob Pan>
> >
> > [2] [RFC v2 00/20] intel_iommu: expose Shared Virtual Addressing to VM
> > from Yi Liu
> 
> there is no link, and should be [RFC v2 00/22]

The link is not generated at the time this series is sent out. Yeah, should be [RFC v2 0/22].
Thanks for spotting it.

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 32+ messages in thread

* RE: [RFC v2 1/3] vfio: VFIO_IOMMU_CACHE_INVALIDATE
  2019-10-25  9:14   ` Tian, Kevin
@ 2019-10-25 11:20     ` Liu, Yi L
  2019-11-05 22:42       ` Alex Williamson
  0 siblings, 1 reply; 32+ messages in thread
From: Liu, Yi L @ 2019-10-25 11:20 UTC (permalink / raw)
  To: Tian, Kevin, alex.williamson, eric.auger
  Cc: jacob.jun.pan, joro, Raj, Ashok, Tian, Jun J, Sun, Yi Y,
	jean-philippe.brucker, peterx, iommu, kvm

Hi Kevin,

> From: Tian, Kevin
> Sent: Friday, October 25, 2019 5:14 PM
> To: Liu, Yi L <yi.l.liu@intel.com>; alex.williamson@redhat.com;
> Subject: RE: [RFC v2 1/3] vfio: VFIO_IOMMU_CACHE_INVALIDATE
> 
> > From: Liu, Yi L
> > Sent: Thursday, October 24, 2019 8:26 PM
> >
> > From: Liu Yi L <yi.l.liu@linux.intel.com>
> >
> > When the guest "owns" the stage 1 translation structures,  the host
> > IOMMU driver has no knowledge of caching structure updates unless the
> > guest invalidation requests are trapped and passed down to the host.
> >
> > This patch adds the VFIO_IOMMU_CACHE_INVALIDATE ioctl with aims at
> > propagating guest stage1 IOMMU cache invalidations to the host.
> >
> > Cc: Kevin Tian <kevin.tian@intel.com>
> > Signed-off-by: Liu Yi L <yi.l.liu@linux.intel.com>
> > Signed-off-by: Eric Auger <eric.auger@redhat.com>
> > Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > ---
> >  drivers/vfio/vfio_iommu_type1.c | 55
> > +++++++++++++++++++++++++++++++++++++++++
> >  include/uapi/linux/vfio.h       | 13 ++++++++++
> >  2 files changed, 68 insertions(+)
> >
> > diff --git a/drivers/vfio/vfio_iommu_type1.c
> > b/drivers/vfio/vfio_iommu_type1.c index 96fddc1d..cd8d3a5 100644
> > --- a/drivers/vfio/vfio_iommu_type1.c
> > +++ b/drivers/vfio/vfio_iommu_type1.c
> > @@ -124,6 +124,34 @@ struct vfio_regions {
> >  #define IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)	\
> >  					(!list_empty(&iommu->domain_list))
> >
> > +struct domain_capsule {
> > +	struct iommu_domain *domain;
> > +	void *data;
> > +};
> > +
> > +/* iommu->lock must be held */
> > +static int
> > +vfio_iommu_lookup_dev(struct vfio_iommu *iommu,
> > +		      int (*fn)(struct device *dev, void *data),
> > +		      void *data)
> 
> 'lookup' usually means find a device and then return. But the real purpose here is to
> loop all the devices within this container and then do something. Does it make more
> sense to be vfio_iommu_for_each_dev?

yep, I can replace it.

> 
> > +{
> > +	struct domain_capsule dc = {.data = data};
> > +	struct vfio_domain *d;
[...]
> 2315,6 +2352,24 @@
> > static long vfio_iommu_type1_ioctl(void *iommu_data,
> >
> >  		return copy_to_user((void __user *)arg, &unmap, minsz) ?
> >  			-EFAULT : 0;
> > +	} else if (cmd == VFIO_IOMMU_CACHE_INVALIDATE) {
> > +		struct vfio_iommu_type1_cache_invalidate ustruct;
> 
> it's weird to call a variable as struct.

Will fix it.

> > +		int ret;
> > +
> > +		minsz = offsetofend(struct
> > vfio_iommu_type1_cache_invalidate,
> > +				    info);
> > +
> > +		if (copy_from_user(&ustruct, (void __user *)arg, minsz))
> > +			return -EFAULT;
> > +
> > +		if (ustruct.argsz < minsz || ustruct.flags)
> > +			return -EINVAL;
> > +
> > +		mutex_lock(&iommu->lock);
> > +		ret = vfio_iommu_lookup_dev(iommu, vfio_cache_inv_fn,
> > +					    &ustruct);
> > +		mutex_unlock(&iommu->lock);
> > +		return ret;
> >  	}
> >
> >  	return -ENOTTY;
> > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> > index 9e843a1..ccf60a2 100644
> > --- a/include/uapi/linux/vfio.h
> > +++ b/include/uapi/linux/vfio.h
> > @@ -794,6 +794,19 @@ struct vfio_iommu_type1_dma_unmap {
> >  #define VFIO_IOMMU_ENABLE	_IO(VFIO_TYPE, VFIO_BASE + 15)
> >  #define VFIO_IOMMU_DISABLE	_IO(VFIO_TYPE, VFIO_BASE + 16)
> >
> > +/**
> > + * VFIO_IOMMU_CACHE_INVALIDATE - _IOWR(VFIO_TYPE, VFIO_BASE +
> > 24,
> > + *			struct vfio_iommu_type1_cache_invalidate)
> > + *
> > + * Propagate guest IOMMU cache invalidation to the host.
> 
> guest or first-level/stage-1? Ideally userspace application may also bind its own
> address space as stage-1 one day...

Should be first-level/stage-1. Will correct it.

Thanks,
Yi Liu


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC v2 1/3] vfio: VFIO_IOMMU_CACHE_INVALIDATE
  2019-10-25 11:20     ` Liu, Yi L
@ 2019-11-05 22:42       ` Alex Williamson
  2019-11-06  1:31         ` Liu, Yi L
  0 siblings, 1 reply; 32+ messages in thread
From: Alex Williamson @ 2019-11-05 22:42 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: Tian, Kevin, eric.auger, jacob.jun.pan, joro, Raj, Ashok, Tian,
	Jun J, Sun, Yi Y, jean-philippe.brucker, peterx, iommu, kvm

On Fri, 25 Oct 2019 11:20:40 +0000
"Liu, Yi L" <yi.l.liu@intel.com> wrote:

> Hi Kevin,
> 
> > From: Tian, Kevin
> > Sent: Friday, October 25, 2019 5:14 PM
> > To: Liu, Yi L <yi.l.liu@intel.com>; alex.williamson@redhat.com;
> > Subject: RE: [RFC v2 1/3] vfio: VFIO_IOMMU_CACHE_INVALIDATE
> >   
> > > From: Liu, Yi L
> > > Sent: Thursday, October 24, 2019 8:26 PM
> > >
> > > From: Liu Yi L <yi.l.liu@linux.intel.com>
> > >
> > > When the guest "owns" the stage 1 translation structures,  the host
> > > IOMMU driver has no knowledge of caching structure updates unless the
> > > guest invalidation requests are trapped and passed down to the host.
> > >
> > > This patch adds the VFIO_IOMMU_CACHE_INVALIDATE ioctl with aims at
> > > propagating guest stage1 IOMMU cache invalidations to the host.
> > >
> > > Cc: Kevin Tian <kevin.tian@intel.com>
> > > Signed-off-by: Liu Yi L <yi.l.liu@linux.intel.com>
> > > Signed-off-by: Eric Auger <eric.auger@redhat.com>
> > > Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > > ---
> > >  drivers/vfio/vfio_iommu_type1.c | 55
> > > +++++++++++++++++++++++++++++++++++++++++
> > >  include/uapi/linux/vfio.h       | 13 ++++++++++
> > >  2 files changed, 68 insertions(+)
> > >
> > > diff --git a/drivers/vfio/vfio_iommu_type1.c
> > > b/drivers/vfio/vfio_iommu_type1.c index 96fddc1d..cd8d3a5 100644
> > > --- a/drivers/vfio/vfio_iommu_type1.c
> > > +++ b/drivers/vfio/vfio_iommu_type1.c
> > > @@ -124,6 +124,34 @@ struct vfio_regions {
> > >  #define IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)	\
> > >  					(!list_empty(&iommu->domain_list))
> > >
> > > +struct domain_capsule {
> > > +	struct iommu_domain *domain;
> > > +	void *data;
> > > +};
> > > +
> > > +/* iommu->lock must be held */
> > > +static int
> > > +vfio_iommu_lookup_dev(struct vfio_iommu *iommu,
> > > +		      int (*fn)(struct device *dev, void *data),
> > > +		      void *data)  
> > 
> > 'lookup' usually means find a device and then return. But the real purpose here is to
> > loop all the devices within this container and then do something. Does it make more
> > sense to be vfio_iommu_for_each_dev?  

+1
 
> yep, I can replace it.
> 
> >   
> > > +{
> > > +	struct domain_capsule dc = {.data = data};
> > > +	struct vfio_domain *d;  
> [...]
> > 2315,6 +2352,24 @@  
> > > static long vfio_iommu_type1_ioctl(void *iommu_data,
> > >
> > >  		return copy_to_user((void __user *)arg, &unmap, minsz) ?
> > >  			-EFAULT : 0;
> > > +	} else if (cmd == VFIO_IOMMU_CACHE_INVALIDATE) {
> > > +		struct vfio_iommu_type1_cache_invalidate ustruct;  
> > 
> > it's weird to call a variable as struct.  
> 
> Will fix it.
> 
> > > +		int ret;
> > > +
> > > +		minsz = offsetofend(struct
> > > vfio_iommu_type1_cache_invalidate,
> > > +				    info);
> > > +
> > > +		if (copy_from_user(&ustruct, (void __user *)arg, minsz))
> > > +			return -EFAULT;
> > > +
> > > +		if (ustruct.argsz < minsz || ustruct.flags)
> > > +			return -EINVAL;
> > > +
> > > +		mutex_lock(&iommu->lock);
> > > +		ret = vfio_iommu_lookup_dev(iommu, vfio_cache_inv_fn,
> > > +					    &ustruct);
> > > +		mutex_unlock(&iommu->lock);
> > > +		return ret;
> > >  	}
> > >
> > >  	return -ENOTTY;
> > > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> > > index 9e843a1..ccf60a2 100644
> > > --- a/include/uapi/linux/vfio.h
> > > +++ b/include/uapi/linux/vfio.h
> > > @@ -794,6 +794,19 @@ struct vfio_iommu_type1_dma_unmap {
> > >  #define VFIO_IOMMU_ENABLE	_IO(VFIO_TYPE, VFIO_BASE + 15)
> > >  #define VFIO_IOMMU_DISABLE	_IO(VFIO_TYPE, VFIO_BASE + 16)
> > >
> > > +/**
> > > + * VFIO_IOMMU_CACHE_INVALIDATE - _IOWR(VFIO_TYPE, VFIO_BASE +
> > > 24,

What's going on with these ioctl numbers?  AFAICT[1] we've used up
through VFIO_BASE + 21, this jumps to 24, the next patch skips to 27,
then the last patch fills in 28 & 29.  Thanks,

Alex

[1] git grep -h VFIO_BASE | grep "VFIO_BASE +" | grep -e ^#define | \
    awk '{print $NF}' | tr -d ')' | sort -u -n


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC v2 2/3] vfio/type1: VFIO_IOMMU_PASID_REQUEST(alloc/free)
  2019-10-24 12:26 ` [RFC v2 2/3] vfio/type1: VFIO_IOMMU_PASID_REQUEST(alloc/free) Liu Yi L
  2019-10-25 10:06   ` Tian, Kevin
@ 2019-11-05 23:35   ` Alex Williamson
  2019-11-06 13:27     ` Liu, Yi L
  1 sibling, 1 reply; 32+ messages in thread
From: Alex Williamson @ 2019-11-05 23:35 UTC (permalink / raw)
  To: Liu Yi L
  Cc: eric.auger, kevin.tian, jacob.jun.pan, joro, ashok.raj,
	jun.j.tian, yi.y.sun, jean-philippe.brucker, peterx, iommu, kvm

On Thu, 24 Oct 2019 08:26:22 -0400
Liu Yi L <yi.l.liu@intel.com> wrote:

> This patch adds VFIO_IOMMU_PASID_REQUEST ioctl which aims
> to passdown PASID allocation/free request from the virtual
> iommu. This is required to get PASID managed in system-wide.
> 
> Cc: Kevin Tian <kevin.tian@intel.com>
> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> ---
>  drivers/vfio/vfio_iommu_type1.c | 114 ++++++++++++++++++++++++++++++++++++++++
>  include/uapi/linux/vfio.h       |  25 +++++++++
>  2 files changed, 139 insertions(+)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index cd8d3a5..3d73a7d 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -2248,6 +2248,83 @@ static int vfio_cache_inv_fn(struct device *dev, void *data)
>  	return iommu_cache_invalidate(dc->domain, dev, &ustruct->info);
>  }
>  
> +static int vfio_iommu_type1_pasid_alloc(struct vfio_iommu *iommu,
> +					 int min_pasid,
> +					 int max_pasid)
> +{
> +	int ret;
> +	ioasid_t pasid;
> +	struct mm_struct *mm = NULL;
> +
> +	mutex_lock(&iommu->lock);
> +	if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) {
> +		ret = -EINVAL;
> +		goto out_unlock;
> +	}
> +	mm = get_task_mm(current);
> +	/* Track ioasid allocation owner by mm */
> +	pasid = ioasid_alloc((struct ioasid_set *)mm, min_pasid,
> +				max_pasid, NULL);

Are we sure we want to tie this to the task mm vs perhaps the
vfio_iommu pointer?

> +	if (pasid == INVALID_IOASID) {
> +		ret = -ENOSPC;
> +		goto out_unlock;
> +	}
> +	ret = pasid;
> +out_unlock:
> +	mutex_unlock(&iommu->lock);
> +	if (mm)
> +		mmput(mm);
> +	return ret;
> +}
> +
> +static int vfio_iommu_type1_pasid_free(struct vfio_iommu *iommu,
> +				       unsigned int pasid)
> +{
> +	struct mm_struct *mm = NULL;
> +	void *pdata;
> +	int ret = 0;
> +
> +	mutex_lock(&iommu->lock);
> +	if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) {
> +		ret = -EINVAL;
> +		goto out_unlock;
> +	}
> +
> +	/**
> +	 * REVISIT:
> +	 * There are two cases free could fail:
> +	 * 1. free pasid by non-owner, we use ioasid_set to track mm, if
> +	 * the set does not match, caller is not permitted to free.
> +	 * 2. free before unbind all devices, we can check if ioasid private
> +	 * data, if data != NULL, then fail to free.
> +	 */
> +	mm = get_task_mm(current);
> +	pdata = ioasid_find((struct ioasid_set *)mm, pasid, NULL);
> +	if (IS_ERR(pdata)) {
> +		if (pdata == ERR_PTR(-ENOENT))
> +			pr_err("PASID %u is not allocated\n", pasid);
> +		else if (pdata == ERR_PTR(-EACCES))
> +			pr_err("Free PASID %u by non-owner, denied", pasid);
> +		else
> +			pr_err("Error searching PASID %u\n", pasid);

This should be removed, errno is sufficient for the user, this just
provides the user with a trivial DoS vector filling logs.

> +		ret = -EPERM;

But why not return PTR_ERR(pdata)?

> +		goto out_unlock;
> +	}
> +	if (pdata) {
> +		pr_debug("Cannot free pasid %d with private data\n", pasid);
> +		/* Expect PASID has no private data if not bond */
> +		ret = -EBUSY;
> +		goto out_unlock;
> +	}
> +	ioasid_free(pasid);

We only ever get here with pasid == NULL?!  Something is wrong.  Should
that be 'if (!pdata)'?  (which also makes that pr_debug another DoS
vector)

> +
> +out_unlock:
> +	if (mm)
> +		mmput(mm);
> +	mutex_unlock(&iommu->lock);
> +	return ret;
> +}
> +
>  static long vfio_iommu_type1_ioctl(void *iommu_data,
>  				   unsigned int cmd, unsigned long arg)
>  {
> @@ -2370,6 +2447,43 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
>  					    &ustruct);
>  		mutex_unlock(&iommu->lock);
>  		return ret;
> +
> +	} else if (cmd == VFIO_IOMMU_PASID_REQUEST) {
> +		struct vfio_iommu_type1_pasid_request req;
> +		int min_pasid, max_pasid, pasid;
> +
> +		minsz = offsetofend(struct vfio_iommu_type1_pasid_request,
> +				    flag);
> +
> +		if (copy_from_user(&req, (void __user *)arg, minsz))
> +			return -EFAULT;
> +
> +		if (req.argsz < minsz)
> +			return -EINVAL;
> +
> +		switch (req.flag) {

This works, but it's strange.  Let's make the code a little easier for
the next flag bit that gets used so they don't need to rework this case
statement.  I'd suggest creating a VFIO_IOMMU_PASID_OPS_MASK that is
the OR of the ALLOC/FREE options, test that no bits are set outside of
that mask, then AND that mask as the switch arg with the code below.

> +		/**
> +		 * TODO: min_pasid and max_pasid align with
> +		 * typedef unsigned int ioasid_t
> +		 */
> +		case VFIO_IOMMU_PASID_ALLOC:
> +			if (copy_from_user(&min_pasid,
> +				(void __user *)arg + minsz, sizeof(min_pasid)))
> +				return -EFAULT;
> +			if (copy_from_user(&max_pasid,
> +				(void __user *)arg + minsz + sizeof(min_pasid),
> +				sizeof(max_pasid)))
> +				return -EFAULT;
> +			return vfio_iommu_type1_pasid_alloc(iommu,
> +						min_pasid, max_pasid);
> +		case VFIO_IOMMU_PASID_FREE:
> +			if (copy_from_user(&pasid,
> +				(void __user *)arg + minsz, sizeof(pasid)))
> +				return -EFAULT;
> +			return vfio_iommu_type1_pasid_free(iommu, pasid);
> +		default:
> +			return -EINVAL;
> +		}
>  	}
>  
>  	return -ENOTTY;
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index ccf60a2..04de290 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -807,6 +807,31 @@ struct vfio_iommu_type1_cache_invalidate {
>  };
>  #define VFIO_IOMMU_CACHE_INVALIDATE      _IO(VFIO_TYPE, VFIO_BASE + 24)
>  
> +/*
> + * @flag=VFIO_IOMMU_PASID_ALLOC, refer to the @min_pasid and @max_pasid fields
> + * @flag=VFIO_IOMMU_PASID_FREE, refer to @pasid field
> + */
> +struct vfio_iommu_type1_pasid_request {
> +	__u32	argsz;
> +#define VFIO_IOMMU_PASID_ALLOC	(1 << 0)
> +#define VFIO_IOMMU_PASID_FREE	(1 << 1)
> +	__u32	flag;
> +	union {
> +		struct {
> +			int min_pasid;
> +			int max_pasid;
> +		};
> +		int pasid;

Perhaps:

		struct {
			u32 min;
			u32 max;
		} alloc_pasid;
		u32 free_pasid;

(note also the s/int/u32/)

> +	};
> +};
> +
> +/**
> + * VFIO_IOMMU_PASID_REQUEST - _IOWR(VFIO_TYPE, VFIO_BASE + 27,
> + *				struct vfio_iommu_type1_pasid_request)
> + *
> + */
> +#define VFIO_IOMMU_PASID_REQUEST	_IO(VFIO_TYPE, VFIO_BASE + 27)
> +
>  /* -------- Additional API for SPAPR TCE (Server POWERPC) IOMMU -------- */
>  
>  /*


^ permalink raw reply	[flat|nested] 32+ messages in thread

* RE: [RFC v2 1/3] vfio: VFIO_IOMMU_CACHE_INVALIDATE
  2019-11-05 22:42       ` Alex Williamson
@ 2019-11-06  1:31         ` Liu, Yi L
  2019-11-13  7:50           ` Auger Eric
  0 siblings, 1 reply; 32+ messages in thread
From: Liu, Yi L @ 2019-11-06  1:31 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Tian, Kevin, eric.auger, jacob.jun.pan, joro, Raj, Ashok, Tian,
	Jun J, Sun, Yi Y, jean-philippe.brucker, peterx, iommu, kvm

> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: Wednesday, November 6, 2019 6:42 AM
> To: Liu, Yi L <yi.l.liu@intel.com>
> Subject: Re: [RFC v2 1/3] vfio: VFIO_IOMMU_CACHE_INVALIDATE
> 
> On Fri, 25 Oct 2019 11:20:40 +0000
> "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> 
> > Hi Kevin,
> >
> > > From: Tian, Kevin
> > > Sent: Friday, October 25, 2019 5:14 PM
> > > To: Liu, Yi L <yi.l.liu@intel.com>; alex.williamson@redhat.com;
> > > Subject: RE: [RFC v2 1/3] vfio: VFIO_IOMMU_CACHE_INVALIDATE
> > >
> > > > From: Liu, Yi L
> > > > Sent: Thursday, October 24, 2019 8:26 PM
> > > >
> > > > From: Liu Yi L <yi.l.liu@linux.intel.com>
> > > >
> > > > When the guest "owns" the stage 1 translation structures,  the
> > > > host IOMMU driver has no knowledge of caching structure updates
> > > > unless the guest invalidation requests are trapped and passed down to the host.
> > > >
> > > > This patch adds the VFIO_IOMMU_CACHE_INVALIDATE ioctl with aims at
> > > > propagating guest stage1 IOMMU cache invalidations to the host.
> > > >
> > > > Cc: Kevin Tian <kevin.tian@intel.com>
> > > > Signed-off-by: Liu Yi L <yi.l.liu@linux.intel.com>
> > > > Signed-off-by: Eric Auger <eric.auger@redhat.com>
> > > > Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > > > ---
> > > >  drivers/vfio/vfio_iommu_type1.c | 55
> > > > +++++++++++++++++++++++++++++++++++++++++
> > > >  include/uapi/linux/vfio.h       | 13 ++++++++++
> > > >  2 files changed, 68 insertions(+)
> > > >
> > > > diff --git a/drivers/vfio/vfio_iommu_type1.c
> > > > b/drivers/vfio/vfio_iommu_type1.c index 96fddc1d..cd8d3a5 100644
> > > > --- a/drivers/vfio/vfio_iommu_type1.c
> > > > +++ b/drivers/vfio/vfio_iommu_type1.c
> > > > @@ -124,6 +124,34 @@ struct vfio_regions {
> > > >  #define IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)	\
> > > >  					(!list_empty(&iommu->domain_list))
> > > >
> > > > +struct domain_capsule {
> > > > +	struct iommu_domain *domain;
> > > > +	void *data;
> > > > +};
> > > > +
> > > > +/* iommu->lock must be held */
> > > > +static int
> > > > +vfio_iommu_lookup_dev(struct vfio_iommu *iommu,
> > > > +		      int (*fn)(struct device *dev, void *data),
> > > > +		      void *data)
> > >
> > > 'lookup' usually means find a device and then return. But the real
> > > purpose here is to loop all the devices within this container and
> > > then do something. Does it make more sense to be vfio_iommu_for_each_dev?
> 
> +1
> 
> > yep, I can replace it.
> >
> > >
> > > > +{
> > > > +	struct domain_capsule dc = {.data = data};
> > > > +	struct vfio_domain *d;
> > [...]
> > > 2315,6 +2352,24 @@
> > > > static long vfio_iommu_type1_ioctl(void *iommu_data,
> > > >
> > > >  		return copy_to_user((void __user *)arg, &unmap, minsz) ?
> > > >  			-EFAULT : 0;
> > > > +	} else if (cmd == VFIO_IOMMU_CACHE_INVALIDATE) {
> > > > +		struct vfio_iommu_type1_cache_invalidate ustruct;
> > >
> > > it's weird to call a variable as struct.
> >
> > Will fix it.
> >
> > > > +		int ret;
> > > > +
> > > > +		minsz = offsetofend(struct
> > > > vfio_iommu_type1_cache_invalidate,
> > > > +				    info);
> > > > +
> > > > +		if (copy_from_user(&ustruct, (void __user *)arg, minsz))
> > > > +			return -EFAULT;
> > > > +
> > > > +		if (ustruct.argsz < minsz || ustruct.flags)
> > > > +			return -EINVAL;
> > > > +
> > > > +		mutex_lock(&iommu->lock);
> > > > +		ret = vfio_iommu_lookup_dev(iommu, vfio_cache_inv_fn,
> > > > +					    &ustruct);
> > > > +		mutex_unlock(&iommu->lock);
> > > > +		return ret;
> > > >  	}
> > > >
> > > >  	return -ENOTTY;
> > > > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> > > > index 9e843a1..ccf60a2 100644
> > > > --- a/include/uapi/linux/vfio.h
> > > > +++ b/include/uapi/linux/vfio.h
> > > > @@ -794,6 +794,19 @@ struct vfio_iommu_type1_dma_unmap {
> > > >  #define VFIO_IOMMU_ENABLE	_IO(VFIO_TYPE, VFIO_BASE + 15)
> > > >  #define VFIO_IOMMU_DISABLE	_IO(VFIO_TYPE, VFIO_BASE + 16)
> > > >
> > > > +/**
> > > > + * VFIO_IOMMU_CACHE_INVALIDATE - _IOWR(VFIO_TYPE, VFIO_BASE +
> > > > 24,
> 
> What's going on with these ioctl numbers?  AFAICT[1] we've used up through
> VFIO_BASE + 21, this jumps to 24, the next patch skips to 27, then the last patch fills
> in 28 & 29.  Thanks,

Hi Alex,

I rebase my patch to Eric's nested stage translation patches. His base also introduced
IOCTLs. I should have made it better. I'll try to sync with Eric to serialize the IOCTLs.

[PATCH v6 00/22] SMMUv3 Nested Stage Setup by Eric Auger
https://lkml.org/lkml/2019/3/17/124

Thanks,
Yi Liu

> Alex
> 
> [1] git grep -h VFIO_BASE | grep "VFIO_BASE +" | grep -e ^#define | \
>     awk '{print $NF}' | tr -d ')' | sort -u -n


^ permalink raw reply	[flat|nested] 32+ messages in thread

* RE: [RFC v2 2/3] vfio/type1: VFIO_IOMMU_PASID_REQUEST(alloc/free)
  2019-11-05 23:35   ` Alex Williamson
@ 2019-11-06 13:27     ` Liu, Yi L
  2019-11-07 22:06       ` Alex Williamson
  0 siblings, 1 reply; 32+ messages in thread
From: Liu, Yi L @ 2019-11-06 13:27 UTC (permalink / raw)
  To: Alex Williamson
  Cc: eric.auger, Tian, Kevin, jacob.jun.pan, joro, Raj, Ashok, Tian,
	Jun J, Sun, Yi Y, jean-philippe.brucker, peterx, iommu, kvm

> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Wednesday, November 6, 2019 7:36 AM
> To: Liu, Yi L <yi.l.liu@intel.com>
> Subject: Re: [RFC v2 2/3] vfio/type1: VFIO_IOMMU_PASID_REQUEST(alloc/free)
> 
> On Thu, 24 Oct 2019 08:26:22 -0400
> Liu Yi L <yi.l.liu@intel.com> wrote:
> 
> > This patch adds VFIO_IOMMU_PASID_REQUEST ioctl which aims
> > to passdown PASID allocation/free request from the virtual
> > iommu. This is required to get PASID managed in system-wide.
> >
> > Cc: Kevin Tian <kevin.tian@intel.com>
> > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
> > Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > ---
> >  drivers/vfio/vfio_iommu_type1.c | 114
> ++++++++++++++++++++++++++++++++++++++++
> >  include/uapi/linux/vfio.h       |  25 +++++++++
> >  2 files changed, 139 insertions(+)
> >
> > diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> > index cd8d3a5..3d73a7d 100644
> > --- a/drivers/vfio/vfio_iommu_type1.c
> > +++ b/drivers/vfio/vfio_iommu_type1.c
> > @@ -2248,6 +2248,83 @@ static int vfio_cache_inv_fn(struct device *dev, void
> *data)
> >  	return iommu_cache_invalidate(dc->domain, dev, &ustruct->info);
> >  }
> >
> > +static int vfio_iommu_type1_pasid_alloc(struct vfio_iommu *iommu,
> > +					 int min_pasid,
> > +					 int max_pasid)
> > +{
> > +	int ret;
> > +	ioasid_t pasid;
> > +	struct mm_struct *mm = NULL;
> > +
> > +	mutex_lock(&iommu->lock);
> > +	if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) {
> > +		ret = -EINVAL;
> > +		goto out_unlock;
> > +	}
> > +	mm = get_task_mm(current);
> > +	/* Track ioasid allocation owner by mm */
> > +	pasid = ioasid_alloc((struct ioasid_set *)mm, min_pasid,
> > +				max_pasid, NULL);
> 
> Are we sure we want to tie this to the task mm vs perhaps the
> vfio_iommu pointer?

Here we want to have a kind of per-VM mark, which can be used to do
ownership check on whether a pasid is held by a specific VM. This is
very important to prevent across VM affect. vfio_iommu pointer is
competent for vfio as vfio is both pasid alloc requester and pasid
consumer. e.g. vfio requests pasid alloc from ioasid and also it will
invoke bind_gpasid(). vfio can either check ownership before invoking
bind_gpasid() or pass vfio_iommu pointer to iommu driver. But in future,
there may be other modules which are just consumers of pasid. And they
also want to do ownership check for a pasid. Then, it would be hard for
them as they are not the pasid alloc requester. So here better to have
a system wide structure to perform as the per-VM mark. task mm looks
to be much competent.

> > +	if (pasid == INVALID_IOASID) {
> > +		ret = -ENOSPC;
> > +		goto out_unlock;
> > +	}
> > +	ret = pasid;
> > +out_unlock:
> > +	mutex_unlock(&iommu->lock);
> > +	if (mm)
> > +		mmput(mm);
> > +	return ret;
> > +}
> > +
> > +static int vfio_iommu_type1_pasid_free(struct vfio_iommu *iommu,
> > +				       unsigned int pasid)
> > +{
> > +	struct mm_struct *mm = NULL;
> > +	void *pdata;
> > +	int ret = 0;
> > +
> > +	mutex_lock(&iommu->lock);
> > +	if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) {
> > +		ret = -EINVAL;
> > +		goto out_unlock;
> > +	}
> > +
> > +	/**
> > +	 * REVISIT:
> > +	 * There are two cases free could fail:
> > +	 * 1. free pasid by non-owner, we use ioasid_set to track mm, if
> > +	 * the set does not match, caller is not permitted to free.
> > +	 * 2. free before unbind all devices, we can check if ioasid private
> > +	 * data, if data != NULL, then fail to free.
> > +	 */
> > +	mm = get_task_mm(current);
> > +	pdata = ioasid_find((struct ioasid_set *)mm, pasid, NULL);
> > +	if (IS_ERR(pdata)) {
> > +		if (pdata == ERR_PTR(-ENOENT))
> > +			pr_err("PASID %u is not allocated\n", pasid);
> > +		else if (pdata == ERR_PTR(-EACCES))
> > +			pr_err("Free PASID %u by non-owner, denied", pasid);
> > +		else
> > +			pr_err("Error searching PASID %u\n", pasid);
> 
> This should be removed, errno is sufficient for the user, this just
> provides the user with a trivial DoS vector filling logs.

sure, will fix it. thanks.

> > +		ret = -EPERM;
> 
> But why not return PTR_ERR(pdata)?

aha, would do it.

> > +		goto out_unlock;
> > +	}
> > +	if (pdata) {
> > +		pr_debug("Cannot free pasid %d with private data\n", pasid);
> > +		/* Expect PASID has no private data if not bond */
> > +		ret = -EBUSY;
> > +		goto out_unlock;
> > +	}
> > +	ioasid_free(pasid);
> 
> We only ever get here with pasid == NULL?! 

I guess you meant only when pdata==NULL.

> Something is wrong.  Should
> that be 'if (!pdata)'?  (which also makes that pr_debug another DoS
> vector)

Oh, yes, just do it as below:

if (!pdata) {
	ioasid_free(pasid);
	ret = SUCCESS;
} else
	ret = -EBUSY;

Is it what you mean?

> > +
> > +out_unlock:
> > +	if (mm)
> > +		mmput(mm);
> > +	mutex_unlock(&iommu->lock);
> > +	return ret;
> > +}
> > +
> >  static long vfio_iommu_type1_ioctl(void *iommu_data,
> >  				   unsigned int cmd, unsigned long arg)
> >  {
> > @@ -2370,6 +2447,43 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
> >  					    &ustruct);
> >  		mutex_unlock(&iommu->lock);
> >  		return ret;
> > +
> > +	} else if (cmd == VFIO_IOMMU_PASID_REQUEST) {
> > +		struct vfio_iommu_type1_pasid_request req;
> > +		int min_pasid, max_pasid, pasid;
> > +
> > +		minsz = offsetofend(struct vfio_iommu_type1_pasid_request,
> > +				    flag);
> > +
> > +		if (copy_from_user(&req, (void __user *)arg, minsz))
> > +			return -EFAULT;
> > +
> > +		if (req.argsz < minsz)
> > +			return -EINVAL;
> > +
> > +		switch (req.flag) {
> 
> This works, but it's strange.  Let's make the code a little easier for
> the next flag bit that gets used so they don't need to rework this case
> statement.  I'd suggest creating a VFIO_IOMMU_PASID_OPS_MASK that is
> the OR of the ALLOC/FREE options, test that no bits are set outside of
> that mask, then AND that mask as the switch arg with the code below.

Got it. Let me fix it in next version.

> > +		/**
> > +		 * TODO: min_pasid and max_pasid align with
> > +		 * typedef unsigned int ioasid_t
> > +		 */
> > +		case VFIO_IOMMU_PASID_ALLOC:
> > +			if (copy_from_user(&min_pasid,
> > +				(void __user *)arg + minsz, sizeof(min_pasid)))
> > +				return -EFAULT;
> > +			if (copy_from_user(&max_pasid,
> > +				(void __user *)arg + minsz + sizeof(min_pasid),
> > +				sizeof(max_pasid)))
> > +				return -EFAULT;
> > +			return vfio_iommu_type1_pasid_alloc(iommu,
> > +						min_pasid, max_pasid);
> > +		case VFIO_IOMMU_PASID_FREE:
> > +			if (copy_from_user(&pasid,
> > +				(void __user *)arg + minsz, sizeof(pasid)))
> > +				return -EFAULT;
> > +			return vfio_iommu_type1_pasid_free(iommu, pasid);
> > +		default:
> > +			return -EINVAL;
> > +		}
> >  	}
> >
> >  	return -ENOTTY;
> > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> > index ccf60a2..04de290 100644
> > --- a/include/uapi/linux/vfio.h
> > +++ b/include/uapi/linux/vfio.h
> > @@ -807,6 +807,31 @@ struct vfio_iommu_type1_cache_invalidate {
> >  };
> >  #define VFIO_IOMMU_CACHE_INVALIDATE      _IO(VFIO_TYPE, VFIO_BASE + 24)
> >
> > +/*
> > + * @flag=VFIO_IOMMU_PASID_ALLOC, refer to the @min_pasid and
> @max_pasid fields
> > + * @flag=VFIO_IOMMU_PASID_FREE, refer to @pasid field
> > + */
> > +struct vfio_iommu_type1_pasid_request {
> > +	__u32	argsz;
> > +#define VFIO_IOMMU_PASID_ALLOC	(1 << 0)
> > +#define VFIO_IOMMU_PASID_FREE	(1 << 1)
> > +	__u32	flag;
> > +	union {
> > +		struct {
> > +			int min_pasid;
> > +			int max_pasid;
> > +		};
> > +		int pasid;
> 
> Perhaps:
> 
> 		struct {
> 			u32 min;
> 			u32 max;
> 		} alloc_pasid;
> 		u32 free_pasid;
> 
> (note also the s/int/u32/)

got it. will fix it in next version. Thanks.

> > +	};
> > +};
> > +
> > +/**
> > + * VFIO_IOMMU_PASID_REQUEST - _IOWR(VFIO_TYPE, VFIO_BASE + 27,
> > + *				struct vfio_iommu_type1_pasid_request)
> > + *
> > + */
> > +#define VFIO_IOMMU_PASID_REQUEST	_IO(VFIO_TYPE, VFIO_BASE + 27)
> > +
> >  /* -------- Additional API for SPAPR TCE (Server POWERPC) IOMMU -------- */
> >
> >  /*

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC v2 2/3] vfio/type1: VFIO_IOMMU_PASID_REQUEST(alloc/free)
  2019-11-06 13:27     ` Liu, Yi L
@ 2019-11-07 22:06       ` Alex Williamson
  2019-11-08 12:23         ` Liu, Yi L
  0 siblings, 1 reply; 32+ messages in thread
From: Alex Williamson @ 2019-11-07 22:06 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: eric.auger, Tian, Kevin, jacob.jun.pan, joro, Raj, Ashok, Tian,
	Jun J, Sun, Yi Y, jean-philippe.brucker, peterx, iommu, kvm

On Wed, 6 Nov 2019 13:27:26 +0000
"Liu, Yi L" <yi.l.liu@intel.com> wrote:

> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Wednesday, November 6, 2019 7:36 AM
> > To: Liu, Yi L <yi.l.liu@intel.com>
> > Subject: Re: [RFC v2 2/3] vfio/type1: VFIO_IOMMU_PASID_REQUEST(alloc/free)
> > 
> > On Thu, 24 Oct 2019 08:26:22 -0400
> > Liu Yi L <yi.l.liu@intel.com> wrote:
> >   
> > > This patch adds VFIO_IOMMU_PASID_REQUEST ioctl which aims
> > > to passdown PASID allocation/free request from the virtual
> > > iommu. This is required to get PASID managed in system-wide.
> > >
> > > Cc: Kevin Tian <kevin.tian@intel.com>
> > > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > > Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
> > > Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > > ---
> > >  drivers/vfio/vfio_iommu_type1.c | 114  
> > ++++++++++++++++++++++++++++++++++++++++  
> > >  include/uapi/linux/vfio.h       |  25 +++++++++
> > >  2 files changed, 139 insertions(+)
> > >
> > > diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> > > index cd8d3a5..3d73a7d 100644
> > > --- a/drivers/vfio/vfio_iommu_type1.c
> > > +++ b/drivers/vfio/vfio_iommu_type1.c
> > > @@ -2248,6 +2248,83 @@ static int vfio_cache_inv_fn(struct device *dev, void  
> > *data)  
> > >  	return iommu_cache_invalidate(dc->domain, dev, &ustruct->info);
> > >  }
> > >
> > > +static int vfio_iommu_type1_pasid_alloc(struct vfio_iommu *iommu,
> > > +					 int min_pasid,
> > > +					 int max_pasid)
> > > +{
> > > +	int ret;
> > > +	ioasid_t pasid;
> > > +	struct mm_struct *mm = NULL;
> > > +
> > > +	mutex_lock(&iommu->lock);
> > > +	if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) {
> > > +		ret = -EINVAL;
> > > +		goto out_unlock;
> > > +	}
> > > +	mm = get_task_mm(current);
> > > +	/* Track ioasid allocation owner by mm */
> > > +	pasid = ioasid_alloc((struct ioasid_set *)mm, min_pasid,
> > > +				max_pasid, NULL);  
> > 
> > Are we sure we want to tie this to the task mm vs perhaps the
> > vfio_iommu pointer?  
> 
> Here we want to have a kind of per-VM mark, which can be used to do
> ownership check on whether a pasid is held by a specific VM. This is
> very important to prevent across VM affect. vfio_iommu pointer is
> competent for vfio as vfio is both pasid alloc requester and pasid
> consumer. e.g. vfio requests pasid alloc from ioasid and also it will
> invoke bind_gpasid(). vfio can either check ownership before invoking
> bind_gpasid() or pass vfio_iommu pointer to iommu driver. But in future,
> there may be other modules which are just consumers of pasid. And they
> also want to do ownership check for a pasid. Then, it would be hard for
> them as they are not the pasid alloc requester. So here better to have
> a system wide structure to perform as the per-VM mark. task mm looks
> to be much competent.

Ok, so it's intentional to have a VM-wide token.  Elsewhere in the
type1 code (vfio_dma_do_map) we record the task_struct per dma mapping
so that we can get the task mm as needed.  Would the task_struct
pointer provide any advantage?

Also, an overall question, this provides userspace with pasid alloc and
free ioctls, (1) what prevents a userspace process from consuming every
available pasid, and (2) if the process exits or crashes without
freeing pasids, how are they recovered aside from a reboot?

> > > +	if (pasid == INVALID_IOASID) {
> > > +		ret = -ENOSPC;
> > > +		goto out_unlock;
> > > +	}
> > > +	ret = pasid;
> > > +out_unlock:
> > > +	mutex_unlock(&iommu->lock);

What does holding this lock protect?  That the vfio_iommu remains
backed by an iommu during this operation, even though we don't do
anything to release allocated pasids when that iommu backing is removed?

> > > +	if (mm)
> > > +		mmput(mm);
> > > +	return ret;
> > > +}
> > > +
> > > +static int vfio_iommu_type1_pasid_free(struct vfio_iommu *iommu,
> > > +				       unsigned int pasid)
> > > +{
> > > +	struct mm_struct *mm = NULL;
> > > +	void *pdata;
> > > +	int ret = 0;
> > > +
> > > +	mutex_lock(&iommu->lock);
> > > +	if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) {
> > > +		ret = -EINVAL;
> > > +		goto out_unlock;
> > > +	}
> > > +
> > > +	/**
> > > +	 * REVISIT:
> > > +	 * There are two cases free could fail:
> > > +	 * 1. free pasid by non-owner, we use ioasid_set to track mm, if
> > > +	 * the set does not match, caller is not permitted to free.
> > > +	 * 2. free before unbind all devices, we can check if ioasid private
> > > +	 * data, if data != NULL, then fail to free.
> > > +	 */
> > > +	mm = get_task_mm(current);
> > > +	pdata = ioasid_find((struct ioasid_set *)mm, pasid, NULL);
> > > +	if (IS_ERR(pdata)) {
> > > +		if (pdata == ERR_PTR(-ENOENT))
> > > +			pr_err("PASID %u is not allocated\n", pasid);
> > > +		else if (pdata == ERR_PTR(-EACCES))
> > > +			pr_err("Free PASID %u by non-owner, denied", pasid);
> > > +		else
> > > +			pr_err("Error searching PASID %u\n", pasid);  
> > 
> > This should be removed, errno is sufficient for the user, this just
> > provides the user with a trivial DoS vector filling logs.  
> 
> sure, will fix it. thanks.
> 
> > > +		ret = -EPERM;  
> > 
> > But why not return PTR_ERR(pdata)?  
> 
> aha, would do it.
> 
> > > +		goto out_unlock;
> > > +	}
> > > +	if (pdata) {
> > > +		pr_debug("Cannot free pasid %d with private data\n", pasid);
> > > +		/* Expect PASID has no private data if not bond */
> > > +		ret = -EBUSY;
> > > +		goto out_unlock;
> > > +	}
> > > +	ioasid_free(pasid);  
> > 
> > We only ever get here with pasid == NULL?!   
> 
> I guess you meant only when pdata==NULL.
> 
> > Something is wrong.  Should
> > that be 'if (!pdata)'?  (which also makes that pr_debug another DoS
> > vector)  
> 
> Oh, yes, just do it as below:
> 
> if (!pdata) {
> 	ioasid_free(pasid);
> 	ret = SUCCESS;
> } else
> 	ret = -EBUSY;
> 
> Is it what you mean?

No, I think I was just confusing pdata and pasid, but I am still
confused about testing pdata.  We call ioasid_alloc() with private =
NULL, and I don't see any of your patches calling ioasid_set_data() to
change the private data after allocation, so how could this ever be
set?  Should this just be a BUG_ON(pdata) as the integrity of the
system is in question should this state ever occur?  Thanks,

Alex
 
> > > +
> > > +out_unlock:
> > > +	if (mm)
> > > +		mmput(mm);
> > > +	mutex_unlock(&iommu->lock);
> > > +	return ret;
> > > +}
> > > +
> > >  static long vfio_iommu_type1_ioctl(void *iommu_data,
> > >  				   unsigned int cmd, unsigned long arg)
> > >  {
> > > @@ -2370,6 +2447,43 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
> > >  					    &ustruct);
> > >  		mutex_unlock(&iommu->lock);
> > >  		return ret;
> > > +
> > > +	} else if (cmd == VFIO_IOMMU_PASID_REQUEST) {
> > > +		struct vfio_iommu_type1_pasid_request req;
> > > +		int min_pasid, max_pasid, pasid;
> > > +
> > > +		minsz = offsetofend(struct vfio_iommu_type1_pasid_request,
> > > +				    flag);
> > > +
> > > +		if (copy_from_user(&req, (void __user *)arg, minsz))
> > > +			return -EFAULT;
> > > +
> > > +		if (req.argsz < minsz)
> > > +			return -EINVAL;
> > > +
> > > +		switch (req.flag) {  
> > 
> > This works, but it's strange.  Let's make the code a little easier for
> > the next flag bit that gets used so they don't need to rework this case
> > statement.  I'd suggest creating a VFIO_IOMMU_PASID_OPS_MASK that is
> > the OR of the ALLOC/FREE options, test that no bits are set outside of
> > that mask, then AND that mask as the switch arg with the code below.  
> 
> Got it. Let me fix it in next version.
> 
> > > +		/**
> > > +		 * TODO: min_pasid and max_pasid align with
> > > +		 * typedef unsigned int ioasid_t
> > > +		 */
> > > +		case VFIO_IOMMU_PASID_ALLOC:
> > > +			if (copy_from_user(&min_pasid,
> > > +				(void __user *)arg + minsz, sizeof(min_pasid)))
> > > +				return -EFAULT;
> > > +			if (copy_from_user(&max_pasid,
> > > +				(void __user *)arg + minsz + sizeof(min_pasid),
> > > +				sizeof(max_pasid)))
> > > +				return -EFAULT;
> > > +			return vfio_iommu_type1_pasid_alloc(iommu,
> > > +						min_pasid, max_pasid);
> > > +		case VFIO_IOMMU_PASID_FREE:
> > > +			if (copy_from_user(&pasid,
> > > +				(void __user *)arg + minsz, sizeof(pasid)))
> > > +				return -EFAULT;
> > > +			return vfio_iommu_type1_pasid_free(iommu, pasid);
> > > +		default:
> > > +			return -EINVAL;
> > > +		}
> > >  	}
> > >
> > >  	return -ENOTTY;
> > > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> > > index ccf60a2..04de290 100644
> > > --- a/include/uapi/linux/vfio.h
> > > +++ b/include/uapi/linux/vfio.h
> > > @@ -807,6 +807,31 @@ struct vfio_iommu_type1_cache_invalidate {
> > >  };
> > >  #define VFIO_IOMMU_CACHE_INVALIDATE      _IO(VFIO_TYPE, VFIO_BASE + 24)
> > >
> > > +/*
> > > + * @flag=VFIO_IOMMU_PASID_ALLOC, refer to the @min_pasid and  
> > @max_pasid fields  
> > > + * @flag=VFIO_IOMMU_PASID_FREE, refer to @pasid field
> > > + */
> > > +struct vfio_iommu_type1_pasid_request {
> > > +	__u32	argsz;
> > > +#define VFIO_IOMMU_PASID_ALLOC	(1 << 0)
> > > +#define VFIO_IOMMU_PASID_FREE	(1 << 1)
> > > +	__u32	flag;
> > > +	union {
> > > +		struct {
> > > +			int min_pasid;
> > > +			int max_pasid;
> > > +		};
> > > +		int pasid;  
> > 
> > Perhaps:
> > 
> > 		struct {
> > 			u32 min;
> > 			u32 max;
> > 		} alloc_pasid;
> > 		u32 free_pasid;
> > 
> > (note also the s/int/u32/)  
> 
> got it. will fix it in next version. Thanks.
> 
> > > +	};
> > > +};
> > > +
> > > +/**
> > > + * VFIO_IOMMU_PASID_REQUEST - _IOWR(VFIO_TYPE, VFIO_BASE + 27,
> > > + *				struct vfio_iommu_type1_pasid_request)
> > > + *
> > > + */
> > > +#define VFIO_IOMMU_PASID_REQUEST	_IO(VFIO_TYPE, VFIO_BASE + 27)
> > > +
> > >  /* -------- Additional API for SPAPR TCE (Server POWERPC) IOMMU -------- */
> > >
> > >  /*  
> 
> Regards,
> Yi Liu


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC v2 3/3] vfio/type1: bind guest pasid (guest page tables) to host
  2019-10-24 12:26 ` [RFC v2 3/3] vfio/type1: bind guest pasid (guest page tables) to host Liu Yi L
@ 2019-11-07 23:20   ` Alex Williamson
  2019-11-12 11:21     ` Liu, Yi L
  0 siblings, 1 reply; 32+ messages in thread
From: Alex Williamson @ 2019-11-07 23:20 UTC (permalink / raw)
  To: Liu Yi L
  Cc: eric.auger, kevin.tian, jacob.jun.pan, joro, ashok.raj,
	jun.j.tian, yi.y.sun, jean-philippe.brucker, peterx, iommu, kvm

On Thu, 24 Oct 2019 08:26:23 -0400
Liu Yi L <yi.l.liu@intel.com> wrote:

> This patch adds vfio support to bind guest translation structure
> to host iommu. VFIO exposes iommu programming capability to user-
> space. Guest is a user-space application in host under KVM solution.
> For SVA usage in Virtual Machine, guest owns GVA->GPA translation
> structure. And this part should be passdown to host to enable nested
> translation (or say two stage translation). This patch reuses the
> VFIO_IOMMU_BIND proposal from Jean-Philippe Brucker, and adds new
> bind type for binding guest owned translation structure to host.
> 
> *) Add two new ioctls for VFIO containers.
> 
>   - VFIO_IOMMU_BIND: for bind request from userspace, it could be
>                    bind a process to a pasid or bind a guest pasid
>                    to a device, this is indicated by type
>   - VFIO_IOMMU_UNBIND: for unbind request from userspace, it could be
>                    unbind a process to a pasid or unbind a guest pasid
>                    to a device, also indicated by type
>   - Bind type:
> 	VFIO_IOMMU_BIND_PROCESS: user-space request to bind a process
>                    to a device
> 	VFIO_IOMMU_BIND_GUEST_PASID: bind guest owned translation
>                    structure to host iommu. e.g. guest page table
> 
> *) Code logic in vfio_iommu_type1_ioctl() to handle VFIO_IOMMU_BIND/UNBIND
> 
> Cc: Kevin Tian <kevin.tian@intel.com>
> Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> ---
>  drivers/vfio/vfio_iommu_type1.c | 136 ++++++++++++++++++++++++++++++++++++++++
>  include/uapi/linux/vfio.h       |  44 +++++++++++++
>  2 files changed, 180 insertions(+)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index 3d73a7d..1a27e25 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -2325,6 +2325,104 @@ static int vfio_iommu_type1_pasid_free(struct vfio_iommu *iommu,
>  	return ret;
>  }
>  
> +static int vfio_bind_gpasid_fn(struct device *dev, void *data)
> +{
> +	struct domain_capsule *dc = (struct domain_capsule *)data;
> +	struct iommu_gpasid_bind_data *ustruct =
> +		(struct iommu_gpasid_bind_data *) dc->data;
> +
> +	return iommu_sva_bind_gpasid(dc->domain, dev, ustruct);
> +}
> +
> +static int vfio_unbind_gpasid_fn(struct device *dev, void *data)
> +{
> +	struct domain_capsule *dc = (struct domain_capsule *)data;
> +	struct iommu_gpasid_bind_data *ustruct =
> +		(struct iommu_gpasid_bind_data *) dc->data;
> +
> +	return iommu_sva_unbind_gpasid(dc->domain, dev,
> +						ustruct->hpasid);
> +}
> +
> +/*
> + * unbind specific gpasid, caller of this function requires hold
> + * vfio_iommu->lock
> + */
> +static long vfio_iommu_type1_do_guest_unbind(struct vfio_iommu *iommu,
> +		  struct iommu_gpasid_bind_data *gbind_data)
> +{
> +	return vfio_iommu_lookup_dev(iommu, vfio_unbind_gpasid_fn, gbind_data);
> +}
> +
> +static long vfio_iommu_type1_bind_gpasid(struct vfio_iommu *iommu,
> +					    void __user *arg,
> +					    struct vfio_iommu_type1_bind *bind)
> +{
> +	struct iommu_gpasid_bind_data gbind_data;
> +	unsigned long minsz;
> +	int ret = 0;
> +
> +	minsz = sizeof(*bind) + sizeof(gbind_data);
> +	if (bind->argsz < minsz)
> +		return -EINVAL;
> +
> +	if (copy_from_user(&gbind_data, arg, sizeof(gbind_data)))
> +		return -EFAULT;
> +
> +	mutex_lock(&iommu->lock);
> +	if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) {
> +		ret = -EINVAL;
> +		goto out_unlock;
> +	}
> +
> +	ret = vfio_iommu_lookup_dev(iommu, vfio_bind_gpasid_fn, &gbind_data);
> +	/*
> +	 * If bind failed, it may not be a total failure. Some devices within
> +	 * the iommu group may have bind successfully. Although we don't enable
> +	 * pasid capability for non-singletion iommu groups, a unbind operation
> +	 * would be helpful to ensure no partial binding for an iommu group.
> +	 */
> +	if (ret)
> +		/*
> +		 * Undo all binds that already succeeded, no need to check the
> +		 * return value here since some device within the group has no
> +		 * successful bind when coming to this place switch.
> +		 */
> +		vfio_iommu_type1_do_guest_unbind(iommu, &gbind_data);
> +
> +out_unlock:
> +	mutex_unlock(&iommu->lock);
> +	return ret;
> +}
> +
> +static long vfio_iommu_type1_unbind_gpasid(struct vfio_iommu *iommu,
> +					    void __user *arg,
> +					    struct vfio_iommu_type1_bind *bind)
> +{
> +	struct iommu_gpasid_bind_data gbind_data;
> +	unsigned long minsz;
> +	int ret = 0;
> +
> +	minsz = sizeof(*bind) + sizeof(gbind_data);
> +	if (bind->argsz < minsz)
> +		return -EINVAL;

But gbind_data can change size if new vendor specific data is added to
the union, so kernel updates break existing userspace.  Fail.

> +
> +	if (copy_from_user(&gbind_data, arg, sizeof(gbind_data)))
> +		return -EFAULT;
> +
> +	mutex_lock(&iommu->lock);
> +	if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) {
> +		ret = -EINVAL;
> +		goto out_unlock;
> +	}
> +
> +	ret = vfio_iommu_type1_do_guest_unbind(iommu, &gbind_data);
> +
> +out_unlock:
> +	mutex_unlock(&iommu->lock);
> +	return ret;
> +}
> +
>  static long vfio_iommu_type1_ioctl(void *iommu_data,
>  				   unsigned int cmd, unsigned long arg)
>  {
> @@ -2484,6 +2582,44 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
>  		default:
>  			return -EINVAL;
>  		}
> +
> +	} else if (cmd == VFIO_IOMMU_BIND) {
> +		struct vfio_iommu_type1_bind bind;
> +
> +		minsz = offsetofend(struct vfio_iommu_type1_bind, bind_type);
> +
> +		if (copy_from_user(&bind, (void __user *)arg, minsz))
> +			return -EFAULT;
> +
> +		if (bind.argsz < minsz)
> +			return -EINVAL;
> +
> +		switch (bind.bind_type) {
> +		case VFIO_IOMMU_BIND_GUEST_PASID:
> +			return vfio_iommu_type1_bind_gpasid(iommu,
> +					(void __user *)(arg + minsz), &bind);

Why are we defining BIND_PROCESS if it's not supported?  How does the
user learn it's not supported?

> +		default:
> +			return -EINVAL;
> +		}
> +
> +	} else if (cmd == VFIO_IOMMU_UNBIND) {
> +		struct vfio_iommu_type1_bind bind;
> +
> +		minsz = offsetofend(struct vfio_iommu_type1_bind, bind_type);
> +
> +		if (copy_from_user(&bind, (void __user *)arg, minsz))
> +			return -EFAULT;
> +
> +		if (bind.argsz < minsz)
> +			return -EINVAL;
> +
> +		switch (bind.bind_type) {
> +		case VFIO_IOMMU_BIND_GUEST_PASID:
> +			return vfio_iommu_type1_unbind_gpasid(iommu,
> +					(void __user *)(arg + minsz), &bind);
> +		default:
> +			return -EINVAL;
> +		}
>  	}
>  
>  	return -ENOTTY;
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 04de290..78e8c64 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -832,6 +832,50 @@ struct vfio_iommu_type1_pasid_request {
>   */
>  #define VFIO_IOMMU_PASID_REQUEST	_IO(VFIO_TYPE, VFIO_BASE + 27)
>  
> +enum vfio_iommu_bind_type {
> +	VFIO_IOMMU_BIND_PROCESS,
> +	VFIO_IOMMU_BIND_GUEST_PASID,
> +};
> +
> +/*
> + * Supported types:
> + *	- VFIO_IOMMU_BIND_GUEST_PASID: bind guest pasid, which invoked
> + *			by guest, it takes iommu_gpasid_bind_data in data.
> + */
> +struct vfio_iommu_type1_bind {
> +	__u32				argsz;
> +	enum vfio_iommu_bind_type	bind_type;
> +	__u8				data[];
> +};

I don't think enum defines a compiler invariant data size.  We can't
use it for a kernel/user interface.  Also why no flags field as is
essentially standard for every vfio ioctl?  Couldn't we specify
process/guest-pasid with flags?  For that matter couldn't we specify
bind/unbind using a single ioctl?  I think that would be more
consistent with the pasid alloc/free ioctl in the previous patch.

Why are we appending opaque data to the end of the structure when
clearly we expect a struct iommu_gpasid_bind_data?  That bind data
structure expects a format (ex. IOMMU_PASID_FORMAT_INTEL_VTD).  How does
a user determine what formats are accepted from within the vfio API (or
even outside of the vfio API)?

> +
> +/*
> + * VFIO_IOMMU_BIND - _IOWR(VFIO_TYPE, VFIO_BASE + 28, struct vfio_iommu_bind)
                            ^
The semantics appear to just be _IOW, nothing is written back to the
userspace buffer on return.

> + *
> + * Manage address spaces of devices in this container. Initially a TYPE1
> + * container can only have one address space, managed with
> + * VFIO_IOMMU_MAP/UNMAP_DMA.
> + *
> + * An IOMMU of type VFIO_TYPE1_NESTING_IOMMU can be managed by both MAP/UNMAP
> + * and BIND ioctls at the same time. MAP/UNMAP acts on the stage-2 (host) page
> + * tables, and BIND manages the stage-1 (guest) page tables. Other types of
> + * IOMMU may allow MAP/UNMAP and BIND to coexist, where MAP/UNMAP controls
> + * non-PASID traffic and BIND controls PASID traffic. But this depends on the
> + * underlying IOMMU architecture and isn't guaranteed.
> + *
> + * Availability of this feature depends on the device, its bus, the underlying
> + * IOMMU and the CPU architecture.

And the user discovers this is available by...?  There's no probe here,
are they left only to setup a VM to the point of trying to use this
before they fail the ioctl?  Could VFIO_IOMMU_GET_INFO fill this gap?
Thanks,

Alex

> + *
> + * returns: 0 on success, -errno on failure.
> + */
> +#define VFIO_IOMMU_BIND		_IO(VFIO_TYPE, VFIO_BASE + 28)
> +
> +/*
> + * VFIO_IOMMU_UNBIND - _IOWR(VFIO_TYPE, VFIO_BASE + 29, struct vfio_iommu_bind)
> + *
> + * Undo what was done by the corresponding VFIO_IOMMU_BIND ioctl.
> + */
> +#define VFIO_IOMMU_UNBIND	_IO(VFIO_TYPE, VFIO_BASE + 29)
> +
>  /* -------- Additional API for SPAPR TCE (Server POWERPC) IOMMU -------- */
>  
>  /*


^ permalink raw reply	[flat|nested] 32+ messages in thread

* RE: [RFC v2 2/3] vfio/type1: VFIO_IOMMU_PASID_REQUEST(alloc/free)
  2019-11-07 22:06       ` Alex Williamson
@ 2019-11-08 12:23         ` Liu, Yi L
  2019-11-08 15:15           ` Alex Williamson
  0 siblings, 1 reply; 32+ messages in thread
From: Liu, Yi L @ 2019-11-08 12:23 UTC (permalink / raw)
  To: Alex Williamson
  Cc: eric.auger, Tian, Kevin, jacob.jun.pan, joro, Raj, Ashok, Tian,
	Jun J, Sun, Yi Y, jean-philippe.brucker, peterx, iommu, kvm

> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: Friday, November 8, 2019 6:07 AM
> To: Liu, Yi L <yi.l.liu@intel.com>
> Subject: Re: [RFC v2 2/3] vfio/type1: VFIO_IOMMU_PASID_REQUEST(alloc/free)
> 
> On Wed, 6 Nov 2019 13:27:26 +0000
> "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> 
> > > From: Alex Williamson <alex.williamson@redhat.com>
> > > Sent: Wednesday, November 6, 2019 7:36 AM
> > > To: Liu, Yi L <yi.l.liu@intel.com>
> > > Subject: Re: [RFC v2 2/3] vfio/type1: VFIO_IOMMU_PASID_REQUEST(alloc/free)
> > >
> > > On Thu, 24 Oct 2019 08:26:22 -0400
> > > Liu Yi L <yi.l.liu@intel.com> wrote:
> > >
> > > > This patch adds VFIO_IOMMU_PASID_REQUEST ioctl which aims
> > > > to passdown PASID allocation/free request from the virtual
> > > > iommu. This is required to get PASID managed in system-wide.
> > > >
> > > > Cc: Kevin Tian <kevin.tian@intel.com>
> > > > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > > > Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
> > > > Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > > > ---
> > > >  drivers/vfio/vfio_iommu_type1.c | 114
> > > ++++++++++++++++++++++++++++++++++++++++
> > > >  include/uapi/linux/vfio.h       |  25 +++++++++
> > > >  2 files changed, 139 insertions(+)
> > > >
> > > > diff --git a/drivers/vfio/vfio_iommu_type1.c
> b/drivers/vfio/vfio_iommu_type1.c
> > > > index cd8d3a5..3d73a7d 100644
> > > > --- a/drivers/vfio/vfio_iommu_type1.c
> > > > +++ b/drivers/vfio/vfio_iommu_type1.c
> > > > @@ -2248,6 +2248,83 @@ static int vfio_cache_inv_fn(struct device *dev,
> void
> > > *data)
> > > >  	return iommu_cache_invalidate(dc->domain, dev, &ustruct->info);
> > > >  }
> > > >
> > > > +static int vfio_iommu_type1_pasid_alloc(struct vfio_iommu *iommu,
> > > > +					 int min_pasid,
> > > > +					 int max_pasid)
> > > > +{
> > > > +	int ret;
> > > > +	ioasid_t pasid;
> > > > +	struct mm_struct *mm = NULL;
> > > > +
> > > > +	mutex_lock(&iommu->lock);
> > > > +	if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) {
> > > > +		ret = -EINVAL;
> > > > +		goto out_unlock;
> > > > +	}
> > > > +	mm = get_task_mm(current);
> > > > +	/* Track ioasid allocation owner by mm */
> > > > +	pasid = ioasid_alloc((struct ioasid_set *)mm, min_pasid,
> > > > +				max_pasid, NULL);
> > >
> > > Are we sure we want to tie this to the task mm vs perhaps the
> > > vfio_iommu pointer?
> >
> > Here we want to have a kind of per-VM mark, which can be used to do
> > ownership check on whether a pasid is held by a specific VM. This is
> > very important to prevent across VM affect. vfio_iommu pointer is
> > competent for vfio as vfio is both pasid alloc requester and pasid
> > consumer. e.g. vfio requests pasid alloc from ioasid and also it will
> > invoke bind_gpasid(). vfio can either check ownership before invoking
> > bind_gpasid() or pass vfio_iommu pointer to iommu driver. But in future,
> > there may be other modules which are just consumers of pasid. And they
> > also want to do ownership check for a pasid. Then, it would be hard for
> > them as they are not the pasid alloc requester. So here better to have
> > a system wide structure to perform as the per-VM mark. task mm looks
> > to be much competent.
> 
> Ok, so it's intentional to have a VM-wide token.  Elsewhere in the
> type1 code (vfio_dma_do_map) we record the task_struct per dma mapping
> so that we can get the task mm as needed.  Would the task_struct
> pointer provide any advantage?

I think we may use task_struct pointer to make type1 code consistent.
How do you think?

> Also, an overall question, this provides userspace with pasid alloc and
> free ioctls, (1) what prevents a userspace process from consuming every
> available pasid, and (2) if the process exits or crashes without
> freeing pasids, how are they recovered aside from a reboot?

For question (1), I think we only need to take care about malicious
userspace process. As vfio usage is under privilege mode, so we may
be safe on it so far. However, we may need to introduce a kind of credit
mechanism to protect it. I've thought it, but no good idea yet. Would be
happy to hear from you.

For question (2), I think we need to reclaim the allocated pasids when
the vfio container fd is released just like what vfio does to the domain
mappings. I didn't add it yet. But I can add it in next version if you think
it would make the pasid alloc/free be much sound.

> > > > +	if (pasid == INVALID_IOASID) {
> > > > +		ret = -ENOSPC;
> > > > +		goto out_unlock;
> > > > +	}
> > > > +	ret = pasid;
> > > > +out_unlock:
> > > > +	mutex_unlock(&iommu->lock);
> 
> What does holding this lock protect?  That the vfio_iommu remains
> backed by an iommu during this operation, even though we don't do
> anything to release allocated pasids when that iommu backing is removed?

yes, it is unnecessary to hold the lock here. At least for the operations in
this patch. will remove it. :-)

> > > > +	if (mm)
> > > > +		mmput(mm);
> > > > +	return ret;
> > > > +}
> > > > +
> > > > +static int vfio_iommu_type1_pasid_free(struct vfio_iommu *iommu,
> > > > +				       unsigned int pasid)
> > > > +{
> > > > +	struct mm_struct *mm = NULL;
> > > > +	void *pdata;
> > > > +	int ret = 0;
> > > > +
> > > > +	mutex_lock(&iommu->lock);
> > > > +	if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) {
> > > > +		ret = -EINVAL;
> > > > +		goto out_unlock;
> > > > +	}
> > > > +
> > > > +	/**
> > > > +	 * REVISIT:
> > > > +	 * There are two cases free could fail:
> > > > +	 * 1. free pasid by non-owner, we use ioasid_set to track mm, if
> > > > +	 * the set does not match, caller is not permitted to free.
> > > > +	 * 2. free before unbind all devices, we can check if ioasid private
> > > > +	 * data, if data != NULL, then fail to free.
> > > > +	 */
> > > > +	mm = get_task_mm(current);
> > > > +	pdata = ioasid_find((struct ioasid_set *)mm, pasid, NULL);
> > > > +	if (IS_ERR(pdata)) {
> > > > +		if (pdata == ERR_PTR(-ENOENT))
> > > > +			pr_err("PASID %u is not allocated\n", pasid);
> > > > +		else if (pdata == ERR_PTR(-EACCES))
> > > > +			pr_err("Free PASID %u by non-owner, denied", pasid);
> > > > +		else
> > > > +			pr_err("Error searching PASID %u\n", pasid);
> > >
> > > This should be removed, errno is sufficient for the user, this just
> > > provides the user with a trivial DoS vector filling logs.
> >
> > sure, will fix it. thanks.
> >
> > > > +		ret = -EPERM;
> > >
> > > But why not return PTR_ERR(pdata)?
> >
> > aha, would do it.
> >
> > > > +		goto out_unlock;
> > > > +	}
> > > > +	if (pdata) {
> > > > +		pr_debug("Cannot free pasid %d with private data\n", pasid);
> > > > +		/* Expect PASID has no private data if not bond */
> > > > +		ret = -EBUSY;
> > > > +		goto out_unlock;
> > > > +	}
> > > > +	ioasid_free(pasid);
> > >
> > > We only ever get here with pasid == NULL?!
> >
> > I guess you meant only when pdata==NULL.
> >
> > > Something is wrong.  Should
> > > that be 'if (!pdata)'?  (which also makes that pr_debug another DoS
> > > vector)
> >
> > Oh, yes, just do it as below:
> >
> > if (!pdata) {
> > 	ioasid_free(pasid);
> > 	ret = SUCCESS;
> > } else
> > 	ret = -EBUSY;
> >
> > Is it what you mean?
> 
> No, I think I was just confusing pdata and pasid, but I am still
> confused about testing pdata.  We call ioasid_alloc() with private =
> NULL, and I don't see any of your patches calling ioasid_set_data() to
> change the private data after allocation, so how could this ever be
> set?  Should this just be a BUG_ON(pdata) as the integrity of the
> system is in question should this state ever occur?  Thanks,

ioasid_set_data() was called  in one patch from Jacob's vSVA patchset.
[PATCH v6 08/10] iommu/vt-d: Add bind guest PASID support
https://lkml.org/lkml/2019/10/22/946

The basic idea is to allocate pasid with private=NULL, and set it when the
pasid is actually bind to a device (bind_gpasid()). Each bind_gpasid() will
increase the ref_cnt in the private data, and each unbind_gpasid() will
decrease the ref_cnt. So if bind/unbind_gpasid() is called in mirror, the
private data should be null when comes to free operation. If not, vfio can
believe that the pasid is still in use.

> Alex

Thanks,
Yi Liu
 
> > > > +
> > > > +out_unlock:
> > > > +	if (mm)
> > > > +		mmput(mm);
> > > > +	mutex_unlock(&iommu->lock);
> > > > +	return ret;
> > > > +}
> > > > +
> > > >  static long vfio_iommu_type1_ioctl(void *iommu_data,
> > > >  				   unsigned int cmd, unsigned long arg)
> > > >  {
> > > > @@ -2370,6 +2447,43 @@ static long vfio_iommu_type1_ioctl(void
> *iommu_data,
> > > >  					    &ustruct);
> > > >  		mutex_unlock(&iommu->lock);
> > > >  		return ret;
> > > > +
> > > > +	} else if (cmd == VFIO_IOMMU_PASID_REQUEST) {
> > > > +		struct vfio_iommu_type1_pasid_request req;
> > > > +		int min_pasid, max_pasid, pasid;
> > > > +
> > > > +		minsz = offsetofend(struct vfio_iommu_type1_pasid_request,
> > > > +				    flag);
> > > > +
> > > > +		if (copy_from_user(&req, (void __user *)arg, minsz))
> > > > +			return -EFAULT;
> > > > +
> > > > +		if (req.argsz < minsz)
> > > > +			return -EINVAL;
> > > > +
> > > > +		switch (req.flag) {
> > >
> > > This works, but it's strange.  Let's make the code a little easier for
> > > the next flag bit that gets used so they don't need to rework this case
> > > statement.  I'd suggest creating a VFIO_IOMMU_PASID_OPS_MASK that is
> > > the OR of the ALLOC/FREE options, test that no bits are set outside of
> > > that mask, then AND that mask as the switch arg with the code below.
> >
> > Got it. Let me fix it in next version.
> >
> > > > +		/**
> > > > +		 * TODO: min_pasid and max_pasid align with
> > > > +		 * typedef unsigned int ioasid_t
> > > > +		 */
> > > > +		case VFIO_IOMMU_PASID_ALLOC:
> > > > +			if (copy_from_user(&min_pasid,
> > > > +				(void __user *)arg + minsz, sizeof(min_pasid)))
> > > > +				return -EFAULT;
> > > > +			if (copy_from_user(&max_pasid,
> > > > +				(void __user *)arg + minsz + sizeof(min_pasid),
> > > > +				sizeof(max_pasid)))
> > > > +				return -EFAULT;
> > > > +			return vfio_iommu_type1_pasid_alloc(iommu,
> > > > +						min_pasid, max_pasid);
> > > > +		case VFIO_IOMMU_PASID_FREE:
> > > > +			if (copy_from_user(&pasid,
> > > > +				(void __user *)arg + minsz, sizeof(pasid)))
> > > > +				return -EFAULT;
> > > > +			return vfio_iommu_type1_pasid_free(iommu, pasid);
> > > > +		default:
> > > > +			return -EINVAL;
> > > > +		}
> > > >  	}
> > > >
> > > >  	return -ENOTTY;
> > > > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> > > > index ccf60a2..04de290 100644
> > > > --- a/include/uapi/linux/vfio.h
> > > > +++ b/include/uapi/linux/vfio.h
> > > > @@ -807,6 +807,31 @@ struct vfio_iommu_type1_cache_invalidate {
> > > >  };
> > > >  #define VFIO_IOMMU_CACHE_INVALIDATE      _IO(VFIO_TYPE, VFIO_BASE +
> 24)
> > > >
> > > > +/*
> > > > + * @flag=VFIO_IOMMU_PASID_ALLOC, refer to the @min_pasid and
> > > @max_pasid fields
> > > > + * @flag=VFIO_IOMMU_PASID_FREE, refer to @pasid field
> > > > + */
> > > > +struct vfio_iommu_type1_pasid_request {
> > > > +	__u32	argsz;
> > > > +#define VFIO_IOMMU_PASID_ALLOC	(1 << 0)
> > > > +#define VFIO_IOMMU_PASID_FREE	(1 << 1)
> > > > +	__u32	flag;
> > > > +	union {
> > > > +		struct {
> > > > +			int min_pasid;
> > > > +			int max_pasid;
> > > > +		};
> > > > +		int pasid;
> > >
> > > Perhaps:
> > >
> > > 		struct {
> > > 			u32 min;
> > > 			u32 max;
> > > 		} alloc_pasid;
> > > 		u32 free_pasid;
> > >
> > > (note also the s/int/u32/)
> >
> > got it. will fix it in next version. Thanks.
> >
> > > > +	};
> > > > +};
> > > > +
> > > > +/**
> > > > + * VFIO_IOMMU_PASID_REQUEST - _IOWR(VFIO_TYPE, VFIO_BASE + 27,
> > > > + *				struct vfio_iommu_type1_pasid_request)
> > > > + *
> > > > + */
> > > > +#define VFIO_IOMMU_PASID_REQUEST	_IO(VFIO_TYPE, VFIO_BASE + 27)
> > > > +
> > > >  /* -------- Additional API for SPAPR TCE (Server POWERPC) IOMMU -------- */
> > > >
> > > >  /*
> >
> > Regards,
> > Yi Liu


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC v2 2/3] vfio/type1: VFIO_IOMMU_PASID_REQUEST(alloc/free)
  2019-11-08 12:23         ` Liu, Yi L
@ 2019-11-08 15:15           ` Alex Williamson
  2019-11-13 11:03             ` Liu, Yi L
  0 siblings, 1 reply; 32+ messages in thread
From: Alex Williamson @ 2019-11-08 15:15 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: eric.auger, Tian, Kevin, jacob.jun.pan, joro, Raj, Ashok, Tian,
	Jun J, Sun, Yi Y, jean-philippe.brucker, peterx, iommu, kvm

On Fri, 8 Nov 2019 12:23:41 +0000
"Liu, Yi L" <yi.l.liu@intel.com> wrote:

> > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > Sent: Friday, November 8, 2019 6:07 AM
> > To: Liu, Yi L <yi.l.liu@intel.com>
> > Subject: Re: [RFC v2 2/3] vfio/type1: VFIO_IOMMU_PASID_REQUEST(alloc/free)
> > 
> > On Wed, 6 Nov 2019 13:27:26 +0000
> > "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> >   
> > > > From: Alex Williamson <alex.williamson@redhat.com>
> > > > Sent: Wednesday, November 6, 2019 7:36 AM
> > > > To: Liu, Yi L <yi.l.liu@intel.com>
> > > > Subject: Re: [RFC v2 2/3] vfio/type1: VFIO_IOMMU_PASID_REQUEST(alloc/free)
> > > >
> > > > On Thu, 24 Oct 2019 08:26:22 -0400
> > > > Liu Yi L <yi.l.liu@intel.com> wrote:
> > > >  
> > > > > This patch adds VFIO_IOMMU_PASID_REQUEST ioctl which aims
> > > > > to passdown PASID allocation/free request from the virtual
> > > > > iommu. This is required to get PASID managed in system-wide.
> > > > >
> > > > > Cc: Kevin Tian <kevin.tian@intel.com>
> > > > > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > > > > Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
> > > > > Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > > > > ---
> > > > >  drivers/vfio/vfio_iommu_type1.c | 114  
> > > > ++++++++++++++++++++++++++++++++++++++++  
> > > > >  include/uapi/linux/vfio.h       |  25 +++++++++
> > > > >  2 files changed, 139 insertions(+)
> > > > >
> > > > > diff --git a/drivers/vfio/vfio_iommu_type1.c  
> > b/drivers/vfio/vfio_iommu_type1.c  
> > > > > index cd8d3a5..3d73a7d 100644
> > > > > --- a/drivers/vfio/vfio_iommu_type1.c
> > > > > +++ b/drivers/vfio/vfio_iommu_type1.c
> > > > > @@ -2248,6 +2248,83 @@ static int vfio_cache_inv_fn(struct device *dev,  
> > void  
> > > > *data)  
> > > > >  	return iommu_cache_invalidate(dc->domain, dev, &ustruct->info);
> > > > >  }
> > > > >
> > > > > +static int vfio_iommu_type1_pasid_alloc(struct vfio_iommu *iommu,
> > > > > +					 int min_pasid,
> > > > > +					 int max_pasid)
> > > > > +{
> > > > > +	int ret;
> > > > > +	ioasid_t pasid;
> > > > > +	struct mm_struct *mm = NULL;
> > > > > +
> > > > > +	mutex_lock(&iommu->lock);
> > > > > +	if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) {
> > > > > +		ret = -EINVAL;
> > > > > +		goto out_unlock;
> > > > > +	}
> > > > > +	mm = get_task_mm(current);
> > > > > +	/* Track ioasid allocation owner by mm */
> > > > > +	pasid = ioasid_alloc((struct ioasid_set *)mm, min_pasid,
> > > > > +				max_pasid, NULL);  
> > > >
> > > > Are we sure we want to tie this to the task mm vs perhaps the
> > > > vfio_iommu pointer?  
> > >
> > > Here we want to have a kind of per-VM mark, which can be used to do
> > > ownership check on whether a pasid is held by a specific VM. This is
> > > very important to prevent across VM affect. vfio_iommu pointer is
> > > competent for vfio as vfio is both pasid alloc requester and pasid
> > > consumer. e.g. vfio requests pasid alloc from ioasid and also it will
> > > invoke bind_gpasid(). vfio can either check ownership before invoking
> > > bind_gpasid() or pass vfio_iommu pointer to iommu driver. But in future,
> > > there may be other modules which are just consumers of pasid. And they
> > > also want to do ownership check for a pasid. Then, it would be hard for
> > > them as they are not the pasid alloc requester. So here better to have
> > > a system wide structure to perform as the per-VM mark. task mm looks
> > > to be much competent.  
> > 
> > Ok, so it's intentional to have a VM-wide token.  Elsewhere in the
> > type1 code (vfio_dma_do_map) we record the task_struct per dma mapping
> > so that we can get the task mm as needed.  Would the task_struct
> > pointer provide any advantage?  
> 
> I think we may use task_struct pointer to make type1 code consistent.
> How do you think?

If it has the same utility, sure.
 
> > Also, an overall question, this provides userspace with pasid alloc and
> > free ioctls, (1) what prevents a userspace process from consuming every
> > available pasid, and (2) if the process exits or crashes without
> > freeing pasids, how are they recovered aside from a reboot?  
> 
> For question (1), I think we only need to take care about malicious
> userspace process. As vfio usage is under privilege mode, so we may
> be safe on it so far.

No, where else do we ever make this assumption?  vfio requires a
privileged entity to configure the system for vfio, bind devices for
user use, and grant those devices to the user, but the usage of the
device is always assumed to be by an unprivileged user.  It is
absolutely not acceptable require a privileged user.  It's vfio's
responsibility to protect the system from the user.

> However, we may need to introduce a kind of credit
> mechanism to protect it. I've thought it, but no good idea yet. Would be
> happy to hear from you.

It's a limited system resource and it's unclear how many might
reasonably used by a user.  I don't have an easy answer.

> For question (2), I think we need to reclaim the allocated pasids when
> the vfio container fd is released just like what vfio does to the domain
> mappings. I didn't add it yet. But I can add it in next version if you think
> it would make the pasid alloc/free be much sound.

Consider it required, the interface is susceptible to abuse without it.

> > > > > +	if (pasid == INVALID_IOASID) {
> > > > > +		ret = -ENOSPC;
> > > > > +		goto out_unlock;
> > > > > +	}
> > > > > +	ret = pasid;
> > > > > +out_unlock:
> > > > > +	mutex_unlock(&iommu->lock);  
> > 
> > What does holding this lock protect?  That the vfio_iommu remains
> > backed by an iommu during this operation, even though we don't do
> > anything to release allocated pasids when that iommu backing is removed?  
> 
> yes, it is unnecessary to hold the lock here. At least for the operations in
> this patch. will remove it. :-)
> 
> > > > > +	if (mm)
> > > > > +		mmput(mm);
> > > > > +	return ret;
> > > > > +}
> > > > > +
> > > > > +static int vfio_iommu_type1_pasid_free(struct vfio_iommu *iommu,
> > > > > +				       unsigned int pasid)
> > > > > +{
> > > > > +	struct mm_struct *mm = NULL;
> > > > > +	void *pdata;
> > > > > +	int ret = 0;
> > > > > +
> > > > > +	mutex_lock(&iommu->lock);
> > > > > +	if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) {
> > > > > +		ret = -EINVAL;
> > > > > +		goto out_unlock;
> > > > > +	}
> > > > > +
> > > > > +	/**
> > > > > +	 * REVISIT:
> > > > > +	 * There are two cases free could fail:
> > > > > +	 * 1. free pasid by non-owner, we use ioasid_set to track mm, if
> > > > > +	 * the set does not match, caller is not permitted to free.
> > > > > +	 * 2. free before unbind all devices, we can check if ioasid private
> > > > > +	 * data, if data != NULL, then fail to free.
> > > > > +	 */
> > > > > +	mm = get_task_mm(current);
> > > > > +	pdata = ioasid_find((struct ioasid_set *)mm, pasid, NULL);
> > > > > +	if (IS_ERR(pdata)) {
> > > > > +		if (pdata == ERR_PTR(-ENOENT))
> > > > > +			pr_err("PASID %u is not allocated\n", pasid);
> > > > > +		else if (pdata == ERR_PTR(-EACCES))
> > > > > +			pr_err("Free PASID %u by non-owner, denied", pasid);
> > > > > +		else
> > > > > +			pr_err("Error searching PASID %u\n", pasid);  
> > > >
> > > > This should be removed, errno is sufficient for the user, this just
> > > > provides the user with a trivial DoS vector filling logs.  
> > >
> > > sure, will fix it. thanks.
> > >  
> > > > > +		ret = -EPERM;  
> > > >
> > > > But why not return PTR_ERR(pdata)?  
> > >
> > > aha, would do it.
> > >  
> > > > > +		goto out_unlock;
> > > > > +	}
> > > > > +	if (pdata) {
> > > > > +		pr_debug("Cannot free pasid %d with private data\n", pasid);
> > > > > +		/* Expect PASID has no private data if not bond */
> > > > > +		ret = -EBUSY;
> > > > > +		goto out_unlock;
> > > > > +	}
> > > > > +	ioasid_free(pasid);  
> > > >
> > > > We only ever get here with pasid == NULL?!  
> > >
> > > I guess you meant only when pdata==NULL.
> > >  
> > > > Something is wrong.  Should
> > > > that be 'if (!pdata)'?  (which also makes that pr_debug another DoS
> > > > vector)  
> > >
> > > Oh, yes, just do it as below:
> > >
> > > if (!pdata) {
> > > 	ioasid_free(pasid);
> > > 	ret = SUCCESS;
> > > } else
> > > 	ret = -EBUSY;
> > >
> > > Is it what you mean?  
> > 
> > No, I think I was just confusing pdata and pasid, but I am still
> > confused about testing pdata.  We call ioasid_alloc() with private =
> > NULL, and I don't see any of your patches calling ioasid_set_data() to
> > change the private data after allocation, so how could this ever be
> > set?  Should this just be a BUG_ON(pdata) as the integrity of the
> > system is in question should this state ever occur?  Thanks,  
> 
> ioasid_set_data() was called  in one patch from Jacob's vSVA patchset.
> [PATCH v6 08/10] iommu/vt-d: Add bind guest PASID support
> https://lkml.org/lkml/2019/10/22/946
> 
> The basic idea is to allocate pasid with private=NULL, and set it when the
> pasid is actually bind to a device (bind_gpasid()). Each bind_gpasid() will
> increase the ref_cnt in the private data, and each unbind_gpasid() will
> decrease the ref_cnt. So if bind/unbind_gpasid() is called in mirror, the
> private data should be null when comes to free operation. If not, vfio can
> believe that the pasid is still in use.

So this is another opportunity to leak pasids.  What's a user supposed
to do when their attempt to free a pasid fails?  It invites leaks to
allow this path to fail.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 32+ messages in thread

* RE: [RFC v2 3/3] vfio/type1: bind guest pasid (guest page tables) to host
  2019-11-07 23:20   ` Alex Williamson
@ 2019-11-12 11:21     ` Liu, Yi L
  2019-11-12 17:25       ` Alex Williamson
  0 siblings, 1 reply; 32+ messages in thread
From: Liu, Yi L @ 2019-11-12 11:21 UTC (permalink / raw)
  To: Alex Williamson
  Cc: eric.auger, Tian, Kevin, jacob.jun.pan, joro, Raj, Ashok, Tian,
	Jun J, Sun, Yi Y, jean-philippe.brucker, peterx, iommu, kvm, Lu,
	Baolu, Wu, Hao

> From: Alex Williamson < alex.williamson@redhat.com >
> Sent: Friday, November 8, 2019 7:21 AM
> To: Liu, Yi L <yi.l.liu@intel.com>
> Subject: Re: [RFC v2 3/3] vfio/type1: bind guest pasid (guest page tables) to host
> 
> On Thu, 24 Oct 2019 08:26:23 -0400
> Liu Yi L <yi.l.liu@intel.com> wrote:
> 
> > This patch adds vfio support to bind guest translation structure
> > to host iommu. VFIO exposes iommu programming capability to user-
> > space. Guest is a user-space application in host under KVM solution.
> > For SVA usage in Virtual Machine, guest owns GVA->GPA translation
> > structure. And this part should be passdown to host to enable nested
> > translation (or say two stage translation). This patch reuses the
> > VFIO_IOMMU_BIND proposal from Jean-Philippe Brucker, and adds new
> > bind type for binding guest owned translation structure to host.
> >
> > *) Add two new ioctls for VFIO containers.
> >
> >   - VFIO_IOMMU_BIND: for bind request from userspace, it could be
> >                    bind a process to a pasid or bind a guest pasid
> >                    to a device, this is indicated by type
> >   - VFIO_IOMMU_UNBIND: for unbind request from userspace, it could be
> >                    unbind a process to a pasid or unbind a guest pasid
> >                    to a device, also indicated by type
> >   - Bind type:
> > 	VFIO_IOMMU_BIND_PROCESS: user-space request to bind a process
> >                    to a device
> > 	VFIO_IOMMU_BIND_GUEST_PASID: bind guest owned translation
> >                    structure to host iommu. e.g. guest page table
> >
> > *) Code logic in vfio_iommu_type1_ioctl() to handle VFIO_IOMMU_BIND/UNBIND
> >
> > Cc: Kevin Tian <kevin.tian@intel.com>
> > Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
> > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > ---
> >  drivers/vfio/vfio_iommu_type1.c | 136
> ++++++++++++++++++++++++++++++++++++++++
> >  include/uapi/linux/vfio.h       |  44 +++++++++++++
> >  2 files changed, 180 insertions(+)
> >
> > diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> > index 3d73a7d..1a27e25 100644
> > --- a/drivers/vfio/vfio_iommu_type1.c
> > +++ b/drivers/vfio/vfio_iommu_type1.c
> > @@ -2325,6 +2325,104 @@ static int vfio_iommu_type1_pasid_free(struct
> vfio_iommu *iommu,
> >  	return ret;
> >  }
> >
> > +static int vfio_bind_gpasid_fn(struct device *dev, void *data)
> > +{
> > +	struct domain_capsule *dc = (struct domain_capsule *)data;
> > +	struct iommu_gpasid_bind_data *ustruct =
> > +		(struct iommu_gpasid_bind_data *) dc->data;
> > +
> > +	return iommu_sva_bind_gpasid(dc->domain, dev, ustruct);
> > +}
> > +
> > +static int vfio_unbind_gpasid_fn(struct device *dev, void *data)
> > +{
> > +	struct domain_capsule *dc = (struct domain_capsule *)data;
> > +	struct iommu_gpasid_bind_data *ustruct =
> > +		(struct iommu_gpasid_bind_data *) dc->data;
> > +
> > +	return iommu_sva_unbind_gpasid(dc->domain, dev,
> > +						ustruct->hpasid);
> > +}
> > +
> > +/*
> > + * unbind specific gpasid, caller of this function requires hold
> > + * vfio_iommu->lock
> > + */
> > +static long vfio_iommu_type1_do_guest_unbind(struct vfio_iommu *iommu,
> > +		  struct iommu_gpasid_bind_data *gbind_data)
> > +{
> > +	return vfio_iommu_lookup_dev(iommu, vfio_unbind_gpasid_fn, gbind_data);
> > +}
> > +
> > +static long vfio_iommu_type1_bind_gpasid(struct vfio_iommu *iommu,
> > +					    void __user *arg,
> > +					    struct vfio_iommu_type1_bind *bind)
> > +{
> > +	struct iommu_gpasid_bind_data gbind_data;
> > +	unsigned long minsz;
> > +	int ret = 0;
> > +
> > +	minsz = sizeof(*bind) + sizeof(gbind_data);
> > +	if (bind->argsz < minsz)
> > +		return -EINVAL;
> > +
> > +	if (copy_from_user(&gbind_data, arg, sizeof(gbind_data)))
> > +		return -EFAULT;
> > +
> > +	mutex_lock(&iommu->lock);
> > +	if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) {
> > +		ret = -EINVAL;
> > +		goto out_unlock;
> > +	}
> > +
> > +	ret = vfio_iommu_lookup_dev(iommu, vfio_bind_gpasid_fn, &gbind_data);
> > +	/*
> > +	 * If bind failed, it may not be a total failure. Some devices within
> > +	 * the iommu group may have bind successfully. Although we don't enable
> > +	 * pasid capability for non-singletion iommu groups, a unbind operation
> > +	 * would be helpful to ensure no partial binding for an iommu group.
> > +	 */
> > +	if (ret)
> > +		/*
> > +		 * Undo all binds that already succeeded, no need to check the
> > +		 * return value here since some device within the group has no
> > +		 * successful bind when coming to this place switch.
> > +		 */
> > +		vfio_iommu_type1_do_guest_unbind(iommu, &gbind_data);
> > +
> > +out_unlock:
> > +	mutex_unlock(&iommu->lock);
> > +	return ret;
> > +}
> > +
> > +static long vfio_iommu_type1_unbind_gpasid(struct vfio_iommu *iommu,
> > +					    void __user *arg,
> > +					    struct vfio_iommu_type1_bind *bind)
> > +{
> > +	struct iommu_gpasid_bind_data gbind_data;
> > +	unsigned long minsz;
> > +	int ret = 0;
> > +
> > +	minsz = sizeof(*bind) + sizeof(gbind_data);
> > +	if (bind->argsz < minsz)
> > +		return -EINVAL;
> 
> But gbind_data can change size if new vendor specific data is added to
> the union, so kernel updates break existing userspace.  Fail.

yes, we have a version field in struct iommu_gpasid_bind_data. How
about doing sanity check per versions? kernel knows the gbind_data
size of specific versions. Does it make sense? If yes, I'll also apply it
to the other sanity check in this series to avoid userspace fail after
kernel update.

> > +
> > +	if (copy_from_user(&gbind_data, arg, sizeof(gbind_data)))
> > +		return -EFAULT;
> > +
> > +	mutex_lock(&iommu->lock);
> > +	if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) {
> > +		ret = -EINVAL;
> > +		goto out_unlock;
> > +	}
> > +
> > +	ret = vfio_iommu_type1_do_guest_unbind(iommu, &gbind_data);
> > +
> > +out_unlock:
> > +	mutex_unlock(&iommu->lock);
> > +	return ret;
> > +}
> > +
> >  static long vfio_iommu_type1_ioctl(void *iommu_data,
> >  				   unsigned int cmd, unsigned long arg)
> >  {
> > @@ -2484,6 +2582,44 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
> >  		default:
> >  			return -EINVAL;
> >  		}
> > +
> > +	} else if (cmd == VFIO_IOMMU_BIND) {
> > +		struct vfio_iommu_type1_bind bind;
> > +
> > +		minsz = offsetofend(struct vfio_iommu_type1_bind, bind_type);
> > +
> > +		if (copy_from_user(&bind, (void __user *)arg, minsz))
> > +			return -EFAULT;
> > +
> > +		if (bind.argsz < minsz)
> > +			return -EINVAL;
> > +
> > +		switch (bind.bind_type) {
> > +		case VFIO_IOMMU_BIND_GUEST_PASID:
> > +			return vfio_iommu_type1_bind_gpasid(iommu,
> > +					(void __user *)(arg + minsz), &bind);
> 
> Why are we defining BIND_PROCESS if it's not supported?  How does the
> user learn it's not supported?

I think I should drop it so far since I only add BIND_GUEST_PASID. I think
Jean Philippe may need it in his native SVA enabling patchset. For the way
to let user learn it, may be using VFIO_IOMMU_GET_INFO as you mentioned
below?

> 
> > +		default:
> > +			return -EINVAL;
> > +		}
> > +
> > +	} else if (cmd == VFIO_IOMMU_UNBIND) {
> > +		struct vfio_iommu_type1_bind bind;
> > +
> > +		minsz = offsetofend(struct vfio_iommu_type1_bind, bind_type);
> > +
> > +		if (copy_from_user(&bind, (void __user *)arg, minsz))
> > +			return -EFAULT;
> > +
> > +		if (bind.argsz < minsz)
> > +			return -EINVAL;
> > +
> > +		switch (bind.bind_type) {
> > +		case VFIO_IOMMU_BIND_GUEST_PASID:
> > +			return vfio_iommu_type1_unbind_gpasid(iommu,
> > +					(void __user *)(arg + minsz), &bind);
> > +		default:
> > +			return -EINVAL;
> > +		}
> >  	}
> >
> >  	return -ENOTTY;
> > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> > index 04de290..78e8c64 100644
> > --- a/include/uapi/linux/vfio.h
> > +++ b/include/uapi/linux/vfio.h
> > @@ -832,6 +832,50 @@ struct vfio_iommu_type1_pasid_request {
> >   */
> >  #define VFIO_IOMMU_PASID_REQUEST	_IO(VFIO_TYPE, VFIO_BASE + 27)
> >
> > +enum vfio_iommu_bind_type {
> > +	VFIO_IOMMU_BIND_PROCESS,
> > +	VFIO_IOMMU_BIND_GUEST_PASID,
> > +};
> > +
> > +/*
> > + * Supported types:
> > + *	- VFIO_IOMMU_BIND_GUEST_PASID: bind guest pasid, which invoked
> > + *			by guest, it takes iommu_gpasid_bind_data in data.
> > + */
> > +struct vfio_iommu_type1_bind {
> > +	__u32				argsz;
> > +	enum vfio_iommu_bind_type	bind_type;
> > +	__u8				data[];
> > +};
> 
> I don't think enum defines a compiler invariant data size.  We can't
> use it for a kernel/user interface.  Also why no flags field as is
> essentially standard for every vfio ioctl?  Couldn't we specify
> process/guest-pasid with flags?

I remember there is an early comment in community which pointed out
that using flags potentially allows to config multiples types in one IOCTL.
Regards to it, defining explicit emums avoids it. But I agree with you,
it makes variant size. I'll fix it if this matter more.

> For that matter couldn't we specify
> bind/unbind using a single ioctl?  I think that would be more
> consistent with the pasid alloc/free ioctl in the previous patch.

yes, let me make it in next version.

> Why are we appending opaque data to the end of the structure when
> clearly we expect a struct iommu_gpasid_bind_data?

This is due to the intention to support BIND_GUEST_PASID and
BIND_PROCESS with a single IOCTL. Maybe we can use a separate
IOCTL for BIND_PROCESS. what's your opinion here?

> That bind data
> structure expects a format (ex. IOMMU_PASID_FORMAT_INTEL_VTD).  How does
> a user determine what formats are accepted from within the vfio API (or
> even outside of the vfio API)?

The info is provided by vIOMMU emulator (e.g. virtual VT-d). The vSVA patch
from Jacob has a sanity check on it.
https://lkml.org/lkml/2019/10/28/873

> > +
> > +/*
> > + * VFIO_IOMMU_BIND - _IOWR(VFIO_TYPE, VFIO_BASE + 28, struct
> vfio_iommu_bind)
>                             ^
> The semantics appear to just be _IOW, nothing is written back to the
> userspace buffer on return.

will fix it. thanks.

> > + *
> > + * Manage address spaces of devices in this container. Initially a TYPE1
> > + * container can only have one address space, managed with
> > + * VFIO_IOMMU_MAP/UNMAP_DMA.
> > + *
> > + * An IOMMU of type VFIO_TYPE1_NESTING_IOMMU can be managed by both
> MAP/UNMAP
> > + * and BIND ioctls at the same time. MAP/UNMAP acts on the stage-2 (host) page
> > + * tables, and BIND manages the stage-1 (guest) page tables. Other types of
> > + * IOMMU may allow MAP/UNMAP and BIND to coexist, where MAP/UNMAP
> controls
> > + * non-PASID traffic and BIND controls PASID traffic. But this depends on the
> > + * underlying IOMMU architecture and isn't guaranteed.
> > + *
> > + * Availability of this feature depends on the device, its bus, the underlying
> > + * IOMMU and the CPU architecture.
> 
> And the user discovers this is available by...?  There's no probe here,
> are they left only to setup a VM to the point of trying to use this
> before they fail the ioctl?  Could VFIO_IOMMU_GET_INFO fill this gap?
> Thanks,

I think VFIO_IOMMU_GET_INFO could help. let me extend it to fill this gap
if you agree.

> Alex

Thanks,
Yi Liu

> 
> > + *
> > + * returns: 0 on success, -errno on failure.
> > + */
> > +#define VFIO_IOMMU_BIND		_IO(VFIO_TYPE, VFIO_BASE + 28)
> > +
> > +/*
> > + * VFIO_IOMMU_UNBIND - _IOWR(VFIO_TYPE, VFIO_BASE + 29, struct
> vfio_iommu_bind)
> > + *
> > + * Undo what was done by the corresponding VFIO_IOMMU_BIND ioctl.
> > + */
> > +#define VFIO_IOMMU_UNBIND	_IO(VFIO_TYPE, VFIO_BASE + 29)
> > +
> >  /* -------- Additional API for SPAPR TCE (Server POWERPC) IOMMU -------- */
> >
> >  /*


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC v2 3/3] vfio/type1: bind guest pasid (guest page tables) to host
  2019-11-12 11:21     ` Liu, Yi L
@ 2019-11-12 17:25       ` Alex Williamson
  2019-11-13  7:43         ` Liu, Yi L
  0 siblings, 1 reply; 32+ messages in thread
From: Alex Williamson @ 2019-11-12 17:25 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: eric.auger, Tian, Kevin, jacob.jun.pan, joro, Raj, Ashok, Tian,
	Jun J, Sun, Yi Y, jean-philippe.brucker, peterx, iommu, kvm, Lu,
	Baolu, Wu, Hao

On Tue, 12 Nov 2019 11:21:40 +0000
"Liu, Yi L" <yi.l.liu@intel.com> wrote:

> > From: Alex Williamson < alex.williamson@redhat.com >
> > Sent: Friday, November 8, 2019 7:21 AM
> > To: Liu, Yi L <yi.l.liu@intel.com>
> > Subject: Re: [RFC v2 3/3] vfio/type1: bind guest pasid (guest page tables) to host
> > 
> > On Thu, 24 Oct 2019 08:26:23 -0400
> > Liu Yi L <yi.l.liu@intel.com> wrote:
> >   
> > > This patch adds vfio support to bind guest translation structure
> > > to host iommu. VFIO exposes iommu programming capability to user-
> > > space. Guest is a user-space application in host under KVM solution.
> > > For SVA usage in Virtual Machine, guest owns GVA->GPA translation
> > > structure. And this part should be passdown to host to enable nested
> > > translation (or say two stage translation). This patch reuses the
> > > VFIO_IOMMU_BIND proposal from Jean-Philippe Brucker, and adds new
> > > bind type for binding guest owned translation structure to host.
> > >
> > > *) Add two new ioctls for VFIO containers.
> > >
> > >   - VFIO_IOMMU_BIND: for bind request from userspace, it could be
> > >                    bind a process to a pasid or bind a guest pasid
> > >                    to a device, this is indicated by type
> > >   - VFIO_IOMMU_UNBIND: for unbind request from userspace, it could be
> > >                    unbind a process to a pasid or unbind a guest pasid
> > >                    to a device, also indicated by type
> > >   - Bind type:
> > > 	VFIO_IOMMU_BIND_PROCESS: user-space request to bind a process
> > >                    to a device
> > > 	VFIO_IOMMU_BIND_GUEST_PASID: bind guest owned translation
> > >                    structure to host iommu. e.g. guest page table
> > >
> > > *) Code logic in vfio_iommu_type1_ioctl() to handle VFIO_IOMMU_BIND/UNBIND
> > >
> > > Cc: Kevin Tian <kevin.tian@intel.com>
> > > Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
> > > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > > Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > > ---
> > >  drivers/vfio/vfio_iommu_type1.c | 136  
> > ++++++++++++++++++++++++++++++++++++++++  
> > >  include/uapi/linux/vfio.h       |  44 +++++++++++++
> > >  2 files changed, 180 insertions(+)
> > >
> > > diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> > > index 3d73a7d..1a27e25 100644
> > > --- a/drivers/vfio/vfio_iommu_type1.c
> > > +++ b/drivers/vfio/vfio_iommu_type1.c
> > > @@ -2325,6 +2325,104 @@ static int vfio_iommu_type1_pasid_free(struct  
> > vfio_iommu *iommu,  
> > >  	return ret;
> > >  }
> > >
> > > +static int vfio_bind_gpasid_fn(struct device *dev, void *data)
> > > +{
> > > +	struct domain_capsule *dc = (struct domain_capsule *)data;
> > > +	struct iommu_gpasid_bind_data *ustruct =
> > > +		(struct iommu_gpasid_bind_data *) dc->data;
> > > +
> > > +	return iommu_sva_bind_gpasid(dc->domain, dev, ustruct);
> > > +}
> > > +
> > > +static int vfio_unbind_gpasid_fn(struct device *dev, void *data)
> > > +{
> > > +	struct domain_capsule *dc = (struct domain_capsule *)data;
> > > +	struct iommu_gpasid_bind_data *ustruct =
> > > +		(struct iommu_gpasid_bind_data *) dc->data;
> > > +
> > > +	return iommu_sva_unbind_gpasid(dc->domain, dev,
> > > +						ustruct->hpasid);
> > > +}
> > > +
> > > +/*
> > > + * unbind specific gpasid, caller of this function requires hold
> > > + * vfio_iommu->lock
> > > + */
> > > +static long vfio_iommu_type1_do_guest_unbind(struct vfio_iommu *iommu,
> > > +		  struct iommu_gpasid_bind_data *gbind_data)
> > > +{
> > > +	return vfio_iommu_lookup_dev(iommu, vfio_unbind_gpasid_fn, gbind_data);
> > > +}
> > > +
> > > +static long vfio_iommu_type1_bind_gpasid(struct vfio_iommu *iommu,
> > > +					    void __user *arg,
> > > +					    struct vfio_iommu_type1_bind *bind)
> > > +{
> > > +	struct iommu_gpasid_bind_data gbind_data;
> > > +	unsigned long minsz;
> > > +	int ret = 0;
> > > +
> > > +	minsz = sizeof(*bind) + sizeof(gbind_data);
> > > +	if (bind->argsz < minsz)
> > > +		return -EINVAL;
> > > +
> > > +	if (copy_from_user(&gbind_data, arg, sizeof(gbind_data)))
> > > +		return -EFAULT;
> > > +
> > > +	mutex_lock(&iommu->lock);
> > > +	if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) {
> > > +		ret = -EINVAL;
> > > +		goto out_unlock;
> > > +	}
> > > +
> > > +	ret = vfio_iommu_lookup_dev(iommu, vfio_bind_gpasid_fn, &gbind_data);
> > > +	/*
> > > +	 * If bind failed, it may not be a total failure. Some devices within
> > > +	 * the iommu group may have bind successfully. Although we don't enable
> > > +	 * pasid capability for non-singletion iommu groups, a unbind operation
> > > +	 * would be helpful to ensure no partial binding for an iommu group.
> > > +	 */
> > > +	if (ret)
> > > +		/*
> > > +		 * Undo all binds that already succeeded, no need to check the
> > > +		 * return value here since some device within the group has no
> > > +		 * successful bind when coming to this place switch.
> > > +		 */
> > > +		vfio_iommu_type1_do_guest_unbind(iommu, &gbind_data);
> > > +
> > > +out_unlock:
> > > +	mutex_unlock(&iommu->lock);
> > > +	return ret;
> > > +}
> > > +
> > > +static long vfio_iommu_type1_unbind_gpasid(struct vfio_iommu *iommu,
> > > +					    void __user *arg,
> > > +					    struct vfio_iommu_type1_bind *bind)
> > > +{
> > > +	struct iommu_gpasid_bind_data gbind_data;
> > > +	unsigned long minsz;
> > > +	int ret = 0;
> > > +
> > > +	minsz = sizeof(*bind) + sizeof(gbind_data);
> > > +	if (bind->argsz < minsz)
> > > +		return -EINVAL;  
> > 
> > But gbind_data can change size if new vendor specific data is added to
> > the union, so kernel updates break existing userspace.  Fail.  
> 
> yes, we have a version field in struct iommu_gpasid_bind_data. How
> about doing sanity check per versions? kernel knows the gbind_data
> size of specific versions. Does it make sense? If yes, I'll also apply it
> to the other sanity check in this series to avoid userspace fail after
> kernel update.

Has it already been decided that the version field will be updated for
every addition to the union?  It seems there are two options, either
the version definition includes the possible contents of the union,
which means we need to support multiple versions concurrently in the
kernel to maintain compatibility with userspace and follow deprecation
protocols for removing that support, or we need to consider version to
be the general form of the structure and interpret the format field to
determine necessary length to copy from the user.

> > > +
> > > +	if (copy_from_user(&gbind_data, arg, sizeof(gbind_data)))
> > > +		return -EFAULT;
> > > +
> > > +	mutex_lock(&iommu->lock);
> > > +	if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) {
> > > +		ret = -EINVAL;
> > > +		goto out_unlock;
> > > +	}
> > > +
> > > +	ret = vfio_iommu_type1_do_guest_unbind(iommu, &gbind_data);
> > > +
> > > +out_unlock:
> > > +	mutex_unlock(&iommu->lock);
> > > +	return ret;
> > > +}
> > > +
> > >  static long vfio_iommu_type1_ioctl(void *iommu_data,
> > >  				   unsigned int cmd, unsigned long arg)
> > >  {
> > > @@ -2484,6 +2582,44 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
> > >  		default:
> > >  			return -EINVAL;
> > >  		}
> > > +
> > > +	} else if (cmd == VFIO_IOMMU_BIND) {
> > > +		struct vfio_iommu_type1_bind bind;
> > > +
> > > +		minsz = offsetofend(struct vfio_iommu_type1_bind, bind_type);
> > > +
> > > +		if (copy_from_user(&bind, (void __user *)arg, minsz))
> > > +			return -EFAULT;
> > > +
> > > +		if (bind.argsz < minsz)
> > > +			return -EINVAL;
> > > +
> > > +		switch (bind.bind_type) {
> > > +		case VFIO_IOMMU_BIND_GUEST_PASID:
> > > +			return vfio_iommu_type1_bind_gpasid(iommu,
> > > +					(void __user *)(arg + minsz), &bind);  
> > 
> > Why are we defining BIND_PROCESS if it's not supported?  How does the
> > user learn it's not supported?  
> 
> I think I should drop it so far since I only add BIND_GUEST_PASID. I think
> Jean Philippe may need it in his native SVA enabling patchset. For the way
> to let user learn it, may be using VFIO_IOMMU_GET_INFO as you mentioned
> below?
> 
> >   
> > > +		default:
> > > +			return -EINVAL;
> > > +		}
> > > +
> > > +	} else if (cmd == VFIO_IOMMU_UNBIND) {
> > > +		struct vfio_iommu_type1_bind bind;
> > > +
> > > +		minsz = offsetofend(struct vfio_iommu_type1_bind, bind_type);
> > > +
> > > +		if (copy_from_user(&bind, (void __user *)arg, minsz))
> > > +			return -EFAULT;
> > > +
> > > +		if (bind.argsz < minsz)
> > > +			return -EINVAL;
> > > +
> > > +		switch (bind.bind_type) {
> > > +		case VFIO_IOMMU_BIND_GUEST_PASID:
> > > +			return vfio_iommu_type1_unbind_gpasid(iommu,
> > > +					(void __user *)(arg + minsz), &bind);
> > > +		default:
> > > +			return -EINVAL;
> > > +		}
> > >  	}
> > >
> > >  	return -ENOTTY;
> > > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> > > index 04de290..78e8c64 100644
> > > --- a/include/uapi/linux/vfio.h
> > > +++ b/include/uapi/linux/vfio.h
> > > @@ -832,6 +832,50 @@ struct vfio_iommu_type1_pasid_request {
> > >   */
> > >  #define VFIO_IOMMU_PASID_REQUEST	_IO(VFIO_TYPE, VFIO_BASE + 27)
> > >
> > > +enum vfio_iommu_bind_type {
> > > +	VFIO_IOMMU_BIND_PROCESS,
> > > +	VFIO_IOMMU_BIND_GUEST_PASID,
> > > +};
> > > +
> > > +/*
> > > + * Supported types:
> > > + *	- VFIO_IOMMU_BIND_GUEST_PASID: bind guest pasid, which invoked
> > > + *			by guest, it takes iommu_gpasid_bind_data in data.
> > > + */
> > > +struct vfio_iommu_type1_bind {
> > > +	__u32				argsz;
> > > +	enum vfio_iommu_bind_type	bind_type;
> > > +	__u8				data[];
> > > +};  
> > 
> > I don't think enum defines a compiler invariant data size.  We can't
> > use it for a kernel/user interface.  Also why no flags field as is
> > essentially standard for every vfio ioctl?  Couldn't we specify
> > process/guest-pasid with flags?  
> 
> I remember there is an early comment in community which pointed out
> that using flags potentially allows to config multiples types in one IOCTL.
> Regards to it, defining explicit emums avoids it. But I agree with you,
> it makes variant size. I'll fix it if this matter more.
> 
> > For that matter couldn't we specify
> > bind/unbind using a single ioctl?  I think that would be more
> > consistent with the pasid alloc/free ioctl in the previous patch.  
> 
> yes, let me make it in next version.
> 
> > Why are we appending opaque data to the end of the structure when
> > clearly we expect a struct iommu_gpasid_bind_data?  
> 
> This is due to the intention to support BIND_GUEST_PASID and
> BIND_PROCESS with a single IOCTL. Maybe we can use a separate
> IOCTL for BIND_PROCESS. what's your opinion here?

If the ioctls have similar purpose and form, then re-using a single
ioctl might make sense, but BIND_PROCESS is only a place-holder in this
series, which is not acceptable.  A dual purpose ioctl does not
preclude that we could also use a union for the data field to make the
structure well specified.
 
> > That bind data
> > structure expects a format (ex. IOMMU_PASID_FORMAT_INTEL_VTD).  How does
> > a user determine what formats are accepted from within the vfio API (or
> > even outside of the vfio API)?  
> 
> The info is provided by vIOMMU emulator (e.g. virtual VT-d). The vSVA patch
> from Jacob has a sanity check on it.
> https://lkml.org/lkml/2019/10/28/873

The vIOMMU emulator runs at a layer above vfio.  How does the vIOMMU
emulator know that the vfio interface supports virtual VT-d?  IMO, it's
not acceptable that the user simply assume that an Intel host platform
supports VT-d.  For example, consider what happens when we need to
define IOMMU_PASID_FORMAT_INTEL_VTDv2.  How would the user learn that
VTDv2 is supported and the original VTD format is not supported?

> > > +
> > > +/*
> > > + * VFIO_IOMMU_BIND - _IOWR(VFIO_TYPE, VFIO_BASE + 28, struct  
> > vfio_iommu_bind)
> >                             ^
> > The semantics appear to just be _IOW, nothing is written back to the
> > userspace buffer on return.  
> 
> will fix it. thanks.
> 
> > > + *
> > > + * Manage address spaces of devices in this container. Initially a TYPE1
> > > + * container can only have one address space, managed with
> > > + * VFIO_IOMMU_MAP/UNMAP_DMA.
> > > + *
> > > + * An IOMMU of type VFIO_TYPE1_NESTING_IOMMU can be managed by both  
> > MAP/UNMAP  
> > > + * and BIND ioctls at the same time. MAP/UNMAP acts on the stage-2 (host) page
> > > + * tables, and BIND manages the stage-1 (guest) page tables. Other types of
> > > + * IOMMU may allow MAP/UNMAP and BIND to coexist, where MAP/UNMAP  
> > controls  
> > > + * non-PASID traffic and BIND controls PASID traffic. But this depends on the
> > > + * underlying IOMMU architecture and isn't guaranteed.
> > > + *
> > > + * Availability of this feature depends on the device, its bus, the underlying
> > > + * IOMMU and the CPU architecture.  
> > 
> > And the user discovers this is available by...?  There's no probe here,
> > are they left only to setup a VM to the point of trying to use this
> > before they fail the ioctl?  Could VFIO_IOMMU_GET_INFO fill this gap?
> > Thanks,  
> 
> I think VFIO_IOMMU_GET_INFO could help. let me extend it to fill this gap
> if you agree.

It's a start.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 32+ messages in thread

* RE: [RFC v2 3/3] vfio/type1: bind guest pasid (guest page tables) to host
  2019-11-12 17:25       ` Alex Williamson
@ 2019-11-13  7:43         ` Liu, Yi L
  2019-11-13 10:29           ` Jean-Philippe Brucker
  0 siblings, 1 reply; 32+ messages in thread
From: Liu, Yi L @ 2019-11-13  7:43 UTC (permalink / raw)
  To: Alex Williamson
  Cc: eric.auger, Tian, Kevin, jacob.jun.pan, joro, Raj, Ashok, Tian,
	Jun J, Sun, Yi Y, jean-philippe.brucker, peterx, iommu, kvm, Lu,
	Baolu, Wu, Hao

> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Wednesday, November 13, 2019 1:26 AM
> To: Liu, Yi L <yi.l.liu@intel.com>
> Subject: Re: [RFC v2 3/3] vfio/type1: bind guest pasid (guest page tables) to host
> 
> On Tue, 12 Nov 2019 11:21:40 +0000
> "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> 
> > > From: Alex Williamson < alex.williamson@redhat.com >
> > > Sent: Friday, November 8, 2019 7:21 AM
> > > To: Liu, Yi L <yi.l.liu@intel.com>
> > > Subject: Re: [RFC v2 3/3] vfio/type1: bind guest pasid (guest page tables) to host
> > >
> > > On Thu, 24 Oct 2019 08:26:23 -0400
> > > Liu Yi L <yi.l.liu@intel.com> wrote:
> > >
> > > > This patch adds vfio support to bind guest translation structure
> > > > to host iommu. VFIO exposes iommu programming capability to user-
> > > > space. Guest is a user-space application in host under KVM solution.
> > > > For SVA usage in Virtual Machine, guest owns GVA->GPA translation
> > > > structure. And this part should be passdown to host to enable nested
> > > > translation (or say two stage translation). This patch reuses the
> > > > VFIO_IOMMU_BIND proposal from Jean-Philippe Brucker, and adds new
> > > > bind type for binding guest owned translation structure to host.
> > > >
> > > > *) Add two new ioctls for VFIO containers.
> > > >
> > > >   - VFIO_IOMMU_BIND: for bind request from userspace, it could be
> > > >                    bind a process to a pasid or bind a guest pasid
> > > >                    to a device, this is indicated by type
> > > >   - VFIO_IOMMU_UNBIND: for unbind request from userspace, it could be
> > > >                    unbind a process to a pasid or unbind a guest pasid
> > > >                    to a device, also indicated by type
> > > >   - Bind type:
> > > > 	VFIO_IOMMU_BIND_PROCESS: user-space request to bind a process
> > > >                    to a device
> > > > 	VFIO_IOMMU_BIND_GUEST_PASID: bind guest owned translation
> > > >                    structure to host iommu. e.g. guest page table
> > > >
> > > > *) Code logic in vfio_iommu_type1_ioctl() to handle
> VFIO_IOMMU_BIND/UNBIND
> > > >
[...]
> > > > +static long vfio_iommu_type1_unbind_gpasid(struct vfio_iommu *iommu,
> > > > +					    void __user *arg,
> > > > +					    struct vfio_iommu_type1_bind *bind)
> > > > +{
> > > > +	struct iommu_gpasid_bind_data gbind_data;
> > > > +	unsigned long minsz;
> > > > +	int ret = 0;
> > > > +
> > > > +	minsz = sizeof(*bind) + sizeof(gbind_data);
> > > > +	if (bind->argsz < minsz)
> > > > +		return -EINVAL;
> > >
> > > But gbind_data can change size if new vendor specific data is added to
> > > the union, so kernel updates break existing userspace.  Fail.
> >
> > yes, we have a version field in struct iommu_gpasid_bind_data. How
> > about doing sanity check per versions? kernel knows the gbind_data
> > size of specific versions. Does it make sense? If yes, I'll also apply it
> > to the other sanity check in this series to avoid userspace fail after
> > kernel update.
> 
> Has it already been decided that the version field will be updated for
> every addition to the union?

No, just my proposal. Jacob may help to explain the purpose of version
field. But if we may be too  "frequent" for an uapi version number updating
if we inc version for each change in the union part. I may vote for the
second option from you below.

> It seems there are two options, either
> the version definition includes the possible contents of the union,
> which means we need to support multiple versions concurrently in the
> kernel to maintain compatibility with userspace and follow deprecation
> protocols for removing that support, or we need to consider version to
> be the general form of the structure and interpret the format field to
> determine necessary length to copy from the user.

As I mentioned above, may be better to let @version field only over the
general fields and let format to cover the possible changes in union. e.g.
IOMMU_PASID_FORMAT_INTEL_VTD2 may means version 2 of Intel
VT-d bind. But either way, I think we need to let kernel maintain multiple
versions to support compatible userspace. e.g. may have multiple versions
iommu_gpasid_bind_data_vtd struct in the union part.

> 
> > > > +
> > > > +	if (copy_from_user(&gbind_data, arg, sizeof(gbind_data)))
> > > > +		return -EFAULT;
> > > > +
> > > > +	mutex_lock(&iommu->lock);
> > > > +	if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) {
> > > > +		ret = -EINVAL;
> > > > +		goto out_unlock;
> > > > +	}
> > > > +
> > > > +	ret = vfio_iommu_type1_do_guest_unbind(iommu, &gbind_data);
> > > > +
> > > > +out_unlock:
> > > > +	mutex_unlock(&iommu->lock);
> > > > +	return ret;
> > > > +}
> > > > +
> > > >  static long vfio_iommu_type1_ioctl(void *iommu_data,
> > > >  				   unsigned int cmd, unsigned long arg)
> > > >  {
> > > > @@ -2484,6 +2582,44 @@ static long vfio_iommu_type1_ioctl(void
> *iommu_data,
> > > >  		default:
> > > >  			return -EINVAL;
> > > >  		}
> > > > +
> > > > +	} else if (cmd == VFIO_IOMMU_BIND) {
> > > > +		struct vfio_iommu_type1_bind bind;
> > > > +
> > > > +		minsz = offsetofend(struct vfio_iommu_type1_bind, bind_type);
> > > > +
> > > > +		if (copy_from_user(&bind, (void __user *)arg, minsz))
> > > > +			return -EFAULT;
> > > > +
> > > > +		if (bind.argsz < minsz)
> > > > +			return -EINVAL;
> > > > +
> > > > +		switch (bind.bind_type) {
> > > > +		case VFIO_IOMMU_BIND_GUEST_PASID:
> > > > +			return vfio_iommu_type1_bind_gpasid(iommu,
> > > > +					(void __user *)(arg + minsz), &bind);
> > >
> > > Why are we defining BIND_PROCESS if it's not supported?  How does the
> > > user learn it's not supported?
> >
> > I think I should drop it so far since I only add BIND_GUEST_PASID. I think
> > Jean Philippe may need it in his native SVA enabling patchset. For the way
> > to let user learn it, may be using VFIO_IOMMU_GET_INFO as you mentioned
> > below?
> >
> > >
> > > > +		default:
> > > > +			return -EINVAL;
> > > > +		}
> > > > +
> > > > +	} else if (cmd == VFIO_IOMMU_UNBIND) {
> > > > +		struct vfio_iommu_type1_bind bind;
> > > > +
> > > > +		minsz = offsetofend(struct vfio_iommu_type1_bind, bind_type);
> > > > +
> > > > +		if (copy_from_user(&bind, (void __user *)arg, minsz))
> > > > +			return -EFAULT;
> > > > +
> > > > +		if (bind.argsz < minsz)
> > > > +			return -EINVAL;
> > > > +
> > > > +		switch (bind.bind_type) {
> > > > +		case VFIO_IOMMU_BIND_GUEST_PASID:
> > > > +			return vfio_iommu_type1_unbind_gpasid(iommu,
> > > > +					(void __user *)(arg + minsz), &bind);
> > > > +		default:
> > > > +			return -EINVAL;
> > > > +		}
> > > >  	}
> > > >
> > > >  	return -ENOTTY;
> > > > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> > > > index 04de290..78e8c64 100644
> > > > --- a/include/uapi/linux/vfio.h
> > > > +++ b/include/uapi/linux/vfio.h
> > > > @@ -832,6 +832,50 @@ struct vfio_iommu_type1_pasid_request {
> > > >   */
> > > >  #define VFIO_IOMMU_PASID_REQUEST	_IO(VFIO_TYPE, VFIO_BASE + 27)
> > > >
> > > > +enum vfio_iommu_bind_type {
> > > > +	VFIO_IOMMU_BIND_PROCESS,
> > > > +	VFIO_IOMMU_BIND_GUEST_PASID,
> > > > +};
> > > > +
> > > > +/*
> > > > + * Supported types:
> > > > + *	- VFIO_IOMMU_BIND_GUEST_PASID: bind guest pasid, which
> invoked
> > > > + *			by guest, it takes iommu_gpasid_bind_data in data.
> > > > + */
> > > > +struct vfio_iommu_type1_bind {
> > > > +	__u32				argsz;
> > > > +	enum vfio_iommu_bind_type	bind_type;
> > > > +	__u8				data[];
> > > > +};
> > >
> > > I don't think enum defines a compiler invariant data size.  We can't
> > > use it for a kernel/user interface.  Also why no flags field as is
> > > essentially standard for every vfio ioctl?  Couldn't we specify
> > > process/guest-pasid with flags?
> >
> > I remember there is an early comment in community which pointed out
> > that using flags potentially allows to config multiples types in one IOCTL.
> > Regards to it, defining explicit emums avoids it. But I agree with you,
> > it makes variant size. I'll fix it if this matter more.
> >
> > > For that matter couldn't we specify
> > > bind/unbind using a single ioctl?  I think that would be more
> > > consistent with the pasid alloc/free ioctl in the previous patch.
> >
> > yes, let me make it in next version.
> >
> > > Why are we appending opaque data to the end of the structure when
> > > clearly we expect a struct iommu_gpasid_bind_data?
> >
> > This is due to the intention to support BIND_GUEST_PASID and
> > BIND_PROCESS with a single IOCTL. Maybe we can use a separate
> > IOCTL for BIND_PROCESS. what's your opinion here?
> 
> If the ioctls have similar purpose and form, then re-using a single
> ioctl might make sense, but BIND_PROCESS is only a place-holder in this
> series, which is not acceptable.  A dual purpose ioctl does not
> preclude that we could also use a union for the data field to make the
> structure well specified.

yes, BIND_PROCESS is only a place-holder here. From kernel p.o.v., both
BIND_GUEST_PASID and BIND_PROCESS are bind requests from userspace.
So the purposes are aligned. Below is the content the @data[] field
supposed to convey for BIND_PROCESS. If we use union, it would leave
space for extending it to support BIND_PROCESS. If only data[], it is a little
bit confusing why we define it in such manner if BIND_PROCESS is included
in this series. Please feel free let me know which one suits better.

+struct vfio_iommu_type1_bind_process {
+	__u32	flags;
+#define VFIO_IOMMU_BIND_PID		(1 << 0)
+	__u32	pasid;
+	__s32	pid;
+};
https://patchwork.kernel.org/patch/10394927/

> > > That bind data
> > > structure expects a format (ex. IOMMU_PASID_FORMAT_INTEL_VTD).  How
> does
> > > a user determine what formats are accepted from within the vfio API (or
> > > even outside of the vfio API)?
> >
> > The info is provided by vIOMMU emulator (e.g. virtual VT-d). The vSVA patch
> > from Jacob has a sanity check on it.
> > https://lkml.org/lkml/2019/10/28/873
> 
> The vIOMMU emulator runs at a layer above vfio.  How does the vIOMMU
> emulator know that the vfio interface supports virtual VT-d?  IMO, it's
> not acceptable that the user simply assume that an Intel host platform
> supports VT-d.  For example, consider what happens when we need to
> define IOMMU_PASID_FORMAT_INTEL_VTDv2.  How would the user learn that
> VTDv2 is supported and the original VTD format is not supported?

I guess this may be another info VFIO_IOMMU_GET_INFO should provide.
It makes sense that vfio be aware of what platform it is running on. right?
After vfio gets the info, may let vfio fill in the format info. Is it the correct
direction?

> 
> > > > +
> > > > +/*
> > > > + * VFIO_IOMMU_BIND - _IOWR(VFIO_TYPE, VFIO_BASE + 28, struct
> > > vfio_iommu_bind)
> > >                             ^
> > > The semantics appear to just be _IOW, nothing is written back to the
> > > userspace buffer on return.
> >
> > will fix it. thanks.
> >
> > > > + *
> > > > + * Manage address spaces of devices in this container. Initially a TYPE1
> > > > + * container can only have one address space, managed with
> > > > + * VFIO_IOMMU_MAP/UNMAP_DMA.
> > > > + *
> > > > + * An IOMMU of type VFIO_TYPE1_NESTING_IOMMU can be managed by
> both
> > > MAP/UNMAP
> > > > + * and BIND ioctls at the same time. MAP/UNMAP acts on the stage-2 (host)
> page
> > > > + * tables, and BIND manages the stage-1 (guest) page tables. Other types of
> > > > + * IOMMU may allow MAP/UNMAP and BIND to coexist, where MAP/UNMAP
> > > controls
> > > > + * non-PASID traffic and BIND controls PASID traffic. But this depends on the
> > > > + * underlying IOMMU architecture and isn't guaranteed.
> > > > + *
> > > > + * Availability of this feature depends on the device, its bus, the underlying
> > > > + * IOMMU and the CPU architecture.
> > >
> > > And the user discovers this is available by...?  There's no probe here,
> > > are they left only to setup a VM to the point of trying to use this
> > > before they fail the ioctl?  Could VFIO_IOMMU_GET_INFO fill this gap?
> > > Thanks,
> >
> > I think VFIO_IOMMU_GET_INFO could help. let me extend it to fill this gap
> > if you agree.
> 
> It's a start.  Thanks,

Got it. will show the code in next version. Thanks for your patient review.

> Alex

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC v2 1/3] vfio: VFIO_IOMMU_CACHE_INVALIDATE
  2019-11-06  1:31         ` Liu, Yi L
@ 2019-11-13  7:50           ` Auger Eric
  0 siblings, 0 replies; 32+ messages in thread
From: Auger Eric @ 2019-11-13  7:50 UTC (permalink / raw)
  To: Liu, Yi L, Alex Williamson
  Cc: Tian, Kevin, jacob.jun.pan, joro, Raj, Ashok, Tian, Jun J, Sun,
	Yi Y, jean-philippe.brucker, peterx, iommu, kvm

Hi Yi,

On 11/6/19 2:31 AM, Liu, Yi L wrote:
>> From: Alex Williamson [mailto:alex.williamson@redhat.com]
>> Sent: Wednesday, November 6, 2019 6:42 AM
>> To: Liu, Yi L <yi.l.liu@intel.com>
>> Subject: Re: [RFC v2 1/3] vfio: VFIO_IOMMU_CACHE_INVALIDATE
>>
>> On Fri, 25 Oct 2019 11:20:40 +0000
>> "Liu, Yi L" <yi.l.liu@intel.com> wrote:
>>
>>> Hi Kevin,
>>>
>>>> From: Tian, Kevin
>>>> Sent: Friday, October 25, 2019 5:14 PM
>>>> To: Liu, Yi L <yi.l.liu@intel.com>; alex.williamson@redhat.com;
>>>> Subject: RE: [RFC v2 1/3] vfio: VFIO_IOMMU_CACHE_INVALIDATE
>>>>
>>>>> From: Liu, Yi L
>>>>> Sent: Thursday, October 24, 2019 8:26 PM
>>>>>
>>>>> From: Liu Yi L <yi.l.liu@linux.intel.com>
>>>>>
>>>>> When the guest "owns" the stage 1 translation structures,  the
>>>>> host IOMMU driver has no knowledge of caching structure updates
>>>>> unless the guest invalidation requests are trapped and passed down to the host.
>>>>>
>>>>> This patch adds the VFIO_IOMMU_CACHE_INVALIDATE ioctl with aims at
>>>>> propagating guest stage1 IOMMU cache invalidations to the host.
>>>>>
>>>>> Cc: Kevin Tian <kevin.tian@intel.com>
>>>>> Signed-off-by: Liu Yi L <yi.l.liu@linux.intel.com>
>>>>> Signed-off-by: Eric Auger <eric.auger@redhat.com>
>>>>> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
>>>>> ---
>>>>>  drivers/vfio/vfio_iommu_type1.c | 55
>>>>> +++++++++++++++++++++++++++++++++++++++++
>>>>>  include/uapi/linux/vfio.h       | 13 ++++++++++
>>>>>  2 files changed, 68 insertions(+)
>>>>>
>>>>> diff --git a/drivers/vfio/vfio_iommu_type1.c
>>>>> b/drivers/vfio/vfio_iommu_type1.c index 96fddc1d..cd8d3a5 100644
>>>>> --- a/drivers/vfio/vfio_iommu_type1.c
>>>>> +++ b/drivers/vfio/vfio_iommu_type1.c
>>>>> @@ -124,6 +124,34 @@ struct vfio_regions {
>>>>>  #define IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)	\
>>>>>  					(!list_empty(&iommu->domain_list))
>>>>>
>>>>> +struct domain_capsule {
>>>>> +	struct iommu_domain *domain;
>>>>> +	void *data;
>>>>> +};
>>>>> +
>>>>> +/* iommu->lock must be held */
>>>>> +static int
>>>>> +vfio_iommu_lookup_dev(struct vfio_iommu *iommu,
>>>>> +		      int (*fn)(struct device *dev, void *data),
>>>>> +		      void *data)
>>>>
>>>> 'lookup' usually means find a device and then return. But the real
>>>> purpose here is to loop all the devices within this container and
>>>> then do something. Does it make more sense to be vfio_iommu_for_each_dev?
>>
>> +1
>>
>>> yep, I can replace it.
>>>
>>>>
>>>>> +{
>>>>> +	struct domain_capsule dc = {.data = data};
>>>>> +	struct vfio_domain *d;
>>> [...]
>>>> 2315,6 +2352,24 @@
>>>>> static long vfio_iommu_type1_ioctl(void *iommu_data,
>>>>>
>>>>>  		return copy_to_user((void __user *)arg, &unmap, minsz) ?
>>>>>  			-EFAULT : 0;
>>>>> +	} else if (cmd == VFIO_IOMMU_CACHE_INVALIDATE) {
>>>>> +		struct vfio_iommu_type1_cache_invalidate ustruct;
>>>>
>>>> it's weird to call a variable as struct.
>>>
>>> Will fix it.
>>>
>>>>> +		int ret;
>>>>> +
>>>>> +		minsz = offsetofend(struct
>>>>> vfio_iommu_type1_cache_invalidate,
>>>>> +				    info);
>>>>> +
>>>>> +		if (copy_from_user(&ustruct, (void __user *)arg, minsz))
>>>>> +			return -EFAULT;
>>>>> +
>>>>> +		if (ustruct.argsz < minsz || ustruct.flags)
>>>>> +			return -EINVAL;
>>>>> +
>>>>> +		mutex_lock(&iommu->lock);
>>>>> +		ret = vfio_iommu_lookup_dev(iommu, vfio_cache_inv_fn,
>>>>> +					    &ustruct);
>>>>> +		mutex_unlock(&iommu->lock);
>>>>> +		return ret;
>>>>>  	}
>>>>>
>>>>>  	return -ENOTTY;
>>>>> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
>>>>> index 9e843a1..ccf60a2 100644
>>>>> --- a/include/uapi/linux/vfio.h
>>>>> +++ b/include/uapi/linux/vfio.h
>>>>> @@ -794,6 +794,19 @@ struct vfio_iommu_type1_dma_unmap {
>>>>>  #define VFIO_IOMMU_ENABLE	_IO(VFIO_TYPE, VFIO_BASE + 15)
>>>>>  #define VFIO_IOMMU_DISABLE	_IO(VFIO_TYPE, VFIO_BASE + 16)
>>>>>
>>>>> +/**
>>>>> + * VFIO_IOMMU_CACHE_INVALIDATE - _IOWR(VFIO_TYPE, VFIO_BASE +
>>>>> 24,
>>
>> What's going on with these ioctl numbers?  AFAICT[1] we've used up through
>> VFIO_BASE + 21, this jumps to 24, the next patch skips to 27, then the last patch fills
>> in 28 & 29.  Thanks,
> 
> Hi Alex,
> 
> I rebase my patch to Eric's nested stage translation patches. His base also introduced
> IOCTLs. I should have made it better. I'll try to sync with Eric to serialize the IOCTLs.
> 
> [PATCH v6 00/22] SMMUv3 Nested Stage Setup by Eric Auger
> https://lkml.org/lkml/2019/3/17/124

Feel free to choose your IOCTL numbers without taking care of my series.
I will adapt to yours if my work gets unblocked at some point.

Thanks

Eric
> 
> Thanks,
> Yi Liu
> 
>> Alex
>>
>> [1] git grep -h VFIO_BASE | grep "VFIO_BASE +" | grep -e ^#define | \
>>     awk '{print $NF}' | tr -d ')' | sort -u -n
> 


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC v2 3/3] vfio/type1: bind guest pasid (guest page tables) to host
  2019-11-13  7:43         ` Liu, Yi L
@ 2019-11-13 10:29           ` Jean-Philippe Brucker
  2019-11-13 11:30             ` Liu, Yi L
  2019-11-25  7:45             ` Liu, Yi L
  0 siblings, 2 replies; 32+ messages in thread
From: Jean-Philippe Brucker @ 2019-11-13 10:29 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: Alex Williamson, Tian, Kevin, Raj, Ashok, kvm,
	jean-philippe.brucker, Tian, Jun J, iommu, Sun, Yi Y, Wu, Hao,
	Lu, Baolu

On Wed, Nov 13, 2019 at 07:43:43AM +0000, Liu, Yi L wrote:
> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Wednesday, November 13, 2019 1:26 AM
> > To: Liu, Yi L <yi.l.liu@intel.com>
> > Subject: Re: [RFC v2 3/3] vfio/type1: bind guest pasid (guest page tables) to host
> > 
> > On Tue, 12 Nov 2019 11:21:40 +0000
> > "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> > 
> > > > From: Alex Williamson < alex.williamson@redhat.com >
> > > > Sent: Friday, November 8, 2019 7:21 AM
> > > > To: Liu, Yi L <yi.l.liu@intel.com>
> > > > Subject: Re: [RFC v2 3/3] vfio/type1: bind guest pasid (guest page tables) to host
> > > >
> > > > On Thu, 24 Oct 2019 08:26:23 -0400
> > > > Liu Yi L <yi.l.liu@intel.com> wrote:
> > > >
> > > > > This patch adds vfio support to bind guest translation structure
> > > > > to host iommu. VFIO exposes iommu programming capability to user-
> > > > > space. Guest is a user-space application in host under KVM solution.
> > > > > For SVA usage in Virtual Machine, guest owns GVA->GPA translation
> > > > > structure. And this part should be passdown to host to enable nested
> > > > > translation (or say two stage translation). This patch reuses the
> > > > > VFIO_IOMMU_BIND proposal from Jean-Philippe Brucker, and adds new
> > > > > bind type for binding guest owned translation structure to host.
> > > > >
> > > > > *) Add two new ioctls for VFIO containers.
> > > > >
> > > > >   - VFIO_IOMMU_BIND: for bind request from userspace, it could be
> > > > >                    bind a process to a pasid or bind a guest pasid
> > > > >                    to a device, this is indicated by type
> > > > >   - VFIO_IOMMU_UNBIND: for unbind request from userspace, it could be
> > > > >                    unbind a process to a pasid or unbind a guest pasid
> > > > >                    to a device, also indicated by type
> > > > >   - Bind type:
> > > > > 	VFIO_IOMMU_BIND_PROCESS: user-space request to bind a process
> > > > >                    to a device
> > > > > 	VFIO_IOMMU_BIND_GUEST_PASID: bind guest owned translation
> > > > >                    structure to host iommu. e.g. guest page table
> > > > >
> > > > > *) Code logic in vfio_iommu_type1_ioctl() to handle
> > VFIO_IOMMU_BIND/UNBIND
> > > > >
> [...]
> > > > > +static long vfio_iommu_type1_unbind_gpasid(struct vfio_iommu *iommu,
> > > > > +					    void __user *arg,
> > > > > +					    struct vfio_iommu_type1_bind *bind)
> > > > > +{
> > > > > +	struct iommu_gpasid_bind_data gbind_data;
> > > > > +	unsigned long minsz;
> > > > > +	int ret = 0;
> > > > > +
> > > > > +	minsz = sizeof(*bind) + sizeof(gbind_data);
> > > > > +	if (bind->argsz < minsz)
> > > > > +		return -EINVAL;
> > > >
> > > > But gbind_data can change size if new vendor specific data is added to
> > > > the union, so kernel updates break existing userspace.  Fail.

I guess we could take minsz up to the vendor-specific data, copy @format,
and then check the size of vendor-specific data?

> > >
> > > yes, we have a version field in struct iommu_gpasid_bind_data. How
> > > about doing sanity check per versions? kernel knows the gbind_data
> > > size of specific versions. Does it make sense? If yes, I'll also apply it
> > > to the other sanity check in this series to avoid userspace fail after
> > > kernel update.
> > 
> > Has it already been decided that the version field will be updated for
> > every addition to the union?
> 
> No, just my proposal. Jacob may help to explain the purpose of version
> field. But if we may be too  "frequent" for an uapi version number updating
> if we inc version for each change in the union part. I may vote for the
> second option from you below.
> 
> > It seems there are two options, either
> > the version definition includes the possible contents of the union,
> > which means we need to support multiple versions concurrently in the
> > kernel to maintain compatibility with userspace and follow deprecation
> > protocols for removing that support, or we need to consider version to
> > be the general form of the structure and interpret the format field to
> > determine necessary length to copy from the user.
> 
> As I mentioned above, may be better to let @version field only over the
> general fields and let format to cover the possible changes in union. e.g.
> IOMMU_PASID_FORMAT_INTEL_VTD2 may means version 2 of Intel
> VT-d bind. But either way, I think we need to let kernel maintain multiple
> versions to support compatible userspace. e.g. may have multiple versions
> iommu_gpasid_bind_data_vtd struct in the union part.

I couldn't find where the @version field originated in our old
discussions, but I believe our plan for allowing future extensions was:

* Add new vendor-specific data by introducing a new format
  (IOMMU_PASID_FORMAT_INTEL_VTD2, IOMMU_PASID_FORMAT_ARM_SMMUV2...), and
  extend the union.

* Add a new common field, if it fits in the existing padding bytes, by
  adding a flag (IOMMU_SVA_GPASID_*).

* Add a new common field, if it doesn't fit in the current padding bytes,
  or completely change the structure layout, by introducing a new version
  (IOMMU_GPASID_BIND_VERSION_2). In that case the kernel has to handle
  both new and old structure versions. It would have both
  iommu_gpasid_bind_data and iommu_gpasid_bind_data_v2 structs.

I think iommu_cache_invalidate_info and iommu_page_response use the same
scheme. iommu_fault is a bit more complicated because it's
kernel->userspace and requires some negotiation:
https://lore.kernel.org/linux-iommu/77405d39-81a4-d9a8-5d35-27602199867a@arm.com/

[...]
> > If the ioctls have similar purpose and form, then re-using a single
> > ioctl might make sense, but BIND_PROCESS is only a place-holder in this
> > series, which is not acceptable.  A dual purpose ioctl does not
> > preclude that we could also use a union for the data field to make the
> > structure well specified.
> 
> yes, BIND_PROCESS is only a place-holder here. From kernel p.o.v., both
> BIND_GUEST_PASID and BIND_PROCESS are bind requests from userspace.
> So the purposes are aligned. Below is the content the @data[] field
> supposed to convey for BIND_PROCESS. If we use union, it would leave
> space for extending it to support BIND_PROCESS. If only data[], it is a little
> bit confusing why we define it in such manner if BIND_PROCESS is included
> in this series. Please feel free let me know which one suits better.
> 
> +struct vfio_iommu_type1_bind_process {
> +	__u32	flags;
> +#define VFIO_IOMMU_BIND_PID		(1 << 0)
> +	__u32	pasid;
> +	__s32	pid;
> +};
> https://patchwork.kernel.org/patch/10394927/

Note that I don't plan to upstream BIND_PROCESS at the moment. It was
useful for testing but I don't know of anyone actually needing it.

> > > > That bind data
> > > > structure expects a format (ex. IOMMU_PASID_FORMAT_INTEL_VTD).  How
> > does
> > > > a user determine what formats are accepted from within the vfio API (or
> > > > even outside of the vfio API)?
> > >
> > > The info is provided by vIOMMU emulator (e.g. virtual VT-d). The vSVA patch
> > > from Jacob has a sanity check on it.
> > > https://lkml.org/lkml/2019/10/28/873
> > 
> > The vIOMMU emulator runs at a layer above vfio.  How does the vIOMMU
> > emulator know that the vfio interface supports virtual VT-d?  IMO, it's
> > not acceptable that the user simply assume that an Intel host platform
> > supports VT-d.  For example, consider what happens when we need to
> > define IOMMU_PASID_FORMAT_INTEL_VTDv2.  How would the user learn that
> > VTDv2 is supported and the original VTD format is not supported?
> 
> I guess this may be another info VFIO_IOMMU_GET_INFO should provide.
> It makes sense that vfio be aware of what platform it is running on. right?
> After vfio gets the info, may let vfio fill in the format info. Is it the correct
> direction?

I thought you were planning to put that information in sysfs?  We last
discussed this over a year ago so I don't remember where we left it. I
know Alex isn't keen on putting in sysfs what can be communicated through
VFIO, but it is a convenient way to describe IOMMU features:
http://www.linux-arm.org/git?p=linux-jpb.git;a=commitdiff;h=665370d5b5e0022c24b2d2b57975ef6fe7b40870;hp=7ce780d838889b53f5e04ba5d444520621261eda

My problem with GET_INFO was that it could be difficult to extend, and
to describe things like variable-size list of supported page table
formats, but I guess the new info capabilities make this easier.

Thanks,
Jean

^ permalink raw reply	[flat|nested] 32+ messages in thread

* RE: [RFC v2 2/3] vfio/type1: VFIO_IOMMU_PASID_REQUEST(alloc/free)
  2019-11-08 15:15           ` Alex Williamson
@ 2019-11-13 11:03             ` Liu, Yi L
  2019-11-13 15:29               ` Alex Williamson
  0 siblings, 1 reply; 32+ messages in thread
From: Liu, Yi L @ 2019-11-13 11:03 UTC (permalink / raw)
  To: Alex Williamson
  Cc: eric.auger, Tian, Kevin, jacob.jun.pan, joro, Raj, Ashok, Tian,
	Jun J, Sun, Yi Y, jean-philippe.brucker, peterx, iommu, kvm

> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: Friday, November 8, 2019 11:15 PM
> To: Liu, Yi L <yi.l.liu@intel.com>
> Subject: Re: [RFC v2 2/3] vfio/type1: VFIO_IOMMU_PASID_REQUEST(alloc/free)
> 
> On Fri, 8 Nov 2019 12:23:41 +0000
> "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> 
> > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > Sent: Friday, November 8, 2019 6:07 AM
> > > To: Liu, Yi L <yi.l.liu@intel.com>
> > > Subject: Re: [RFC v2 2/3] vfio/type1: VFIO_IOMMU_PASID_REQUEST(alloc/free)
> > >
> > > On Wed, 6 Nov 2019 13:27:26 +0000
> > > "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> > >
> > > > > From: Alex Williamson <alex.williamson@redhat.com>
> > > > > Sent: Wednesday, November 6, 2019 7:36 AM
> > > > > To: Liu, Yi L <yi.l.liu@intel.com>
> > > > > Subject: Re: [RFC v2 2/3] vfio/type1:
> VFIO_IOMMU_PASID_REQUEST(alloc/free)
> > > > >
> > > > > On Thu, 24 Oct 2019 08:26:22 -0400
> > > > > Liu Yi L <yi.l.liu@intel.com> wrote:
> > > > >
> > > > > > This patch adds VFIO_IOMMU_PASID_REQUEST ioctl which aims
> > > > > > to passdown PASID allocation/free request from the virtual
> > > > > > iommu. This is required to get PASID managed in system-wide.
> > > > > >
> > > > > > Cc: Kevin Tian <kevin.tian@intel.com>
> > > > > > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > > > > > Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
> > > > > > Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > > > > > ---
> > > > > >  drivers/vfio/vfio_iommu_type1.c | 114
> > > > > ++++++++++++++++++++++++++++++++++++++++
> > > > > >  include/uapi/linux/vfio.h       |  25 +++++++++
> > > > > >  2 files changed, 139 insertions(+)
> > > > > >
> > > > > > diff --git a/drivers/vfio/vfio_iommu_type1.c
> > > b/drivers/vfio/vfio_iommu_type1.c
> > > > > > index cd8d3a5..3d73a7d 100644
> > > > > > --- a/drivers/vfio/vfio_iommu_type1.c
> > > > > > +++ b/drivers/vfio/vfio_iommu_type1.c
> > > > > > @@ -2248,6 +2248,83 @@ static int vfio_cache_inv_fn(struct device *dev,
> > > void
> > > > > *data)
> > > > > >  	return iommu_cache_invalidate(dc->domain, dev, &ustruct->info);
> > > > > >  }
> > > > > >
> > > > > > +static int vfio_iommu_type1_pasid_alloc(struct vfio_iommu *iommu,
> > > > > > +					 int min_pasid,
> > > > > > +					 int max_pasid)
> > > > > > +{
> > > > > > +	int ret;
> > > > > > +	ioasid_t pasid;
> > > > > > +	struct mm_struct *mm = NULL;
> > > > > > +
> > > > > > +	mutex_lock(&iommu->lock);
> > > > > > +	if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) {
> > > > > > +		ret = -EINVAL;
> > > > > > +		goto out_unlock;
> > > > > > +	}
> > > > > > +	mm = get_task_mm(current);
> > > > > > +	/* Track ioasid allocation owner by mm */
> > > > > > +	pasid = ioasid_alloc((struct ioasid_set *)mm, min_pasid,
> > > > > > +				max_pasid, NULL);
> > > > >
> > > > > Are we sure we want to tie this to the task mm vs perhaps the
> > > > > vfio_iommu pointer?
> > > >
> > > > Here we want to have a kind of per-VM mark, which can be used to do
> > > > ownership check on whether a pasid is held by a specific VM. This is
> > > > very important to prevent across VM affect. vfio_iommu pointer is
> > > > competent for vfio as vfio is both pasid alloc requester and pasid
> > > > consumer. e.g. vfio requests pasid alloc from ioasid and also it will
> > > > invoke bind_gpasid(). vfio can either check ownership before invoking
> > > > bind_gpasid() or pass vfio_iommu pointer to iommu driver. But in future,
> > > > there may be other modules which are just consumers of pasid. And they
> > > > also want to do ownership check for a pasid. Then, it would be hard for
> > > > them as they are not the pasid alloc requester. So here better to have
> > > > a system wide structure to perform as the per-VM mark. task mm looks
> > > > to be much competent.
> > >
> > > Ok, so it's intentional to have a VM-wide token.  Elsewhere in the
> > > type1 code (vfio_dma_do_map) we record the task_struct per dma mapping
> > > so that we can get the task mm as needed.  Would the task_struct
> > > pointer provide any advantage?
> >
> > I think we may use task_struct pointer to make type1 code consistent.
> > How do you think?
> 
> If it has the same utility, sure.

thanks, I'll make this change.

> > > Also, an overall question, this provides userspace with pasid alloc and
> > > free ioctls, (1) what prevents a userspace process from consuming every
> > > available pasid, and (2) if the process exits or crashes without
> > > freeing pasids, how are they recovered aside from a reboot?
> >
> > For question (1), I think we only need to take care about malicious
> > userspace process. As vfio usage is under privilege mode, so we may
> > be safe on it so far.
> 
> No, where else do we ever make this assumption?  vfio requires a
> privileged entity to configure the system for vfio, bind devices for
> user use, and grant those devices to the user, but the usage of the
> device is always assumed to be by an unprivileged user.  It is
> absolutely not acceptable require a privileged user.  It's vfio's
> responsibility to protect the system from the user.

My assumption is not precise here. sorry for it... Maybe to further
check with you to better understand your point. I think the user (QEMU)
of vfio needs to have a root permission. Thus it can open the vfio fds.
At this point, the user is a privileged one. Also I guess that's why vfio
can grant the user with the usage of VFIO_MAP/UNMAP to config
mappings into iommu page tables. But I'm not quite sure when will
the user be an unprivileged one.

> > However, we may need to introduce a kind of credit
> > mechanism to protect it. I've thought it, but no good idea yet. Would be
> > happy to hear from you.
> 
> It's a limited system resource and it's unclear how many might
> reasonably used by a user.  I don't have an easy answer.

How about the below method? based on some offline chat with Jacob.
a. some reasonable defaults for the initial per VM quota, e.g. 1000 per
process
b. IOASID should be able to enforce per ioasid_set (it is kind of per VM
mark) limit

> > For question (2), I think we need to reclaim the allocated pasids when
> > the vfio container fd is released just like what vfio does to the domain
> > mappings. I didn't add it yet. But I can add it in next version if you think
> > it would make the pasid alloc/free be much sound.
> 
> Consider it required, the interface is susceptible to abuse without it.

sure, let me add it in next version.

> > > > > > +	if (pasid == INVALID_IOASID) {
> > > > > > +		ret = -ENOSPC;
> > > > > > +		goto out_unlock;
> > > > > > +	}
> > > > > > +	ret = pasid;
> > > > > > +out_unlock:
> > > > > > +	mutex_unlock(&iommu->lock);
> > >
> > > What does holding this lock protect?  That the vfio_iommu remains
> > > backed by an iommu during this operation, even though we don't do
> > > anything to release allocated pasids when that iommu backing is removed?
> >
> > yes, it is unnecessary to hold the lock here. At least for the operations in
> > this patch. will remove it. :-)
> >
> > > > > > +	if (mm)
> > > > > > +		mmput(mm);
> > > > > > +	return ret;
> > > > > > +}
> > > > > > +
> > > > > > +static int vfio_iommu_type1_pasid_free(struct vfio_iommu *iommu,
> > > > > > +				       unsigned int pasid)
> > > > > > +{
> > > > > > +	struct mm_struct *mm = NULL;
> > > > > > +	void *pdata;
> > > > > > +	int ret = 0;
> > > > > > +
> > > > > > +	mutex_lock(&iommu->lock);
> > > > > > +	if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) {
> > > > > > +		ret = -EINVAL;
> > > > > > +		goto out_unlock;
> > > > > > +	}
> > > > > > +
> > > > > > +	/**
> > > > > > +	 * REVISIT:
> > > > > > +	 * There are two cases free could fail:
> > > > > > +	 * 1. free pasid by non-owner, we use ioasid_set to track mm, if
> > > > > > +	 * the set does not match, caller is not permitted to free.
> > > > > > +	 * 2. free before unbind all devices, we can check if ioasid private
> > > > > > +	 * data, if data != NULL, then fail to free.
> > > > > > +	 */
> > > > > > +	mm = get_task_mm(current);
> > > > > > +	pdata = ioasid_find((struct ioasid_set *)mm, pasid, NULL);
> > > > > > +	if (IS_ERR(pdata)) {
> > > > > > +		if (pdata == ERR_PTR(-ENOENT))
> > > > > > +			pr_err("PASID %u is not allocated\n", pasid);
> > > > > > +		else if (pdata == ERR_PTR(-EACCES))
> > > > > > +			pr_err("Free PASID %u by non-owner, denied",
> pasid);
> > > > > > +		else
> > > > > > +			pr_err("Error searching PASID %u\n", pasid);
> > > > >
> > > > > This should be removed, errno is sufficient for the user, this just
> > > > > provides the user with a trivial DoS vector filling logs.
> > > >
> > > > sure, will fix it. thanks.
> > > >
> > > > > > +		ret = -EPERM;
> > > > >
> > > > > But why not return PTR_ERR(pdata)?
> > > >
> > > > aha, would do it.
> > > >
> > > > > > +		goto out_unlock;
> > > > > > +	}
> > > > > > +	if (pdata) {
> > > > > > +		pr_debug("Cannot free pasid %d with private data\n", pasid);
> > > > > > +		/* Expect PASID has no private data if not bond */
> > > > > > +		ret = -EBUSY;
> > > > > > +		goto out_unlock;
> > > > > > +	}
> > > > > > +	ioasid_free(pasid);
> > > > >
> > > > > We only ever get here with pasid == NULL?!
> > > >
> > > > I guess you meant only when pdata==NULL.
> > > >
> > > > > Something is wrong.  Should
> > > > > that be 'if (!pdata)'?  (which also makes that pr_debug another DoS
> > > > > vector)
> > > >
> > > > Oh, yes, just do it as below:
> > > >
> > > > if (!pdata) {
> > > > 	ioasid_free(pasid);
> > > > 	ret = SUCCESS;
> > > > } else
> > > > 	ret = -EBUSY;
> > > >
> > > > Is it what you mean?
> > >
> > > No, I think I was just confusing pdata and pasid, but I am still
> > > confused about testing pdata.  We call ioasid_alloc() with private =
> > > NULL, and I don't see any of your patches calling ioasid_set_data() to
> > > change the private data after allocation, so how could this ever be
> > > set?  Should this just be a BUG_ON(pdata) as the integrity of the
> > > system is in question should this state ever occur?  Thanks,
> >
> > ioasid_set_data() was called  in one patch from Jacob's vSVA patchset.
> > [PATCH v6 08/10] iommu/vt-d: Add bind guest PASID support
> > https://lkml.org/lkml/2019/10/22/946
> >
> > The basic idea is to allocate pasid with private=NULL, and set it when the
> > pasid is actually bind to a device (bind_gpasid()). Each bind_gpasid() will
> > increase the ref_cnt in the private data, and each unbind_gpasid() will
> > decrease the ref_cnt. So if bind/unbind_gpasid() is called in mirror, the
> > private data should be null when comes to free operation. If not, vfio can
> > believe that the pasid is still in use.
> 
> So this is another opportunity to leak pasids.  What's a user supposed
> to do when their attempt to free a pasid fails?  It invites leaks to
> allow this path to fail.  Thanks,

Agreed, may no need to fail pasid free as it may leak pasid. How about
always let free successful? If the ref_cnt is non-zero, notify the remaining
users to release their reference.

Thanks,
Yi Liu


^ permalink raw reply	[flat|nested] 32+ messages in thread

* RE: [RFC v2 3/3] vfio/type1: bind guest pasid (guest page tables) to host
  2019-11-13 10:29           ` Jean-Philippe Brucker
@ 2019-11-13 11:30             ` Liu, Yi L
  2019-11-25  7:45             ` Liu, Yi L
  1 sibling, 0 replies; 32+ messages in thread
From: Liu, Yi L @ 2019-11-13 11:30 UTC (permalink / raw)
  To: Jean-Philippe Brucker
  Cc: Alex Williamson, Tian, Kevin, Raj, Ashok, kvm,
	jean-philippe.brucker, Tian, Jun J, iommu, Sun, Yi Y, Wu, Hao,
	Lu, Baolu

> From: Jean-Philippe Brucker [mailto:jean-philippe@linaro.org]
> Sent: Wednesday, November 13, 2019 6:29 PM
> To: Liu, Yi L <yi.l.liu@intel.com>
> Subject: Re: [RFC v2 3/3] vfio/type1: bind guest pasid (guest page tables) to host
> 
> On Wed, Nov 13, 2019 at 07:43:43AM +0000, Liu, Yi L wrote:
> > > From: Alex Williamson <alex.williamson@redhat.com>
> > > Sent: Wednesday, November 13, 2019 1:26 AM
> > > To: Liu, Yi L <yi.l.liu@intel.com>
> > > Subject: Re: [RFC v2 3/3] vfio/type1: bind guest pasid (guest page tables) to host
> > >
> > > On Tue, 12 Nov 2019 11:21:40 +0000
> > > "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> > >
> > > > > From: Alex Williamson < alex.williamson@redhat.com >
> > > > > Sent: Friday, November 8, 2019 7:21 AM
> > > > > To: Liu, Yi L <yi.l.liu@intel.com>
> > > > > Subject: Re: [RFC v2 3/3] vfio/type1: bind guest pasid (guest page tables) to
> host
> > > > >
> > > > > On Thu, 24 Oct 2019 08:26:23 -0400
> > > > > Liu Yi L <yi.l.liu@intel.com> wrote:
> > > > >
> > > > > > This patch adds vfio support to bind guest translation structure
> > > > > > to host iommu. VFIO exposes iommu programming capability to user-
> > > > > > space. Guest is a user-space application in host under KVM solution.
> > > > > > For SVA usage in Virtual Machine, guest owns GVA->GPA translation
> > > > > > structure. And this part should be passdown to host to enable nested
> > > > > > translation (or say two stage translation). This patch reuses the
> > > > > > VFIO_IOMMU_BIND proposal from Jean-Philippe Brucker, and adds new
> > > > > > bind type for binding guest owned translation structure to host.
> > > > > >
> > > > > > *) Add two new ioctls for VFIO containers.
> > > > > >
> > > > > >   - VFIO_IOMMU_BIND: for bind request from userspace, it could be
> > > > > >                    bind a process to a pasid or bind a guest pasid
> > > > > >                    to a device, this is indicated by type
> > > > > >   - VFIO_IOMMU_UNBIND: for unbind request from userspace, it could be
> > > > > >                    unbind a process to a pasid or unbind a guest pasid
> > > > > >                    to a device, also indicated by type
> > > > > >   - Bind type:
> > > > > > 	VFIO_IOMMU_BIND_PROCESS: user-space request to bind a
> process
> > > > > >                    to a device
> > > > > > 	VFIO_IOMMU_BIND_GUEST_PASID: bind guest owned translation
> > > > > >                    structure to host iommu. e.g. guest page table
> > > > > >
> > > > > > *) Code logic in vfio_iommu_type1_ioctl() to handle
> > > VFIO_IOMMU_BIND/UNBIND
> > > > > >
> > [...]
> > > > > > +static long vfio_iommu_type1_unbind_gpasid(struct vfio_iommu *iommu,
> > > > > > +					    void __user *arg,
> > > > > > +					    struct vfio_iommu_type1_bind
> *bind)
> > > > > > +{
> > > > > > +	struct iommu_gpasid_bind_data gbind_data;
> > > > > > +	unsigned long minsz;
> > > > > > +	int ret = 0;
> > > > > > +
> > > > > > +	minsz = sizeof(*bind) + sizeof(gbind_data);
> > > > > > +	if (bind->argsz < minsz)
> > > > > > +		return -EINVAL;
> > > > >
> > > > > But gbind_data can change size if new vendor specific data is added to
> > > > > the union, so kernel updates break existing userspace.  Fail.
> 
> I guess we could take minsz up to the vendor-specific data, copy @format,
> and then check the size of vendor-specific data?

Agreed.

> 
> > > >
> > > > yes, we have a version field in struct iommu_gpasid_bind_data. How
> > > > about doing sanity check per versions? kernel knows the gbind_data
> > > > size of specific versions. Does it make sense? If yes, I'll also apply it
> > > > to the other sanity check in this series to avoid userspace fail after
> > > > kernel update.
> > >
> > > Has it already been decided that the version field will be updated for
> > > every addition to the union?
> >
> > No, just my proposal. Jacob may help to explain the purpose of version
> > field. But if we may be too  "frequent" for an uapi version number updating
> > if we inc version for each change in the union part. I may vote for the
> > second option from you below.
> >
> > > It seems there are two options, either
> > > the version definition includes the possible contents of the union,
> > > which means we need to support multiple versions concurrently in the
> > > kernel to maintain compatibility with userspace and follow deprecation
> > > protocols for removing that support, or we need to consider version to
> > > be the general form of the structure and interpret the format field to
> > > determine necessary length to copy from the user.
> >
> > As I mentioned above, may be better to let @version field only over the
> > general fields and let format to cover the possible changes in union. e.g.
> > IOMMU_PASID_FORMAT_INTEL_VTD2 may means version 2 of Intel
> > VT-d bind. But either way, I think we need to let kernel maintain multiple
> > versions to support compatible userspace. e.g. may have multiple versions
> > iommu_gpasid_bind_data_vtd struct in the union part.
> 
> I couldn't find where the @version field originated in our old
> discussions, but I believe our plan for allowing future extensions was:
> 
> * Add new vendor-specific data by introducing a new format
>   (IOMMU_PASID_FORMAT_INTEL_VTD2,
> IOMMU_PASID_FORMAT_ARM_SMMUV2...), and
>   extend the union.
> 
> * Add a new common field, if it fits in the existing padding bytes, by
>   adding a flag (IOMMU_SVA_GPASID_*).
> 
> * Add a new common field, if it doesn't fit in the current padding bytes,
>   or completely change the structure layout, by introducing a new version
>   (IOMMU_GPASID_BIND_VERSION_2). In that case the kernel has to handle
>   both new and old structure versions. It would have both
>   iommu_gpasid_bind_data and iommu_gpasid_bind_data_v2 structs.
> 
> I think iommu_cache_invalidate_info and iommu_page_response use the same
> scheme. iommu_fault is a bit more complicated because it's
> kernel->userspace and requires some negotiation:
> https://lore.kernel.org/linux-iommu/77405d39-81a4-d9a8-5d35-
> 27602199867a@arm.com/

Thanks for the excellent recap.

> [...]
> > > If the ioctls have similar purpose and form, then re-using a single
> > > ioctl might make sense, but BIND_PROCESS is only a place-holder in this
> > > series, which is not acceptable.  A dual purpose ioctl does not
> > > preclude that we could also use a union for the data field to make the
> > > structure well specified.
> >
> > yes, BIND_PROCESS is only a place-holder here. From kernel p.o.v., both
> > BIND_GUEST_PASID and BIND_PROCESS are bind requests from userspace.
> > So the purposes are aligned. Below is the content the @data[] field
> > supposed to convey for BIND_PROCESS. If we use union, it would leave
> > space for extending it to support BIND_PROCESS. If only data[], it is a little
> > bit confusing why we define it in such manner if BIND_PROCESS is included
> > in this series. Please feel free let me know which one suits better.
> >
> > +struct vfio_iommu_type1_bind_process {
> > +	__u32	flags;
> > +#define VFIO_IOMMU_BIND_PID		(1 << 0)
> > +	__u32	pasid;
> > +	__s32	pid;
> > +};
> > https://patchwork.kernel.org/patch/10394927/
> 
> Note that I don't plan to upstream BIND_PROCESS at the moment. It was
> useful for testing but I don't know of anyone actually needing it.

yes, you told me during KVM forum. But if we want to share IOCTL, may
need to leave a place for you to extend. If @data[] is not good, then may
use union.

> > > > > That bind data
> > > > > structure expects a format (ex. IOMMU_PASID_FORMAT_INTEL_VTD).  How
> > > does
> > > > > a user determine what formats are accepted from within the vfio API (or
> > > > > even outside of the vfio API)?
> > > >
> > > > The info is provided by vIOMMU emulator (e.g. virtual VT-d). The vSVA patch
> > > > from Jacob has a sanity check on it.
> > > > https://lkml.org/lkml/2019/10/28/873
> > >
> > > The vIOMMU emulator runs at a layer above vfio.  How does the vIOMMU
> > > emulator know that the vfio interface supports virtual VT-d?  IMO, it's
> > > not acceptable that the user simply assume that an Intel host platform
> > > supports VT-d.  For example, consider what happens when we need to
> > > define IOMMU_PASID_FORMAT_INTEL_VTDv2.  How would the user learn that
> > > VTDv2 is supported and the original VTD format is not supported?
> >
> > I guess this may be another info VFIO_IOMMU_GET_INFO should provide.
> > It makes sense that vfio be aware of what platform it is running on. right?
> > After vfio gets the info, may let vfio fill in the format info. Is it the correct
> > direction?
> 
> I thought you were planning to put that information in sysfs?  We last
> discussed this over a year ago so I don't remember where we left it. I

yes, we did have such discussion to do hardware iommu capability query via
sysfs. If only want to let vIOMMU learn what format it should use, then GET_INFO
may be enough. e.g. vfio just asks its backed iommu driver. hey, do you support
nested translation? what format do you prefer? But I'm open on it.

> know Alex isn't keen on putting in sysfs what can be communicated through
> VFIO, but it is a convenient way to describe IOMMU features:
> http://www.linux-arm.org/git?p=linux-
> jpb.git;a=commitdiff;h=665370d5b5e0022c24b2d2b57975ef6fe7b40870;hp=7ce780
> d838889b53f5e04ba5d444520621261eda
> 
> My problem with GET_INFO was that it could be difficult to extend, and
> to describe things like variable-size list of supported page table
> formats, but I guess the new info capabilities make this easier.

yeah, you also need to make the info generic if want to extend something.
As I said, I'm open with it. Please feel free let me know if you've got other
ideas.

Regards,
Yi Liu

> Thanks,
> Jean

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC v2 2/3] vfio/type1: VFIO_IOMMU_PASID_REQUEST(alloc/free)
  2019-11-13 11:03             ` Liu, Yi L
@ 2019-11-13 15:29               ` Alex Williamson
  2019-11-13 19:45                 ` Jacob Pan
  2019-11-18  4:50                 ` Liu, Yi L
  0 siblings, 2 replies; 32+ messages in thread
From: Alex Williamson @ 2019-11-13 15:29 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: eric.auger, Tian, Kevin, jacob.jun.pan, joro, Raj, Ashok, Tian,
	Jun J, Sun, Yi Y, jean-philippe.brucker, peterx, iommu, kvm

On Wed, 13 Nov 2019 11:03:17 +0000
"Liu, Yi L" <yi.l.liu@intel.com> wrote:

> > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > Sent: Friday, November 8, 2019 11:15 PM
> > To: Liu, Yi L <yi.l.liu@intel.com>
> > Subject: Re: [RFC v2 2/3] vfio/type1: VFIO_IOMMU_PASID_REQUEST(alloc/free)
> > 
> > On Fri, 8 Nov 2019 12:23:41 +0000
> > "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> >   
> > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > Sent: Friday, November 8, 2019 6:07 AM
> > > > To: Liu, Yi L <yi.l.liu@intel.com>
> > > > Subject: Re: [RFC v2 2/3] vfio/type1: VFIO_IOMMU_PASID_REQUEST(alloc/free)
> > > >
> > > > On Wed, 6 Nov 2019 13:27:26 +0000
> > > > "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> > > >  
> > > > > > From: Alex Williamson <alex.williamson@redhat.com>
> > > > > > Sent: Wednesday, November 6, 2019 7:36 AM
> > > > > > To: Liu, Yi L <yi.l.liu@intel.com>
> > > > > > Subject: Re: [RFC v2 2/3] vfio/type1:  
> > VFIO_IOMMU_PASID_REQUEST(alloc/free)  
> > > > > >
> > > > > > On Thu, 24 Oct 2019 08:26:22 -0400
> > > > > > Liu Yi L <yi.l.liu@intel.com> wrote:
> > > > > >  
> > > > > > > This patch adds VFIO_IOMMU_PASID_REQUEST ioctl which aims
> > > > > > > to passdown PASID allocation/free request from the virtual
> > > > > > > iommu. This is required to get PASID managed in system-wide.
> > > > > > >
> > > > > > > Cc: Kevin Tian <kevin.tian@intel.com>
> > > > > > > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > > > > > > Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
> > > > > > > Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > > > > > > ---
> > > > > > >  drivers/vfio/vfio_iommu_type1.c | 114  
> > > > > > ++++++++++++++++++++++++++++++++++++++++  
> > > > > > >  include/uapi/linux/vfio.h       |  25 +++++++++
> > > > > > >  2 files changed, 139 insertions(+)
> > > > > > >
> > > > > > > diff --git a/drivers/vfio/vfio_iommu_type1.c  
> > > > b/drivers/vfio/vfio_iommu_type1.c  
> > > > > > > index cd8d3a5..3d73a7d 100644
> > > > > > > --- a/drivers/vfio/vfio_iommu_type1.c
> > > > > > > +++ b/drivers/vfio/vfio_iommu_type1.c
> > > > > > > @@ -2248,6 +2248,83 @@ static int vfio_cache_inv_fn(struct device *dev,  
> > > > void  
> > > > > > *data)  
> > > > > > >  	return iommu_cache_invalidate(dc->domain, dev, &ustruct->info);
> > > > > > >  }
> > > > > > >
> > > > > > > +static int vfio_iommu_type1_pasid_alloc(struct vfio_iommu *iommu,
> > > > > > > +					 int min_pasid,
> > > > > > > +					 int max_pasid)
> > > > > > > +{
> > > > > > > +	int ret;
> > > > > > > +	ioasid_t pasid;
> > > > > > > +	struct mm_struct *mm = NULL;
> > > > > > > +
> > > > > > > +	mutex_lock(&iommu->lock);
> > > > > > > +	if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) {
> > > > > > > +		ret = -EINVAL;
> > > > > > > +		goto out_unlock;
> > > > > > > +	}
> > > > > > > +	mm = get_task_mm(current);
> > > > > > > +	/* Track ioasid allocation owner by mm */
> > > > > > > +	pasid = ioasid_alloc((struct ioasid_set *)mm, min_pasid,
> > > > > > > +				max_pasid, NULL);  
> > > > > >
> > > > > > Are we sure we want to tie this to the task mm vs perhaps the
> > > > > > vfio_iommu pointer?  
> > > > >
> > > > > Here we want to have a kind of per-VM mark, which can be used to do
> > > > > ownership check on whether a pasid is held by a specific VM. This is
> > > > > very important to prevent across VM affect. vfio_iommu pointer is
> > > > > competent for vfio as vfio is both pasid alloc requester and pasid
> > > > > consumer. e.g. vfio requests pasid alloc from ioasid and also it will
> > > > > invoke bind_gpasid(). vfio can either check ownership before invoking
> > > > > bind_gpasid() or pass vfio_iommu pointer to iommu driver. But in future,
> > > > > there may be other modules which are just consumers of pasid. And they
> > > > > also want to do ownership check for a pasid. Then, it would be hard for
> > > > > them as they are not the pasid alloc requester. So here better to have
> > > > > a system wide structure to perform as the per-VM mark. task mm looks
> > > > > to be much competent.  
> > > >
> > > > Ok, so it's intentional to have a VM-wide token.  Elsewhere in the
> > > > type1 code (vfio_dma_do_map) we record the task_struct per dma mapping
> > > > so that we can get the task mm as needed.  Would the task_struct
> > > > pointer provide any advantage?  
> > >
> > > I think we may use task_struct pointer to make type1 code consistent.
> > > How do you think?  
> > 
> > If it has the same utility, sure.  
> 
> thanks, I'll make this change.
> 
> > > > Also, an overall question, this provides userspace with pasid alloc and
> > > > free ioctls, (1) what prevents a userspace process from consuming every
> > > > available pasid, and (2) if the process exits or crashes without
> > > > freeing pasids, how are they recovered aside from a reboot?  
> > >
> > > For question (1), I think we only need to take care about malicious
> > > userspace process. As vfio usage is under privilege mode, so we may
> > > be safe on it so far.  
> > 
> > No, where else do we ever make this assumption?  vfio requires a
> > privileged entity to configure the system for vfio, bind devices for
> > user use, and grant those devices to the user, but the usage of the
> > device is always assumed to be by an unprivileged user.  It is
> > absolutely not acceptable require a privileged user.  It's vfio's
> > responsibility to protect the system from the user.  
> 
> My assumption is not precise here. sorry for it... Maybe to further
> check with you to better understand your point. I think the user (QEMU)
> of vfio needs to have a root permission. Thus it can open the vfio fds.
> At this point, the user is a privileged one. Also I guess that's why vfio
> can grant the user with the usage of VFIO_MAP/UNMAP to config
> mappings into iommu page tables. But I'm not quite sure when will
> the user be an unprivileged one.

QEMU does NOT need to be run as root to use vfio.  This is NOT the
model libvirt follows.  libvirt grants a user access to a device, or
rather a set of one or more devices (ie. the group) via standard file
permission access to the group file (/dev/vfio/$GROUP).  Ownership of a
device allows the user permission to make use of the IOMMU.  The user's
ability to create DMA mappings is restricted by their process locked
memory limits, where libvirt elevates the user limit sufficient for the
size of the VM.  QEMU should never need to be run as root and doing so
is entirely unacceptable from a security perspective.  The only mode of
vfio that requires elevated privilege for use is when making use of
no-iommu, where we have no IOMMU protection or translation.

> > > However, we may need to introduce a kind of credit
> > > mechanism to protect it. I've thought it, but no good idea yet. Would be
> > > happy to hear from you.  
> > 
> > It's a limited system resource and it's unclear how many might
> > reasonably used by a user.  I don't have an easy answer.  
> 
> How about the below method? based on some offline chat with Jacob.
> a. some reasonable defaults for the initial per VM quota, e.g. 1000 per
> process
> b. IOASID should be able to enforce per ioasid_set (it is kind of per VM
> mark) limit

We support large numbers of assigned devices, how many IOASIDs might be
reasonably used per device?  Is the mm or the task still the correct
"set" in this scenario?  I don't have any better ideas than setting a
limit, but it probably needs a kernel or module tunable, and it needs
to match the scaling we expect to see when multiple devices are
involved.

> > > For question (2), I think we need to reclaim the allocated pasids when
> > > the vfio container fd is released just like what vfio does to the domain
> > > mappings. I didn't add it yet. But I can add it in next version if you think
> > > it would make the pasid alloc/free be much sound.  
> > 
> > Consider it required, the interface is susceptible to abuse without it.  
> 
> sure, let me add it in next version.
> 
> > > > > > > +	if (pasid == INVALID_IOASID) {
> > > > > > > +		ret = -ENOSPC;
> > > > > > > +		goto out_unlock;
> > > > > > > +	}
> > > > > > > +	ret = pasid;
> > > > > > > +out_unlock:
> > > > > > > +	mutex_unlock(&iommu->lock);  
> > > >
> > > > What does holding this lock protect?  That the vfio_iommu remains
> > > > backed by an iommu during this operation, even though we don't do
> > > > anything to release allocated pasids when that iommu backing is removed?  
> > >
> > > yes, it is unnecessary to hold the lock here. At least for the operations in
> > > this patch. will remove it. :-)
> > >  
> > > > > > > +	if (mm)
> > > > > > > +		mmput(mm);
> > > > > > > +	return ret;
> > > > > > > +}
> > > > > > > +
> > > > > > > +static int vfio_iommu_type1_pasid_free(struct vfio_iommu *iommu,
> > > > > > > +				       unsigned int pasid)
> > > > > > > +{
> > > > > > > +	struct mm_struct *mm = NULL;
> > > > > > > +	void *pdata;
> > > > > > > +	int ret = 0;
> > > > > > > +
> > > > > > > +	mutex_lock(&iommu->lock);
> > > > > > > +	if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) {
> > > > > > > +		ret = -EINVAL;
> > > > > > > +		goto out_unlock;
> > > > > > > +	}
> > > > > > > +
> > > > > > > +	/**
> > > > > > > +	 * REVISIT:
> > > > > > > +	 * There are two cases free could fail:
> > > > > > > +	 * 1. free pasid by non-owner, we use ioasid_set to track mm, if
> > > > > > > +	 * the set does not match, caller is not permitted to free.
> > > > > > > +	 * 2. free before unbind all devices, we can check if ioasid private
> > > > > > > +	 * data, if data != NULL, then fail to free.
> > > > > > > +	 */
> > > > > > > +	mm = get_task_mm(current);
> > > > > > > +	pdata = ioasid_find((struct ioasid_set *)mm, pasid, NULL);
> > > > > > > +	if (IS_ERR(pdata)) {
> > > > > > > +		if (pdata == ERR_PTR(-ENOENT))
> > > > > > > +			pr_err("PASID %u is not allocated\n", pasid);
> > > > > > > +		else if (pdata == ERR_PTR(-EACCES))
> > > > > > > +			pr_err("Free PASID %u by non-owner, denied",  
> > pasid);  
> > > > > > > +		else
> > > > > > > +			pr_err("Error searching PASID %u\n", pasid);  
> > > > > >
> > > > > > This should be removed, errno is sufficient for the user, this just
> > > > > > provides the user with a trivial DoS vector filling logs.  
> > > > >
> > > > > sure, will fix it. thanks.
> > > > >  
> > > > > > > +		ret = -EPERM;  
> > > > > >
> > > > > > But why not return PTR_ERR(pdata)?  
> > > > >
> > > > > aha, would do it.
> > > > >  
> > > > > > > +		goto out_unlock;
> > > > > > > +	}
> > > > > > > +	if (pdata) {
> > > > > > > +		pr_debug("Cannot free pasid %d with private data\n", pasid);
> > > > > > > +		/* Expect PASID has no private data if not bond */
> > > > > > > +		ret = -EBUSY;
> > > > > > > +		goto out_unlock;
> > > > > > > +	}
> > > > > > > +	ioasid_free(pasid);  
> > > > > >
> > > > > > We only ever get here with pasid == NULL?!  
> > > > >
> > > > > I guess you meant only when pdata==NULL.
> > > > >  
> > > > > > Something is wrong.  Should
> > > > > > that be 'if (!pdata)'?  (which also makes that pr_debug another DoS
> > > > > > vector)  
> > > > >
> > > > > Oh, yes, just do it as below:
> > > > >
> > > > > if (!pdata) {
> > > > > 	ioasid_free(pasid);
> > > > > 	ret = SUCCESS;
> > > > > } else
> > > > > 	ret = -EBUSY;
> > > > >
> > > > > Is it what you mean?  
> > > >
> > > > No, I think I was just confusing pdata and pasid, but I am still
> > > > confused about testing pdata.  We call ioasid_alloc() with private =
> > > > NULL, and I don't see any of your patches calling ioasid_set_data() to
> > > > change the private data after allocation, so how could this ever be
> > > > set?  Should this just be a BUG_ON(pdata) as the integrity of the
> > > > system is in question should this state ever occur?  Thanks,  
> > >
> > > ioasid_set_data() was called  in one patch from Jacob's vSVA patchset.
> > > [PATCH v6 08/10] iommu/vt-d: Add bind guest PASID support
> > > https://lkml.org/lkml/2019/10/22/946
> > >
> > > The basic idea is to allocate pasid with private=NULL, and set it when the
> > > pasid is actually bind to a device (bind_gpasid()). Each bind_gpasid() will
> > > increase the ref_cnt in the private data, and each unbind_gpasid() will
> > > decrease the ref_cnt. So if bind/unbind_gpasid() is called in mirror, the
> > > private data should be null when comes to free operation. If not, vfio can
> > > believe that the pasid is still in use.  
> > 
> > So this is another opportunity to leak pasids.  What's a user supposed
> > to do when their attempt to free a pasid fails?  It invites leaks to
> > allow this path to fail.  Thanks,  
> 
> Agreed, may no need to fail pasid free as it may leak pasid. How about
> always let free successful? If the ref_cnt is non-zero, notify the remaining
> users to release their reference.

If a user frees an PASID, they've done their due diligence in
indicating it's no longer used.  The kernel should handle reclaiming it
from that point.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC v2 2/3] vfio/type1: VFIO_IOMMU_PASID_REQUEST(alloc/free)
  2019-11-13 15:29               ` Alex Williamson
@ 2019-11-13 19:45                 ` Jacob Pan
  2019-11-25  8:32                   ` Liu, Yi L
  2019-11-18  4:50                 ` Liu, Yi L
  1 sibling, 1 reply; 32+ messages in thread
From: Jacob Pan @ 2019-11-13 19:45 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Liu, Yi L, eric.auger, Tian, Kevin, joro, Raj, Ashok, Tian,
	Jun J, Sun, Yi Y, jean-philippe.brucker, peterx, iommu, kvm,
	jacob.jun.pan

On Wed, 13 Nov 2019 08:29:40 -0700
Alex Williamson <alex.williamson@redhat.com> wrote:

> On Wed, 13 Nov 2019 11:03:17 +0000
> "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> 
> > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > Sent: Friday, November 8, 2019 11:15 PM
> > > To: Liu, Yi L <yi.l.liu@intel.com>
> > > Subject: Re: [RFC v2 2/3] vfio/type1:
> > > VFIO_IOMMU_PASID_REQUEST(alloc/free)
> > > 
> > > On Fri, 8 Nov 2019 12:23:41 +0000
> > > "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> > >     
> > > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > > Sent: Friday, November 8, 2019 6:07 AM
> > > > > To: Liu, Yi L <yi.l.liu@intel.com>
> > > > > Subject: Re: [RFC v2 2/3] vfio/type1:
> > > > > VFIO_IOMMU_PASID_REQUEST(alloc/free)
> > > > >
> > > > > On Wed, 6 Nov 2019 13:27:26 +0000
> > > > > "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> > > > >    
> > > > > > > From: Alex Williamson <alex.williamson@redhat.com>
> > > > > > > Sent: Wednesday, November 6, 2019 7:36 AM
> > > > > > > To: Liu, Yi L <yi.l.liu@intel.com>
> > > > > > > Subject: Re: [RFC v2 2/3] vfio/type1:    
> > > VFIO_IOMMU_PASID_REQUEST(alloc/free)    
> > > > > > >
> > > > > > > On Thu, 24 Oct 2019 08:26:22 -0400
> > > > > > > Liu Yi L <yi.l.liu@intel.com> wrote:
> > > > > > >    
> > > > > > > > This patch adds VFIO_IOMMU_PASID_REQUEST ioctl which
> > > > > > > > aims to passdown PASID allocation/free request from the
> > > > > > > > virtual iommu. This is required to get PASID managed in
> > > > > > > > system-wide.
> > > > > > > >
> > > > > > > > Cc: Kevin Tian <kevin.tian@intel.com>
> > > > > > > > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > > > > > > > Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
> > > > > > > > Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > > > > > > > ---
> > > > > > > >  drivers/vfio/vfio_iommu_type1.c | 114    
> > > > > > > ++++++++++++++++++++++++++++++++++++++++    
> > > > > > > >  include/uapi/linux/vfio.h       |  25 +++++++++
> > > > > > > >  2 files changed, 139 insertions(+)
> > > > > > > >
> > > > > > > > diff --git a/drivers/vfio/vfio_iommu_type1.c    
> > > > > b/drivers/vfio/vfio_iommu_type1.c    
> > > > > > > > index cd8d3a5..3d73a7d 100644
> > > > > > > > --- a/drivers/vfio/vfio_iommu_type1.c
> > > > > > > > +++ b/drivers/vfio/vfio_iommu_type1.c
> > > > > > > > @@ -2248,6 +2248,83 @@ static int
> > > > > > > > vfio_cache_inv_fn(struct device *dev,    
> > > > > void    
> > > > > > > *data)    
> > > > > > > >  	return iommu_cache_invalidate(dc->domain, dev,
> > > > > > > > &ustruct->info); }
> > > > > > > >
> > > > > > > > +static int vfio_iommu_type1_pasid_alloc(struct
> > > > > > > > vfio_iommu *iommu,
> > > > > > > > +					 int min_pasid,
> > > > > > > > +					 int max_pasid)
> > > > > > > > +{
> > > > > > > > +	int ret;
> > > > > > > > +	ioasid_t pasid;
> > > > > > > > +	struct mm_struct *mm = NULL;
> > > > > > > > +
> > > > > > > > +	mutex_lock(&iommu->lock);
> > > > > > > > +	if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) {
> > > > > > > > +		ret = -EINVAL;
> > > > > > > > +		goto out_unlock;
> > > > > > > > +	}
> > > > > > > > +	mm = get_task_mm(current);
> > > > > > > > +	/* Track ioasid allocation owner by mm */
> > > > > > > > +	pasid = ioasid_alloc((struct ioasid_set *)mm,
> > > > > > > > min_pasid,
> > > > > > > > +				max_pasid, NULL);    
> > > > > > >
> > > > > > > Are we sure we want to tie this to the task mm vs perhaps
> > > > > > > the vfio_iommu pointer?    
> > > > > >
> > > > > > Here we want to have a kind of per-VM mark, which can be
> > > > > > used to do ownership check on whether a pasid is held by a
> > > > > > specific VM. This is very important to prevent across VM
> > > > > > affect. vfio_iommu pointer is competent for vfio as vfio is
> > > > > > both pasid alloc requester and pasid consumer. e.g. vfio
> > > > > > requests pasid alloc from ioasid and also it will invoke
> > > > > > bind_gpasid(). vfio can either check ownership before
> > > > > > invoking bind_gpasid() or pass vfio_iommu pointer to iommu
> > > > > > driver. But in future, there may be other modules which are
> > > > > > just consumers of pasid. And they also want to do ownership
> > > > > > check for a pasid. Then, it would be hard for them as they
> > > > > > are not the pasid alloc requester. So here better to have a
> > > > > > system wide structure to perform as the per-VM mark. task
> > > > > > mm looks to be much competent.    
> > > > >
> > > > > Ok, so it's intentional to have a VM-wide token.  Elsewhere
> > > > > in the type1 code (vfio_dma_do_map) we record the task_struct
> > > > > per dma mapping so that we can get the task mm as needed.
> > > > > Would the task_struct pointer provide any advantage?    
> > > >
> > > > I think we may use task_struct pointer to make type1 code
> > > > consistent. How do you think?    
> > > 
> > > If it has the same utility, sure.    
> > 
> > thanks, I'll make this change.
> >   
> > > > > Also, an overall question, this provides userspace with pasid
> > > > > alloc and free ioctls, (1) what prevents a userspace process
> > > > > from consuming every available pasid, and (2) if the process
> > > > > exits or crashes without freeing pasids, how are they
> > > > > recovered aside from a reboot?    
> > > >
> > > > For question (1), I think we only need to take care about
> > > > malicious userspace process. As vfio usage is under privilege
> > > > mode, so we may be safe on it so far.    
> > > 
> > > No, where else do we ever make this assumption?  vfio requires a
> > > privileged entity to configure the system for vfio, bind devices
> > > for user use, and grant those devices to the user, but the usage
> > > of the device is always assumed to be by an unprivileged user.
> > > It is absolutely not acceptable require a privileged user.  It's
> > > vfio's responsibility to protect the system from the user.    
> > 
> > My assumption is not precise here. sorry for it... Maybe to further
> > check with you to better understand your point. I think the user
> > (QEMU) of vfio needs to have a root permission. Thus it can open
> > the vfio fds. At this point, the user is a privileged one. Also I
> > guess that's why vfio can grant the user with the usage of
> > VFIO_MAP/UNMAP to config mappings into iommu page tables. But I'm
> > not quite sure when will the user be an unprivileged one.  
> 
> QEMU does NOT need to be run as root to use vfio.  This is NOT the
> model libvirt follows.  libvirt grants a user access to a device, or
> rather a set of one or more devices (ie. the group) via standard file
> permission access to the group file (/dev/vfio/$GROUP).  Ownership of
> a device allows the user permission to make use of the IOMMU.  The
> user's ability to create DMA mappings is restricted by their process
> locked memory limits, where libvirt elevates the user limit
> sufficient for the size of the VM.  QEMU should never need to be run
> as root and doing so is entirely unacceptable from a security
> perspective.  The only mode of vfio that requires elevated privilege
> for use is when making use of no-iommu, where we have no IOMMU
> protection or translation.
> 
> > > > However, we may need to introduce a kind of credit
> > > > mechanism to protect it. I've thought it, but no good idea yet.
> > > > Would be happy to hear from you.    
> > > 
> > > It's a limited system resource and it's unclear how many might
> > > reasonably used by a user.  I don't have an easy answer.    
> > 
> > How about the below method? based on some offline chat with Jacob.
> > a. some reasonable defaults for the initial per VM quota, e.g. 1000
> > per process
> > b. IOASID should be able to enforce per ioasid_set (it is kind of
> > per VM mark) limit  
> 
> We support large numbers of assigned devices, how many IOASIDs might
> be reasonably used per device?  Is the mm or the task still the
> correct "set" in this scenario?  I don't have any better ideas than
> setting a limit, but it probably needs a kernel or module tunable,
> and it needs to match the scaling we expect to see when multiple
> devices are involved.
> 
I think mm/task is still the correct set in that we try to prevent
abuse based on mm not device. Or we need to have some notion of super
container (Ashok proposed a while ago) that maps to a VM.

I am guessing you are suggesting the per mm quota should also be
scaled against number of devices assigned. I think that is very
reasonable. Perhaps we can do:

1. A tunable per iommu group PASID quota with a default of 1000, e.g.
nr_pasid_per_group. Since we are dealing with nested translation and
each device has its own second level so we pretty much have one device
per group. Call it nr_pasid_per_group?

2. Limit number of PASIDs per VM with nr_pasid_per_group * nr_groups.
Probably update the limit when group is added to a container with the
same mm.

I guess we could also use a cgroup controller for PASIDs.

> > > > For question (2), I think we need to reclaim the allocated
> > > > pasids when the vfio container fd is released just like what
> > > > vfio does to the domain mappings. I didn't add it yet. But I
> > > > can add it in next version if you think it would make the pasid
> > > > alloc/free be much sound.    
> > > 
> > > Consider it required, the interface is susceptible to abuse
> > > without it.    
> > 
> > sure, let me add it in next version.
> >   
> > > > > > > > +	if (pasid == INVALID_IOASID) {
> > > > > > > > +		ret = -ENOSPC;
> > > > > > > > +		goto out_unlock;
> > > > > > > > +	}
> > > > > > > > +	ret = pasid;
> > > > > > > > +out_unlock:
> > > > > > > > +	mutex_unlock(&iommu->lock);    
> > > > >
> > > > > What does holding this lock protect?  That the vfio_iommu
> > > > > remains backed by an iommu during this operation, even though
> > > > > we don't do anything to release allocated pasids when that
> > > > > iommu backing is removed?    
> > > >
> > > > yes, it is unnecessary to hold the lock here. At least for the
> > > > operations in this patch. will remove it. :-)
> > > >    
> > > > > > > > +	if (mm)
> > > > > > > > +		mmput(mm);
> > > > > > > > +	return ret;
> > > > > > > > +}
> > > > > > > > +
> > > > > > > > +static int vfio_iommu_type1_pasid_free(struct
> > > > > > > > vfio_iommu *iommu,
> > > > > > > > +				       unsigned int
> > > > > > > > pasid) +{
> > > > > > > > +	struct mm_struct *mm = NULL;
> > > > > > > > +	void *pdata;
> > > > > > > > +	int ret = 0;
> > > > > > > > +
> > > > > > > > +	mutex_lock(&iommu->lock);
> > > > > > > > +	if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) {
> > > > > > > > +		ret = -EINVAL;
> > > > > > > > +		goto out_unlock;
> > > > > > > > +	}
> > > > > > > > +
> > > > > > > > +	/**
> > > > > > > > +	 * REVISIT:
> > > > > > > > +	 * There are two cases free could fail:
> > > > > > > > +	 * 1. free pasid by non-owner, we use
> > > > > > > > ioasid_set to track mm, if
> > > > > > > > +	 * the set does not match, caller is not
> > > > > > > > permitted to free.
> > > > > > > > +	 * 2. free before unbind all devices, we can
> > > > > > > > check if ioasid private
> > > > > > > > +	 * data, if data != NULL, then fail to free.
> > > > > > > > +	 */
> > > > > > > > +	mm = get_task_mm(current);
> > > > > > > > +	pdata = ioasid_find((struct ioasid_set *)mm,
> > > > > > > > pasid, NULL);
> > > > > > > > +	if (IS_ERR(pdata)) {
> > > > > > > > +		if (pdata == ERR_PTR(-ENOENT))
> > > > > > > > +			pr_err("PASID %u is not
> > > > > > > > allocated\n", pasid);
> > > > > > > > +		else if (pdata == ERR_PTR(-EACCES))
> > > > > > > > +			pr_err("Free PASID %u by
> > > > > > > > non-owner, denied",    
> > > pasid);    
> > > > > > > > +		else
> > > > > > > > +			pr_err("Error searching PASID
> > > > > > > > %u\n", pasid);    
> > > > > > >
> > > > > > > This should be removed, errno is sufficient for the user,
> > > > > > > this just provides the user with a trivial DoS vector
> > > > > > > filling logs.    
> > > > > >
> > > > > > sure, will fix it. thanks.
> > > > > >    
> > > > > > > > +		ret = -EPERM;    
> > > > > > >
> > > > > > > But why not return PTR_ERR(pdata)?    
> > > > > >
> > > > > > aha, would do it.
> > > > > >    
> > > > > > > > +		goto out_unlock;
> > > > > > > > +	}
> > > > > > > > +	if (pdata) {
> > > > > > > > +		pr_debug("Cannot free pasid %d with
> > > > > > > > private data\n", pasid);
> > > > > > > > +		/* Expect PASID has no private data if
> > > > > > > > not bond */
> > > > > > > > +		ret = -EBUSY;
> > > > > > > > +		goto out_unlock;
> > > > > > > > +	}
> > > > > > > > +	ioasid_free(pasid);    
> > > > > > >
> > > > > > > We only ever get here with pasid == NULL?!    
> > > > > >
> > > > > > I guess you meant only when pdata==NULL.
> > > > > >    
> > > > > > > Something is wrong.  Should
> > > > > > > that be 'if (!pdata)'?  (which also makes that pr_debug
> > > > > > > another DoS vector)    
> > > > > >
> > > > > > Oh, yes, just do it as below:
> > > > > >
> > > > > > if (!pdata) {
> > > > > > 	ioasid_free(pasid);
> > > > > > 	ret = SUCCESS;
> > > > > > } else
> > > > > > 	ret = -EBUSY;
> > > > > >
> > > > > > Is it what you mean?    
> > > > >
> > > > > No, I think I was just confusing pdata and pasid, but I am
> > > > > still confused about testing pdata.  We call ioasid_alloc()
> > > > > with private = NULL, and I don't see any of your patches
> > > > > calling ioasid_set_data() to change the private data after
> > > > > allocation, so how could this ever be set?  Should this just
> > > > > be a BUG_ON(pdata) as the integrity of the system is in
> > > > > question should this state ever occur?  Thanks,    
> > > >
> > > > ioasid_set_data() was called  in one patch from Jacob's vSVA
> > > > patchset. [PATCH v6 08/10] iommu/vt-d: Add bind guest PASID
> > > > support https://lkml.org/lkml/2019/10/22/946
> > > >
> > > > The basic idea is to allocate pasid with private=NULL, and set
> > > > it when the pasid is actually bind to a device (bind_gpasid()).
> > > > Each bind_gpasid() will increase the ref_cnt in the private
> > > > data, and each unbind_gpasid() will decrease the ref_cnt. So if
> > > > bind/unbind_gpasid() is called in mirror, the private data
> > > > should be null when comes to free operation. If not, vfio can
> > > > believe that the pasid is still in use.    
> > > 
> > > So this is another opportunity to leak pasids.  What's a user
> > > supposed to do when their attempt to free a pasid fails?  It
> > > invites leaks to allow this path to fail.  Thanks,    
> > 
> > Agreed, may no need to fail pasid free as it may leak pasid. How
> > about always let free successful? If the ref_cnt is non-zero,
> > notify the remaining users to release their reference.  
> 
> If a user frees an PASID, they've done their due diligence in
> indicating it's no longer used.  The kernel should handle reclaiming
> it from that point.  Thanks,

Yeah, I think we can add a atomic notifier for each PASID.
Consumers such as IOMMU driver and KVM gets notified when IOASID is
freed by VFIO. IOMMU driver can do the unbind and tear down.

In case of all the users already did unbind() before ioasid_free(), the
free will proceed as usual.

> Alex
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* RE: [RFC v2 2/3] vfio/type1: VFIO_IOMMU_PASID_REQUEST(alloc/free)
  2019-11-13 15:29               ` Alex Williamson
  2019-11-13 19:45                 ` Jacob Pan
@ 2019-11-18  4:50                 ` Liu, Yi L
  1 sibling, 0 replies; 32+ messages in thread
From: Liu, Yi L @ 2019-11-18  4:50 UTC (permalink / raw)
  To: Alex Williamson
  Cc: eric.auger, Tian, Kevin, jacob.jun.pan, joro, Raj, Ashok, Tian,
	Jun J, Sun, Yi Y, jean-philippe.brucker, peterx, iommu, kvm

> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: Wednesday, November 13, 2019 11:30 PM
> To: Liu, Yi L <yi.l.liu@intel.com>
> Subject: Re: [RFC v2 2/3] vfio/type1: VFIO_IOMMU_PASID_REQUEST(alloc/free)
> 
> On Wed, 13 Nov 2019 11:03:17 +0000
> "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> 
> > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > Sent: Friday, November 8, 2019 11:15 PM
> > > To: Liu, Yi L <yi.l.liu@intel.com>
> > > Subject: Re: [RFC v2 2/3] vfio/type1: VFIO_IOMMU_PASID_REQUEST(alloc/free)
> > >
> > > On Fri, 8 Nov 2019 12:23:41 +0000
> > > "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> > >
> > > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > > Sent: Friday, November 8, 2019 6:07 AM
> > > > > To: Liu, Yi L <yi.l.liu@intel.com>
> > > > > Subject: Re: [RFC v2 2/3] vfio/type1:
> VFIO_IOMMU_PASID_REQUEST(alloc/free)
> > > > >
> > > > > On Wed, 6 Nov 2019 13:27:26 +0000
> > > > > "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> > > > >
> > > > > > > From: Alex Williamson <alex.williamson@redhat.com>
> > > > > > > Sent: Wednesday, November 6, 2019 7:36 AM
> > > > > > > To: Liu, Yi L <yi.l.liu@intel.com>
> > > > > > > Subject: Re: [RFC v2 2/3] vfio/type1:
> > > VFIO_IOMMU_PASID_REQUEST(alloc/free)
> > > > > > >
> > > > > > > On Thu, 24 Oct 2019 08:26:22 -0400
> > > > > > > Liu Yi L <yi.l.liu@intel.com> wrote:
> > > > > > >
> > > > > > > > This patch adds VFIO_IOMMU_PASID_REQUEST ioctl which aims
> > > > > > > > to passdown PASID allocation/free request from the virtual
> > > > > > > > iommu. This is required to get PASID managed in system-wide.
> > > > > > > >
> > > > > > > > Cc: Kevin Tian <kevin.tian@intel.com>
> > > > > > > > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > > > > > > > Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
> > > > > > > > Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > > > > > > > ---
> > > > > > > >  drivers/vfio/vfio_iommu_type1.c | 114
> > > > > > > ++++++++++++++++++++++++++++++++++++++++
> > > > > > > >  include/uapi/linux/vfio.h       |  25 +++++++++
> > > > > > > >  2 files changed, 139 insertions(+)
> > > > > > > >
> > > > > > > > diff --git a/drivers/vfio/vfio_iommu_type1.c
> > > > > b/drivers/vfio/vfio_iommu_type1.c
> > > > > > > > index cd8d3a5..3d73a7d 100644
> > > > > > > > --- a/drivers/vfio/vfio_iommu_type1.c
> > > > > > > > +++ b/drivers/vfio/vfio_iommu_type1.c
> > > > > > > > @@ -2248,6 +2248,83 @@ static int vfio_cache_inv_fn(struct device
> *dev,
> > > > > void
> > > > > > > *data)
> > > > > > > >  	return iommu_cache_invalidate(dc->domain, dev, &ustruct->info);
> > > > > > > >  }
> > > > > > > >
> > > > > > > > +static int vfio_iommu_type1_pasid_alloc(struct vfio_iommu *iommu,
> > > > > > > > +					 int min_pasid,
> > > > > > > > +					 int max_pasid)
> > > > > > > > +{
> > > > > > > > +	int ret;
> > > > > > > > +	ioasid_t pasid;
> > > > > > > > +	struct mm_struct *mm = NULL;
> > > > > > > > +
> > > > > > > > +	mutex_lock(&iommu->lock);
> > > > > > > > +	if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) {
> > > > > > > > +		ret = -EINVAL;
> > > > > > > > +		goto out_unlock;
> > > > > > > > +	}
> > > > > > > > +	mm = get_task_mm(current);
> > > > > > > > +	/* Track ioasid allocation owner by mm */
> > > > > > > > +	pasid = ioasid_alloc((struct ioasid_set *)mm, min_pasid,
> > > > > > > > +				max_pasid, NULL);
> > > > > > >
> > > > > > > Are we sure we want to tie this to the task mm vs perhaps the
> > > > > > > vfio_iommu pointer?
> > > > > >
> > > > > > Here we want to have a kind of per-VM mark, which can be used to do
> > > > > > ownership check on whether a pasid is held by a specific VM. This is
> > > > > > very important to prevent across VM affect. vfio_iommu pointer is
> > > > > > competent for vfio as vfio is both pasid alloc requester and pasid
> > > > > > consumer. e.g. vfio requests pasid alloc from ioasid and also it will
> > > > > > invoke bind_gpasid(). vfio can either check ownership before invoking
> > > > > > bind_gpasid() or pass vfio_iommu pointer to iommu driver. But in future,
> > > > > > there may be other modules which are just consumers of pasid. And they
> > > > > > also want to do ownership check for a pasid. Then, it would be hard for
> > > > > > them as they are not the pasid alloc requester. So here better to have
> > > > > > a system wide structure to perform as the per-VM mark. task mm looks
> > > > > > to be much competent.
> > > > >
> > > > > Ok, so it's intentional to have a VM-wide token.  Elsewhere in the
> > > > > type1 code (vfio_dma_do_map) we record the task_struct per dma mapping
> > > > > so that we can get the task mm as needed.  Would the task_struct
> > > > > pointer provide any advantage?
> > > >
> > > > I think we may use task_struct pointer to make type1 code consistent.
> > > > How do you think?
> > >
> > > If it has the same utility, sure.
> >
> > thanks, I'll make this change.
> >
> > > > > Also, an overall question, this provides userspace with pasid alloc and
> > > > > free ioctls, (1) what prevents a userspace process from consuming every
> > > > > available pasid, and (2) if the process exits or crashes without
> > > > > freeing pasids, how are they recovered aside from a reboot?
> > > >
> > > > For question (1), I think we only need to take care about malicious
> > > > userspace process. As vfio usage is under privilege mode, so we may
> > > > be safe on it so far.
> > >
> > > No, where else do we ever make this assumption?  vfio requires a
> > > privileged entity to configure the system for vfio, bind devices for
> > > user use, and grant those devices to the user, but the usage of the
> > > device is always assumed to be by an unprivileged user.  It is
> > > absolutely not acceptable require a privileged user.  It's vfio's
> > > responsibility to protect the system from the user.
> >
> > My assumption is not precise here. sorry for it... Maybe to further
> > check with you to better understand your point. I think the user (QEMU)
> > of vfio needs to have a root permission. Thus it can open the vfio fds.
> > At this point, the user is a privileged one. Also I guess that's why vfio
> > can grant the user with the usage of VFIO_MAP/UNMAP to config
> > mappings into iommu page tables. But I'm not quite sure when will
> > the user be an unprivileged one.
> 
> QEMU does NOT need to be run as root to use vfio.  This is NOT the
> model libvirt follows.  libvirt grants a user access to a device, or
> rather a set of one or more devices (ie. the group) via standard file
> permission access to the group file (/dev/vfio/$GROUP).  Ownership of a
> device allows the user permission to make use of the IOMMU.  The user's
> ability to create DMA mappings is restricted by their process locked
> memory limits, where libvirt elevates the user limit sufficient for the
> size of the VM.  QEMU should never need to be run as root and doing so
> is entirely unacceptable from a security perspective.  The only mode of
> vfio that requires elevated privilege for use is when making use of
> no-iommu, where we have no IOMMU protection or translation.

got it. thanks for the detailed explanation.

> > > > However, we may need to introduce a kind of credit
> > > > mechanism to protect it. I've thought it, but no good idea yet. Would be
> > > > happy to hear from you.
> > >
> > > It's a limited system resource and it's unclear how many might
> > > reasonably used by a user.  I don't have an easy answer.
> >
> > How about the below method? based on some offline chat with Jacob.
> > a. some reasonable defaults for the initial per VM quota, e.g. 1000 per
> > process
> > b. IOASID should be able to enforce per ioasid_set (it is kind of per VM
> > mark) limit
> 
> We support large numbers of assigned devices, how many IOASIDs might be
> reasonably used per device?  Is the mm or the task still the correct
> "set" in this scenario?  I don't have any better ideas than setting a
> limit, but it probably needs a kernel or module tunable, and it needs
> to match the scaling we expect to see when multiple devices are
> involved.

How about Jacob's proposal in his reply?

> > > > For question (2), I think we need to reclaim the allocated pasids when
> > > > the vfio container fd is released just like what vfio does to the domain
> > > > mappings. I didn't add it yet. But I can add it in next version if you think
> > > > it would make the pasid alloc/free be much sound.
> > >
> > > Consider it required, the interface is susceptible to abuse without it.
> >
> > sure, let me add it in next version.
> >
> > > > > > > > +	if (pasid == INVALID_IOASID) {
> > > > > > > > +		ret = -ENOSPC;
> > > > > > > > +		goto out_unlock;
> > > > > > > > +	}
> > > > > > > > +	ret = pasid;
> > > > > > > > +out_unlock:
> > > > > > > > +	mutex_unlock(&iommu->lock);
> > > > >
> > > > > What does holding this lock protect?  That the vfio_iommu remains
> > > > > backed by an iommu during this operation, even though we don't do
> > > > > anything to release allocated pasids when that iommu backing is removed?
> > > >
> > > > yes, it is unnecessary to hold the lock here. At least for the operations in
> > > > this patch. will remove it. :-)
> > > >
> > > > > > > > +	if (mm)
> > > > > > > > +		mmput(mm);
> > > > > > > > +	return ret;
> > > > > > > > +}
> > > > > > > > +
> > > > > > > > +static int vfio_iommu_type1_pasid_free(struct vfio_iommu *iommu,
> > > > > > > > +				       unsigned int pasid)
> > > > > > > > +{
> > > > > > > > +	struct mm_struct *mm = NULL;
> > > > > > > > +	void *pdata;
> > > > > > > > +	int ret = 0;
> > > > > > > > +
> > > > > > > > +	mutex_lock(&iommu->lock);
> > > > > > > > +	if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) {
> > > > > > > > +		ret = -EINVAL;
> > > > > > > > +		goto out_unlock;
> > > > > > > > +	}
> > > > > > > > +
> > > > > > > > +	/**
> > > > > > > > +	 * REVISIT:
> > > > > > > > +	 * There are two cases free could fail:
> > > > > > > > +	 * 1. free pasid by non-owner, we use ioasid_set to track mm, if
> > > > > > > > +	 * the set does not match, caller is not permitted to free.
> > > > > > > > +	 * 2. free before unbind all devices, we can check if ioasid private
> > > > > > > > +	 * data, if data != NULL, then fail to free.
> > > > > > > > +	 */
> > > > > > > > +	mm = get_task_mm(current);
> > > > > > > > +	pdata = ioasid_find((struct ioasid_set *)mm, pasid, NULL);
> > > > > > > > +	if (IS_ERR(pdata)) {
> > > > > > > > +		if (pdata == ERR_PTR(-ENOENT))
> > > > > > > > +			pr_err("PASID %u is not allocated\n", pasid);
> > > > > > > > +		else if (pdata == ERR_PTR(-EACCES))
> > > > > > > > +			pr_err("Free PASID %u by non-owner, denied",
> > > pasid);
> > > > > > > > +		else
> > > > > > > > +			pr_err("Error searching PASID %u\n", pasid);
> > > > > > >
> > > > > > > This should be removed, errno is sufficient for the user, this just
> > > > > > > provides the user with a trivial DoS vector filling logs.
> > > > > >
> > > > > > sure, will fix it. thanks.
> > > > > >
> > > > > > > > +		ret = -EPERM;
> > > > > > >
> > > > > > > But why not return PTR_ERR(pdata)?
> > > > > >
> > > > > > aha, would do it.
> > > > > >
> > > > > > > > +		goto out_unlock;
> > > > > > > > +	}
> > > > > > > > +	if (pdata) {
> > > > > > > > +		pr_debug("Cannot free pasid %d with private data\n", pasid);
> > > > > > > > +		/* Expect PASID has no private data if not bond */
> > > > > > > > +		ret = -EBUSY;
> > > > > > > > +		goto out_unlock;
> > > > > > > > +	}
> > > > > > > > +	ioasid_free(pasid);
> > > > > > >
> > > > > > > We only ever get here with pasid == NULL?!
> > > > > >
> > > > > > I guess you meant only when pdata==NULL.
> > > > > >
> > > > > > > Something is wrong.  Should
> > > > > > > that be 'if (!pdata)'?  (which also makes that pr_debug another DoS
> > > > > > > vector)
> > > > > >
> > > > > > Oh, yes, just do it as below:
> > > > > >
> > > > > > if (!pdata) {
> > > > > > 	ioasid_free(pasid);
> > > > > > 	ret = SUCCESS;
> > > > > > } else
> > > > > > 	ret = -EBUSY;
> > > > > >
> > > > > > Is it what you mean?
> > > > >
> > > > > No, I think I was just confusing pdata and pasid, but I am still
> > > > > confused about testing pdata.  We call ioasid_alloc() with private =
> > > > > NULL, and I don't see any of your patches calling ioasid_set_data() to
> > > > > change the private data after allocation, so how could this ever be
> > > > > set?  Should this just be a BUG_ON(pdata) as the integrity of the
> > > > > system is in question should this state ever occur?  Thanks,
> > > >
> > > > ioasid_set_data() was called  in one patch from Jacob's vSVA patchset.
> > > > [PATCH v6 08/10] iommu/vt-d: Add bind guest PASID support
> > > > https://lkml.org/lkml/2019/10/22/946
> > > >
> > > > The basic idea is to allocate pasid with private=NULL, and set it when the
> > > > pasid is actually bind to a device (bind_gpasid()). Each bind_gpasid() will
> > > > increase the ref_cnt in the private data, and each unbind_gpasid() will
> > > > decrease the ref_cnt. So if bind/unbind_gpasid() is called in mirror, the
> > > > private data should be null when comes to free operation. If not, vfio can
> > > > believe that the pasid is still in use.
> > >
> > > So this is another opportunity to leak pasids.  What's a user supposed
> > > to do when their attempt to free a pasid fails?  It invites leaks to
> > > allow this path to fail.  Thanks,
> >
> > Agreed, may no need to fail pasid free as it may leak pasid. How about
> > always let free successful? If the ref_cnt is non-zero, notify the remaining
> > users to release their reference.
> 
> If a user frees an PASID, they've done their due diligence in
> indicating it's no longer used.  The kernel should handle reclaiming it
> from that point.  Thanks,

Yes, I've aligned with Jacob offline. Will free PASID per requested, no fail. Jacob
will help to add notifications in ioasid.

> Alex

Thanks,
Yi Liu

^ permalink raw reply	[flat|nested] 32+ messages in thread

* RE: [RFC v2 3/3] vfio/type1: bind guest pasid (guest page tables) to host
  2019-11-13 10:29           ` Jean-Philippe Brucker
  2019-11-13 11:30             ` Liu, Yi L
@ 2019-11-25  7:45             ` Liu, Yi L
  2019-12-03  0:11               ` Alex Williamson
  1 sibling, 1 reply; 32+ messages in thread
From: Liu, Yi L @ 2019-11-25  7:45 UTC (permalink / raw)
  To: Alex Williamson, Jean-Philippe Brucker, eric.auger, jacob.jun.pan
  Cc: Tian, Kevin, Raj, Ashok, kvm, Tian, Jun J, iommu, Sun, Yi Y, Wu,
	Hao, Lu, Baolu

Hi Alex,

Thanks for the review. Here I'd like to conclude the major opens in this
thread and see if we can get some agreements to prepare a new version.

a) IOCTLs for BIND_GPASID and BIND_PROCESS, share a single IOCTL or two
   separate IOCTLs?
   Yi: It may be helpful to have separate IOCTLs. The bind data conveyed
   for BIND_GPASID and BIND_PROCESS are totally different, and the struct
   iommu_gpasid_bind_data has vendor specific data and may even have more
   versions in future. To better maintain it, I guess separate IOCTLs for
   the two bind types would be better. The structure for BIND_GPASID is
   as below:

        struct vfio_iommu_type1_bind {
                __u32                           argsz;
                struct iommu_gpasid_bind_data   bind_data;
        };

b) how kernel-space learns the number of bytes to be copied (a.k.a. the
   usage of @version field and @format field of struct
   iommu_gpasid_bind_data)
   Yi: Jean has an excellent recap in prior reply on the plan of future
   extensions regards to @version field and @format field. Based on the
   plan, kernel space needs to parse the @version field and @format field
   to get the length of the current BIND_GPASID request. Also kernel needs
   to maintain the new and old structure versions. Follow specific
   deprecation policy in future.

c) how can vIOMMU emulator know that the vfio interface supports to config
   dual stage translation for vIOMMU?
   Yi: may do it via VFIO_IOMMU_GET_INFO.

d) how can vIOMMU emulator know what @version and @format should be set
   in struct iommu_gpasid_bind_data?
   Yi: currently, we have two ways. First one, may do it via
   VFIO_IOMMU_GET_INFO. This is a natural idea as here @version and @format
   are used in vfio apis. It makes sense to let vfio to provide related info
   to vIOMMU emulator after checking with vendor specific iommu driver. Also,
   there is idea to do it via sysfs (/sys/class/iommu/dmar#) as we have plan
   to do IOMMU capability sync between vIOMMU and pIOMMU via sysfs. I have
   two concern on this option. Current iommu sysfs only provides vendor
   specific hardware infos. I'm not sure if it is good to expose infos
   defined in IOMMU generic layer via iommu sysfs. If this concern is not
   a big thing, I'm fine with both options.

Thoughts? Would also be happy to know more from you guys.

Regards,
Yi Liu

> From: Jean-Philippe Brucker [mailto:jean-philippe@linaro.org]
> Sent: Wednesday, November 13, 2019 6:29 PM
> To: Liu, Yi L <yi.l.liu@intel.com>
> Subject: Re: [RFC v2 3/3] vfio/type1: bind guest pasid (guest page tables) to host
> 
> On Wed, Nov 13, 2019 at 07:43:43AM +0000, Liu, Yi L wrote:
> > > From: Alex Williamson <alex.williamson@redhat.com>
> > > Sent: Wednesday, November 13, 2019 1:26 AM
> > > To: Liu, Yi L <yi.l.liu@intel.com>
> > > Subject: Re: [RFC v2 3/3] vfio/type1: bind guest pasid (guest page tables) to host
> > >
> > > On Tue, 12 Nov 2019 11:21:40 +0000
> > > "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> > >
> > > > > From: Alex Williamson < alex.williamson@redhat.com >
> > > > > Sent: Friday, November 8, 2019 7:21 AM
> > > > > To: Liu, Yi L <yi.l.liu@intel.com>
> > > > > Subject: Re: [RFC v2 3/3] vfio/type1: bind guest pasid (guest page tables) to
> host
> > > > >
> > > > > On Thu, 24 Oct 2019 08:26:23 -0400
> > > > > Liu Yi L <yi.l.liu@intel.com> wrote:
> > > > >
> > > > > > This patch adds vfio support to bind guest translation structure
> > > > > > to host iommu. VFIO exposes iommu programming capability to user-
> > > > > > space. Guest is a user-space application in host under KVM solution.
> > > > > > For SVA usage in Virtual Machine, guest owns GVA->GPA translation
> > > > > > structure. And this part should be passdown to host to enable nested
> > > > > > translation (or say two stage translation). This patch reuses the
> > > > > > VFIO_IOMMU_BIND proposal from Jean-Philippe Brucker, and adds new
> > > > > > bind type for binding guest owned translation structure to host.
> > > > > >
> > > > > > *) Add two new ioctls for VFIO containers.
> > > > > >
> > > > > >   - VFIO_IOMMU_BIND: for bind request from userspace, it could be
> > > > > >                    bind a process to a pasid or bind a guest pasid
> > > > > >                    to a device, this is indicated by type
> > > > > >   - VFIO_IOMMU_UNBIND: for unbind request from userspace, it could be
> > > > > >                    unbind a process to a pasid or unbind a guest pasid
> > > > > >                    to a device, also indicated by type
> > > > > >   - Bind type:
> > > > > > 	VFIO_IOMMU_BIND_PROCESS: user-space request to bind a
> process
> > > > > >                    to a device
> > > > > > 	VFIO_IOMMU_BIND_GUEST_PASID: bind guest owned translation
> > > > > >                    structure to host iommu. e.g. guest page table
> > > > > >
> > > > > > *) Code logic in vfio_iommu_type1_ioctl() to handle
> > > VFIO_IOMMU_BIND/UNBIND
> > > > > >
> > [...]
> > > > > > +static long vfio_iommu_type1_unbind_gpasid(struct vfio_iommu *iommu,
> > > > > > +					    void __user *arg,
> > > > > > +					    struct vfio_iommu_type1_bind
> *bind)
> > > > > > +{
> > > > > > +	struct iommu_gpasid_bind_data gbind_data;
> > > > > > +	unsigned long minsz;
> > > > > > +	int ret = 0;
> > > > > > +
> > > > > > +	minsz = sizeof(*bind) + sizeof(gbind_data);
> > > > > > +	if (bind->argsz < minsz)
> > > > > > +		return -EINVAL;
> > > > >
> > > > > But gbind_data can change size if new vendor specific data is added to
> > > > > the union, so kernel updates break existing userspace.  Fail.
> 
> I guess we could take minsz up to the vendor-specific data, copy @format,
> and then check the size of vendor-specific data?
> 
> > > >
> > > > yes, we have a version field in struct iommu_gpasid_bind_data. How
> > > > about doing sanity check per versions? kernel knows the gbind_data
> > > > size of specific versions. Does it make sense? If yes, I'll also apply it
> > > > to the other sanity check in this series to avoid userspace fail after
> > > > kernel update.
> > >
> > > Has it already been decided that the version field will be updated for
> > > every addition to the union?
> >
> > No, just my proposal. Jacob may help to explain the purpose of version
> > field. But if we may be too  "frequent" for an uapi version number updating
> > if we inc version for each change in the union part. I may vote for the
> > second option from you below.
> >
> > > It seems there are two options, either
> > > the version definition includes the possible contents of the union,
> > > which means we need to support multiple versions concurrently in the
> > > kernel to maintain compatibility with userspace and follow deprecation
> > > protocols for removing that support, or we need to consider version to
> > > be the general form of the structure and interpret the format field to
> > > determine necessary length to copy from the user.
> >
> > As I mentioned above, may be better to let @version field only over the
> > general fields and let format to cover the possible changes in union. e.g.
> > IOMMU_PASID_FORMAT_INTEL_VTD2 may means version 2 of Intel
> > VT-d bind. But either way, I think we need to let kernel maintain multiple
> > versions to support compatible userspace. e.g. may have multiple versions
> > iommu_gpasid_bind_data_vtd struct in the union part.
> 
> I couldn't find where the @version field originated in our old
> discussions, but I believe our plan for allowing future extensions was:
> 
> * Add new vendor-specific data by introducing a new format
>   (IOMMU_PASID_FORMAT_INTEL_VTD2,
> IOMMU_PASID_FORMAT_ARM_SMMUV2...), and
>   extend the union.
> 
> * Add a new common field, if it fits in the existing padding bytes, by
>   adding a flag (IOMMU_SVA_GPASID_*).
> 
> * Add a new common field, if it doesn't fit in the current padding bytes,
>   or completely change the structure layout, by introducing a new version
>   (IOMMU_GPASID_BIND_VERSION_2). In that case the kernel has to handle
>   both new and old structure versions. It would have both
>   iommu_gpasid_bind_data and iommu_gpasid_bind_data_v2 structs.
> 
> I think iommu_cache_invalidate_info and iommu_page_response use the same
> scheme. iommu_fault is a bit more complicated because it's
> kernel->userspace and requires some negotiation:
> https://lore.kernel.org/linux-iommu/77405d39-81a4-d9a8-5d35-
> 27602199867a@arm.com/
> 
> [...]
> > > If the ioctls have similar purpose and form, then re-using a single
> > > ioctl might make sense, but BIND_PROCESS is only a place-holder in this
> > > series, which is not acceptable.  A dual purpose ioctl does not
> > > preclude that we could also use a union for the data field to make the
> > > structure well specified.
> >
> > yes, BIND_PROCESS is only a place-holder here. From kernel p.o.v., both
> > BIND_GUEST_PASID and BIND_PROCESS are bind requests from userspace.
> > So the purposes are aligned. Below is the content the @data[] field
> > supposed to convey for BIND_PROCESS. If we use union, it would leave
> > space for extending it to support BIND_PROCESS. If only data[], it is a little
> > bit confusing why we define it in such manner if BIND_PROCESS is included
> > in this series. Please feel free let me know which one suits better.
> >
> > +struct vfio_iommu_type1_bind_process {
> > +	__u32	flags;
> > +#define VFIO_IOMMU_BIND_PID		(1 << 0)
> > +	__u32	pasid;
> > +	__s32	pid;
> > +};
> > https://patchwork.kernel.org/patch/10394927/
> 
> Note that I don't plan to upstream BIND_PROCESS at the moment. It was
> useful for testing but I don't know of anyone actually needing it.
> 
> > > > > That bind data
> > > > > structure expects a format (ex. IOMMU_PASID_FORMAT_INTEL_VTD).  How
> > > does
> > > > > a user determine what formats are accepted from within the vfio API (or
> > > > > even outside of the vfio API)?
> > > >
> > > > The info is provided by vIOMMU emulator (e.g. virtual VT-d). The vSVA patch
> > > > from Jacob has a sanity check on it.
> > > > https://lkml.org/lkml/2019/10/28/873
> > >
> > > The vIOMMU emulator runs at a layer above vfio.  How does the vIOMMU
> > > emulator know that the vfio interface supports virtual VT-d?  IMO, it's
> > > not acceptable that the user simply assume that an Intel host platform
> > > supports VT-d.  For example, consider what happens when we need to
> > > define IOMMU_PASID_FORMAT_INTEL_VTDv2.  How would the user learn that
> > > VTDv2 is supported and the original VTD format is not supported?
> >
> > I guess this may be another info VFIO_IOMMU_GET_INFO should provide.
> > It makes sense that vfio be aware of what platform it is running on. right?
> > After vfio gets the info, may let vfio fill in the format info. Is it the correct
> > direction?
> 
> I thought you were planning to put that information in sysfs?  We last
> discussed this over a year ago so I don't remember where we left it. I
> know Alex isn't keen on putting in sysfs what can be communicated through
> VFIO, but it is a convenient way to describe IOMMU features:
> http://www.linux-arm.org/git?p=linux-
> jpb.git;a=commitdiff;h=665370d5b5e0022c24b2d2b57975ef6fe7b40870;hp=7ce780
> d838889b53f5e04ba5d444520621261eda
> 
> My problem with GET_INFO was that it could be difficult to extend, and
> to describe things like variable-size list of supported page table
> formats, but I guess the new info capabilities make this easier.
> 
> Thanks,
> Jean

^ permalink raw reply	[flat|nested] 32+ messages in thread

* RE: [RFC v2 2/3] vfio/type1: VFIO_IOMMU_PASID_REQUEST(alloc/free)
  2019-11-13 19:45                 ` Jacob Pan
@ 2019-11-25  8:32                   ` Liu, Yi L
  0 siblings, 0 replies; 32+ messages in thread
From: Liu, Yi L @ 2019-11-25  8:32 UTC (permalink / raw)
  To: Jacob Pan, Alex Williamson
  Cc: eric.auger, Tian, Kevin, joro, Raj, Ashok, Tian, Jun J, Sun,
	Yi Y, jean-philippe.brucker, peterx, iommu, kvm

Hi Alex,

I think the major opens in this thread are the pasid allocation limit for
each VM and pasid lifecycle management. For lifecycle management, we will
free pasid when userspace requests and reclaim pasids when VM is crashed.
For the pasid allocation limit, may have a tunable quota and scaled per
assigned device number. If no apparent issue, we may prepare a version
to see if it is workable.

Thanks,
Yi Liu

> From: Jacob Pan [mailto:jacob.jun.pan@linux.intel.com]
> Sent: Thursday, November 14, 2019 3:45 AM
> To: Alex Williamson <alex.williamson@redhat.com>
> Cc: Liu, Yi L <yi.l.liu@intel.com>; eric.auger@redhat.com; Tian, Kevin
> <kevin.tian@intel.com>; joro@8bytes.org; Raj, Ashok <ashok.raj@intel.com>; Tian,
> Jun J <jun.j.tian@intel.com>; Sun, Yi Y <yi.y.sun@intel.com>; jean-
> philippe.brucker@arm.com; peterx@redhat.com; iommu@lists.linux-foundation.org;
> kvm@vger.kernel.org; jacob.jun.pan@linux.intel.com
> Subject: Re: [RFC v2 2/3] vfio/type1: VFIO_IOMMU_PASID_REQUEST(alloc/free)
> 
> On Wed, 13 Nov 2019 08:29:40 -0700
> Alex Williamson <alex.williamson@redhat.com> wrote:
> 
> > On Wed, 13 Nov 2019 11:03:17 +0000
> > "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> >
> > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > Sent: Friday, November 8, 2019 11:15 PM
> > > > To: Liu, Yi L <yi.l.liu@intel.com>
> > > > Subject: Re: [RFC v2 2/3] vfio/type1:
> > > > VFIO_IOMMU_PASID_REQUEST(alloc/free)
> > > >
> > > > On Fri, 8 Nov 2019 12:23:41 +0000
> > > > "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> > > >
> > > > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > > > Sent: Friday, November 8, 2019 6:07 AM
> > > > > > To: Liu, Yi L <yi.l.liu@intel.com>
> > > > > > Subject: Re: [RFC v2 2/3] vfio/type1:
> > > > > > VFIO_IOMMU_PASID_REQUEST(alloc/free)
> > > > > >
> > > > > > On Wed, 6 Nov 2019 13:27:26 +0000
> > > > > > "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> > > > > >
> > > > > > > > From: Alex Williamson <alex.williamson@redhat.com>
> > > > > > > > Sent: Wednesday, November 6, 2019 7:36 AM
> > > > > > > > To: Liu, Yi L <yi.l.liu@intel.com>
> > > > > > > > Subject: Re: [RFC v2 2/3] vfio/type1:
> > > > VFIO_IOMMU_PASID_REQUEST(alloc/free)
> > > > > > > >
> > > > > > > > On Thu, 24 Oct 2019 08:26:22 -0400
> > > > > > > > Liu Yi L <yi.l.liu@intel.com> wrote:
> > > > > > > >
> > > > > > > > > This patch adds VFIO_IOMMU_PASID_REQUEST ioctl which
> > > > > > > > > aims to passdown PASID allocation/free request from the
> > > > > > > > > virtual iommu. This is required to get PASID managed in
> > > > > > > > > system-wide.
> > > > > > > > >
> > > > > > > > > Cc: Kevin Tian <kevin.tian@intel.com>
> > > > > > > > > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > > > > > > > > Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
> > > > > > > > > Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > > > > > > > > ---
> > > > > > > > >  drivers/vfio/vfio_iommu_type1.c | 114
> > > > > > > > ++++++++++++++++++++++++++++++++++++++++
> > > > > > > > >  include/uapi/linux/vfio.h       |  25 +++++++++
> > > > > > > > >  2 files changed, 139 insertions(+)
> > > > > > > > >
> > > > > > > > > diff --git a/drivers/vfio/vfio_iommu_type1.c
> > > > > > b/drivers/vfio/vfio_iommu_type1.c
> > > > > > > > > index cd8d3a5..3d73a7d 100644
> > > > > > > > > --- a/drivers/vfio/vfio_iommu_type1.c
> > > > > > > > > +++ b/drivers/vfio/vfio_iommu_type1.c
> > > > > > > > > @@ -2248,6 +2248,83 @@ static int
> > > > > > > > > vfio_cache_inv_fn(struct device *dev,
> > > > > > void
> > > > > > > > *data)
> > > > > > > > >  	return iommu_cache_invalidate(dc->domain, dev,
> > > > > > > > > &ustruct->info); }
> > > > > > > > >
> > > > > > > > > +static int vfio_iommu_type1_pasid_alloc(struct
> > > > > > > > > vfio_iommu *iommu,
> > > > > > > > > +					 int min_pasid,
> > > > > > > > > +					 int max_pasid)
> > > > > > > > > +{
> > > > > > > > > +	int ret;
> > > > > > > > > +	ioasid_t pasid;
> > > > > > > > > +	struct mm_struct *mm = NULL;
> > > > > > > > > +
> > > > > > > > > +	mutex_lock(&iommu->lock);
> > > > > > > > > +	if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) {
> > > > > > > > > +		ret = -EINVAL;
> > > > > > > > > +		goto out_unlock;
> > > > > > > > > +	}
> > > > > > > > > +	mm = get_task_mm(current);
> > > > > > > > > +	/* Track ioasid allocation owner by mm */
> > > > > > > > > +	pasid = ioasid_alloc((struct ioasid_set *)mm,
> > > > > > > > > min_pasid,
> > > > > > > > > +				max_pasid, NULL);
> > > > > > > >
> > > > > > > > Are we sure we want to tie this to the task mm vs perhaps
> > > > > > > > the vfio_iommu pointer?
> > > > > > >
> > > > > > > Here we want to have a kind of per-VM mark, which can be
> > > > > > > used to do ownership check on whether a pasid is held by a
> > > > > > > specific VM. This is very important to prevent across VM
> > > > > > > affect. vfio_iommu pointer is competent for vfio as vfio is
> > > > > > > both pasid alloc requester and pasid consumer. e.g. vfio
> > > > > > > requests pasid alloc from ioasid and also it will invoke
> > > > > > > bind_gpasid(). vfio can either check ownership before
> > > > > > > invoking bind_gpasid() or pass vfio_iommu pointer to iommu
> > > > > > > driver. But in future, there may be other modules which are
> > > > > > > just consumers of pasid. And they also want to do ownership
> > > > > > > check for a pasid. Then, it would be hard for them as they
> > > > > > > are not the pasid alloc requester. So here better to have a
> > > > > > > system wide structure to perform as the per-VM mark. task
> > > > > > > mm looks to be much competent.
> > > > > >
> > > > > > Ok, so it's intentional to have a VM-wide token.  Elsewhere
> > > > > > in the type1 code (vfio_dma_do_map) we record the task_struct
> > > > > > per dma mapping so that we can get the task mm as needed.
> > > > > > Would the task_struct pointer provide any advantage?
> > > > >
> > > > > I think we may use task_struct pointer to make type1 code
> > > > > consistent. How do you think?
> > > >
> > > > If it has the same utility, sure.
> > >
> > > thanks, I'll make this change.
> > >
> > > > > > Also, an overall question, this provides userspace with pasid
> > > > > > alloc and free ioctls, (1) what prevents a userspace process
> > > > > > from consuming every available pasid, and (2) if the process
> > > > > > exits or crashes without freeing pasids, how are they
> > > > > > recovered aside from a reboot?
> > > > >
> > > > > For question (1), I think we only need to take care about
> > > > > malicious userspace process. As vfio usage is under privilege
> > > > > mode, so we may be safe on it so far.
> > > >
> > > > No, where else do we ever make this assumption?  vfio requires a
> > > > privileged entity to configure the system for vfio, bind devices
> > > > for user use, and grant those devices to the user, but the usage
> > > > of the device is always assumed to be by an unprivileged user.
> > > > It is absolutely not acceptable require a privileged user.  It's
> > > > vfio's responsibility to protect the system from the user.
> > >
> > > My assumption is not precise here. sorry for it... Maybe to further
> > > check with you to better understand your point. I think the user
> > > (QEMU) of vfio needs to have a root permission. Thus it can open
> > > the vfio fds. At this point, the user is a privileged one. Also I
> > > guess that's why vfio can grant the user with the usage of
> > > VFIO_MAP/UNMAP to config mappings into iommu page tables. But I'm
> > > not quite sure when will the user be an unprivileged one.
> >
> > QEMU does NOT need to be run as root to use vfio.  This is NOT the
> > model libvirt follows.  libvirt grants a user access to a device, or
> > rather a set of one or more devices (ie. the group) via standard file
> > permission access to the group file (/dev/vfio/$GROUP).  Ownership of
> > a device allows the user permission to make use of the IOMMU.  The
> > user's ability to create DMA mappings is restricted by their process
> > locked memory limits, where libvirt elevates the user limit
> > sufficient for the size of the VM.  QEMU should never need to be run
> > as root and doing so is entirely unacceptable from a security
> > perspective.  The only mode of vfio that requires elevated privilege
> > for use is when making use of no-iommu, where we have no IOMMU
> > protection or translation.
> >
> > > > > However, we may need to introduce a kind of credit
> > > > > mechanism to protect it. I've thought it, but no good idea yet.
> > > > > Would be happy to hear from you.
> > > >
> > > > It's a limited system resource and it's unclear how many might
> > > > reasonably used by a user.  I don't have an easy answer.
> > >
> > > How about the below method? based on some offline chat with Jacob.
> > > a. some reasonable defaults for the initial per VM quota, e.g. 1000
> > > per process
> > > b. IOASID should be able to enforce per ioasid_set (it is kind of
> > > per VM mark) limit
> >
> > We support large numbers of assigned devices, how many IOASIDs might
> > be reasonably used per device?  Is the mm or the task still the
> > correct "set" in this scenario?  I don't have any better ideas than
> > setting a limit, but it probably needs a kernel or module tunable,
> > and it needs to match the scaling we expect to see when multiple
> > devices are involved.
> >
> I think mm/task is still the correct set in that we try to prevent
> abuse based on mm not device. Or we need to have some notion of super
> container (Ashok proposed a while ago) that maps to a VM.
> 
> I am guessing you are suggesting the per mm quota should also be
> scaled against number of devices assigned. I think that is very
> reasonable. Perhaps we can do:
> 
> 1. A tunable per iommu group PASID quota with a default of 1000, e.g.
> nr_pasid_per_group. Since we are dealing with nested translation and
> each device has its own second level so we pretty much have one device
> per group. Call it nr_pasid_per_group?
> 
> 2. Limit number of PASIDs per VM with nr_pasid_per_group * nr_groups.
> Probably update the limit when group is added to a container with the
> same mm.
> 
> I guess we could also use a cgroup controller for PASIDs.
> 
> > > > > For question (2), I think we need to reclaim the allocated
> > > > > pasids when the vfio container fd is released just like what
> > > > > vfio does to the domain mappings. I didn't add it yet. But I
> > > > > can add it in next version if you think it would make the pasid
> > > > > alloc/free be much sound.
> > > >
> > > > Consider it required, the interface is susceptible to abuse
> > > > without it.
> > >
> > > sure, let me add it in next version.
> > >
> > > > > > > > > +	if (pasid == INVALID_IOASID) {
> > > > > > > > > +		ret = -ENOSPC;
> > > > > > > > > +		goto out_unlock;
> > > > > > > > > +	}
> > > > > > > > > +	ret = pasid;
> > > > > > > > > +out_unlock:
> > > > > > > > > +	mutex_unlock(&iommu->lock);
> > > > > >
> > > > > > What does holding this lock protect?  That the vfio_iommu
> > > > > > remains backed by an iommu during this operation, even though
> > > > > > we don't do anything to release allocated pasids when that
> > > > > > iommu backing is removed?
> > > > >
> > > > > yes, it is unnecessary to hold the lock here. At least for the
> > > > > operations in this patch. will remove it. :-)
> > > > >
> > > > > > > > > +	if (mm)
> > > > > > > > > +		mmput(mm);
> > > > > > > > > +	return ret;
> > > > > > > > > +}
> > > > > > > > > +
> > > > > > > > > +static int vfio_iommu_type1_pasid_free(struct
> > > > > > > > > vfio_iommu *iommu,
> > > > > > > > > +				       unsigned int
> > > > > > > > > pasid) +{
> > > > > > > > > +	struct mm_struct *mm = NULL;
> > > > > > > > > +	void *pdata;
> > > > > > > > > +	int ret = 0;
> > > > > > > > > +
> > > > > > > > > +	mutex_lock(&iommu->lock);
> > > > > > > > > +	if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) {
> > > > > > > > > +		ret = -EINVAL;
> > > > > > > > > +		goto out_unlock;
> > > > > > > > > +	}
> > > > > > > > > +
> > > > > > > > > +	/**
> > > > > > > > > +	 * REVISIT:
> > > > > > > > > +	 * There are two cases free could fail:
> > > > > > > > > +	 * 1. free pasid by non-owner, we use
> > > > > > > > > ioasid_set to track mm, if
> > > > > > > > > +	 * the set does not match, caller is not
> > > > > > > > > permitted to free.
> > > > > > > > > +	 * 2. free before unbind all devices, we can
> > > > > > > > > check if ioasid private
> > > > > > > > > +	 * data, if data != NULL, then fail to free.
> > > > > > > > > +	 */
> > > > > > > > > +	mm = get_task_mm(current);
> > > > > > > > > +	pdata = ioasid_find((struct ioasid_set *)mm,
> > > > > > > > > pasid, NULL);
> > > > > > > > > +	if (IS_ERR(pdata)) {
> > > > > > > > > +		if (pdata == ERR_PTR(-ENOENT))
> > > > > > > > > +			pr_err("PASID %u is not
> > > > > > > > > allocated\n", pasid);
> > > > > > > > > +		else if (pdata == ERR_PTR(-EACCES))
> > > > > > > > > +			pr_err("Free PASID %u by
> > > > > > > > > non-owner, denied",
> > > > pasid);
> > > > > > > > > +		else
> > > > > > > > > +			pr_err("Error searching PASID
> > > > > > > > > %u\n", pasid);
> > > > > > > >
> > > > > > > > This should be removed, errno is sufficient for the user,
> > > > > > > > this just provides the user with a trivial DoS vector
> > > > > > > > filling logs.
> > > > > > >
> > > > > > > sure, will fix it. thanks.
> > > > > > >
> > > > > > > > > +		ret = -EPERM;
> > > > > > > >
> > > > > > > > But why not return PTR_ERR(pdata)?
> > > > > > >
> > > > > > > aha, would do it.
> > > > > > >
> > > > > > > > > +		goto out_unlock;
> > > > > > > > > +	}
> > > > > > > > > +	if (pdata) {
> > > > > > > > > +		pr_debug("Cannot free pasid %d with
> > > > > > > > > private data\n", pasid);
> > > > > > > > > +		/* Expect PASID has no private data if
> > > > > > > > > not bond */
> > > > > > > > > +		ret = -EBUSY;
> > > > > > > > > +		goto out_unlock;
> > > > > > > > > +	}
> > > > > > > > > +	ioasid_free(pasid);
> > > > > > > >
> > > > > > > > We only ever get here with pasid == NULL?!
> > > > > > >
> > > > > > > I guess you meant only when pdata==NULL.
> > > > > > >
> > > > > > > > Something is wrong.  Should
> > > > > > > > that be 'if (!pdata)'?  (which also makes that pr_debug
> > > > > > > > another DoS vector)
> > > > > > >
> > > > > > > Oh, yes, just do it as below:
> > > > > > >
> > > > > > > if (!pdata) {
> > > > > > > 	ioasid_free(pasid);
> > > > > > > 	ret = SUCCESS;
> > > > > > > } else
> > > > > > > 	ret = -EBUSY;
> > > > > > >
> > > > > > > Is it what you mean?
> > > > > >
> > > > > > No, I think I was just confusing pdata and pasid, but I am
> > > > > > still confused about testing pdata.  We call ioasid_alloc()
> > > > > > with private = NULL, and I don't see any of your patches
> > > > > > calling ioasid_set_data() to change the private data after
> > > > > > allocation, so how could this ever be set?  Should this just
> > > > > > be a BUG_ON(pdata) as the integrity of the system is in
> > > > > > question should this state ever occur?  Thanks,
> > > > >
> > > > > ioasid_set_data() was called  in one patch from Jacob's vSVA
> > > > > patchset. [PATCH v6 08/10] iommu/vt-d: Add bind guest PASID
> > > > > support https://lkml.org/lkml/2019/10/22/946
> > > > >
> > > > > The basic idea is to allocate pasid with private=NULL, and set
> > > > > it when the pasid is actually bind to a device (bind_gpasid()).
> > > > > Each bind_gpasid() will increase the ref_cnt in the private
> > > > > data, and each unbind_gpasid() will decrease the ref_cnt. So if
> > > > > bind/unbind_gpasid() is called in mirror, the private data
> > > > > should be null when comes to free operation. If not, vfio can
> > > > > believe that the pasid is still in use.
> > > >
> > > > So this is another opportunity to leak pasids.  What's a user
> > > > supposed to do when their attempt to free a pasid fails?  It
> > > > invites leaks to allow this path to fail.  Thanks,
> > >
> > > Agreed, may no need to fail pasid free as it may leak pasid. How
> > > about always let free successful? If the ref_cnt is non-zero,
> > > notify the remaining users to release their reference.
> >
> > If a user frees an PASID, they've done their due diligence in
> > indicating it's no longer used.  The kernel should handle reclaiming
> > it from that point.  Thanks,
> 
> Yeah, I think we can add a atomic notifier for each PASID.
> Consumers such as IOMMU driver and KVM gets notified when IOASID is
> freed by VFIO. IOMMU driver can do the unbind and tear down.
> 
> In case of all the users already did unbind() before ioasid_free(), the
> free will proceed as usual.
> 
> > Alex
> >

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC v2 3/3] vfio/type1: bind guest pasid (guest page tables) to host
  2019-11-25  7:45             ` Liu, Yi L
@ 2019-12-03  0:11               ` Alex Williamson
  2019-12-05 12:19                 ` Liu, Yi L
  0 siblings, 1 reply; 32+ messages in thread
From: Alex Williamson @ 2019-12-03  0:11 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: Jean-Philippe Brucker, eric.auger, jacob.jun.pan, Tian, Kevin,
	Raj, Ashok, kvm, Tian, Jun J, iommu, Sun, Yi Y, Wu, Hao, Lu,
	Baolu

On Mon, 25 Nov 2019 07:45:18 +0000
"Liu, Yi L" <yi.l.liu@intel.com> wrote:

> Hi Alex,
> 
> Thanks for the review. Here I'd like to conclude the major opens in this
> thread and see if we can get some agreements to prepare a new version.
> 
> a) IOCTLs for BIND_GPASID and BIND_PROCESS, share a single IOCTL or two
>    separate IOCTLs?
>    Yi: It may be helpful to have separate IOCTLs. The bind data conveyed
>    for BIND_GPASID and BIND_PROCESS are totally different, and the struct
>    iommu_gpasid_bind_data has vendor specific data and may even have more
>    versions in future. To better maintain it, I guess separate IOCTLs for
>    the two bind types would be better. The structure for BIND_GPASID is
>    as below:
> 
>         struct vfio_iommu_type1_bind {
>                 __u32                           argsz;
>                 struct iommu_gpasid_bind_data   bind_data;
>         };


We've been rather successful at extending ioctls in vfio and I'm
generally opposed to rampant ioctl proliferation.  If we added @flags
to the above struct (as pretty much the standard for vfio ioctls), then
we could use it to describe the type of binding to perform and
therefore the type of data provided.  I think my major complaint here
was that we were defining PROCESS but not implementing it.  We can
design the ioctl to enable it, but not define it until it's implemented.
 
> b) how kernel-space learns the number of bytes to be copied (a.k.a. the
>    usage of @version field and @format field of struct
>    iommu_gpasid_bind_data)
>    Yi: Jean has an excellent recap in prior reply on the plan of future
>    extensions regards to @version field and @format field. Based on the
>    plan, kernel space needs to parse the @version field and @format field
>    to get the length of the current BIND_GPASID request. Also kernel needs
>    to maintain the new and old structure versions. Follow specific
>    deprecation policy in future.

Yes, it seems reasonable, so from the struct above (plus @flags) we
could determine we have struct iommu_gpasid_bind_data as the payload
and read that using @version and @format as outlined.

> c) how can vIOMMU emulator know that the vfio interface supports to config
>    dual stage translation for vIOMMU?
>    Yi: may do it via VFIO_IOMMU_GET_INFO.

Yes please.

> d) how can vIOMMU emulator know what @version and @format should be set
>    in struct iommu_gpasid_bind_data?
>    Yi: currently, we have two ways. First one, may do it via
>    VFIO_IOMMU_GET_INFO. This is a natural idea as here @version and @format
>    are used in vfio apis. It makes sense to let vfio to provide related info
>    to vIOMMU emulator after checking with vendor specific iommu driver. Also,
>    there is idea to do it via sysfs (/sys/class/iommu/dmar#) as we have plan
>    to do IOMMU capability sync between vIOMMU and pIOMMU via sysfs. I have
>    two concern on this option. Current iommu sysfs only provides vendor
>    specific hardware infos. I'm not sure if it is good to expose infos
>    defined in IOMMU generic layer via iommu sysfs. If this concern is not
>    a big thing, I'm fine with both options.

This seems like the same issue we had with IOMMU reserved regions, I'd
prefer that a user can figure out how to interact with the vfio
interface through the vfio interface.  Forcing the user to poke around
in sysfs requires the user to have read permissions to sysfs in places
they otherwise wouldn't need.  Thanks,

Alex

> > From: Jean-Philippe Brucker [mailto:jean-philippe@linaro.org]
> > Sent: Wednesday, November 13, 2019 6:29 PM
> > To: Liu, Yi L <yi.l.liu@intel.com>
> > Subject: Re: [RFC v2 3/3] vfio/type1: bind guest pasid (guest page tables) to host
> > 
> > On Wed, Nov 13, 2019 at 07:43:43AM +0000, Liu, Yi L wrote:  
> > > > From: Alex Williamson <alex.williamson@redhat.com>
> > > > Sent: Wednesday, November 13, 2019 1:26 AM
> > > > To: Liu, Yi L <yi.l.liu@intel.com>
> > > > Subject: Re: [RFC v2 3/3] vfio/type1: bind guest pasid (guest page tables) to host
> > > >
> > > > On Tue, 12 Nov 2019 11:21:40 +0000
> > > > "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> > > >  
> > > > > > From: Alex Williamson < alex.williamson@redhat.com >
> > > > > > Sent: Friday, November 8, 2019 7:21 AM
> > > > > > To: Liu, Yi L <yi.l.liu@intel.com>
> > > > > > Subject: Re: [RFC v2 3/3] vfio/type1: bind guest pasid (guest page tables) to  
> > host  
> > > > > >
> > > > > > On Thu, 24 Oct 2019 08:26:23 -0400
> > > > > > Liu Yi L <yi.l.liu@intel.com> wrote:
> > > > > >  
> > > > > > > This patch adds vfio support to bind guest translation structure
> > > > > > > to host iommu. VFIO exposes iommu programming capability to user-
> > > > > > > space. Guest is a user-space application in host under KVM solution.
> > > > > > > For SVA usage in Virtual Machine, guest owns GVA->GPA translation
> > > > > > > structure. And this part should be passdown to host to enable nested
> > > > > > > translation (or say two stage translation). This patch reuses the
> > > > > > > VFIO_IOMMU_BIND proposal from Jean-Philippe Brucker, and adds new
> > > > > > > bind type for binding guest owned translation structure to host.
> > > > > > >
> > > > > > > *) Add two new ioctls for VFIO containers.
> > > > > > >
> > > > > > >   - VFIO_IOMMU_BIND: for bind request from userspace, it could be
> > > > > > >                    bind a process to a pasid or bind a guest pasid
> > > > > > >                    to a device, this is indicated by type
> > > > > > >   - VFIO_IOMMU_UNBIND: for unbind request from userspace, it could be
> > > > > > >                    unbind a process to a pasid or unbind a guest pasid
> > > > > > >                    to a device, also indicated by type
> > > > > > >   - Bind type:
> > > > > > > 	VFIO_IOMMU_BIND_PROCESS: user-space request to bind a  
> > process  
> > > > > > >                    to a device
> > > > > > > 	VFIO_IOMMU_BIND_GUEST_PASID: bind guest owned translation
> > > > > > >                    structure to host iommu. e.g. guest page table
> > > > > > >
> > > > > > > *) Code logic in vfio_iommu_type1_ioctl() to handle  
> > > > VFIO_IOMMU_BIND/UNBIND  
> > > > > > >  
> > > [...]  
> > > > > > > +static long vfio_iommu_type1_unbind_gpasid(struct vfio_iommu *iommu,
> > > > > > > +					    void __user *arg,
> > > > > > > +					    struct vfio_iommu_type1_bind  
> > *bind)  
> > > > > > > +{
> > > > > > > +	struct iommu_gpasid_bind_data gbind_data;
> > > > > > > +	unsigned long minsz;
> > > > > > > +	int ret = 0;
> > > > > > > +
> > > > > > > +	minsz = sizeof(*bind) + sizeof(gbind_data);
> > > > > > > +	if (bind->argsz < minsz)
> > > > > > > +		return -EINVAL;  
> > > > > >
> > > > > > But gbind_data can change size if new vendor specific data is added to
> > > > > > the union, so kernel updates break existing userspace.  Fail.  
> > 
> > I guess we could take minsz up to the vendor-specific data, copy @format,
> > and then check the size of vendor-specific data?
> >   
> > > > >
> > > > > yes, we have a version field in struct iommu_gpasid_bind_data. How
> > > > > about doing sanity check per versions? kernel knows the gbind_data
> > > > > size of specific versions. Does it make sense? If yes, I'll also apply it
> > > > > to the other sanity check in this series to avoid userspace fail after
> > > > > kernel update.  
> > > >
> > > > Has it already been decided that the version field will be updated for
> > > > every addition to the union?  
> > >
> > > No, just my proposal. Jacob may help to explain the purpose of version
> > > field. But if we may be too  "frequent" for an uapi version number updating
> > > if we inc version for each change in the union part. I may vote for the
> > > second option from you below.
> > >  
> > > > It seems there are two options, either
> > > > the version definition includes the possible contents of the union,
> > > > which means we need to support multiple versions concurrently in the
> > > > kernel to maintain compatibility with userspace and follow deprecation
> > > > protocols for removing that support, or we need to consider version to
> > > > be the general form of the structure and interpret the format field to
> > > > determine necessary length to copy from the user.  
> > >
> > > As I mentioned above, may be better to let @version field only over the
> > > general fields and let format to cover the possible changes in union. e.g.
> > > IOMMU_PASID_FORMAT_INTEL_VTD2 may means version 2 of Intel
> > > VT-d bind. But either way, I think we need to let kernel maintain multiple
> > > versions to support compatible userspace. e.g. may have multiple versions
> > > iommu_gpasid_bind_data_vtd struct in the union part.  
> > 
> > I couldn't find where the @version field originated in our old
> > discussions, but I believe our plan for allowing future extensions was:
> > 
> > * Add new vendor-specific data by introducing a new format
> >   (IOMMU_PASID_FORMAT_INTEL_VTD2,
> > IOMMU_PASID_FORMAT_ARM_SMMUV2...), and
> >   extend the union.
> > 
> > * Add a new common field, if it fits in the existing padding bytes, by
> >   adding a flag (IOMMU_SVA_GPASID_*).
> > 
> > * Add a new common field, if it doesn't fit in the current padding bytes,
> >   or completely change the structure layout, by introducing a new version
> >   (IOMMU_GPASID_BIND_VERSION_2). In that case the kernel has to handle
> >   both new and old structure versions. It would have both
> >   iommu_gpasid_bind_data and iommu_gpasid_bind_data_v2 structs.
> > 
> > I think iommu_cache_invalidate_info and iommu_page_response use the same
> > scheme. iommu_fault is a bit more complicated because it's
> > kernel->userspace and requires some negotiation:
> > https://lore.kernel.org/linux-iommu/77405d39-81a4-d9a8-5d35-
> > 27602199867a@arm.com/
> > 
> > [...]  
> > > > If the ioctls have similar purpose and form, then re-using a single
> > > > ioctl might make sense, but BIND_PROCESS is only a place-holder in this
> > > > series, which is not acceptable.  A dual purpose ioctl does not
> > > > preclude that we could also use a union for the data field to make the
> > > > structure well specified.  
> > >
> > > yes, BIND_PROCESS is only a place-holder here. From kernel p.o.v., both
> > > BIND_GUEST_PASID and BIND_PROCESS are bind requests from userspace.
> > > So the purposes are aligned. Below is the content the @data[] field
> > > supposed to convey for BIND_PROCESS. If we use union, it would leave
> > > space for extending it to support BIND_PROCESS. If only data[], it is a little
> > > bit confusing why we define it in such manner if BIND_PROCESS is included
> > > in this series. Please feel free let me know which one suits better.
> > >
> > > +struct vfio_iommu_type1_bind_process {
> > > +	__u32	flags;
> > > +#define VFIO_IOMMU_BIND_PID		(1 << 0)
> > > +	__u32	pasid;
> > > +	__s32	pid;
> > > +};
> > > https://patchwork.kernel.org/patch/10394927/  
> > 
> > Note that I don't plan to upstream BIND_PROCESS at the moment. It was
> > useful for testing but I don't know of anyone actually needing it.
> >   
> > > > > > That bind data
> > > > > > structure expects a format (ex. IOMMU_PASID_FORMAT_INTEL_VTD).  How  
> > > > does  
> > > > > > a user determine what formats are accepted from within the vfio API (or
> > > > > > even outside of the vfio API)?  
> > > > >
> > > > > The info is provided by vIOMMU emulator (e.g. virtual VT-d). The vSVA patch
> > > > > from Jacob has a sanity check on it.
> > > > > https://lkml.org/lkml/2019/10/28/873  
> > > >
> > > > The vIOMMU emulator runs at a layer above vfio.  How does the vIOMMU
> > > > emulator know that the vfio interface supports virtual VT-d?  IMO, it's
> > > > not acceptable that the user simply assume that an Intel host platform
> > > > supports VT-d.  For example, consider what happens when we need to
> > > > define IOMMU_PASID_FORMAT_INTEL_VTDv2.  How would the user learn that
> > > > VTDv2 is supported and the original VTD format is not supported?  
> > >
> > > I guess this may be another info VFIO_IOMMU_GET_INFO should provide.
> > > It makes sense that vfio be aware of what platform it is running on. right?
> > > After vfio gets the info, may let vfio fill in the format info. Is it the correct
> > > direction?  
> > 
> > I thought you were planning to put that information in sysfs?  We last
> > discussed this over a year ago so I don't remember where we left it. I
> > know Alex isn't keen on putting in sysfs what can be communicated through
> > VFIO, but it is a convenient way to describe IOMMU features:
> > http://www.linux-arm.org/git?p=linux-
> > jpb.git;a=commitdiff;h=665370d5b5e0022c24b2d2b57975ef6fe7b40870;hp=7ce780
> > d838889b53f5e04ba5d444520621261eda
> > 
> > My problem with GET_INFO was that it could be difficult to extend, and
> > to describe things like variable-size list of supported page table
> > formats, but I guess the new info capabilities make this easier.
> > 
> > Thanks,
> > Jean  
> 


^ permalink raw reply	[flat|nested] 32+ messages in thread

* RE: [RFC v2 3/3] vfio/type1: bind guest pasid (guest page tables) to host
  2019-12-03  0:11               ` Alex Williamson
@ 2019-12-05 12:19                 ` Liu, Yi L
  0 siblings, 0 replies; 32+ messages in thread
From: Liu, Yi L @ 2019-12-05 12:19 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Jean-Philippe Brucker, eric.auger, jacob.jun.pan, Tian, Kevin,
	Raj, Ashok, kvm, Tian, Jun J, iommu, Sun, Yi Y, Wu, Hao, Lu,
	Baolu

> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: Tuesday, December 3, 2019 8:12 AM
> To: Liu, Yi L <yi.l.liu@intel.com>
> Subject: Re: [RFC v2 3/3] vfio/type1: bind guest pasid (guest page tables) to host
> 
> On Mon, 25 Nov 2019 07:45:18 +0000
> "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> 
> > Hi Alex,
> >
> > Thanks for the review. Here I'd like to conclude the major opens in this
> > thread and see if we can get some agreements to prepare a new version.
> >
> > a) IOCTLs for BIND_GPASID and BIND_PROCESS, share a single IOCTL or two
> >    separate IOCTLs?
> >    Yi: It may be helpful to have separate IOCTLs. The bind data conveyed
> >    for BIND_GPASID and BIND_PROCESS are totally different, and the struct
> >    iommu_gpasid_bind_data has vendor specific data and may even have more
> >    versions in future. To better maintain it, I guess separate IOCTLs for
> >    the two bind types would be better. The structure for BIND_GPASID is
> >    as below:
> >
> >         struct vfio_iommu_type1_bind {
> >                 __u32                           argsz;
> >                 struct iommu_gpasid_bind_data   bind_data;
> >         };
> 
> 
> We've been rather successful at extending ioctls in vfio and I'm
> generally opposed to rampant ioctl proliferation.  If we added @flags
> to the above struct (as pretty much the standard for vfio ioctls), then
> we could use it to describe the type of binding to perform and
> therefore the type of data provided.  I think my major complaint here
> was that we were defining PROCESS but not implementing it.  We can
> design the ioctl to enable it, but not define it until it's implemented.

sure, so I'll pull back the @flags field. BTW. Regards to the payloads,
what would be preferred? @data[] or a wrapper structure like below?

	union {
		struct iommu_gpasid_bind_data   bind_gpasid;
	}bind_data;

> > b) how kernel-space learns the number of bytes to be copied (a.k.a. the
> >    usage of @version field and @format field of struct
> >    iommu_gpasid_bind_data)
> >    Yi: Jean has an excellent recap in prior reply on the plan of future
> >    extensions regards to @version field and @format field. Based on the
> >    plan, kernel space needs to parse the @version field and @format field
> >    to get the length of the current BIND_GPASID request. Also kernel needs
> >    to maintain the new and old structure versions. Follow specific
> >    deprecation policy in future.
> 
> Yes, it seems reasonable, so from the struct above (plus @flags) we
> could determine we have struct iommu_gpasid_bind_data as the payload
> and read that using @version and @format as outlined.

sure, thanks.

> > c) how can vIOMMU emulator know that the vfio interface supports to config
> >    dual stage translation for vIOMMU?
> >    Yi: may do it via VFIO_IOMMU_GET_INFO.
> 
> Yes please.

got it.

> > d) how can vIOMMU emulator know what @version and @format should be set
> >    in struct iommu_gpasid_bind_data?
> >    Yi: currently, we have two ways. First one, may do it via
> >    VFIO_IOMMU_GET_INFO. This is a natural idea as here @version and @format
> >    are used in vfio apis. It makes sense to let vfio to provide related info
> >    to vIOMMU emulator after checking with vendor specific iommu driver. Also,
> >    there is idea to do it via sysfs (/sys/class/iommu/dmar#) as we have plan
> >    to do IOMMU capability sync between vIOMMU and pIOMMU via sysfs. I have
> >    two concern on this option. Current iommu sysfs only provides vendor
> >    specific hardware infos. I'm not sure if it is good to expose infos
> >    defined in IOMMU generic layer via iommu sysfs. If this concern is not
> >    a big thing, I'm fine with both options.
> 
> This seems like the same issue we had with IOMMU reserved regions, I'd
> prefer that a user can figure out how to interact with the vfio
> interface through the vfio interface.  Forcing the user to poke around
> in sysfs requires the user to have read permissions to sysfs in places
> they otherwise wouldn't need.  Thanks,

thanks, let me prepare a new version.

Regards,
Yi Liu

> Alex


^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, back to index

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-24 12:26 [RFC v2 0/3] vfio: support Shared Virtual Addressing Liu Yi L
2019-10-24 12:26 ` [RFC v2 1/3] vfio: VFIO_IOMMU_CACHE_INVALIDATE Liu Yi L
2019-10-25  9:14   ` Tian, Kevin
2019-10-25 11:20     ` Liu, Yi L
2019-11-05 22:42       ` Alex Williamson
2019-11-06  1:31         ` Liu, Yi L
2019-11-13  7:50           ` Auger Eric
2019-10-24 12:26 ` [RFC v2 2/3] vfio/type1: VFIO_IOMMU_PASID_REQUEST(alloc/free) Liu Yi L
2019-10-25 10:06   ` Tian, Kevin
2019-10-25 11:16     ` Liu, Yi L
2019-11-05 23:35   ` Alex Williamson
2019-11-06 13:27     ` Liu, Yi L
2019-11-07 22:06       ` Alex Williamson
2019-11-08 12:23         ` Liu, Yi L
2019-11-08 15:15           ` Alex Williamson
2019-11-13 11:03             ` Liu, Yi L
2019-11-13 15:29               ` Alex Williamson
2019-11-13 19:45                 ` Jacob Pan
2019-11-25  8:32                   ` Liu, Yi L
2019-11-18  4:50                 ` Liu, Yi L
2019-10-24 12:26 ` [RFC v2 3/3] vfio/type1: bind guest pasid (guest page tables) to host Liu Yi L
2019-11-07 23:20   ` Alex Williamson
2019-11-12 11:21     ` Liu, Yi L
2019-11-12 17:25       ` Alex Williamson
2019-11-13  7:43         ` Liu, Yi L
2019-11-13 10:29           ` Jean-Philippe Brucker
2019-11-13 11:30             ` Liu, Yi L
2019-11-25  7:45             ` Liu, Yi L
2019-12-03  0:11               ` Alex Williamson
2019-12-05 12:19                 ` Liu, Yi L
2019-10-25  8:59 ` [RFC v2 0/3] vfio: support Shared Virtual Addressing Tian, Kevin
2019-10-25 11:18   ` Liu, Yi L

KVM Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/kvm/0 kvm/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 kvm kvm/ https://lore.kernel.org/kvm \
		kvm@vger.kernel.org
	public-inbox-index kvm

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.kvm


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git