linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v1 0/8] vfio: expose virtual Shared Virtual Addressing to VMs
@ 2020-03-22 12:31 Liu, Yi L
  2020-03-22 12:31 ` [PATCH v1 1/8] vfio: Add VFIO_IOMMU_PASID_REQUEST(alloc/free) Liu, Yi L
                   ` (8 more replies)
  0 siblings, 9 replies; 110+ messages in thread
From: Liu, Yi L @ 2020-03-22 12:31 UTC (permalink / raw)
  To: alex.williamson, eric.auger
  Cc: kevin.tian, jacob.jun.pan, joro, ashok.raj, yi.l.liu, jun.j.tian,
	yi.y.sun, jean-philippe, peterx, iommu, kvm, linux-kernel,
	hao.wu

From: Liu Yi L <yi.l.liu@intel.com>

Shared Virtual Addressing (SVA), a.k.a, Shared Virtual Memory (SVM) on
Intel platforms allows address space sharing between device DMA and
applications. SVA can reduce programming complexity and enhance security.

This VFIO series is intended to expose SVA usage to VMs. i.e. Sharing
guest application address space with passthru devices. This is called
vSVA in this series. The whole vSVA enabling requires QEMU/VFIO/IOMMU
changes. For IOMMU and QEMU changes, they are in separate series (listed
in the "Related series").

The high-level architecture for SVA virtualization is as below, the key
design of vSVA support is to utilize the dual-stage IOMMU translation (
also known as IOMMU nesting translation) capability in host IOMMU.


    .-------------.  .---------------------------.
    |   vIOMMU    |  | Guest process CR3, FL only|
    |             |  '---------------------------'
    .----------------/
    | PASID Entry |--- PASID cache flush -
    '-------------'                       |
    |             |                       V
    |             |                CR3 in GPA
    '-------------'
Guest
------| Shadow |--------------------------|--------
      v        v                          v
Host
    .-------------.  .----------------------.
    |   pIOMMU    |  | Bind FL for GVA-GPA  |
    |             |  '----------------------'
    .----------------/  |
    | PASID Entry |     V (Nested xlate)
    '----------------\.------------------------------.
    |             |   |SL for GPA-HPA, default domain|
    |             |   '------------------------------'
    '-------------'
Where:
 - FL = First level/stage one page tables
 - SL = Second level/stage two page tables

There are roughly four parts in this patchset which are
corresponding to the basic vSVA support for PCI device
assignment
 1. vfio support for PASID allocation and free for VMs
 2. vfio support for guest page table binding request from VMs
 3. vfio support for IOMMU cache invalidation from VMs
 4. vfio support for vSVA usage on IOMMU-backed mdevs

The complete vSVA kernel upstream patches are divided into three phases:
    1. Common APIs and PCI device direct assignment
    2. IOMMU-backed Mediated Device assignment
    3. Page Request Services (PRS) support

This patchset is aiming for the phase 1 and phase 2, and based on Jacob's
below series.
[PATCH V10 00/11] Nested Shared Virtual Address (SVA) VT-d support:
https://lkml.org/lkml/2020/3/20/1172

Complete set for current vSVA can be found in below branch.
https://github.com/luxis1999/linux-vsva.git: vsva-linux-5.6-rc6

The corresponding QEMU patch series is as below, complete QEMU set can be
found in below branch.
[PATCH v1 00/22] intel_iommu: expose Shared Virtual Addressing to VMs
complete QEMU set can be found in below link:
https://github.com/luxis1999/qemu.git: sva_vtd_v10_v1

Regards,
Yi Liu

Changelog:
	- RFC v1 -> Patch v1:
	  a) Address comments to the PASID request(alloc/free) path
	  b) Report PASID alloc/free availabitiy to user-space
	  c) Add a vfio_iommu_type1 parameter to support pasid quota tuning
	  d) Adjusted to latest ioasid code implementation. e.g. remove the
	     code for tracking the allocated PASIDs as latest ioasid code
	     will track it, VFIO could use ioasid_free_set() to free all
	     PASIDs.

	- RFC v2 -> v3:
	  a) Refine the whole patchset to fit the roughly parts in this series
	  b) Adds complete vfio PASID management framework. e.g. pasid alloc,
	  free, reclaim in VM crash/down and per-VM PASID quota to prevent
	  PASID abuse.
	  c) Adds IOMMU uAPI version check and page table format check to ensure
	  version compatibility and hardware compatibility.
	  d) Adds vSVA vfio support for IOMMU-backed mdevs.

	- RFC v1 -> v2:
	  Dropped vfio: VFIO_IOMMU_ATTACH/DETACH_PASID_TABLE.

Liu Yi L (8):
  vfio: Add VFIO_IOMMU_PASID_REQUEST(alloc/free)
  vfio/type1: Add vfio_iommu_type1 parameter for quota tuning
  vfio/type1: Report PASID alloc/free support to userspace
  vfio: Check nesting iommu uAPI version
  vfio/type1: Report 1st-level/stage-1 format to userspace
  vfio/type1: Bind guest page tables to host
  vfio/type1: Add VFIO_IOMMU_CACHE_INVALIDATE
  vfio/type1: Add vSVA support for IOMMU-backed mdevs

 drivers/vfio/vfio.c             | 136 +++++++++++++
 drivers/vfio/vfio_iommu_type1.c | 419 ++++++++++++++++++++++++++++++++++++++++
 include/linux/vfio.h            |  21 ++
 include/uapi/linux/vfio.h       | 127 ++++++++++++
 4 files changed, 703 insertions(+)

-- 
2.7.4


^ permalink raw reply	[flat|nested] 110+ messages in thread

* [PATCH v1 1/8] vfio: Add VFIO_IOMMU_PASID_REQUEST(alloc/free)
  2020-03-22 12:31 [PATCH v1 0/8] vfio: expose virtual Shared Virtual Addressing to VMs Liu, Yi L
@ 2020-03-22 12:31 ` Liu, Yi L
  2020-03-22 16:21   ` kbuild test robot
                     ` (4 more replies)
  2020-03-22 12:31 ` [PATCH v1 2/8] vfio/type1: Add vfio_iommu_type1 parameter for quota tuning Liu, Yi L
                   ` (7 subsequent siblings)
  8 siblings, 5 replies; 110+ messages in thread
From: Liu, Yi L @ 2020-03-22 12:31 UTC (permalink / raw)
  To: alex.williamson, eric.auger
  Cc: kevin.tian, jacob.jun.pan, joro, ashok.raj, yi.l.liu, jun.j.tian,
	yi.y.sun, jean-philippe, peterx, iommu, kvm, linux-kernel,
	hao.wu

From: Liu Yi L <yi.l.liu@intel.com>

For a long time, devices have only one DMA address space from platform
IOMMU's point of view. This is true for both bare metal and directed-
access in virtualization environment. Reason is the source ID of DMA in
PCIe are BDF (bus/dev/fnc ID), which results in only device granularity
DMA isolation. However, this is changing with the latest advancement in
I/O technology area. More and more platform vendors are utilizing the PCIe
PASID TLP prefix in DMA requests, thus to give devices with multiple DMA
address spaces as identified by their individual PASIDs. For example,
Shared Virtual Addressing (SVA, a.k.a Shared Virtual Memory) is able to
let device access multiple process virtual address space by binding the
virtual address space with a PASID. Wherein the PASID is allocated in
software and programmed to device per device specific manner. Devices
which support PASID capability are called PASID-capable devices. If such
devices are passed through to VMs, guest software are also able to bind
guest process virtual address space on such devices. Therefore, the guest
software could reuse the bare metal software programming model, which
means guest software will also allocate PASID and program it to device
directly. This is a dangerous situation since it has potential PASID
conflicts and unauthorized address space access. It would be safer to
let host intercept in the guest software's PASID allocation. Thus PASID
are managed system-wide.

This patch adds VFIO_IOMMU_PASID_REQUEST ioctl which aims to passdown
PASID allocation/free request from the virtual IOMMU. Additionally, such
requests are intended to be invoked by QEMU or other applications which
are running in userspace, it is necessary to have a mechanism to prevent
single application from abusing available PASIDs in system. With such
consideration, this patch tracks the VFIO PASID allocation per-VM. There
was a discussion to make quota to be per assigned devices. e.g. if a VM
has many assigned devices, then it should have more quota. However, it
is not sure how many PASIDs an assigned devices will use. e.g. it is
possible that a VM with multiples assigned devices but requests less
PASIDs. Therefore per-VM quota would be better.

This patch uses struct mm pointer as a per-VM token. We also considered
using task structure pointer and vfio_iommu structure pointer. However,
task structure is per-thread, which means it cannot achieve per-VM PASID
alloc tracking purpose. While for vfio_iommu structure, it is visible
only within vfio. Therefore, structure mm pointer is selected. This patch
adds a structure vfio_mm. A vfio_mm is created when the first vfio
container is opened by a VM. On the reverse order, vfio_mm is free when
the last vfio container is released. Each VM is assigned with a PASID
quota, so that it is not able to request PASID beyond its quota. This
patch adds a default quota of 1000. This quota could be tuned by
administrator. Making PASID quota tunable will be added in another patch
in this series.

Previous discussions:
https://patchwork.kernel.org/patch/11209429/

Cc: Kevin Tian <kevin.tian@intel.com>
CC: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Eric Auger <eric.auger@redhat.com>
Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
---
 drivers/vfio/vfio.c             | 130 ++++++++++++++++++++++++++++++++++++++++
 drivers/vfio/vfio_iommu_type1.c | 104 ++++++++++++++++++++++++++++++++
 include/linux/vfio.h            |  20 +++++++
 include/uapi/linux/vfio.h       |  41 +++++++++++++
 4 files changed, 295 insertions(+)

diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index c848262..d13b483 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -32,6 +32,7 @@
 #include <linux/vfio.h>
 #include <linux/wait.h>
 #include <linux/sched/signal.h>
+#include <linux/sched/mm.h>
 
 #define DRIVER_VERSION	"0.3"
 #define DRIVER_AUTHOR	"Alex Williamson <alex.williamson@redhat.com>"
@@ -46,6 +47,8 @@ static struct vfio {
 	struct mutex			group_lock;
 	struct cdev			group_cdev;
 	dev_t				group_devt;
+	struct list_head		vfio_mm_list;
+	struct mutex			vfio_mm_lock;
 	wait_queue_head_t		release_q;
 } vfio;
 
@@ -2129,6 +2132,131 @@ int vfio_unregister_notifier(struct device *dev, enum vfio_notify_type type,
 EXPORT_SYMBOL(vfio_unregister_notifier);
 
 /**
+ * VFIO_MM objects - create, release, get, put, search
+ * Caller of the function should have held vfio.vfio_mm_lock.
+ */
+static struct vfio_mm *vfio_create_mm(struct mm_struct *mm)
+{
+	struct vfio_mm *vmm;
+	struct vfio_mm_token *token;
+	int ret = 0;
+
+	vmm = kzalloc(sizeof(*vmm), GFP_KERNEL);
+	if (!vmm)
+		return ERR_PTR(-ENOMEM);
+
+	/* Per mm IOASID set used for quota control and group operations */
+	ret = ioasid_alloc_set((struct ioasid_set *) mm,
+			       VFIO_DEFAULT_PASID_QUOTA, &vmm->ioasid_sid);
+	if (ret) {
+		kfree(vmm);
+		return ERR_PTR(ret);
+	}
+
+	kref_init(&vmm->kref);
+	token = &vmm->token;
+	token->val = mm;
+	vmm->pasid_quota = VFIO_DEFAULT_PASID_QUOTA;
+	mutex_init(&vmm->pasid_lock);
+
+	list_add(&vmm->vfio_next, &vfio.vfio_mm_list);
+
+	return vmm;
+}
+
+static void vfio_mm_unlock_and_free(struct vfio_mm *vmm)
+{
+	/* destroy the ioasid set */
+	ioasid_free_set(vmm->ioasid_sid, true);
+	mutex_unlock(&vfio.vfio_mm_lock);
+	kfree(vmm);
+}
+
+/* called with vfio.vfio_mm_lock held */
+static void vfio_mm_release(struct kref *kref)
+{
+	struct vfio_mm *vmm = container_of(kref, struct vfio_mm, kref);
+
+	list_del(&vmm->vfio_next);
+	vfio_mm_unlock_and_free(vmm);
+}
+
+void vfio_mm_put(struct vfio_mm *vmm)
+{
+	kref_put_mutex(&vmm->kref, vfio_mm_release, &vfio.vfio_mm_lock);
+}
+EXPORT_SYMBOL_GPL(vfio_mm_put);
+
+/* Assume vfio_mm_lock or vfio_mm reference is held */
+static void vfio_mm_get(struct vfio_mm *vmm)
+{
+	kref_get(&vmm->kref);
+}
+
+struct vfio_mm *vfio_mm_get_from_task(struct task_struct *task)
+{
+	struct mm_struct *mm = get_task_mm(task);
+	struct vfio_mm *vmm;
+	unsigned long long val = (unsigned long long) mm;
+
+	mutex_lock(&vfio.vfio_mm_lock);
+	list_for_each_entry(vmm, &vfio.vfio_mm_list, vfio_next) {
+		if (vmm->token.val == val) {
+			vfio_mm_get(vmm);
+			goto out;
+		}
+	}
+
+	vmm = vfio_create_mm(mm);
+	if (IS_ERR(vmm))
+		vmm = NULL;
+out:
+	mutex_unlock(&vfio.vfio_mm_lock);
+	mmput(mm);
+	return vmm;
+}
+EXPORT_SYMBOL_GPL(vfio_mm_get_from_task);
+
+int vfio_mm_pasid_alloc(struct vfio_mm *vmm, int min, int max)
+{
+	ioasid_t pasid;
+	int ret = -ENOSPC;
+
+	mutex_lock(&vmm->pasid_lock);
+
+	pasid = ioasid_alloc(vmm->ioasid_sid, min, max, NULL);
+	if (pasid == INVALID_IOASID) {
+		ret = -ENOSPC;
+		goto out_unlock;
+	}
+
+	ret = pasid;
+out_unlock:
+	mutex_unlock(&vmm->pasid_lock);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(vfio_mm_pasid_alloc);
+
+int vfio_mm_pasid_free(struct vfio_mm *vmm, ioasid_t pasid)
+{
+	void *pdata;
+	int ret = 0;
+
+	mutex_lock(&vmm->pasid_lock);
+	pdata = ioasid_find(vmm->ioasid_sid, pasid, NULL);
+	if (IS_ERR(pdata)) {
+		ret = PTR_ERR(pdata);
+		goto out_unlock;
+	}
+	ioasid_free(pasid);
+
+out_unlock:
+	mutex_unlock(&vmm->pasid_lock);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(vfio_mm_pasid_free);
+
+/**
  * Module/class support
  */
 static char *vfio_devnode(struct device *dev, umode_t *mode)
@@ -2151,8 +2279,10 @@ static int __init vfio_init(void)
 	idr_init(&vfio.group_idr);
 	mutex_init(&vfio.group_lock);
 	mutex_init(&vfio.iommu_drivers_lock);
+	mutex_init(&vfio.vfio_mm_lock);
 	INIT_LIST_HEAD(&vfio.group_list);
 	INIT_LIST_HEAD(&vfio.iommu_drivers_list);
+	INIT_LIST_HEAD(&vfio.vfio_mm_list);
 	init_waitqueue_head(&vfio.release_q);
 
 	ret = misc_register(&vfio_dev);
diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index a177bf2..331ceee 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -70,6 +70,7 @@ struct vfio_iommu {
 	unsigned int		dma_avail;
 	bool			v2;
 	bool			nesting;
+	struct vfio_mm		*vmm;
 };
 
 struct vfio_domain {
@@ -2018,6 +2019,7 @@ static void vfio_iommu_type1_detach_group(void *iommu_data,
 static void *vfio_iommu_type1_open(unsigned long arg)
 {
 	struct vfio_iommu *iommu;
+	struct vfio_mm *vmm = NULL;
 
 	iommu = kzalloc(sizeof(*iommu), GFP_KERNEL);
 	if (!iommu)
@@ -2043,6 +2045,10 @@ static void *vfio_iommu_type1_open(unsigned long arg)
 	iommu->dma_avail = dma_entry_limit;
 	mutex_init(&iommu->lock);
 	BLOCKING_INIT_NOTIFIER_HEAD(&iommu->notifier);
+	vmm = vfio_mm_get_from_task(current);
+	if (!vmm)
+		pr_err("Failed to get vfio_mm track\n");
+	iommu->vmm = vmm;
 
 	return iommu;
 }
@@ -2084,6 +2090,8 @@ static void vfio_iommu_type1_release(void *iommu_data)
 	}
 
 	vfio_iommu_iova_free(&iommu->iova_list);
+	if (iommu->vmm)
+		vfio_mm_put(iommu->vmm);
 
 	kfree(iommu);
 }
@@ -2172,6 +2180,55 @@ static int vfio_iommu_iova_build_caps(struct vfio_iommu *iommu,
 	return ret;
 }
 
+static bool vfio_iommu_type1_pasid_req_valid(u32 flags)
+{
+	return !((flags & ~VFIO_PASID_REQUEST_MASK) ||
+		 (flags & VFIO_IOMMU_PASID_ALLOC &&
+		  flags & VFIO_IOMMU_PASID_FREE));
+}
+
+static int vfio_iommu_type1_pasid_alloc(struct vfio_iommu *iommu,
+					 int min,
+					 int max)
+{
+	struct vfio_mm *vmm = iommu->vmm;
+	int ret = 0;
+
+	mutex_lock(&iommu->lock);
+	if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) {
+		ret = -EFAULT;
+		goto out_unlock;
+	}
+	if (vmm)
+		ret = vfio_mm_pasid_alloc(vmm, min, max);
+	else
+		ret = -EINVAL;
+out_unlock:
+	mutex_unlock(&iommu->lock);
+	return ret;
+}
+
+static int vfio_iommu_type1_pasid_free(struct vfio_iommu *iommu,
+				       unsigned int pasid)
+{
+	struct vfio_mm *vmm = iommu->vmm;
+	int ret = 0;
+
+	mutex_lock(&iommu->lock);
+	if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) {
+		ret = -EFAULT;
+		goto out_unlock;
+	}
+
+	if (vmm)
+		ret = vfio_mm_pasid_free(vmm, pasid);
+	else
+		ret = -EINVAL;
+out_unlock:
+	mutex_unlock(&iommu->lock);
+	return ret;
+}
+
 static long vfio_iommu_type1_ioctl(void *iommu_data,
 				   unsigned int cmd, unsigned long arg)
 {
@@ -2276,6 +2333,53 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
 
 		return copy_to_user((void __user *)arg, &unmap, minsz) ?
 			-EFAULT : 0;
+
+	} else if (cmd == VFIO_IOMMU_PASID_REQUEST) {
+		struct vfio_iommu_type1_pasid_request req;
+		unsigned long offset;
+
+		minsz = offsetofend(struct vfio_iommu_type1_pasid_request,
+				    flags);
+
+		if (copy_from_user(&req, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (req.argsz < minsz ||
+		    !vfio_iommu_type1_pasid_req_valid(req.flags))
+			return -EINVAL;
+
+		if (copy_from_user((void *)&req + minsz,
+				   (void __user *)arg + minsz,
+				   sizeof(req) - minsz))
+			return -EFAULT;
+
+		switch (req.flags & VFIO_PASID_REQUEST_MASK) {
+		case VFIO_IOMMU_PASID_ALLOC:
+		{
+			int ret = 0, result;
+
+			result = vfio_iommu_type1_pasid_alloc(iommu,
+							req.alloc_pasid.min,
+							req.alloc_pasid.max);
+			if (result > 0) {
+				offset = offsetof(
+					struct vfio_iommu_type1_pasid_request,
+					alloc_pasid.result);
+				ret = copy_to_user(
+					      (void __user *) (arg + offset),
+					      &result, sizeof(result));
+			} else {
+				pr_debug("%s: PASID alloc failed\n", __func__);
+				ret = -EFAULT;
+			}
+			return ret;
+		}
+		case VFIO_IOMMU_PASID_FREE:
+			return vfio_iommu_type1_pasid_free(iommu,
+							   req.free_pasid);
+		default:
+			return -EINVAL;
+		}
 	}
 
 	return -ENOTTY;
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index e42a711..75f9f7f1 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -89,6 +89,26 @@ extern int vfio_register_iommu_driver(const struct vfio_iommu_driver_ops *ops);
 extern void vfio_unregister_iommu_driver(
 				const struct vfio_iommu_driver_ops *ops);
 
+#define VFIO_DEFAULT_PASID_QUOTA	1000
+struct vfio_mm_token {
+	unsigned long long val;
+};
+
+struct vfio_mm {
+	struct kref			kref;
+	struct vfio_mm_token		token;
+	int				ioasid_sid;
+	/* protect @pasid_quota field and pasid allocation/free */
+	struct mutex			pasid_lock;
+	int				pasid_quota;
+	struct list_head		vfio_next;
+};
+
+extern struct vfio_mm *vfio_mm_get_from_task(struct task_struct *task);
+extern void vfio_mm_put(struct vfio_mm *vmm);
+extern int vfio_mm_pasid_alloc(struct vfio_mm *vmm, int min, int max);
+extern int vfio_mm_pasid_free(struct vfio_mm *vmm, ioasid_t pasid);
+
 /*
  * External user API
  */
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 9e843a1..298ac80 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -794,6 +794,47 @@ struct vfio_iommu_type1_dma_unmap {
 #define VFIO_IOMMU_ENABLE	_IO(VFIO_TYPE, VFIO_BASE + 15)
 #define VFIO_IOMMU_DISABLE	_IO(VFIO_TYPE, VFIO_BASE + 16)
 
+/*
+ * PASID (Process Address Space ID) is a PCIe concept which
+ * has been extended to support DMA isolation in fine-grain.
+ * With device assigned to user space (e.g. VMs), PASID alloc
+ * and free need to be system wide. This structure defines
+ * the info for pasid alloc/free between user space and kernel
+ * space.
+ *
+ * @flag=VFIO_IOMMU_PASID_ALLOC, refer to the @alloc_pasid
+ * @flag=VFIO_IOMMU_PASID_FREE, refer to @free_pasid
+ */
+struct vfio_iommu_type1_pasid_request {
+	__u32	argsz;
+#define VFIO_IOMMU_PASID_ALLOC	(1 << 0)
+#define VFIO_IOMMU_PASID_FREE	(1 << 1)
+	__u32	flags;
+	union {
+		struct {
+			__u32 min;
+			__u32 max;
+			__u32 result;
+		} alloc_pasid;
+		__u32 free_pasid;
+	};
+};
+
+#define VFIO_PASID_REQUEST_MASK	(VFIO_IOMMU_PASID_ALLOC | \
+					 VFIO_IOMMU_PASID_FREE)
+
+/**
+ * VFIO_IOMMU_PASID_REQUEST - _IOWR(VFIO_TYPE, VFIO_BASE + 22,
+ *				struct vfio_iommu_type1_pasid_request)
+ *
+ * Availability of this feature depends on PASID support in the device,
+ * its bus, the underlying IOMMU and the CPU architecture. In VFIO, it
+ * is available after VFIO_SET_IOMMU.
+ *
+ * returns: 0 on success, -errno on failure.
+ */
+#define VFIO_IOMMU_PASID_REQUEST	_IO(VFIO_TYPE, VFIO_BASE + 22)
+
 /* -------- Additional API for SPAPR TCE (Server POWERPC) IOMMU -------- */
 
 /*
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCH v1 2/8] vfio/type1: Add vfio_iommu_type1 parameter for quota tuning
  2020-03-22 12:31 [PATCH v1 0/8] vfio: expose virtual Shared Virtual Addressing to VMs Liu, Yi L
  2020-03-22 12:31 ` [PATCH v1 1/8] vfio: Add VFIO_IOMMU_PASID_REQUEST(alloc/free) Liu, Yi L
@ 2020-03-22 12:31 ` Liu, Yi L
  2020-03-22 17:20   ` kbuild test robot
  2020-03-30  8:40   ` Tian, Kevin
  2020-03-22 12:32 ` [PATCH v1 3/8] vfio/type1: Report PASID alloc/free support to userspace Liu, Yi L
                   ` (6 subsequent siblings)
  8 siblings, 2 replies; 110+ messages in thread
From: Liu, Yi L @ 2020-03-22 12:31 UTC (permalink / raw)
  To: alex.williamson, eric.auger
  Cc: kevin.tian, jacob.jun.pan, joro, ashok.raj, yi.l.liu, jun.j.tian,
	yi.y.sun, jean-philippe, peterx, iommu, kvm, linux-kernel,
	hao.wu

From: Liu Yi L <yi.l.liu@intel.com>

This patch adds a module option to make the PASID quota tunable by
administrator.

TODO: needs to think more on how to  make the tuning to be per-process.

Previous discussions:
https://patchwork.kernel.org/patch/11209429/

Cc: Kevin Tian <kevin.tian@intel.com>
CC: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Eric Auger <eric.auger@redhat.com>
Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 drivers/vfio/vfio.c             | 8 +++++++-
 drivers/vfio/vfio_iommu_type1.c | 7 ++++++-
 include/linux/vfio.h            | 3 ++-
 3 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index d13b483..020a792 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -2217,13 +2217,19 @@ struct vfio_mm *vfio_mm_get_from_task(struct task_struct *task)
 }
 EXPORT_SYMBOL_GPL(vfio_mm_get_from_task);
 
-int vfio_mm_pasid_alloc(struct vfio_mm *vmm, int min, int max)
+int vfio_mm_pasid_alloc(struct vfio_mm *vmm, int quota, int min, int max)
 {
 	ioasid_t pasid;
 	int ret = -ENOSPC;
 
 	mutex_lock(&vmm->pasid_lock);
 
+	/* update quota as it is tunable by admin */
+	if (vmm->pasid_quota != quota) {
+		vmm->pasid_quota = quota;
+		ioasid_adjust_set(vmm->ioasid_sid, quota);
+	}
+
 	pasid = ioasid_alloc(vmm->ioasid_sid, min, max, NULL);
 	if (pasid == INVALID_IOASID) {
 		ret = -ENOSPC;
diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 331ceee..e40afc0 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -60,6 +60,11 @@ module_param_named(dma_entry_limit, dma_entry_limit, uint, 0644);
 MODULE_PARM_DESC(dma_entry_limit,
 		 "Maximum number of user DMA mappings per container (65535).");
 
+static int pasid_quota = VFIO_DEFAULT_PASID_QUOTA;
+module_param_named(pasid_quota, pasid_quota, uint, 0644);
+MODULE_PARM_DESC(pasid_quota,
+		 "Quota of user owned PASIDs per vfio-based application (1000).");
+
 struct vfio_iommu {
 	struct list_head	domain_list;
 	struct list_head	iova_list;
@@ -2200,7 +2205,7 @@ static int vfio_iommu_type1_pasid_alloc(struct vfio_iommu *iommu,
 		goto out_unlock;
 	}
 	if (vmm)
-		ret = vfio_mm_pasid_alloc(vmm, min, max);
+		ret = vfio_mm_pasid_alloc(vmm, pasid_quota, min, max);
 	else
 		ret = -EINVAL;
 out_unlock:
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index 75f9f7f1..af2ef78 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -106,7 +106,8 @@ struct vfio_mm {
 
 extern struct vfio_mm *vfio_mm_get_from_task(struct task_struct *task);
 extern void vfio_mm_put(struct vfio_mm *vmm);
-extern int vfio_mm_pasid_alloc(struct vfio_mm *vmm, int min, int max);
+extern int vfio_mm_pasid_alloc(struct vfio_mm *vmm,
+				int quota, int min, int max);
 extern int vfio_mm_pasid_free(struct vfio_mm *vmm, ioasid_t pasid);
 
 /*
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCH v1 3/8] vfio/type1: Report PASID alloc/free support to userspace
  2020-03-22 12:31 [PATCH v1 0/8] vfio: expose virtual Shared Virtual Addressing to VMs Liu, Yi L
  2020-03-22 12:31 ` [PATCH v1 1/8] vfio: Add VFIO_IOMMU_PASID_REQUEST(alloc/free) Liu, Yi L
  2020-03-22 12:31 ` [PATCH v1 2/8] vfio/type1: Add vfio_iommu_type1 parameter for quota tuning Liu, Yi L
@ 2020-03-22 12:32 ` Liu, Yi L
  2020-03-30  9:43   ` Tian, Kevin
                     ` (2 more replies)
  2020-03-22 12:32 ` [PATCH v1 4/8] vfio: Check nesting iommu uAPI version Liu, Yi L
                   ` (5 subsequent siblings)
  8 siblings, 3 replies; 110+ messages in thread
From: Liu, Yi L @ 2020-03-22 12:32 UTC (permalink / raw)
  To: alex.williamson, eric.auger
  Cc: kevin.tian, jacob.jun.pan, joro, ashok.raj, yi.l.liu, jun.j.tian,
	yi.y.sun, jean-philippe, peterx, iommu, kvm, linux-kernel,
	hao.wu

From: Liu Yi L <yi.l.liu@intel.com>

This patch reports PASID alloc/free availability to userspace (e.g. QEMU)
thus userspace could do a pre-check before utilizing this feature.

Cc: Kevin Tian <kevin.tian@intel.com>
CC: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Eric Auger <eric.auger@redhat.com>
Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 drivers/vfio/vfio_iommu_type1.c | 28 ++++++++++++++++++++++++++++
 include/uapi/linux/vfio.h       |  8 ++++++++
 2 files changed, 36 insertions(+)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index e40afc0..ddd1ffe 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -2234,6 +2234,30 @@ static int vfio_iommu_type1_pasid_free(struct vfio_iommu *iommu,
 	return ret;
 }
 
+static int vfio_iommu_info_add_nesting_cap(struct vfio_iommu *iommu,
+					 struct vfio_info_cap *caps)
+{
+	struct vfio_info_cap_header *header;
+	struct vfio_iommu_type1_info_cap_nesting *nesting_cap;
+
+	header = vfio_info_cap_add(caps, sizeof(*nesting_cap),
+				   VFIO_IOMMU_TYPE1_INFO_CAP_NESTING, 1);
+	if (IS_ERR(header))
+		return PTR_ERR(header);
+
+	nesting_cap = container_of(header,
+				struct vfio_iommu_type1_info_cap_nesting,
+				header);
+
+	nesting_cap->nesting_capabilities = 0;
+	if (iommu->nesting) {
+		/* nesting iommu type supports PASID requests (alloc/free) */
+		nesting_cap->nesting_capabilities |= VFIO_IOMMU_PASID_REQS;
+	}
+
+	return 0;
+}
+
 static long vfio_iommu_type1_ioctl(void *iommu_data,
 				   unsigned int cmd, unsigned long arg)
 {
@@ -2283,6 +2307,10 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
 		if (ret)
 			return ret;
 
+		ret = vfio_iommu_info_add_nesting_cap(iommu, &caps);
+		if (ret)
+			return ret;
+
 		if (caps.size) {
 			info.flags |= VFIO_IOMMU_INFO_CAPS;
 
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 298ac80..8837219 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -748,6 +748,14 @@ struct vfio_iommu_type1_info_cap_iova_range {
 	struct	vfio_iova_range iova_ranges[];
 };
 
+#define VFIO_IOMMU_TYPE1_INFO_CAP_NESTING  2
+
+struct vfio_iommu_type1_info_cap_nesting {
+	struct	vfio_info_cap_header header;
+#define VFIO_IOMMU_PASID_REQS	(1 << 0)
+	__u32	nesting_capabilities;
+};
+
 #define VFIO_IOMMU_GET_INFO _IO(VFIO_TYPE, VFIO_BASE + 12)
 
 /**
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCH v1 4/8] vfio: Check nesting iommu uAPI version
  2020-03-22 12:31 [PATCH v1 0/8] vfio: expose virtual Shared Virtual Addressing to VMs Liu, Yi L
                   ` (2 preceding siblings ...)
  2020-03-22 12:32 ` [PATCH v1 3/8] vfio/type1: Report PASID alloc/free support to userspace Liu, Yi L
@ 2020-03-22 12:32 ` Liu, Yi L
  2020-03-22 18:30   ` kbuild test robot
  2020-03-22 12:32 ` [PATCH v1 5/8] vfio/type1: Report 1st-level/stage-1 format to userspace Liu, Yi L
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 110+ messages in thread
From: Liu, Yi L @ 2020-03-22 12:32 UTC (permalink / raw)
  To: alex.williamson, eric.auger
  Cc: kevin.tian, jacob.jun.pan, joro, ashok.raj, yi.l.liu, jun.j.tian,
	yi.y.sun, jean-philippe, peterx, iommu, kvm, linux-kernel,
	hao.wu

From: Liu Yi L <yi.l.liu@intel.com>

In Linux Kernel, the IOMMU nesting translation (a.k.a dual stage address
translation) capability is abstracted in uapi/iommu.h, in which the uAPIs
like bind_gpasid/iommu_cache_invalidate/fault_report/pgreq_resp are defined.

VFIO_TYPE1_NESTING_IOMMU stands for the vfio iommu type which is backed by
hardware IOMMU w/ dual stage translation capability. For such vfio iommu
type, userspace is able to setup dual stage DMA translation in host side
via VFIO's ABI. However, such VFIO ABIs rely on the uAPIs defined in uapi/
iommu.h. So VFIO needs to provide an API to userspace for the uapi/iommu.h
version check to ensure the iommu uAPI compatibility.

This patch reports the iommu uAPI version to userspace in VFIO_CHECK_EXTENSION
IOCTL. Applications could do version check before further setup dual stage
translation in host IOMMU.

Cc: Kevin Tian <kevin.tian@intel.com>
CC: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Eric Auger <eric.auger@redhat.com>
Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 drivers/vfio/vfio_iommu_type1.c | 2 ++
 include/uapi/linux/vfio.h       | 9 +++++++++
 2 files changed, 11 insertions(+)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index ddd1ffe..9aa2a67 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -2274,6 +2274,8 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
 			if (!iommu)
 				return 0;
 			return vfio_domains_have_iommu_cache(iommu);
+		case VFIO_NESTING_IOMMU_UAPI:
+			return iommu_get_uapi_version();
 		default:
 			return 0;
 		}
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 8837219..ed9881d 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -47,6 +47,15 @@
 #define VFIO_NOIOMMU_IOMMU		8
 
 /*
+ * Hardware IOMMUs with two-stage translation capability give userspace
+ * the ownership of stage-1 translation structures (e.g. page tables).
+ * VFIO exposes the two-stage IOMMU programming capability to userspace
+ * based on the IOMMU UAPIs. Therefore user of VFIO_TYPE1_NESTING should
+ * check the IOMMU UAPI version compatibility.
+ */
+#define VFIO_NESTING_IOMMU_UAPI		9
+
+/*
  * The IOCTL interface is designed for extensibility by embedding the
  * structure length (argsz) and flags into structures passed between
  * kernel and userspace.  We therefore use the _IO() macro for these
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCH v1 5/8] vfio/type1: Report 1st-level/stage-1 format to userspace
  2020-03-22 12:31 [PATCH v1 0/8] vfio: expose virtual Shared Virtual Addressing to VMs Liu, Yi L
                   ` (3 preceding siblings ...)
  2020-03-22 12:32 ` [PATCH v1 4/8] vfio: Check nesting iommu uAPI version Liu, Yi L
@ 2020-03-22 12:32 ` Liu, Yi L
  2020-03-22 16:44   ` kbuild test robot
                     ` (3 more replies)
  2020-03-22 12:32 ` [PATCH v1 6/8] vfio/type1: Bind guest page tables to host Liu, Yi L
                   ` (3 subsequent siblings)
  8 siblings, 4 replies; 110+ messages in thread
From: Liu, Yi L @ 2020-03-22 12:32 UTC (permalink / raw)
  To: alex.williamson, eric.auger
  Cc: kevin.tian, jacob.jun.pan, joro, ashok.raj, yi.l.liu, jun.j.tian,
	yi.y.sun, jean-philippe, peterx, iommu, kvm, linux-kernel,
	hao.wu

From: Liu Yi L <yi.l.liu@intel.com>

VFIO exposes IOMMU nesting translation (a.k.a dual stage translation)
capability to userspace. Thus applications like QEMU could support
vIOMMU with hardware's nesting translation capability for pass-through
devices. Before setting up nesting translation for pass-through devices,
QEMU and other applications need to learn the supported 1st-lvl/stage-1
translation structure format like page table format.

Take vSVA (virtual Shared Virtual Addressing) as an example, to support
vSVA for pass-through devices, QEMU setup nesting translation for pass-
through devices. The guest page table are configured to host as 1st-lvl/
stage-1 page table. Therefore, guest format should be compatible with
host side.

This patch reports the supported 1st-lvl/stage-1 page table format on the
current platform to userspace. QEMU and other alike applications should
use this format info when trying to setup IOMMU nesting translation on
host IOMMU.

Cc: Kevin Tian <kevin.tian@intel.com>
CC: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Eric Auger <eric.auger@redhat.com>
Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 drivers/vfio/vfio_iommu_type1.c | 56 +++++++++++++++++++++++++++++++++++++++++
 include/uapi/linux/vfio.h       |  1 +
 2 files changed, 57 insertions(+)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 9aa2a67..82a9e0b 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -2234,11 +2234,66 @@ static int vfio_iommu_type1_pasid_free(struct vfio_iommu *iommu,
 	return ret;
 }
 
+static int vfio_iommu_get_stage1_format(struct vfio_iommu *iommu,
+					 u32 *stage1_format)
+{
+	struct vfio_domain *domain;
+	u32 format = 0, tmp_format = 0;
+	int ret;
+
+	mutex_lock(&iommu->lock);
+	if (list_empty(&iommu->domain_list)) {
+		mutex_unlock(&iommu->lock);
+		return -EINVAL;
+	}
+
+	list_for_each_entry(domain, &iommu->domain_list, next) {
+		if (iommu_domain_get_attr(domain->domain,
+			DOMAIN_ATTR_PASID_FORMAT, &format)) {
+			ret = -EINVAL;
+			format = 0;
+			goto out_unlock;
+		}
+		/*
+		 * format is always non-zero (the first format is
+		 * IOMMU_PASID_FORMAT_INTEL_VTD which is 1). For
+		 * the reason of potential different backed IOMMU
+		 * formats, here we expect to have identical formats
+		 * in the domain list, no mixed formats support.
+		 * return -EINVAL to fail the attempt of setup
+		 * VFIO_TYPE1_NESTING_IOMMU if non-identical formats
+		 * are detected.
+		 */
+		if (tmp_format && tmp_format != format) {
+			ret = -EINVAL;
+			format = 0;
+			goto out_unlock;
+		}
+
+		tmp_format = format;
+	}
+	ret = 0;
+
+out_unlock:
+	if (format)
+		*stage1_format = format;
+	mutex_unlock(&iommu->lock);
+	return ret;
+}
+
 static int vfio_iommu_info_add_nesting_cap(struct vfio_iommu *iommu,
 					 struct vfio_info_cap *caps)
 {
 	struct vfio_info_cap_header *header;
 	struct vfio_iommu_type1_info_cap_nesting *nesting_cap;
+	u32 formats = 0;
+	int ret;
+
+	ret = vfio_iommu_get_stage1_format(iommu, &formats);
+	if (ret) {
+		pr_warn("Failed to get stage-1 format\n");
+		return ret;
+	}
 
 	header = vfio_info_cap_add(caps, sizeof(*nesting_cap),
 				   VFIO_IOMMU_TYPE1_INFO_CAP_NESTING, 1);
@@ -2254,6 +2309,7 @@ static int vfio_iommu_info_add_nesting_cap(struct vfio_iommu *iommu,
 		/* nesting iommu type supports PASID requests (alloc/free) */
 		nesting_cap->nesting_capabilities |= VFIO_IOMMU_PASID_REQS;
 	}
+	nesting_cap->stage1_formats = formats;
 
 	return 0;
 }
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index ed9881d..ebeaf3e 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -763,6 +763,7 @@ struct vfio_iommu_type1_info_cap_nesting {
 	struct	vfio_info_cap_header header;
 #define VFIO_IOMMU_PASID_REQS	(1 << 0)
 	__u32	nesting_capabilities;
+	__u32	stage1_formats;
 };
 
 #define VFIO_IOMMU_GET_INFO _IO(VFIO_TYPE, VFIO_BASE + 12)
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCH v1 6/8] vfio/type1: Bind guest page tables to host
  2020-03-22 12:31 [PATCH v1 0/8] vfio: expose virtual Shared Virtual Addressing to VMs Liu, Yi L
                   ` (4 preceding siblings ...)
  2020-03-22 12:32 ` [PATCH v1 5/8] vfio/type1: Report 1st-level/stage-1 format to userspace Liu, Yi L
@ 2020-03-22 12:32 ` Liu, Yi L
  2020-03-22 18:10   ` kbuild test robot
                     ` (2 more replies)
  2020-03-22 12:32 ` [PATCH v1 7/8] vfio/type1: Add VFIO_IOMMU_CACHE_INVALIDATE Liu, Yi L
                   ` (2 subsequent siblings)
  8 siblings, 3 replies; 110+ messages in thread
From: Liu, Yi L @ 2020-03-22 12:32 UTC (permalink / raw)
  To: alex.williamson, eric.auger
  Cc: kevin.tian, jacob.jun.pan, joro, ashok.raj, yi.l.liu, jun.j.tian,
	yi.y.sun, jean-philippe, peterx, iommu, kvm, linux-kernel,
	hao.wu

From: Liu Yi L <yi.l.liu@intel.com>

VFIO_TYPE1_NESTING_IOMMU is an IOMMU type which is backed by hardware
IOMMUs that have nesting DMA translation (a.k.a dual stage address
translation). For such hardware IOMMUs, there are two stages/levels of
address translation, and software may let userspace/VM to own the first-
level/stage-1 translation structures. Example of such usage is vSVA (
virtual Shared Virtual Addressing). VM owns the first-level/stage-1
translation structures and bind the structures to host, then hardware
IOMMU would utilize nesting translation when doing DMA translation fo
the devices behind such hardware IOMMU.

This patch adds vfio support for binding guest translation (a.k.a stage 1)
structure to host iommu. And for VFIO_TYPE1_NESTING_IOMMU, not only bind
guest page table is needed, it also requires to expose interface to guest
for iommu cache invalidation when guest modified the first-level/stage-1
translation structures since hardware needs to be notified to flush stale
iotlbs. This would be introduced in next patch.

In this patch, guest page table bind and unbind are done by using flags
VFIO_IOMMU_BIND_GUEST_PGTBL and VFIO_IOMMU_UNBIND_GUEST_PGTBL under IOCTL
VFIO_IOMMU_BIND, the bind/unbind data are conveyed by
struct iommu_gpasid_bind_data. Before binding guest page table to host,
VM should have got a PASID allocated by host via VFIO_IOMMU_PASID_REQUEST.

Bind guest translation structures (here is guest page table) to host
are the first step to setup vSVA (Virtual Shared Virtual Addressing).

Cc: Kevin Tian <kevin.tian@intel.com>
CC: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Eric Auger <eric.auger@redhat.com>
Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.com>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
---
 drivers/vfio/vfio_iommu_type1.c | 158 ++++++++++++++++++++++++++++++++++++++++
 include/uapi/linux/vfio.h       |  46 ++++++++++++
 2 files changed, 204 insertions(+)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 82a9e0b..a877747 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -130,6 +130,33 @@ struct vfio_regions {
 #define IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)	\
 					(!list_empty(&iommu->domain_list))
 
+struct domain_capsule {
+	struct iommu_domain *domain;
+	void *data;
+};
+
+/* iommu->lock must be held */
+static int vfio_iommu_for_each_dev(struct vfio_iommu *iommu,
+		      int (*fn)(struct device *dev, void *data),
+		      void *data)
+{
+	struct domain_capsule dc = {.data = data};
+	struct vfio_domain *d;
+	struct vfio_group *g;
+	int ret = 0;
+
+	list_for_each_entry(d, &iommu->domain_list, next) {
+		dc.domain = d->domain;
+		list_for_each_entry(g, &d->group_list, next) {
+			ret = iommu_group_for_each_dev(g->iommu_group,
+						       &dc, fn);
+			if (ret)
+				break;
+		}
+	}
+	return ret;
+}
+
 static int put_pfn(unsigned long pfn, int prot);
 
 /*
@@ -2314,6 +2341,88 @@ static int vfio_iommu_info_add_nesting_cap(struct vfio_iommu *iommu,
 	return 0;
 }
 
+static int vfio_bind_gpasid_fn(struct device *dev, void *data)
+{
+	struct domain_capsule *dc = (struct domain_capsule *)data;
+	struct iommu_gpasid_bind_data *gbind_data =
+		(struct iommu_gpasid_bind_data *) dc->data;
+
+	return iommu_sva_bind_gpasid(dc->domain, dev, gbind_data);
+}
+
+static int vfio_unbind_gpasid_fn(struct device *dev, void *data)
+{
+	struct domain_capsule *dc = (struct domain_capsule *)data;
+	struct iommu_gpasid_bind_data *gbind_data =
+		(struct iommu_gpasid_bind_data *) dc->data;
+
+	return iommu_sva_unbind_gpasid(dc->domain, dev,
+					gbind_data->hpasid);
+}
+
+/**
+ * Unbind specific gpasid, caller of this function requires hold
+ * vfio_iommu->lock
+ */
+static long vfio_iommu_type1_do_guest_unbind(struct vfio_iommu *iommu,
+				struct iommu_gpasid_bind_data *gbind_data)
+{
+	return vfio_iommu_for_each_dev(iommu,
+				vfio_unbind_gpasid_fn, gbind_data);
+}
+
+static long vfio_iommu_type1_bind_gpasid(struct vfio_iommu *iommu,
+				struct iommu_gpasid_bind_data *gbind_data)
+{
+	int ret = 0;
+
+	mutex_lock(&iommu->lock);
+	if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) {
+		ret = -EINVAL;
+		goto out_unlock;
+	}
+
+	ret = vfio_iommu_for_each_dev(iommu,
+			vfio_bind_gpasid_fn, gbind_data);
+	/*
+	 * If bind failed, it may not be a total failure. Some devices
+	 * within the iommu group may have bind successfully. Although
+	 * we don't enable pasid capability for non-singletion iommu
+	 * groups, a unbind operation would be helpful to ensure no
+	 * partial binding for an iommu group.
+	 */
+	if (ret)
+		/*
+		 * Undo all binds that already succeeded, no need to
+		 * check the return value here since some device within
+		 * the group has no successful bind when coming to this
+		 * place switch.
+		 */
+		vfio_iommu_type1_do_guest_unbind(iommu, gbind_data);
+
+out_unlock:
+	mutex_unlock(&iommu->lock);
+	return ret;
+}
+
+static long vfio_iommu_type1_unbind_gpasid(struct vfio_iommu *iommu,
+				struct iommu_gpasid_bind_data *gbind_data)
+{
+	int ret = 0;
+
+	mutex_lock(&iommu->lock);
+	if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) {
+		ret = -EINVAL;
+		goto out_unlock;
+	}
+
+	ret = vfio_iommu_type1_do_guest_unbind(iommu, gbind_data);
+
+out_unlock:
+	mutex_unlock(&iommu->lock);
+	return ret;
+}
+
 static long vfio_iommu_type1_ioctl(void *iommu_data,
 				   unsigned int cmd, unsigned long arg)
 {
@@ -2471,6 +2580,55 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
 		default:
 			return -EINVAL;
 		}
+
+	} else if (cmd == VFIO_IOMMU_BIND) {
+		struct vfio_iommu_type1_bind bind;
+		u32 version;
+		int data_size;
+		void *gbind_data;
+		int ret;
+
+		minsz = offsetofend(struct vfio_iommu_type1_bind, flags);
+
+		if (copy_from_user(&bind, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (bind.argsz < minsz)
+			return -EINVAL;
+
+		/* Get the version of struct iommu_gpasid_bind_data */
+		if (copy_from_user(&version,
+			(void __user *) (arg + minsz),
+					sizeof(version)))
+			return -EFAULT;
+
+		data_size = iommu_uapi_get_data_size(
+				IOMMU_UAPI_BIND_GPASID, version);
+		gbind_data = kzalloc(data_size, GFP_KERNEL);
+		if (!gbind_data)
+			return -ENOMEM;
+
+		if (copy_from_user(gbind_data,
+			 (void __user *) (arg + minsz), data_size)) {
+			kfree(gbind_data);
+			return -EFAULT;
+		}
+
+		switch (bind.flags & VFIO_IOMMU_BIND_MASK) {
+		case VFIO_IOMMU_BIND_GUEST_PGTBL:
+			ret = vfio_iommu_type1_bind_gpasid(iommu,
+							   gbind_data);
+			break;
+		case VFIO_IOMMU_UNBIND_GUEST_PGTBL:
+			ret = vfio_iommu_type1_unbind_gpasid(iommu,
+							     gbind_data);
+			break;
+		default:
+			ret = -EINVAL;
+			break;
+		}
+		kfree(gbind_data);
+		return ret;
 	}
 
 	return -ENOTTY;
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index ebeaf3e..2235bc6 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -14,6 +14,7 @@
 
 #include <linux/types.h>
 #include <linux/ioctl.h>
+#include <linux/iommu.h>
 
 #define VFIO_API_VERSION	0
 
@@ -853,6 +854,51 @@ struct vfio_iommu_type1_pasid_request {
  */
 #define VFIO_IOMMU_PASID_REQUEST	_IO(VFIO_TYPE, VFIO_BASE + 22)
 
+/**
+ * Supported flags:
+ *	- VFIO_IOMMU_BIND_GUEST_PGTBL: bind guest page tables to host for
+ *			nesting type IOMMUs. In @data field It takes struct
+ *			iommu_gpasid_bind_data.
+ *	- VFIO_IOMMU_UNBIND_GUEST_PGTBL: undo a bind guest page table operation
+ *			invoked by VFIO_IOMMU_BIND_GUEST_PGTBL.
+ *
+ */
+struct vfio_iommu_type1_bind {
+	__u32		argsz;
+	__u32		flags;
+#define VFIO_IOMMU_BIND_GUEST_PGTBL	(1 << 0)
+#define VFIO_IOMMU_UNBIND_GUEST_PGTBL	(1 << 1)
+	__u8		data[];
+};
+
+#define VFIO_IOMMU_BIND_MASK	(VFIO_IOMMU_BIND_GUEST_PGTBL | \
+					VFIO_IOMMU_UNBIND_GUEST_PGTBL)
+
+/**
+ * VFIO_IOMMU_BIND - _IOW(VFIO_TYPE, VFIO_BASE + 23,
+ *				struct vfio_iommu_type1_bind)
+ *
+ * Manage address spaces of devices in this container. Initially a TYPE1
+ * container can only have one address space, managed with
+ * VFIO_IOMMU_MAP/UNMAP_DMA.
+ *
+ * An IOMMU of type VFIO_TYPE1_NESTING_IOMMU can be managed by both MAP/UNMAP
+ * and BIND ioctls at the same time. MAP/UNMAP acts on the stage-2 (host) page
+ * tables, and BIND manages the stage-1 (guest) page tables. Other types of
+ * IOMMU may allow MAP/UNMAP and BIND to coexist, where MAP/UNMAP controls
+ * the traffics only require single stage translation while BIND controls the
+ * traffics require nesting translation. But this depends on the underlying
+ * IOMMU architecture and isn't guaranteed. Example of this is the guest SVA
+ * traffics, such traffics need nesting translation to gain gVA->gPA and then
+ * gPA->hPA translation.
+ *
+ * Availability of this feature depends on the device, its bus, the underlying
+ * IOMMU and the CPU architecture.
+ *
+ * returns: 0 on success, -errno on failure.
+ */
+#define VFIO_IOMMU_BIND		_IO(VFIO_TYPE, VFIO_BASE + 23)
+
 /* -------- Additional API for SPAPR TCE (Server POWERPC) IOMMU -------- */
 
 /*
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCH v1 7/8] vfio/type1: Add VFIO_IOMMU_CACHE_INVALIDATE
  2020-03-22 12:31 [PATCH v1 0/8] vfio: expose virtual Shared Virtual Addressing to VMs Liu, Yi L
                   ` (5 preceding siblings ...)
  2020-03-22 12:32 ` [PATCH v1 6/8] vfio/type1: Bind guest page tables to host Liu, Yi L
@ 2020-03-22 12:32 ` Liu, Yi L
  2020-03-30 12:58   ` Tian, Kevin
                     ` (2 more replies)
  2020-03-22 12:32 ` [PATCH v1 8/8] vfio/type1: Add vSVA support for IOMMU-backed mdevs Liu, Yi L
  2020-03-26 12:56 ` [PATCH v1 0/8] vfio: expose virtual Shared Virtual Addressing to VMs Liu, Yi L
  8 siblings, 3 replies; 110+ messages in thread
From: Liu, Yi L @ 2020-03-22 12:32 UTC (permalink / raw)
  To: alex.williamson, eric.auger
  Cc: kevin.tian, jacob.jun.pan, joro, ashok.raj, yi.l.liu, jun.j.tian,
	yi.y.sun, jean-philippe, peterx, iommu, kvm, linux-kernel,
	hao.wu

From: Liu Yi L <yi.l.liu@linux.intel.com>

For VFIO IOMMUs with the type VFIO_TYPE1_NESTING_IOMMU, guest "owns" the
first-level/stage-1 translation structures, the host IOMMU driver has no
knowledge of first-level/stage-1 structure cache updates unless the guest
invalidation requests are trapped and propagated to the host.

This patch adds a new IOCTL VFIO_IOMMU_CACHE_INVALIDATE to propagate guest
first-level/stage-1 IOMMU cache invalidations to host to ensure IOMMU cache
correctness.

With this patch, vSVA (Virtual Shared Virtual Addressing) can be used safely
as the host IOMMU iotlb correctness are ensured.

Cc: Kevin Tian <kevin.tian@intel.com>
CC: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Eric Auger <eric.auger@redhat.com>
Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
Signed-off-by: Liu Yi L <yi.l.liu@linux.intel.com>
Signed-off-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
---
 drivers/vfio/vfio_iommu_type1.c | 49 +++++++++++++++++++++++++++++++++++++++++
 include/uapi/linux/vfio.h       | 22 ++++++++++++++++++
 2 files changed, 71 insertions(+)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index a877747..937ec3f 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -2423,6 +2423,15 @@ static long vfio_iommu_type1_unbind_gpasid(struct vfio_iommu *iommu,
 	return ret;
 }
 
+static int vfio_cache_inv_fn(struct device *dev, void *data)
+{
+	struct domain_capsule *dc = (struct domain_capsule *)data;
+	struct iommu_cache_invalidate_info *cache_inv_info =
+		(struct iommu_cache_invalidate_info *) dc->data;
+
+	return iommu_cache_invalidate(dc->domain, dev, cache_inv_info);
+}
+
 static long vfio_iommu_type1_ioctl(void *iommu_data,
 				   unsigned int cmd, unsigned long arg)
 {
@@ -2629,6 +2638,46 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
 		}
 		kfree(gbind_data);
 		return ret;
+	} else if (cmd == VFIO_IOMMU_CACHE_INVALIDATE) {
+		struct vfio_iommu_type1_cache_invalidate cache_inv;
+		u32 version;
+		int info_size;
+		void *cache_info;
+		int ret;
+
+		minsz = offsetofend(struct vfio_iommu_type1_cache_invalidate,
+				    flags);
+
+		if (copy_from_user(&cache_inv, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (cache_inv.argsz < minsz || cache_inv.flags)
+			return -EINVAL;
+
+		/* Get the version of struct iommu_cache_invalidate_info */
+		if (copy_from_user(&version,
+			(void __user *) (arg + minsz), sizeof(version)))
+			return -EFAULT;
+
+		info_size = iommu_uapi_get_data_size(
+					IOMMU_UAPI_CACHE_INVAL, version);
+
+		cache_info = kzalloc(info_size, GFP_KERNEL);
+		if (!cache_info)
+			return -ENOMEM;
+
+		if (copy_from_user(cache_info,
+			(void __user *) (arg + minsz), info_size)) {
+			kfree(cache_info);
+			return -EFAULT;
+		}
+
+		mutex_lock(&iommu->lock);
+		ret = vfio_iommu_for_each_dev(iommu, vfio_cache_inv_fn,
+					    cache_info);
+		mutex_unlock(&iommu->lock);
+		kfree(cache_info);
+		return ret;
 	}
 
 	return -ENOTTY;
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 2235bc6..62ca791 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -899,6 +899,28 @@ struct vfio_iommu_type1_bind {
  */
 #define VFIO_IOMMU_BIND		_IO(VFIO_TYPE, VFIO_BASE + 23)
 
+/**
+ * VFIO_IOMMU_CACHE_INVALIDATE - _IOW(VFIO_TYPE, VFIO_BASE + 24,
+ *			struct vfio_iommu_type1_cache_invalidate)
+ *
+ * Propagate guest IOMMU cache invalidation to the host. The cache
+ * invalidation information is conveyed by @cache_info, the content
+ * format would be structures defined in uapi/linux/iommu.h. User
+ * should be aware of that the struct  iommu_cache_invalidate_info
+ * has a @version field, vfio needs to parse this field before getting
+ * data from userspace.
+ *
+ * Availability of this IOCTL is after VFIO_SET_IOMMU.
+ *
+ * returns: 0 on success, -errno on failure.
+ */
+struct vfio_iommu_type1_cache_invalidate {
+	__u32   argsz;
+	__u32   flags;
+	struct	iommu_cache_invalidate_info cache_info;
+};
+#define VFIO_IOMMU_CACHE_INVALIDATE      _IO(VFIO_TYPE, VFIO_BASE + 24)
+
 /* -------- Additional API for SPAPR TCE (Server POWERPC) IOMMU -------- */
 
 /*
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [PATCH v1 8/8] vfio/type1: Add vSVA support for IOMMU-backed mdevs
  2020-03-22 12:31 [PATCH v1 0/8] vfio: expose virtual Shared Virtual Addressing to VMs Liu, Yi L
                   ` (6 preceding siblings ...)
  2020-03-22 12:32 ` [PATCH v1 7/8] vfio/type1: Add VFIO_IOMMU_CACHE_INVALIDATE Liu, Yi L
@ 2020-03-22 12:32 ` Liu, Yi L
  2020-03-30 13:18   ` Tian, Kevin
  2020-04-02 20:33   ` Alex Williamson
  2020-03-26 12:56 ` [PATCH v1 0/8] vfio: expose virtual Shared Virtual Addressing to VMs Liu, Yi L
  8 siblings, 2 replies; 110+ messages in thread
From: Liu, Yi L @ 2020-03-22 12:32 UTC (permalink / raw)
  To: alex.williamson, eric.auger
  Cc: kevin.tian, jacob.jun.pan, joro, ashok.raj, yi.l.liu, jun.j.tian,
	yi.y.sun, jean-philippe, peterx, iommu, kvm, linux-kernel,
	hao.wu

From: Liu Yi L <yi.l.liu@intel.com>

Recent years, mediated device pass-through framework (e.g. vfio-mdev)
are used to achieve flexible device sharing across domains (e.g. VMs).
Also there are hardware assisted mediated pass-through solutions from
platform vendors. e.g. Intel VT-d scalable mode which supports Intel
Scalable I/O Virtualization technology. Such mdevs are called IOMMU-
backed mdevs as there are IOMMU enforced DMA isolation for such mdevs.
In kernel, IOMMU-backed mdevs are exposed to IOMMU layer by aux-domain
concept, which means mdevs are protected by an iommu domain which is
aux-domain of its physical device. Details can be found in the KVM
presentation from Kevin Tian. IOMMU-backed equals to IOMMU-capable.

https://events19.linuxfoundation.org/wp-content/uploads/2017/12/\
Hardware-Assisted-Mediated-Pass-Through-with-VFIO-Kevin-Tian-Intel.pdf

This patch supports NESTING IOMMU for IOMMU-backed mdevs by figuring
out the physical device of an IOMMU-backed mdev and then invoking IOMMU
requests to IOMMU layer with the physical device and the mdev's aux
domain info.

With this patch, vSVA (Virtual Shared Virtual Addressing) can be used
on IOMMU-backed mdevs.

Cc: Kevin Tian <kevin.tian@intel.com>
CC: Jacob Pan <jacob.jun.pan@linux.intel.com>
CC: Jun Tian <jun.j.tian@intel.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Eric Auger <eric.auger@redhat.com>
Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
 drivers/vfio/vfio_iommu_type1.c | 23 ++++++++++++++++++++---
 1 file changed, 20 insertions(+), 3 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 937ec3f..d473665 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -132,6 +132,7 @@ struct vfio_regions {
 
 struct domain_capsule {
 	struct iommu_domain *domain;
+	struct vfio_group *group;
 	void *data;
 };
 
@@ -148,6 +149,7 @@ static int vfio_iommu_for_each_dev(struct vfio_iommu *iommu,
 	list_for_each_entry(d, &iommu->domain_list, next) {
 		dc.domain = d->domain;
 		list_for_each_entry(g, &d->group_list, next) {
+			dc.group = g;
 			ret = iommu_group_for_each_dev(g->iommu_group,
 						       &dc, fn);
 			if (ret)
@@ -2347,7 +2349,12 @@ static int vfio_bind_gpasid_fn(struct device *dev, void *data)
 	struct iommu_gpasid_bind_data *gbind_data =
 		(struct iommu_gpasid_bind_data *) dc->data;
 
-	return iommu_sva_bind_gpasid(dc->domain, dev, gbind_data);
+	if (dc->group->mdev_group)
+		return iommu_sva_bind_gpasid(dc->domain,
+			vfio_mdev_get_iommu_device(dev), gbind_data);
+	else
+		return iommu_sva_bind_gpasid(dc->domain,
+						dev, gbind_data);
 }
 
 static int vfio_unbind_gpasid_fn(struct device *dev, void *data)
@@ -2356,8 +2363,13 @@ static int vfio_unbind_gpasid_fn(struct device *dev, void *data)
 	struct iommu_gpasid_bind_data *gbind_data =
 		(struct iommu_gpasid_bind_data *) dc->data;
 
-	return iommu_sva_unbind_gpasid(dc->domain, dev,
+	if (dc->group->mdev_group)
+		return iommu_sva_unbind_gpasid(dc->domain,
+					vfio_mdev_get_iommu_device(dev),
 					gbind_data->hpasid);
+	else
+		return iommu_sva_unbind_gpasid(dc->domain, dev,
+						gbind_data->hpasid);
 }
 
 /**
@@ -2429,7 +2441,12 @@ static int vfio_cache_inv_fn(struct device *dev, void *data)
 	struct iommu_cache_invalidate_info *cache_inv_info =
 		(struct iommu_cache_invalidate_info *) dc->data;
 
-	return iommu_cache_invalidate(dc->domain, dev, cache_inv_info);
+	if (dc->group->mdev_group)
+		return iommu_cache_invalidate(dc->domain,
+			vfio_mdev_get_iommu_device(dev), cache_inv_info);
+	else
+		return iommu_cache_invalidate(dc->domain,
+						dev, cache_inv_info);
 }
 
 static long vfio_iommu_type1_ioctl(void *iommu_data,
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 110+ messages in thread

* Re: [PATCH v1 1/8] vfio: Add VFIO_IOMMU_PASID_REQUEST(alloc/free)
  2020-03-22 12:31 ` [PATCH v1 1/8] vfio: Add VFIO_IOMMU_PASID_REQUEST(alloc/free) Liu, Yi L
@ 2020-03-22 16:21   ` kbuild test robot
  2020-03-30  8:32   ` Tian, Kevin
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 110+ messages in thread
From: kbuild test robot @ 2020-03-22 16:21 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: kbuild-all, alex.williamson, eric.auger, kevin.tian,
	jacob.jun.pan, joro, ashok.raj, yi.l.liu, jun.j.tian, yi.y.sun,
	jean-philippe, peterx, iommu, kvm, linux-kernel, hao.wu

[-- Attachment #1: Type: text/plain, Size: 4623 bytes --]

Hi Yi,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on vfio/next]
[also build test WARNING on v5.6-rc6 next-20200320]
[if your patch is applied to the wrong git tree, please drop us a note to help
improve the system. BTW, we also suggest to use '--base' option to specify the
base tree in git format-patch, please see https://stackoverflow.com/a/37406982]

url:    https://github.com/0day-ci/linux/commits/Liu-Yi-L/vfio-expose-virtual-Shared-Virtual-Addressing-to-VMs/20200322-213259
base:   https://github.com/awilliam/linux-vfio.git next
config: arm64-defconfig (attached as .config)
compiler: aarch64-linux-gcc (GCC) 9.2.0
reproduce:
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        GCC_VERSION=9.2.0 make.cross ARCH=arm64 

If you fix the issue, kindly add following tag
Reported-by: kbuild test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

   drivers/vfio/vfio.c: In function 'vfio_create_mm':
   drivers/vfio/vfio.c:2149:8: error: implicit declaration of function 'ioasid_alloc_set'; did you mean 'ioasid_alloc'? [-Werror=implicit-function-declaration]
    2149 |  ret = ioasid_alloc_set((struct ioasid_set *) mm,
         |        ^~~~~~~~~~~~~~~~
         |        ioasid_alloc
>> drivers/vfio/vfio.c:2158:13: warning: assignment to 'long long unsigned int' from 'struct mm_struct *' makes integer from pointer without a cast [-Wint-conversion]
    2158 |  token->val = mm;
         |             ^
   drivers/vfio/vfio.c: In function 'vfio_mm_unlock_and_free':
   drivers/vfio/vfio.c:2170:2: error: implicit declaration of function 'ioasid_free_set'; did you mean 'ioasid_free'? [-Werror=implicit-function-declaration]
    2170 |  ioasid_free_set(vmm->ioasid_sid, true);
         |  ^~~~~~~~~~~~~~~
         |  ioasid_free
   drivers/vfio/vfio.c: In function 'vfio_mm_pasid_alloc':
   drivers/vfio/vfio.c:2227:26: warning: passing argument 1 of 'ioasid_alloc' makes pointer from integer without a cast [-Wint-conversion]
    2227 |  pasid = ioasid_alloc(vmm->ioasid_sid, min, max, NULL);
         |                       ~~~^~~~~~~~~~~~
         |                          |
         |                          int
   In file included from include/linux/iommu.h:16,
                    from drivers/vfio/vfio.c:20:
   include/linux/ioasid.h:45:56: note: expected 'struct ioasid_set *' but argument is of type 'int'
      45 | static inline ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min,
         |                                     ~~~~~~~~~~~~~~~~~~~^~~
   drivers/vfio/vfio.c: In function 'vfio_mm_pasid_free':
   drivers/vfio/vfio.c:2246:25: warning: passing argument 1 of 'ioasid_find' makes pointer from integer without a cast [-Wint-conversion]
    2246 |  pdata = ioasid_find(vmm->ioasid_sid, pasid, NULL);
         |                      ~~~^~~~~~~~~~~~
         |                         |
         |                         int
   In file included from include/linux/iommu.h:16,
                    from drivers/vfio/vfio.c:20:
   include/linux/ioasid.h:55:52: note: expected 'struct ioasid_set *' but argument is of type 'int'
      55 | static inline void *ioasid_find(struct ioasid_set *set, ioasid_t ioasid,
         |                                 ~~~~~~~~~~~~~~~~~~~^~~
   cc1: some warnings being treated as errors

vim +2158 drivers/vfio/vfio.c

  2133	
  2134	/**
  2135	 * VFIO_MM objects - create, release, get, put, search
  2136	 * Caller of the function should have held vfio.vfio_mm_lock.
  2137	 */
  2138	static struct vfio_mm *vfio_create_mm(struct mm_struct *mm)
  2139	{
  2140		struct vfio_mm *vmm;
  2141		struct vfio_mm_token *token;
  2142		int ret = 0;
  2143	
  2144		vmm = kzalloc(sizeof(*vmm), GFP_KERNEL);
  2145		if (!vmm)
  2146			return ERR_PTR(-ENOMEM);
  2147	
  2148		/* Per mm IOASID set used for quota control and group operations */
  2149		ret = ioasid_alloc_set((struct ioasid_set *) mm,
  2150				       VFIO_DEFAULT_PASID_QUOTA, &vmm->ioasid_sid);
  2151		if (ret) {
  2152			kfree(vmm);
  2153			return ERR_PTR(ret);
  2154		}
  2155	
  2156		kref_init(&vmm->kref);
  2157		token = &vmm->token;
> 2158		token->val = mm;
  2159		vmm->pasid_quota = VFIO_DEFAULT_PASID_QUOTA;
  2160		mutex_init(&vmm->pasid_lock);
  2161	
  2162		list_add(&vmm->vfio_next, &vfio.vfio_mm_list);
  2163	
  2164		return vmm;
  2165	}
  2166	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 46471 bytes --]

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH v1 5/8] vfio/type1: Report 1st-level/stage-1 format to userspace
  2020-03-22 12:32 ` [PATCH v1 5/8] vfio/type1: Report 1st-level/stage-1 format to userspace Liu, Yi L
@ 2020-03-22 16:44   ` kbuild test robot
  2020-03-30 11:48   ` Tian, Kevin
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 110+ messages in thread
From: kbuild test robot @ 2020-03-22 16:44 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: kbuild-all, alex.williamson, eric.auger, kevin.tian,
	jacob.jun.pan, joro, ashok.raj, yi.l.liu, jun.j.tian, yi.y.sun,
	jean-philippe, peterx, iommu, kvm, linux-kernel, hao.wu

[-- Attachment #1: Type: text/plain, Size: 3463 bytes --]

Hi Yi,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on vfio/next]
[also build test ERROR on v5.6-rc6 next-20200320]
[if your patch is applied to the wrong git tree, please drop us a note to help
improve the system. BTW, we also suggest to use '--base' option to specify the
base tree in git format-patch, please see https://stackoverflow.com/a/37406982]

url:    https://github.com/0day-ci/linux/commits/Liu-Yi-L/vfio-expose-virtual-Shared-Virtual-Addressing-to-VMs/20200322-213259
base:   https://github.com/awilliam/linux-vfio.git next
config: arm64-defconfig (attached as .config)
compiler: aarch64-linux-gcc (GCC) 9.2.0
reproduce:
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        GCC_VERSION=9.2.0 make.cross ARCH=arm64 

If you fix the issue, kindly add following tag
Reported-by: kbuild test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   drivers/vfio/vfio_iommu_type1.c: In function 'vfio_iommu_get_stage1_format':
>> drivers/vfio/vfio_iommu_type1.c:2273:4: error: 'DOMAIN_ATTR_PASID_FORMAT' undeclared (first use in this function)
    2273 |    DOMAIN_ATTR_PASID_FORMAT, &format)) {
         |    ^~~~~~~~~~~~~~~~~~~~~~~~
   drivers/vfio/vfio_iommu_type1.c:2273:4: note: each undeclared identifier is reported only once for each function it appears in
   drivers/vfio/vfio_iommu_type1.c: In function 'vfio_iommu_type1_ioctl':
   drivers/vfio/vfio_iommu_type1.c:2355:11: error: implicit declaration of function 'iommu_get_uapi_version' [-Werror=implicit-function-declaration]
    2355 |    return iommu_get_uapi_version();
         |           ^~~~~~~~~~~~~~~~~~~~~~
   cc1: some warnings being treated as errors

vim +/DOMAIN_ATTR_PASID_FORMAT +2273 drivers/vfio/vfio_iommu_type1.c

  2257	
  2258	static int vfio_iommu_get_stage1_format(struct vfio_iommu *iommu,
  2259						 u32 *stage1_format)
  2260	{
  2261		struct vfio_domain *domain;
  2262		u32 format = 0, tmp_format = 0;
  2263		int ret;
  2264	
  2265		mutex_lock(&iommu->lock);
  2266		if (list_empty(&iommu->domain_list)) {
  2267			mutex_unlock(&iommu->lock);
  2268			return -EINVAL;
  2269		}
  2270	
  2271		list_for_each_entry(domain, &iommu->domain_list, next) {
  2272			if (iommu_domain_get_attr(domain->domain,
> 2273				DOMAIN_ATTR_PASID_FORMAT, &format)) {
  2274				ret = -EINVAL;
  2275				format = 0;
  2276				goto out_unlock;
  2277			}
  2278			/*
  2279			 * format is always non-zero (the first format is
  2280			 * IOMMU_PASID_FORMAT_INTEL_VTD which is 1). For
  2281			 * the reason of potential different backed IOMMU
  2282			 * formats, here we expect to have identical formats
  2283			 * in the domain list, no mixed formats support.
  2284			 * return -EINVAL to fail the attempt of setup
  2285			 * VFIO_TYPE1_NESTING_IOMMU if non-identical formats
  2286			 * are detected.
  2287			 */
  2288			if (tmp_format && tmp_format != format) {
  2289				ret = -EINVAL;
  2290				format = 0;
  2291				goto out_unlock;
  2292			}
  2293	
  2294			tmp_format = format;
  2295		}
  2296		ret = 0;
  2297	
  2298	out_unlock:
  2299		if (format)
  2300			*stage1_format = format;
  2301		mutex_unlock(&iommu->lock);
  2302		return ret;
  2303	}
  2304	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 46471 bytes --]

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH v1 2/8] vfio/type1: Add vfio_iommu_type1 parameter for quota tuning
  2020-03-22 12:31 ` [PATCH v1 2/8] vfio/type1: Add vfio_iommu_type1 parameter for quota tuning Liu, Yi L
@ 2020-03-22 17:20   ` kbuild test robot
  2020-03-30  8:40   ` Tian, Kevin
  1 sibling, 0 replies; 110+ messages in thread
From: kbuild test robot @ 2020-03-22 17:20 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: kbuild-all, alex.williamson, eric.auger, kevin.tian,
	jacob.jun.pan, joro, ashok.raj, yi.l.liu, jun.j.tian, yi.y.sun,
	jean-philippe, peterx, iommu, kvm, linux-kernel, hao.wu

[-- Attachment #1: Type: text/plain, Size: 7135 bytes --]

Hi Yi,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on vfio/next]
[also build test ERROR on v5.6-rc6 next-20200320]
[if your patch is applied to the wrong git tree, please drop us a note to help
improve the system. BTW, we also suggest to use '--base' option to specify the
base tree in git format-patch, please see https://stackoverflow.com/a/37406982]

url:    https://github.com/0day-ci/linux/commits/Liu-Yi-L/vfio-expose-virtual-Shared-Virtual-Addressing-to-VMs/20200322-213259
base:   https://github.com/awilliam/linux-vfio.git next
config: arm64-defconfig (attached as .config)
compiler: aarch64-linux-gcc (GCC) 9.2.0
reproduce:
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        GCC_VERSION=9.2.0 make.cross ARCH=arm64 

If you fix the issue, kindly add following tag
Reported-by: kbuild test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   drivers/vfio/vfio.c: In function 'vfio_create_mm':
   drivers/vfio/vfio.c:2149:8: error: implicit declaration of function 'ioasid_alloc_set'; did you mean 'ioasid_alloc'? [-Werror=implicit-function-declaration]
    2149 |  ret = ioasid_alloc_set((struct ioasid_set *) mm,
         |        ^~~~~~~~~~~~~~~~
         |        ioasid_alloc
   drivers/vfio/vfio.c:2158:13: warning: assignment to 'long long unsigned int' from 'struct mm_struct *' makes integer from pointer without a cast [-Wint-conversion]
    2158 |  token->val = mm;
         |             ^
   drivers/vfio/vfio.c: In function 'vfio_mm_unlock_and_free':
   drivers/vfio/vfio.c:2170:2: error: implicit declaration of function 'ioasid_free_set'; did you mean 'ioasid_free'? [-Werror=implicit-function-declaration]
    2170 |  ioasid_free_set(vmm->ioasid_sid, true);
         |  ^~~~~~~~~~~~~~~
         |  ioasid_free
   drivers/vfio/vfio.c: In function 'vfio_mm_pasid_alloc':
>> drivers/vfio/vfio.c:2230:3: error: implicit declaration of function 'ioasid_adjust_set' [-Werror=implicit-function-declaration]
    2230 |   ioasid_adjust_set(vmm->ioasid_sid, quota);
         |   ^~~~~~~~~~~~~~~~~
   drivers/vfio/vfio.c:2233:26: warning: passing argument 1 of 'ioasid_alloc' makes pointer from integer without a cast [-Wint-conversion]
    2233 |  pasid = ioasid_alloc(vmm->ioasid_sid, min, max, NULL);
         |                       ~~~^~~~~~~~~~~~
         |                          |
         |                          int
   In file included from include/linux/iommu.h:16,
                    from drivers/vfio/vfio.c:20:
   include/linux/ioasid.h:45:56: note: expected 'struct ioasid_set *' but argument is of type 'int'
      45 | static inline ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min,
         |                                     ~~~~~~~~~~~~~~~~~~~^~~
   drivers/vfio/vfio.c: In function 'vfio_mm_pasid_free':
   drivers/vfio/vfio.c:2252:25: warning: passing argument 1 of 'ioasid_find' makes pointer from integer without a cast [-Wint-conversion]
    2252 |  pdata = ioasid_find(vmm->ioasid_sid, pasid, NULL);
         |                      ~~~^~~~~~~~~~~~
         |                         |
         |                         int
   In file included from include/linux/iommu.h:16,
                    from drivers/vfio/vfio.c:20:
   include/linux/ioasid.h:55:52: note: expected 'struct ioasid_set *' but argument is of type 'int'
      55 | static inline void *ioasid_find(struct ioasid_set *set, ioasid_t ioasid,
         |                                 ~~~~~~~~~~~~~~~~~~~^~~
   cc1: some warnings being treated as errors

vim +/ioasid_adjust_set +2230 drivers/vfio/vfio.c

  2133	
  2134	/**
  2135	 * VFIO_MM objects - create, release, get, put, search
  2136	 * Caller of the function should have held vfio.vfio_mm_lock.
  2137	 */
  2138	static struct vfio_mm *vfio_create_mm(struct mm_struct *mm)
  2139	{
  2140		struct vfio_mm *vmm;
  2141		struct vfio_mm_token *token;
  2142		int ret = 0;
  2143	
  2144		vmm = kzalloc(sizeof(*vmm), GFP_KERNEL);
  2145		if (!vmm)
  2146			return ERR_PTR(-ENOMEM);
  2147	
  2148		/* Per mm IOASID set used for quota control and group operations */
  2149		ret = ioasid_alloc_set((struct ioasid_set *) mm,
  2150				       VFIO_DEFAULT_PASID_QUOTA, &vmm->ioasid_sid);
  2151		if (ret) {
  2152			kfree(vmm);
  2153			return ERR_PTR(ret);
  2154		}
  2155	
  2156		kref_init(&vmm->kref);
  2157		token = &vmm->token;
> 2158		token->val = mm;
  2159		vmm->pasid_quota = VFIO_DEFAULT_PASID_QUOTA;
  2160		mutex_init(&vmm->pasid_lock);
  2161	
  2162		list_add(&vmm->vfio_next, &vfio.vfio_mm_list);
  2163	
  2164		return vmm;
  2165	}
  2166	
  2167	static void vfio_mm_unlock_and_free(struct vfio_mm *vmm)
  2168	{
  2169		/* destroy the ioasid set */
  2170		ioasid_free_set(vmm->ioasid_sid, true);
  2171		mutex_unlock(&vfio.vfio_mm_lock);
  2172		kfree(vmm);
  2173	}
  2174	
  2175	/* called with vfio.vfio_mm_lock held */
  2176	static void vfio_mm_release(struct kref *kref)
  2177	{
  2178		struct vfio_mm *vmm = container_of(kref, struct vfio_mm, kref);
  2179	
  2180		list_del(&vmm->vfio_next);
  2181		vfio_mm_unlock_and_free(vmm);
  2182	}
  2183	
  2184	void vfio_mm_put(struct vfio_mm *vmm)
  2185	{
  2186		kref_put_mutex(&vmm->kref, vfio_mm_release, &vfio.vfio_mm_lock);
  2187	}
  2188	EXPORT_SYMBOL_GPL(vfio_mm_put);
  2189	
  2190	/* Assume vfio_mm_lock or vfio_mm reference is held */
  2191	static void vfio_mm_get(struct vfio_mm *vmm)
  2192	{
  2193		kref_get(&vmm->kref);
  2194	}
  2195	
  2196	struct vfio_mm *vfio_mm_get_from_task(struct task_struct *task)
  2197	{
  2198		struct mm_struct *mm = get_task_mm(task);
  2199		struct vfio_mm *vmm;
  2200		unsigned long long val = (unsigned long long) mm;
  2201	
  2202		mutex_lock(&vfio.vfio_mm_lock);
  2203		list_for_each_entry(vmm, &vfio.vfio_mm_list, vfio_next) {
  2204			if (vmm->token.val == val) {
  2205				vfio_mm_get(vmm);
  2206				goto out;
  2207			}
  2208		}
  2209	
  2210		vmm = vfio_create_mm(mm);
  2211		if (IS_ERR(vmm))
  2212			vmm = NULL;
  2213	out:
  2214		mutex_unlock(&vfio.vfio_mm_lock);
  2215		mmput(mm);
  2216		return vmm;
  2217	}
  2218	EXPORT_SYMBOL_GPL(vfio_mm_get_from_task);
  2219	
  2220	int vfio_mm_pasid_alloc(struct vfio_mm *vmm, int quota, int min, int max)
  2221	{
  2222		ioasid_t pasid;
  2223		int ret = -ENOSPC;
  2224	
  2225		mutex_lock(&vmm->pasid_lock);
  2226	
  2227		/* update quota as it is tunable by admin */
  2228		if (vmm->pasid_quota != quota) {
  2229			vmm->pasid_quota = quota;
> 2230			ioasid_adjust_set(vmm->ioasid_sid, quota);
  2231		}
  2232	
  2233		pasid = ioasid_alloc(vmm->ioasid_sid, min, max, NULL);
  2234		if (pasid == INVALID_IOASID) {
  2235			ret = -ENOSPC;
  2236			goto out_unlock;
  2237		}
  2238	
  2239		ret = pasid;
  2240	out_unlock:
  2241		mutex_unlock(&vmm->pasid_lock);
  2242		return ret;
  2243	}
  2244	EXPORT_SYMBOL_GPL(vfio_mm_pasid_alloc);
  2245	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 46471 bytes --]

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH v1 6/8] vfio/type1: Bind guest page tables to host
  2020-03-22 12:32 ` [PATCH v1 6/8] vfio/type1: Bind guest page tables to host Liu, Yi L
@ 2020-03-22 18:10   ` kbuild test robot
  2020-03-30 12:46   ` Tian, Kevin
  2020-04-02 19:57   ` Alex Williamson
  2 siblings, 0 replies; 110+ messages in thread
From: kbuild test robot @ 2020-03-22 18:10 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: kbuild-all, alex.williamson, eric.auger, kevin.tian,
	jacob.jun.pan, joro, ashok.raj, yi.l.liu, jun.j.tian, yi.y.sun,
	jean-philippe, peterx, iommu, kvm, linux-kernel, hao.wu

[-- Attachment #1: Type: text/plain, Size: 8974 bytes --]

Hi Yi,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on vfio/next]
[also build test ERROR on v5.6-rc6 next-20200320]
[if your patch is applied to the wrong git tree, please drop us a note to help
improve the system. BTW, we also suggest to use '--base' option to specify the
base tree in git format-patch, please see https://stackoverflow.com/a/37406982]

url:    https://github.com/0day-ci/linux/commits/Liu-Yi-L/vfio-expose-virtual-Shared-Virtual-Addressing-to-VMs/20200322-213259
base:   https://github.com/awilliam/linux-vfio.git next
config: arm64-defconfig (attached as .config)
compiler: aarch64-linux-gcc (GCC) 9.2.0
reproduce:
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        GCC_VERSION=9.2.0 make.cross ARCH=arm64 

If you fix the issue, kindly add following tag
Reported-by: kbuild test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   drivers/vfio/vfio_iommu_type1.c: In function 'vfio_iommu_get_stage1_format':
   drivers/vfio/vfio_iommu_type1.c:2300:4: error: 'DOMAIN_ATTR_PASID_FORMAT' undeclared (first use in this function)
    2300 |    DOMAIN_ATTR_PASID_FORMAT, &format)) {
         |    ^~~~~~~~~~~~~~~~~~~~~~~~
   drivers/vfio/vfio_iommu_type1.c:2300:4: note: each undeclared identifier is reported only once for each function it appears in
   drivers/vfio/vfio_iommu_type1.c: In function 'vfio_iommu_type1_ioctl':
   drivers/vfio/vfio_iommu_type1.c:2464:11: error: implicit declaration of function 'iommu_get_uapi_version' [-Werror=implicit-function-declaration]
    2464 |    return iommu_get_uapi_version();
         |           ^~~~~~~~~~~~~~~~~~~~~~
>> drivers/vfio/vfio_iommu_type1.c:2626:15: error: implicit declaration of function 'iommu_uapi_get_data_size' [-Werror=implicit-function-declaration]
    2626 |   data_size = iommu_uapi_get_data_size(
         |               ^~~~~~~~~~~~~~~~~~~~~~~~
>> drivers/vfio/vfio_iommu_type1.c:2627:5: error: 'IOMMU_UAPI_BIND_GPASID' undeclared (first use in this function)
    2627 |     IOMMU_UAPI_BIND_GPASID, version);
         |     ^~~~~~~~~~~~~~~~~~~~~~
   cc1: some warnings being treated as errors

vim +/iommu_uapi_get_data_size +2626 drivers/vfio/vfio_iommu_type1.c

  2446	
  2447	static long vfio_iommu_type1_ioctl(void *iommu_data,
  2448					   unsigned int cmd, unsigned long arg)
  2449	{
  2450		struct vfio_iommu *iommu = iommu_data;
  2451		unsigned long minsz;
  2452	
  2453		if (cmd == VFIO_CHECK_EXTENSION) {
  2454			switch (arg) {
  2455			case VFIO_TYPE1_IOMMU:
  2456			case VFIO_TYPE1v2_IOMMU:
  2457			case VFIO_TYPE1_NESTING_IOMMU:
  2458				return 1;
  2459			case VFIO_DMA_CC_IOMMU:
  2460				if (!iommu)
  2461					return 0;
  2462				return vfio_domains_have_iommu_cache(iommu);
  2463			case VFIO_NESTING_IOMMU_UAPI:
  2464				return iommu_get_uapi_version();
  2465			default:
  2466				return 0;
  2467			}
  2468		} else if (cmd == VFIO_IOMMU_GET_INFO) {
  2469			struct vfio_iommu_type1_info info;
  2470			struct vfio_info_cap caps = { .buf = NULL, .size = 0 };
  2471			unsigned long capsz;
  2472			int ret;
  2473	
  2474			minsz = offsetofend(struct vfio_iommu_type1_info, iova_pgsizes);
  2475	
  2476			/* For backward compatibility, cannot require this */
  2477			capsz = offsetofend(struct vfio_iommu_type1_info, cap_offset);
  2478	
  2479			if (copy_from_user(&info, (void __user *)arg, minsz))
  2480				return -EFAULT;
  2481	
  2482			if (info.argsz < minsz)
  2483				return -EINVAL;
  2484	
  2485			if (info.argsz >= capsz) {
  2486				minsz = capsz;
  2487				info.cap_offset = 0; /* output, no-recopy necessary */
  2488			}
  2489	
  2490			info.flags = VFIO_IOMMU_INFO_PGSIZES;
  2491	
  2492			info.iova_pgsizes = vfio_pgsize_bitmap(iommu);
  2493	
  2494			ret = vfio_iommu_iova_build_caps(iommu, &caps);
  2495			if (ret)
  2496				return ret;
  2497	
  2498			ret = vfio_iommu_info_add_nesting_cap(iommu, &caps);
  2499			if (ret)
  2500				return ret;
  2501	
  2502			if (caps.size) {
  2503				info.flags |= VFIO_IOMMU_INFO_CAPS;
  2504	
  2505				if (info.argsz < sizeof(info) + caps.size) {
  2506					info.argsz = sizeof(info) + caps.size;
  2507				} else {
  2508					vfio_info_cap_shift(&caps, sizeof(info));
  2509					if (copy_to_user((void __user *)arg +
  2510							sizeof(info), caps.buf,
  2511							caps.size)) {
  2512						kfree(caps.buf);
  2513						return -EFAULT;
  2514					}
  2515					info.cap_offset = sizeof(info);
  2516				}
  2517	
  2518				kfree(caps.buf);
  2519			}
  2520	
  2521			return copy_to_user((void __user *)arg, &info, minsz) ?
  2522				-EFAULT : 0;
  2523	
  2524		} else if (cmd == VFIO_IOMMU_MAP_DMA) {
  2525			struct vfio_iommu_type1_dma_map map;
  2526			uint32_t mask = VFIO_DMA_MAP_FLAG_READ |
  2527					VFIO_DMA_MAP_FLAG_WRITE;
  2528	
  2529			minsz = offsetofend(struct vfio_iommu_type1_dma_map, size);
  2530	
  2531			if (copy_from_user(&map, (void __user *)arg, minsz))
  2532				return -EFAULT;
  2533	
  2534			if (map.argsz < minsz || map.flags & ~mask)
  2535				return -EINVAL;
  2536	
  2537			return vfio_dma_do_map(iommu, &map);
  2538	
  2539		} else if (cmd == VFIO_IOMMU_UNMAP_DMA) {
  2540			struct vfio_iommu_type1_dma_unmap unmap;
  2541			long ret;
  2542	
  2543			minsz = offsetofend(struct vfio_iommu_type1_dma_unmap, size);
  2544	
  2545			if (copy_from_user(&unmap, (void __user *)arg, minsz))
  2546				return -EFAULT;
  2547	
  2548			if (unmap.argsz < minsz || unmap.flags)
  2549				return -EINVAL;
  2550	
  2551			ret = vfio_dma_do_unmap(iommu, &unmap);
  2552			if (ret)
  2553				return ret;
  2554	
  2555			return copy_to_user((void __user *)arg, &unmap, minsz) ?
  2556				-EFAULT : 0;
  2557	
  2558		} else if (cmd == VFIO_IOMMU_PASID_REQUEST) {
  2559			struct vfio_iommu_type1_pasid_request req;
  2560			unsigned long offset;
  2561	
  2562			minsz = offsetofend(struct vfio_iommu_type1_pasid_request,
  2563					    flags);
  2564	
  2565			if (copy_from_user(&req, (void __user *)arg, minsz))
  2566				return -EFAULT;
  2567	
  2568			if (req.argsz < minsz ||
  2569			    !vfio_iommu_type1_pasid_req_valid(req.flags))
  2570				return -EINVAL;
  2571	
  2572			if (copy_from_user((void *)&req + minsz,
  2573					   (void __user *)arg + minsz,
  2574					   sizeof(req) - minsz))
  2575				return -EFAULT;
  2576	
  2577			switch (req.flags & VFIO_PASID_REQUEST_MASK) {
  2578			case VFIO_IOMMU_PASID_ALLOC:
  2579			{
  2580				int ret = 0, result;
  2581	
  2582				result = vfio_iommu_type1_pasid_alloc(iommu,
  2583								req.alloc_pasid.min,
  2584								req.alloc_pasid.max);
  2585				if (result > 0) {
  2586					offset = offsetof(
  2587						struct vfio_iommu_type1_pasid_request,
  2588						alloc_pasid.result);
  2589					ret = copy_to_user(
  2590						      (void __user *) (arg + offset),
  2591						      &result, sizeof(result));
  2592				} else {
  2593					pr_debug("%s: PASID alloc failed\n", __func__);
  2594					ret = -EFAULT;
  2595				}
  2596				return ret;
  2597			}
  2598			case VFIO_IOMMU_PASID_FREE:
  2599				return vfio_iommu_type1_pasid_free(iommu,
  2600								   req.free_pasid);
  2601			default:
  2602				return -EINVAL;
  2603			}
  2604	
  2605		} else if (cmd == VFIO_IOMMU_BIND) {
  2606			struct vfio_iommu_type1_bind bind;
  2607			u32 version;
  2608			int data_size;
  2609			void *gbind_data;
  2610			int ret;
  2611	
  2612			minsz = offsetofend(struct vfio_iommu_type1_bind, flags);
  2613	
  2614			if (copy_from_user(&bind, (void __user *)arg, minsz))
  2615				return -EFAULT;
  2616	
  2617			if (bind.argsz < minsz)
  2618				return -EINVAL;
  2619	
  2620			/* Get the version of struct iommu_gpasid_bind_data */
  2621			if (copy_from_user(&version,
  2622				(void __user *) (arg + minsz),
  2623						sizeof(version)))
  2624				return -EFAULT;
  2625	
> 2626			data_size = iommu_uapi_get_data_size(
> 2627					IOMMU_UAPI_BIND_GPASID, version);
  2628			gbind_data = kzalloc(data_size, GFP_KERNEL);
  2629			if (!gbind_data)
  2630				return -ENOMEM;
  2631	
  2632			if (copy_from_user(gbind_data,
  2633				 (void __user *) (arg + minsz), data_size)) {
  2634				kfree(gbind_data);
  2635				return -EFAULT;
  2636			}
  2637	
  2638			switch (bind.flags & VFIO_IOMMU_BIND_MASK) {
  2639			case VFIO_IOMMU_BIND_GUEST_PGTBL:
  2640				ret = vfio_iommu_type1_bind_gpasid(iommu,
  2641								   gbind_data);
  2642				break;
  2643			case VFIO_IOMMU_UNBIND_GUEST_PGTBL:
  2644				ret = vfio_iommu_type1_unbind_gpasid(iommu,
  2645								     gbind_data);
  2646				break;
  2647			default:
  2648				ret = -EINVAL;
  2649				break;
  2650			}
  2651			kfree(gbind_data);
  2652			return ret;
  2653		}
  2654	
  2655		return -ENOTTY;
  2656	}
  2657	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 46471 bytes --]

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH v1 4/8] vfio: Check nesting iommu uAPI version
  2020-03-22 12:32 ` [PATCH v1 4/8] vfio: Check nesting iommu uAPI version Liu, Yi L
@ 2020-03-22 18:30   ` kbuild test robot
  0 siblings, 0 replies; 110+ messages in thread
From: kbuild test robot @ 2020-03-22 18:30 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: kbuild-all, alex.williamson, eric.auger, kevin.tian,
	jacob.jun.pan, joro, ashok.raj, yi.l.liu, jun.j.tian, yi.y.sun,
	jean-philippe, peterx, iommu, kvm, linux-kernel, hao.wu

[-- Attachment #1: Type: text/plain, Size: 6618 bytes --]

Hi Yi,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on vfio/next]
[also build test ERROR on v5.6-rc6 next-20200320]
[if your patch is applied to the wrong git tree, please drop us a note to help
improve the system. BTW, we also suggest to use '--base' option to specify the
base tree in git format-patch, please see https://stackoverflow.com/a/37406982]

url:    https://github.com/0day-ci/linux/commits/Liu-Yi-L/vfio-expose-virtual-Shared-Virtual-Addressing-to-VMs/20200322-213259
base:   https://github.com/awilliam/linux-vfio.git next
config: arm64-defconfig (attached as .config)
compiler: aarch64-linux-gcc (GCC) 9.2.0
reproduce:
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        GCC_VERSION=9.2.0 make.cross ARCH=arm64 

If you fix the issue, kindly add following tag
Reported-by: kbuild test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   drivers/vfio/vfio_iommu_type1.c: In function 'vfio_iommu_type1_ioctl':
>> drivers/vfio/vfio_iommu_type1.c:2299:11: error: implicit declaration of function 'iommu_get_uapi_version' [-Werror=implicit-function-declaration]
    2299 |    return iommu_get_uapi_version();
         |           ^~~~~~~~~~~~~~~~~~~~~~
   cc1: some warnings being treated as errors

vim +/iommu_get_uapi_version +2299 drivers/vfio/vfio_iommu_type1.c

  2281	
  2282	static long vfio_iommu_type1_ioctl(void *iommu_data,
  2283					   unsigned int cmd, unsigned long arg)
  2284	{
  2285		struct vfio_iommu *iommu = iommu_data;
  2286		unsigned long minsz;
  2287	
  2288		if (cmd == VFIO_CHECK_EXTENSION) {
  2289			switch (arg) {
  2290			case VFIO_TYPE1_IOMMU:
  2291			case VFIO_TYPE1v2_IOMMU:
  2292			case VFIO_TYPE1_NESTING_IOMMU:
  2293				return 1;
  2294			case VFIO_DMA_CC_IOMMU:
  2295				if (!iommu)
  2296					return 0;
  2297				return vfio_domains_have_iommu_cache(iommu);
  2298			case VFIO_NESTING_IOMMU_UAPI:
> 2299				return iommu_get_uapi_version();
  2300			default:
  2301				return 0;
  2302			}
  2303		} else if (cmd == VFIO_IOMMU_GET_INFO) {
  2304			struct vfio_iommu_type1_info info;
  2305			struct vfio_info_cap caps = { .buf = NULL, .size = 0 };
  2306			unsigned long capsz;
  2307			int ret;
  2308	
  2309			minsz = offsetofend(struct vfio_iommu_type1_info, iova_pgsizes);
  2310	
  2311			/* For backward compatibility, cannot require this */
  2312			capsz = offsetofend(struct vfio_iommu_type1_info, cap_offset);
  2313	
  2314			if (copy_from_user(&info, (void __user *)arg, minsz))
  2315				return -EFAULT;
  2316	
  2317			if (info.argsz < minsz)
  2318				return -EINVAL;
  2319	
  2320			if (info.argsz >= capsz) {
  2321				minsz = capsz;
  2322				info.cap_offset = 0; /* output, no-recopy necessary */
  2323			}
  2324	
  2325			info.flags = VFIO_IOMMU_INFO_PGSIZES;
  2326	
  2327			info.iova_pgsizes = vfio_pgsize_bitmap(iommu);
  2328	
  2329			ret = vfio_iommu_iova_build_caps(iommu, &caps);
  2330			if (ret)
  2331				return ret;
  2332	
  2333			ret = vfio_iommu_info_add_nesting_cap(iommu, &caps);
  2334			if (ret)
  2335				return ret;
  2336	
  2337			if (caps.size) {
  2338				info.flags |= VFIO_IOMMU_INFO_CAPS;
  2339	
  2340				if (info.argsz < sizeof(info) + caps.size) {
  2341					info.argsz = sizeof(info) + caps.size;
  2342				} else {
  2343					vfio_info_cap_shift(&caps, sizeof(info));
  2344					if (copy_to_user((void __user *)arg +
  2345							sizeof(info), caps.buf,
  2346							caps.size)) {
  2347						kfree(caps.buf);
  2348						return -EFAULT;
  2349					}
  2350					info.cap_offset = sizeof(info);
  2351				}
  2352	
  2353				kfree(caps.buf);
  2354			}
  2355	
  2356			return copy_to_user((void __user *)arg, &info, minsz) ?
  2357				-EFAULT : 0;
  2358	
  2359		} else if (cmd == VFIO_IOMMU_MAP_DMA) {
  2360			struct vfio_iommu_type1_dma_map map;
  2361			uint32_t mask = VFIO_DMA_MAP_FLAG_READ |
  2362					VFIO_DMA_MAP_FLAG_WRITE;
  2363	
  2364			minsz = offsetofend(struct vfio_iommu_type1_dma_map, size);
  2365	
  2366			if (copy_from_user(&map, (void __user *)arg, minsz))
  2367				return -EFAULT;
  2368	
  2369			if (map.argsz < minsz || map.flags & ~mask)
  2370				return -EINVAL;
  2371	
  2372			return vfio_dma_do_map(iommu, &map);
  2373	
  2374		} else if (cmd == VFIO_IOMMU_UNMAP_DMA) {
  2375			struct vfio_iommu_type1_dma_unmap unmap;
  2376			long ret;
  2377	
  2378			minsz = offsetofend(struct vfio_iommu_type1_dma_unmap, size);
  2379	
  2380			if (copy_from_user(&unmap, (void __user *)arg, minsz))
  2381				return -EFAULT;
  2382	
  2383			if (unmap.argsz < minsz || unmap.flags)
  2384				return -EINVAL;
  2385	
  2386			ret = vfio_dma_do_unmap(iommu, &unmap);
  2387			if (ret)
  2388				return ret;
  2389	
  2390			return copy_to_user((void __user *)arg, &unmap, minsz) ?
  2391				-EFAULT : 0;
  2392	
  2393		} else if (cmd == VFIO_IOMMU_PASID_REQUEST) {
  2394			struct vfio_iommu_type1_pasid_request req;
  2395			unsigned long offset;
  2396	
  2397			minsz = offsetofend(struct vfio_iommu_type1_pasid_request,
  2398					    flags);
  2399	
  2400			if (copy_from_user(&req, (void __user *)arg, minsz))
  2401				return -EFAULT;
  2402	
  2403			if (req.argsz < minsz ||
  2404			    !vfio_iommu_type1_pasid_req_valid(req.flags))
  2405				return -EINVAL;
  2406	
  2407			if (copy_from_user((void *)&req + minsz,
  2408					   (void __user *)arg + minsz,
  2409					   sizeof(req) - minsz))
  2410				return -EFAULT;
  2411	
  2412			switch (req.flags & VFIO_PASID_REQUEST_MASK) {
  2413			case VFIO_IOMMU_PASID_ALLOC:
  2414			{
  2415				int ret = 0, result;
  2416	
  2417				result = vfio_iommu_type1_pasid_alloc(iommu,
  2418								req.alloc_pasid.min,
  2419								req.alloc_pasid.max);
  2420				if (result > 0) {
  2421					offset = offsetof(
  2422						struct vfio_iommu_type1_pasid_request,
  2423						alloc_pasid.result);
  2424					ret = copy_to_user(
  2425						      (void __user *) (arg + offset),
  2426						      &result, sizeof(result));
  2427				} else {
  2428					pr_debug("%s: PASID alloc failed\n", __func__);
  2429					ret = -EFAULT;
  2430				}
  2431				return ret;
  2432			}
  2433			case VFIO_IOMMU_PASID_FREE:
  2434				return vfio_iommu_type1_pasid_free(iommu,
  2435								   req.free_pasid);
  2436			default:
  2437				return -EINVAL;
  2438			}
  2439		}
  2440	
  2441		return -ENOTTY;
  2442	}
  2443	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 46471 bytes --]

^ permalink raw reply	[flat|nested] 110+ messages in thread

* RE: [PATCH v1 0/8] vfio: expose virtual Shared Virtual Addressing to VMs
  2020-03-22 12:31 [PATCH v1 0/8] vfio: expose virtual Shared Virtual Addressing to VMs Liu, Yi L
                   ` (7 preceding siblings ...)
  2020-03-22 12:32 ` [PATCH v1 8/8] vfio/type1: Add vSVA support for IOMMU-backed mdevs Liu, Yi L
@ 2020-03-26 12:56 ` Liu, Yi L
  8 siblings, 0 replies; 110+ messages in thread
From: Liu, Yi L @ 2020-03-26 12:56 UTC (permalink / raw)
  To: alex.williamson, eric.auger
  Cc: Tian, Kevin, jacob.jun.pan, joro, Raj, Ashok, Tian, Jun J, Sun,
	Yi Y, jean-philippe, peterx, iommu, kvm, linux-kernel, Wu, Hao

> From: Liu, Yi L <yi.l.liu@intel.com>
> Sent: Sunday, March 22, 2020 8:32 PM
> To: alex.williamson@redhat.com; eric.auger@redhat.com
> Subject: [PATCH v1 0/8] vfio: expose virtual Shared Virtual Addressing to VMs
> 
> From: Liu Yi L <yi.l.liu@intel.com>
> 
> Shared Virtual Addressing (SVA), a.k.a, Shared Virtual Memory (SVM) on
> Intel platforms allows address space sharing between device DMA and
> applications. SVA can reduce programming complexity and enhance security.
> 
> This VFIO series is intended to expose SVA usage to VMs. i.e. Sharing
> guest application address space with passthru devices. This is called
> vSVA in this series. The whole vSVA enabling requires QEMU/VFIO/IOMMU
> changes. For IOMMU and QEMU changes, they are in separate series (listed
> in the "Related series").
> 
> The high-level architecture for SVA virtualization is as below, the key
> design of vSVA support is to utilize the dual-stage IOMMU translation (
> also known as IOMMU nesting translation) capability in host IOMMU.
> 
> 
>     .-------------.  .---------------------------.
>     |   vIOMMU    |  | Guest process CR3, FL only|
>     |             |  '---------------------------'
>     .----------------/
>     | PASID Entry |--- PASID cache flush -
>     '-------------'                       |
>     |             |                       V
>     |             |                CR3 in GPA
>     '-------------'
> Guest
> ------| Shadow |--------------------------|--------
>       v        v                          v
> Host
>     .-------------.  .----------------------.
>     |   pIOMMU    |  | Bind FL for GVA-GPA  |
>     |             |  '----------------------'
>     .----------------/  |
>     | PASID Entry |     V (Nested xlate)
>     '----------------\.------------------------------.
>     |             |   |SL for GPA-HPA, default domain|
>     |             |   '------------------------------'
>     '-------------'
> Where:
>  - FL = First level/stage one page tables
>  - SL = Second level/stage two page tables
> 
> There are roughly four parts in this patchset which are
> corresponding to the basic vSVA support for PCI device
> assignment
>  1. vfio support for PASID allocation and free for VMs
>  2. vfio support for guest page table binding request from VMs
>  3. vfio support for IOMMU cache invalidation from VMs
>  4. vfio support for vSVA usage on IOMMU-backed mdevs
> 
> The complete vSVA kernel upstream patches are divided into three phases:
>     1. Common APIs and PCI device direct assignment
>     2. IOMMU-backed Mediated Device assignment
>     3. Page Request Services (PRS) support
> 
> This patchset is aiming for the phase 1 and phase 2, and based on Jacob's
> below series.
> [PATCH V10 00/11] Nested Shared Virtual Address (SVA) VT-d support:
> https://lkml.org/lkml/2020/3/20/1172
> 
> Complete set for current vSVA can be found in below branch.
> https://github.com/luxis1999/linux-vsva.git: vsva-linux-5.6-rc6
> 
> The corresponding QEMU patch series is as below, complete QEMU set can be
> found in below branch.
> [PATCH v1 00/22] intel_iommu: expose Shared Virtual Addressing to VMs
> complete QEMU set can be found in below link:
> https://github.com/luxis1999/qemu.git: sva_vtd_v10_v1

The ioasid extension is in the below link.

https://lkml.org/lkml/2020/3/25/874

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 110+ messages in thread

* RE: [PATCH v1 1/8] vfio: Add VFIO_IOMMU_PASID_REQUEST(alloc/free)
  2020-03-22 12:31 ` [PATCH v1 1/8] vfio: Add VFIO_IOMMU_PASID_REQUEST(alloc/free) Liu, Yi L
  2020-03-22 16:21   ` kbuild test robot
@ 2020-03-30  8:32   ` Tian, Kevin
  2020-03-30 14:36     ` Liu, Yi L
  2020-03-31  7:53   ` Christoph Hellwig
                     ` (2 subsequent siblings)
  4 siblings, 1 reply; 110+ messages in thread
From: Tian, Kevin @ 2020-03-30  8:32 UTC (permalink / raw)
  To: Liu, Yi L, alex.williamson, eric.auger
  Cc: jacob.jun.pan, joro, Raj, Ashok, Tian, Jun J, Sun, Yi Y,
	jean-philippe, peterx, iommu, kvm, linux-kernel, Wu, Hao

> From: Liu, Yi L <yi.l.liu@intel.com>
> Sent: Sunday, March 22, 2020 8:32 PM
> 
> From: Liu Yi L <yi.l.liu@intel.com>
> 
> For a long time, devices have only one DMA address space from platform
> IOMMU's point of view. This is true for both bare metal and directed-
> access in virtualization environment. Reason is the source ID of DMA in
> PCIe are BDF (bus/dev/fnc ID), which results in only device granularity

are->is

> DMA isolation. However, this is changing with the latest advancement in
> I/O technology area. More and more platform vendors are utilizing the PCIe
> PASID TLP prefix in DMA requests, thus to give devices with multiple DMA
> address spaces as identified by their individual PASIDs. For example,
> Shared Virtual Addressing (SVA, a.k.a Shared Virtual Memory) is able to
> let device access multiple process virtual address space by binding the

"address space" -> "address spaces"

"binding the" -> "binding each"

> virtual address space with a PASID. Wherein the PASID is allocated in
> software and programmed to device per device specific manner. Devices
> which support PASID capability are called PASID-capable devices. If such
> devices are passed through to VMs, guest software are also able to bind
> guest process virtual address space on such devices. Therefore, the guest
> software could reuse the bare metal software programming model, which
> means guest software will also allocate PASID and program it to device
> directly. This is a dangerous situation since it has potential PASID
> conflicts and unauthorized address space access. It would be safer to
> let host intercept in the guest software's PASID allocation. Thus PASID
> are managed system-wide.
> 
> This patch adds VFIO_IOMMU_PASID_REQUEST ioctl which aims to
> passdown
> PASID allocation/free request from the virtual IOMMU. Additionally, such

"Additionally, because such"

> requests are intended to be invoked by QEMU or other applications which

simplify to "intended to be invoked from userspace"

> are running in userspace, it is necessary to have a mechanism to prevent
> single application from abusing available PASIDs in system. With such
> consideration, this patch tracks the VFIO PASID allocation per-VM. There
> was a discussion to make quota to be per assigned devices. e.g. if a VM
> has many assigned devices, then it should have more quota. However, it
> is not sure how many PASIDs an assigned devices will use. e.g. it is

devices -> device

> possible that a VM with multiples assigned devices but requests less
> PASIDs. Therefore per-VM quota would be better.
> 
> This patch uses struct mm pointer as a per-VM token. We also considered
> using task structure pointer and vfio_iommu structure pointer. However,
> task structure is per-thread, which means it cannot achieve per-VM PASID
> alloc tracking purpose. While for vfio_iommu structure, it is visible
> only within vfio. Therefore, structure mm pointer is selected. This patch
> adds a structure vfio_mm. A vfio_mm is created when the first vfio
> container is opened by a VM. On the reverse order, vfio_mm is free when
> the last vfio container is released. Each VM is assigned with a PASID
> quota, so that it is not able to request PASID beyond its quota. This
> patch adds a default quota of 1000. This quota could be tuned by
> administrator. Making PASID quota tunable will be added in another patch
> in this series.
> 
> Previous discussions:
> https://patchwork.kernel.org/patch/11209429/
> 
> Cc: Kevin Tian <kevin.tian@intel.com>
> CC: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Cc: Alex Williamson <alex.williamson@redhat.com>
> Cc: Eric Auger <eric.auger@redhat.com>
> Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> ---
>  drivers/vfio/vfio.c             | 130
> ++++++++++++++++++++++++++++++++++++++++
>  drivers/vfio/vfio_iommu_type1.c | 104
> ++++++++++++++++++++++++++++++++
>  include/linux/vfio.h            |  20 +++++++
>  include/uapi/linux/vfio.h       |  41 +++++++++++++
>  4 files changed, 295 insertions(+)
> 
> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> index c848262..d13b483 100644
> --- a/drivers/vfio/vfio.c
> +++ b/drivers/vfio/vfio.c
> @@ -32,6 +32,7 @@
>  #include <linux/vfio.h>
>  #include <linux/wait.h>
>  #include <linux/sched/signal.h>
> +#include <linux/sched/mm.h>
> 
>  #define DRIVER_VERSION	"0.3"
>  #define DRIVER_AUTHOR	"Alex Williamson
> <alex.williamson@redhat.com>"
> @@ -46,6 +47,8 @@ static struct vfio {
>  	struct mutex			group_lock;
>  	struct cdev			group_cdev;
>  	dev_t				group_devt;
> +	struct list_head		vfio_mm_list;
> +	struct mutex			vfio_mm_lock;
>  	wait_queue_head_t		release_q;
>  } vfio;
> 
> @@ -2129,6 +2132,131 @@ int vfio_unregister_notifier(struct device *dev,
> enum vfio_notify_type type,
>  EXPORT_SYMBOL(vfio_unregister_notifier);
> 
>  /**
> + * VFIO_MM objects - create, release, get, put, search

why capitalizing vfio_mm?

> + * Caller of the function should have held vfio.vfio_mm_lock.
> + */
> +static struct vfio_mm *vfio_create_mm(struct mm_struct *mm)
> +{
> +	struct vfio_mm *vmm;
> +	struct vfio_mm_token *token;
> +	int ret = 0;
> +
> +	vmm = kzalloc(sizeof(*vmm), GFP_KERNEL);
> +	if (!vmm)
> +		return ERR_PTR(-ENOMEM);
> +
> +	/* Per mm IOASID set used for quota control and group operations
> */
> +	ret = ioasid_alloc_set((struct ioasid_set *) mm,
> +			       VFIO_DEFAULT_PASID_QUOTA, &vmm-
> >ioasid_sid);
> +	if (ret) {
> +		kfree(vmm);
> +		return ERR_PTR(ret);
> +	}
> +
> +	kref_init(&vmm->kref);
> +	token = &vmm->token;
> +	token->val = mm;
> +	vmm->pasid_quota = VFIO_DEFAULT_PASID_QUOTA;
> +	mutex_init(&vmm->pasid_lock);
> +
> +	list_add(&vmm->vfio_next, &vfio.vfio_mm_list);
> +
> +	return vmm;
> +}
> +
> +static void vfio_mm_unlock_and_free(struct vfio_mm *vmm)
> +{
> +	/* destroy the ioasid set */
> +	ioasid_free_set(vmm->ioasid_sid, true);

do we need hold pasid lock here, since it attempts to destroy a
set which might be referenced by vfio_mm_pasid_free? or is
there guarantee that such race won't happen?

> +	mutex_unlock(&vfio.vfio_mm_lock);
> +	kfree(vmm);
> +}
> +
> +/* called with vfio.vfio_mm_lock held */
> +static void vfio_mm_release(struct kref *kref)
> +{
> +	struct vfio_mm *vmm = container_of(kref, struct vfio_mm, kref);
> +
> +	list_del(&vmm->vfio_next);
> +	vfio_mm_unlock_and_free(vmm);
> +}
> +
> +void vfio_mm_put(struct vfio_mm *vmm)
> +{
> +	kref_put_mutex(&vmm->kref, vfio_mm_release,
> &vfio.vfio_mm_lock);
> +}
> +EXPORT_SYMBOL_GPL(vfio_mm_put);
> +
> +/* Assume vfio_mm_lock or vfio_mm reference is held */
> +static void vfio_mm_get(struct vfio_mm *vmm)
> +{
> +	kref_get(&vmm->kref);
> +}
> +
> +struct vfio_mm *vfio_mm_get_from_task(struct task_struct *task)
> +{
> +	struct mm_struct *mm = get_task_mm(task);
> +	struct vfio_mm *vmm;
> +	unsigned long long val = (unsigned long long) mm;
> +
> +	mutex_lock(&vfio.vfio_mm_lock);
> +	list_for_each_entry(vmm, &vfio.vfio_mm_list, vfio_next) {
> +		if (vmm->token.val == val) {
> +			vfio_mm_get(vmm);
> +			goto out;
> +		}
> +	}
> +
> +	vmm = vfio_create_mm(mm);
> +	if (IS_ERR(vmm))
> +		vmm = NULL;
> +out:
> +	mutex_unlock(&vfio.vfio_mm_lock);
> +	mmput(mm);

I assume this has been discussed before, but from readability p.o.v
it might be good to add a comment for this function to explain
how the recording of mm in vfio_mm can be correctly removed
when the mm is being destroyed, since we don't hold a reference
of mm here.

> +	return vmm;
> +}
> +EXPORT_SYMBOL_GPL(vfio_mm_get_from_task);
> +
> +int vfio_mm_pasid_alloc(struct vfio_mm *vmm, int min, int max)
> +{
> +	ioasid_t pasid;
> +	int ret = -ENOSPC;
> +
> +	mutex_lock(&vmm->pasid_lock);
> +
> +	pasid = ioasid_alloc(vmm->ioasid_sid, min, max, NULL);
> +	if (pasid == INVALID_IOASID) {
> +		ret = -ENOSPC;
> +		goto out_unlock;
> +	}
> +
> +	ret = pasid;
> +out_unlock:
> +	mutex_unlock(&vmm->pasid_lock);
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(vfio_mm_pasid_alloc);
> +
> +int vfio_mm_pasid_free(struct vfio_mm *vmm, ioasid_t pasid)
> +{
> +	void *pdata;
> +	int ret = 0;
> +
> +	mutex_lock(&vmm->pasid_lock);
> +	pdata = ioasid_find(vmm->ioasid_sid, pasid, NULL);
> +	if (IS_ERR(pdata)) {
> +		ret = PTR_ERR(pdata);
> +		goto out_unlock;
> +	}
> +	ioasid_free(pasid);
> +
> +out_unlock:
> +	mutex_unlock(&vmm->pasid_lock);
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(vfio_mm_pasid_free);
> +
> +/**
>   * Module/class support
>   */
>  static char *vfio_devnode(struct device *dev, umode_t *mode)
> @@ -2151,8 +2279,10 @@ static int __init vfio_init(void)
>  	idr_init(&vfio.group_idr);
>  	mutex_init(&vfio.group_lock);
>  	mutex_init(&vfio.iommu_drivers_lock);
> +	mutex_init(&vfio.vfio_mm_lock);
>  	INIT_LIST_HEAD(&vfio.group_list);
>  	INIT_LIST_HEAD(&vfio.iommu_drivers_list);
> +	INIT_LIST_HEAD(&vfio.vfio_mm_list);
>  	init_waitqueue_head(&vfio.release_q);
> 
>  	ret = misc_register(&vfio_dev);
> diff --git a/drivers/vfio/vfio_iommu_type1.c
> b/drivers/vfio/vfio_iommu_type1.c
> index a177bf2..331ceee 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -70,6 +70,7 @@ struct vfio_iommu {
>  	unsigned int		dma_avail;
>  	bool			v2;
>  	bool			nesting;
> +	struct vfio_mm		*vmm;
>  };
> 
>  struct vfio_domain {
> @@ -2018,6 +2019,7 @@ static void vfio_iommu_type1_detach_group(void
> *iommu_data,
>  static void *vfio_iommu_type1_open(unsigned long arg)
>  {
>  	struct vfio_iommu *iommu;
> +	struct vfio_mm *vmm = NULL;
> 
>  	iommu = kzalloc(sizeof(*iommu), GFP_KERNEL);
>  	if (!iommu)
> @@ -2043,6 +2045,10 @@ static void *vfio_iommu_type1_open(unsigned
> long arg)
>  	iommu->dma_avail = dma_entry_limit;
>  	mutex_init(&iommu->lock);
>  	BLOCKING_INIT_NOTIFIER_HEAD(&iommu->notifier);
> +	vmm = vfio_mm_get_from_task(current);
> +	if (!vmm)
> +		pr_err("Failed to get vfio_mm track\n");

I assume error should be returned when pr_err is used...

> +	iommu->vmm = vmm;
> 
>  	return iommu;
>  }
> @@ -2084,6 +2090,8 @@ static void vfio_iommu_type1_release(void
> *iommu_data)
>  	}
> 
>  	vfio_iommu_iova_free(&iommu->iova_list);
> +	if (iommu->vmm)
> +		vfio_mm_put(iommu->vmm);
> 
>  	kfree(iommu);
>  }
> @@ -2172,6 +2180,55 @@ static int vfio_iommu_iova_build_caps(struct
> vfio_iommu *iommu,
>  	return ret;
>  }
> 
> +static bool vfio_iommu_type1_pasid_req_valid(u32 flags)

I don't think you need prefix "vfio_iommu_type1" for every new
function here, especially for leaf internal function as this one.

> +{
> +	return !((flags & ~VFIO_PASID_REQUEST_MASK) ||
> +		 (flags & VFIO_IOMMU_PASID_ALLOC &&
> +		  flags & VFIO_IOMMU_PASID_FREE));
> +}
> +
> +static int vfio_iommu_type1_pasid_alloc(struct vfio_iommu *iommu,
> +					 int min,
> +					 int max)
> +{
> +	struct vfio_mm *vmm = iommu->vmm;
> +	int ret = 0;
> +
> +	mutex_lock(&iommu->lock);
> +	if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) {
> +		ret = -EFAULT;

why -EFAULT?

> +		goto out_unlock;
> +	}
> +	if (vmm)
> +		ret = vfio_mm_pasid_alloc(vmm, min, max);
> +	else
> +		ret = -EINVAL;
> +out_unlock:
> +	mutex_unlock(&iommu->lock);
> +	return ret;
> +}
> +
> +static int vfio_iommu_type1_pasid_free(struct vfio_iommu *iommu,
> +				       unsigned int pasid)
> +{
> +	struct vfio_mm *vmm = iommu->vmm;
> +	int ret = 0;
> +
> +	mutex_lock(&iommu->lock);
> +	if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) {
> +		ret = -EFAULT;

ditto

> +		goto out_unlock;
> +	}
> +
> +	if (vmm)
> +		ret = vfio_mm_pasid_free(vmm, pasid);
> +	else
> +		ret = -EINVAL;
> +out_unlock:
> +	mutex_unlock(&iommu->lock);
> +	return ret;
> +}
> +
>  static long vfio_iommu_type1_ioctl(void *iommu_data,
>  				   unsigned int cmd, unsigned long arg)
>  {
> @@ -2276,6 +2333,53 @@ static long vfio_iommu_type1_ioctl(void
> *iommu_data,
> 
>  		return copy_to_user((void __user *)arg, &unmap, minsz) ?
>  			-EFAULT : 0;
> +
> +	} else if (cmd == VFIO_IOMMU_PASID_REQUEST) {
> +		struct vfio_iommu_type1_pasid_request req;
> +		unsigned long offset;
> +
> +		minsz = offsetofend(struct vfio_iommu_type1_pasid_request,
> +				    flags);
> +
> +		if (copy_from_user(&req, (void __user *)arg, minsz))
> +			return -EFAULT;
> +
> +		if (req.argsz < minsz ||
> +		    !vfio_iommu_type1_pasid_req_valid(req.flags))
> +			return -EINVAL;
> +
> +		if (copy_from_user((void *)&req + minsz,
> +				   (void __user *)arg + minsz,
> +				   sizeof(req) - minsz))
> +			return -EFAULT;

why copying in two steps instead of copying them together?

> +
> +		switch (req.flags & VFIO_PASID_REQUEST_MASK) {
> +		case VFIO_IOMMU_PASID_ALLOC:
> +		{
> +			int ret = 0, result;
> +
> +			result = vfio_iommu_type1_pasid_alloc(iommu,
> +							req.alloc_pasid.min,
> +							req.alloc_pasid.max);
> +			if (result > 0) {
> +				offset = offsetof(
> +					struct
> vfio_iommu_type1_pasid_request,
> +					alloc_pasid.result);
> +				ret = copy_to_user(
> +					      (void __user *) (arg + offset),
> +					      &result, sizeof(result));
> +			} else {
> +				pr_debug("%s: PASID alloc failed\n",
> __func__);
> +				ret = -EFAULT;

no, this branch is not for copy_to_user error. it is about pasid alloc
failure. you should handle both.

> +			}
> +			return ret;
> +		}
> +		case VFIO_IOMMU_PASID_FREE:
> +			return vfio_iommu_type1_pasid_free(iommu,
> +							   req.free_pasid);
> +		default:
> +			return -EINVAL;
> +		}
>  	}
> 
>  	return -ENOTTY;
> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> index e42a711..75f9f7f1 100644
> --- a/include/linux/vfio.h
> +++ b/include/linux/vfio.h
> @@ -89,6 +89,26 @@ extern int vfio_register_iommu_driver(const struct
> vfio_iommu_driver_ops *ops);
>  extern void vfio_unregister_iommu_driver(
>  				const struct vfio_iommu_driver_ops *ops);
> 
> +#define VFIO_DEFAULT_PASID_QUOTA	1000
> +struct vfio_mm_token {
> +	unsigned long long val;
> +};
> +
> +struct vfio_mm {
> +	struct kref			kref;
> +	struct vfio_mm_token		token;
> +	int				ioasid_sid;
> +	/* protect @pasid_quota field and pasid allocation/free */
> +	struct mutex			pasid_lock;
> +	int				pasid_quota;
> +	struct list_head		vfio_next;
> +};
> +
> +extern struct vfio_mm *vfio_mm_get_from_task(struct task_struct *task);
> +extern void vfio_mm_put(struct vfio_mm *vmm);
> +extern int vfio_mm_pasid_alloc(struct vfio_mm *vmm, int min, int max);
> +extern int vfio_mm_pasid_free(struct vfio_mm *vmm, ioasid_t pasid);
> +
>  /*
>   * External user API
>   */
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 9e843a1..298ac80 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -794,6 +794,47 @@ struct vfio_iommu_type1_dma_unmap {
>  #define VFIO_IOMMU_ENABLE	_IO(VFIO_TYPE, VFIO_BASE + 15)
>  #define VFIO_IOMMU_DISABLE	_IO(VFIO_TYPE, VFIO_BASE + 16)
> 
> +/*
> + * PASID (Process Address Space ID) is a PCIe concept which
> + * has been extended to support DMA isolation in fine-grain.
> + * With device assigned to user space (e.g. VMs), PASID alloc
> + * and free need to be system wide. This structure defines
> + * the info for pasid alloc/free between user space and kernel
> + * space.
> + *
> + * @flag=VFIO_IOMMU_PASID_ALLOC, refer to the @alloc_pasid
> + * @flag=VFIO_IOMMU_PASID_FREE, refer to @free_pasid
> + */
> +struct vfio_iommu_type1_pasid_request {
> +	__u32	argsz;
> +#define VFIO_IOMMU_PASID_ALLOC	(1 << 0)
> +#define VFIO_IOMMU_PASID_FREE	(1 << 1)
> +	__u32	flags;
> +	union {
> +		struct {
> +			__u32 min;
> +			__u32 max;
> +			__u32 result;

result->pasid?

> +		} alloc_pasid;
> +		__u32 free_pasid;

what about putting a common pasid field after flags?

> +	};
> +};
> +
> +#define VFIO_PASID_REQUEST_MASK	(VFIO_IOMMU_PASID_ALLOC | \
> +					 VFIO_IOMMU_PASID_FREE)
> +
> +/**
> + * VFIO_IOMMU_PASID_REQUEST - _IOWR(VFIO_TYPE, VFIO_BASE + 22,
> + *				struct vfio_iommu_type1_pasid_request)
> + *
> + * Availability of this feature depends on PASID support in the device,
> + * its bus, the underlying IOMMU and the CPU architecture. In VFIO, it
> + * is available after VFIO_SET_IOMMU.
> + *
> + * returns: 0 on success, -errno on failure.
> + */
> +#define VFIO_IOMMU_PASID_REQUEST	_IO(VFIO_TYPE, VFIO_BASE +
> 22)
> +
>  /* -------- Additional API for SPAPR TCE (Server POWERPC) IOMMU -------- */
> 
>  /*
> --
> 2.7.4


^ permalink raw reply	[flat|nested] 110+ messages in thread

* RE: [PATCH v1 2/8] vfio/type1: Add vfio_iommu_type1 parameter for quota tuning
  2020-03-22 12:31 ` [PATCH v1 2/8] vfio/type1: Add vfio_iommu_type1 parameter for quota tuning Liu, Yi L
  2020-03-22 17:20   ` kbuild test robot
@ 2020-03-30  8:40   ` Tian, Kevin
  2020-03-30  8:52     ` Liu, Yi L
  1 sibling, 1 reply; 110+ messages in thread
From: Tian, Kevin @ 2020-03-30  8:40 UTC (permalink / raw)
  To: Liu, Yi L, alex.williamson, eric.auger
  Cc: jacob.jun.pan, joro, Raj, Ashok, Tian, Jun J, Sun, Yi Y,
	jean-philippe, peterx, iommu, kvm, linux-kernel, Wu, Hao

> From: Liu, Yi L <yi.l.liu@intel.com>
> Sent: Sunday, March 22, 2020 8:32 PM
> 
> From: Liu Yi L <yi.l.liu@intel.com>
> 
> This patch adds a module option to make the PASID quota tunable by
> administrator.
> 
> TODO: needs to think more on how to  make the tuning to be per-process.
> 
> Previous discussions:
> https://patchwork.kernel.org/patch/11209429/
> 
> Cc: Kevin Tian <kevin.tian@intel.com>
> CC: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Cc: Alex Williamson <alex.williamson@redhat.com>
> Cc: Eric Auger <eric.auger@redhat.com>
> Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> ---
>  drivers/vfio/vfio.c             | 8 +++++++-
>  drivers/vfio/vfio_iommu_type1.c | 7 ++++++-
>  include/linux/vfio.h            | 3 ++-
>  3 files changed, 15 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> index d13b483..020a792 100644
> --- a/drivers/vfio/vfio.c
> +++ b/drivers/vfio/vfio.c
> @@ -2217,13 +2217,19 @@ struct vfio_mm *vfio_mm_get_from_task(struct
> task_struct *task)
>  }
>  EXPORT_SYMBOL_GPL(vfio_mm_get_from_task);
> 
> -int vfio_mm_pasid_alloc(struct vfio_mm *vmm, int min, int max)
> +int vfio_mm_pasid_alloc(struct vfio_mm *vmm, int quota, int min, int max)
>  {
>  	ioasid_t pasid;
>  	int ret = -ENOSPC;
> 
>  	mutex_lock(&vmm->pasid_lock);
> 
> +	/* update quota as it is tunable by admin */
> +	if (vmm->pasid_quota != quota) {
> +		vmm->pasid_quota = quota;
> +		ioasid_adjust_set(vmm->ioasid_sid, quota);
> +	}
> +

It's a bit weird to have quota adjusted in the alloc path, since the latter might
be initiated by non-privileged users. Why not doing the simple math in vfio_
create_mm to set the quota when the ioasid set is created? even in the future
you may allow per-process quota setting, that should come from separate 
privileged path instead of thru alloc...

>  	pasid = ioasid_alloc(vmm->ioasid_sid, min, max, NULL);
>  	if (pasid == INVALID_IOASID) {
>  		ret = -ENOSPC;
> diff --git a/drivers/vfio/vfio_iommu_type1.c
> b/drivers/vfio/vfio_iommu_type1.c
> index 331ceee..e40afc0 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -60,6 +60,11 @@ module_param_named(dma_entry_limit,
> dma_entry_limit, uint, 0644);
>  MODULE_PARM_DESC(dma_entry_limit,
>  		 "Maximum number of user DMA mappings per container
> (65535).");
> 
> +static int pasid_quota = VFIO_DEFAULT_PASID_QUOTA;
> +module_param_named(pasid_quota, pasid_quota, uint, 0644);
> +MODULE_PARM_DESC(pasid_quota,
> +		 "Quota of user owned PASIDs per vfio-based application
> (1000).");
> +
>  struct vfio_iommu {
>  	struct list_head	domain_list;
>  	struct list_head	iova_list;
> @@ -2200,7 +2205,7 @@ static int vfio_iommu_type1_pasid_alloc(struct
> vfio_iommu *iommu,
>  		goto out_unlock;
>  	}
>  	if (vmm)
> -		ret = vfio_mm_pasid_alloc(vmm, min, max);
> +		ret = vfio_mm_pasid_alloc(vmm, pasid_quota, min, max);
>  	else
>  		ret = -EINVAL;
>  out_unlock:
> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> index 75f9f7f1..af2ef78 100644
> --- a/include/linux/vfio.h
> +++ b/include/linux/vfio.h
> @@ -106,7 +106,8 @@ struct vfio_mm {
> 
>  extern struct vfio_mm *vfio_mm_get_from_task(struct task_struct *task);
>  extern void vfio_mm_put(struct vfio_mm *vmm);
> -extern int vfio_mm_pasid_alloc(struct vfio_mm *vmm, int min, int max);
> +extern int vfio_mm_pasid_alloc(struct vfio_mm *vmm,
> +				int quota, int min, int max);
>  extern int vfio_mm_pasid_free(struct vfio_mm *vmm, ioasid_t pasid);
> 
>  /*
> --
> 2.7.4


^ permalink raw reply	[flat|nested] 110+ messages in thread

* RE: [PATCH v1 2/8] vfio/type1: Add vfio_iommu_type1 parameter for quota tuning
  2020-03-30  8:40   ` Tian, Kevin
@ 2020-03-30  8:52     ` Liu, Yi L
  2020-03-30  9:19       ` Tian, Kevin
  0 siblings, 1 reply; 110+ messages in thread
From: Liu, Yi L @ 2020-03-30  8:52 UTC (permalink / raw)
  To: Tian, Kevin, alex.williamson, eric.auger
  Cc: jacob.jun.pan, joro, Raj, Ashok, Tian, Jun J, Sun, Yi Y,
	jean-philippe, peterx, iommu, kvm, linux-kernel, Wu, Hao

> From: Tian, Kevin <kevin.tian@intel.com>
> Sent: Monday, March 30, 2020 4:41 PM
> To: Liu, Yi L <yi.l.liu@intel.com>; alex.williamson@redhat.com;
> Subject: RE: [PATCH v1 2/8] vfio/type1: Add vfio_iommu_type1 parameter for quota
> tuning
> 
> > From: Liu, Yi L <yi.l.liu@intel.com>
> > Sent: Sunday, March 22, 2020 8:32 PM
> >
> > From: Liu Yi L <yi.l.liu@intel.com>
> >
> > This patch adds a module option to make the PASID quota tunable by
> > administrator.
> >
> > TODO: needs to think more on how to  make the tuning to be per-process.
> >
> > Previous discussions:
> > https://patchwork.kernel.org/patch/11209429/
> >
> > Cc: Kevin Tian <kevin.tian@intel.com>
> > CC: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > Cc: Alex Williamson <alex.williamson@redhat.com>
> > Cc: Eric Auger <eric.auger@redhat.com>
> > Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
> > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > ---
> >  drivers/vfio/vfio.c             | 8 +++++++-
> >  drivers/vfio/vfio_iommu_type1.c | 7 ++++++-
> >  include/linux/vfio.h            | 3 ++-
> >  3 files changed, 15 insertions(+), 3 deletions(-)
> >
> > diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> > index d13b483..020a792 100644
> > --- a/drivers/vfio/vfio.c
> > +++ b/drivers/vfio/vfio.c
> > @@ -2217,13 +2217,19 @@ struct vfio_mm *vfio_mm_get_from_task(struct
> > task_struct *task)
> >  }
> >  EXPORT_SYMBOL_GPL(vfio_mm_get_from_task);
> >
> > -int vfio_mm_pasid_alloc(struct vfio_mm *vmm, int min, int max)
> > +int vfio_mm_pasid_alloc(struct vfio_mm *vmm, int quota, int min, int max)
> >  {
> >  	ioasid_t pasid;
> >  	int ret = -ENOSPC;
> >
> >  	mutex_lock(&vmm->pasid_lock);
> >
> > +	/* update quota as it is tunable by admin */
> > +	if (vmm->pasid_quota != quota) {
> > +		vmm->pasid_quota = quota;
> > +		ioasid_adjust_set(vmm->ioasid_sid, quota);
> > +	}
> > +
> 
> It's a bit weird to have quota adjusted in the alloc path, since the latter might
> be initiated by non-privileged users. Why not doing the simple math in vfio_
> create_mm to set the quota when the ioasid set is created? even in the future
> you may allow per-process quota setting, that should come from separate
> privileged path instead of thru alloc..

The reason is the kernel parameter modification has no event which
can be used to adjust the quota. So I chose to adjust it in pasid_alloc
path. If it's not good, how about adding one more IOCTL to let user-
space trigger a quota adjustment event? Then even non-privileged
user could trigger quota adjustment, the quota is actually controlled
by privileged user. How about your opinion?

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 110+ messages in thread

* RE: [PATCH v1 2/8] vfio/type1: Add vfio_iommu_type1 parameter for quota tuning
  2020-03-30  8:52     ` Liu, Yi L
@ 2020-03-30  9:19       ` Tian, Kevin
  2020-03-30  9:26         ` Liu, Yi L
  0 siblings, 1 reply; 110+ messages in thread
From: Tian, Kevin @ 2020-03-30  9:19 UTC (permalink / raw)
  To: Liu, Yi L, alex.williamson, eric.auger
  Cc: jacob.jun.pan, joro, Raj, Ashok, Tian, Jun J, Sun, Yi Y,
	jean-philippe, peterx, iommu, kvm, linux-kernel, Wu, Hao

> From: Liu, Yi L <yi.l.liu@intel.com>
> Sent: Monday, March 30, 2020 4:53 PM
> 
> > From: Tian, Kevin <kevin.tian@intel.com>
> > Sent: Monday, March 30, 2020 4:41 PM
> > To: Liu, Yi L <yi.l.liu@intel.com>; alex.williamson@redhat.com;
> > Subject: RE: [PATCH v1 2/8] vfio/type1: Add vfio_iommu_type1 parameter
> for quota
> > tuning
> >
> > > From: Liu, Yi L <yi.l.liu@intel.com>
> > > Sent: Sunday, March 22, 2020 8:32 PM
> > >
> > > From: Liu Yi L <yi.l.liu@intel.com>
> > >
> > > This patch adds a module option to make the PASID quota tunable by
> > > administrator.
> > >
> > > TODO: needs to think more on how to  make the tuning to be per-process.
> > >
> > > Previous discussions:
> > > https://patchwork.kernel.org/patch/11209429/
> > >
> > > Cc: Kevin Tian <kevin.tian@intel.com>
> > > CC: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > > Cc: Alex Williamson <alex.williamson@redhat.com>
> > > Cc: Eric Auger <eric.auger@redhat.com>
> > > Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
> > > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > > ---
> > >  drivers/vfio/vfio.c             | 8 +++++++-
> > >  drivers/vfio/vfio_iommu_type1.c | 7 ++++++-
> > >  include/linux/vfio.h            | 3 ++-
> > >  3 files changed, 15 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> > > index d13b483..020a792 100644
> > > --- a/drivers/vfio/vfio.c
> > > +++ b/drivers/vfio/vfio.c
> > > @@ -2217,13 +2217,19 @@ struct vfio_mm
> *vfio_mm_get_from_task(struct
> > > task_struct *task)
> > >  }
> > >  EXPORT_SYMBOL_GPL(vfio_mm_get_from_task);
> > >
> > > -int vfio_mm_pasid_alloc(struct vfio_mm *vmm, int min, int max)
> > > +int vfio_mm_pasid_alloc(struct vfio_mm *vmm, int quota, int min, int
> max)
> > >  {
> > >  	ioasid_t pasid;
> > >  	int ret = -ENOSPC;
> > >
> > >  	mutex_lock(&vmm->pasid_lock);
> > >
> > > +	/* update quota as it is tunable by admin */
> > > +	if (vmm->pasid_quota != quota) {
> > > +		vmm->pasid_quota = quota;
> > > +		ioasid_adjust_set(vmm->ioasid_sid, quota);
> > > +	}
> > > +
> >
> > It's a bit weird to have quota adjusted in the alloc path, since the latter
> might
> > be initiated by non-privileged users. Why not doing the simple math in
> vfio_
> > create_mm to set the quota when the ioasid set is created? even in the
> future
> > you may allow per-process quota setting, that should come from separate
> > privileged path instead of thru alloc..
> 
> The reason is the kernel parameter modification has no event which
> can be used to adjust the quota. So I chose to adjust it in pasid_alloc
> path. If it's not good, how about adding one more IOCTL to let user-
> space trigger a quota adjustment event? Then even non-privileged
> user could trigger quota adjustment, the quota is actually controlled
> by privileged user. How about your opinion?
> 

why do you need an event to adjust? As I said, you can set the quota
when the set is created in vfio_create_mm...

Thanks
Kevin

^ permalink raw reply	[flat|nested] 110+ messages in thread

* RE: [PATCH v1 2/8] vfio/type1: Add vfio_iommu_type1 parameter for quota tuning
  2020-03-30  9:19       ` Tian, Kevin
@ 2020-03-30  9:26         ` Liu, Yi L
  2020-03-30 11:44           ` Tian, Kevin
  0 siblings, 1 reply; 110+ messages in thread
From: Liu, Yi L @ 2020-03-30  9:26 UTC (permalink / raw)
  To: Tian, Kevin, alex.williamson, eric.auger
  Cc: jacob.jun.pan, joro, Raj, Ashok, Tian, Jun J, Sun, Yi Y,
	jean-philippe, peterx, iommu, kvm, linux-kernel, Wu, Hao

> From: Tian, Kevin <kevin.tian@intel.com>
> Sent: Monday, March 30, 2020 5:20 PM
> To: Liu, Yi L <yi.l.liu@intel.com>; alex.williamson@redhat.com;
> Subject: RE: [PATCH v1 2/8] vfio/type1: Add vfio_iommu_type1 parameter for quota
> tuning
> 
> > From: Liu, Yi L <yi.l.liu@intel.com>
> > Sent: Monday, March 30, 2020 4:53 PM
> >
> > > From: Tian, Kevin <kevin.tian@intel.com>
> > > Sent: Monday, March 30, 2020 4:41 PM
> > > To: Liu, Yi L <yi.l.liu@intel.com>; alex.williamson@redhat.com;
> > > Subject: RE: [PATCH v1 2/8] vfio/type1: Add vfio_iommu_type1
> > > parameter
> > for quota
> > > tuning
> > >
> > > > From: Liu, Yi L <yi.l.liu@intel.com>
> > > > Sent: Sunday, March 22, 2020 8:32 PM
> > > >
> > > > From: Liu Yi L <yi.l.liu@intel.com>
> > > >
> > > > This patch adds a module option to make the PASID quota tunable by
> > > > administrator.
> > > >
> > > > TODO: needs to think more on how to  make the tuning to be per-process.
> > > >
> > > > Previous discussions:
> > > > https://patchwork.kernel.org/patch/11209429/
> > > >
> > > > Cc: Kevin Tian <kevin.tian@intel.com>
> > > > CC: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > > > Cc: Alex Williamson <alex.williamson@redhat.com>
> > > > Cc: Eric Auger <eric.auger@redhat.com>
> > > > Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
> > > > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > > > ---
> > > >  drivers/vfio/vfio.c             | 8 +++++++-
> > > >  drivers/vfio/vfio_iommu_type1.c | 7 ++++++-
> > > >  include/linux/vfio.h            | 3 ++-
> > > >  3 files changed, 15 insertions(+), 3 deletions(-)
> > > >
> > > > diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c index
> > > > d13b483..020a792 100644
> > > > --- a/drivers/vfio/vfio.c
> > > > +++ b/drivers/vfio/vfio.c
> > > > @@ -2217,13 +2217,19 @@ struct vfio_mm
> > *vfio_mm_get_from_task(struct
> > > > task_struct *task)
> > > >  }
> > > >  EXPORT_SYMBOL_GPL(vfio_mm_get_from_task);
> > > >
> > > > -int vfio_mm_pasid_alloc(struct vfio_mm *vmm, int min, int max)
> > > > +int vfio_mm_pasid_alloc(struct vfio_mm *vmm, int quota, int min,
> > > > +int
> > max)
> > > >  {
> > > >  	ioasid_t pasid;
> > > >  	int ret = -ENOSPC;
> > > >
> > > >  	mutex_lock(&vmm->pasid_lock);
> > > >
> > > > +	/* update quota as it is tunable by admin */
> > > > +	if (vmm->pasid_quota != quota) {
> > > > +		vmm->pasid_quota = quota;
> > > > +		ioasid_adjust_set(vmm->ioasid_sid, quota);
> > > > +	}
> > > > +
> > >
> > > It's a bit weird to have quota adjusted in the alloc path, since the
> > > latter
> > might
> > > be initiated by non-privileged users. Why not doing the simple math
> > > in
> > vfio_
> > > create_mm to set the quota when the ioasid set is created? even in
> > > the
> > future
> > > you may allow per-process quota setting, that should come from
> > > separate privileged path instead of thru alloc..
> >
> > The reason is the kernel parameter modification has no event which can
> > be used to adjust the quota. So I chose to adjust it in pasid_alloc
> > path. If it's not good, how about adding one more IOCTL to let user-
> > space trigger a quota adjustment event? Then even non-privileged user
> > could trigger quota adjustment, the quota is actually controlled by
> > privileged user. How about your opinion?
> >
> 
> why do you need an event to adjust? As I said, you can set the quota when the set is
> created in vfio_create_mm...

oh, it's to support runtime adjustments. I guess it may be helpful to let
per-VM quota tunable even the VM is running. If just set the quota in
vfio_create_mm(), it is not able to adjust at runtime.

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 110+ messages in thread

* RE: [PATCH v1 3/8] vfio/type1: Report PASID alloc/free support to userspace
  2020-03-22 12:32 ` [PATCH v1 3/8] vfio/type1: Report PASID alloc/free support to userspace Liu, Yi L
@ 2020-03-30  9:43   ` Tian, Kevin
  2020-04-01  7:46     ` Liu, Yi L
  2020-04-01  9:41   ` Auger Eric
  2020-04-02 18:01   ` Alex Williamson
  2 siblings, 1 reply; 110+ messages in thread
From: Tian, Kevin @ 2020-03-30  9:43 UTC (permalink / raw)
  To: Liu, Yi L, alex.williamson, eric.auger
  Cc: jacob.jun.pan, joro, Raj, Ashok, Tian, Jun J, Sun, Yi Y,
	jean-philippe, peterx, iommu, kvm, linux-kernel, Wu, Hao

> From: Liu, Yi L <yi.l.liu@intel.com>
> Sent: Sunday, March 22, 2020 8:32 PM
> 
> From: Liu Yi L <yi.l.liu@intel.com>
> 
> This patch reports PASID alloc/free availability to userspace (e.g. QEMU)
> thus userspace could do a pre-check before utilizing this feature.
> 
> Cc: Kevin Tian <kevin.tian@intel.com>
> CC: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Cc: Alex Williamson <alex.williamson@redhat.com>
> Cc: Eric Auger <eric.auger@redhat.com>
> Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> ---
>  drivers/vfio/vfio_iommu_type1.c | 28 ++++++++++++++++++++++++++++
>  include/uapi/linux/vfio.h       |  8 ++++++++
>  2 files changed, 36 insertions(+)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c
> b/drivers/vfio/vfio_iommu_type1.c
> index e40afc0..ddd1ffe 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -2234,6 +2234,30 @@ static int vfio_iommu_type1_pasid_free(struct
> vfio_iommu *iommu,
>  	return ret;
>  }
> 
> +static int vfio_iommu_info_add_nesting_cap(struct vfio_iommu *iommu,
> +					 struct vfio_info_cap *caps)
> +{
> +	struct vfio_info_cap_header *header;
> +	struct vfio_iommu_type1_info_cap_nesting *nesting_cap;
> +
> +	header = vfio_info_cap_add(caps, sizeof(*nesting_cap),
> +				   VFIO_IOMMU_TYPE1_INFO_CAP_NESTING,
> 1);
> +	if (IS_ERR(header))
> +		return PTR_ERR(header);
> +
> +	nesting_cap = container_of(header,
> +				struct vfio_iommu_type1_info_cap_nesting,
> +				header);
> +
> +	nesting_cap->nesting_capabilities = 0;
> +	if (iommu->nesting) {

Is it good to report a nesting cap when iommu->nesting is disabled? I suppose
the check should move before vfio_info_cap_add...

> +		/* nesting iommu type supports PASID requests (alloc/free)
> */
> +		nesting_cap->nesting_capabilities |=
> VFIO_IOMMU_PASID_REQS;

VFIO_IOMMU_CAP_PASID_REQ? to avoid confusion with ioctl cmd
VFIO_IOMMU_PASID_REQUEST...

> +	}
> +
> +	return 0;
> +}
> +
>  static long vfio_iommu_type1_ioctl(void *iommu_data,
>  				   unsigned int cmd, unsigned long arg)
>  {
> @@ -2283,6 +2307,10 @@ static long vfio_iommu_type1_ioctl(void
> *iommu_data,
>  		if (ret)
>  			return ret;
> 
> +		ret = vfio_iommu_info_add_nesting_cap(iommu, &caps);
> +		if (ret)
> +			return ret;
> +
>  		if (caps.size) {
>  			info.flags |= VFIO_IOMMU_INFO_CAPS;
> 
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 298ac80..8837219 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -748,6 +748,14 @@ struct vfio_iommu_type1_info_cap_iova_range {
>  	struct	vfio_iova_range iova_ranges[];
>  };
> 
> +#define VFIO_IOMMU_TYPE1_INFO_CAP_NESTING  2
> +
> +struct vfio_iommu_type1_info_cap_nesting {
> +	struct	vfio_info_cap_header header;
> +#define VFIO_IOMMU_PASID_REQS	(1 << 0)
> +	__u32	nesting_capabilities;
> +};
> +
>  #define VFIO_IOMMU_GET_INFO _IO(VFIO_TYPE, VFIO_BASE + 12)
> 
>  /**
> --
> 2.7.4


^ permalink raw reply	[flat|nested] 110+ messages in thread

* RE: [PATCH v1 2/8] vfio/type1: Add vfio_iommu_type1 parameter for quota tuning
  2020-03-30  9:26         ` Liu, Yi L
@ 2020-03-30 11:44           ` Tian, Kevin
  2020-04-02 17:58             ` Alex Williamson
  0 siblings, 1 reply; 110+ messages in thread
From: Tian, Kevin @ 2020-03-30 11:44 UTC (permalink / raw)
  To: Liu, Yi L, alex.williamson, eric.auger
  Cc: jacob.jun.pan, joro, Raj, Ashok, Tian, Jun J, Sun, Yi Y,
	jean-philippe, peterx, iommu, kvm, linux-kernel, Wu, Hao

> From: Liu, Yi L <yi.l.liu@intel.com>
> Sent: Monday, March 30, 2020 5:27 PM
> 
> > From: Tian, Kevin <kevin.tian@intel.com>
> > Sent: Monday, March 30, 2020 5:20 PM
> > To: Liu, Yi L <yi.l.liu@intel.com>; alex.williamson@redhat.com;
> > Subject: RE: [PATCH v1 2/8] vfio/type1: Add vfio_iommu_type1 parameter
> for quota
> > tuning
> >
> > > From: Liu, Yi L <yi.l.liu@intel.com>
> > > Sent: Monday, March 30, 2020 4:53 PM
> > >
> > > > From: Tian, Kevin <kevin.tian@intel.com>
> > > > Sent: Monday, March 30, 2020 4:41 PM
> > > > To: Liu, Yi L <yi.l.liu@intel.com>; alex.williamson@redhat.com;
> > > > Subject: RE: [PATCH v1 2/8] vfio/type1: Add vfio_iommu_type1
> > > > parameter
> > > for quota
> > > > tuning
> > > >
> > > > > From: Liu, Yi L <yi.l.liu@intel.com>
> > > > > Sent: Sunday, March 22, 2020 8:32 PM
> > > > >
> > > > > From: Liu Yi L <yi.l.liu@intel.com>
> > > > >
> > > > > This patch adds a module option to make the PASID quota tunable by
> > > > > administrator.
> > > > >
> > > > > TODO: needs to think more on how to  make the tuning to be per-
> process.
> > > > >
> > > > > Previous discussions:
> > > > > https://patchwork.kernel.org/patch/11209429/
> > > > >
> > > > > Cc: Kevin Tian <kevin.tian@intel.com>
> > > > > CC: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > > > > Cc: Alex Williamson <alex.williamson@redhat.com>
> > > > > Cc: Eric Auger <eric.auger@redhat.com>
> > > > > Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
> > > > > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > > > > ---
> > > > >  drivers/vfio/vfio.c             | 8 +++++++-
> > > > >  drivers/vfio/vfio_iommu_type1.c | 7 ++++++-
> > > > >  include/linux/vfio.h            | 3 ++-
> > > > >  3 files changed, 15 insertions(+), 3 deletions(-)
> > > > >
> > > > > diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c index
> > > > > d13b483..020a792 100644
> > > > > --- a/drivers/vfio/vfio.c
> > > > > +++ b/drivers/vfio/vfio.c
> > > > > @@ -2217,13 +2217,19 @@ struct vfio_mm
> > > *vfio_mm_get_from_task(struct
> > > > > task_struct *task)
> > > > >  }
> > > > >  EXPORT_SYMBOL_GPL(vfio_mm_get_from_task);
> > > > >
> > > > > -int vfio_mm_pasid_alloc(struct vfio_mm *vmm, int min, int max)
> > > > > +int vfio_mm_pasid_alloc(struct vfio_mm *vmm, int quota, int min,
> > > > > +int
> > > max)
> > > > >  {
> > > > >  	ioasid_t pasid;
> > > > >  	int ret = -ENOSPC;
> > > > >
> > > > >  	mutex_lock(&vmm->pasid_lock);
> > > > >
> > > > > +	/* update quota as it is tunable by admin */
> > > > > +	if (vmm->pasid_quota != quota) {
> > > > > +		vmm->pasid_quota = quota;
> > > > > +		ioasid_adjust_set(vmm->ioasid_sid, quota);
> > > > > +	}
> > > > > +
> > > >
> > > > It's a bit weird to have quota adjusted in the alloc path, since the
> > > > latter
> > > might
> > > > be initiated by non-privileged users. Why not doing the simple math
> > > > in
> > > vfio_
> > > > create_mm to set the quota when the ioasid set is created? even in
> > > > the
> > > future
> > > > you may allow per-process quota setting, that should come from
> > > > separate privileged path instead of thru alloc..
> > >
> > > The reason is the kernel parameter modification has no event which can
> > > be used to adjust the quota. So I chose to adjust it in pasid_alloc
> > > path. If it's not good, how about adding one more IOCTL to let user-
> > > space trigger a quota adjustment event? Then even non-privileged user
> > > could trigger quota adjustment, the quota is actually controlled by
> > > privileged user. How about your opinion?
> > >
> >
> > why do you need an event to adjust? As I said, you can set the quota when
> the set is
> > created in vfio_create_mm...
> 
> oh, it's to support runtime adjustments. I guess it may be helpful to let
> per-VM quota tunable even the VM is running. If just set the quota in
> vfio_create_mm(), it is not able to adjust at runtime.
> 

ok, I didn't note the module parameter was granted with a write permission.
However there is a further problem. We cannot support PASID reclaim now.
What about the admin sets a quota smaller than previous value while some
IOASID sets already exceed the new quota? I'm not sure how to fail a runtime
module parameter change due to that situation. possibly a normal sysfs 
node better suites the runtime change requirement...

Thanks
Kevin

^ permalink raw reply	[flat|nested] 110+ messages in thread

* RE: [PATCH v1 5/8] vfio/type1: Report 1st-level/stage-1 format to userspace
  2020-03-22 12:32 ` [PATCH v1 5/8] vfio/type1: Report 1st-level/stage-1 format to userspace Liu, Yi L
  2020-03-22 16:44   ` kbuild test robot
@ 2020-03-30 11:48   ` Tian, Kevin
  2020-04-01  7:38     ` Liu, Yi L
  2020-04-01  8:51   ` Auger Eric
  2020-04-02 19:20   ` Alex Williamson
  3 siblings, 1 reply; 110+ messages in thread
From: Tian, Kevin @ 2020-03-30 11:48 UTC (permalink / raw)
  To: Liu, Yi L, alex.williamson, eric.auger
  Cc: jacob.jun.pan, joro, Raj, Ashok, Tian, Jun J, Sun, Yi Y,
	jean-philippe, peterx, iommu, kvm, linux-kernel, Wu, Hao

> From: Liu, Yi L <yi.l.liu@intel.com>
> Sent: Sunday, March 22, 2020 8:32 PM
> 
> From: Liu Yi L <yi.l.liu@intel.com>
> 
> VFIO exposes IOMMU nesting translation (a.k.a dual stage translation)
> capability to userspace. Thus applications like QEMU could support
> vIOMMU with hardware's nesting translation capability for pass-through
> devices. Before setting up nesting translation for pass-through devices,
> QEMU and other applications need to learn the supported 1st-lvl/stage-1
> translation structure format like page table format.
> 
> Take vSVA (virtual Shared Virtual Addressing) as an example, to support
> vSVA for pass-through devices, QEMU setup nesting translation for pass-
> through devices. The guest page table are configured to host as 1st-lvl/
> stage-1 page table. Therefore, guest format should be compatible with
> host side.
> 
> This patch reports the supported 1st-lvl/stage-1 page table format on the
> current platform to userspace. QEMU and other alike applications should
> use this format info when trying to setup IOMMU nesting translation on
> host IOMMU.
> 
> Cc: Kevin Tian <kevin.tian@intel.com>
> CC: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Cc: Alex Williamson <alex.williamson@redhat.com>
> Cc: Eric Auger <eric.auger@redhat.com>
> Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> ---
>  drivers/vfio/vfio_iommu_type1.c | 56
> +++++++++++++++++++++++++++++++++++++++++
>  include/uapi/linux/vfio.h       |  1 +
>  2 files changed, 57 insertions(+)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c
> b/drivers/vfio/vfio_iommu_type1.c
> index 9aa2a67..82a9e0b 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -2234,11 +2234,66 @@ static int vfio_iommu_type1_pasid_free(struct
> vfio_iommu *iommu,
>  	return ret;
>  }
> 
> +static int vfio_iommu_get_stage1_format(struct vfio_iommu *iommu,
> +					 u32 *stage1_format)
> +{
> +	struct vfio_domain *domain;
> +	u32 format = 0, tmp_format = 0;
> +	int ret;
> +
> +	mutex_lock(&iommu->lock);
> +	if (list_empty(&iommu->domain_list)) {
> +		mutex_unlock(&iommu->lock);
> +		return -EINVAL;
> +	}
> +
> +	list_for_each_entry(domain, &iommu->domain_list, next) {
> +		if (iommu_domain_get_attr(domain->domain,
> +			DOMAIN_ATTR_PASID_FORMAT, &format)) {
> +			ret = -EINVAL;
> +			format = 0;
> +			goto out_unlock;
> +		}
> +		/*
> +		 * format is always non-zero (the first format is
> +		 * IOMMU_PASID_FORMAT_INTEL_VTD which is 1). For
> +		 * the reason of potential different backed IOMMU
> +		 * formats, here we expect to have identical formats
> +		 * in the domain list, no mixed formats support.
> +		 * return -EINVAL to fail the attempt of setup
> +		 * VFIO_TYPE1_NESTING_IOMMU if non-identical formats
> +		 * are detected.
> +		 */
> +		if (tmp_format && tmp_format != format) {
> +			ret = -EINVAL;
> +			format = 0;
> +			goto out_unlock;
> +		}
> +
> +		tmp_format = format;
> +	}

this path is invoked only in VFIO_IOMMU_GET_INFO path. If we don't
want to assume the status quo that one container holds only one
device w/ vIOMMU (the prerequisite for vSVA), looks we also need
check the format compatibility when attaching a new group to this
container?

> +	ret = 0;
> +
> +out_unlock:
> +	if (format)
> +		*stage1_format = format;
> +	mutex_unlock(&iommu->lock);
> +	return ret;
> +}
> +
>  static int vfio_iommu_info_add_nesting_cap(struct vfio_iommu *iommu,
>  					 struct vfio_info_cap *caps)
>  {
>  	struct vfio_info_cap_header *header;
>  	struct vfio_iommu_type1_info_cap_nesting *nesting_cap;
> +	u32 formats = 0;
> +	int ret;
> +
> +	ret = vfio_iommu_get_stage1_format(iommu, &formats);
> +	if (ret) {
> +		pr_warn("Failed to get stage-1 format\n");
> +		return ret;
> +	}
> 
>  	header = vfio_info_cap_add(caps, sizeof(*nesting_cap),
>  				   VFIO_IOMMU_TYPE1_INFO_CAP_NESTING,
> 1);
> @@ -2254,6 +2309,7 @@ static int vfio_iommu_info_add_nesting_cap(struct
> vfio_iommu *iommu,
>  		/* nesting iommu type supports PASID requests (alloc/free)
> */
>  		nesting_cap->nesting_capabilities |=
> VFIO_IOMMU_PASID_REQS;
>  	}
> +	nesting_cap->stage1_formats = formats;
> 
>  	return 0;
>  }
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index ed9881d..ebeaf3e 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -763,6 +763,7 @@ struct vfio_iommu_type1_info_cap_nesting {
>  	struct	vfio_info_cap_header header;
>  #define VFIO_IOMMU_PASID_REQS	(1 << 0)
>  	__u32	nesting_capabilities;
> +	__u32	stage1_formats;

do you plan to support multiple formats? If not, use singular name.

>  };
> 
>  #define VFIO_IOMMU_GET_INFO _IO(VFIO_TYPE, VFIO_BASE + 12)
> --
> 2.7.4


^ permalink raw reply	[flat|nested] 110+ messages in thread

* RE: [PATCH v1 6/8] vfio/type1: Bind guest page tables to host
  2020-03-22 12:32 ` [PATCH v1 6/8] vfio/type1: Bind guest page tables to host Liu, Yi L
  2020-03-22 18:10   ` kbuild test robot
@ 2020-03-30 12:46   ` Tian, Kevin
  2020-04-01  9:13     ` Liu, Yi L
  2020-04-02 19:57   ` Alex Williamson
  2 siblings, 1 reply; 110+ messages in thread
From: Tian, Kevin @ 2020-03-30 12:46 UTC (permalink / raw)
  To: Liu, Yi L, alex.williamson, eric.auger
  Cc: jacob.jun.pan, joro, Raj, Ashok, Tian, Jun J, Sun, Yi Y,
	jean-philippe, peterx, iommu, kvm, linux-kernel, Wu, Hao

> From: Liu, Yi L <yi.l.liu@intel.com>
> Sent: Sunday, March 22, 2020 8:32 PM
> 
> From: Liu Yi L <yi.l.liu@intel.com>
> 
> VFIO_TYPE1_NESTING_IOMMU is an IOMMU type which is backed by
> hardware
> IOMMUs that have nesting DMA translation (a.k.a dual stage address
> translation). For such hardware IOMMUs, there are two stages/levels of
> address translation, and software may let userspace/VM to own the first-
> level/stage-1 translation structures. Example of such usage is vSVA (
> virtual Shared Virtual Addressing). VM owns the first-level/stage-1
> translation structures and bind the structures to host, then hardware
> IOMMU would utilize nesting translation when doing DMA translation fo
> the devices behind such hardware IOMMU.
> 
> This patch adds vfio support for binding guest translation (a.k.a stage 1)
> structure to host iommu. And for VFIO_TYPE1_NESTING_IOMMU, not only
> bind
> guest page table is needed, it also requires to expose interface to guest
> for iommu cache invalidation when guest modified the first-level/stage-1
> translation structures since hardware needs to be notified to flush stale
> iotlbs. This would be introduced in next patch.
> 
> In this patch, guest page table bind and unbind are done by using flags
> VFIO_IOMMU_BIND_GUEST_PGTBL and
> VFIO_IOMMU_UNBIND_GUEST_PGTBL under IOCTL
> VFIO_IOMMU_BIND, the bind/unbind data are conveyed by
> struct iommu_gpasid_bind_data. Before binding guest page table to host,
> VM should have got a PASID allocated by host via
> VFIO_IOMMU_PASID_REQUEST.
> 
> Bind guest translation structures (here is guest page table) to host

Bind -> Binding

> are the first step to setup vSVA (Virtual Shared Virtual Addressing).

are -> is. and you already explained vSVA earlier.

> 
> Cc: Kevin Tian <kevin.tian@intel.com>
> CC: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Cc: Alex Williamson <alex.williamson@redhat.com>
> Cc: Eric Auger <eric.auger@redhat.com>
> Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
> Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.com>
> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> ---
>  drivers/vfio/vfio_iommu_type1.c | 158
> ++++++++++++++++++++++++++++++++++++++++
>  include/uapi/linux/vfio.h       |  46 ++++++++++++
>  2 files changed, 204 insertions(+)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c
> b/drivers/vfio/vfio_iommu_type1.c
> index 82a9e0b..a877747 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -130,6 +130,33 @@ struct vfio_regions {
>  #define IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)	\
>  					(!list_empty(&iommu->domain_list))
> 
> +struct domain_capsule {
> +	struct iommu_domain *domain;
> +	void *data;
> +};
> +
> +/* iommu->lock must be held */
> +static int vfio_iommu_for_each_dev(struct vfio_iommu *iommu,
> +		      int (*fn)(struct device *dev, void *data),
> +		      void *data)
> +{
> +	struct domain_capsule dc = {.data = data};
> +	struct vfio_domain *d;
> +	struct vfio_group *g;
> +	int ret = 0;
> +
> +	list_for_each_entry(d, &iommu->domain_list, next) {
> +		dc.domain = d->domain;
> +		list_for_each_entry(g, &d->group_list, next) {
> +			ret = iommu_group_for_each_dev(g->iommu_group,
> +						       &dc, fn);
> +			if (ret)
> +				break;
> +		}
> +	}
> +	return ret;
> +}
> +
>  static int put_pfn(unsigned long pfn, int prot);
> 
>  /*
> @@ -2314,6 +2341,88 @@ static int
> vfio_iommu_info_add_nesting_cap(struct vfio_iommu *iommu,
>  	return 0;
>  }
> 
> +static int vfio_bind_gpasid_fn(struct device *dev, void *data)
> +{
> +	struct domain_capsule *dc = (struct domain_capsule *)data;
> +	struct iommu_gpasid_bind_data *gbind_data =
> +		(struct iommu_gpasid_bind_data *) dc->data;
> +

In Jacob's vSVA iommu series, [PATCH 06/11]:

+		/* REVISIT: upper layer/VFIO can track host process that bind the PASID.
+		 * ioasid_set = mm might be sufficient for vfio to check pasid VMM
+		 * ownership.
+		 */

I asked him who exactly should be responsible for tracking the pasid
ownership. Although no response yet, I expect vfio/iommu can have
a clear policy and also documented here to provide consistent 
message.

> +	return iommu_sva_bind_gpasid(dc->domain, dev, gbind_data);
> +}
> +
> +static int vfio_unbind_gpasid_fn(struct device *dev, void *data)
> +{
> +	struct domain_capsule *dc = (struct domain_capsule *)data;
> +	struct iommu_gpasid_bind_data *gbind_data =
> +		(struct iommu_gpasid_bind_data *) dc->data;
> +
> +	return iommu_sva_unbind_gpasid(dc->domain, dev,
> +					gbind_data->hpasid);

curious why we have to share the same bind_data structure
between bind and unbind, especially when unbind requires
only one field? I didn't see a clear reason, and just similar
to earlier ALLOC/FREE which don't share structure either.
Current way simply wastes space for unbind operation...

> +}
> +
> +/**
> + * Unbind specific gpasid, caller of this function requires hold
> + * vfio_iommu->lock
> + */
> +static long vfio_iommu_type1_do_guest_unbind(struct vfio_iommu
> *iommu,
> +				struct iommu_gpasid_bind_data *gbind_data)
> +{
> +	return vfio_iommu_for_each_dev(iommu,
> +				vfio_unbind_gpasid_fn, gbind_data);
> +}
> +
> +static long vfio_iommu_type1_bind_gpasid(struct vfio_iommu *iommu,
> +				struct iommu_gpasid_bind_data *gbind_data)
> +{
> +	int ret = 0;
> +
> +	mutex_lock(&iommu->lock);
> +	if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) {
> +		ret = -EINVAL;
> +		goto out_unlock;
> +	}
> +
> +	ret = vfio_iommu_for_each_dev(iommu,
> +			vfio_bind_gpasid_fn, gbind_data);
> +	/*
> +	 * If bind failed, it may not be a total failure. Some devices
> +	 * within the iommu group may have bind successfully. Although
> +	 * we don't enable pasid capability for non-singletion iommu
> +	 * groups, a unbind operation would be helpful to ensure no
> +	 * partial binding for an iommu group.
> +	 */
> +	if (ret)
> +		/*
> +		 * Undo all binds that already succeeded, no need to

binds -> bindings

> +		 * check the return value here since some device within
> +		 * the group has no successful bind when coming to this
> +		 * place switch.
> +		 */

remove 'switch'

> +		vfio_iommu_type1_do_guest_unbind(iommu, gbind_data);
> +
> +out_unlock:
> +	mutex_unlock(&iommu->lock);
> +	return ret;
> +}
> +
> +static long vfio_iommu_type1_unbind_gpasid(struct vfio_iommu *iommu,
> +				struct iommu_gpasid_bind_data *gbind_data)
> +{
> +	int ret = 0;
> +
> +	mutex_lock(&iommu->lock);
> +	if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) {
> +		ret = -EINVAL;
> +		goto out_unlock;
> +	}
> +
> +	ret = vfio_iommu_type1_do_guest_unbind(iommu, gbind_data);
> +
> +out_unlock:
> +	mutex_unlock(&iommu->lock);
> +	return ret;
> +}
> +
>  static long vfio_iommu_type1_ioctl(void *iommu_data,
>  				   unsigned int cmd, unsigned long arg)
>  {
> @@ -2471,6 +2580,55 @@ static long vfio_iommu_type1_ioctl(void
> *iommu_data,
>  		default:
>  			return -EINVAL;
>  		}
> +
> +	} else if (cmd == VFIO_IOMMU_BIND) {

BIND what? VFIO_IOMMU_BIND_PASID sounds clearer to me.

> +		struct vfio_iommu_type1_bind bind;
> +		u32 version;
> +		int data_size;
> +		void *gbind_data;
> +		int ret;
> +
> +		minsz = offsetofend(struct vfio_iommu_type1_bind, flags);
> +
> +		if (copy_from_user(&bind, (void __user *)arg, minsz))
> +			return -EFAULT;
> +
> +		if (bind.argsz < minsz)
> +			return -EINVAL;
> +
> +		/* Get the version of struct iommu_gpasid_bind_data */
> +		if (copy_from_user(&version,
> +			(void __user *) (arg + minsz),
> +					sizeof(version)))
> +			return -EFAULT;
> +
> +		data_size = iommu_uapi_get_data_size(
> +				IOMMU_UAPI_BIND_GPASID, version);
> +		gbind_data = kzalloc(data_size, GFP_KERNEL);
> +		if (!gbind_data)
> +			return -ENOMEM;
> +
> +		if (copy_from_user(gbind_data,
> +			 (void __user *) (arg + minsz), data_size)) {
> +			kfree(gbind_data);
> +			return -EFAULT;
> +		}
> +
> +		switch (bind.flags & VFIO_IOMMU_BIND_MASK) {
> +		case VFIO_IOMMU_BIND_GUEST_PGTBL:
> +			ret = vfio_iommu_type1_bind_gpasid(iommu,
> +							   gbind_data);
> +			break;
> +		case VFIO_IOMMU_UNBIND_GUEST_PGTBL:
> +			ret = vfio_iommu_type1_unbind_gpasid(iommu,
> +							     gbind_data);
> +			break;
> +		default:
> +			ret = -EINVAL;
> +			break;
> +		}
> +		kfree(gbind_data);
> +		return ret;
>  	}
> 
>  	return -ENOTTY;
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index ebeaf3e..2235bc6 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -14,6 +14,7 @@
> 
>  #include <linux/types.h>
>  #include <linux/ioctl.h>
> +#include <linux/iommu.h>
> 
>  #define VFIO_API_VERSION	0
> 
> @@ -853,6 +854,51 @@ struct vfio_iommu_type1_pasid_request {
>   */
>  #define VFIO_IOMMU_PASID_REQUEST	_IO(VFIO_TYPE, VFIO_BASE +
> 22)
> 
> +/**
> + * Supported flags:
> + *	- VFIO_IOMMU_BIND_GUEST_PGTBL: bind guest page tables to host
> for
> + *			nesting type IOMMUs. In @data field It takes struct
> + *			iommu_gpasid_bind_data.
> + *	- VFIO_IOMMU_UNBIND_GUEST_PGTBL: undo a bind guest page
> table operation
> + *			invoked by VFIO_IOMMU_BIND_GUEST_PGTBL.
> + *
> + */
> +struct vfio_iommu_type1_bind {
> +	__u32		argsz;
> +	__u32		flags;
> +#define VFIO_IOMMU_BIND_GUEST_PGTBL	(1 << 0)
> +#define VFIO_IOMMU_UNBIND_GUEST_PGTBL	(1 << 1)
> +	__u8		data[];
> +};
> +
> +#define VFIO_IOMMU_BIND_MASK	(VFIO_IOMMU_BIND_GUEST_PGTBL
> | \
> +
> 	VFIO_IOMMU_UNBIND_GUEST_PGTBL)
> +
> +/**
> + * VFIO_IOMMU_BIND - _IOW(VFIO_TYPE, VFIO_BASE + 23,
> + *				struct vfio_iommu_type1_bind)
> + *
> + * Manage address spaces of devices in this container. Initially a TYPE1
> + * container can only have one address space, managed with
> + * VFIO_IOMMU_MAP/UNMAP_DMA.

the last sentence seems irrelevant and more suitable in commit msg.

> + *
> + * An IOMMU of type VFIO_TYPE1_NESTING_IOMMU can be managed by
> both MAP/UNMAP
> + * and BIND ioctls at the same time. MAP/UNMAP acts on the stage-2 (host)
> page
> + * tables, and BIND manages the stage-1 (guest) page tables. Other types of

Are "other types" the counterpart to VFIO_TYPE1_NESTING_IOMMU?
What are those types? I thought only NESTING_IOMMU allows two
stage translation...

> + * IOMMU may allow MAP/UNMAP and BIND to coexist, where

The first sentence said the same thing. Then what is the exact difference?

> MAP/UNMAP controls
> + * the traffics only require single stage translation while BIND controls the
> + * traffics require nesting translation. But this depends on the underlying
> + * IOMMU architecture and isn't guaranteed. Example of this is the guest
> SVA
> + * traffics, such traffics need nesting translation to gain gVA->gPA and then
> + * gPA->hPA translation.

I'm a bit confused about the content since "other types of". Are they
trying to state some exceptions/corner cases that this API cannot
resolve or explain the desired behavior of the API? Especially the
last example, which is worded as if the example for "isn't guaranteed"
but isn't guest SVA the main purpose of this API?

> + *
> + * Availability of this feature depends on the device, its bus, the underlying
> + * IOMMU and the CPU architecture.
> + *
> + * returns: 0 on success, -errno on failure.
> + */
> +#define VFIO_IOMMU_BIND		_IO(VFIO_TYPE, VFIO_BASE + 23)
> +
>  /* -------- Additional API for SPAPR TCE (Server POWERPC) IOMMU -------- */
> 
>  /*
> --
> 2.7.4


^ permalink raw reply	[flat|nested] 110+ messages in thread

* RE: [PATCH v1 7/8] vfio/type1: Add VFIO_IOMMU_CACHE_INVALIDATE
  2020-03-22 12:32 ` [PATCH v1 7/8] vfio/type1: Add VFIO_IOMMU_CACHE_INVALIDATE Liu, Yi L
@ 2020-03-30 12:58   ` Tian, Kevin
  2020-04-01  7:49     ` Liu, Yi L
  2020-03-31  7:56   ` Christoph Hellwig
  2020-04-02 20:24   ` Alex Williamson
  2 siblings, 1 reply; 110+ messages in thread
From: Tian, Kevin @ 2020-03-30 12:58 UTC (permalink / raw)
  To: Liu, Yi L, alex.williamson, eric.auger
  Cc: jacob.jun.pan, joro, Raj, Ashok, Tian, Jun J, Sun, Yi Y,
	jean-philippe, peterx, iommu, kvm, linux-kernel, Wu, Hao

> From: Liu, Yi L <yi.l.liu@intel.com>
> Sent: Sunday, March 22, 2020 8:32 PM
> 
> From: Liu Yi L <yi.l.liu@linux.intel.com>
> 
> For VFIO IOMMUs with the type VFIO_TYPE1_NESTING_IOMMU, guest
> "owns" the
> first-level/stage-1 translation structures, the host IOMMU driver has no
> knowledge of first-level/stage-1 structure cache updates unless the guest
> invalidation requests are trapped and propagated to the host.
> 
> This patch adds a new IOCTL VFIO_IOMMU_CACHE_INVALIDATE to
> propagate guest
> first-level/stage-1 IOMMU cache invalidations to host to ensure IOMMU
> cache
> correctness.
> 
> With this patch, vSVA (Virtual Shared Virtual Addressing) can be used safely
> as the host IOMMU iotlb correctness are ensured.
> 
> Cc: Kevin Tian <kevin.tian@intel.com>
> CC: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Cc: Alex Williamson <alex.williamson@redhat.com>
> Cc: Eric Auger <eric.auger@redhat.com>
> Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
> Signed-off-by: Liu Yi L <yi.l.liu@linux.intel.com>
> Signed-off-by: Eric Auger <eric.auger@redhat.com>
> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> ---
>  drivers/vfio/vfio_iommu_type1.c | 49
> +++++++++++++++++++++++++++++++++++++++++
>  include/uapi/linux/vfio.h       | 22 ++++++++++++++++++
>  2 files changed, 71 insertions(+)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c
> b/drivers/vfio/vfio_iommu_type1.c
> index a877747..937ec3f 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -2423,6 +2423,15 @@ static long
> vfio_iommu_type1_unbind_gpasid(struct vfio_iommu *iommu,
>  	return ret;
>  }
> 
> +static int vfio_cache_inv_fn(struct device *dev, void *data)

vfio_iommu_cache_inv_fn

> +{
> +	struct domain_capsule *dc = (struct domain_capsule *)data;
> +	struct iommu_cache_invalidate_info *cache_inv_info =
> +		(struct iommu_cache_invalidate_info *) dc->data;
> +
> +	return iommu_cache_invalidate(dc->domain, dev, cache_inv_info);
> +}
> +
>  static long vfio_iommu_type1_ioctl(void *iommu_data,
>  				   unsigned int cmd, unsigned long arg)
>  {
> @@ -2629,6 +2638,46 @@ static long vfio_iommu_type1_ioctl(void
> *iommu_data,
>  		}
>  		kfree(gbind_data);
>  		return ret;
> +	} else if (cmd == VFIO_IOMMU_CACHE_INVALIDATE) {
> +		struct vfio_iommu_type1_cache_invalidate cache_inv;
> +		u32 version;
> +		int info_size;
> +		void *cache_info;
> +		int ret;
> +
> +		minsz = offsetofend(struct
> vfio_iommu_type1_cache_invalidate,
> +				    flags);
> +
> +		if (copy_from_user(&cache_inv, (void __user *)arg, minsz))
> +			return -EFAULT;
> +
> +		if (cache_inv.argsz < minsz || cache_inv.flags)
> +			return -EINVAL;
> +
> +		/* Get the version of struct iommu_cache_invalidate_info */
> +		if (copy_from_user(&version,
> +			(void __user *) (arg + minsz), sizeof(version)))
> +			return -EFAULT;
> +
> +		info_size = iommu_uapi_get_data_size(
> +					IOMMU_UAPI_CACHE_INVAL,
> version);
> +
> +		cache_info = kzalloc(info_size, GFP_KERNEL);
> +		if (!cache_info)
> +			return -ENOMEM;
> +
> +		if (copy_from_user(cache_info,
> +			(void __user *) (arg + minsz), info_size)) {
> +			kfree(cache_info);
> +			return -EFAULT;
> +		}
> +
> +		mutex_lock(&iommu->lock);
> +		ret = vfio_iommu_for_each_dev(iommu, vfio_cache_inv_fn,
> +					    cache_info);
> +		mutex_unlock(&iommu->lock);
> +		kfree(cache_info);
> +		return ret;
>  	}
> 
>  	return -ENOTTY;
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 2235bc6..62ca791 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -899,6 +899,28 @@ struct vfio_iommu_type1_bind {
>   */
>  #define VFIO_IOMMU_BIND		_IO(VFIO_TYPE, VFIO_BASE + 23)
> 
> +/**
> + * VFIO_IOMMU_CACHE_INVALIDATE - _IOW(VFIO_TYPE, VFIO_BASE + 24,
> + *			struct vfio_iommu_type1_cache_invalidate)
> + *
> + * Propagate guest IOMMU cache invalidation to the host. The cache
> + * invalidation information is conveyed by @cache_info, the content
> + * format would be structures defined in uapi/linux/iommu.h. User
> + * should be aware of that the struct  iommu_cache_invalidate_info
> + * has a @version field, vfio needs to parse this field before getting
> + * data from userspace.
> + *
> + * Availability of this IOCTL is after VFIO_SET_IOMMU.
> + *
> + * returns: 0 on success, -errno on failure.
> + */
> +struct vfio_iommu_type1_cache_invalidate {
> +	__u32   argsz;
> +	__u32   flags;
> +	struct	iommu_cache_invalidate_info cache_info;
> +};
> +#define VFIO_IOMMU_CACHE_INVALIDATE      _IO(VFIO_TYPE, VFIO_BASE +
> 24)
> +
>  /* -------- Additional API for SPAPR TCE (Server POWERPC) IOMMU -------- */
> 
>  /*
> --
> 2.7.4

This patch looks good to me in general. But since there is still
a major open about version compatibility, I'll hold my r-b until
that open is closed. 😊

Thanks
Kevin

^ permalink raw reply	[flat|nested] 110+ messages in thread

* RE: [PATCH v1 8/8] vfio/type1: Add vSVA support for IOMMU-backed mdevs
  2020-03-22 12:32 ` [PATCH v1 8/8] vfio/type1: Add vSVA support for IOMMU-backed mdevs Liu, Yi L
@ 2020-03-30 13:18   ` Tian, Kevin
  2020-04-01  7:51     ` Liu, Yi L
  2020-04-02 20:33   ` Alex Williamson
  1 sibling, 1 reply; 110+ messages in thread
From: Tian, Kevin @ 2020-03-30 13:18 UTC (permalink / raw)
  To: Liu, Yi L, alex.williamson, eric.auger
  Cc: jacob.jun.pan, joro, Raj, Ashok, Tian, Jun J, Sun, Yi Y,
	jean-philippe, peterx, iommu, kvm, linux-kernel, Wu, Hao

> From: Liu, Yi L <yi.l.liu@intel.com>
> Sent: Sunday, March 22, 2020 8:32 PM
> 
> From: Liu Yi L <yi.l.liu@intel.com>
> 
> Recent years, mediated device pass-through framework (e.g. vfio-mdev)
> are used to achieve flexible device sharing across domains (e.g. VMs).

are->is

> Also there are hardware assisted mediated pass-through solutions from
> platform vendors. e.g. Intel VT-d scalable mode which supports Intel
> Scalable I/O Virtualization technology. Such mdevs are called IOMMU-
> backed mdevs as there are IOMMU enforced DMA isolation for such mdevs.
> In kernel, IOMMU-backed mdevs are exposed to IOMMU layer by aux-
> domain
> concept, which means mdevs are protected by an iommu domain which is
> aux-domain of its physical device. Details can be found in the KVM

"by an iommu domain which is auxiliary to the domain that the kernel
driver primarily uses for DMA API"

> presentation from Kevin Tian. IOMMU-backed equals to IOMMU-capable.
> 
> https://events19.linuxfoundation.org/wp-content/uploads/2017/12/\
> Hardware-Assisted-Mediated-Pass-Through-with-VFIO-Kevin-Tian-Intel.pdf
> 
> This patch supports NESTING IOMMU for IOMMU-backed mdevs by figuring
> out the physical device of an IOMMU-backed mdev and then invoking
> IOMMU
> requests to IOMMU layer with the physical device and the mdev's aux
> domain info.

"and then calling into the IOMMU layer to complete the vSVA operations
on the aux domain associated with that mdev"

> 
> With this patch, vSVA (Virtual Shared Virtual Addressing) can be used
> on IOMMU-backed mdevs.
> 
> Cc: Kevin Tian <kevin.tian@intel.com>
> CC: Jacob Pan <jacob.jun.pan@linux.intel.com>
> CC: Jun Tian <jun.j.tian@intel.com>
> Cc: Alex Williamson <alex.williamson@redhat.com>
> Cc: Eric Auger <eric.auger@redhat.com>
> Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> ---
>  drivers/vfio/vfio_iommu_type1.c | 23 ++++++++++++++++++++---
>  1 file changed, 20 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c
> b/drivers/vfio/vfio_iommu_type1.c
> index 937ec3f..d473665 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -132,6 +132,7 @@ struct vfio_regions {
> 
>  struct domain_capsule {
>  	struct iommu_domain *domain;
> +	struct vfio_group *group;
>  	void *data;
>  };
> 
> @@ -148,6 +149,7 @@ static int vfio_iommu_for_each_dev(struct
> vfio_iommu *iommu,
>  	list_for_each_entry(d, &iommu->domain_list, next) {
>  		dc.domain = d->domain;
>  		list_for_each_entry(g, &d->group_list, next) {
> +			dc.group = g;
>  			ret = iommu_group_for_each_dev(g->iommu_group,
>  						       &dc, fn);
>  			if (ret)
> @@ -2347,7 +2349,12 @@ static int vfio_bind_gpasid_fn(struct device *dev,
> void *data)
>  	struct iommu_gpasid_bind_data *gbind_data =
>  		(struct iommu_gpasid_bind_data *) dc->data;
> 
> -	return iommu_sva_bind_gpasid(dc->domain, dev, gbind_data);
> +	if (dc->group->mdev_group)
> +		return iommu_sva_bind_gpasid(dc->domain,
> +			vfio_mdev_get_iommu_device(dev), gbind_data);
> +	else
> +		return iommu_sva_bind_gpasid(dc->domain,
> +						dev, gbind_data);
>  }
> 
>  static int vfio_unbind_gpasid_fn(struct device *dev, void *data)
> @@ -2356,8 +2363,13 @@ static int vfio_unbind_gpasid_fn(struct device
> *dev, void *data)
>  	struct iommu_gpasid_bind_data *gbind_data =
>  		(struct iommu_gpasid_bind_data *) dc->data;
> 
> -	return iommu_sva_unbind_gpasid(dc->domain, dev,
> +	if (dc->group->mdev_group)
> +		return iommu_sva_unbind_gpasid(dc->domain,
> +					vfio_mdev_get_iommu_device(dev),
>  					gbind_data->hpasid);
> +	else
> +		return iommu_sva_unbind_gpasid(dc->domain, dev,
> +						gbind_data->hpasid);
>  }
> 
>  /**
> @@ -2429,7 +2441,12 @@ static int vfio_cache_inv_fn(struct device *dev,
> void *data)
>  	struct iommu_cache_invalidate_info *cache_inv_info =
>  		(struct iommu_cache_invalidate_info *) dc->data;
> 
> -	return iommu_cache_invalidate(dc->domain, dev, cache_inv_info);
> +	if (dc->group->mdev_group)
> +		return iommu_cache_invalidate(dc->domain,
> +			vfio_mdev_get_iommu_device(dev), cache_inv_info);
> +	else
> +		return iommu_cache_invalidate(dc->domain,
> +						dev, cache_inv_info);
>  }

possibly above could be simplified, e.g. 

static struct device *vfio_get_iommu_device(struct vfio_group *group, 
	struct device *dev)
{
	if  (group->mdev_group)
		return vfio_mdev_get_iommu_device(dev);
	else
		return dev;
}

Then use it to replace plain 'dev' in all three places.

> 
>  static long vfio_iommu_type1_ioctl(void *iommu_data,
> --
> 2.7.4


^ permalink raw reply	[flat|nested] 110+ messages in thread

* RE: [PATCH v1 1/8] vfio: Add VFIO_IOMMU_PASID_REQUEST(alloc/free)
  2020-03-30  8:32   ` Tian, Kevin
@ 2020-03-30 14:36     ` Liu, Yi L
  2020-03-31  5:40       ` Tian, Kevin
  0 siblings, 1 reply; 110+ messages in thread
From: Liu, Yi L @ 2020-03-30 14:36 UTC (permalink / raw)
  To: Tian, Kevin, alex.williamson, eric.auger
  Cc: jacob.jun.pan, joro, Raj, Ashok, Tian, Jun J, Sun, Yi Y,
	jean-philippe, peterx, iommu, kvm, linux-kernel, Wu, Hao

> From: Tian, Kevin <kevin.tian@intel.com>
> Sent: Monday, March 30, 2020 4:32 PM
> To: Liu, Yi L <yi.l.liu@intel.com>; alex.williamson@redhat.com;
> Subject: RE: [PATCH v1 1/8] vfio: Add VFIO_IOMMU_PASID_REQUEST(alloc/free)
> 
> > From: Liu, Yi L <yi.l.liu@intel.com>
> > Sent: Sunday, March 22, 2020 8:32 PM
> >
> > From: Liu Yi L <yi.l.liu@intel.com>
> >
> > For a long time, devices have only one DMA address space from platform
> > IOMMU's point of view. This is true for both bare metal and directed-
> > access in virtualization environment. Reason is the source ID of DMA in
> > PCIe are BDF (bus/dev/fnc ID), which results in only device granularity
> 
> are->is

thanks.

> > DMA isolation. However, this is changing with the latest advancement in
> > I/O technology area. More and more platform vendors are utilizing the PCIe
> > PASID TLP prefix in DMA requests, thus to give devices with multiple DMA
> > address spaces as identified by their individual PASIDs. For example,
> > Shared Virtual Addressing (SVA, a.k.a Shared Virtual Memory) is able to
> > let device access multiple process virtual address space by binding the
> 
> "address space" -> "address spaces"
> 
> "binding the" -> "binding each"

will correct both.

> > virtual address space with a PASID. Wherein the PASID is allocated in
> > software and programmed to device per device specific manner. Devices
> > which support PASID capability are called PASID-capable devices. If such
> > devices are passed through to VMs, guest software are also able to bind
> > guest process virtual address space on such devices. Therefore, the guest
> > software could reuse the bare metal software programming model, which
> > means guest software will also allocate PASID and program it to device
> > directly. This is a dangerous situation since it has potential PASID
> > conflicts and unauthorized address space access. It would be safer to
> > let host intercept in the guest software's PASID allocation. Thus PASID
> > are managed system-wide.
> >
> > This patch adds VFIO_IOMMU_PASID_REQUEST ioctl which aims to
> > passdown
> > PASID allocation/free request from the virtual IOMMU. Additionally, such
> 
> "Additionally, because such"
> 
> > requests are intended to be invoked by QEMU or other applications which
> 
> simplify to "intended to be invoked from userspace"

got it.

> > are running in userspace, it is necessary to have a mechanism to prevent
> > single application from abusing available PASIDs in system. With such
> > consideration, this patch tracks the VFIO PASID allocation per-VM. There
> > was a discussion to make quota to be per assigned devices. e.g. if a VM
> > has many assigned devices, then it should have more quota. However, it
> > is not sure how many PASIDs an assigned devices will use. e.g. it is
> 
> devices -> device

got it.

> > possible that a VM with multiples assigned devices but requests less
> > PASIDs. Therefore per-VM quota would be better.
> >
> > This patch uses struct mm pointer as a per-VM token. We also considered
> > using task structure pointer and vfio_iommu structure pointer. However,
> > task structure is per-thread, which means it cannot achieve per-VM PASID
> > alloc tracking purpose. While for vfio_iommu structure, it is visible
> > only within vfio. Therefore, structure mm pointer is selected. This patch
> > adds a structure vfio_mm. A vfio_mm is created when the first vfio
> > container is opened by a VM. On the reverse order, vfio_mm is free when
> > the last vfio container is released. Each VM is assigned with a PASID
> > quota, so that it is not able to request PASID beyond its quota. This
> > patch adds a default quota of 1000. This quota could be tuned by
> > administrator. Making PASID quota tunable will be added in another patch
> > in this series.
> >
> > Previous discussions:
> > https://patchwork.kernel.org/patch/11209429/
> >
> > Cc: Kevin Tian <kevin.tian@intel.com>
> > CC: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > Cc: Alex Williamson <alex.williamson@redhat.com>
> > Cc: Eric Auger <eric.auger@redhat.com>
> > Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
> > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
> > Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > ---
> >  drivers/vfio/vfio.c             | 130
> > ++++++++++++++++++++++++++++++++++++++++
> >  drivers/vfio/vfio_iommu_type1.c | 104
> > ++++++++++++++++++++++++++++++++
> >  include/linux/vfio.h            |  20 +++++++
> >  include/uapi/linux/vfio.h       |  41 +++++++++++++
> >  4 files changed, 295 insertions(+)
> >
> > diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> > index c848262..d13b483 100644
> > --- a/drivers/vfio/vfio.c
> > +++ b/drivers/vfio/vfio.c
> > @@ -32,6 +32,7 @@
> >  #include <linux/vfio.h>
> >  #include <linux/wait.h>
> >  #include <linux/sched/signal.h>
> > +#include <linux/sched/mm.h>
> >
> >  #define DRIVER_VERSION	"0.3"
> >  #define DRIVER_AUTHOR	"Alex Williamson
> > <alex.williamson@redhat.com>"
> > @@ -46,6 +47,8 @@ static struct vfio {
> >  	struct mutex			group_lock;
> >  	struct cdev			group_cdev;
> >  	dev_t				group_devt;
> > +	struct list_head		vfio_mm_list;
> > +	struct mutex			vfio_mm_lock;
> >  	wait_queue_head_t		release_q;
> >  } vfio;
> >
> > @@ -2129,6 +2132,131 @@ int vfio_unregister_notifier(struct device *dev,
> > enum vfio_notify_type type,
> >  EXPORT_SYMBOL(vfio_unregister_notifier);
> >
> >  /**
> > + * VFIO_MM objects - create, release, get, put, search
> 
> why capitalizing vfio_mm?

oops, it's not intended, will fix it.

> > + * Caller of the function should have held vfio.vfio_mm_lock.
> > + */
> > +static struct vfio_mm *vfio_create_mm(struct mm_struct *mm)
> > +{
> > +	struct vfio_mm *vmm;
> > +	struct vfio_mm_token *token;
> > +	int ret = 0;
> > +
> > +	vmm = kzalloc(sizeof(*vmm), GFP_KERNEL);
> > +	if (!vmm)
> > +		return ERR_PTR(-ENOMEM);
> > +
> > +	/* Per mm IOASID set used for quota control and group operations
> > */
> > +	ret = ioasid_alloc_set((struct ioasid_set *) mm,
> > +			       VFIO_DEFAULT_PASID_QUOTA, &vmm-
> > >ioasid_sid);
> > +	if (ret) {
> > +		kfree(vmm);
> > +		return ERR_PTR(ret);
> > +	}
> > +
> > +	kref_init(&vmm->kref);
> > +	token = &vmm->token;
> > +	token->val = mm;
> > +	vmm->pasid_quota = VFIO_DEFAULT_PASID_QUOTA;
> > +	mutex_init(&vmm->pasid_lock);
> > +
> > +	list_add(&vmm->vfio_next, &vfio.vfio_mm_list);
> > +
> > +	return vmm;
> > +}
> > +
> > +static void vfio_mm_unlock_and_free(struct vfio_mm *vmm)
> > +{
> > +	/* destroy the ioasid set */
> > +	ioasid_free_set(vmm->ioasid_sid, true);
> 
> do we need hold pasid lock here, since it attempts to destroy a
> set which might be referenced by vfio_mm_pasid_free? or is
> there guarantee that such race won't happen?

Emmm, if considering the race between ioasid_free_set and
vfio_mm_pasid_free, I guess ioasid core should sequence the
two operations with its internal lock. right?

> > +	mutex_unlock(&vfio.vfio_mm_lock);
> > +	kfree(vmm);
> > +}
> > +
> > +/* called with vfio.vfio_mm_lock held */
> > +static void vfio_mm_release(struct kref *kref)
> > +{
> > +	struct vfio_mm *vmm = container_of(kref, struct vfio_mm, kref);
> > +
> > +	list_del(&vmm->vfio_next);
> > +	vfio_mm_unlock_and_free(vmm);
> > +}
> > +
> > +void vfio_mm_put(struct vfio_mm *vmm)
> > +{
> > +	kref_put_mutex(&vmm->kref, vfio_mm_release,
> > &vfio.vfio_mm_lock);
> > +}
> > +EXPORT_SYMBOL_GPL(vfio_mm_put);
> > +
> > +/* Assume vfio_mm_lock or vfio_mm reference is held */
> > +static void vfio_mm_get(struct vfio_mm *vmm)
> > +{
> > +	kref_get(&vmm->kref);
> > +}
> > +
> > +struct vfio_mm *vfio_mm_get_from_task(struct task_struct *task)
> > +{
> > +	struct mm_struct *mm = get_task_mm(task);
> > +	struct vfio_mm *vmm;
> > +	unsigned long long val = (unsigned long long) mm;
> > +
> > +	mutex_lock(&vfio.vfio_mm_lock);
> > +	list_for_each_entry(vmm, &vfio.vfio_mm_list, vfio_next) {
> > +		if (vmm->token.val == val) {
> > +			vfio_mm_get(vmm);
> > +			goto out;
> > +		}
> > +	}
> > +
> > +	vmm = vfio_create_mm(mm);
> > +	if (IS_ERR(vmm))
> > +		vmm = NULL;
> > +out:
> > +	mutex_unlock(&vfio.vfio_mm_lock);
> > +	mmput(mm);
> 
> I assume this has been discussed before, but from readability p.o.v
> it might be good to add a comment for this function to explain
> how the recording of mm in vfio_mm can be correctly removed
> when the mm is being destroyed, since we don't hold a reference
> of mm here.

yeah, I'll add it.

> > +	return vmm;
> > +}
> > +EXPORT_SYMBOL_GPL(vfio_mm_get_from_task);
> > +
> > +int vfio_mm_pasid_alloc(struct vfio_mm *vmm, int min, int max)
> > +{
> > +	ioasid_t pasid;
> > +	int ret = -ENOSPC;
> > +
> > +	mutex_lock(&vmm->pasid_lock);
> > +
> > +	pasid = ioasid_alloc(vmm->ioasid_sid, min, max, NULL);
> > +	if (pasid == INVALID_IOASID) {
> > +		ret = -ENOSPC;
> > +		goto out_unlock;
> > +	}
> > +
> > +	ret = pasid;
> > +out_unlock:
> > +	mutex_unlock(&vmm->pasid_lock);
> > +	return ret;
> > +}
> > +EXPORT_SYMBOL_GPL(vfio_mm_pasid_alloc);
> > +
> > +int vfio_mm_pasid_free(struct vfio_mm *vmm, ioasid_t pasid)
> > +{
> > +	void *pdata;
> > +	int ret = 0;
> > +
> > +	mutex_lock(&vmm->pasid_lock);
> > +	pdata = ioasid_find(vmm->ioasid_sid, pasid, NULL);
> > +	if (IS_ERR(pdata)) {
> > +		ret = PTR_ERR(pdata);
> > +		goto out_unlock;
> > +	}
> > +	ioasid_free(pasid);
> > +
> > +out_unlock:
> > +	mutex_unlock(&vmm->pasid_lock);
> > +	return ret;
> > +}
> > +EXPORT_SYMBOL_GPL(vfio_mm_pasid_free);
> > +
> > +/**
> >   * Module/class support
> >   */
> >  static char *vfio_devnode(struct device *dev, umode_t *mode)
> > @@ -2151,8 +2279,10 @@ static int __init vfio_init(void)
> >  	idr_init(&vfio.group_idr);
> >  	mutex_init(&vfio.group_lock);
> >  	mutex_init(&vfio.iommu_drivers_lock);
> > +	mutex_init(&vfio.vfio_mm_lock);
> >  	INIT_LIST_HEAD(&vfio.group_list);
> >  	INIT_LIST_HEAD(&vfio.iommu_drivers_list);
> > +	INIT_LIST_HEAD(&vfio.vfio_mm_list);
> >  	init_waitqueue_head(&vfio.release_q);
> >
> >  	ret = misc_register(&vfio_dev);
> > diff --git a/drivers/vfio/vfio_iommu_type1.c
> > b/drivers/vfio/vfio_iommu_type1.c
> > index a177bf2..331ceee 100644
> > --- a/drivers/vfio/vfio_iommu_type1.c
> > +++ b/drivers/vfio/vfio_iommu_type1.c
> > @@ -70,6 +70,7 @@ struct vfio_iommu {
> >  	unsigned int		dma_avail;
> >  	bool			v2;
> >  	bool			nesting;
> > +	struct vfio_mm		*vmm;
> >  };
> >
> >  struct vfio_domain {
> > @@ -2018,6 +2019,7 @@ static void vfio_iommu_type1_detach_group(void
> > *iommu_data,
> >  static void *vfio_iommu_type1_open(unsigned long arg)
> >  {
> >  	struct vfio_iommu *iommu;
> > +	struct vfio_mm *vmm = NULL;
> >
> >  	iommu = kzalloc(sizeof(*iommu), GFP_KERNEL);
> >  	if (!iommu)
> > @@ -2043,6 +2045,10 @@ static void *vfio_iommu_type1_open(unsigned
> > long arg)
> >  	iommu->dma_avail = dma_entry_limit;
> >  	mutex_init(&iommu->lock);
> >  	BLOCKING_INIT_NOTIFIER_HEAD(&iommu->notifier);
> > +	vmm = vfio_mm_get_from_task(current);
> > +	if (!vmm)
> > +		pr_err("Failed to get vfio_mm track\n");
> 
> I assume error should be returned when pr_err is used...

got it. I didn't do it as I don't think vfio_mm is necessary for
every iommu open. It is necessary for the nesting type iommu. I'll
make it fetch vmm when it is opening nesting type and return error
if failed.

> > +	iommu->vmm = vmm;
> >
> >  	return iommu;
> >  }
> > @@ -2084,6 +2090,8 @@ static void vfio_iommu_type1_release(void
> > *iommu_data)
> >  	}
> >
> >  	vfio_iommu_iova_free(&iommu->iova_list);
> > +	if (iommu->vmm)
> > +		vfio_mm_put(iommu->vmm);
> >
> >  	kfree(iommu);
> >  }
> > @@ -2172,6 +2180,55 @@ static int vfio_iommu_iova_build_caps(struct
> > vfio_iommu *iommu,
> >  	return ret;
> >  }
> >
> > +static bool vfio_iommu_type1_pasid_req_valid(u32 flags)
> 
> I don't think you need prefix "vfio_iommu_type1" for every new
> function here, especially for leaf internal function as this one.

got it. thanks.

> > +{
> > +	return !((flags & ~VFIO_PASID_REQUEST_MASK) ||
> > +		 (flags & VFIO_IOMMU_PASID_ALLOC &&
> > +		  flags & VFIO_IOMMU_PASID_FREE));
> > +}
> > +
> > +static int vfio_iommu_type1_pasid_alloc(struct vfio_iommu *iommu,
> > +					 int min,
> > +					 int max)
> > +{
> > +	struct vfio_mm *vmm = iommu->vmm;
> > +	int ret = 0;
> > +
> > +	mutex_lock(&iommu->lock);
> > +	if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) {
> > +		ret = -EFAULT;
> 
> why -EFAULT?

well, it's from a prior comment as below:
  vfio_mm_pasid_alloc() can return -ENOSPC though, so it'd be nice to
  differentiate the errors. We could use EFAULT for the no IOMMU case
  and EINVAL here?
http://lkml.iu.edu/hypermail/linux/kernel/2001.3/05964.html

> 
> > +		goto out_unlock;
> > +	}
> > +	if (vmm)
> > +		ret = vfio_mm_pasid_alloc(vmm, min, max);
> > +	else
> > +		ret = -EINVAL;
> > +out_unlock:
> > +	mutex_unlock(&iommu->lock);
> > +	return ret;
> > +}
> > +
> > +static int vfio_iommu_type1_pasid_free(struct vfio_iommu *iommu,
> > +				       unsigned int pasid)
> > +{
> > +	struct vfio_mm *vmm = iommu->vmm;
> > +	int ret = 0;
> > +
> > +	mutex_lock(&iommu->lock);
> > +	if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) {
> > +		ret = -EFAULT;
> 
> ditto
> 
> > +		goto out_unlock;
> > +	}
> > +
> > +	if (vmm)
> > +		ret = vfio_mm_pasid_free(vmm, pasid);
> > +	else
> > +		ret = -EINVAL;
> > +out_unlock:
> > +	mutex_unlock(&iommu->lock);
> > +	return ret;
> > +}
> > +
> >  static long vfio_iommu_type1_ioctl(void *iommu_data,
> >  				   unsigned int cmd, unsigned long arg)
> >  {
> > @@ -2276,6 +2333,53 @@ static long vfio_iommu_type1_ioctl(void
> > *iommu_data,
> >
> >  		return copy_to_user((void __user *)arg, &unmap, minsz) ?
> >  			-EFAULT : 0;
> > +
> > +	} else if (cmd == VFIO_IOMMU_PASID_REQUEST) {
> > +		struct vfio_iommu_type1_pasid_request req;
> > +		unsigned long offset;
> > +
> > +		minsz = offsetofend(struct vfio_iommu_type1_pasid_request,
> > +				    flags);
> > +
> > +		if (copy_from_user(&req, (void __user *)arg, minsz))
> > +			return -EFAULT;
> > +
> > +		if (req.argsz < minsz ||
> > +		    !vfio_iommu_type1_pasid_req_valid(req.flags))
> > +			return -EINVAL;
> > +
> > +		if (copy_from_user((void *)&req + minsz,
> > +				   (void __user *)arg + minsz,
> > +				   sizeof(req) - minsz))
> > +			return -EFAULT;
> 
> why copying in two steps instead of copying them together?

just want to do sanity check before copying all the data. I
can move it as one copy if it's better. :-)

> > +
> > +		switch (req.flags & VFIO_PASID_REQUEST_MASK) {
> > +		case VFIO_IOMMU_PASID_ALLOC:
> > +		{
> > +			int ret = 0, result;
> > +
> > +			result = vfio_iommu_type1_pasid_alloc(iommu,
> > +							req.alloc_pasid.min,
> > +							req.alloc_pasid.max);
> > +			if (result > 0) {
> > +				offset = offsetof(
> > +					struct
> > vfio_iommu_type1_pasid_request,
> > +					alloc_pasid.result);
> > +				ret = copy_to_user(
> > +					      (void __user *) (arg + offset),
> > +					      &result, sizeof(result));
> > +			} else {
> > +				pr_debug("%s: PASID alloc failed\n",
> > __func__);
> > +				ret = -EFAULT;
> 
> no, this branch is not for copy_to_user error. it is about pasid alloc
> failure. you should handle both.

Emmm, I just want to fail the IOCTL in such case, so the @result field
is meaningless in the user side. How about using another return value
(e.g. ENOSPC) to indicate the IOCTL failure?

> > +			}
> > +			return ret;
> > +		}
> > +		case VFIO_IOMMU_PASID_FREE:
> > +			return vfio_iommu_type1_pasid_free(iommu,
> > +							   req.free_pasid);
> > +		default:
> > +			return -EINVAL;
> > +		}
> >  	}
> >
> >  	return -ENOTTY;
> > diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> > index e42a711..75f9f7f1 100644
> > --- a/include/linux/vfio.h
> > +++ b/include/linux/vfio.h
> > @@ -89,6 +89,26 @@ extern int vfio_register_iommu_driver(const struct
> > vfio_iommu_driver_ops *ops);
> >  extern void vfio_unregister_iommu_driver(
> >  				const struct vfio_iommu_driver_ops *ops);
> >
> > +#define VFIO_DEFAULT_PASID_QUOTA	1000
> > +struct vfio_mm_token {
> > +	unsigned long long val;
> > +};
> > +
> > +struct vfio_mm {
> > +	struct kref			kref;
> > +	struct vfio_mm_token		token;
> > +	int				ioasid_sid;
> > +	/* protect @pasid_quota field and pasid allocation/free */
> > +	struct mutex			pasid_lock;
> > +	int				pasid_quota;
> > +	struct list_head		vfio_next;
> > +};
> > +
> > +extern struct vfio_mm *vfio_mm_get_from_task(struct task_struct *task);
> > +extern void vfio_mm_put(struct vfio_mm *vmm);
> > +extern int vfio_mm_pasid_alloc(struct vfio_mm *vmm, int min, int max);
> > +extern int vfio_mm_pasid_free(struct vfio_mm *vmm, ioasid_t pasid);
> > +
> >  /*
> >   * External user API
> >   */
> > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> > index 9e843a1..298ac80 100644
> > --- a/include/uapi/linux/vfio.h
> > +++ b/include/uapi/linux/vfio.h
> > @@ -794,6 +794,47 @@ struct vfio_iommu_type1_dma_unmap {
> >  #define VFIO_IOMMU_ENABLE	_IO(VFIO_TYPE, VFIO_BASE + 15)
> >  #define VFIO_IOMMU_DISABLE	_IO(VFIO_TYPE, VFIO_BASE + 16)
> >
> > +/*
> > + * PASID (Process Address Space ID) is a PCIe concept which
> > + * has been extended to support DMA isolation in fine-grain.
> > + * With device assigned to user space (e.g. VMs), PASID alloc
> > + * and free need to be system wide. This structure defines
> > + * the info for pasid alloc/free between user space and kernel
> > + * space.
> > + *
> > + * @flag=VFIO_IOMMU_PASID_ALLOC, refer to the @alloc_pasid
> > + * @flag=VFIO_IOMMU_PASID_FREE, refer to @free_pasid
> > + */
> > +struct vfio_iommu_type1_pasid_request {
> > +	__u32	argsz;
> > +#define VFIO_IOMMU_PASID_ALLOC	(1 << 0)
> > +#define VFIO_IOMMU_PASID_FREE	(1 << 1)
> > +	__u32	flags;
> > +	union {
> > +		struct {
> > +			__u32 min;
> > +			__u32 max;
> > +			__u32 result;
> 
> result->pasid?

yes, the pasid allocated.

> 
> > +		} alloc_pasid;
> > +		__u32 free_pasid;
> 
> what about putting a common pasid field after flags?

looks good to me. But it would make the union part only meaningful
to alloc pasid. if so, maybe make the union part as a data field and
only alloc pasid will have it. For free pasid, it is not necessary
to read it from userspace. does it look good?

Regards,
Yi Liu

> > +	};
> > +};
> > +
> > +#define VFIO_PASID_REQUEST_MASK	(VFIO_IOMMU_PASID_ALLOC | \
> > +					 VFIO_IOMMU_PASID_FREE)
> > +
> > +/**
> > + * VFIO_IOMMU_PASID_REQUEST - _IOWR(VFIO_TYPE, VFIO_BASE + 22,
> > + *				struct vfio_iommu_type1_pasid_request)
> > + *
> > + * Availability of this feature depends on PASID support in the device,
> > + * its bus, the underlying IOMMU and the CPU architecture. In VFIO, it
> > + * is available after VFIO_SET_IOMMU.
> > + *
> > + * returns: 0 on success, -errno on failure.
> > + */
> > +#define VFIO_IOMMU_PASID_REQUEST	_IO(VFIO_TYPE, VFIO_BASE +
> > 22)
> > +
> >  /* -------- Additional API for SPAPR TCE (Server POWERPC) IOMMU -------- */
> >
> >  /*
> > --
> > 2.7.4


^ permalink raw reply	[flat|nested] 110+ messages in thread

* RE: [PATCH v1 1/8] vfio: Add VFIO_IOMMU_PASID_REQUEST(alloc/free)
  2020-03-30 14:36     ` Liu, Yi L
@ 2020-03-31  5:40       ` Tian, Kevin
  2020-03-31 13:22         ` Liu, Yi L
  0 siblings, 1 reply; 110+ messages in thread
From: Tian, Kevin @ 2020-03-31  5:40 UTC (permalink / raw)
  To: Liu, Yi L, alex.williamson, eric.auger
  Cc: jacob.jun.pan, joro, Raj, Ashok, Tian, Jun J, Sun, Yi Y,
	jean-philippe, peterx, iommu, kvm, linux-kernel, Wu, Hao

> From: Liu, Yi L <yi.l.liu@intel.com>
> Sent: Monday, March 30, 2020 10:37 PM
> 
> > From: Tian, Kevin <kevin.tian@intel.com>
> > Sent: Monday, March 30, 2020 4:32 PM
> > To: Liu, Yi L <yi.l.liu@intel.com>; alex.williamson@redhat.com;
> > Subject: RE: [PATCH v1 1/8] vfio: Add
> VFIO_IOMMU_PASID_REQUEST(alloc/free)
> >
> > > From: Liu, Yi L <yi.l.liu@intel.com>
> > > Sent: Sunday, March 22, 2020 8:32 PM
> > >
> > > From: Liu Yi L <yi.l.liu@intel.com>
> > >
> > > For a long time, devices have only one DMA address space from platform
> > > IOMMU's point of view. This is true for both bare metal and directed-
> > > access in virtualization environment. Reason is the source ID of DMA in
> > > PCIe are BDF (bus/dev/fnc ID), which results in only device granularity
> >
> > are->is
> 
> thanks.
> 
> > > DMA isolation. However, this is changing with the latest advancement in
> > > I/O technology area. More and more platform vendors are utilizing the
> PCIe
> > > PASID TLP prefix in DMA requests, thus to give devices with multiple DMA
> > > address spaces as identified by their individual PASIDs. For example,
> > > Shared Virtual Addressing (SVA, a.k.a Shared Virtual Memory) is able to
> > > let device access multiple process virtual address space by binding the
> >
> > "address space" -> "address spaces"
> >
> > "binding the" -> "binding each"
> 
> will correct both.
> 
> > > virtual address space with a PASID. Wherein the PASID is allocated in
> > > software and programmed to device per device specific manner. Devices
> > > which support PASID capability are called PASID-capable devices. If such
> > > devices are passed through to VMs, guest software are also able to bind
> > > guest process virtual address space on such devices. Therefore, the guest
> > > software could reuse the bare metal software programming model,
> which
> > > means guest software will also allocate PASID and program it to device
> > > directly. This is a dangerous situation since it has potential PASID
> > > conflicts and unauthorized address space access. It would be safer to
> > > let host intercept in the guest software's PASID allocation. Thus PASID
> > > are managed system-wide.
> > >
> > > This patch adds VFIO_IOMMU_PASID_REQUEST ioctl which aims to
> > > passdown
> > > PASID allocation/free request from the virtual IOMMU. Additionally, such
> >
> > "Additionally, because such"
> >
> > > requests are intended to be invoked by QEMU or other applications
> which
> >
> > simplify to "intended to be invoked from userspace"
> 
> got it.
> 
> > > are running in userspace, it is necessary to have a mechanism to prevent
> > > single application from abusing available PASIDs in system. With such
> > > consideration, this patch tracks the VFIO PASID allocation per-VM. There
> > > was a discussion to make quota to be per assigned devices. e.g. if a VM
> > > has many assigned devices, then it should have more quota. However, it
> > > is not sure how many PASIDs an assigned devices will use. e.g. it is
> >
> > devices -> device
> 
> got it.
> 
> > > possible that a VM with multiples assigned devices but requests less
> > > PASIDs. Therefore per-VM quota would be better.
> > >
> > > This patch uses struct mm pointer as a per-VM token. We also considered
> > > using task structure pointer and vfio_iommu structure pointer. However,
> > > task structure is per-thread, which means it cannot achieve per-VM PASID
> > > alloc tracking purpose. While for vfio_iommu structure, it is visible
> > > only within vfio. Therefore, structure mm pointer is selected. This patch
> > > adds a structure vfio_mm. A vfio_mm is created when the first vfio
> > > container is opened by a VM. On the reverse order, vfio_mm is free when
> > > the last vfio container is released. Each VM is assigned with a PASID
> > > quota, so that it is not able to request PASID beyond its quota. This
> > > patch adds a default quota of 1000. This quota could be tuned by
> > > administrator. Making PASID quota tunable will be added in another
> patch
> > > in this series.
> > >
> > > Previous discussions:
> > > https://patchwork.kernel.org/patch/11209429/
> > >
> > > Cc: Kevin Tian <kevin.tian@intel.com>
> > > CC: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > > Cc: Alex Williamson <alex.williamson@redhat.com>
> > > Cc: Eric Auger <eric.auger@redhat.com>
> > > Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
> > > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > > Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
> > > Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > > ---
> > >  drivers/vfio/vfio.c             | 130
> > > ++++++++++++++++++++++++++++++++++++++++
> > >  drivers/vfio/vfio_iommu_type1.c | 104
> > > ++++++++++++++++++++++++++++++++
> > >  include/linux/vfio.h            |  20 +++++++
> > >  include/uapi/linux/vfio.h       |  41 +++++++++++++
> > >  4 files changed, 295 insertions(+)
> > >
> > > diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> > > index c848262..d13b483 100644
> > > --- a/drivers/vfio/vfio.c
> > > +++ b/drivers/vfio/vfio.c
> > > @@ -32,6 +32,7 @@
> > >  #include <linux/vfio.h>
> > >  #include <linux/wait.h>
> > >  #include <linux/sched/signal.h>
> > > +#include <linux/sched/mm.h>
> > >
> > >  #define DRIVER_VERSION	"0.3"
> > >  #define DRIVER_AUTHOR	"Alex Williamson
> > > <alex.williamson@redhat.com>"
> > > @@ -46,6 +47,8 @@ static struct vfio {
> > >  	struct mutex			group_lock;
> > >  	struct cdev			group_cdev;
> > >  	dev_t				group_devt;
> > > +	struct list_head		vfio_mm_list;
> > > +	struct mutex			vfio_mm_lock;
> > >  	wait_queue_head_t		release_q;
> > >  } vfio;
> > >
> > > @@ -2129,6 +2132,131 @@ int vfio_unregister_notifier(struct device
> *dev,
> > > enum vfio_notify_type type,
> > >  EXPORT_SYMBOL(vfio_unregister_notifier);
> > >
> > >  /**
> > > + * VFIO_MM objects - create, release, get, put, search
> >
> > why capitalizing vfio_mm?
> 
> oops, it's not intended, will fix it.
> 
> > > + * Caller of the function should have held vfio.vfio_mm_lock.
> > > + */
> > > +static struct vfio_mm *vfio_create_mm(struct mm_struct *mm)
> > > +{
> > > +	struct vfio_mm *vmm;
> > > +	struct vfio_mm_token *token;
> > > +	int ret = 0;
> > > +
> > > +	vmm = kzalloc(sizeof(*vmm), GFP_KERNEL);
> > > +	if (!vmm)
> > > +		return ERR_PTR(-ENOMEM);
> > > +
> > > +	/* Per mm IOASID set used for quota control and group operations
> > > */
> > > +	ret = ioasid_alloc_set((struct ioasid_set *) mm,
> > > +			       VFIO_DEFAULT_PASID_QUOTA, &vmm-
> > > >ioasid_sid);
> > > +	if (ret) {
> > > +		kfree(vmm);
> > > +		return ERR_PTR(ret);
> > > +	}
> > > +
> > > +	kref_init(&vmm->kref);
> > > +	token = &vmm->token;
> > > +	token->val = mm;
> > > +	vmm->pasid_quota = VFIO_DEFAULT_PASID_QUOTA;
> > > +	mutex_init(&vmm->pasid_lock);
> > > +
> > > +	list_add(&vmm->vfio_next, &vfio.vfio_mm_list);
> > > +
> > > +	return vmm;
> > > +}
> > > +
> > > +static void vfio_mm_unlock_and_free(struct vfio_mm *vmm)
> > > +{
> > > +	/* destroy the ioasid set */
> > > +	ioasid_free_set(vmm->ioasid_sid, true);
> >
> > do we need hold pasid lock here, since it attempts to destroy a
> > set which might be referenced by vfio_mm_pasid_free? or is
> > there guarantee that such race won't happen?
> 
> Emmm, if considering the race between ioasid_free_set and
> vfio_mm_pasid_free, I guess ioasid core should sequence the
> two operations with its internal lock. right?

I looked at below code in free path:

+	mutex_lock(&vmm->pasid_lock);
+	pdata = ioasid_find(vmm->ioasid_sid, pasid, NULL);
+	if (IS_ERR(pdata)) {
+		ret = PTR_ERR(pdata);
+		goto out_unlock;
+	}
+	ioasid_free(pasid);
+
+out_unlock:
+	mutex_unlock(&vmm->pasid_lock);

above implies that ioasid_find/free must be paired within the pasid_lock.
Then if we don't hold pasid_lock above, ioasid_free_set could
happen between find/free. I'm not sure whether this race would
lead to real problem, but it doesn't look correct simply by looking at
this file.

> 
> > > +	mutex_unlock(&vfio.vfio_mm_lock);
> > > +	kfree(vmm);
> > > +}
> > > +
> > > +/* called with vfio.vfio_mm_lock held */
> > > +static void vfio_mm_release(struct kref *kref)
> > > +{
> > > +	struct vfio_mm *vmm = container_of(kref, struct vfio_mm, kref);
> > > +
> > > +	list_del(&vmm->vfio_next);
> > > +	vfio_mm_unlock_and_free(vmm);
> > > +}
> > > +
> > > +void vfio_mm_put(struct vfio_mm *vmm)
> > > +{
> > > +	kref_put_mutex(&vmm->kref, vfio_mm_release,
> > > &vfio.vfio_mm_lock);
> > > +}
> > > +EXPORT_SYMBOL_GPL(vfio_mm_put);
> > > +
> > > +/* Assume vfio_mm_lock or vfio_mm reference is held */
> > > +static void vfio_mm_get(struct vfio_mm *vmm)
> > > +{
> > > +	kref_get(&vmm->kref);
> > > +}
> > > +
> > > +struct vfio_mm *vfio_mm_get_from_task(struct task_struct *task)
> > > +{
> > > +	struct mm_struct *mm = get_task_mm(task);
> > > +	struct vfio_mm *vmm;
> > > +	unsigned long long val = (unsigned long long) mm;
> > > +
> > > +	mutex_lock(&vfio.vfio_mm_lock);
> > > +	list_for_each_entry(vmm, &vfio.vfio_mm_list, vfio_next) {
> > > +		if (vmm->token.val == val) {
> > > +			vfio_mm_get(vmm);
> > > +			goto out;
> > > +		}
> > > +	}
> > > +
> > > +	vmm = vfio_create_mm(mm);
> > > +	if (IS_ERR(vmm))
> > > +		vmm = NULL;
> > > +out:
> > > +	mutex_unlock(&vfio.vfio_mm_lock);
> > > +	mmput(mm);
> >
> > I assume this has been discussed before, but from readability p.o.v
> > it might be good to add a comment for this function to explain
> > how the recording of mm in vfio_mm can be correctly removed
> > when the mm is being destroyed, since we don't hold a reference
> > of mm here.
> 
> yeah, I'll add it.
> 
> > > +	return vmm;
> > > +}
> > > +EXPORT_SYMBOL_GPL(vfio_mm_get_from_task);
> > > +
> > > +int vfio_mm_pasid_alloc(struct vfio_mm *vmm, int min, int max)
> > > +{
> > > +	ioasid_t pasid;
> > > +	int ret = -ENOSPC;
> > > +
> > > +	mutex_lock(&vmm->pasid_lock);
> > > +
> > > +	pasid = ioasid_alloc(vmm->ioasid_sid, min, max, NULL);
> > > +	if (pasid == INVALID_IOASID) {
> > > +		ret = -ENOSPC;
> > > +		goto out_unlock;
> > > +	}
> > > +
> > > +	ret = pasid;
> > > +out_unlock:
> > > +	mutex_unlock(&vmm->pasid_lock);
> > > +	return ret;
> > > +}
> > > +EXPORT_SYMBOL_GPL(vfio_mm_pasid_alloc);
> > > +
> > > +int vfio_mm_pasid_free(struct vfio_mm *vmm, ioasid_t pasid)
> > > +{
> > > +	void *pdata;
> > > +	int ret = 0;
> > > +
> > > +	mutex_lock(&vmm->pasid_lock);
> > > +	pdata = ioasid_find(vmm->ioasid_sid, pasid, NULL);
> > > +	if (IS_ERR(pdata)) {
> > > +		ret = PTR_ERR(pdata);
> > > +		goto out_unlock;
> > > +	}
> > > +	ioasid_free(pasid);
> > > +
> > > +out_unlock:
> > > +	mutex_unlock(&vmm->pasid_lock);
> > > +	return ret;
> > > +}
> > > +EXPORT_SYMBOL_GPL(vfio_mm_pasid_free);
> > > +
> > > +/**
> > >   * Module/class support
> > >   */
> > >  static char *vfio_devnode(struct device *dev, umode_t *mode)
> > > @@ -2151,8 +2279,10 @@ static int __init vfio_init(void)
> > >  	idr_init(&vfio.group_idr);
> > >  	mutex_init(&vfio.group_lock);
> > >  	mutex_init(&vfio.iommu_drivers_lock);
> > > +	mutex_init(&vfio.vfio_mm_lock);
> > >  	INIT_LIST_HEAD(&vfio.group_list);
> > >  	INIT_LIST_HEAD(&vfio.iommu_drivers_list);
> > > +	INIT_LIST_HEAD(&vfio.vfio_mm_list);
> > >  	init_waitqueue_head(&vfio.release_q);
> > >
> > >  	ret = misc_register(&vfio_dev);
> > > diff --git a/drivers/vfio/vfio_iommu_type1.c
> > > b/drivers/vfio/vfio_iommu_type1.c
> > > index a177bf2..331ceee 100644
> > > --- a/drivers/vfio/vfio_iommu_type1.c
> > > +++ b/drivers/vfio/vfio_iommu_type1.c
> > > @@ -70,6 +70,7 @@ struct vfio_iommu {
> > >  	unsigned int		dma_avail;
> > >  	bool			v2;
> > >  	bool			nesting;
> > > +	struct vfio_mm		*vmm;
> > >  };
> > >
> > >  struct vfio_domain {
> > > @@ -2018,6 +2019,7 @@ static void
> vfio_iommu_type1_detach_group(void
> > > *iommu_data,
> > >  static void *vfio_iommu_type1_open(unsigned long arg)
> > >  {
> > >  	struct vfio_iommu *iommu;
> > > +	struct vfio_mm *vmm = NULL;
> > >
> > >  	iommu = kzalloc(sizeof(*iommu), GFP_KERNEL);
> > >  	if (!iommu)
> > > @@ -2043,6 +2045,10 @@ static void
> *vfio_iommu_type1_open(unsigned
> > > long arg)
> > >  	iommu->dma_avail = dma_entry_limit;
> > >  	mutex_init(&iommu->lock);
> > >  	BLOCKING_INIT_NOTIFIER_HEAD(&iommu->notifier);
> > > +	vmm = vfio_mm_get_from_task(current);
> > > +	if (!vmm)
> > > +		pr_err("Failed to get vfio_mm track\n");
> >
> > I assume error should be returned when pr_err is used...
> 
> got it. I didn't do it as I don’t think vfio_mm is necessary for
> every iommu open. It is necessary for the nesting type iommu. I'll
> make it fetch vmm when it is opening nesting type and return error
> if failed.

sounds good.

> 
> > > +	iommu->vmm = vmm;
> > >
> > >  	return iommu;
> > >  }
> > > @@ -2084,6 +2090,8 @@ static void vfio_iommu_type1_release(void
> > > *iommu_data)
> > >  	}
> > >
> > >  	vfio_iommu_iova_free(&iommu->iova_list);
> > > +	if (iommu->vmm)
> > > +		vfio_mm_put(iommu->vmm);
> > >
> > >  	kfree(iommu);
> > >  }
> > > @@ -2172,6 +2180,55 @@ static int vfio_iommu_iova_build_caps(struct
> > > vfio_iommu *iommu,
> > >  	return ret;
> > >  }
> > >
> > > +static bool vfio_iommu_type1_pasid_req_valid(u32 flags)
> >
> > I don't think you need prefix "vfio_iommu_type1" for every new
> > function here, especially for leaf internal function as this one.
> 
> got it. thanks.
> 
> > > +{
> > > +	return !((flags & ~VFIO_PASID_REQUEST_MASK) ||
> > > +		 (flags & VFIO_IOMMU_PASID_ALLOC &&
> > > +		  flags & VFIO_IOMMU_PASID_FREE));
> > > +}
> > > +
> > > +static int vfio_iommu_type1_pasid_alloc(struct vfio_iommu *iommu,
> > > +					 int min,
> > > +					 int max)
> > > +{
> > > +	struct vfio_mm *vmm = iommu->vmm;
> > > +	int ret = 0;
> > > +
> > > +	mutex_lock(&iommu->lock);
> > > +	if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) {
> > > +		ret = -EFAULT;
> >
> > why -EFAULT?
> 
> well, it's from a prior comment as below:
>   vfio_mm_pasid_alloc() can return -ENOSPC though, so it'd be nice to
>   differentiate the errors. We could use EFAULT for the no IOMMU case
>   and EINVAL here?
> http://lkml.iu.edu/hypermail/linux/kernel/2001.3/05964.html
> 
> >
> > > +		goto out_unlock;
> > > +	}
> > > +	if (vmm)
> > > +		ret = vfio_mm_pasid_alloc(vmm, min, max);
> > > +	else
> > > +		ret = -EINVAL;
> > > +out_unlock:
> > > +	mutex_unlock(&iommu->lock);
> > > +	return ret;
> > > +}
> > > +
> > > +static int vfio_iommu_type1_pasid_free(struct vfio_iommu *iommu,
> > > +				       unsigned int pasid)
> > > +{
> > > +	struct vfio_mm *vmm = iommu->vmm;
> > > +	int ret = 0;
> > > +
> > > +	mutex_lock(&iommu->lock);
> > > +	if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) {
> > > +		ret = -EFAULT;
> >
> > ditto
> >
> > > +		goto out_unlock;
> > > +	}
> > > +
> > > +	if (vmm)
> > > +		ret = vfio_mm_pasid_free(vmm, pasid);
> > > +	else
> > > +		ret = -EINVAL;
> > > +out_unlock:
> > > +	mutex_unlock(&iommu->lock);
> > > +	return ret;
> > > +}
> > > +
> > >  static long vfio_iommu_type1_ioctl(void *iommu_data,
> > >  				   unsigned int cmd, unsigned long arg)
> > >  {
> > > @@ -2276,6 +2333,53 @@ static long vfio_iommu_type1_ioctl(void
> > > *iommu_data,
> > >
> > >  		return copy_to_user((void __user *)arg, &unmap, minsz) ?
> > >  			-EFAULT : 0;
> > > +
> > > +	} else if (cmd == VFIO_IOMMU_PASID_REQUEST) {
> > > +		struct vfio_iommu_type1_pasid_request req;
> > > +		unsigned long offset;
> > > +
> > > +		minsz = offsetofend(struct vfio_iommu_type1_pasid_request,
> > > +				    flags);
> > > +
> > > +		if (copy_from_user(&req, (void __user *)arg, minsz))
> > > +			return -EFAULT;
> > > +
> > > +		if (req.argsz < minsz ||
> > > +		    !vfio_iommu_type1_pasid_req_valid(req.flags))
> > > +			return -EINVAL;
> > > +
> > > +		if (copy_from_user((void *)&req + minsz,
> > > +				   (void __user *)arg + minsz,
> > > +				   sizeof(req) - minsz))
> > > +			return -EFAULT;
> >
> > why copying in two steps instead of copying them together?
> 
> just want to do sanity check before copying all the data. I
> can move it as one copy if it's better. :-)

it's possible fine. I just saw you did same thing for other uapis.

> 
> > > +
> > > +		switch (req.flags & VFIO_PASID_REQUEST_MASK) {
> > > +		case VFIO_IOMMU_PASID_ALLOC:
> > > +		{
> > > +			int ret = 0, result;
> > > +
> > > +			result = vfio_iommu_type1_pasid_alloc(iommu,
> > > +							req.alloc_pasid.min,
> > > +							req.alloc_pasid.max);
> > > +			if (result > 0) {
> > > +				offset = offsetof(
> > > +					struct
> > > vfio_iommu_type1_pasid_request,
> > > +					alloc_pasid.result);
> > > +				ret = copy_to_user(
> > > +					      (void __user *) (arg + offset),
> > > +					      &result, sizeof(result));
> > > +			} else {
> > > +				pr_debug("%s: PASID alloc failed\n",
> > > __func__);
> > > +				ret = -EFAULT;
> >
> > no, this branch is not for copy_to_user error. it is about pasid alloc
> > failure. you should handle both.
> 
> Emmm, I just want to fail the IOCTL in such case, so the @result field
> is meaningless in the user side. How about using another return value
> (e.g. ENOSPC) to indicate the IOCTL failure?

If pasid_alloc fails, you return its result to userspace
if copy_to_user fails, then return -EFAULT.

however, above you return -EFAULT for pasid_alloc failure, and 
then the number of not-copied bytes for copy_to_user.

> 
> > > +			}
> > > +			return ret;
> > > +		}
> > > +		case VFIO_IOMMU_PASID_FREE:
> > > +			return vfio_iommu_type1_pasid_free(iommu,
> > > +							   req.free_pasid);
> > > +		default:
> > > +			return -EINVAL;
> > > +		}
> > >  	}
> > >
> > >  	return -ENOTTY;
> > > diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> > > index e42a711..75f9f7f1 100644
> > > --- a/include/linux/vfio.h
> > > +++ b/include/linux/vfio.h
> > > @@ -89,6 +89,26 @@ extern int vfio_register_iommu_driver(const struct
> > > vfio_iommu_driver_ops *ops);
> > >  extern void vfio_unregister_iommu_driver(
> > >  				const struct vfio_iommu_driver_ops *ops);
> > >
> > > +#define VFIO_DEFAULT_PASID_QUOTA	1000
> > > +struct vfio_mm_token {
> > > +	unsigned long long val;
> > > +};
> > > +
> > > +struct vfio_mm {
> > > +	struct kref			kref;
> > > +	struct vfio_mm_token		token;
> > > +	int				ioasid_sid;
> > > +	/* protect @pasid_quota field and pasid allocation/free */
> > > +	struct mutex			pasid_lock;
> > > +	int				pasid_quota;
> > > +	struct list_head		vfio_next;
> > > +};
> > > +
> > > +extern struct vfio_mm *vfio_mm_get_from_task(struct task_struct
> *task);
> > > +extern void vfio_mm_put(struct vfio_mm *vmm);
> > > +extern int vfio_mm_pasid_alloc(struct vfio_mm *vmm, int min, int max);
> > > +extern int vfio_mm_pasid_free(struct vfio_mm *vmm, ioasid_t pasid);
> > > +
> > >  /*
> > >   * External user API
> > >   */
> > > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> > > index 9e843a1..298ac80 100644
> > > --- a/include/uapi/linux/vfio.h
> > > +++ b/include/uapi/linux/vfio.h
> > > @@ -794,6 +794,47 @@ struct vfio_iommu_type1_dma_unmap {
> > >  #define VFIO_IOMMU_ENABLE	_IO(VFIO_TYPE, VFIO_BASE + 15)
> > >  #define VFIO_IOMMU_DISABLE	_IO(VFIO_TYPE, VFIO_BASE + 16)
> > >
> > > +/*
> > > + * PASID (Process Address Space ID) is a PCIe concept which
> > > + * has been extended to support DMA isolation in fine-grain.
> > > + * With device assigned to user space (e.g. VMs), PASID alloc
> > > + * and free need to be system wide. This structure defines
> > > + * the info for pasid alloc/free between user space and kernel
> > > + * space.
> > > + *
> > > + * @flag=VFIO_IOMMU_PASID_ALLOC, refer to the @alloc_pasid
> > > + * @flag=VFIO_IOMMU_PASID_FREE, refer to @free_pasid
> > > + */
> > > +struct vfio_iommu_type1_pasid_request {
> > > +	__u32	argsz;
> > > +#define VFIO_IOMMU_PASID_ALLOC	(1 << 0)
> > > +#define VFIO_IOMMU_PASID_FREE	(1 << 1)
> > > +	__u32	flags;
> > > +	union {
> > > +		struct {
> > > +			__u32 min;
> > > +			__u32 max;
> > > +			__u32 result;
> >
> > result->pasid?
> 
> yes, the pasid allocated.
> 
> >
> > > +		} alloc_pasid;
> > > +		__u32 free_pasid;
> >
> > what about putting a common pasid field after flags?
> 
> looks good to me. But it would make the union part only meaningful
> to alloc pasid. if so, maybe make the union part as a data field and
> only alloc pasid will have it. For free pasid, it is not necessary
> to read it from userspace. does it look good?

maybe keeping the union is also OK, just with {min, max} for alloc.
who knows whether more pasid ops will be added in the future
which may require its specific union structure. 😊 putting pasid
as a common field is reasonable because the whole cmd is for
pasid.

> 
> Regards,
> Yi Liu
> 
> > > +	};
> > > +};
> > > +
> > > +#define VFIO_PASID_REQUEST_MASK	(VFIO_IOMMU_PASID_ALLOC
> | \
> > > +					 VFIO_IOMMU_PASID_FREE)
> > > +
> > > +/**
> > > + * VFIO_IOMMU_PASID_REQUEST - _IOWR(VFIO_TYPE, VFIO_BASE + 22,
> > > + *				struct vfio_iommu_type1_pasid_request)
> > > + *
> > > + * Availability of this feature depends on PASID support in the device,
> > > + * its bus, the underlying IOMMU and the CPU architecture. In VFIO, it
> > > + * is available after VFIO_SET_IOMMU.
> > > + *
> > > + * returns: 0 on success, -errno on failure.
> > > + */
> > > +#define VFIO_IOMMU_PASID_REQUEST	_IO(VFIO_TYPE, VFIO_BASE +
> > > 22)
> > > +
> > >  /* -------- Additional API for SPAPR TCE (Server POWERPC) IOMMU --------
> */
> > >
> > >  /*
> > > --
> > > 2.7.4


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH v1 1/8] vfio: Add VFIO_IOMMU_PASID_REQUEST(alloc/free)
  2020-03-22 12:31 ` [PATCH v1 1/8] vfio: Add VFIO_IOMMU_PASID_REQUEST(alloc/free) Liu, Yi L
  2020-03-22 16:21   ` kbuild test robot
  2020-03-30  8:32   ` Tian, Kevin
@ 2020-03-31  7:53   ` Christoph Hellwig
  2020-03-31  8:17     ` Liu, Yi L
  2020-03-31  8:32     ` Liu, Yi L
  2020-04-02 13:52   ` Jean-Philippe Brucker
  2020-04-02 17:50   ` Alex Williamson
  4 siblings, 2 replies; 110+ messages in thread
From: Christoph Hellwig @ 2020-03-31  7:53 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: alex.williamson, eric.auger, jean-philippe, kevin.tian,
	ashok.raj, kvm, jun.j.tian, iommu, linux-kernel, yi.y.sun,
	hao.wu

Who is going to use thse exports?  Please submit them together with
a driver actually using them.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH v1 7/8] vfio/type1: Add VFIO_IOMMU_CACHE_INVALIDATE
  2020-03-22 12:32 ` [PATCH v1 7/8] vfio/type1: Add VFIO_IOMMU_CACHE_INVALIDATE Liu, Yi L
  2020-03-30 12:58   ` Tian, Kevin
@ 2020-03-31  7:56   ` Christoph Hellwig
  2020-03-31 10:48     ` Liu, Yi L
  2020-04-02 20:24   ` Alex Williamson
  2 siblings, 1 reply; 110+ messages in thread
From: Christoph Hellwig @ 2020-03-31  7:56 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: alex.williamson, eric.auger, jean-philippe, kevin.tian,
	ashok.raj, kvm, jun.j.tian, iommu, linux-kernel, yi.y.sun,
	hao.wu

> @@ -2629,6 +2638,46 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
>  		}
>  		kfree(gbind_data);
>  		return ret;
> +	} else if (cmd == VFIO_IOMMU_CACHE_INVALIDATE) {

Please refactor the spaghetti in this ioctl handler to use a switch
statement and a helper function per command before growing it even more.

> +		/* Get the version of struct iommu_cache_invalidate_info */
> +		if (copy_from_user(&version,
> +			(void __user *) (arg + minsz), sizeof(version)))
> +			return -EFAULT;
> +
> +		info_size = iommu_uapi_get_data_size(
> +					IOMMU_UAPI_CACHE_INVAL, version);
> +
> +		cache_info = kzalloc(info_size, GFP_KERNEL);
> +		if (!cache_info)
> +			return -ENOMEM;
> +
> +		if (copy_from_user(cache_info,
> +			(void __user *) (arg + minsz), info_size)) {

The user might have changed the version while you were allocating and
freeing the memory, introducing potentially exploitable racing
conditions.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* RE: [PATCH v1 1/8] vfio: Add VFIO_IOMMU_PASID_REQUEST(alloc/free)
  2020-03-31  7:53   ` Christoph Hellwig
@ 2020-03-31  8:17     ` Liu, Yi L
  2020-03-31  8:32     ` Liu, Yi L
  1 sibling, 0 replies; 110+ messages in thread
From: Liu, Yi L @ 2020-03-31  8:17 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: alex.williamson, eric.auger, jean-philippe, Tian, Kevin, Raj,
	Ashok, kvm, Tian, Jun J, iommu, linux-kernel, Sun, Yi Y, Wu, Hao

> From: Christoph Hellwig <hch@infradead.org>
> Sent: Tuesday, March 31, 2020 3:54 PM
> To: Liu, Yi L <yi.l.liu@intel.com>
> Subject: Re: [PATCH v1 1/8] vfio: Add VFIO_IOMMU_PASID_REQUEST(alloc/free)
> 
> Who is going to use thse exports?  Please submit them together with
> a driver actually using them.

Hi Hellwig,

These are exposed for SVA (Shared Virtual Addressing) usage in VMs. If
say a driver who actually using them, it is the iommu driver running in
guest. The flow is: guest iommu driver programs the virtual command interface
and it traps to host. The virtual IOMMU device model lays in QEMU will
utilize the exported ioctl to get PASIDs.
Here is iommu kernel driver patch which utilizes virtual command interface
to request pasid alloc/free.
https://lkml.org/lkml/2020/3/20/1176
And, the below patch is one which utilizes the ioctl exported in this patch:
https://patchwork.kernel.org/patch/11464601/

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 110+ messages in thread

* RE: [PATCH v1 1/8] vfio: Add VFIO_IOMMU_PASID_REQUEST(alloc/free)
  2020-03-31  7:53   ` Christoph Hellwig
  2020-03-31  8:17     ` Liu, Yi L
@ 2020-03-31  8:32     ` Liu, Yi L
  2020-03-31  8:36       ` Liu, Yi L
  1 sibling, 1 reply; 110+ messages in thread
From: Liu, Yi L @ 2020-03-31  8:32 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: alex.williamson, eric.auger, jean-philippe, Tian, Kevin, Raj,
	Ashok, kvm, Tian, Jun J, iommu, linux-kernel, Sun, Yi Y, Wu, Hao

> From: Christoph Hellwig <hch@infradead.org>
> Sent: Tuesday, March 31, 2020 3:54 PM
> To: Liu, Yi L <yi.l.liu@intel.com>
> Subject: Re: [PATCH v1 1/8] vfio: Add VFIO_IOMMU_PASID_REQUEST(alloc/free)
> 
> Who is going to use thse exports?  Please submit them together with
> a driver actually using them.

Hi Hellwig,

Sorry, maybe I misunderstood your point. Do you mean the exported symbol
below? They are used by the vfio_iommu_type1 driver which is a separate
driver besides the vfio.ko driver.

+EXPORT_SYMBOL_GPL(vfio_mm_put);
+EXPORT_SYMBOL_GPL(vfio_mm_get_from_task);
+EXPORT_SYMBOL_GPL(vfio_mm_pasid_alloc);
+EXPORT_SYMBOL_GPL(vfio_mm_pasid_free);

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 110+ messages in thread

* RE: [PATCH v1 1/8] vfio: Add VFIO_IOMMU_PASID_REQUEST(alloc/free)
  2020-03-31  8:32     ` Liu, Yi L
@ 2020-03-31  8:36       ` Liu, Yi L
  2020-03-31  9:15         ` Christoph Hellwig
  0 siblings, 1 reply; 110+ messages in thread
From: Liu, Yi L @ 2020-03-31  8:36 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: alex.williamson, eric.auger, jean-philippe, Tian, Kevin, Raj,
	Ashok, kvm, Tian, Jun J, iommu, linux-kernel, Sun, Yi Y, Wu, Hao

> From: Liu, Yi L
> Sent: Tuesday, March 31, 2020 4:33 PM
> To: 'Christoph Hellwig' <hch@infradead.org>
> Subject: RE: [PATCH v1 1/8] vfio: Add VFIO_IOMMU_PASID_REQUEST(alloc/free)
> 
> > From: Christoph Hellwig <hch@infradead.org>
> > Sent: Tuesday, March 31, 2020 3:54 PM
> > To: Liu, Yi L <yi.l.liu@intel.com>
> > Subject: Re: [PATCH v1 1/8] vfio: Add
> > VFIO_IOMMU_PASID_REQUEST(alloc/free)
> >
> > Who is going to use thse exports?  Please submit them together with a
> > driver actually using them.
the user of the symbols are already in this patch. sorry for the split answer..

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH v1 1/8] vfio: Add VFIO_IOMMU_PASID_REQUEST(alloc/free)
  2020-03-31  8:36       ` Liu, Yi L
@ 2020-03-31  9:15         ` Christoph Hellwig
  0 siblings, 0 replies; 110+ messages in thread
From: Christoph Hellwig @ 2020-03-31  9:15 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: Christoph Hellwig, alex.williamson, eric.auger, jean-philippe,
	Tian, Kevin, Raj, Ashok, kvm, Tian, Jun J, iommu, linux-kernel,
	Sun, Yi Y, Wu, Hao

On Tue, Mar 31, 2020 at 08:36:32AM +0000, Liu, Yi L wrote:
> > From: Liu, Yi L
> > Sent: Tuesday, March 31, 2020 4:33 PM
> > To: 'Christoph Hellwig' <hch@infradead.org>
> > Subject: RE: [PATCH v1 1/8] vfio: Add VFIO_IOMMU_PASID_REQUEST(alloc/free)
> > 
> > > From: Christoph Hellwig <hch@infradead.org>
> > > Sent: Tuesday, March 31, 2020 3:54 PM
> > > To: Liu, Yi L <yi.l.liu@intel.com>
> > > Subject: Re: [PATCH v1 1/8] vfio: Add
> > > VFIO_IOMMU_PASID_REQUEST(alloc/free)
> > >
> > > Who is going to use thse exports?  Please submit them together with a
> > > driver actually using them.
> the user of the symbols are already in this patch. sorry for the split answer..

Thanks, sorry for the noise!

^ permalink raw reply	[flat|nested] 110+ messages in thread

* RE: [PATCH v1 7/8] vfio/type1: Add VFIO_IOMMU_CACHE_INVALIDATE
  2020-03-31  7:56   ` Christoph Hellwig
@ 2020-03-31 10:48     ` Liu, Yi L
  0 siblings, 0 replies; 110+ messages in thread
From: Liu, Yi L @ 2020-03-31 10:48 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: alex.williamson, eric.auger, jean-philippe, Tian, Kevin, Raj,
	Ashok, kvm, Tian, Jun J, iommu, linux-kernel, Sun, Yi Y, Wu, Hao

Hi Hellwig,

Thanks for your review, Hellwig. :-) inline reply.

> From: Christoph Hellwig <hch@infradead.org>
> Sent: Tuesday, March 31, 2020 3:56 PM
> To: Liu, Yi L <yi.l.liu@intel.com>
> Subject: Re: [PATCH v1 7/8] vfio/type1: Add VFIO_IOMMU_CACHE_INVALIDATE
> 
> > @@ -2629,6 +2638,46 @@ static long vfio_iommu_type1_ioctl(void
> *iommu_data,
> >  		}
> >  		kfree(gbind_data);
> >  		return ret;
> > +	} else if (cmd == VFIO_IOMMU_CACHE_INVALIDATE) {
> 
> Please refactor the spaghetti in this ioctl handler to use a switch statement and a
> helper function per command before growing it even more.

got it. I may get a separate refactor patch before adding my changes.

> 
> > +		/* Get the version of struct iommu_cache_invalidate_info */
> > +		if (copy_from_user(&version,
> > +			(void __user *) (arg + minsz), sizeof(version)))
> > +			return -EFAULT;
> > +
> > +		info_size = iommu_uapi_get_data_size(
> > +					IOMMU_UAPI_CACHE_INVAL, version);
> > +
> > +		cache_info = kzalloc(info_size, GFP_KERNEL);
> > +		if (!cache_info)
> > +			return -ENOMEM;
> > +
> > +		if (copy_from_user(cache_info,
> > +			(void __user *) (arg + minsz), info_size)) {
> 
> The user might have changed the version while you were allocating and
> freeing the
> memory, introducing potentially exploitable racing conditions.

yeah, I know the @version is not welcomed in the thread Jacob is driving.
I'll adjust the code here once the open in that thread has been solved.

But regardless of the version, I'm not sure if I 100% got your point.
Could you elaborate a bit? BTW. The code somehow referenced the code
below. The basic flow is copying partial data from __arg and then copy
the rest data after figuring out how much left. The difference betwen
below code and my code is just different way to figure out left data
size. Since I'm not sure if I got your point. If the racing is true in
such flow, I guess there are quite a few places need to enhance.

vfio_pci_ioctl(){
{
...
        } else if (cmd == VFIO_DEVICE_SET_IRQS) {
                struct vfio_irq_set hdr;
                u8 *data = NULL;
                int max, ret = 0;
                size_t data_size = 0;

                minsz = offsetofend(struct vfio_irq_set, count);

                if (copy_from_user(&hdr, (void __user *)arg, minsz))
                        return -EFAULT;

                max = vfio_pci_get_irq_count(vdev, hdr.index);

                ret = vfio_set_irqs_validate_and_prepare(&hdr, max,
                                                 VFIO_PCI_NUM_IRQS, &data_size);
                if (ret)
                        return ret;

                if (data_size) {
                        data = memdup_user((void __user *)(arg + minsz),
                                            data_size);
                        if (IS_ERR(data))
                                return PTR_ERR(data);
                }

                mutex_lock(&vdev->igate);

                ret = vfio_pci_set_irqs_ioctl(vdev, hdr.flags, hdr.index,
                                              hdr.start, hdr.count, data);

                mutex_unlock(&vdev->igate);
                kfree(data);

                return ret;

        } else if (cmd == VFIO_DEVICE_RESET) {
...
}

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 110+ messages in thread

* RE: [PATCH v1 1/8] vfio: Add VFIO_IOMMU_PASID_REQUEST(alloc/free)
  2020-03-31  5:40       ` Tian, Kevin
@ 2020-03-31 13:22         ` Liu, Yi L
  2020-04-01  5:43           ` Tian, Kevin
  0 siblings, 1 reply; 110+ messages in thread
From: Liu, Yi L @ 2020-03-31 13:22 UTC (permalink / raw)
  To: Tian, Kevin, alex.williamson, eric.auger
  Cc: jacob.jun.pan, joro, Raj, Ashok, Tian, Jun J, Sun, Yi Y,
	jean-philippe, peterx, iommu, kvm, linux-kernel, Wu, Hao

[-- Attachment #1: Type: text/plain, Size: 24131 bytes --]

> From: Tian, Kevin <kevin.tian@intel.com>
> Sent: Tuesday, March 31, 2020 1:41 PM
> To: Liu, Yi L <yi.l.liu@intel.com>; alex.williamson@redhat.com;
> eric.auger@redhat.com
> Subject: RE: [PATCH v1 1/8] vfio: Add VFIO_IOMMU_PASID_REQUEST(alloc/free)
> 
> > From: Liu, Yi L <yi.l.liu@intel.com>
> > Sent: Monday, March 30, 2020 10:37 PM
> >
> > > From: Tian, Kevin <kevin.tian@intel.com>
> > > Sent: Monday, March 30, 2020 4:32 PM
> > > To: Liu, Yi L <yi.l.liu@intel.com>; alex.williamson@redhat.com;
> > > Subject: RE: [PATCH v1 1/8] vfio: Add
> > VFIO_IOMMU_PASID_REQUEST(alloc/free)
> > >
> > > > From: Liu, Yi L <yi.l.liu@intel.com>
> > > > Sent: Sunday, March 22, 2020 8:32 PM
> > > >
> > > > From: Liu Yi L <yi.l.liu@intel.com>
> > > >
> > > > For a long time, devices have only one DMA address space from
> > > > platform IOMMU's point of view. This is true for both bare metal
> > > > and directed- access in virtualization environment. Reason is the
> > > > source ID of DMA in PCIe are BDF (bus/dev/fnc ID), which results
> > > > in only device granularity
> > >
> > > are->is
> >
> > thanks.
> >
> > > > DMA isolation. However, this is changing with the latest
> > > > advancement in I/O technology area. More and more platform vendors
> > > > are utilizing the
> > PCIe
> > > > PASID TLP prefix in DMA requests, thus to give devices with
> > > > multiple DMA address spaces as identified by their individual
> > > > PASIDs. For example, Shared Virtual Addressing (SVA, a.k.a Shared
> > > > Virtual Memory) is able to let device access multiple process
> > > > virtual address space by binding the
> > >
> > > "address space" -> "address spaces"
> > >
> > > "binding the" -> "binding each"
> >
> > will correct both.
> >
> > > > virtual address space with a PASID. Wherein the PASID is allocated
> > > > in software and programmed to device per device specific manner.
> > > > Devices which support PASID capability are called PASID-capable
> > > > devices. If such devices are passed through to VMs, guest software
> > > > are also able to bind guest process virtual address space on such
> > > > devices. Therefore, the guest software could reuse the bare metal
> > > > software programming model,
> > which
> > > > means guest software will also allocate PASID and program it to
> > > > device directly. This is a dangerous situation since it has
> > > > potential PASID conflicts and unauthorized address space access.
> > > > It would be safer to let host intercept in the guest software's
> > > > PASID allocation. Thus PASID are managed system-wide.
> > > >
> > > > This patch adds VFIO_IOMMU_PASID_REQUEST ioctl which aims to
> > > > passdown PASID allocation/free request from the virtual IOMMU.
> > > > Additionally, such
> > >
> > > "Additionally, because such"
> > >
> > > > requests are intended to be invoked by QEMU or other applications
> > which
> > >
> > > simplify to "intended to be invoked from userspace"
> >
> > got it.
> >
> > > > are running in userspace, it is necessary to have a mechanism to
> > > > prevent single application from abusing available PASIDs in
> > > > system. With such consideration, this patch tracks the VFIO PASID
> > > > allocation per-VM. There was a discussion to make quota to be per
> > > > assigned devices. e.g. if a VM has many assigned devices, then it
> > > > should have more quota. However, it is not sure how many PASIDs an
> > > > assigned devices will use. e.g. it is
> > >
> > > devices -> device
> >
> > got it.
> >
> > > > possible that a VM with multiples assigned devices but requests
> > > > less PASIDs. Therefore per-VM quota would be better.
> > > >
> > > > This patch uses struct mm pointer as a per-VM token. We also
> > > > considered using task structure pointer and vfio_iommu structure
> > > > pointer. However, task structure is per-thread, which means it
> > > > cannot achieve per-VM PASID alloc tracking purpose. While for
> > > > vfio_iommu structure, it is visible only within vfio. Therefore,
> > > > structure mm pointer is selected. This patch adds a structure
> > > > vfio_mm. A vfio_mm is created when the first vfio container is
> > > > opened by a VM. On the reverse order, vfio_mm is free when the
> > > > last vfio container is released. Each VM is assigned with a PASID
> > > > quota, so that it is not able to request PASID beyond its quota.
> > > > This patch adds a default quota of 1000. This quota could be tuned
> > > > by administrator. Making PASID quota tunable will be added in
> > > > another
> > patch
> > > > in this series.
> > > >
> > > > Previous discussions:
> > > > https://patchwork.kernel.org/patch/11209429/
> > > >
> > > > Cc: Kevin Tian <kevin.tian@intel.com>
> > > > CC: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > > > Cc: Alex Williamson <alex.williamson@redhat.com>
> > > > Cc: Eric Auger <eric.auger@redhat.com>
> > > > Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
> > > > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > > > Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
> > > > Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > > > ---
> > > >  drivers/vfio/vfio.c             | 130
> > > > ++++++++++++++++++++++++++++++++++++++++
> > > >  drivers/vfio/vfio_iommu_type1.c | 104
> > > > ++++++++++++++++++++++++++++++++
> > > >  include/linux/vfio.h            |  20 +++++++
> > > >  include/uapi/linux/vfio.h       |  41 +++++++++++++
> > > >  4 files changed, 295 insertions(+)
> > > >
> > > > diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c index
> > > > c848262..d13b483 100644
> > > > --- a/drivers/vfio/vfio.c
> > > > +++ b/drivers/vfio/vfio.c
> > > > @@ -32,6 +32,7 @@
> > > >  #include <linux/vfio.h>
> > > >  #include <linux/wait.h>
> > > >  #include <linux/sched/signal.h>
> > > > +#include <linux/sched/mm.h>
> > > >
> > > >  #define DRIVER_VERSION	"0.3"
> > > >  #define DRIVER_AUTHOR	"Alex Williamson
> > > > <alex.williamson@redhat.com>"
> > > > @@ -46,6 +47,8 @@ static struct vfio {
> > > >  	struct mutex			group_lock;
> > > >  	struct cdev			group_cdev;
> > > >  	dev_t				group_devt;
> > > > +	struct list_head		vfio_mm_list;
> > > > +	struct mutex			vfio_mm_lock;
> > > >  	wait_queue_head_t		release_q;
> > > >  } vfio;
> > > >
> > > > @@ -2129,6 +2132,131 @@ int vfio_unregister_notifier(struct device
> > *dev,
> > > > enum vfio_notify_type type,
> > > >  EXPORT_SYMBOL(vfio_unregister_notifier);
> > > >
> > > >  /**
> > > > + * VFIO_MM objects - create, release, get, put, search
> > >
> > > why capitalizing vfio_mm?
> >
> > oops, it's not intended, will fix it.
> >
> > > > + * Caller of the function should have held vfio.vfio_mm_lock.
> > > > + */
> > > > +static struct vfio_mm *vfio_create_mm(struct mm_struct *mm) {
> > > > +	struct vfio_mm *vmm;
> > > > +	struct vfio_mm_token *token;
> > > > +	int ret = 0;
> > > > +
> > > > +	vmm = kzalloc(sizeof(*vmm), GFP_KERNEL);
> > > > +	if (!vmm)
> > > > +		return ERR_PTR(-ENOMEM);
> > > > +
> > > > +	/* Per mm IOASID set used for quota control and group operations
> > > > */
> > > > +	ret = ioasid_alloc_set((struct ioasid_set *) mm,
> > > > +			       VFIO_DEFAULT_PASID_QUOTA, &vmm-
> > > > >ioasid_sid);
> > > > +	if (ret) {
> > > > +		kfree(vmm);
> > > > +		return ERR_PTR(ret);
> > > > +	}
> > > > +
> > > > +	kref_init(&vmm->kref);
> > > > +	token = &vmm->token;
> > > > +	token->val = mm;
> > > > +	vmm->pasid_quota = VFIO_DEFAULT_PASID_QUOTA;
> > > > +	mutex_init(&vmm->pasid_lock);
> > > > +
> > > > +	list_add(&vmm->vfio_next, &vfio.vfio_mm_list);
> > > > +
> > > > +	return vmm;
> > > > +}
> > > > +
> > > > +static void vfio_mm_unlock_and_free(struct vfio_mm *vmm) {
> > > > +	/* destroy the ioasid set */
> > > > +	ioasid_free_set(vmm->ioasid_sid, true);
> > >
> > > do we need hold pasid lock here, since it attempts to destroy a set
> > > which might be referenced by vfio_mm_pasid_free? or is there
> > > guarantee that such race won't happen?
> >
> > Emmm, if considering the race between ioasid_free_set and
> > vfio_mm_pasid_free, I guess ioasid core should sequence the two
> > operations with its internal lock. right?
> 
> I looked at below code in free path:
> 
> +	mutex_lock(&vmm->pasid_lock);
> +	pdata = ioasid_find(vmm->ioasid_sid, pasid, NULL);
> +	if (IS_ERR(pdata)) {
> +		ret = PTR_ERR(pdata);
> +		goto out_unlock;
> +	}
> +	ioasid_free(pasid);
> +
> +out_unlock:
> +	mutex_unlock(&vmm->pasid_lock);
> 
> above implies that ioasid_find/free must be paired within the pasid_lock.
> Then if we don't hold pasid_lock above, ioasid_free_set could happen between
> find/free. I'm not sure whether this race would lead to real problem, but it doesn't
> look correct simply by looking at this file.

Well, Jacob told me to remove the ioasid_find in another email as he
believes ioasid core should be able to take care of it. and also need to
be protected by lock. If so, does it look good? :-)

 " [Jacob Pan] this might be better to put under ioasid code such that it 
  is under the ioasid lock. no one else can free the ioasid between find() and free().
  e.g. ioasid_free(sid, pasid)
  if sid == INVALID_IOASID_SET, then no set ownership checking.
  thoughts?"

> >
> > > > +	mutex_unlock(&vfio.vfio_mm_lock);
> > > > +	kfree(vmm);
> > > > +}
> > > > +
> > > > +/* called with vfio.vfio_mm_lock held */ static void
> > > > +vfio_mm_release(struct kref *kref) {
> > > > +	struct vfio_mm *vmm = container_of(kref, struct vfio_mm, kref);
> > > > +
> > > > +	list_del(&vmm->vfio_next);
> > > > +	vfio_mm_unlock_and_free(vmm);
> > > > +}
> > > > +
> > > > +void vfio_mm_put(struct vfio_mm *vmm) {
> > > > +	kref_put_mutex(&vmm->kref, vfio_mm_release,
> > > > &vfio.vfio_mm_lock);
> > > > +}
> > > > +EXPORT_SYMBOL_GPL(vfio_mm_put);
> > > > +
> > > > +/* Assume vfio_mm_lock or vfio_mm reference is held */ static
> > > > +void vfio_mm_get(struct vfio_mm *vmm) {
> > > > +	kref_get(&vmm->kref);
> > > > +}
> > > > +
> > > > +struct vfio_mm *vfio_mm_get_from_task(struct task_struct *task) {
> > > > +	struct mm_struct *mm = get_task_mm(task);
> > > > +	struct vfio_mm *vmm;
> > > > +	unsigned long long val = (unsigned long long) mm;
> > > > +
> > > > +	mutex_lock(&vfio.vfio_mm_lock);
> > > > +	list_for_each_entry(vmm, &vfio.vfio_mm_list, vfio_next) {
> > > > +		if (vmm->token.val == val) {
> > > > +			vfio_mm_get(vmm);
> > > > +			goto out;
> > > > +		}
> > > > +	}
> > > > +
> > > > +	vmm = vfio_create_mm(mm);
> > > > +	if (IS_ERR(vmm))
> > > > +		vmm = NULL;
> > > > +out:
> > > > +	mutex_unlock(&vfio.vfio_mm_lock);
> > > > +	mmput(mm);
> > >
> > > I assume this has been discussed before, but from readability p.o.v
> > > it might be good to add a comment for this function to explain how
> > > the recording of mm in vfio_mm can be correctly removed when the mm
> > > is being destroyed, since we don't hold a reference of mm here.
> >
> > yeah, I'll add it.
> >
> > > > +	return vmm;
> > > > +}
> > > > +EXPORT_SYMBOL_GPL(vfio_mm_get_from_task);
> > > > +
> > > > +int vfio_mm_pasid_alloc(struct vfio_mm *vmm, int min, int max) {
> > > > +	ioasid_t pasid;
> > > > +	int ret = -ENOSPC;
> > > > +
> > > > +	mutex_lock(&vmm->pasid_lock);
> > > > +
> > > > +	pasid = ioasid_alloc(vmm->ioasid_sid, min, max, NULL);
> > > > +	if (pasid == INVALID_IOASID) {
> > > > +		ret = -ENOSPC;
> > > > +		goto out_unlock;
> > > > +	}
> > > > +
> > > > +	ret = pasid;
> > > > +out_unlock:
> > > > +	mutex_unlock(&vmm->pasid_lock);
> > > > +	return ret;
> > > > +}
> > > > +EXPORT_SYMBOL_GPL(vfio_mm_pasid_alloc);
> > > > +
> > > > +int vfio_mm_pasid_free(struct vfio_mm *vmm, ioasid_t pasid) {
> > > > +	void *pdata;
> > > > +	int ret = 0;
> > > > +
> > > > +	mutex_lock(&vmm->pasid_lock);
> > > > +	pdata = ioasid_find(vmm->ioasid_sid, pasid, NULL);
> > > > +	if (IS_ERR(pdata)) {
> > > > +		ret = PTR_ERR(pdata);
> > > > +		goto out_unlock;
> > > > +	}
> > > > +	ioasid_free(pasid);
> > > > +
> > > > +out_unlock:
> > > > +	mutex_unlock(&vmm->pasid_lock);
> > > > +	return ret;
> > > > +}
> > > > +EXPORT_SYMBOL_GPL(vfio_mm_pasid_free);
> > > > +
> > > > +/**
> > > >   * Module/class support
> > > >   */
> > > >  static char *vfio_devnode(struct device *dev, umode_t *mode) @@
> > > > -2151,8 +2279,10 @@ static int __init vfio_init(void)
> > > >  	idr_init(&vfio.group_idr);
> > > >  	mutex_init(&vfio.group_lock);
> > > >  	mutex_init(&vfio.iommu_drivers_lock);
> > > > +	mutex_init(&vfio.vfio_mm_lock);
> > > >  	INIT_LIST_HEAD(&vfio.group_list);
> > > >  	INIT_LIST_HEAD(&vfio.iommu_drivers_list);
> > > > +	INIT_LIST_HEAD(&vfio.vfio_mm_list);
> > > >  	init_waitqueue_head(&vfio.release_q);
> > > >
> > > >  	ret = misc_register(&vfio_dev);
> > > > diff --git a/drivers/vfio/vfio_iommu_type1.c
> > > > b/drivers/vfio/vfio_iommu_type1.c index a177bf2..331ceee 100644
> > > > --- a/drivers/vfio/vfio_iommu_type1.c
> > > > +++ b/drivers/vfio/vfio_iommu_type1.c
> > > > @@ -70,6 +70,7 @@ struct vfio_iommu {
> > > >  	unsigned int		dma_avail;
> > > >  	bool			v2;
> > > >  	bool			nesting;
> > > > +	struct vfio_mm		*vmm;
> > > >  };
> > > >
> > > >  struct vfio_domain {
> > > > @@ -2018,6 +2019,7 @@ static void
> > vfio_iommu_type1_detach_group(void
> > > > *iommu_data,
> > > >  static void *vfio_iommu_type1_open(unsigned long arg)  {
> > > >  	struct vfio_iommu *iommu;
> > > > +	struct vfio_mm *vmm = NULL;
> > > >
> > > >  	iommu = kzalloc(sizeof(*iommu), GFP_KERNEL);
> > > >  	if (!iommu)
> > > > @@ -2043,6 +2045,10 @@ static void
> > *vfio_iommu_type1_open(unsigned
> > > > long arg)
> > > >  	iommu->dma_avail = dma_entry_limit;
> > > >  	mutex_init(&iommu->lock);
> > > >  	BLOCKING_INIT_NOTIFIER_HEAD(&iommu->notifier);
> > > > +	vmm = vfio_mm_get_from_task(current);
> > > > +	if (!vmm)
> > > > +		pr_err("Failed to get vfio_mm track\n");
> > >
> > > I assume error should be returned when pr_err is used...
> >
> > got it. I didn't do it as I don’t think vfio_mm is necessary for
> > every iommu open. It is necessary for the nesting type iommu. I'll
> > make it fetch vmm when it is opening nesting type and return error
> > if failed.
> 
> sounds good.
> 
> >
> > > > +	iommu->vmm = vmm;
> > > >
> > > >  	return iommu;
> > > >  }
> > > > @@ -2084,6 +2090,8 @@ static void vfio_iommu_type1_release(void
> > > > *iommu_data)
> > > >  	}
> > > >
> > > >  	vfio_iommu_iova_free(&iommu->iova_list);
> > > > +	if (iommu->vmm)
> > > > +		vfio_mm_put(iommu->vmm);
> > > >
> > > >  	kfree(iommu);
> > > >  }
> > > > @@ -2172,6 +2180,55 @@ static int vfio_iommu_iova_build_caps(struct
> > > > vfio_iommu *iommu,
> > > >  	return ret;
> > > >  }
> > > >
> > > > +static bool vfio_iommu_type1_pasid_req_valid(u32 flags)
> > >
> > > I don't think you need prefix "vfio_iommu_type1" for every new
> > > function here, especially for leaf internal function as this one.
> >
> > got it. thanks.
> >
> > > > +{
> > > > +	return !((flags & ~VFIO_PASID_REQUEST_MASK) ||
> > > > +		 (flags & VFIO_IOMMU_PASID_ALLOC &&
> > > > +		  flags & VFIO_IOMMU_PASID_FREE));
> > > > +}
> > > > +
> > > > +static int vfio_iommu_type1_pasid_alloc(struct vfio_iommu *iommu,
> > > > +					 int min,
> > > > +					 int max)
> > > > +{
> > > > +	struct vfio_mm *vmm = iommu->vmm;
> > > > +	int ret = 0;
> > > > +
> > > > +	mutex_lock(&iommu->lock);
> > > > +	if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) {
> > > > +		ret = -EFAULT;
> > >
> > > why -EFAULT?
> >
> > well, it's from a prior comment as below:
> >   vfio_mm_pasid_alloc() can return -ENOSPC though, so it'd be nice to
> >   differentiate the errors. We could use EFAULT for the no IOMMU case
> >   and EINVAL here?
> > http://lkml.iu.edu/hypermail/linux/kernel/2001.3/05964.html
> >
> > >
> > > > +		goto out_unlock;
> > > > +	}
> > > > +	if (vmm)
> > > > +		ret = vfio_mm_pasid_alloc(vmm, min, max);
> > > > +	else
> > > > +		ret = -EINVAL;
> > > > +out_unlock:
> > > > +	mutex_unlock(&iommu->lock);
> > > > +	return ret;
> > > > +}
> > > > +
> > > > +static int vfio_iommu_type1_pasid_free(struct vfio_iommu *iommu,
> > > > +				       unsigned int pasid)
> > > > +{
> > > > +	struct vfio_mm *vmm = iommu->vmm;
> > > > +	int ret = 0;
> > > > +
> > > > +	mutex_lock(&iommu->lock);
> > > > +	if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) {
> > > > +		ret = -EFAULT;
> > >
> > > ditto
> > >
> > > > +		goto out_unlock;
> > > > +	}
> > > > +
> > > > +	if (vmm)
> > > > +		ret = vfio_mm_pasid_free(vmm, pasid);
> > > > +	else
> > > > +		ret = -EINVAL;
> > > > +out_unlock:
> > > > +	mutex_unlock(&iommu->lock);
> > > > +	return ret;
> > > > +}
> > > > +
> > > >  static long vfio_iommu_type1_ioctl(void *iommu_data,
> > > >  				   unsigned int cmd, unsigned long arg)
> > > >  {
> > > > @@ -2276,6 +2333,53 @@ static long vfio_iommu_type1_ioctl(void
> > > > *iommu_data,
> > > >
> > > >  		return copy_to_user((void __user *)arg, &unmap, minsz) ?
> > > >  			-EFAULT : 0;
> > > > +
> > > > +	} else if (cmd == VFIO_IOMMU_PASID_REQUEST) {
> > > > +		struct vfio_iommu_type1_pasid_request req;
> > > > +		unsigned long offset;
> > > > +
> > > > +		minsz = offsetofend(struct vfio_iommu_type1_pasid_request,
> > > > +				    flags);
> > > > +
> > > > +		if (copy_from_user(&req, (void __user *)arg, minsz))
> > > > +			return -EFAULT;
> > > > +
> > > > +		if (req.argsz < minsz ||
> > > > +		    !vfio_iommu_type1_pasid_req_valid(req.flags))
> > > > +			return -EINVAL;
> > > > +
> > > > +		if (copy_from_user((void *)&req + minsz,
> > > > +				   (void __user *)arg + minsz,
> > > > +				   sizeof(req) - minsz))
> > > > +			return -EFAULT;
> > >
> > > why copying in two steps instead of copying them together?
> >
> > just want to do sanity check before copying all the data. I
> > can move it as one copy if it's better. :-)
> 
> it's possible fine. I just saw you did same thing for other uapis.
> 
> >
> > > > +
> > > > +		switch (req.flags & VFIO_PASID_REQUEST_MASK) {
> > > > +		case VFIO_IOMMU_PASID_ALLOC:
> > > > +		{
> > > > +			int ret = 0, result;
> > > > +
> > > > +			result = vfio_iommu_type1_pasid_alloc(iommu,
> > > > +							req.alloc_pasid.min,
> > > > +							req.alloc_pasid.max);
> > > > +			if (result > 0) {
> > > > +				offset = offsetof(
> > > > +					struct
> > > > vfio_iommu_type1_pasid_request,
> > > > +					alloc_pasid.result);
> > > > +				ret = copy_to_user(
> > > > +					      (void __user *) (arg + offset),
> > > > +					      &result, sizeof(result));
> > > > +			} else {
> > > > +				pr_debug("%s: PASID alloc failed\n",
> > > > __func__);
> > > > +				ret = -EFAULT;
> > >
> > > no, this branch is not for copy_to_user error. it is about pasid alloc
> > > failure. you should handle both.
> >
> > Emmm, I just want to fail the IOCTL in such case, so the @result field
> > is meaningless in the user side. How about using another return value
> > (e.g. ENOSPC) to indicate the IOCTL failure?
> 
> If pasid_alloc fails, you return its result to userspace
> if copy_to_user fails, then return -EFAULT.
> 
> however, above you return -EFAULT for pasid_alloc failure, and
> then the number of not-copied bytes for copy_to_user.

not quite get. Let me re-paste the code. :-)

+		case VFIO_IOMMU_PASID_ALLOC:
+		{
+			int ret = 0, result;
+
+			result = vfio_iommu_type1_pasid_alloc(iommu,
+							req.alloc_pasid.min,
+							req.alloc_pasid.max);
+			if (result > 0) {
+				offset = offsetof(
+					struct vfio_iommu_type1_pasid_request,
+					alloc_pasid.result);
+				ret = copy_to_user(
+					      (void __user *) (arg + offset),
+					      &result, sizeof(result));
if copy_to_user failed, ret is the number of uncopied bytes and
will be returned to userspace to indicate failure. userspace will
not use the data in result field. perhaps, I should check the ret
here and return -EFAULT or another errno, instead of return the
number of un-copied bytes.
+			} else {
+				pr_debug("%s: PASID alloc failed\n", __func__);
+				ret = -EFAULT;
if vfio_iommu_type1_pasid_alloc() failed, no doubt, return -EFAULT
to userspace to indicate failure.
+			}
+			return ret;
+		}

is there still porblem here?
> >
> > > > +			}
> > > > +			return ret;
> > > > +		}
> > > > +		case VFIO_IOMMU_PASID_FREE:
> > > > +			return vfio_iommu_type1_pasid_free(iommu,
> > > > +							   req.free_pasid);
> > > > +		default:
> > > > +			return -EINVAL;
> > > > +		}
> > > >  	}
> > > >
> > > >  	return -ENOTTY;
> > > > diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> > > > index e42a711..75f9f7f1 100644
> > > > --- a/include/linux/vfio.h
> > > > +++ b/include/linux/vfio.h
> > > > @@ -89,6 +89,26 @@ extern int vfio_register_iommu_driver(const struct
> > > > vfio_iommu_driver_ops *ops);
> > > >  extern void vfio_unregister_iommu_driver(
> > > >  				const struct vfio_iommu_driver_ops *ops);
> > > >
> > > > +#define VFIO_DEFAULT_PASID_QUOTA	1000
> > > > +struct vfio_mm_token {
> > > > +	unsigned long long val;
> > > > +};
> > > > +
> > > > +struct vfio_mm {
> > > > +	struct kref			kref;
> > > > +	struct vfio_mm_token		token;
> > > > +	int				ioasid_sid;
> > > > +	/* protect @pasid_quota field and pasid allocation/free */
> > > > +	struct mutex			pasid_lock;
> > > > +	int				pasid_quota;
> > > > +	struct list_head		vfio_next;
> > > > +};
> > > > +
> > > > +extern struct vfio_mm *vfio_mm_get_from_task(struct task_struct
> > *task);
> > > > +extern void vfio_mm_put(struct vfio_mm *vmm);
> > > > +extern int vfio_mm_pasid_alloc(struct vfio_mm *vmm, int min, int max);
> > > > +extern int vfio_mm_pasid_free(struct vfio_mm *vmm, ioasid_t pasid);
> > > > +
> > > >  /*
> > > >   * External user API
> > > >   */
> > > > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> > > > index 9e843a1..298ac80 100644
> > > > --- a/include/uapi/linux/vfio.h
> > > > +++ b/include/uapi/linux/vfio.h
> > > > @@ -794,6 +794,47 @@ struct vfio_iommu_type1_dma_unmap {
> > > >  #define VFIO_IOMMU_ENABLE	_IO(VFIO_TYPE, VFIO_BASE + 15)
> > > >  #define VFIO_IOMMU_DISABLE	_IO(VFIO_TYPE, VFIO_BASE + 16)
> > > >
> > > > +/*
> > > > + * PASID (Process Address Space ID) is a PCIe concept which
> > > > + * has been extended to support DMA isolation in fine-grain.
> > > > + * With device assigned to user space (e.g. VMs), PASID alloc
> > > > + * and free need to be system wide. This structure defines
> > > > + * the info for pasid alloc/free between user space and kernel
> > > > + * space.
> > > > + *
> > > > + * @flag=VFIO_IOMMU_PASID_ALLOC, refer to the @alloc_pasid
> > > > + * @flag=VFIO_IOMMU_PASID_FREE, refer to @free_pasid
> > > > + */
> > > > +struct vfio_iommu_type1_pasid_request {
> > > > +	__u32	argsz;
> > > > +#define VFIO_IOMMU_PASID_ALLOC	(1 << 0)
> > > > +#define VFIO_IOMMU_PASID_FREE	(1 << 1)
> > > > +	__u32	flags;
> > > > +	union {
> > > > +		struct {
> > > > +			__u32 min;
> > > > +			__u32 max;
> > > > +			__u32 result;
> > >
> > > result->pasid?
> >
> > yes, the pasid allocated.
> >
> > >
> > > > +		} alloc_pasid;
> > > > +		__u32 free_pasid;
> > >
> > > what about putting a common pasid field after flags?
> >
> > looks good to me. But it would make the union part only meaningful
> > to alloc pasid. if so, maybe make the union part as a data field and
> > only alloc pasid will have it. For free pasid, it is not necessary
> > to read it from userspace. does it look good?
> 
> maybe keeping the union is also OK, just with {min, max} for alloc.
> who knows whether more pasid ops will be added in the future
> which may require its specific union structure. ?? putting pasid
> as a common field is reasonable because the whole cmd is for
> pasid.

got it.

Thanks,
Yi Liu


[-- Attachment #2: Type: message/rfc822, Size: 2197 bytes --]

From: Jacob Pan <jacob.jun.pan@linux.intel.com>
To: "Liu, Yi L" <yi.l.liu@intel.com>
Cc: "jacob.jun.pan@linux.intel.com" <jacob.jun.pan@linux.intel.com>
Subject: Re: [PATCH v1 1/8] vfio: Add VFIO_IOMMU_PASID_REQUEST(alloc/free)
Date: Fri, 27 Mar 2020 00:02:52 +0000
Message-ID: <20200326170252.719ff28d@jacob-builder>

off the list

On Sun, 22 Mar 2020 05:31:58 -0700
"Liu, Yi L" <yi.l.liu@intel.com> wrote:

> +     mutex_lock(&vmm->pasid_lock);
> +     pdata = ioasid_find(vmm->ioasid_sid, pasid, NULL);
> +     if (IS_ERR(pdata)) {
> +             ret = PTR_ERR(pdata);
> +             goto out_unlock;
> +     }
> +     ioasid_free(pasid);
> +
[Jacob Pan] this might be better to put under ioasid code such that it
is under the ioasid lock. no one else can free the ioasid between find()
and free().
e.g. ioasid_free(sid, pasid)
if sid == INVALID_IOASID_SET, then no set ownership checking.

thoughts?

^ permalink raw reply	[flat|nested] 110+ messages in thread

* RE: [PATCH v1 1/8] vfio: Add VFIO_IOMMU_PASID_REQUEST(alloc/free)
  2020-03-31 13:22         ` Liu, Yi L
@ 2020-04-01  5:43           ` Tian, Kevin
  2020-04-01  5:48             ` Liu, Yi L
  0 siblings, 1 reply; 110+ messages in thread
From: Tian, Kevin @ 2020-04-01  5:43 UTC (permalink / raw)
  To: Liu, Yi L, alex.williamson, eric.auger
  Cc: jacob.jun.pan, joro, Raj, Ashok, Tian, Jun J, Sun, Yi Y,
	jean-philippe, peterx, iommu, kvm, linux-kernel, Wu, Hao

> From: Liu, Yi L <yi.l.liu@intel.com>
> Sent: Tuesday, March 31, 2020 9:22 PM
> 
> > From: Tian, Kevin <kevin.tian@intel.com>
> > Sent: Tuesday, March 31, 2020 1:41 PM
> > To: Liu, Yi L <yi.l.liu@intel.com>; alex.williamson@redhat.com;
> > eric.auger@redhat.com
> > Subject: RE: [PATCH v1 1/8] vfio: Add
> VFIO_IOMMU_PASID_REQUEST(alloc/free)
> >
> > > From: Liu, Yi L <yi.l.liu@intel.com>
> > > Sent: Monday, March 30, 2020 10:37 PM
> > >
> > > > From: Tian, Kevin <kevin.tian@intel.com>
> > > > Sent: Monday, March 30, 2020 4:32 PM
> > > > To: Liu, Yi L <yi.l.liu@intel.com>; alex.williamson@redhat.com;
> > > > Subject: RE: [PATCH v1 1/8] vfio: Add
> > > VFIO_IOMMU_PASID_REQUEST(alloc/free)
> > > >
> > > > > From: Liu, Yi L <yi.l.liu@intel.com>
> > > > > Sent: Sunday, March 22, 2020 8:32 PM
> > > > >
> > > > > From: Liu Yi L <yi.l.liu@intel.com>
> > > > >
> > > > > For a long time, devices have only one DMA address space from
> > > > > platform IOMMU's point of view. This is true for both bare metal
> > > > > and directed- access in virtualization environment. Reason is the
> > > > > source ID of DMA in PCIe are BDF (bus/dev/fnc ID), which results
> > > > > in only device granularity
> > > >
> > > > are->is
> > >
> > > thanks.
> > >
> > > > > DMA isolation. However, this is changing with the latest
> > > > > advancement in I/O technology area. More and more platform
> vendors
> > > > > are utilizing the
> > > PCIe
> > > > > PASID TLP prefix in DMA requests, thus to give devices with
> > > > > multiple DMA address spaces as identified by their individual
> > > > > PASIDs. For example, Shared Virtual Addressing (SVA, a.k.a Shared
> > > > > Virtual Memory) is able to let device access multiple process
> > > > > virtual address space by binding the
> > > >
> > > > "address space" -> "address spaces"
> > > >
> > > > "binding the" -> "binding each"
> > >
> > > will correct both.
> > >
> > > > > virtual address space with a PASID. Wherein the PASID is allocated
> > > > > in software and programmed to device per device specific manner.
> > > > > Devices which support PASID capability are called PASID-capable
> > > > > devices. If such devices are passed through to VMs, guest software
> > > > > are also able to bind guest process virtual address space on such
> > > > > devices. Therefore, the guest software could reuse the bare metal
> > > > > software programming model,
> > > which
> > > > > means guest software will also allocate PASID and program it to
> > > > > device directly. This is a dangerous situation since it has
> > > > > potential PASID conflicts and unauthorized address space access.
> > > > > It would be safer to let host intercept in the guest software's
> > > > > PASID allocation. Thus PASID are managed system-wide.
> > > > >
> > > > > This patch adds VFIO_IOMMU_PASID_REQUEST ioctl which aims to
> > > > > passdown PASID allocation/free request from the virtual IOMMU.
> > > > > Additionally, such
> > > >
> > > > "Additionally, because such"
> > > >
> > > > > requests are intended to be invoked by QEMU or other applications
> > > which
> > > >
> > > > simplify to "intended to be invoked from userspace"
> > >
> > > got it.
> > >
> > > > > are running in userspace, it is necessary to have a mechanism to
> > > > > prevent single application from abusing available PASIDs in
> > > > > system. With such consideration, this patch tracks the VFIO PASID
> > > > > allocation per-VM. There was a discussion to make quota to be per
> > > > > assigned devices. e.g. if a VM has many assigned devices, then it
> > > > > should have more quota. However, it is not sure how many PASIDs an
> > > > > assigned devices will use. e.g. it is
> > > >
> > > > devices -> device
> > >
> > > got it.
> > >
> > > > > possible that a VM with multiples assigned devices but requests
> > > > > less PASIDs. Therefore per-VM quota would be better.
> > > > >
> > > > > This patch uses struct mm pointer as a per-VM token. We also
> > > > > considered using task structure pointer and vfio_iommu structure
> > > > > pointer. However, task structure is per-thread, which means it
> > > > > cannot achieve per-VM PASID alloc tracking purpose. While for
> > > > > vfio_iommu structure, it is visible only within vfio. Therefore,
> > > > > structure mm pointer is selected. This patch adds a structure
> > > > > vfio_mm. A vfio_mm is created when the first vfio container is
> > > > > opened by a VM. On the reverse order, vfio_mm is free when the
> > > > > last vfio container is released. Each VM is assigned with a PASID
> > > > > quota, so that it is not able to request PASID beyond its quota.
> > > > > This patch adds a default quota of 1000. This quota could be tuned
> > > > > by administrator. Making PASID quota tunable will be added in
> > > > > another
> > > patch
> > > > > in this series.
> > > > >
> > > > > Previous discussions:
> > > > > https://patchwork.kernel.org/patch/11209429/
> > > > >
> > > > > Cc: Kevin Tian <kevin.tian@intel.com>
> > > > > CC: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > > > > Cc: Alex Williamson <alex.williamson@redhat.com>
> > > > > Cc: Eric Auger <eric.auger@redhat.com>
> > > > > Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
> > > > > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > > > > Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
> > > > > Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > > > > ---
> > > > >  drivers/vfio/vfio.c             | 130
> > > > > ++++++++++++++++++++++++++++++++++++++++
> > > > >  drivers/vfio/vfio_iommu_type1.c | 104
> > > > > ++++++++++++++++++++++++++++++++
> > > > >  include/linux/vfio.h            |  20 +++++++
> > > > >  include/uapi/linux/vfio.h       |  41 +++++++++++++
> > > > >  4 files changed, 295 insertions(+)
> > > > >
> > > > > diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c index
> > > > > c848262..d13b483 100644
> > > > > --- a/drivers/vfio/vfio.c
> > > > > +++ b/drivers/vfio/vfio.c
> > > > > @@ -32,6 +32,7 @@
> > > > >  #include <linux/vfio.h>
> > > > >  #include <linux/wait.h>
> > > > >  #include <linux/sched/signal.h>
> > > > > +#include <linux/sched/mm.h>
> > > > >
> > > > >  #define DRIVER_VERSION	"0.3"
> > > > >  #define DRIVER_AUTHOR	"Alex Williamson
> > > > > <alex.williamson@redhat.com>"
> > > > > @@ -46,6 +47,8 @@ static struct vfio {
> > > > >  	struct mutex			group_lock;
> > > > >  	struct cdev			group_cdev;
> > > > >  	dev_t				group_devt;
> > > > > +	struct list_head		vfio_mm_list;
> > > > > +	struct mutex			vfio_mm_lock;
> > > > >  	wait_queue_head_t		release_q;
> > > > >  } vfio;
> > > > >
> > > > > @@ -2129,6 +2132,131 @@ int vfio_unregister_notifier(struct device
> > > *dev,
> > > > > enum vfio_notify_type type,
> > > > >  EXPORT_SYMBOL(vfio_unregister_notifier);
> > > > >
> > > > >  /**
> > > > > + * VFIO_MM objects - create, release, get, put, search
> > > >
> > > > why capitalizing vfio_mm?
> > >
> > > oops, it's not intended, will fix it.
> > >
> > > > > + * Caller of the function should have held vfio.vfio_mm_lock.
> > > > > + */
> > > > > +static struct vfio_mm *vfio_create_mm(struct mm_struct *mm) {
> > > > > +	struct vfio_mm *vmm;
> > > > > +	struct vfio_mm_token *token;
> > > > > +	int ret = 0;
> > > > > +
> > > > > +	vmm = kzalloc(sizeof(*vmm), GFP_KERNEL);
> > > > > +	if (!vmm)
> > > > > +		return ERR_PTR(-ENOMEM);
> > > > > +
> > > > > +	/* Per mm IOASID set used for quota control and group
> operations
> > > > > */
> > > > > +	ret = ioasid_alloc_set((struct ioasid_set *) mm,
> > > > > +			       VFIO_DEFAULT_PASID_QUOTA, &vmm-
> > > > > >ioasid_sid);
> > > > > +	if (ret) {
> > > > > +		kfree(vmm);
> > > > > +		return ERR_PTR(ret);
> > > > > +	}
> > > > > +
> > > > > +	kref_init(&vmm->kref);
> > > > > +	token = &vmm->token;
> > > > > +	token->val = mm;
> > > > > +	vmm->pasid_quota = VFIO_DEFAULT_PASID_QUOTA;
> > > > > +	mutex_init(&vmm->pasid_lock);
> > > > > +
> > > > > +	list_add(&vmm->vfio_next, &vfio.vfio_mm_list);
> > > > > +
> > > > > +	return vmm;
> > > > > +}
> > > > > +
> > > > > +static void vfio_mm_unlock_and_free(struct vfio_mm *vmm) {
> > > > > +	/* destroy the ioasid set */
> > > > > +	ioasid_free_set(vmm->ioasid_sid, true);
> > > >
> > > > do we need hold pasid lock here, since it attempts to destroy a set
> > > > which might be referenced by vfio_mm_pasid_free? or is there
> > > > guarantee that such race won't happen?
> > >
> > > Emmm, if considering the race between ioasid_free_set and
> > > vfio_mm_pasid_free, I guess ioasid core should sequence the two
> > > operations with its internal lock. right?
> >
> > I looked at below code in free path:
> >
> > +	mutex_lock(&vmm->pasid_lock);
> > +	pdata = ioasid_find(vmm->ioasid_sid, pasid, NULL);
> > +	if (IS_ERR(pdata)) {
> > +		ret = PTR_ERR(pdata);
> > +		goto out_unlock;
> > +	}
> > +	ioasid_free(pasid);
> > +
> > +out_unlock:
> > +	mutex_unlock(&vmm->pasid_lock);
> >
> > above implies that ioasid_find/free must be paired within the pasid_lock.
> > Then if we don't hold pasid_lock above, ioasid_free_set could happen
> between
> > find/free. I'm not sure whether this race would lead to real problem, but it
> doesn't
> > look correct simply by looking at this file.
> 
> Well, Jacob told me to remove the ioasid_find in another email as he
> believes ioasid core should be able to take care of it. and also need to
> be protected by lock. If so, does it look good? :-)
> 
>  " [Jacob Pan] this might be better to put under ioasid code such that it
>   is under the ioasid lock. no one else can free the ioasid between find() and
> free().
>   e.g. ioasid_free(sid, pasid)
>   if sid == INVALID_IOASID_SET, then no set ownership checking.
>   thoughts?"

yes, that way looks better.

> 
> > >
> > > > > +	mutex_unlock(&vfio.vfio_mm_lock);
> > > > > +	kfree(vmm);
> > > > > +}
> > > > > +
> > > > > +/* called with vfio.vfio_mm_lock held */ static void
> > > > > +vfio_mm_release(struct kref *kref) {
> > > > > +	struct vfio_mm *vmm = container_of(kref, struct vfio_mm,
> kref);
> > > > > +
> > > > > +	list_del(&vmm->vfio_next);
> > > > > +	vfio_mm_unlock_and_free(vmm);
> > > > > +}
> > > > > +
> > > > > +void vfio_mm_put(struct vfio_mm *vmm) {
> > > > > +	kref_put_mutex(&vmm->kref, vfio_mm_release,
> > > > > &vfio.vfio_mm_lock);
> > > > > +}
> > > > > +EXPORT_SYMBOL_GPL(vfio_mm_put);
> > > > > +
> > > > > +/* Assume vfio_mm_lock or vfio_mm reference is held */ static
> > > > > +void vfio_mm_get(struct vfio_mm *vmm) {
> > > > > +	kref_get(&vmm->kref);
> > > > > +}
> > > > > +
> > > > > +struct vfio_mm *vfio_mm_get_from_task(struct task_struct *task) {
> > > > > +	struct mm_struct *mm = get_task_mm(task);
> > > > > +	struct vfio_mm *vmm;
> > > > > +	unsigned long long val = (unsigned long long) mm;
> > > > > +
> > > > > +	mutex_lock(&vfio.vfio_mm_lock);
> > > > > +	list_for_each_entry(vmm, &vfio.vfio_mm_list, vfio_next) {
> > > > > +		if (vmm->token.val == val) {
> > > > > +			vfio_mm_get(vmm);
> > > > > +			goto out;
> > > > > +		}
> > > > > +	}
> > > > > +
> > > > > +	vmm = vfio_create_mm(mm);
> > > > > +	if (IS_ERR(vmm))
> > > > > +		vmm = NULL;
> > > > > +out:
> > > > > +	mutex_unlock(&vfio.vfio_mm_lock);
> > > > > +	mmput(mm);
> > > >
> > > > I assume this has been discussed before, but from readability p.o.v
> > > > it might be good to add a comment for this function to explain how
> > > > the recording of mm in vfio_mm can be correctly removed when the
> mm
> > > > is being destroyed, since we don't hold a reference of mm here.
> > >
> > > yeah, I'll add it.
> > >
> > > > > +	return vmm;
> > > > > +}
> > > > > +EXPORT_SYMBOL_GPL(vfio_mm_get_from_task);
> > > > > +
> > > > > +int vfio_mm_pasid_alloc(struct vfio_mm *vmm, int min, int max) {
> > > > > +	ioasid_t pasid;
> > > > > +	int ret = -ENOSPC;
> > > > > +
> > > > > +	mutex_lock(&vmm->pasid_lock);
> > > > > +
> > > > > +	pasid = ioasid_alloc(vmm->ioasid_sid, min, max, NULL);
> > > > > +	if (pasid == INVALID_IOASID) {
> > > > > +		ret = -ENOSPC;
> > > > > +		goto out_unlock;
> > > > > +	}
> > > > > +
> > > > > +	ret = pasid;
> > > > > +out_unlock:
> > > > > +	mutex_unlock(&vmm->pasid_lock);
> > > > > +	return ret;
> > > > > +}
> > > > > +EXPORT_SYMBOL_GPL(vfio_mm_pasid_alloc);
> > > > > +
> > > > > +int vfio_mm_pasid_free(struct vfio_mm *vmm, ioasid_t pasid) {
> > > > > +	void *pdata;
> > > > > +	int ret = 0;
> > > > > +
> > > > > +	mutex_lock(&vmm->pasid_lock);
> > > > > +	pdata = ioasid_find(vmm->ioasid_sid, pasid, NULL);
> > > > > +	if (IS_ERR(pdata)) {
> > > > > +		ret = PTR_ERR(pdata);
> > > > > +		goto out_unlock;
> > > > > +	}
> > > > > +	ioasid_free(pasid);
> > > > > +
> > > > > +out_unlock:
> > > > > +	mutex_unlock(&vmm->pasid_lock);
> > > > > +	return ret;
> > > > > +}
> > > > > +EXPORT_SYMBOL_GPL(vfio_mm_pasid_free);
> > > > > +
> > > > > +/**
> > > > >   * Module/class support
> > > > >   */
> > > > >  static char *vfio_devnode(struct device *dev, umode_t *mode) @@
> > > > > -2151,8 +2279,10 @@ static int __init vfio_init(void)
> > > > >  	idr_init(&vfio.group_idr);
> > > > >  	mutex_init(&vfio.group_lock);
> > > > >  	mutex_init(&vfio.iommu_drivers_lock);
> > > > > +	mutex_init(&vfio.vfio_mm_lock);
> > > > >  	INIT_LIST_HEAD(&vfio.group_list);
> > > > >  	INIT_LIST_HEAD(&vfio.iommu_drivers_list);
> > > > > +	INIT_LIST_HEAD(&vfio.vfio_mm_list);
> > > > >  	init_waitqueue_head(&vfio.release_q);
> > > > >
> > > > >  	ret = misc_register(&vfio_dev);
> > > > > diff --git a/drivers/vfio/vfio_iommu_type1.c
> > > > > b/drivers/vfio/vfio_iommu_type1.c index a177bf2..331ceee 100644
> > > > > --- a/drivers/vfio/vfio_iommu_type1.c
> > > > > +++ b/drivers/vfio/vfio_iommu_type1.c
> > > > > @@ -70,6 +70,7 @@ struct vfio_iommu {
> > > > >  	unsigned int		dma_avail;
> > > > >  	bool			v2;
> > > > >  	bool			nesting;
> > > > > +	struct vfio_mm		*vmm;
> > > > >  };
> > > > >
> > > > >  struct vfio_domain {
> > > > > @@ -2018,6 +2019,7 @@ static void
> > > vfio_iommu_type1_detach_group(void
> > > > > *iommu_data,
> > > > >  static void *vfio_iommu_type1_open(unsigned long arg)  {
> > > > >  	struct vfio_iommu *iommu;
> > > > > +	struct vfio_mm *vmm = NULL;
> > > > >
> > > > >  	iommu = kzalloc(sizeof(*iommu), GFP_KERNEL);
> > > > >  	if (!iommu)
> > > > > @@ -2043,6 +2045,10 @@ static void
> > > *vfio_iommu_type1_open(unsigned
> > > > > long arg)
> > > > >  	iommu->dma_avail = dma_entry_limit;
> > > > >  	mutex_init(&iommu->lock);
> > > > >  	BLOCKING_INIT_NOTIFIER_HEAD(&iommu->notifier);
> > > > > +	vmm = vfio_mm_get_from_task(current);
> > > > > +	if (!vmm)
> > > > > +		pr_err("Failed to get vfio_mm track\n");
> > > >
> > > > I assume error should be returned when pr_err is used...
> > >
> > > got it. I didn't do it as I don’t think vfio_mm is necessary for
> > > every iommu open. It is necessary for the nesting type iommu. I'll
> > > make it fetch vmm when it is opening nesting type and return error
> > > if failed.
> >
> > sounds good.
> >
> > >
> > > > > +	iommu->vmm = vmm;
> > > > >
> > > > >  	return iommu;
> > > > >  }
> > > > > @@ -2084,6 +2090,8 @@ static void vfio_iommu_type1_release(void
> > > > > *iommu_data)
> > > > >  	}
> > > > >
> > > > >  	vfio_iommu_iova_free(&iommu->iova_list);
> > > > > +	if (iommu->vmm)
> > > > > +		vfio_mm_put(iommu->vmm);
> > > > >
> > > > >  	kfree(iommu);
> > > > >  }
> > > > > @@ -2172,6 +2180,55 @@ static int
> vfio_iommu_iova_build_caps(struct
> > > > > vfio_iommu *iommu,
> > > > >  	return ret;
> > > > >  }
> > > > >
> > > > > +static bool vfio_iommu_type1_pasid_req_valid(u32 flags)
> > > >
> > > > I don't think you need prefix "vfio_iommu_type1" for every new
> > > > function here, especially for leaf internal function as this one.
> > >
> > > got it. thanks.
> > >
> > > > > +{
> > > > > +	return !((flags & ~VFIO_PASID_REQUEST_MASK) ||
> > > > > +		 (flags & VFIO_IOMMU_PASID_ALLOC &&
> > > > > +		  flags & VFIO_IOMMU_PASID_FREE));
> > > > > +}
> > > > > +
> > > > > +static int vfio_iommu_type1_pasid_alloc(struct vfio_iommu *iommu,
> > > > > +					 int min,
> > > > > +					 int max)
> > > > > +{
> > > > > +	struct vfio_mm *vmm = iommu->vmm;
> > > > > +	int ret = 0;
> > > > > +
> > > > > +	mutex_lock(&iommu->lock);
> > > > > +	if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) {
> > > > > +		ret = -EFAULT;
> > > >
> > > > why -EFAULT?
> > >
> > > well, it's from a prior comment as below:
> > >   vfio_mm_pasid_alloc() can return -ENOSPC though, so it'd be nice to
> > >   differentiate the errors. We could use EFAULT for the no IOMMU case
> > >   and EINVAL here?
> > > http://lkml.iu.edu/hypermail/linux/kernel/2001.3/05964.html
> > >
> > > >
> > > > > +		goto out_unlock;
> > > > > +	}
> > > > > +	if (vmm)
> > > > > +		ret = vfio_mm_pasid_alloc(vmm, min, max);
> > > > > +	else
> > > > > +		ret = -EINVAL;
> > > > > +out_unlock:
> > > > > +	mutex_unlock(&iommu->lock);
> > > > > +	return ret;
> > > > > +}
> > > > > +
> > > > > +static int vfio_iommu_type1_pasid_free(struct vfio_iommu *iommu,
> > > > > +				       unsigned int pasid)
> > > > > +{
> > > > > +	struct vfio_mm *vmm = iommu->vmm;
> > > > > +	int ret = 0;
> > > > > +
> > > > > +	mutex_lock(&iommu->lock);
> > > > > +	if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) {
> > > > > +		ret = -EFAULT;
> > > >
> > > > ditto
> > > >
> > > > > +		goto out_unlock;
> > > > > +	}
> > > > > +
> > > > > +	if (vmm)
> > > > > +		ret = vfio_mm_pasid_free(vmm, pasid);
> > > > > +	else
> > > > > +		ret = -EINVAL;
> > > > > +out_unlock:
> > > > > +	mutex_unlock(&iommu->lock);
> > > > > +	return ret;
> > > > > +}
> > > > > +
> > > > >  static long vfio_iommu_type1_ioctl(void *iommu_data,
> > > > >  				   unsigned int cmd, unsigned long arg)
> > > > >  {
> > > > > @@ -2276,6 +2333,53 @@ static long vfio_iommu_type1_ioctl(void
> > > > > *iommu_data,
> > > > >
> > > > >  		return copy_to_user((void __user *)arg, &unmap, minsz) ?
> > > > >  			-EFAULT : 0;
> > > > > +
> > > > > +	} else if (cmd == VFIO_IOMMU_PASID_REQUEST) {
> > > > > +		struct vfio_iommu_type1_pasid_request req;
> > > > > +		unsigned long offset;
> > > > > +
> > > > > +		minsz = offsetofend(struct
> vfio_iommu_type1_pasid_request,
> > > > > +				    flags);
> > > > > +
> > > > > +		if (copy_from_user(&req, (void __user *)arg, minsz))
> > > > > +			return -EFAULT;
> > > > > +
> > > > > +		if (req.argsz < minsz ||
> > > > > +		    !vfio_iommu_type1_pasid_req_valid(req.flags))
> > > > > +			return -EINVAL;
> > > > > +
> > > > > +		if (copy_from_user((void *)&req + minsz,
> > > > > +				   (void __user *)arg + minsz,
> > > > > +				   sizeof(req) - minsz))
> > > > > +			return -EFAULT;
> > > >
> > > > why copying in two steps instead of copying them together?
> > >
> > > just want to do sanity check before copying all the data. I
> > > can move it as one copy if it's better. :-)
> >
> > it's possible fine. I just saw you did same thing for other uapis.
> >
> > >
> > > > > +
> > > > > +		switch (req.flags & VFIO_PASID_REQUEST_MASK) {
> > > > > +		case VFIO_IOMMU_PASID_ALLOC:
> > > > > +		{
> > > > > +			int ret = 0, result;
> > > > > +
> > > > > +			result =
> vfio_iommu_type1_pasid_alloc(iommu,
> > > > > +
> 	req.alloc_pasid.min,
> > > > > +
> 	req.alloc_pasid.max);
> > > > > +			if (result > 0) {
> > > > > +				offset = offsetof(
> > > > > +					struct
> > > > > vfio_iommu_type1_pasid_request,
> > > > > +					alloc_pasid.result);
> > > > > +				ret = copy_to_user(
> > > > > +					      (void __user *) (arg +
> offset),
> > > > > +					      &result, sizeof(result));
> > > > > +			} else {
> > > > > +				pr_debug("%s: PASID alloc failed\n",
> > > > > __func__);
> > > > > +				ret = -EFAULT;
> > > >
> > > > no, this branch is not for copy_to_user error. it is about pasid alloc
> > > > failure. you should handle both.
> > >
> > > Emmm, I just want to fail the IOCTL in such case, so the @result field
> > > is meaningless in the user side. How about using another return value
> > > (e.g. ENOSPC) to indicate the IOCTL failure?
> >
> > If pasid_alloc fails, you return its result to userspace
> > if copy_to_user fails, then return -EFAULT.
> >
> > however, above you return -EFAULT for pasid_alloc failure, and
> > then the number of not-copied bytes for copy_to_user.
> 
> not quite get. Let me re-paste the code. :-)
> 
> +		case VFIO_IOMMU_PASID_ALLOC:
> +		{
> +			int ret = 0, result;
> +
> +			result = vfio_iommu_type1_pasid_alloc(iommu,
> +							req.alloc_pasid.min,
> +							req.alloc_pasid.max);
> +			if (result > 0) {
> +				offset = offsetof(
> +					struct
> vfio_iommu_type1_pasid_request,
> +					alloc_pasid.result);
> +				ret = copy_to_user(
> +					      (void __user *) (arg + offset),
> +					      &result, sizeof(result));
> if copy_to_user failed, ret is the number of uncopied bytes and
> will be returned to userspace to indicate failure. userspace will
> not use the data in result field. perhaps, I should check the ret
> here and return -EFAULT or another errno, instead of return the
> number of un-copied bytes.

here should return -EFAULT.

> +			} else {
> +				pr_debug("%s: PASID alloc failed\n",
> __func__);
> +				ret = -EFAULT;
> if vfio_iommu_type1_pasid_alloc() failed, no doubt, return -EFAULT
> to userspace to indicate failure.

pasid_alloc has its own error types returned. why blindly replace it
with -EFAULT?

> +			}
> +			return ret;
> +		}
> 
> is there still porblem here?
> > >
> > > > > +			}
> > > > > +			return ret;
> > > > > +		}
> > > > > +		case VFIO_IOMMU_PASID_FREE:
> > > > > +			return vfio_iommu_type1_pasid_free(iommu,
> > > > > +
> req.free_pasid);
> > > > > +		default:
> > > > > +			return -EINVAL;
> > > > > +		}
> > > > >  	}
> > > > >
> > > > >  	return -ENOTTY;
> > > > > diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> > > > > index e42a711..75f9f7f1 100644
> > > > > --- a/include/linux/vfio.h
> > > > > +++ b/include/linux/vfio.h
> > > > > @@ -89,6 +89,26 @@ extern int vfio_register_iommu_driver(const
> struct
> > > > > vfio_iommu_driver_ops *ops);
> > > > >  extern void vfio_unregister_iommu_driver(
> > > > >  				const struct vfio_iommu_driver_ops *ops);
> > > > >
> > > > > +#define VFIO_DEFAULT_PASID_QUOTA	1000
> > > > > +struct vfio_mm_token {
> > > > > +	unsigned long long val;
> > > > > +};
> > > > > +
> > > > > +struct vfio_mm {
> > > > > +	struct kref			kref;
> > > > > +	struct vfio_mm_token		token;
> > > > > +	int				ioasid_sid;
> > > > > +	/* protect @pasid_quota field and pasid allocation/free */
> > > > > +	struct mutex			pasid_lock;
> > > > > +	int				pasid_quota;
> > > > > +	struct list_head		vfio_next;
> > > > > +};
> > > > > +
> > > > > +extern struct vfio_mm *vfio_mm_get_from_task(struct task_struct
> > > *task);
> > > > > +extern void vfio_mm_put(struct vfio_mm *vmm);
> > > > > +extern int vfio_mm_pasid_alloc(struct vfio_mm *vmm, int min, int
> max);
> > > > > +extern int vfio_mm_pasid_free(struct vfio_mm *vmm, ioasid_t
> pasid);
> > > > > +
> > > > >  /*
> > > > >   * External user API
> > > > >   */
> > > > > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> > > > > index 9e843a1..298ac80 100644
> > > > > --- a/include/uapi/linux/vfio.h
> > > > > +++ b/include/uapi/linux/vfio.h
> > > > > @@ -794,6 +794,47 @@ struct vfio_iommu_type1_dma_unmap {
> > > > >  #define VFIO_IOMMU_ENABLE	_IO(VFIO_TYPE, VFIO_BASE + 15)
> > > > >  #define VFIO_IOMMU_DISABLE	_IO(VFIO_TYPE, VFIO_BASE + 16)
> > > > >
> > > > > +/*
> > > > > + * PASID (Process Address Space ID) is a PCIe concept which
> > > > > + * has been extended to support DMA isolation in fine-grain.
> > > > > + * With device assigned to user space (e.g. VMs), PASID alloc
> > > > > + * and free need to be system wide. This structure defines
> > > > > + * the info for pasid alloc/free between user space and kernel
> > > > > + * space.
> > > > > + *
> > > > > + * @flag=VFIO_IOMMU_PASID_ALLOC, refer to the @alloc_pasid
> > > > > + * @flag=VFIO_IOMMU_PASID_FREE, refer to @free_pasid
> > > > > + */
> > > > > +struct vfio_iommu_type1_pasid_request {
> > > > > +	__u32	argsz;
> > > > > +#define VFIO_IOMMU_PASID_ALLOC	(1 << 0)
> > > > > +#define VFIO_IOMMU_PASID_FREE	(1 << 1)
> > > > > +	__u32	flags;
> > > > > +	union {
> > > > > +		struct {
> > > > > +			__u32 min;
> > > > > +			__u32 max;
> > > > > +			__u32 result;
> > > >
> > > > result->pasid?
> > >
> > > yes, the pasid allocated.
> > >
> > > >
> > > > > +		} alloc_pasid;
> > > > > +		__u32 free_pasid;
> > > >
> > > > what about putting a common pasid field after flags?
> > >
> > > looks good to me. But it would make the union part only meaningful
> > > to alloc pasid. if so, maybe make the union part as a data field and
> > > only alloc pasid will have it. For free pasid, it is not necessary
> > > to read it from userspace. does it look good?
> >
> > maybe keeping the union is also OK, just with {min, max} for alloc.
> > who knows whether more pasid ops will be added in the future
> > which may require its specific union structure. ?? putting pasid
> > as a common field is reasonable because the whole cmd is for
> > pasid.
> 
> got it.
> 
> Thanks,
> Yi Liu


^ permalink raw reply	[flat|nested] 110+ messages in thread

* RE: [PATCH v1 1/8] vfio: Add VFIO_IOMMU_PASID_REQUEST(alloc/free)
  2020-04-01  5:43           ` Tian, Kevin
@ 2020-04-01  5:48             ` Liu, Yi L
  0 siblings, 0 replies; 110+ messages in thread
From: Liu, Yi L @ 2020-04-01  5:48 UTC (permalink / raw)
  To: Tian, Kevin, alex.williamson, eric.auger
  Cc: jacob.jun.pan, joro, Raj, Ashok, Tian, Jun J, Sun, Yi Y,
	jean-philippe, peterx, iommu, kvm, linux-kernel, Wu, Hao

> From: Tian, Kevin <kevin.tian@intel.com>
> Sent: Wednesday, April 1, 2020 1:43 PM
> To: Liu, Yi L <yi.l.liu@intel.com>; alex.williamson@redhat.com;
> Subject: RE: [PATCH v1 1/8] vfio: Add VFIO_IOMMU_PASID_REQUEST(alloc/free)
> 
> > From: Liu, Yi L <yi.l.liu@intel.com>
> > Sent: Tuesday, March 31, 2020 9:22 PM
> >
> > > From: Tian, Kevin <kevin.tian@intel.com>
> > > Sent: Tuesday, March 31, 2020 1:41 PM
> > > To: Liu, Yi L <yi.l.liu@intel.com>; alex.williamson@redhat.com;
> > > eric.auger@redhat.com
> > > Subject: RE: [PATCH v1 1/8] vfio: Add
> > VFIO_IOMMU_PASID_REQUEST(alloc/free)
> > >
> > > > From: Liu, Yi L <yi.l.liu@intel.com>
> > > > Sent: Monday, March 30, 2020 10:37 PM
> > > >
> > > > > From: Tian, Kevin <kevin.tian@intel.com>
> > > > > Sent: Monday, March 30, 2020 4:32 PM
> > > > > To: Liu, Yi L <yi.l.liu@intel.com>; alex.williamson@redhat.com;
> > > > > Subject: RE: [PATCH v1 1/8] vfio: Add
> > > > VFIO_IOMMU_PASID_REQUEST(alloc/free)
> > > > >
> > > > > > From: Liu, Yi L <yi.l.liu@intel.com>
> > > > > > Sent: Sunday, March 22, 2020 8:32 PM
> > > > > >
> > > > > > From: Liu Yi L <yi.l.liu@intel.com>
> > > > > >
> > > > > > For a long time, devices have only one DMA address space from
> > > > > > platform IOMMU's point of view. This is true for both bare
> > > > > > metal and directed- access in virtualization environment.
> > > > > > Reason is the source ID of DMA in PCIe are BDF (bus/dev/fnc
> > > > > > ID), which results in only device granularity
[...]
> > >
> > > >
> > > > > > +
> > > > > > +		switch (req.flags & VFIO_PASID_REQUEST_MASK) {
> > > > > > +		case VFIO_IOMMU_PASID_ALLOC:
> > > > > > +		{
> > > > > > +			int ret = 0, result;
> > > > > > +
> > > > > > +			result =
> > vfio_iommu_type1_pasid_alloc(iommu,
> > > > > > +
> > 	req.alloc_pasid.min,
> > > > > > +
> > 	req.alloc_pasid.max);
> > > > > > +			if (result > 0) {
> > > > > > +				offset = offsetof(
> > > > > > +					struct
> > > > > > vfio_iommu_type1_pasid_request,
> > > > > > +					alloc_pasid.result);
> > > > > > +				ret = copy_to_user(
> > > > > > +					      (void __user *) (arg +
> > offset),
> > > > > > +					      &result, sizeof(result));
> > > > > > +			} else {
> > > > > > +				pr_debug("%s: PASID alloc failed\n",
> > > > > > __func__);
> > > > > > +				ret = -EFAULT;
> > > > >
> > > > > no, this branch is not for copy_to_user error. it is about pasid
> > > > > alloc failure. you should handle both.
> > > >
> > > > Emmm, I just want to fail the IOCTL in such case, so the @result
> > > > field is meaningless in the user side. How about using another
> > > > return value (e.g. ENOSPC) to indicate the IOCTL failure?
> > >
> > > If pasid_alloc fails, you return its result to userspace if
> > > copy_to_user fails, then return -EFAULT.
> > >
> > > however, above you return -EFAULT for pasid_alloc failure, and then
> > > the number of not-copied bytes for copy_to_user.
> >
> > not quite get. Let me re-paste the code. :-)
> >
> > +		case VFIO_IOMMU_PASID_ALLOC:
> > +		{
> > +			int ret = 0, result;
> > +
> > +			result = vfio_iommu_type1_pasid_alloc(iommu,
> > +							req.alloc_pasid.min,
> > +							req.alloc_pasid.max);
> > +			if (result > 0) {
> > +				offset = offsetof(
> > +					struct
> > vfio_iommu_type1_pasid_request,
> > +					alloc_pasid.result);
> > +				ret = copy_to_user(
> > +					      (void __user *) (arg + offset),
> > +					      &result, sizeof(result));
> > if copy_to_user failed, ret is the number of uncopied bytes and will
> > be returned to userspace to indicate failure. userspace will not use
> > the data in result field. perhaps, I should check the ret here and
> > return -EFAULT or another errno, instead of return the number of
> > un-copied bytes.
> 
> here should return -EFAULT.

got it. so if any failure, the return value of this ioctl is a -ERROR_VAL.

> 
> > +			} else {
> > +				pr_debug("%s: PASID alloc failed\n",
> > __func__);
> > +				ret = -EFAULT;
> > if vfio_iommu_type1_pasid_alloc() failed, no doubt, return -EFAULT to
> > userspace to indicate failure.
> 
> pasid_alloc has its own error types returned. why blindly replace it with -EFAULT?

right, should use its own error types.

Thanks,
Yi Liu

^ permalink raw reply	[flat|nested] 110+ messages in thread

* RE: [PATCH v1 5/8] vfio/type1: Report 1st-level/stage-1 format to userspace
  2020-03-30 11:48   ` Tian, Kevin
@ 2020-04-01  7:38     ` Liu, Yi L
  2020-04-01  7:56       ` Tian, Kevin
  0 siblings, 1 reply; 110+ messages in thread
From: Liu, Yi L @ 2020-04-01  7:38 UTC (permalink / raw)
  To: Tian, Kevin, alex.williamson, eric.auger
  Cc: jacob.jun.pan, joro, Raj, Ashok, Tian, Jun J, Sun, Yi Y,
	jean-philippe, peterx, iommu, kvm, linux-kernel, Wu, Hao

 > From: Tian, Kevin <kevin.tian@intel.com>
> Sent: Monday, March 30, 2020 7:49 PM
> To: Liu, Yi L <yi.l.liu@intel.com>; alex.williamson@redhat.com;
> Subject: RE: [PATCH v1 5/8] vfio/type1: Report 1st-level/stage-1 format to
> userspace
> 
> > From: Liu, Yi L <yi.l.liu@intel.com>
> > Sent: Sunday, March 22, 2020 8:32 PM
> >
> > From: Liu Yi L <yi.l.liu@intel.com>
> >
> > VFIO exposes IOMMU nesting translation (a.k.a dual stage translation)
> > capability to userspace. Thus applications like QEMU could support
> > vIOMMU with hardware's nesting translation capability for pass-through
> > devices. Before setting up nesting translation for pass-through
> > devices, QEMU and other applications need to learn the supported
> > 1st-lvl/stage-1 translation structure format like page table format.
> >
> > Take vSVA (virtual Shared Virtual Addressing) as an example, to
> > support vSVA for pass-through devices, QEMU setup nesting translation
> > for pass- through devices. The guest page table are configured to host
> > as 1st-lvl/
> > stage-1 page table. Therefore, guest format should be compatible with
> > host side.
> >
> > This patch reports the supported 1st-lvl/stage-1 page table format on
> > the current platform to userspace. QEMU and other alike applications
> > should use this format info when trying to setup IOMMU nesting
> > translation on host IOMMU.
> >
> > Cc: Kevin Tian <kevin.tian@intel.com>
> > CC: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > Cc: Alex Williamson <alex.williamson@redhat.com>
> > Cc: Eric Auger <eric.auger@redhat.com>
> > Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
> > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > ---
> >  drivers/vfio/vfio_iommu_type1.c | 56
> > +++++++++++++++++++++++++++++++++++++++++
> >  include/uapi/linux/vfio.h       |  1 +
> >  2 files changed, 57 insertions(+)
> >
> > diff --git a/drivers/vfio/vfio_iommu_type1.c
> > b/drivers/vfio/vfio_iommu_type1.c index 9aa2a67..82a9e0b 100644
> > --- a/drivers/vfio/vfio_iommu_type1.c
> > +++ b/drivers/vfio/vfio_iommu_type1.c
> > @@ -2234,11 +2234,66 @@ static int vfio_iommu_type1_pasid_free(struct
> > vfio_iommu *iommu,
> >  	return ret;
> >  }
> >
> > +static int vfio_iommu_get_stage1_format(struct vfio_iommu *iommu,
> > +					 u32 *stage1_format)
> > +{
> > +	struct vfio_domain *domain;
> > +	u32 format = 0, tmp_format = 0;
> > +	int ret;
> > +
> > +	mutex_lock(&iommu->lock);
> > +	if (list_empty(&iommu->domain_list)) {
> > +		mutex_unlock(&iommu->lock);
> > +		return -EINVAL;
> > +	}
> > +
> > +	list_for_each_entry(domain, &iommu->domain_list, next) {
> > +		if (iommu_domain_get_attr(domain->domain,
> > +			DOMAIN_ATTR_PASID_FORMAT, &format)) {
> > +			ret = -EINVAL;
> > +			format = 0;
> > +			goto out_unlock;
> > +		}
> > +		/*
> > +		 * format is always non-zero (the first format is
> > +		 * IOMMU_PASID_FORMAT_INTEL_VTD which is 1). For
> > +		 * the reason of potential different backed IOMMU
> > +		 * formats, here we expect to have identical formats
> > +		 * in the domain list, no mixed formats support.
> > +		 * return -EINVAL to fail the attempt of setup
> > +		 * VFIO_TYPE1_NESTING_IOMMU if non-identical formats
> > +		 * are detected.
> > +		 */
> > +		if (tmp_format && tmp_format != format) {
> > +			ret = -EINVAL;
> > +			format = 0;
> > +			goto out_unlock;
> > +		}
> > +
> > +		tmp_format = format;
> > +	}
> 
> this path is invoked only in VFIO_IOMMU_GET_INFO path. If we don't want to
> assume the status quo that one container holds only one device w/ vIOMMU
> (the prerequisite for vSVA), looks we also need check the format
> compatibility when attaching a new group to this container?

right. if attaching to a nesting type container (vfio_iommu.nesting bit
indicates it), it should check if it is compabile with prior domains in
the domain list. But if it is the first one attached to this container,
it's fine. is it good?

> > +	ret = 0;
> > +
> > +out_unlock:
> > +	if (format)
> > +		*stage1_format = format;
> > +	mutex_unlock(&iommu->lock);
> > +	return ret;
> > +}
> > +
> >  static int vfio_iommu_info_add_nesting_cap(struct vfio_iommu *iommu,
> >  					 struct vfio_info_cap *caps)
> >  {
> >  	struct vfio_info_cap_header *header;
> >  	struct vfio_iommu_type1_info_cap_nesting *nesting_cap;
> > +	u32 formats = 0;
> > +	int ret;
> > +
> > +	ret = vfio_iommu_get_stage1_format(iommu, &formats);
> > +	if (ret) {
> > +		pr_warn("Failed to get stage-1 format\n");
> > +		return ret;
> > +	}
> >
> >  	header = vfio_info_cap_add(caps, sizeof(*nesting_cap),
> >  				   VFIO_IOMMU_TYPE1_INFO_CAP_NESTING,
> > 1);
> > @@ -2254,6 +2309,7 @@ static int
> > vfio_iommu_info_add_nesting_cap(struct
> > vfio_iommu *iommu,
> >  		/* nesting iommu type supports PASID requests (alloc/free) */
> >  		nesting_cap->nesting_capabilities |= VFIO_IOMMU_PASID_REQS;
> >  	}
> > +	nesting_cap->stage1_formats = formats;
> >
> >  	return 0;
> >  }
> > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> > index ed9881d..ebeaf3e 100644
> > --- a/include/uapi/linux/vfio.h
> > +++ b/include/uapi/linux/vfio.h
> > @@ -763,6 +763,7 @@ struct vfio_iommu_type1_info_cap_nesting {
> >  	struct	vfio_info_cap_header header;
> >  #define VFIO_IOMMU_PASID_REQS	(1 << 0)
> >  	__u32	nesting_capabilities;
> > +	__u32	stage1_formats;
> 
> do you plan to support multiple formats? If not, use singular name.

I do have such plan. e.g. it may be helpful when one day a platform can
support multiple formats.

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 110+ messages in thread

* RE: [PATCH v1 3/8] vfio/type1: Report PASID alloc/free support to userspace
  2020-03-30  9:43   ` Tian, Kevin
@ 2020-04-01  7:46     ` Liu, Yi L
  0 siblings, 0 replies; 110+ messages in thread
From: Liu, Yi L @ 2020-04-01  7:46 UTC (permalink / raw)
  To: Tian, Kevin, alex.williamson, eric.auger
  Cc: jacob.jun.pan, joro, Raj, Ashok, Tian, Jun J, Sun, Yi Y,
	jean-philippe, peterx, iommu, kvm, linux-kernel, Wu, Hao

> From: Tian, Kevin <kevin.tian@intel.com>
> Sent: Monday, March 30, 2020 5:44 PM
> To: Liu, Yi L <yi.l.liu@intel.com>; alex.williamson@redhat.com;
> Subject: RE: [PATCH v1 3/8] vfio/type1: Report PASID alloc/free support to
> userspace
> 
> > From: Liu, Yi L <yi.l.liu@intel.com>
> > Sent: Sunday, March 22, 2020 8:32 PM
> >
> > From: Liu Yi L <yi.l.liu@intel.com>
> >
> > This patch reports PASID alloc/free availability to userspace (e.g.
> > QEMU) thus userspace could do a pre-check before utilizing this feature.
> >
> > Cc: Kevin Tian <kevin.tian@intel.com>
> > CC: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > Cc: Alex Williamson <alex.williamson@redhat.com>
> > Cc: Eric Auger <eric.auger@redhat.com>
> > Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
> > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > ---
> >  drivers/vfio/vfio_iommu_type1.c | 28 ++++++++++++++++++++++++++++
> >  include/uapi/linux/vfio.h       |  8 ++++++++
> >  2 files changed, 36 insertions(+)
> >
> > diff --git a/drivers/vfio/vfio_iommu_type1.c
> > b/drivers/vfio/vfio_iommu_type1.c index e40afc0..ddd1ffe 100644
> > --- a/drivers/vfio/vfio_iommu_type1.c
> > +++ b/drivers/vfio/vfio_iommu_type1.c
> > @@ -2234,6 +2234,30 @@ static int vfio_iommu_type1_pasid_free(struct
> > vfio_iommu *iommu,
> >  	return ret;
> >  }
> >
> > +static int vfio_iommu_info_add_nesting_cap(struct vfio_iommu *iommu,
> > +					 struct vfio_info_cap *caps)
> > +{
> > +	struct vfio_info_cap_header *header;
> > +	struct vfio_iommu_type1_info_cap_nesting *nesting_cap;
> > +
> > +	header = vfio_info_cap_add(caps, sizeof(*nesting_cap),
> > +				   VFIO_IOMMU_TYPE1_INFO_CAP_NESTING,
> > 1);
> > +	if (IS_ERR(header))
> > +		return PTR_ERR(header);
> > +
> > +	nesting_cap = container_of(header,
> > +				struct vfio_iommu_type1_info_cap_nesting,
> > +				header);
> > +
> > +	nesting_cap->nesting_capabilities = 0;
> > +	if (iommu->nesting) {
> 
> Is it good to report a nesting cap when iommu->nesting is disabled? I suppose the
> check should move before vfio_info_cap_add...

oops, yes it.

> 
> > +		/* nesting iommu type supports PASID requests (alloc/free)
> > */
> > +		nesting_cap->nesting_capabilities |=
> > VFIO_IOMMU_PASID_REQS;
> 
> VFIO_IOMMU_CAP_PASID_REQ? to avoid confusion with ioctl cmd
> VFIO_IOMMU_PASID_REQUEST...

got it.

Thanks,
Yi Liu


^ permalink raw reply	[flat|nested] 110+ messages in thread

* RE: [PATCH v1 7/8] vfio/type1: Add VFIO_IOMMU_CACHE_INVALIDATE
  2020-03-30 12:58   ` Tian, Kevin
@ 2020-04-01  7:49     ` Liu, Yi L
  0 siblings, 0 replies; 110+ messages in thread
From: Liu, Yi L @ 2020-04-01  7:49 UTC (permalink / raw)
  To: Tian, Kevin, alex.williamson, eric.auger
  Cc: jacob.jun.pan, joro, Raj, Ashok, Tian, Jun J, Sun, Yi Y,
	jean-philippe, peterx, iommu, kvm, linux-kernel, Wu, Hao

> From: Tian, Kevin <kevin.tian@intel.com>
> Sent: Monday, March 30, 2020 8:58 PM
> To: Liu, Yi L <yi.l.liu@intel.com>; alex.williamson@redhat.com;
> Subject: RE: [PATCH v1 7/8] vfio/type1: Add VFIO_IOMMU_CACHE_INVALIDATE
> 
> > From: Liu, Yi L <yi.l.liu@intel.com>
> > Sent: Sunday, March 22, 2020 8:32 PM
> >
> > From: Liu Yi L <yi.l.liu@linux.intel.com>
> >
> > For VFIO IOMMUs with the type VFIO_TYPE1_NESTING_IOMMU, guest "owns"
> > the
> > first-level/stage-1 translation structures, the host IOMMU driver has
> > no knowledge of first-level/stage-1 structure cache updates unless the
> > guest invalidation requests are trapped and propagated to the host.
> >
> > This patch adds a new IOCTL VFIO_IOMMU_CACHE_INVALIDATE to propagate
> > guest
> > first-level/stage-1 IOMMU cache invalidations to host to ensure IOMMU
> > cache correctness.
> >
> > With this patch, vSVA (Virtual Shared Virtual Addressing) can be used
> > safely as the host IOMMU iotlb correctness are ensured.
> >
> > Cc: Kevin Tian <kevin.tian@intel.com>
> > CC: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > Cc: Alex Williamson <alex.williamson@redhat.com>
> > Cc: Eric Auger <eric.auger@redhat.com>
> > Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
> > Signed-off-by: Liu Yi L <yi.l.liu@linux.intel.com>
> > Signed-off-by: Eric Auger <eric.auger@redhat.com>
> > Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > ---
> >  drivers/vfio/vfio_iommu_type1.c | 49
> > +++++++++++++++++++++++++++++++++++++++++
> >  include/uapi/linux/vfio.h       | 22 ++++++++++++++++++
> >  2 files changed, 71 insertions(+)
> >
> > diff --git a/drivers/vfio/vfio_iommu_type1.c
> > b/drivers/vfio/vfio_iommu_type1.c index a877747..937ec3f 100644
> > --- a/drivers/vfio/vfio_iommu_type1.c
> > +++ b/drivers/vfio/vfio_iommu_type1.c
> > @@ -2423,6 +2423,15 @@ static long
> > vfio_iommu_type1_unbind_gpasid(struct vfio_iommu *iommu,
> >  	return ret;
> >  }
> >
> > +static int vfio_cache_inv_fn(struct device *dev, void *data)
> 
> vfio_iommu_cache_inv_fn

got it.

> > +{
> > +	struct domain_capsule *dc = (struct domain_capsule *)data;
> > +	struct iommu_cache_invalidate_info *cache_inv_info =
> > +		(struct iommu_cache_invalidate_info *) dc->data;
> > +
> > +	return iommu_cache_invalidate(dc->domain, dev, cache_inv_info); }
> > +
> >  static long vfio_iommu_type1_ioctl(void *iommu_data,
> >  				   unsigned int cmd, unsigned long arg)  { @@ -
> 2629,6 +2638,46 @@
> > static long vfio_iommu_type1_ioctl(void *iommu_data,
> >  		}
> >  		kfree(gbind_data);
> >  		return ret;
> > +	} else if (cmd == VFIO_IOMMU_CACHE_INVALIDATE) {
> > +		struct vfio_iommu_type1_cache_invalidate cache_inv;
> > +		u32 version;
> > +		int info_size;
> > +		void *cache_info;
> > +		int ret;
> > +
> > +		minsz = offsetofend(struct
> > vfio_iommu_type1_cache_invalidate,
> > +				    flags);
> > +
> > +		if (copy_from_user(&cache_inv, (void __user *)arg, minsz))
> > +			return -EFAULT;
> > +
> > +		if (cache_inv.argsz < minsz || cache_inv.flags)
> > +			return -EINVAL;
> > +
> > +		/* Get the version of struct iommu_cache_invalidate_info */
> > +		if (copy_from_user(&version,
> > +			(void __user *) (arg + minsz), sizeof(version)))
> > +			return -EFAULT;
> > +
> > +		info_size = iommu_uapi_get_data_size(
> > +					IOMMU_UAPI_CACHE_INVAL,
> > version);
> > +
> > +		cache_info = kzalloc(info_size, GFP_KERNEL);
> > +		if (!cache_info)
> > +			return -ENOMEM;
> > +
> > +		if (copy_from_user(cache_info,
> > +			(void __user *) (arg + minsz), info_size)) {
> > +			kfree(cache_info);
> > +			return -EFAULT;
> > +		}
> > +
> > +		mutex_lock(&iommu->lock);
> > +		ret = vfio_iommu_for_each_dev(iommu, vfio_cache_inv_fn,
> > +					    cache_info);
> > +		mutex_unlock(&iommu->lock);
> > +		kfree(cache_info);
> > +		return ret;
> >  	}
> >
> >  	return -ENOTTY;
> > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> > index 2235bc6..62ca791 100644
> > --- a/include/uapi/linux/vfio.h
> > +++ b/include/uapi/linux/vfio.h
> > @@ -899,6 +899,28 @@ struct vfio_iommu_type1_bind {
> >   */
> >  #define VFIO_IOMMU_BIND		_IO(VFIO_TYPE, VFIO_BASE + 23)
> >
> > +/**
> > + * VFIO_IOMMU_CACHE_INVALIDATE - _IOW(VFIO_TYPE, VFIO_BASE + 24,
> > + *			struct vfio_iommu_type1_cache_invalidate)
> > + *
> > + * Propagate guest IOMMU cache invalidation to the host. The cache
> > + * invalidation information is conveyed by @cache_info, the content
> > + * format would be structures defined in uapi/linux/iommu.h. User
> > + * should be aware of that the struct  iommu_cache_invalidate_info
> > + * has a @version field, vfio needs to parse this field before
> > +getting
> > + * data from userspace.
> > + *
> > + * Availability of this IOCTL is after VFIO_SET_IOMMU.
> > + *
> > + * returns: 0 on success, -errno on failure.
> > + */
> > +struct vfio_iommu_type1_cache_invalidate {
> > +	__u32   argsz;
> > +	__u32   flags;
> > +	struct	iommu_cache_invalidate_info cache_info;
> > +};
> > +#define VFIO_IOMMU_CACHE_INVALIDATE      _IO(VFIO_TYPE, VFIO_BASE +
> > 24)
> > +
> >  /* -------- Additional API for SPAPR TCE (Server POWERPC) IOMMU
> > -------- */
> >
> >  /*
> > --
> > 2.7.4
> 
> This patch looks good to me in general. But since there is still a major open about
> version compatibility, I'll hold my r-b until that open is closed. 😊
> 

thanks,

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 110+ messages in thread

* RE: [PATCH v1 8/8] vfio/type1: Add vSVA support for IOMMU-backed mdevs
  2020-03-30 13:18   ` Tian, Kevin
@ 2020-04-01  7:51     ` Liu, Yi L
  0 siblings, 0 replies; 110+ messages in thread
From: Liu, Yi L @ 2020-04-01  7:51 UTC (permalink / raw)
  To: Tian, Kevin, alex.williamson, eric.auger
  Cc: jacob.jun.pan, joro, Raj, Ashok, Tian, Jun J, Sun, Yi Y,
	jean-philippe, peterx, iommu, kvm, linux-kernel, Wu, Hao

> From: Tian, Kevin <kevin.tian@intel.com>
> Sent: Monday, March 30, 2020 9:19 PM
> To: Liu, Yi L <yi.l.liu@intel.com>; alex.williamson@redhat.com;
> Subject: RE: [PATCH v1 8/8] vfio/type1: Add vSVA support for IOMMU-backed
> mdevs
> 
> > From: Liu, Yi L <yi.l.liu@intel.com>
> > Sent: Sunday, March 22, 2020 8:32 PM
> >
> > From: Liu Yi L <yi.l.liu@intel.com>
> >
> > Recent years, mediated device pass-through framework (e.g. vfio-mdev)
> > are used to achieve flexible device sharing across domains (e.g. VMs).
> 
> are->is

got it.

> > Also there are hardware assisted mediated pass-through solutions from
> > platform vendors. e.g. Intel VT-d scalable mode which supports Intel
> > Scalable I/O Virtualization technology. Such mdevs are called IOMMU-
> > backed mdevs as there are IOMMU enforced DMA isolation for such mdevs.
> > In kernel, IOMMU-backed mdevs are exposed to IOMMU layer by aux-
> > domain concept, which means mdevs are protected by an iommu domain
> > which is aux-domain of its physical device. Details can be found in
> > the KVM
> 
> "by an iommu domain which is auxiliary to the domain that the kernel driver
> primarily uses for DMA API"

yep.

> > presentation from Kevin Tian. IOMMU-backed equals to IOMMU-capable.
> >
> > https://events19.linuxfoundation.org/wp-content/uploads/2017/12/\
> > Hardware-Assisted-Mediated-Pass-Through-with-VFIO-Kevin-Tian-Intel.pdf
> >
> > This patch supports NESTING IOMMU for IOMMU-backed mdevs by figuring
> > out the physical device of an IOMMU-backed mdev and then invoking
> > IOMMU requests to IOMMU layer with the physical device and the mdev's
> > aux domain info.
> 
> "and then calling into the IOMMU layer to complete the vSVA operations on the aux
> domain associated with that mdev"

got it.
> >
> > With this patch, vSVA (Virtual Shared Virtual Addressing) can be used
> > on IOMMU-backed mdevs.
> >
> > Cc: Kevin Tian <kevin.tian@intel.com>
> > CC: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > CC: Jun Tian <jun.j.tian@intel.com>
> > Cc: Alex Williamson <alex.williamson@redhat.com>
> > Cc: Eric Auger <eric.auger@redhat.com>
> > Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
> > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > ---
> >  drivers/vfio/vfio_iommu_type1.c | 23 ++++++++++++++++++++---
> >  1 file changed, 20 insertions(+), 3 deletions(-)
> >
> > diff --git a/drivers/vfio/vfio_iommu_type1.c
> > b/drivers/vfio/vfio_iommu_type1.c index 937ec3f..d473665 100644
> > --- a/drivers/vfio/vfio_iommu_type1.c
> > +++ b/drivers/vfio/vfio_iommu_type1.c
> > @@ -132,6 +132,7 @@ struct vfio_regions {
> >
> >  struct domain_capsule {
> >  	struct iommu_domain *domain;
> > +	struct vfio_group *group;
> >  	void *data;
> >  };
> >
> > @@ -148,6 +149,7 @@ static int vfio_iommu_for_each_dev(struct
> > vfio_iommu *iommu,
> >  	list_for_each_entry(d, &iommu->domain_list, next) {
> >  		dc.domain = d->domain;
> >  		list_for_each_entry(g, &d->group_list, next) {
> > +			dc.group = g;
> >  			ret = iommu_group_for_each_dev(g->iommu_group,
> >  						       &dc, fn);
> >  			if (ret)
> > @@ -2347,7 +2349,12 @@ static int vfio_bind_gpasid_fn(struct device
> > *dev, void *data)
> >  	struct iommu_gpasid_bind_data *gbind_data =
> >  		(struct iommu_gpasid_bind_data *) dc->data;
> >
> > -	return iommu_sva_bind_gpasid(dc->domain, dev, gbind_data);
> > +	if (dc->group->mdev_group)
> > +		return iommu_sva_bind_gpasid(dc->domain,
> > +			vfio_mdev_get_iommu_device(dev), gbind_data);
> > +	else
> > +		return iommu_sva_bind_gpasid(dc->domain,
> > +						dev, gbind_data);
> >  }
> >
> >  static int vfio_unbind_gpasid_fn(struct device *dev, void *data) @@
> > -2356,8 +2363,13 @@ static int vfio_unbind_gpasid_fn(struct device
> > *dev, void *data)
> >  	struct iommu_gpasid_bind_data *gbind_data =
> >  		(struct iommu_gpasid_bind_data *) dc->data;
> >
> > -	return iommu_sva_unbind_gpasid(dc->domain, dev,
> > +	if (dc->group->mdev_group)
> > +		return iommu_sva_unbind_gpasid(dc->domain,
> > +					vfio_mdev_get_iommu_device(dev),
> >  					gbind_data->hpasid);
> > +	else
> > +		return iommu_sva_unbind_gpasid(dc->domain, dev,
> > +						gbind_data->hpasid);
> >  }
> >
> >  /**
> > @@ -2429,7 +2441,12 @@ static int vfio_cache_inv_fn(struct device
> > *dev, void *data)
> >  	struct iommu_cache_invalidate_info *cache_inv_info =
> >  		(struct iommu_cache_invalidate_info *) dc->data;
> >
> > -	return iommu_cache_invalidate(dc->domain, dev, cache_inv_info);
> > +	if (dc->group->mdev_group)
> > +		return iommu_cache_invalidate(dc->domain,
> > +			vfio_mdev_get_iommu_device(dev), cache_inv_info);
> > +	else
> > +		return iommu_cache_invalidate(dc->domain,
> > +						dev, cache_inv_info);
> >  }
> 
> possibly above could be simplified, e.g.
> 
> static struct device *vfio_get_iommu_device(struct vfio_group *group,
> 	struct device *dev)
> {
> 	if  (group->mdev_group)
> 		return vfio_mdev_get_iommu_device(dev);
> 	else
> 		return dev;
> }
> 
> Then use it to replace plain 'dev' in all three places.

yes, better for reading. thanks.

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 110+ messages in thread

* RE: [PATCH v1 5/8] vfio/type1: Report 1st-level/stage-1 format to userspace
  2020-04-01  7:38     ` Liu, Yi L
@ 2020-04-01  7:56       ` Tian, Kevin
  2020-04-01  8:06         ` Liu, Yi L
  0 siblings, 1 reply; 110+ messages in thread
From: Tian, Kevin @ 2020-04-01  7:56 UTC (permalink / raw)
  To: Liu, Yi L, alex.williamson, eric.auger
  Cc: jacob.jun.pan, joro, Raj, Ashok, Tian, Jun J, Sun, Yi Y,
	jean-philippe, peterx, iommu, kvm, linux-kernel, Wu, Hao

> From: Liu, Yi L <yi.l.liu@intel.com>
> Sent: Wednesday, April 1, 2020 3:38 PM
> 
>  > From: Tian, Kevin <kevin.tian@intel.com>
> > Sent: Monday, March 30, 2020 7:49 PM
> > To: Liu, Yi L <yi.l.liu@intel.com>; alex.williamson@redhat.com;
> > Subject: RE: [PATCH v1 5/8] vfio/type1: Report 1st-level/stage-1 format to
> > userspace
> >
> > > From: Liu, Yi L <yi.l.liu@intel.com>
> > > Sent: Sunday, March 22, 2020 8:32 PM
> > >
> > > From: Liu Yi L <yi.l.liu@intel.com>
> > >
> > > VFIO exposes IOMMU nesting translation (a.k.a dual stage translation)
> > > capability to userspace. Thus applications like QEMU could support
> > > vIOMMU with hardware's nesting translation capability for pass-through
> > > devices. Before setting up nesting translation for pass-through
> > > devices, QEMU and other applications need to learn the supported
> > > 1st-lvl/stage-1 translation structure format like page table format.
> > >
> > > Take vSVA (virtual Shared Virtual Addressing) as an example, to
> > > support vSVA for pass-through devices, QEMU setup nesting translation
> > > for pass- through devices. The guest page table are configured to host
> > > as 1st-lvl/
> > > stage-1 page table. Therefore, guest format should be compatible with
> > > host side.
> > >
> > > This patch reports the supported 1st-lvl/stage-1 page table format on
> > > the current platform to userspace. QEMU and other alike applications
> > > should use this format info when trying to setup IOMMU nesting
> > > translation on host IOMMU.
> > >
> > > Cc: Kevin Tian <kevin.tian@intel.com>
> > > CC: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > > Cc: Alex Williamson <alex.williamson@redhat.com>
> > > Cc: Eric Auger <eric.auger@redhat.com>
> > > Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
> > > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > > ---
> > >  drivers/vfio/vfio_iommu_type1.c | 56
> > > +++++++++++++++++++++++++++++++++++++++++
> > >  include/uapi/linux/vfio.h       |  1 +
> > >  2 files changed, 57 insertions(+)
> > >
> > > diff --git a/drivers/vfio/vfio_iommu_type1.c
> > > b/drivers/vfio/vfio_iommu_type1.c index 9aa2a67..82a9e0b 100644
> > > --- a/drivers/vfio/vfio_iommu_type1.c
> > > +++ b/drivers/vfio/vfio_iommu_type1.c
> > > @@ -2234,11 +2234,66 @@ static int
> vfio_iommu_type1_pasid_free(struct
> > > vfio_iommu *iommu,
> > >  	return ret;
> > >  }
> > >
> > > +static int vfio_iommu_get_stage1_format(struct vfio_iommu *iommu,
> > > +					 u32 *stage1_format)
> > > +{
> > > +	struct vfio_domain *domain;
> > > +	u32 format = 0, tmp_format = 0;
> > > +	int ret;
> > > +
> > > +	mutex_lock(&iommu->lock);
> > > +	if (list_empty(&iommu->domain_list)) {
> > > +		mutex_unlock(&iommu->lock);
> > > +		return -EINVAL;
> > > +	}
> > > +
> > > +	list_for_each_entry(domain, &iommu->domain_list, next) {
> > > +		if (iommu_domain_get_attr(domain->domain,
> > > +			DOMAIN_ATTR_PASID_FORMAT, &format)) {
> > > +			ret = -EINVAL;
> > > +			format = 0;
> > > +			goto out_unlock;
> > > +		}
> > > +		/*
> > > +		 * format is always non-zero (the first format is
> > > +		 * IOMMU_PASID_FORMAT_INTEL_VTD which is 1). For
> > > +		 * the reason of potential different backed IOMMU
> > > +		 * formats, here we expect to have identical formats
> > > +		 * in the domain list, no mixed formats support.
> > > +		 * return -EINVAL to fail the attempt of setup
> > > +		 * VFIO_TYPE1_NESTING_IOMMU if non-identical formats
> > > +		 * are detected.
> > > +		 */
> > > +		if (tmp_format && tmp_format != format) {
> > > +			ret = -EINVAL;
> > > +			format = 0;
> > > +			goto out_unlock;
> > > +		}
> > > +
> > > +		tmp_format = format;
> > > +	}
> >
> > this path is invoked only in VFIO_IOMMU_GET_INFO path. If we don't want
> to
> > assume the status quo that one container holds only one device w/
> vIOMMU
> > (the prerequisite for vSVA), looks we also need check the format
> > compatibility when attaching a new group to this container?
> 
> right. if attaching to a nesting type container (vfio_iommu.nesting bit
> indicates it), it should check if it is compabile with prior domains in
> the domain list. But if it is the first one attached to this container,
> it's fine. is it good?

yes, but my point is whether we should check the format compatibility
in the attach path...

> 
> > > +	ret = 0;
> > > +
> > > +out_unlock:
> > > +	if (format)
> > > +		*stage1_format = format;
> > > +	mutex_unlock(&iommu->lock);
> > > +	return ret;
> > > +}
> > > +
> > >  static int vfio_iommu_info_add_nesting_cap(struct vfio_iommu *iommu,
> > >  					 struct vfio_info_cap *caps)
> > >  {
> > >  	struct vfio_info_cap_header *header;
> > >  	struct vfio_iommu_type1_info_cap_nesting *nesting_cap;
> > > +	u32 formats = 0;
> > > +	int ret;
> > > +
> > > +	ret = vfio_iommu_get_stage1_format(iommu, &formats);
> > > +	if (ret) {
> > > +		pr_warn("Failed to get stage-1 format\n");
> > > +		return ret;
> > > +	}
> > >
> > >  	header = vfio_info_cap_add(caps, sizeof(*nesting_cap),
> > >  				   VFIO_IOMMU_TYPE1_INFO_CAP_NESTING,
> > > 1);
> > > @@ -2254,6 +2309,7 @@ static int
> > > vfio_iommu_info_add_nesting_cap(struct
> > > vfio_iommu *iommu,
> > >  		/* nesting iommu type supports PASID requests (alloc/free)
> */
> > >  		nesting_cap->nesting_capabilities |=
> VFIO_IOMMU_PASID_REQS;
> > >  	}
> > > +	nesting_cap->stage1_formats = formats;
> > >
> > >  	return 0;
> > >  }
> > > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> > > index ed9881d..ebeaf3e 100644
> > > --- a/include/uapi/linux/vfio.h
> > > +++ b/include/uapi/linux/vfio.h
> > > @@ -763,6 +763,7 @@ struct vfio_iommu_type1_info_cap_nesting {
> > >  	struct	vfio_info_cap_header header;
> > >  #define VFIO_IOMMU_PASID_REQS	(1 << 0)
> > >  	__u32	nesting_capabilities;
> > > +	__u32	stage1_formats;
> >
> > do you plan to support multiple formats? If not, use singular name.
> 
> I do have such plan. e.g. it may be helpful when one day a platform can
> support multiple formats.
> 
> Regards,
> Yi Liu

^ permalink raw reply	[flat|nested] 110+ messages in thread

* RE: [PATCH v1 5/8] vfio/type1: Report 1st-level/stage-1 format to userspace
  2020-04-01  7:56       ` Tian, Kevin
@ 2020-04-01  8:06         ` Liu, Yi L
  2020-04-01  8:08           ` Tian, Kevin
  0 siblings, 1 reply; 110+ messages in thread
From: Liu, Yi L @ 2020-04-01  8:06 UTC (permalink / raw)
  To: Tian, Kevin, alex.williamson, eric.auger
  Cc: jacob.jun.pan, joro, Raj, Ashok, Tian, Jun J, Sun, Yi Y,
	jean-philippe, peterx, iommu, kvm, linux-kernel, Wu, Hao

> From: Tian, Kevin <kevin.tian@intel.com>
> Sent: Wednesday, April 1, 2020 3:56 PM
> To: Liu, Yi L <yi.l.liu@intel.com>; alex.williamson@redhat.com;
> Subject: RE: [PATCH v1 5/8] vfio/type1: Report 1st-level/stage-1 format to
> userspace
> 
> > From: Liu, Yi L <yi.l.liu@intel.com>
> > Sent: Wednesday, April 1, 2020 3:38 PM
> >
> >  > From: Tian, Kevin <kevin.tian@intel.com>
> > > Sent: Monday, March 30, 2020 7:49 PM
> > > To: Liu, Yi L <yi.l.liu@intel.com>; alex.williamson@redhat.com;
> > > Subject: RE: [PATCH v1 5/8] vfio/type1: Report 1st-level/stage-1
> > > format to userspace
> > >
> > > > From: Liu, Yi L <yi.l.liu@intel.com>
> > > > Sent: Sunday, March 22, 2020 8:32 PM
> > > >
> > > > From: Liu Yi L <yi.l.liu@intel.com>
> > > >
> > > > VFIO exposes IOMMU nesting translation (a.k.a dual stage
> > > > translation) capability to userspace. Thus applications like QEMU
> > > > could support vIOMMU with hardware's nesting translation
> > > > capability for pass-through devices. Before setting up nesting
> > > > translation for pass-through devices, QEMU and other applications
> > > > need to learn the supported
> > > > 1st-lvl/stage-1 translation structure format like page table format.
> > > >
> > > > Take vSVA (virtual Shared Virtual Addressing) as an example, to
> > > > support vSVA for pass-through devices, QEMU setup nesting
> > > > translation for pass- through devices. The guest page table are
> > > > configured to host as 1st-lvl/
> > > > stage-1 page table. Therefore, guest format should be compatible
> > > > with host side.
> > > >
> > > > This patch reports the supported 1st-lvl/stage-1 page table format
> > > > on the current platform to userspace. QEMU and other alike
> > > > applications should use this format info when trying to setup
> > > > IOMMU nesting translation on host IOMMU.
> > > >
> > > > Cc: Kevin Tian <kevin.tian@intel.com>
> > > > CC: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > > > Cc: Alex Williamson <alex.williamson@redhat.com>
> > > > Cc: Eric Auger <eric.auger@redhat.com>
> > > > Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
> > > > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > > > ---
> > > >  drivers/vfio/vfio_iommu_type1.c | 56
> > > > +++++++++++++++++++++++++++++++++++++++++
> > > >  include/uapi/linux/vfio.h       |  1 +
> > > >  2 files changed, 57 insertions(+)
> > > >
> > > > diff --git a/drivers/vfio/vfio_iommu_type1.c
> > > > b/drivers/vfio/vfio_iommu_type1.c index 9aa2a67..82a9e0b 100644
> > > > --- a/drivers/vfio/vfio_iommu_type1.c
> > > > +++ b/drivers/vfio/vfio_iommu_type1.c
> > > > @@ -2234,11 +2234,66 @@ static int
> > vfio_iommu_type1_pasid_free(struct
> > > > vfio_iommu *iommu,
> > > >  	return ret;
> > > >  }
> > > >
> > > > +static int vfio_iommu_get_stage1_format(struct vfio_iommu *iommu,
> > > > +					 u32 *stage1_format)
> > > > +{
> > > > +	struct vfio_domain *domain;
> > > > +	u32 format = 0, tmp_format = 0;
> > > > +	int ret;
> > > > +
> > > > +	mutex_lock(&iommu->lock);
> > > > +	if (list_empty(&iommu->domain_list)) {
> > > > +		mutex_unlock(&iommu->lock);
> > > > +		return -EINVAL;
> > > > +	}
> > > > +
> > > > +	list_for_each_entry(domain, &iommu->domain_list, next) {
> > > > +		if (iommu_domain_get_attr(domain->domain,
> > > > +			DOMAIN_ATTR_PASID_FORMAT, &format)) {
> > > > +			ret = -EINVAL;
> > > > +			format = 0;
> > > > +			goto out_unlock;
> > > > +		}
> > > > +		/*
> > > > +		 * format is always non-zero (the first format is
> > > > +		 * IOMMU_PASID_FORMAT_INTEL_VTD which is 1). For
> > > > +		 * the reason of potential different backed IOMMU
> > > > +		 * formats, here we expect to have identical formats
> > > > +		 * in the domain list, no mixed formats support.
> > > > +		 * return -EINVAL to fail the attempt of setup
> > > > +		 * VFIO_TYPE1_NESTING_IOMMU if non-identical formats
> > > > +		 * are detected.
> > > > +		 */
> > > > +		if (tmp_format && tmp_format != format) {
> > > > +			ret = -EINVAL;
> > > > +			format = 0;
> > > > +			goto out_unlock;
> > > > +		}
> > > > +
> > > > +		tmp_format = format;
> > > > +	}
> > >
> > > this path is invoked only in VFIO_IOMMU_GET_INFO path. If we don't
> > > want
> > to
> > > assume the status quo that one container holds only one device w/
> > vIOMMU
> > > (the prerequisite for vSVA), looks we also need check the format
> > > compatibility when attaching a new group to this container?
> >
> > right. if attaching to a nesting type container (vfio_iommu.nesting
> > bit indicates it), it should check if it is compabile with prior
> > domains in the domain list. But if it is the first one attached to
> > this container, it's fine. is it good?
> 
> yes, but my point is whether we should check the format compatibility
> in the attach path...

I guess so. Assume a device has been attached to a container, and
userspace has fetched the nesting cap info. e.g. QEMU will have a
per-container structure to store the nesting info. And then attach
another device from a separate group, if its backend iommu supports
different formats, then it will be a problem. If userspace reads the
nesting cap info again, it will get a different value. It may affect
the prior attched device. If userspace doesn't refresh the nesting
info by re-fetch, then the newly added device may use a format which
its backend iommu doesn't support.

Although, the vendor specific iommu driver should ensure all devices
are backed by iommu units w/ same capability (e.g. format). But it
would better to have a check in vfio side all the same. how about your
opinion so far?:-)

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 110+ messages in thread

* RE: [PATCH v1 5/8] vfio/type1: Report 1st-level/stage-1 format to userspace
  2020-04-01  8:06         ` Liu, Yi L
@ 2020-04-01  8:08           ` Tian, Kevin
  2020-04-01  8:09             ` Liu, Yi L
  0 siblings, 1 reply; 110+ messages in thread
From: Tian, Kevin @ 2020-04-01  8:08 UTC (permalink / raw)
  To: Liu, Yi L, alex.williamson, eric.auger
  Cc: jacob.jun.pan, joro, Raj, Ashok, Tian, Jun J, Sun, Yi Y,
	jean-philippe, peterx, iommu, kvm, linux-kernel, Wu, Hao

> From: Liu, Yi L <yi.l.liu@intel.com>
> Sent: Wednesday, April 1, 2020 4:07 PM
> 
> > From: Tian, Kevin <kevin.tian@intel.com>
> > Sent: Wednesday, April 1, 2020 3:56 PM
> > To: Liu, Yi L <yi.l.liu@intel.com>; alex.williamson@redhat.com;
> > Subject: RE: [PATCH v1 5/8] vfio/type1: Report 1st-level/stage-1 format to
> > userspace
> >
> > > From: Liu, Yi L <yi.l.liu@intel.com>
> > > Sent: Wednesday, April 1, 2020 3:38 PM
> > >
> > >  > From: Tian, Kevin <kevin.tian@intel.com>
> > > > Sent: Monday, March 30, 2020 7:49 PM
> > > > To: Liu, Yi L <yi.l.liu@intel.com>; alex.williamson@redhat.com;
> > > > Subject: RE: [PATCH v1 5/8] vfio/type1: Report 1st-level/stage-1
> > > > format to userspace
> > > >
> > > > > From: Liu, Yi L <yi.l.liu@intel.com>
> > > > > Sent: Sunday, March 22, 2020 8:32 PM
> > > > >
> > > > > From: Liu Yi L <yi.l.liu@intel.com>
> > > > >
> > > > > VFIO exposes IOMMU nesting translation (a.k.a dual stage
> > > > > translation) capability to userspace. Thus applications like QEMU
> > > > > could support vIOMMU with hardware's nesting translation
> > > > > capability for pass-through devices. Before setting up nesting
> > > > > translation for pass-through devices, QEMU and other applications
> > > > > need to learn the supported
> > > > > 1st-lvl/stage-1 translation structure format like page table format.
> > > > >
> > > > > Take vSVA (virtual Shared Virtual Addressing) as an example, to
> > > > > support vSVA for pass-through devices, QEMU setup nesting
> > > > > translation for pass- through devices. The guest page table are
> > > > > configured to host as 1st-lvl/
> > > > > stage-1 page table. Therefore, guest format should be compatible
> > > > > with host side.
> > > > >
> > > > > This patch reports the supported 1st-lvl/stage-1 page table format
> > > > > on the current platform to userspace. QEMU and other alike
> > > > > applications should use this format info when trying to setup
> > > > > IOMMU nesting translation on host IOMMU.
> > > > >
> > > > > Cc: Kevin Tian <kevin.tian@intel.com>
> > > > > CC: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > > > > Cc: Alex Williamson <alex.williamson@redhat.com>
> > > > > Cc: Eric Auger <eric.auger@redhat.com>
> > > > > Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
> > > > > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > > > > ---
> > > > >  drivers/vfio/vfio_iommu_type1.c | 56
> > > > > +++++++++++++++++++++++++++++++++++++++++
> > > > >  include/uapi/linux/vfio.h       |  1 +
> > > > >  2 files changed, 57 insertions(+)
> > > > >
> > > > > diff --git a/drivers/vfio/vfio_iommu_type1.c
> > > > > b/drivers/vfio/vfio_iommu_type1.c index 9aa2a67..82a9e0b 100644
> > > > > --- a/drivers/vfio/vfio_iommu_type1.c
> > > > > +++ b/drivers/vfio/vfio_iommu_type1.c
> > > > > @@ -2234,11 +2234,66 @@ static int
> > > vfio_iommu_type1_pasid_free(struct
> > > > > vfio_iommu *iommu,
> > > > >  	return ret;
> > > > >  }
> > > > >
> > > > > +static int vfio_iommu_get_stage1_format(struct vfio_iommu *iommu,
> > > > > +					 u32 *stage1_format)
> > > > > +{
> > > > > +	struct vfio_domain *domain;
> > > > > +	u32 format = 0, tmp_format = 0;
> > > > > +	int ret;
> > > > > +
> > > > > +	mutex_lock(&iommu->lock);
> > > > > +	if (list_empty(&iommu->domain_list)) {
> > > > > +		mutex_unlock(&iommu->lock);
> > > > > +		return -EINVAL;
> > > > > +	}
> > > > > +
> > > > > +	list_for_each_entry(domain, &iommu->domain_list, next) {
> > > > > +		if (iommu_domain_get_attr(domain->domain,
> > > > > +			DOMAIN_ATTR_PASID_FORMAT, &format)) {
> > > > > +			ret = -EINVAL;
> > > > > +			format = 0;
> > > > > +			goto out_unlock;
> > > > > +		}
> > > > > +		/*
> > > > > +		 * format is always non-zero (the first format is
> > > > > +		 * IOMMU_PASID_FORMAT_INTEL_VTD which is 1).
> For
> > > > > +		 * the reason of potential different backed IOMMU
> > > > > +		 * formats, here we expect to have identical formats
> > > > > +		 * in the domain list, no mixed formats support.
> > > > > +		 * return -EINVAL to fail the attempt of setup
> > > > > +		 * VFIO_TYPE1_NESTING_IOMMU if non-identical
> formats
> > > > > +		 * are detected.
> > > > > +		 */
> > > > > +		if (tmp_format && tmp_format != format) {
> > > > > +			ret = -EINVAL;
> > > > > +			format = 0;
> > > > > +			goto out_unlock;
> > > > > +		}
> > > > > +
> > > > > +		tmp_format = format;
> > > > > +	}
> > > >
> > > > this path is invoked only in VFIO_IOMMU_GET_INFO path. If we don't
> > > > want
> > > to
> > > > assume the status quo that one container holds only one device w/
> > > vIOMMU
> > > > (the prerequisite for vSVA), looks we also need check the format
> > > > compatibility when attaching a new group to this container?
> > >
> > > right. if attaching to a nesting type container (vfio_iommu.nesting
> > > bit indicates it), it should check if it is compabile with prior
> > > domains in the domain list. But if it is the first one attached to
> > > this container, it's fine. is it good?
> >
> > yes, but my point is whether we should check the format compatibility
> > in the attach path...
> 
> I guess so. Assume a device has been attached to a container, and
> userspace has fetched the nesting cap info. e.g. QEMU will have a
> per-container structure to store the nesting info. And then attach
> another device from a separate group, if its backend iommu supports
> different formats, then it will be a problem. If userspace reads the
> nesting cap info again, it will get a different value. It may affect
> the prior attched device. If userspace doesn't refresh the nesting
> info by re-fetch, then the newly added device may use a format which
> its backend iommu doesn't support.
> 
> Although, the vendor specific iommu driver should ensure all devices
> are backed by iommu units w/ same capability (e.g. format). But it
> would better to have a check in vfio side all the same. how about your
> opinion so far?:-)
> 

I think so. 

^ permalink raw reply	[flat|nested] 110+ messages in thread

* RE: [PATCH v1 5/8] vfio/type1: Report 1st-level/stage-1 format to userspace
  2020-04-01  8:08           ` Tian, Kevin
@ 2020-04-01  8:09             ` Liu, Yi L
  0 siblings, 0 replies; 110+ messages in thread
From: Liu, Yi L @ 2020-04-01  8:09 UTC (permalink / raw)
  To: Tian, Kevin, alex.williamson, eric.auger
  Cc: jacob.jun.pan, joro, Raj, Ashok, Tian, Jun J, Sun, Yi Y,
	jean-philippe, peterx, iommu, kvm, linux-kernel, Wu, Hao

> From: Tian, Kevin <kevin.tian@intel.com>
> Sent: Wednesday, April 1, 2020 4:09 PM
> Subject: RE: [PATCH v1 5/8] vfio/type1: Report 1st-level/stage-1 format to
> userspace
> 
> > From: Liu, Yi L <yi.l.liu@intel.com>
> > Sent: Wednesday, April 1, 2020 4:07 PM
> >
> > > From: Tian, Kevin <kevin.tian@intel.com>
> > > Sent: Wednesday, April 1, 2020 3:56 PM
> > > To: Liu, Yi L <yi.l.liu@intel.com>; alex.williamson@redhat.com;
> > > Subject: RE: [PATCH v1 5/8] vfio/type1: Report 1st-level/stage-1
> > > format to userspace
> > >
> > > > From: Liu, Yi L <yi.l.liu@intel.com>
> > > > Sent: Wednesday, April 1, 2020 3:38 PM
> > > >
> > > >  > From: Tian, Kevin <kevin.tian@intel.com>
> > > > > Sent: Monday, March 30, 2020 7:49 PM
> > > > > To: Liu, Yi L <yi.l.liu@intel.com>; alex.williamson@redhat.com;
> > > > > Subject: RE: [PATCH v1 5/8] vfio/type1: Report 1st-level/stage-1
> > > > > format to userspace
> > > > >
> > > > > > From: Liu, Yi L <yi.l.liu@intel.com>
> > > > > > Sent: Sunday, March 22, 2020 8:32 PM
> > > > > >
> > > > > > From: Liu Yi L <yi.l.liu@intel.com>
> > > > > >
> > > > > > VFIO exposes IOMMU nesting translation (a.k.a dual stage
> > > > > > translation) capability to userspace. Thus applications like
> > > > > > QEMU could support vIOMMU with hardware's nesting translation
> > > > > > capability for pass-through devices. Before setting up nesting
> > > > > > translation for pass-through devices, QEMU and other
> > > > > > applications need to learn the supported
> > > > > > 1st-lvl/stage-1 translation structure format like page table format.
> > > > > >
> > > > > > Take vSVA (virtual Shared Virtual Addressing) as an example,
> > > > > > to support vSVA for pass-through devices, QEMU setup nesting
> > > > > > translation for pass- through devices. The guest page table
> > > > > > are configured to host as 1st-lvl/
> > > > > > stage-1 page table. Therefore, guest format should be
> > > > > > compatible with host side.
> > > > > >
> > > > > > This patch reports the supported 1st-lvl/stage-1 page table
> > > > > > format on the current platform to userspace. QEMU and other
> > > > > > alike applications should use this format info when trying to
> > > > > > setup IOMMU nesting translation on host IOMMU.
> > > > > >
> > > > > > Cc: Kevin Tian <kevin.tian@intel.com>
> > > > > > CC: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > > > > > Cc: Alex Williamson <alex.williamson@redhat.com>
> > > > > > Cc: Eric Auger <eric.auger@redhat.com>
> > > > > > Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
> > > > > > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > > > > > ---
> > > > > >  drivers/vfio/vfio_iommu_type1.c | 56
> > > > > > +++++++++++++++++++++++++++++++++++++++++
> > > > > >  include/uapi/linux/vfio.h       |  1 +
> > > > > >  2 files changed, 57 insertions(+)
> > > > > >
> > > > > > diff --git a/drivers/vfio/vfio_iommu_type1.c
> > > > > > b/drivers/vfio/vfio_iommu_type1.c index 9aa2a67..82a9e0b
> > > > > > 100644
> > > > > > --- a/drivers/vfio/vfio_iommu_type1.c
> > > > > > +++ b/drivers/vfio/vfio_iommu_type1.c
> > > > > > @@ -2234,11 +2234,66 @@ static int
> > > > vfio_iommu_type1_pasid_free(struct
> > > > > > vfio_iommu *iommu,
> > > > > >  	return ret;
> > > > > >  }
> > > > > >
> > > > > > +static int vfio_iommu_get_stage1_format(struct vfio_iommu *iommu,
> > > > > > +					 u32 *stage1_format)
> > > > > > +{
> > > > > > +	struct vfio_domain *domain;
> > > > > > +	u32 format = 0, tmp_format = 0;
> > > > > > +	int ret;
> > > > > > +
> > > > > > +	mutex_lock(&iommu->lock);
> > > > > > +	if (list_empty(&iommu->domain_list)) {
> > > > > > +		mutex_unlock(&iommu->lock);
> > > > > > +		return -EINVAL;
> > > > > > +	}
> > > > > > +
> > > > > > +	list_for_each_entry(domain, &iommu->domain_list, next) {
> > > > > > +		if (iommu_domain_get_attr(domain->domain,
> > > > > > +			DOMAIN_ATTR_PASID_FORMAT, &format)) {
> > > > > > +			ret = -EINVAL;
> > > > > > +			format = 0;
> > > > > > +			goto out_unlock;
> > > > > > +		}
> > > > > > +		/*
> > > > > > +		 * format is always non-zero (the first format is
> > > > > > +		 * IOMMU_PASID_FORMAT_INTEL_VTD which is 1).
> > For
> > > > > > +		 * the reason of potential different backed IOMMU
> > > > > > +		 * formats, here we expect to have identical formats
> > > > > > +		 * in the domain list, no mixed formats support.
> > > > > > +		 * return -EINVAL to fail the attempt of setup
> > > > > > +		 * VFIO_TYPE1_NESTING_IOMMU if non-identical
> > formats
> > > > > > +		 * are detected.
> > > > > > +		 */
> > > > > > +		if (tmp_format && tmp_format != format) {
> > > > > > +			ret = -EINVAL;
> > > > > > +			format = 0;
> > > > > > +			goto out_unlock;
> > > > > > +		}
> > > > > > +
> > > > > > +		tmp_format = format;
> > > > > > +	}
> > > > >
> > > > > this path is invoked only in VFIO_IOMMU_GET_INFO path. If we
> > > > > don't want
> > > > to
> > > > > assume the status quo that one container holds only one device
> > > > > w/
> > > > vIOMMU
> > > > > (the prerequisite for vSVA), looks we also need check the format
> > > > > compatibility when attaching a new group to this container?
> > > >
> > > > right. if attaching to a nesting type container
> > > > (vfio_iommu.nesting bit indicates it), it should check if it is
> > > > compabile with prior domains in the domain list. But if it is the
> > > > first one attached to this container, it's fine. is it good?
> > >
> > > yes, but my point is whether we should check the format
> > > compatibility in the attach path...
> >
> > I guess so. Assume a device has been attached to a container, and
> > userspace has fetched the nesting cap info. e.g. QEMU will have a
> > per-container structure to store the nesting info. And then attach
> > another device from a separate group, if its backend iommu supports
> > different formats, then it will be a problem. If userspace reads the
> > nesting cap info again, it will get a different value. It may affect
> > the prior attched device. If userspace doesn't refresh the nesting
> > info by re-fetch, then the newly added device may use a format which
> > its backend iommu doesn't support.
> >
> > Although, the vendor specific iommu driver should ensure all devices
> > are backed by iommu units w/ same capability (e.g. format). But it
> > would better to have a check in vfio side all the same. how about your
> > opinion so far?:-)
> >
> 
> I think so.

Thanks, :-)

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH v1 5/8] vfio/type1: Report 1st-level/stage-1 format to userspace
  2020-03-22 12:32 ` [PATCH v1 5/8] vfio/type1: Report 1st-level/stage-1 format to userspace Liu, Yi L
  2020-03-22 16:44   ` kbuild test robot
  2020-03-30 11:48   ` Tian, Kevin
@ 2020-04-01  8:51   ` Auger Eric
  2020-04-01 12:51     ` Liu, Yi L
  2020-04-02 19:20   ` Alex Williamson
  3 siblings, 1 reply; 110+ messages in thread
From: Auger Eric @ 2020-04-01  8:51 UTC (permalink / raw)
  To: Liu, Yi L, alex.williamson
  Cc: kevin.tian, jacob.jun.pan, joro, ashok.raj, jun.j.tian, yi.y.sun,
	jean-philippe, peterx, iommu, kvm, linux-kernel, hao.wu

Hi Yi,
On 3/22/20 1:32 PM, Liu, Yi L wrote:
> From: Liu Yi L <yi.l.liu@intel.com>
> 
> VFIO exposes IOMMU nesting translation (a.k.a dual stage translation)
> capability to userspace. Thus applications like QEMU could support
> vIOMMU with hardware's nesting translation capability for pass-through
> devices. Before setting up nesting translation for pass-through devices,
> QEMU and other applications need to learn the supported 1st-lvl/stage-1
> translation structure format like page table format.
> 
> Take vSVA (virtual Shared Virtual Addressing) as an example, to support
> vSVA for pass-through devices, QEMU setup nesting translation for pass-
> through devices. The guest page table are configured to host as 1st-lvl/
> stage-1 page table. Therefore, guest format should be compatible with
> host side.
> 
> This patch reports the supported 1st-lvl/stage-1 page table format on the
> current platform to userspace. QEMU and other alike applications should
> use this format info when trying to setup IOMMU nesting translation on
> host IOMMU.
> 
> Cc: Kevin Tian <kevin.tian@intel.com>
> CC: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Cc: Alex Williamson <alex.williamson@redhat.com>
> Cc: Eric Auger <eric.auger@redhat.com>
> Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> ---
>  drivers/vfio/vfio_iommu_type1.c | 56 +++++++++++++++++++++++++++++++++++++++++
>  include/uapi/linux/vfio.h       |  1 +
>  2 files changed, 57 insertions(+)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index 9aa2a67..82a9e0b 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -2234,11 +2234,66 @@ static int vfio_iommu_type1_pasid_free(struct vfio_iommu *iommu,
>  	return ret;
>  }
>  
> +static int vfio_iommu_get_stage1_format(struct vfio_iommu *iommu,
> +					 u32 *stage1_format)
vfio_pasid_format() to be homogeneous with vfio_pgsize_bitmap() which
does the same kind of enumeration of the vfio_iommu domains
> +{
> +	struct vfio_domain *domain;
> +	u32 format = 0, tmp_format = 0;
> +	int ret;
ret = -EINVAL;
> +
> +	mutex_lock(&iommu->lock);
> +	if (list_empty(&iommu->domain_list)) {
goto out_unlock;
> +		mutex_unlock(&iommu->lock);
> +		return -EINVAL;
> +	}
> +
> +	list_for_each_entry(domain, &iommu->domain_list, next) {
> +		if (iommu_domain_get_attr(domain->domain,
> +			DOMAIN_ATTR_PASID_FORMAT, &format)) {
I can find DOMAIN_ATTR_PASID_FORMAT in Jacob's v9 but not in v10
> +			ret = -EINVAL;
could be removed
> +			format = 0;
> +			goto out_unlock;
> +		}
> +		/*
> +		 * format is always non-zero (the first format is
> +		 * IOMMU_PASID_FORMAT_INTEL_VTD which is 1). For
> +		 * the reason of potential different backed IOMMU
> +		 * formats, here we expect to have identical formats
> +		 * in the domain list, no mixed formats support.
> +		 * return -EINVAL to fail the attempt of setup
> +		 * VFIO_TYPE1_NESTING_IOMMU if non-identical formats
> +		 * are detected.
> +		 */
> +		if (tmp_format && tmp_format != format) {
> +			ret = -EINVAL;
could be removed
> +			format = 0;
> +			goto out_unlock;
> +		}
> +
> +		tmp_format = format;
> +	}
> +	ret = 0;
> +
> +out_unlock:
> +	if (format)
if (!ret) ? then you can remove the format = 0 in case of error.
> +		*stage1_format = format;
> +	mutex_unlock(&iommu->lock);
> +	return ret;
> +}
> +
>  static int vfio_iommu_info_add_nesting_cap(struct vfio_iommu *iommu,
>  					 struct vfio_info_cap *caps)
>  {
>  	struct vfio_info_cap_header *header;
>  	struct vfio_iommu_type1_info_cap_nesting *nesting_cap;
> +	u32 formats = 0;
> +	int ret;
> +
> +	ret = vfio_iommu_get_stage1_format(iommu, &formats);
> +	if (ret) {
> +		pr_warn("Failed to get stage-1 format\n");
trace triggered by userspace to be removed?
> +		return ret;
> +	}
>  
>  	header = vfio_info_cap_add(caps, sizeof(*nesting_cap),
>  				   VFIO_IOMMU_TYPE1_INFO_CAP_NESTING, 1);
> @@ -2254,6 +2309,7 @@ static int vfio_iommu_info_add_nesting_cap(struct vfio_iommu *iommu,
>  		/* nesting iommu type supports PASID requests (alloc/free) */
>  		nesting_cap->nesting_capabilities |= VFIO_IOMMU_PASID_REQS;
What is the meaning for ARM?
>  	}
> +	nesting_cap->stage1_formats = formats;
as spotted by Kevin, since a single format is supported, rename
>  
>  	return 0;
>  }
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index ed9881d..ebeaf3e 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -763,6 +763,7 @@ struct vfio_iommu_type1_info_cap_nesting {
>  	struct	vfio_info_cap_header header;
>  #define VFIO_IOMMU_PASID_REQS	(1 << 0)
>  	__u32	nesting_capabilities;
> +	__u32	stage1_formats;
>  };
>  
>  #define VFIO_IOMMU_GET_INFO _IO(VFIO_TYPE, VFIO_BASE + 12)
> 
Thanks

Eric


^ permalink raw reply	[flat|nested] 110+ messages in thread

* RE: [PATCH v1 6/8] vfio/type1: Bind guest page tables to host
  2020-03-30 12:46   ` Tian, Kevin
@ 2020-04-01  9:13     ` Liu, Yi L
  2020-04-02  2:12       ` Tian, Kevin
  0 siblings, 1 reply; 110+ messages in thread
From: Liu, Yi L @ 2020-04-01  9:13 UTC (permalink / raw)
  To: Tian, Kevin, alex.williamson, eric.auger
  Cc: jacob.jun.pan, joro, Raj, Ashok, Tian, Jun J, Sun, Yi Y,
	jean-philippe, peterx, iommu, kvm, linux-kernel, Wu, Hao

> From: Tian, Kevin <kevin.tian@intel.com>
> Sent: Monday, March 30, 2020 8:46 PM
> Subject: RE: [PATCH v1 6/8] vfio/type1: Bind guest page tables to host
> 
> > From: Liu, Yi L <yi.l.liu@intel.com>
> > Sent: Sunday, March 22, 2020 8:32 PM
> >
> > From: Liu Yi L <yi.l.liu@intel.com>
> >
> > VFIO_TYPE1_NESTING_IOMMU is an IOMMU type which is backed by hardware
> > IOMMUs that have nesting DMA translation (a.k.a dual stage address
> > translation). For such hardware IOMMUs, there are two stages/levels of
> > address translation, and software may let userspace/VM to own the
> > first-
> > level/stage-1 translation structures. Example of such usage is vSVA (
> > virtual Shared Virtual Addressing). VM owns the first-level/stage-1
> > translation structures and bind the structures to host, then hardware
> > IOMMU would utilize nesting translation when doing DMA translation fo
> > the devices behind such hardware IOMMU.
> >
> > This patch adds vfio support for binding guest translation (a.k.a
> > stage 1) structure to host iommu. And for VFIO_TYPE1_NESTING_IOMMU,
> > not only bind guest page table is needed, it also requires to expose
> > interface to guest for iommu cache invalidation when guest modified
> > the first-level/stage-1 translation structures since hardware needs to
> > be notified to flush stale iotlbs. This would be introduced in next
> > patch.
> >
> > In this patch, guest page table bind and unbind are done by using
> > flags VFIO_IOMMU_BIND_GUEST_PGTBL and
> VFIO_IOMMU_UNBIND_GUEST_PGTBL
> > under IOCTL VFIO_IOMMU_BIND, the bind/unbind data are conveyed by
> > struct iommu_gpasid_bind_data. Before binding guest page table to
> > host, VM should have got a PASID allocated by host via
> > VFIO_IOMMU_PASID_REQUEST.
> >
> > Bind guest translation structures (here is guest page table) to host
> 
> Bind -> Binding
got it.
> > are the first step to setup vSVA (Virtual Shared Virtual Addressing).
> 
> are -> is. and you already explained vSVA earlier.
oh yes, it is.
> >
> > Cc: Kevin Tian <kevin.tian@intel.com>
> > CC: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > Cc: Alex Williamson <alex.williamson@redhat.com>
> > Cc: Eric Auger <eric.auger@redhat.com>
> > Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
> > Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.com>
> > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > ---
> >  drivers/vfio/vfio_iommu_type1.c | 158
> > ++++++++++++++++++++++++++++++++++++++++
> >  include/uapi/linux/vfio.h       |  46 ++++++++++++
> >  2 files changed, 204 insertions(+)
> >
> > diff --git a/drivers/vfio/vfio_iommu_type1.c
> > b/drivers/vfio/vfio_iommu_type1.c index 82a9e0b..a877747 100644
> > --- a/drivers/vfio/vfio_iommu_type1.c
> > +++ b/drivers/vfio/vfio_iommu_type1.c
> > @@ -130,6 +130,33 @@ struct vfio_regions {
> >  #define IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)	\
> >  					(!list_empty(&iommu->domain_list))
> >
> > +struct domain_capsule {
> > +	struct iommu_domain *domain;
> > +	void *data;
> > +};
> > +
> > +/* iommu->lock must be held */
> > +static int vfio_iommu_for_each_dev(struct vfio_iommu *iommu,
> > +		      int (*fn)(struct device *dev, void *data),
> > +		      void *data)
> > +{
> > +	struct domain_capsule dc = {.data = data};
> > +	struct vfio_domain *d;
> > +	struct vfio_group *g;
> > +	int ret = 0;
> > +
> > +	list_for_each_entry(d, &iommu->domain_list, next) {
> > +		dc.domain = d->domain;
> > +		list_for_each_entry(g, &d->group_list, next) {
> > +			ret = iommu_group_for_each_dev(g->iommu_group,
> > +						       &dc, fn);
> > +			if (ret)
> > +				break;
> > +		}
> > +	}
> > +	return ret;
> > +}
> > +
> >  static int put_pfn(unsigned long pfn, int prot);
> >
> >  /*
> > @@ -2314,6 +2341,88 @@ static int
> > vfio_iommu_info_add_nesting_cap(struct vfio_iommu *iommu,
> >  	return 0;
> >  }
> >
> > +static int vfio_bind_gpasid_fn(struct device *dev, void *data) {
> > +	struct domain_capsule *dc = (struct domain_capsule *)data;
> > +	struct iommu_gpasid_bind_data *gbind_data =
> > +		(struct iommu_gpasid_bind_data *) dc->data;
> > +
> 
> In Jacob's vSVA iommu series, [PATCH 06/11]:
> 
> +		/* REVISIT: upper layer/VFIO can track host process that bind the
> PASID.
> +		 * ioasid_set = mm might be sufficient for vfio to check pasid VMM
> +		 * ownership.
> +		 */
> 
> I asked him who exactly should be responsible for tracking the pasid ownership.
> Although no response yet, I expect vfio/iommu can have a clear policy and also
> documented here to provide consistent message.

yep.

> > +	return iommu_sva_bind_gpasid(dc->domain, dev, gbind_data); }
> > +
> > +static int vfio_unbind_gpasid_fn(struct device *dev, void *data) {
> > +	struct domain_capsule *dc = (struct domain_capsule *)data;
> > +	struct iommu_gpasid_bind_data *gbind_data =
> > +		(struct iommu_gpasid_bind_data *) dc->data;
> > +
> > +	return iommu_sva_unbind_gpasid(dc->domain, dev,
> > +					gbind_data->hpasid);
> 
> curious why we have to share the same bind_data structure between bind and
> unbind, especially when unbind requires only one field? I didn't see a clear reason,
> and just similar to earlier ALLOC/FREE which don't share structure either.
> Current way simply wastes space for unbind operation...

no special reason today. But the gIOVA support over nested translation
is in plan, it may require a flag to indicate it as guest iommu driver
may user a single PASID value(RID2PASID) for all devices in guest side.
Especially if the RID2PASID value used for IOVA the the same with host
side. So adding a flag to indicate the binding is for IOVA is helpful.
For PF/VF, iommu driver just bind with the host side's RID2PASID. While
for ADI (Assignable Device Interface),  vfio layer needs to figure out
the default PASID stored in the aux-domain, and then iommu driver bind
gIOVA table to the default PASID. The potential flag is required in both
bind and unbind path. As such, it would be better to share the structure.

> > +}
> > +
> > +/**
> > + * Unbind specific gpasid, caller of this function requires hold
> > + * vfio_iommu->lock
> > + */
> > +static long vfio_iommu_type1_do_guest_unbind(struct vfio_iommu
> > *iommu,
> > +				struct iommu_gpasid_bind_data *gbind_data) {
> > +	return vfio_iommu_for_each_dev(iommu,
> > +				vfio_unbind_gpasid_fn, gbind_data); }
> > +
> > +static long vfio_iommu_type1_bind_gpasid(struct vfio_iommu *iommu,
> > +				struct iommu_gpasid_bind_data *gbind_data) {
> > +	int ret = 0;
> > +
> > +	mutex_lock(&iommu->lock);
> > +	if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) {
> > +		ret = -EINVAL;
> > +		goto out_unlock;
> > +	}
> > +
> > +	ret = vfio_iommu_for_each_dev(iommu,
> > +			vfio_bind_gpasid_fn, gbind_data);
> > +	/*
> > +	 * If bind failed, it may not be a total failure. Some devices
> > +	 * within the iommu group may have bind successfully. Although
> > +	 * we don't enable pasid capability for non-singletion iommu
> > +	 * groups, a unbind operation would be helpful to ensure no
> > +	 * partial binding for an iommu group.
> > +	 */
> > +	if (ret)
> > +		/*
> > +		 * Undo all binds that already succeeded, no need to
> 
> binds -> bindings
got it.
> 
> > +		 * check the return value here since some device within
> > +		 * the group has no successful bind when coming to this
> > +		 * place switch.
> > +		 */
> 
> remove 'switch'
oh, yes.

> 
> > +		vfio_iommu_type1_do_guest_unbind(iommu, gbind_data);
> > +
> > +out_unlock:
> > +	mutex_unlock(&iommu->lock);
> > +	return ret;
> > +}
> > +
> > +static long vfio_iommu_type1_unbind_gpasid(struct vfio_iommu *iommu,
> > +				struct iommu_gpasid_bind_data *gbind_data) {
> > +	int ret = 0;
> > +
> > +	mutex_lock(&iommu->lock);
> > +	if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) {
> > +		ret = -EINVAL;
> > +		goto out_unlock;
> > +	}
> > +
> > +	ret = vfio_iommu_type1_do_guest_unbind(iommu, gbind_data);
> > +
> > +out_unlock:
> > +	mutex_unlock(&iommu->lock);
> > +	return ret;
> > +}
> > +
> >  static long vfio_iommu_type1_ioctl(void *iommu_data,
> >  				   unsigned int cmd, unsigned long arg)  { @@ -
> 2471,6 +2580,55 @@
> > static long vfio_iommu_type1_ioctl(void *iommu_data,
> >  		default:
> >  			return -EINVAL;
> >  		}
> > +
> > +	} else if (cmd == VFIO_IOMMU_BIND) {
> 
> BIND what? VFIO_IOMMU_BIND_PASID sounds clearer to me.

Emm, it's up to the flags to indicate bind what. It was proposed to
cover the three cases below:
a) BIND/UNBIND_GPASID
b) BIND/UNBIND_GPASID_TABLE
c) BIND/UNBIND_PROCESS
<only a) is covered in this patch>
So it's called VFIO_IOMMU_BIND.

> 
> > +		struct vfio_iommu_type1_bind bind;
> > +		u32 version;
> > +		int data_size;
> > +		void *gbind_data;
> > +		int ret;
> > +
> > +		minsz = offsetofend(struct vfio_iommu_type1_bind, flags);
> > +
> > +		if (copy_from_user(&bind, (void __user *)arg, minsz))
> > +			return -EFAULT;
> > +
> > +		if (bind.argsz < minsz)
> > +			return -EINVAL;
> > +
> > +		/* Get the version of struct iommu_gpasid_bind_data */
> > +		if (copy_from_user(&version,
> > +			(void __user *) (arg + minsz),
> > +					sizeof(version)))
> > +			return -EFAULT;
> > +
> > +		data_size = iommu_uapi_get_data_size(
> > +				IOMMU_UAPI_BIND_GPASID, version);
> > +		gbind_data = kzalloc(data_size, GFP_KERNEL);
> > +		if (!gbind_data)
> > +			return -ENOMEM;
> > +
> > +		if (copy_from_user(gbind_data,
> > +			 (void __user *) (arg + minsz), data_size)) {
> > +			kfree(gbind_data);
> > +			return -EFAULT;
> > +		}
> > +
> > +		switch (bind.flags & VFIO_IOMMU_BIND_MASK) {
> > +		case VFIO_IOMMU_BIND_GUEST_PGTBL:
> > +			ret = vfio_iommu_type1_bind_gpasid(iommu,
> > +							   gbind_data);
> > +			break;
> > +		case VFIO_IOMMU_UNBIND_GUEST_PGTBL:
> > +			ret = vfio_iommu_type1_unbind_gpasid(iommu,
> > +							     gbind_data);
> > +			break;
> > +		default:
> > +			ret = -EINVAL;
> > +			break;
> > +		}
> > +		kfree(gbind_data);
> > +		return ret;
> >  	}
> >
> >  	return -ENOTTY;
> > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> > index ebeaf3e..2235bc6 100644
> > --- a/include/uapi/linux/vfio.h
> > +++ b/include/uapi/linux/vfio.h
> > @@ -14,6 +14,7 @@
> >
> >  #include <linux/types.h>
> >  #include <linux/ioctl.h>
> > +#include <linux/iommu.h>
> >
> >  #define VFIO_API_VERSION	0
> >
> > @@ -853,6 +854,51 @@ struct vfio_iommu_type1_pasid_request {
> >   */
> >  #define VFIO_IOMMU_PASID_REQUEST	_IO(VFIO_TYPE, VFIO_BASE +
> > 22)
> >
> > +/**
> > + * Supported flags:
> > + *	- VFIO_IOMMU_BIND_GUEST_PGTBL: bind guest page tables to host
> > for
> > + *			nesting type IOMMUs. In @data field It takes struct
> > + *			iommu_gpasid_bind_data.
> > + *	- VFIO_IOMMU_UNBIND_GUEST_PGTBL: undo a bind guest page
> > table operation
> > + *			invoked by VFIO_IOMMU_BIND_GUEST_PGTBL.
> > + *
> > + */
> > +struct vfio_iommu_type1_bind {
> > +	__u32		argsz;
> > +	__u32		flags;
> > +#define VFIO_IOMMU_BIND_GUEST_PGTBL	(1 << 0)
> > +#define VFIO_IOMMU_UNBIND_GUEST_PGTBL	(1 << 1)
> > +	__u8		data[];
> > +};
> > +
> > +#define VFIO_IOMMU_BIND_MASK	(VFIO_IOMMU_BIND_GUEST_PGTBL
> > | \
> > +
> > 	VFIO_IOMMU_UNBIND_GUEST_PGTBL)
> > +
> > +/**
> > + * VFIO_IOMMU_BIND - _IOW(VFIO_TYPE, VFIO_BASE + 23,
> > + *				struct vfio_iommu_type1_bind)
> > + *
> > + * Manage address spaces of devices in this container. Initially a
> > +TYPE1
> > + * container can only have one address space, managed with
> > + * VFIO_IOMMU_MAP/UNMAP_DMA.
> 
> the last sentence seems irrelevant and more suitable in commit msg.

oh, I could remove it.

> > + *
> > + * An IOMMU of type VFIO_TYPE1_NESTING_IOMMU can be managed by
> > both MAP/UNMAP
> > + * and BIND ioctls at the same time. MAP/UNMAP acts on the stage-2
> > + (host)
> > page
> > + * tables, and BIND manages the stage-1 (guest) page tables. Other
> > + types of
> 
> Are "other types" the counterpart to VFIO_TYPE1_NESTING_IOMMU?
> What are those types? I thought only NESTING_IOMMU allows two stage
> translation...

it's a mistake... please ignore this message. would correct it in next version.

> 
> > + * IOMMU may allow MAP/UNMAP and BIND to coexist, where
> 
> The first sentence said the same thing. Then what is the exact difference?

this sentence were added by mistake. will correct it.

> 
> > MAP/UNMAP controls
> > + * the traffics only require single stage translation while BIND
> > + controls the
> > + * traffics require nesting translation. But this depends on the
> > + underlying
> > + * IOMMU architecture and isn't guaranteed. Example of this is the
> > + guest
> > SVA
> > + * traffics, such traffics need nesting translation to gain gVA->gPA
> > + and then
> > + * gPA->hPA translation.
> 
> I'm a bit confused about the content since "other types of". Are they trying to state
> some exceptions/corner cases that this API cannot resolve or explain the desired
> behavior of the API? Especially the last example, which is worded as if the example
> for "isn't guaranteed"
> but isn't guest SVA the main purpose of this API?
> 
I think the description in original patch is bad especially with the "other types"
phrase. How about the below description?

/**
 * VFIO_IOMMU_BIND - _IOW(VFIO_TYPE, VFIO_BASE + 23,
 *				struct vfio_iommu_type1_bind)
 *
 * Manage address spaces of devices in this container when it's an IOMMU
 * of type VFIO_TYPE1_NESTING_IOMMU. Such type IOMMU allows MAP/UNMAP and
 * BIND to coexist, where MAP/UNMAP controls the traffics only require
 * single stage translation while BIND controls the traffics require nesting
 * translation.
 *
 * Availability of this feature depends on the device, its bus, the underlying
 * IOMMU and the CPU architecture.
 *
 * returns: 0 on success, -errno on failure.
 */

Regards,
Yi Liu


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH v1 3/8] vfio/type1: Report PASID alloc/free support to userspace
  2020-03-22 12:32 ` [PATCH v1 3/8] vfio/type1: Report PASID alloc/free support to userspace Liu, Yi L
  2020-03-30  9:43   ` Tian, Kevin
@ 2020-04-01  9:41   ` Auger Eric
  2020-04-01 13:13     ` Liu, Yi L
  2020-04-02 18:01   ` Alex Williamson
  2 siblings, 1 reply; 110+ messages in thread
From: Auger Eric @ 2020-04-01  9:41 UTC (permalink / raw)
  To: Liu, Yi L, alex.williamson
  Cc: kevin.tian, jacob.jun.pan, joro, ashok.raj, jun.j.tian, yi.y.sun,
	jean-philippe, peterx, iommu, kvm, linux-kernel, hao.wu

Yi,
On 3/22/20 1:32 PM, Liu, Yi L wrote:
> From: Liu Yi L <yi.l.liu@intel.com>
> 
> This patch reports PASID alloc/free availability to userspace (e.g. QEMU)
> thus userspace could do a pre-check before utilizing this feature.
> 
> Cc: Kevin Tian <kevin.tian@intel.com>
> CC: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Cc: Alex Williamson <alex.williamson@redhat.com>
> Cc: Eric Auger <eric.auger@redhat.com>
> Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> ---
>  drivers/vfio/vfio_iommu_type1.c | 28 ++++++++++++++++++++++++++++
>  include/uapi/linux/vfio.h       |  8 ++++++++
>  2 files changed, 36 insertions(+)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index e40afc0..ddd1ffe 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -2234,6 +2234,30 @@ static int vfio_iommu_type1_pasid_free(struct vfio_iommu *iommu,
>  	return ret;
>  }
>  
> +static int vfio_iommu_info_add_nesting_cap(struct vfio_iommu *iommu,
> +					 struct vfio_info_cap *caps)
> +{
> +	struct vfio_info_cap_header *header;
> +	struct vfio_iommu_type1_info_cap_nesting *nesting_cap;
> +
> +	header = vfio_info_cap_add(caps, sizeof(*nesting_cap),
> +				   VFIO_IOMMU_TYPE1_INFO_CAP_NESTING, 1);
> +	if (IS_ERR(header))
> +		return PTR_ERR(header);
> +
> +	nesting_cap = container_of(header,
> +				struct vfio_iommu_type1_info_cap_nesting,
> +				header);
> +
> +	nesting_cap->nesting_capabilities = 0;
> +	if (iommu->nesting) {
> +		/* nesting iommu type supports PASID requests (alloc/free) */
> +		nesting_cap->nesting_capabilities |= VFIO_IOMMU_PASID_REQS;
Supporting nesting does not necessarily mean supporting PASID.
> +	}
> +
> +	return 0;
> +}
> +
>  static long vfio_iommu_type1_ioctl(void *iommu_data,
>  				   unsigned int cmd, unsigned long arg)
>  {
> @@ -2283,6 +2307,10 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
>  		if (ret)
>  			return ret;
>  
> +		ret = vfio_iommu_info_add_nesting_cap(iommu, &caps);
> +		if (ret)
> +			return ret;
> +
>  		if (caps.size) {
>  			info.flags |= VFIO_IOMMU_INFO_CAPS;
>  
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 298ac80..8837219 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -748,6 +748,14 @@ struct vfio_iommu_type1_info_cap_iova_range {
>  	struct	vfio_iova_range iova_ranges[];
>  };
>  
> +#define VFIO_IOMMU_TYPE1_INFO_CAP_NESTING  2
> +
> +struct vfio_iommu_type1_info_cap_nesting {
> +	struct	vfio_info_cap_header header;
> +#define VFIO_IOMMU_PASID_REQS	(1 << 0)
PASID_REQS sounds a bit far from the claimed host managed alloc/free
capability.
VFIO_IOMMU_SYSTEM_WIDE_PASID?
> +	__u32	nesting_capabilities;
> +};
> +
>  #define VFIO_IOMMU_GET_INFO _IO(VFIO_TYPE, VFIO_BASE + 12)
>  
>  /**
> 
Thanks

Eric


^ permalink raw reply	[flat|nested] 110+ messages in thread

* RE: [PATCH v1 5/8] vfio/type1: Report 1st-level/stage-1 format to userspace
  2020-04-01  8:51   ` Auger Eric
@ 2020-04-01 12:51     ` Liu, Yi L
  2020-04-01 13:01       ` Auger Eric
  0 siblings, 1 reply; 110+ messages in thread
From: Liu, Yi L @ 2020-04-01 12:51 UTC (permalink / raw)
  To: Auger Eric, alex.williamson
  Cc: Tian, Kevin, jacob.jun.pan, joro, Raj, Ashok, Tian, Jun J, Sun,
	Yi Y, jean-philippe, peterx, iommu, kvm, linux-kernel, Wu, Hao

Hi Eric,

> From: Auger Eric <eric.auger@redhat.com>
> Sent: Wednesday, April 1, 2020 4:51 PM
> To: Liu, Yi L <yi.l.liu@intel.com>; alex.williamson@redhat.com
> Subject: Re: [PATCH v1 5/8] vfio/type1: Report 1st-level/stage-1 format to
> userspace
> 
> Hi Yi,
> On 3/22/20 1:32 PM, Liu, Yi L wrote:
> > From: Liu Yi L <yi.l.liu@intel.com>
> >
> > VFIO exposes IOMMU nesting translation (a.k.a dual stage translation)
> > capability to userspace. Thus applications like QEMU could support
> > vIOMMU with hardware's nesting translation capability for pass-through
> > devices. Before setting up nesting translation for pass-through devices,
> > QEMU and other applications need to learn the supported 1st-lvl/stage-1
> > translation structure format like page table format.
> >
> > Take vSVA (virtual Shared Virtual Addressing) as an example, to support
> > vSVA for pass-through devices, QEMU setup nesting translation for pass-
> > through devices. The guest page table are configured to host as 1st-lvl/
> > stage-1 page table. Therefore, guest format should be compatible with
> > host side.
> >
> > This patch reports the supported 1st-lvl/stage-1 page table format on the
> > current platform to userspace. QEMU and other alike applications should
> > use this format info when trying to setup IOMMU nesting translation on
> > host IOMMU.
> >
> > Cc: Kevin Tian <kevin.tian@intel.com>
> > CC: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > Cc: Alex Williamson <alex.williamson@redhat.com>
> > Cc: Eric Auger <eric.auger@redhat.com>
> > Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
> > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > ---
> >  drivers/vfio/vfio_iommu_type1.c | 56
> +++++++++++++++++++++++++++++++++++++++++
> >  include/uapi/linux/vfio.h       |  1 +
> >  2 files changed, 57 insertions(+)
> >
> > diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> > index 9aa2a67..82a9e0b 100644
> > --- a/drivers/vfio/vfio_iommu_type1.c
> > +++ b/drivers/vfio/vfio_iommu_type1.c
> > @@ -2234,11 +2234,66 @@ static int vfio_iommu_type1_pasid_free(struct
> vfio_iommu *iommu,
> >  	return ret;
> >  }
> >
> > +static int vfio_iommu_get_stage1_format(struct vfio_iommu *iommu,
> > +					 u32 *stage1_format)
> vfio_pasid_format() to be homogeneous with vfio_pgsize_bitmap() which
> does the same kind of enumeration of the vfio_iommu domains

yes, similar.

> > +{
> > +	struct vfio_domain *domain;
> > +	u32 format = 0, tmp_format = 0;
> > +	int ret;
> ret = -EINVAL;

got it.

> > +
> > +	mutex_lock(&iommu->lock);
> > +	if (list_empty(&iommu->domain_list)) {
> goto out_unlock;

right.
> > +		mutex_unlock(&iommu->lock);
> > +		return -EINVAL;
> > +	}
> > +
> > +	list_for_each_entry(domain, &iommu->domain_list, next) {
> > +		if (iommu_domain_get_attr(domain->domain,
> > +			DOMAIN_ATTR_PASID_FORMAT, &format)) {
> I can find DOMAIN_ATTR_PASID_FORMAT in Jacob's v9 but not in v10

oops, I guess he somehow missed. you may find it in below link.

https://github.com/luxis1999/linux-vsva/commit/bf14b11a12f74d58ad3ee626a5d891de395082eb

> > +			ret = -EINVAL;
> could be removed

sure.

> > +			format = 0;
> > +			goto out_unlock;
> > +		}
> > +		/*
> > +		 * format is always non-zero (the first format is
> > +		 * IOMMU_PASID_FORMAT_INTEL_VTD which is 1). For
> > +		 * the reason of potential different backed IOMMU
> > +		 * formats, here we expect to have identical formats
> > +		 * in the domain list, no mixed formats support.
> > +		 * return -EINVAL to fail the attempt of setup
> > +		 * VFIO_TYPE1_NESTING_IOMMU if non-identical formats
> > +		 * are detected.
> > +		 */
> > +		if (tmp_format && tmp_format != format) {
> > +			ret = -EINVAL;
> could be removed

got it.

> > +			format = 0;
> > +			goto out_unlock;
> > +		}
> > +
> > +		tmp_format = format;
> > +	}
> > +	ret = 0;
> > +
> > +out_unlock:
> > +	if (format)
> if (!ret) ? then you can remove the format = 0 in case of error.

oh, yes.

> > +		*stage1_format = format;
> > +	mutex_unlock(&iommu->lock);
> > +	return ret;
> > +}
> > +
> >  static int vfio_iommu_info_add_nesting_cap(struct vfio_iommu *iommu,
> >  					 struct vfio_info_cap *caps)
> >  {
> >  	struct vfio_info_cap_header *header;
> >  	struct vfio_iommu_type1_info_cap_nesting *nesting_cap;
> > +	u32 formats = 0;
> > +	int ret;
> > +
> > +	ret = vfio_iommu_get_stage1_format(iommu, &formats);
> > +	if (ret) {
> > +		pr_warn("Failed to get stage-1 format\n");
> trace triggered by userspace to be removed?

sure.

> > +		return ret;
> > +	}
> >
> >  	header = vfio_info_cap_add(caps, sizeof(*nesting_cap),
> >  				   VFIO_IOMMU_TYPE1_INFO_CAP_NESTING, 1);
> > @@ -2254,6 +2309,7 @@ static int vfio_iommu_info_add_nesting_cap(struct
> vfio_iommu *iommu,
> >  		/* nesting iommu type supports PASID requests (alloc/free) */
> >  		nesting_cap->nesting_capabilities |= VFIO_IOMMU_PASID_REQS;
> What is the meaning for ARM?

I think it's just a software capability exposed to userspace, on
userspace side, it has a choice to use it or not. :-) The reason
define it and report it in cap nesting is that I'd like to make
the pasid alloc/free be available just for IOMMU with type
VFIO_IOMMU_TYPE1_NESTING. Please feel free tell me if it is not
good for ARM. We can find a proper way to report the availability.

> >  	}
> > +	nesting_cap->stage1_formats = formats;
> as spotted by Kevin, since a single format is supported, rename

ok, I was believing it may be possible on ARM or so. :-) will
rename it.

I'll refine the patch per your above comments.

Regards,
Yi Liu


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH v1 5/8] vfio/type1: Report 1st-level/stage-1 format to userspace
  2020-04-01 12:51     ` Liu, Yi L
@ 2020-04-01 13:01       ` Auger Eric
  2020-04-03  8:23         ` Jean-Philippe Brucker
  0 siblings, 1 reply; 110+ messages in thread
From: Auger Eric @ 2020-04-01 13:01 UTC (permalink / raw)
  To: Liu, Yi L, alex.williamson
  Cc: Tian, Kevin, jacob.jun.pan, joro, Raj, Ashok, Tian, Jun J, Sun,
	Yi Y, jean-philippe, peterx, iommu, kvm, linux-kernel, Wu, Hao

Hi Yi,

On 4/1/20 2:51 PM, Liu, Yi L wrote:
> Hi Eric,
> 
>> From: Auger Eric <eric.auger@redhat.com>
>> Sent: Wednesday, April 1, 2020 4:51 PM
>> To: Liu, Yi L <yi.l.liu@intel.com>; alex.williamson@redhat.com
>> Subject: Re: [PATCH v1 5/8] vfio/type1: Report 1st-level/stage-1 format to
>> userspace
>>
>> Hi Yi,
>> On 3/22/20 1:32 PM, Liu, Yi L wrote:
>>> From: Liu Yi L <yi.l.liu@intel.com>
>>>
>>> VFIO exposes IOMMU nesting translation (a.k.a dual stage translation)
>>> capability to userspace. Thus applications like QEMU could support
>>> vIOMMU with hardware's nesting translation capability for pass-through
>>> devices. Before setting up nesting translation for pass-through devices,
>>> QEMU and other applications need to learn the supported 1st-lvl/stage-1
>>> translation structure format like page table format.
>>>
>>> Take vSVA (virtual Shared Virtual Addressing) as an example, to support
>>> vSVA for pass-through devices, QEMU setup nesting translation for pass-
>>> through devices. The guest page table are configured to host as 1st-lvl/
>>> stage-1 page table. Therefore, guest format should be compatible with
>>> host side.
>>>
>>> This patch reports the supported 1st-lvl/stage-1 page table format on the
>>> current platform to userspace. QEMU and other alike applications should
>>> use this format info when trying to setup IOMMU nesting translation on
>>> host IOMMU.
>>>
>>> Cc: Kevin Tian <kevin.tian@intel.com>
>>> CC: Jacob Pan <jacob.jun.pan@linux.intel.com>
>>> Cc: Alex Williamson <alex.williamson@redhat.com>
>>> Cc: Eric Auger <eric.auger@redhat.com>
>>> Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
>>> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
>>> ---
>>>  drivers/vfio/vfio_iommu_type1.c | 56
>> +++++++++++++++++++++++++++++++++++++++++
>>>  include/uapi/linux/vfio.h       |  1 +
>>>  2 files changed, 57 insertions(+)
>>>
>>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
>>> index 9aa2a67..82a9e0b 100644
>>> --- a/drivers/vfio/vfio_iommu_type1.c
>>> +++ b/drivers/vfio/vfio_iommu_type1.c
>>> @@ -2234,11 +2234,66 @@ static int vfio_iommu_type1_pasid_free(struct
>> vfio_iommu *iommu,
>>>  	return ret;
>>>  }
>>>
>>> +static int vfio_iommu_get_stage1_format(struct vfio_iommu *iommu,
>>> +					 u32 *stage1_format)
>> vfio_pasid_format() to be homogeneous with vfio_pgsize_bitmap() which
>> does the same kind of enumeration of the vfio_iommu domains
> 
> yes, similar.
> 
>>> +{
>>> +	struct vfio_domain *domain;
>>> +	u32 format = 0, tmp_format = 0;
>>> +	int ret;
>> ret = -EINVAL;
> 
> got it.
> 
>>> +
>>> +	mutex_lock(&iommu->lock);
>>> +	if (list_empty(&iommu->domain_list)) {
>> goto out_unlock;
> 
> right.
>>> +		mutex_unlock(&iommu->lock);
>>> +		return -EINVAL;
>>> +	}
>>> +
>>> +	list_for_each_entry(domain, &iommu->domain_list, next) {
>>> +		if (iommu_domain_get_attr(domain->domain,
>>> +			DOMAIN_ATTR_PASID_FORMAT, &format)) {
>> I can find DOMAIN_ATTR_PASID_FORMAT in Jacob's v9 but not in v10
> 
> oops, I guess he somehow missed. you may find it in below link.
> 
> https://github.com/luxis1999/linux-vsva/commit/bf14b11a12f74d58ad3ee626a5d891de395082eb
> 
>>> +			ret = -EINVAL;
>> could be removed
> 
> sure.
> 
>>> +			format = 0;
>>> +			goto out_unlock;
>>> +		}
>>> +		/*
>>> +		 * format is always non-zero (the first format is
>>> +		 * IOMMU_PASID_FORMAT_INTEL_VTD which is 1). For
>>> +		 * the reason of potential different backed IOMMU
>>> +		 * formats, here we expect to have identical formats
>>> +		 * in the domain list, no mixed formats support.
>>> +		 * return -EINVAL to fail the attempt of setup
>>> +		 * VFIO_TYPE1_NESTING_IOMMU if non-identical formats
>>> +		 * are detected.
>>> +		 */
>>> +		if (tmp_format && tmp_format != format) {
>>> +			ret = -EINVAL;
>> could be removed
> 
> got it.
> 
>>> +			format = 0;
>>> +			goto out_unlock;
>>> +		}
>>> +
>>> +		tmp_format = format;
>>> +	}
>>> +	ret = 0;
>>> +
>>> +out_unlock:
>>> +	if (format)
>> if (!ret) ? then you can remove the format = 0 in case of error.
> 
> oh, yes.
> 
>>> +		*stage1_format = format;
>>> +	mutex_unlock(&iommu->lock);
>>> +	return ret;
>>> +}
>>> +
>>>  static int vfio_iommu_info_add_nesting_cap(struct vfio_iommu *iommu,
>>>  					 struct vfio_info_cap *caps)
>>>  {
>>>  	struct vfio_info_cap_header *header;
>>>  	struct vfio_iommu_type1_info_cap_nesting *nesting_cap;
>>> +	u32 formats = 0;
>>> +	int ret;
>>> +
>>> +	ret = vfio_iommu_get_stage1_format(iommu, &formats);
>>> +	if (ret) {
>>> +		pr_warn("Failed to get stage-1 format\n");
>> trace triggered by userspace to be removed?
> 
> sure.
> 
>>> +		return ret;
>>> +	}
>>>
>>>  	header = vfio_info_cap_add(caps, sizeof(*nesting_cap),
>>>  				   VFIO_IOMMU_TYPE1_INFO_CAP_NESTING, 1);
>>> @@ -2254,6 +2309,7 @@ static int vfio_iommu_info_add_nesting_cap(struct
>> vfio_iommu *iommu,
>>>  		/* nesting iommu type supports PASID requests (alloc/free) */
>>>  		nesting_cap->nesting_capabilities |= VFIO_IOMMU_PASID_REQS;
>> What is the meaning for ARM?
> 
> I think it's just a software capability exposed to userspace, on
> userspace side, it has a choice to use it or not. :-) The reason
> define it and report it in cap nesting is that I'd like to make
> the pasid alloc/free be available just for IOMMU with type
> VFIO_IOMMU_TYPE1_NESTING. Please feel free tell me if it is not
> good for ARM. We can find a proper way to report the availability.

Well it is more a question for jean-Philippe. Do we have a system wide
PASID allocation on ARM?

Thanks

Eric
> 
>>>  	}
>>> +	nesting_cap->stage1_formats = formats;
>> as spotted by Kevin, since a single format is supported, rename
> 
> ok, I was believing it may be possible on ARM or so. :-) will
> rename it.
> 
> I'll refine the patch per your above comments.
> 
> Regards,
> Yi Liu
> 


^ permalink raw reply	[flat|nested] 110+ messages in thread

* RE: [PATCH v1 3/8] vfio/type1: Report PASID alloc/free support to userspace
  2020-04-01  9:41   ` Auger Eric
@ 2020-04-01 13:13     ` Liu, Yi L
  0 siblings, 0 replies; 110+ messages in thread
From: Liu, Yi L @ 2020-04-01 13:13 UTC (permalink / raw)
  To: Auger Eric, alex.williamson
  Cc: Tian, Kevin, jacob.jun.pan, joro, Raj, Ashok, Tian, Jun J, Sun,
	Yi Y, jean-philippe, peterx, iommu, kvm, linux-kernel, Wu, Hao

Hi Eric,

> From: Auger Eric <eric.auger@redhat.com>
> Sent: Wednesday, April 1, 2020 5:41 PM
> To: Liu, Yi L <yi.l.liu@intel.com>; alex.williamson@redhat.com
> Subject: Re: [PATCH v1 3/8] vfio/type1: Report PASID alloc/free support to
> userspace
> 
> Yi,
> On 3/22/20 1:32 PM, Liu, Yi L wrote:
> > From: Liu Yi L <yi.l.liu@intel.com>
> >
> > This patch reports PASID alloc/free availability to userspace (e.g.
> > QEMU) thus userspace could do a pre-check before utilizing this feature.
> >
> > Cc: Kevin Tian <kevin.tian@intel.com>
> > CC: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > Cc: Alex Williamson <alex.williamson@redhat.com>
> > Cc: Eric Auger <eric.auger@redhat.com>
> > Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
> > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > ---
> >  drivers/vfio/vfio_iommu_type1.c | 28 ++++++++++++++++++++++++++++
> >  include/uapi/linux/vfio.h       |  8 ++++++++
> >  2 files changed, 36 insertions(+)
> >
> > diff --git a/drivers/vfio/vfio_iommu_type1.c
> > b/drivers/vfio/vfio_iommu_type1.c index e40afc0..ddd1ffe 100644
> > --- a/drivers/vfio/vfio_iommu_type1.c
> > +++ b/drivers/vfio/vfio_iommu_type1.c
> > @@ -2234,6 +2234,30 @@ static int vfio_iommu_type1_pasid_free(struct
> vfio_iommu *iommu,
> >  	return ret;
> >  }
> >
> > +static int vfio_iommu_info_add_nesting_cap(struct vfio_iommu *iommu,
> > +					 struct vfio_info_cap *caps)
> > +{
> > +	struct vfio_info_cap_header *header;
> > +	struct vfio_iommu_type1_info_cap_nesting *nesting_cap;
> > +
> > +	header = vfio_info_cap_add(caps, sizeof(*nesting_cap),
> > +				   VFIO_IOMMU_TYPE1_INFO_CAP_NESTING, 1);
> > +	if (IS_ERR(header))
> > +		return PTR_ERR(header);
> > +
> > +	nesting_cap = container_of(header,
> > +				struct vfio_iommu_type1_info_cap_nesting,
> > +				header);
> > +
> > +	nesting_cap->nesting_capabilities = 0;
> > +	if (iommu->nesting) {
> > +		/* nesting iommu type supports PASID requests (alloc/free) */
> > +		nesting_cap->nesting_capabilities |= VFIO_IOMMU_PASID_REQS;
> Supporting nesting does not necessarily mean supporting PASID.

here I think the PASID is somehow IDs in kernel which can be used to
tag various address spaces provided by guest software. I think this
is why we introduced the ioasid. :-) Current implementation is doing
PASID alloc/free in vfio.

> > +	}
> > +
> > +	return 0;
> > +}
> > +
> >  static long vfio_iommu_type1_ioctl(void *iommu_data,
> >  				   unsigned int cmd, unsigned long arg)  { @@ -
> 2283,6 +2307,10 @@
> > static long vfio_iommu_type1_ioctl(void *iommu_data,
> >  		if (ret)
> >  			return ret;
> >
> > +		ret = vfio_iommu_info_add_nesting_cap(iommu, &caps);
> > +		if (ret)
> > +			return ret;
> > +
> >  		if (caps.size) {
> >  			info.flags |= VFIO_IOMMU_INFO_CAPS;
> >
> > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> > index 298ac80..8837219 100644
> > --- a/include/uapi/linux/vfio.h
> > +++ b/include/uapi/linux/vfio.h
> > @@ -748,6 +748,14 @@ struct vfio_iommu_type1_info_cap_iova_range {
> >  	struct	vfio_iova_range iova_ranges[];
> >  };
> >
> > +#define VFIO_IOMMU_TYPE1_INFO_CAP_NESTING  2
> > +
> > +struct vfio_iommu_type1_info_cap_nesting {
> > +	struct	vfio_info_cap_header header;
> > +#define VFIO_IOMMU_PASID_REQS	(1 << 0)
> PASID_REQS sounds a bit far from the claimed host managed alloc/free
> capability.
> VFIO_IOMMU_SYSTEM_WIDE_PASID?

Oh, yep. I can rename it.

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 110+ messages in thread

* RE: [PATCH v1 6/8] vfio/type1: Bind guest page tables to host
  2020-04-01  9:13     ` Liu, Yi L
@ 2020-04-02  2:12       ` Tian, Kevin
  2020-04-02  8:05         ` Liu, Yi L
  0 siblings, 1 reply; 110+ messages in thread
From: Tian, Kevin @ 2020-04-02  2:12 UTC (permalink / raw)
  To: Liu, Yi L, alex.williamson, eric.auger
  Cc: jacob.jun.pan, joro, Raj, Ashok, Tian, Jun J, Sun, Yi Y,
	jean-philippe, peterx, iommu, kvm, linux-kernel, Wu, Hao

> From: Liu, Yi L <yi.l.liu@intel.com>
> Sent: Wednesday, April 1, 2020 5:13 PM
> 
> > From: Tian, Kevin <kevin.tian@intel.com>
> > Sent: Monday, March 30, 2020 8:46 PM
> > Subject: RE: [PATCH v1 6/8] vfio/type1: Bind guest page tables to host
> >
> > > From: Liu, Yi L <yi.l.liu@intel.com>
> > > Sent: Sunday, March 22, 2020 8:32 PM
> > >
> > > From: Liu Yi L <yi.l.liu@intel.com>
> > >
> > > VFIO_TYPE1_NESTING_IOMMU is an IOMMU type which is backed by
> hardware
> > > IOMMUs that have nesting DMA translation (a.k.a dual stage address
> > > translation). For such hardware IOMMUs, there are two stages/levels of
> > > address translation, and software may let userspace/VM to own the
> > > first-
> > > level/stage-1 translation structures. Example of such usage is vSVA (
> > > virtual Shared Virtual Addressing). VM owns the first-level/stage-1
> > > translation structures and bind the structures to host, then hardware
> > > IOMMU would utilize nesting translation when doing DMA translation fo
> > > the devices behind such hardware IOMMU.
> > >
> > > This patch adds vfio support for binding guest translation (a.k.a
> > > stage 1) structure to host iommu. And for VFIO_TYPE1_NESTING_IOMMU,
> > > not only bind guest page table is needed, it also requires to expose
> > > interface to guest for iommu cache invalidation when guest modified
> > > the first-level/stage-1 translation structures since hardware needs to
> > > be notified to flush stale iotlbs. This would be introduced in next
> > > patch.
> > >
> > > In this patch, guest page table bind and unbind are done by using
> > > flags VFIO_IOMMU_BIND_GUEST_PGTBL and
> > VFIO_IOMMU_UNBIND_GUEST_PGTBL
> > > under IOCTL VFIO_IOMMU_BIND, the bind/unbind data are conveyed by
> > > struct iommu_gpasid_bind_data. Before binding guest page table to
> > > host, VM should have got a PASID allocated by host via
> > > VFIO_IOMMU_PASID_REQUEST.
> > >
> > > Bind guest translation structures (here is guest page table) to host
> >
> > Bind -> Binding
> got it.
> > > are the first step to setup vSVA (Virtual Shared Virtual Addressing).
> >
> > are -> is. and you already explained vSVA earlier.
> oh yes, it is.
> > >
> > > Cc: Kevin Tian <kevin.tian@intel.com>
> > > CC: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > > Cc: Alex Williamson <alex.williamson@redhat.com>
> > > Cc: Eric Auger <eric.auger@redhat.com>
> > > Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
> > > Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.com>
> > > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > > Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > > ---
> > >  drivers/vfio/vfio_iommu_type1.c | 158
> > > ++++++++++++++++++++++++++++++++++++++++
> > >  include/uapi/linux/vfio.h       |  46 ++++++++++++
> > >  2 files changed, 204 insertions(+)
> > >
> > > diff --git a/drivers/vfio/vfio_iommu_type1.c
> > > b/drivers/vfio/vfio_iommu_type1.c index 82a9e0b..a877747 100644
> > > --- a/drivers/vfio/vfio_iommu_type1.c
> > > +++ b/drivers/vfio/vfio_iommu_type1.c
> > > @@ -130,6 +130,33 @@ struct vfio_regions {
> > >  #define IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)	\
> > >  					(!list_empty(&iommu->domain_list))
> > >
> > > +struct domain_capsule {
> > > +	struct iommu_domain *domain;
> > > +	void *data;
> > > +};
> > > +
> > > +/* iommu->lock must be held */
> > > +static int vfio_iommu_for_each_dev(struct vfio_iommu *iommu,
> > > +		      int (*fn)(struct device *dev, void *data),
> > > +		      void *data)
> > > +{
> > > +	struct domain_capsule dc = {.data = data};
> > > +	struct vfio_domain *d;
> > > +	struct vfio_group *g;
> > > +	int ret = 0;
> > > +
> > > +	list_for_each_entry(d, &iommu->domain_list, next) {
> > > +		dc.domain = d->domain;
> > > +		list_for_each_entry(g, &d->group_list, next) {
> > > +			ret = iommu_group_for_each_dev(g->iommu_group,
> > > +						       &dc, fn);
> > > +			if (ret)
> > > +				break;
> > > +		}
> > > +	}
> > > +	return ret;
> > > +}
> > > +
> > >  static int put_pfn(unsigned long pfn, int prot);
> > >
> > >  /*
> > > @@ -2314,6 +2341,88 @@ static int
> > > vfio_iommu_info_add_nesting_cap(struct vfio_iommu *iommu,
> > >  	return 0;
> > >  }
> > >
> > > +static int vfio_bind_gpasid_fn(struct device *dev, void *data) {
> > > +	struct domain_capsule *dc = (struct domain_capsule *)data;
> > > +	struct iommu_gpasid_bind_data *gbind_data =
> > > +		(struct iommu_gpasid_bind_data *) dc->data;
> > > +
> >
> > In Jacob's vSVA iommu series, [PATCH 06/11]:
> >
> > +		/* REVISIT: upper layer/VFIO can track host process that bind
> the
> > PASID.
> > +		 * ioasid_set = mm might be sufficient for vfio to check pasid
> VMM
> > +		 * ownership.
> > +		 */
> >
> > I asked him who exactly should be responsible for tracking the pasid
> ownership.
> > Although no response yet, I expect vfio/iommu can have a clear policy and
> also
> > documented here to provide consistent message.
> 
> yep.
> 
> > > +	return iommu_sva_bind_gpasid(dc->domain, dev, gbind_data); }
> > > +
> > > +static int vfio_unbind_gpasid_fn(struct device *dev, void *data) {
> > > +	struct domain_capsule *dc = (struct domain_capsule *)data;
> > > +	struct iommu_gpasid_bind_data *gbind_data =
> > > +		(struct iommu_gpasid_bind_data *) dc->data;
> > > +
> > > +	return iommu_sva_unbind_gpasid(dc->domain, dev,
> > > +					gbind_data->hpasid);
> >
> > curious why we have to share the same bind_data structure between bind
> and
> > unbind, especially when unbind requires only one field? I didn't see a clear
> reason,
> > and just similar to earlier ALLOC/FREE which don't share structure either.
> > Current way simply wastes space for unbind operation...
> 
> no special reason today. But the gIOVA support over nested translation
> is in plan, it may require a flag to indicate it as guest iommu driver
> may user a single PASID value(RID2PASID) for all devices in guest side.
> Especially if the RID2PASID value used for IOVA the the same with host
> side. So adding a flag to indicate the binding is for IOVA is helpful.
> For PF/VF, iommu driver just bind with the host side's RID2PASID. While
> for ADI (Assignable Device Interface),  vfio layer needs to figure out
> the default PASID stored in the aux-domain, and then iommu driver bind
> gIOVA table to the default PASID. The potential flag is required in both
> bind and unbind path. As such, it would be better to share the structure.

I'm fine with it if you are pretty sure that more extension will be required
in the future, though I didn't fully understand above explanations. 😊

> 
> > > +}
> > > +
> > > +/**
> > > + * Unbind specific gpasid, caller of this function requires hold
> > > + * vfio_iommu->lock
> > > + */
> > > +static long vfio_iommu_type1_do_guest_unbind(struct vfio_iommu
> > > *iommu,
> > > +				struct iommu_gpasid_bind_data *gbind_data)
> {
> > > +	return vfio_iommu_for_each_dev(iommu,
> > > +				vfio_unbind_gpasid_fn, gbind_data); }
> > > +
> > > +static long vfio_iommu_type1_bind_gpasid(struct vfio_iommu *iommu,
> > > +				struct iommu_gpasid_bind_data *gbind_data)
> {
> > > +	int ret = 0;
> > > +
> > > +	mutex_lock(&iommu->lock);
> > > +	if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) {
> > > +		ret = -EINVAL;
> > > +		goto out_unlock;
> > > +	}
> > > +
> > > +	ret = vfio_iommu_for_each_dev(iommu,
> > > +			vfio_bind_gpasid_fn, gbind_data);
> > > +	/*
> > > +	 * If bind failed, it may not be a total failure. Some devices
> > > +	 * within the iommu group may have bind successfully. Although
> > > +	 * we don't enable pasid capability for non-singletion iommu
> > > +	 * groups, a unbind operation would be helpful to ensure no
> > > +	 * partial binding for an iommu group.
> > > +	 */
> > > +	if (ret)
> > > +		/*
> > > +		 * Undo all binds that already succeeded, no need to
> >
> > binds -> bindings
> got it.
> >
> > > +		 * check the return value here since some device within
> > > +		 * the group has no successful bind when coming to this
> > > +		 * place switch.
> > > +		 */
> >
> > remove 'switch'
> oh, yes.
> 
> >
> > > +		vfio_iommu_type1_do_guest_unbind(iommu, gbind_data);
> > > +
> > > +out_unlock:
> > > +	mutex_unlock(&iommu->lock);
> > > +	return ret;
> > > +}
> > > +
> > > +static long vfio_iommu_type1_unbind_gpasid(struct vfio_iommu
> *iommu,
> > > +				struct iommu_gpasid_bind_data *gbind_data)
> {
> > > +	int ret = 0;
> > > +
> > > +	mutex_lock(&iommu->lock);
> > > +	if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) {
> > > +		ret = -EINVAL;
> > > +		goto out_unlock;
> > > +	}
> > > +
> > > +	ret = vfio_iommu_type1_do_guest_unbind(iommu, gbind_data);
> > > +
> > > +out_unlock:
> > > +	mutex_unlock(&iommu->lock);
> > > +	return ret;
> > > +}
> > > +
> > >  static long vfio_iommu_type1_ioctl(void *iommu_data,
> > >  				   unsigned int cmd, unsigned long arg)  { @@
> -
> > 2471,6 +2580,55 @@
> > > static long vfio_iommu_type1_ioctl(void *iommu_data,
> > >  		default:
> > >  			return -EINVAL;
> > >  		}
> > > +
> > > +	} else if (cmd == VFIO_IOMMU_BIND) {
> >
> > BIND what? VFIO_IOMMU_BIND_PASID sounds clearer to me.
> 
> Emm, it's up to the flags to indicate bind what. It was proposed to
> cover the three cases below:
> a) BIND/UNBIND_GPASID
> b) BIND/UNBIND_GPASID_TABLE
> c) BIND/UNBIND_PROCESS
> <only a) is covered in this patch>
> So it's called VFIO_IOMMU_BIND.

but aren't they all about PASID related binding?

> 
> >
> > > +		struct vfio_iommu_type1_bind bind;
> > > +		u32 version;
> > > +		int data_size;
> > > +		void *gbind_data;
> > > +		int ret;
> > > +
> > > +		minsz = offsetofend(struct vfio_iommu_type1_bind, flags);
> > > +
> > > +		if (copy_from_user(&bind, (void __user *)arg, minsz))
> > > +			return -EFAULT;
> > > +
> > > +		if (bind.argsz < minsz)
> > > +			return -EINVAL;
> > > +
> > > +		/* Get the version of struct iommu_gpasid_bind_data */
> > > +		if (copy_from_user(&version,
> > > +			(void __user *) (arg + minsz),
> > > +					sizeof(version)))
> > > +			return -EFAULT;
> > > +
> > > +		data_size = iommu_uapi_get_data_size(
> > > +				IOMMU_UAPI_BIND_GPASID, version);
> > > +		gbind_data = kzalloc(data_size, GFP_KERNEL);
> > > +		if (!gbind_data)
> > > +			return -ENOMEM;
> > > +
> > > +		if (copy_from_user(gbind_data,
> > > +			 (void __user *) (arg + minsz), data_size)) {
> > > +			kfree(gbind_data);
> > > +			return -EFAULT;
> > > +		}
> > > +
> > > +		switch (bind.flags & VFIO_IOMMU_BIND_MASK) {
> > > +		case VFIO_IOMMU_BIND_GUEST_PGTBL:
> > > +			ret = vfio_iommu_type1_bind_gpasid(iommu,
> > > +							   gbind_data);
> > > +			break;
> > > +		case VFIO_IOMMU_UNBIND_GUEST_PGTBL:
> > > +			ret = vfio_iommu_type1_unbind_gpasid(iommu,
> > > +							     gbind_data);
> > > +			break;
> > > +		default:
> > > +			ret = -EINVAL;
> > > +			break;
> > > +		}
> > > +		kfree(gbind_data);
> > > +		return ret;
> > >  	}
> > >
> > >  	return -ENOTTY;
> > > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> > > index ebeaf3e..2235bc6 100644
> > > --- a/include/uapi/linux/vfio.h
> > > +++ b/include/uapi/linux/vfio.h
> > > @@ -14,6 +14,7 @@
> > >
> > >  #include <linux/types.h>
> > >  #include <linux/ioctl.h>
> > > +#include <linux/iommu.h>
> > >
> > >  #define VFIO_API_VERSION	0
> > >
> > > @@ -853,6 +854,51 @@ struct vfio_iommu_type1_pasid_request {
> > >   */
> > >  #define VFIO_IOMMU_PASID_REQUEST	_IO(VFIO_TYPE, VFIO_BASE +
> > > 22)
> > >
> > > +/**
> > > + * Supported flags:
> > > + *	- VFIO_IOMMU_BIND_GUEST_PGTBL: bind guest page tables to host
> > > for
> > > + *			nesting type IOMMUs. In @data field It takes struct
> > > + *			iommu_gpasid_bind_data.
> > > + *	- VFIO_IOMMU_UNBIND_GUEST_PGTBL: undo a bind guest page
> > > table operation
> > > + *			invoked by VFIO_IOMMU_BIND_GUEST_PGTBL.
> > > + *
> > > + */
> > > +struct vfio_iommu_type1_bind {
> > > +	__u32		argsz;
> > > +	__u32		flags;
> > > +#define VFIO_IOMMU_BIND_GUEST_PGTBL	(1 << 0)
> > > +#define VFIO_IOMMU_UNBIND_GUEST_PGTBL	(1 << 1)
> > > +	__u8		data[];
> > > +};
> > > +
> > > +#define VFIO_IOMMU_BIND_MASK
> 	(VFIO_IOMMU_BIND_GUEST_PGTBL
> > > | \
> > > +
> > > 	VFIO_IOMMU_UNBIND_GUEST_PGTBL)
> > > +
> > > +/**
> > > + * VFIO_IOMMU_BIND - _IOW(VFIO_TYPE, VFIO_BASE + 23,
> > > + *				struct vfio_iommu_type1_bind)
> > > + *
> > > + * Manage address spaces of devices in this container. Initially a
> > > +TYPE1
> > > + * container can only have one address space, managed with
> > > + * VFIO_IOMMU_MAP/UNMAP_DMA.
> >
> > the last sentence seems irrelevant and more suitable in commit msg.
> 
> oh, I could remove it.
> 
> > > + *
> > > + * An IOMMU of type VFIO_TYPE1_NESTING_IOMMU can be managed
> by
> > > both MAP/UNMAP
> > > + * and BIND ioctls at the same time. MAP/UNMAP acts on the stage-2
> > > + (host)
> > > page
> > > + * tables, and BIND manages the stage-1 (guest) page tables. Other
> > > + types of
> >
> > Are "other types" the counterpart to VFIO_TYPE1_NESTING_IOMMU?
> > What are those types? I thought only NESTING_IOMMU allows two stage
> > translation...
> 
> it's a mistake... please ignore this message. would correct it in next version.
> 
> >
> > > + * IOMMU may allow MAP/UNMAP and BIND to coexist, where
> >
> > The first sentence said the same thing. Then what is the exact difference?
> 
> this sentence were added by mistake. will correct it.
> 
> >
> > > MAP/UNMAP controls
> > > + * the traffics only require single stage translation while BIND
> > > + controls the
> > > + * traffics require nesting translation. But this depends on the
> > > + underlying
> > > + * IOMMU architecture and isn't guaranteed. Example of this is the
> > > + guest
> > > SVA
> > > + * traffics, such traffics need nesting translation to gain gVA->gPA
> > > + and then
> > > + * gPA->hPA translation.
> >
> > I'm a bit confused about the content since "other types of". Are they trying
> to state
> > some exceptions/corner cases that this API cannot resolve or explain the
> desired
> > behavior of the API? Especially the last example, which is worded as if the
> example
> > for "isn't guaranteed"
> > but isn't guest SVA the main purpose of this API?
> >
> I think the description in original patch is bad especially with the "other
> types"
> phrase. How about the below description?
> 
> /**
>  * VFIO_IOMMU_BIND - _IOW(VFIO_TYPE, VFIO_BASE + 23,
>  *				struct vfio_iommu_type1_bind)
>  *
>  * Manage address spaces of devices in this container when it's an IOMMU
>  * of type VFIO_TYPE1_NESTING_IOMMU. Such type IOMMU allows
> MAP/UNMAP and
>  * BIND to coexist, where MAP/UNMAP controls the traffics only require
>  * single stage translation while BIND controls the traffics require nesting
>  * translation.
>  *
>  * Availability of this feature depends on the device, its bus, the underlying
>  * IOMMU and the CPU architecture.
>  *
>  * returns: 0 on success, -errno on failure.
>  */
> 
> Regards,
> Yi Liu

yes, this looks better.

Thanks
Kevin


^ permalink raw reply	[flat|nested] 110+ messages in thread

* RE: [PATCH v1 6/8] vfio/type1: Bind guest page tables to host
  2020-04-02  2:12       ` Tian, Kevin
@ 2020-04-02  8:05         ` Liu, Yi L
  2020-04-03  8:34           ` Jean-Philippe Brucker
  0 siblings, 1 reply; 110+ messages in thread
From: Liu, Yi L @ 2020-04-02  8:05 UTC (permalink / raw)
  To: Tian, Kevin, alex.williamson, eric.auger
  Cc: jacob.jun.pan, joro, Raj, Ashok, Tian, Jun J, Sun, Yi Y,
	jean-philippe, peterx, iommu, kvm, linux-kernel, Wu, Hao

> From: Tian, Kevin <kevin.tian@intel.com>
> Sent: Thursday, April 2, 2020 10:12 AM
> To: Liu, Yi L <yi.l.liu@intel.com>; alex.williamson@redhat.com;
> Subject: RE: [PATCH v1 6/8] vfio/type1: Bind guest page tables to host
> 
> > From: Liu, Yi L <yi.l.liu@intel.com>
> > Sent: Wednesday, April 1, 2020 5:13 PM
> >
> > > From: Tian, Kevin <kevin.tian@intel.com>
> > > Sent: Monday, March 30, 2020 8:46 PM
> > > Subject: RE: [PATCH v1 6/8] vfio/type1: Bind guest page tables to
> > > host
> > >
> > > > From: Liu, Yi L <yi.l.liu@intel.com>
> > > > Sent: Sunday, March 22, 2020 8:32 PM
> > > >
> > > > From: Liu Yi L <yi.l.liu@intel.com>
> > > >
> > > > VFIO_TYPE1_NESTING_IOMMU is an IOMMU type which is backed by
> > hardware
> > > > IOMMUs that have nesting DMA translation (a.k.a dual stage address
> > > > translation). For such hardware IOMMUs, there are two
> > > > stages/levels of address translation, and software may let
> > > > userspace/VM to own the
> > > > first-
> > > > level/stage-1 translation structures. Example of such usage is
> > > > vSVA ( virtual Shared Virtual Addressing). VM owns the
> > > > first-level/stage-1 translation structures and bind the structures
> > > > to host, then hardware IOMMU would utilize nesting translation
> > > > when doing DMA translation fo the devices behind such hardware IOMMU.
> > > >
> > > > This patch adds vfio support for binding guest translation (a.k.a
> > > > stage 1) structure to host iommu. And for
> > > > VFIO_TYPE1_NESTING_IOMMU, not only bind guest page table is
> > > > needed, it also requires to expose interface to guest for iommu
> > > > cache invalidation when guest modified the first-level/stage-1
> > > > translation structures since hardware needs to be notified to
> > > > flush stale iotlbs. This would be introduced in next patch.
> > > >
> > > > In this patch, guest page table bind and unbind are done by using
> > > > flags VFIO_IOMMU_BIND_GUEST_PGTBL and
> > > VFIO_IOMMU_UNBIND_GUEST_PGTBL
> > > > under IOCTL VFIO_IOMMU_BIND, the bind/unbind data are conveyed by
> > > > struct iommu_gpasid_bind_data. Before binding guest page table to
> > > > host, VM should have got a PASID allocated by host via
> > > > VFIO_IOMMU_PASID_REQUEST.
> > > >
> > > > Bind guest translation structures (here is guest page table) to
> > > > host
> > >
> > > Bind -> Binding
> > got it.
> > > > are the first step to setup vSVA (Virtual Shared Virtual Addressing).
> > >
> > > are -> is. and you already explained vSVA earlier.
> > oh yes, it is.
> > > >
> > > > Cc: Kevin Tian <kevin.tian@intel.com>
> > > > CC: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > > > Cc: Alex Williamson <alex.williamson@redhat.com>
> > > > Cc: Eric Auger <eric.auger@redhat.com>
> > > > Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
> > > > Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.com>
> > > > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > > > Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > > > ---
> > > >  drivers/vfio/vfio_iommu_type1.c | 158
> > > > ++++++++++++++++++++++++++++++++++++++++
> > > >  include/uapi/linux/vfio.h       |  46 ++++++++++++
> > > >  2 files changed, 204 insertions(+)
> > > >
> > > > diff --git a/drivers/vfio/vfio_iommu_type1.c
> > > > b/drivers/vfio/vfio_iommu_type1.c index 82a9e0b..a877747 100644
> > > > --- a/drivers/vfio/vfio_iommu_type1.c
> > > > +++ b/drivers/vfio/vfio_iommu_type1.c
> > > > @@ -130,6 +130,33 @@ struct vfio_regions {
> > > >  #define IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)	\
> > > >  					(!list_empty(&iommu->domain_list))
> > > >
> > > > +struct domain_capsule {
> > > > +	struct iommu_domain *domain;
> > > > +	void *data;
> > > > +};
> > > > +
> > > > +/* iommu->lock must be held */
> > > > +static int vfio_iommu_for_each_dev(struct vfio_iommu *iommu,
> > > > +		      int (*fn)(struct device *dev, void *data),
> > > > +		      void *data)
> > > > +{
> > > > +	struct domain_capsule dc = {.data = data};
> > > > +	struct vfio_domain *d;
> > > > +	struct vfio_group *g;
> > > > +	int ret = 0;
> > > > +
> > > > +	list_for_each_entry(d, &iommu->domain_list, next) {
> > > > +		dc.domain = d->domain;
> > > > +		list_for_each_entry(g, &d->group_list, next) {
> > > > +			ret = iommu_group_for_each_dev(g->iommu_group,
> > > > +						       &dc, fn);
> > > > +			if (ret)
> > > > +				break;
> > > > +		}
> > > > +	}
> > > > +	return ret;
> > > > +}
> > > > +
> > > >  static int put_pfn(unsigned long pfn, int prot);
> > > >
> > > >  /*
> > > > @@ -2314,6 +2341,88 @@ static int
> > > > vfio_iommu_info_add_nesting_cap(struct vfio_iommu *iommu,
> > > >  	return 0;
> > > >  }
> > > >
> > > > +static int vfio_bind_gpasid_fn(struct device *dev, void *data) {
> > > > +	struct domain_capsule *dc = (struct domain_capsule *)data;
> > > > +	struct iommu_gpasid_bind_data *gbind_data =
> > > > +		(struct iommu_gpasid_bind_data *) dc->data;
> > > > +
> > >
> > > In Jacob's vSVA iommu series, [PATCH 06/11]:
> > >
> > > +		/* REVISIT: upper layer/VFIO can track host process that bind
> > the
> > > PASID.
> > > +		 * ioasid_set = mm might be sufficient for vfio to check pasid
> > VMM
> > > +		 * ownership.
> > > +		 */
> > >
> > > I asked him who exactly should be responsible for tracking the pasid
> > ownership.
> > > Although no response yet, I expect vfio/iommu can have a clear
> > > policy and
> > also
> > > documented here to provide consistent message.
> >
> > yep.
> >
> > > > +	return iommu_sva_bind_gpasid(dc->domain, dev, gbind_data); }
> > > > +
> > > > +static int vfio_unbind_gpasid_fn(struct device *dev, void *data) {
> > > > +	struct domain_capsule *dc = (struct domain_capsule *)data;
> > > > +	struct iommu_gpasid_bind_data *gbind_data =
> > > > +		(struct iommu_gpasid_bind_data *) dc->data;
> > > > +
> > > > +	return iommu_sva_unbind_gpasid(dc->domain, dev,
> > > > +					gbind_data->hpasid);
> > >
> > > curious why we have to share the same bind_data structure between
> > > bind
> > and
> > > unbind, especially when unbind requires only one field? I didn't see
> > > a clear
> > reason,
> > > and just similar to earlier ALLOC/FREE which don't share structure either.
> > > Current way simply wastes space for unbind operation...
> >
> > no special reason today. But the gIOVA support over nested translation
> > is in plan, it may require a flag to indicate it as guest iommu driver
> > may user a single PASID value(RID2PASID) for all devices in guest side.
> > Especially if the RID2PASID value used for IOVA the the same with host
> > side. So adding a flag to indicate the binding is for IOVA is helpful.
> > For PF/VF, iommu driver just bind with the host side's RID2PASID.
> > While for ADI (Assignable Device Interface),  vfio layer needs to
> > figure out the default PASID stored in the aux-domain, and then iommu
> > driver bind gIOVA table to the default PASID. The potential flag is
> > required in both bind and unbind path. As such, it would be better to share the
> structure.
> 
> I'm fine with it if you are pretty sure that more extension will be required in the
> future, though I didn't fully understand above explanations. 😊
> 
> >
> > > > +}
> > > > +
> > > > +/**
> > > > + * Unbind specific gpasid, caller of this function requires hold
> > > > + * vfio_iommu->lock
> > > > + */
> > > > +static long vfio_iommu_type1_do_guest_unbind(struct vfio_iommu
> > > > *iommu,
> > > > +				struct iommu_gpasid_bind_data *gbind_data)
> > {
> > > > +	return vfio_iommu_for_each_dev(iommu,
> > > > +				vfio_unbind_gpasid_fn, gbind_data); }
> > > > +
> > > > +static long vfio_iommu_type1_bind_gpasid(struct vfio_iommu *iommu,
> > > > +				struct iommu_gpasid_bind_data *gbind_data)
> > {
> > > > +	int ret = 0;
> > > > +
> > > > +	mutex_lock(&iommu->lock);
> > > > +	if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) {
> > > > +		ret = -EINVAL;
> > > > +		goto out_unlock;
> > > > +	}
> > > > +
> > > > +	ret = vfio_iommu_for_each_dev(iommu,
> > > > +			vfio_bind_gpasid_fn, gbind_data);
> > > > +	/*
> > > > +	 * If bind failed, it may not be a total failure. Some devices
> > > > +	 * within the iommu group may have bind successfully. Although
> > > > +	 * we don't enable pasid capability for non-singletion iommu
> > > > +	 * groups, a unbind operation would be helpful to ensure no
> > > > +	 * partial binding for an iommu group.
> > > > +	 */
> > > > +	if (ret)
> > > > +		/*
> > > > +		 * Undo all binds that already succeeded, no need to
> > >
> > > binds -> bindings
> > got it.
> > >
> > > > +		 * check the return value here since some device within
> > > > +		 * the group has no successful bind when coming to this
> > > > +		 * place switch.
> > > > +		 */
> > >
> > > remove 'switch'
> > oh, yes.
> >
> > >
> > > > +		vfio_iommu_type1_do_guest_unbind(iommu, gbind_data);
> > > > +
> > > > +out_unlock:
> > > > +	mutex_unlock(&iommu->lock);
> > > > +	return ret;
> > > > +}
> > > > +
> > > > +static long vfio_iommu_type1_unbind_gpasid(struct vfio_iommu
> > *iommu,
> > > > +				struct iommu_gpasid_bind_data *gbind_data)
> > {
> > > > +	int ret = 0;
> > > > +
> > > > +	mutex_lock(&iommu->lock);
> > > > +	if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) {
> > > > +		ret = -EINVAL;
> > > > +		goto out_unlock;
> > > > +	}
> > > > +
> > > > +	ret = vfio_iommu_type1_do_guest_unbind(iommu, gbind_data);
> > > > +
> > > > +out_unlock:
> > > > +	mutex_unlock(&iommu->lock);
> > > > +	return ret;
> > > > +}
> > > > +
> > > >  static long vfio_iommu_type1_ioctl(void *iommu_data,
> > > >  				   unsigned int cmd, unsigned long arg)  { @@
> > -
> > > 2471,6 +2580,55 @@
> > > > static long vfio_iommu_type1_ioctl(void *iommu_data,
> > > >  		default:
> > > >  			return -EINVAL;
> > > >  		}
> > > > +
> > > > +	} else if (cmd == VFIO_IOMMU_BIND) {
> > >
> > > BIND what? VFIO_IOMMU_BIND_PASID sounds clearer to me.
> >
> > Emm, it's up to the flags to indicate bind what. It was proposed to
> > cover the three cases below:
> > a) BIND/UNBIND_GPASID
> > b) BIND/UNBIND_GPASID_TABLE
> > c) BIND/UNBIND_PROCESS
> > <only a) is covered in this patch>
> > So it's called VFIO_IOMMU_BIND.
> 
> but aren't they all about PASID related binding?

yeah, I can rename it. :-)

Regards,
Yi Liu


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH v1 1/8] vfio: Add VFIO_IOMMU_PASID_REQUEST(alloc/free)
  2020-03-22 12:31 ` [PATCH v1 1/8] vfio: Add VFIO_IOMMU_PASID_REQUEST(alloc/free) Liu, Yi L
                     ` (2 preceding siblings ...)
  2020-03-31  7:53   ` Christoph Hellwig
@ 2020-04-02 13:52   ` Jean-Philippe Brucker
  2020-04-03 11:56     ` Liu, Yi L
  2020-04-02 17:50   ` Alex Williamson
  4 siblings, 1 reply; 110+ messages in thread
From: Jean-Philippe Brucker @ 2020-04-02 13:52 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: alex.williamson, eric.auger, kevin.tian, jacob.jun.pan, joro,
	ashok.raj, jun.j.tian, yi.y.sun, peterx, iommu, kvm,
	linux-kernel, hao.wu

Hi Yi,

On Sun, Mar 22, 2020 at 05:31:58AM -0700, Liu, Yi L wrote:
> From: Liu Yi L <yi.l.liu@intel.com>
> 
> For a long time, devices have only one DMA address space from platform
> IOMMU's point of view. This is true for both bare metal and directed-
> access in virtualization environment. Reason is the source ID of DMA in
> PCIe are BDF (bus/dev/fnc ID), which results in only device granularity
> DMA isolation. However, this is changing with the latest advancement in
> I/O technology area. More and more platform vendors are utilizing the PCIe
> PASID TLP prefix in DMA requests, thus to give devices with multiple DMA
> address spaces as identified by their individual PASIDs. For example,
> Shared Virtual Addressing (SVA, a.k.a Shared Virtual Memory) is able to
> let device access multiple process virtual address space by binding the
> virtual address space with a PASID. Wherein the PASID is allocated in
> software and programmed to device per device specific manner. Devices
> which support PASID capability are called PASID-capable devices. If such
> devices are passed through to VMs, guest software are also able to bind
> guest process virtual address space on such devices. Therefore, the guest
> software could reuse the bare metal software programming model, which
> means guest software will also allocate PASID and program it to device
> directly. This is a dangerous situation since it has potential PASID
> conflicts and unauthorized address space access.

It's worth noting that this applies to Intel VT-d with scalable mode, not
IOMMUs that use one PASID space per VM

> It would be safer to
> let host intercept in the guest software's PASID allocation. Thus PASID
> are managed system-wide.
> 
> This patch adds VFIO_IOMMU_PASID_REQUEST ioctl which aims to passdown
> PASID allocation/free request from the virtual IOMMU. Additionally, such
> requests are intended to be invoked by QEMU or other applications which
> are running in userspace, it is necessary to have a mechanism to prevent
> single application from abusing available PASIDs in system. With such
> consideration, this patch tracks the VFIO PASID allocation per-VM. There
> was a discussion to make quota to be per assigned devices. e.g. if a VM
> has many assigned devices, then it should have more quota. However, it
> is not sure how many PASIDs an assigned devices will use. e.g. it is
> possible that a VM with multiples assigned devices but requests less
> PASIDs. Therefore per-VM quota would be better.
> 
> This patch uses struct mm pointer as a per-VM token. We also considered
> using task structure pointer and vfio_iommu structure pointer. However,
> task structure is per-thread, which means it cannot achieve per-VM PASID
> alloc tracking purpose. While for vfio_iommu structure, it is visible
> only within vfio. Therefore, structure mm pointer is selected. This patch
> adds a structure vfio_mm. A vfio_mm is created when the first vfio
> container is opened by a VM. On the reverse order, vfio_mm is free when
> the last vfio container is released. Each VM is assigned with a PASID
> quota, so that it is not able to request PASID beyond its quota. This
> patch adds a default quota of 1000. This quota could be tuned by
> administrator. Making PASID quota tunable will be added in another patch
> in this series.
> 
> Previous discussions:
> https://patchwork.kernel.org/patch/11209429/
> 
> Cc: Kevin Tian <kevin.tian@intel.com>
> CC: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Cc: Alex Williamson <alex.williamson@redhat.com>
> Cc: Eric Auger <eric.auger@redhat.com>
> Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> ---
>  drivers/vfio/vfio.c             | 130 ++++++++++++++++++++++++++++++++++++++++
>  drivers/vfio/vfio_iommu_type1.c | 104 ++++++++++++++++++++++++++++++++
>  include/linux/vfio.h            |  20 +++++++
>  include/uapi/linux/vfio.h       |  41 +++++++++++++
>  4 files changed, 295 insertions(+)
> 
> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> index c848262..d13b483 100644
> --- a/drivers/vfio/vfio.c
> +++ b/drivers/vfio/vfio.c
> @@ -32,6 +32,7 @@
>  #include <linux/vfio.h>
>  #include <linux/wait.h>
>  #include <linux/sched/signal.h>
> +#include <linux/sched/mm.h>
>  
>  #define DRIVER_VERSION	"0.3"
>  #define DRIVER_AUTHOR	"Alex Williamson <alex.williamson@redhat.com>"
> @@ -46,6 +47,8 @@ static struct vfio {
>  	struct mutex			group_lock;
>  	struct cdev			group_cdev;
>  	dev_t				group_devt;
> +	struct list_head		vfio_mm_list;
> +	struct mutex			vfio_mm_lock;
>  	wait_queue_head_t		release_q;
>  } vfio;
>  
> @@ -2129,6 +2132,131 @@ int vfio_unregister_notifier(struct device *dev, enum vfio_notify_type type,
>  EXPORT_SYMBOL(vfio_unregister_notifier);
>  
>  /**
> + * VFIO_MM objects - create, release, get, put, search
> + * Caller of the function should have held vfio.vfio_mm_lock.
> + */
> +static struct vfio_mm *vfio_create_mm(struct mm_struct *mm)
> +{
> +	struct vfio_mm *vmm;
> +	struct vfio_mm_token *token;
> +	int ret = 0;
> +
> +	vmm = kzalloc(sizeof(*vmm), GFP_KERNEL);
> +	if (!vmm)
> +		return ERR_PTR(-ENOMEM);
> +
> +	/* Per mm IOASID set used for quota control and group operations */
> +	ret = ioasid_alloc_set((struct ioasid_set *) mm,

Hmm, either we need to change the token of ioasid_alloc_set() to "void *",
or pass an actual ioasid_set struct, but this cast doesn't look good :)

As I commented on the IOASID series, I think we could embed a struct
ioasid_set into vfio_mm, pass that struct to all other ioasid_* functions,
and get rid of ioasid_sid.

> +			       VFIO_DEFAULT_PASID_QUOTA, &vmm->ioasid_sid);
> +	if (ret) {
> +		kfree(vmm);
> +		return ERR_PTR(ret);
> +	}
> +
> +	kref_init(&vmm->kref);
> +	token = &vmm->token;
> +	token->val = mm;

Why the intermediate token struct?  Could we just store the mm_struct
pointer within vfio_mm?

Thanks,
Jean

> +	vmm->pasid_quota = VFIO_DEFAULT_PASID_QUOTA;
> +	mutex_init(&vmm->pasid_lock);
> +
> +	list_add(&vmm->vfio_next, &vfio.vfio_mm_list);
> +
> +	return vmm;
> +}

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH v1 1/8] vfio: Add VFIO_IOMMU_PASID_REQUEST(alloc/free)
  2020-03-22 12:31 ` [PATCH v1 1/8] vfio: Add VFIO_IOMMU_PASID_REQUEST(alloc/free) Liu, Yi L
                     ` (3 preceding siblings ...)
  2020-04-02 13:52   ` Jean-Philippe Brucker
@ 2020-04-02 17:50   ` Alex Williamson
  2020-04-03  5:58     ` Tian, Kevin
  2020-04-03 13:12     ` Liu, Yi L
  4 siblings, 2 replies; 110+ messages in thread
From: Alex Williamson @ 2020-04-02 17:50 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: eric.auger, kevin.tian, jacob.jun.pan, joro, ashok.raj,
	jun.j.tian, yi.y.sun, jean-philippe, peterx, iommu, kvm,
	linux-kernel, hao.wu

On Sun, 22 Mar 2020 05:31:58 -0700
"Liu, Yi L" <yi.l.liu@intel.com> wrote:

> From: Liu Yi L <yi.l.liu@intel.com>
> 
> For a long time, devices have only one DMA address space from platform
> IOMMU's point of view. This is true for both bare metal and directed-
> access in virtualization environment. Reason is the source ID of DMA in
> PCIe are BDF (bus/dev/fnc ID), which results in only device granularity
> DMA isolation. However, this is changing with the latest advancement in
> I/O technology area. More and more platform vendors are utilizing the PCIe
> PASID TLP prefix in DMA requests, thus to give devices with multiple DMA
> address spaces as identified by their individual PASIDs. For example,
> Shared Virtual Addressing (SVA, a.k.a Shared Virtual Memory) is able to
> let device access multiple process virtual address space by binding the
> virtual address space with a PASID. Wherein the PASID is allocated in
> software and programmed to device per device specific manner. Devices
> which support PASID capability are called PASID-capable devices. If such
> devices are passed through to VMs, guest software are also able to bind
> guest process virtual address space on such devices. Therefore, the guest
> software could reuse the bare metal software programming model, which
> means guest software will also allocate PASID and program it to device
> directly. This is a dangerous situation since it has potential PASID
> conflicts and unauthorized address space access. It would be safer to
> let host intercept in the guest software's PASID allocation. Thus PASID
> are managed system-wide.

Providing an allocation interface only allows for collaborative usage
of PASIDs though.  Do we have any ability to enforce PASID usage or can
a user spoof other PASIDs on the same BDF?

> This patch adds VFIO_IOMMU_PASID_REQUEST ioctl which aims to passdown
> PASID allocation/free request from the virtual IOMMU. Additionally, such
> requests are intended to be invoked by QEMU or other applications which
> are running in userspace, it is necessary to have a mechanism to prevent
> single application from abusing available PASIDs in system. With such
> consideration, this patch tracks the VFIO PASID allocation per-VM. There
> was a discussion to make quota to be per assigned devices. e.g. if a VM
> has many assigned devices, then it should have more quota. However, it
> is not sure how many PASIDs an assigned devices will use. e.g. it is
> possible that a VM with multiples assigned devices but requests less
> PASIDs. Therefore per-VM quota would be better.
> 
> This patch uses struct mm pointer as a per-VM token. We also considered
> using task structure pointer and vfio_iommu structure pointer. However,
> task structure is per-thread, which means it cannot achieve per-VM PASID
> alloc tracking purpose. While for vfio_iommu structure, it is visible
> only within vfio. Therefore, structure mm pointer is selected. This patch
> adds a structure vfio_mm. A vfio_mm is created when the first vfio
> container is opened by a VM. On the reverse order, vfio_mm is free when
> the last vfio container is released. Each VM is assigned with a PASID
> quota, so that it is not able to request PASID beyond its quota. This
> patch adds a default quota of 1000. This quota could be tuned by
> administrator. Making PASID quota tunable will be added in another patch
> in this series.
> 
> Previous discussions:
> https://patchwork.kernel.org/patch/11209429/
> 
> Cc: Kevin Tian <kevin.tian@intel.com>
> CC: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Cc: Alex Williamson <alex.williamson@redhat.com>
> Cc: Eric Auger <eric.auger@redhat.com>
> Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> ---
>  drivers/vfio/vfio.c             | 130 ++++++++++++++++++++++++++++++++++++++++
>  drivers/vfio/vfio_iommu_type1.c | 104 ++++++++++++++++++++++++++++++++
>  include/linux/vfio.h            |  20 +++++++
>  include/uapi/linux/vfio.h       |  41 +++++++++++++
>  4 files changed, 295 insertions(+)
> 
> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> index c848262..d13b483 100644
> --- a/drivers/vfio/vfio.c
> +++ b/drivers/vfio/vfio.c
> @@ -32,6 +32,7 @@
>  #include <linux/vfio.h>
>  #include <linux/wait.h>
>  #include <linux/sched/signal.h>
> +#include <linux/sched/mm.h>
>  
>  #define DRIVER_VERSION	"0.3"
>  #define DRIVER_AUTHOR	"Alex Williamson <alex.williamson@redhat.com>"
> @@ -46,6 +47,8 @@ static struct vfio {
>  	struct mutex			group_lock;
>  	struct cdev			group_cdev;
>  	dev_t				group_devt;
> +	struct list_head		vfio_mm_list;
> +	struct mutex			vfio_mm_lock;
>  	wait_queue_head_t		release_q;
>  } vfio;
>  
> @@ -2129,6 +2132,131 @@ int vfio_unregister_notifier(struct device *dev, enum vfio_notify_type type,
>  EXPORT_SYMBOL(vfio_unregister_notifier);
>  
>  /**
> + * VFIO_MM objects - create, release, get, put, search
> + * Caller of the function should have held vfio.vfio_mm_lock.
> + */
> +static struct vfio_mm *vfio_create_mm(struct mm_struct *mm)
> +{
> +	struct vfio_mm *vmm;
> +	struct vfio_mm_token *token;
> +	int ret = 0;
> +
> +	vmm = kzalloc(sizeof(*vmm), GFP_KERNEL);
> +	if (!vmm)
> +		return ERR_PTR(-ENOMEM);
> +
> +	/* Per mm IOASID set used for quota control and group operations */
> +	ret = ioasid_alloc_set((struct ioasid_set *) mm,
> +			       VFIO_DEFAULT_PASID_QUOTA, &vmm->ioasid_sid);
> +	if (ret) {
> +		kfree(vmm);
> +		return ERR_PTR(ret);
> +	}
> +
> +	kref_init(&vmm->kref);
> +	token = &vmm->token;
> +	token->val = mm;
> +	vmm->pasid_quota = VFIO_DEFAULT_PASID_QUOTA;
> +	mutex_init(&vmm->pasid_lock);
> +
> +	list_add(&vmm->vfio_next, &vfio.vfio_mm_list);
> +
> +	return vmm;
> +}
> +
> +static void vfio_mm_unlock_and_free(struct vfio_mm *vmm)
> +{
> +	/* destroy the ioasid set */
> +	ioasid_free_set(vmm->ioasid_sid, true);
> +	mutex_unlock(&vfio.vfio_mm_lock);
> +	kfree(vmm);
> +}
> +
> +/* called with vfio.vfio_mm_lock held */
> +static void vfio_mm_release(struct kref *kref)
> +{
> +	struct vfio_mm *vmm = container_of(kref, struct vfio_mm, kref);
> +
> +	list_del(&vmm->vfio_next);
> +	vfio_mm_unlock_and_free(vmm);
> +}
> +
> +void vfio_mm_put(struct vfio_mm *vmm)
> +{
> +	kref_put_mutex(&vmm->kref, vfio_mm_release, &vfio.vfio_mm_lock);
> +}
> +EXPORT_SYMBOL_GPL(vfio_mm_put);
> +
> +/* Assume vfio_mm_lock or vfio_mm reference is held */
> +static void vfio_mm_get(struct vfio_mm *vmm)
> +{
> +	kref_get(&vmm->kref);
> +}
> +
> +struct vfio_mm *vfio_mm_get_from_task(struct task_struct *task)
> +{
> +	struct mm_struct *mm = get_task_mm(task);
> +	struct vfio_mm *vmm;
> +	unsigned long long val = (unsigned long long) mm;
> +
> +	mutex_lock(&vfio.vfio_mm_lock);
> +	list_for_each_entry(vmm, &vfio.vfio_mm_list, vfio_next) {
> +		if (vmm->token.val == val) {
> +			vfio_mm_get(vmm);
> +			goto out;
> +		}
> +	}
> +
> +	vmm = vfio_create_mm(mm);
> +	if (IS_ERR(vmm))
> +		vmm = NULL;
> +out:
> +	mutex_unlock(&vfio.vfio_mm_lock);
> +	mmput(mm);
> +	return vmm;
> +}
> +EXPORT_SYMBOL_GPL(vfio_mm_get_from_task);
> +
> +int vfio_mm_pasid_alloc(struct vfio_mm *vmm, int min, int max)
> +{
> +	ioasid_t pasid;
> +	int ret = -ENOSPC;
> +
> +	mutex_lock(&vmm->pasid_lock);
> +
> +	pasid = ioasid_alloc(vmm->ioasid_sid, min, max, NULL);
> +	if (pasid == INVALID_IOASID) {
> +		ret = -ENOSPC;
> +		goto out_unlock;
> +	}
> +
> +	ret = pasid;
> +out_unlock:
> +	mutex_unlock(&vmm->pasid_lock);
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(vfio_mm_pasid_alloc);
> +
> +int vfio_mm_pasid_free(struct vfio_mm *vmm, ioasid_t pasid)
> +{
> +	void *pdata;
> +	int ret = 0;
> +
> +	mutex_lock(&vmm->pasid_lock);
> +	pdata = ioasid_find(vmm->ioasid_sid, pasid, NULL);
> +	if (IS_ERR(pdata)) {
> +		ret = PTR_ERR(pdata);
> +		goto out_unlock;
> +	}
> +	ioasid_free(pasid);
> +
> +out_unlock:
> +	mutex_unlock(&vmm->pasid_lock);
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(vfio_mm_pasid_free);
> +
> +/**
>   * Module/class support
>   */
>  static char *vfio_devnode(struct device *dev, umode_t *mode)
> @@ -2151,8 +2279,10 @@ static int __init vfio_init(void)
>  	idr_init(&vfio.group_idr);
>  	mutex_init(&vfio.group_lock);
>  	mutex_init(&vfio.iommu_drivers_lock);
> +	mutex_init(&vfio.vfio_mm_lock);
>  	INIT_LIST_HEAD(&vfio.group_list);
>  	INIT_LIST_HEAD(&vfio.iommu_drivers_list);
> +	INIT_LIST_HEAD(&vfio.vfio_mm_list);
>  	init_waitqueue_head(&vfio.release_q);
>  
>  	ret = misc_register(&vfio_dev);

Is vfio.c the right place for any of the above?  It seems like it could
all be in a separate vfio_pasid module, similar to our virqfd module.

> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index a177bf2..331ceee 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -70,6 +70,7 @@ struct vfio_iommu {
>  	unsigned int		dma_avail;
>  	bool			v2;
>  	bool			nesting;
> +	struct vfio_mm		*vmm;
>  };
>  
>  struct vfio_domain {
> @@ -2018,6 +2019,7 @@ static void vfio_iommu_type1_detach_group(void *iommu_data,
>  static void *vfio_iommu_type1_open(unsigned long arg)
>  {
>  	struct vfio_iommu *iommu;
> +	struct vfio_mm *vmm = NULL;
>  
>  	iommu = kzalloc(sizeof(*iommu), GFP_KERNEL);
>  	if (!iommu)
> @@ -2043,6 +2045,10 @@ static void *vfio_iommu_type1_open(unsigned long arg)
>  	iommu->dma_avail = dma_entry_limit;
>  	mutex_init(&iommu->lock);
>  	BLOCKING_INIT_NOTIFIER_HEAD(&iommu->notifier);
> +	vmm = vfio_mm_get_from_task(current);
> +	if (!vmm)
> +		pr_err("Failed to get vfio_mm track\n");

Doesn't this presume everyone is instantly running PASID capable hosts?
Looks like a noisy support regression to me.

> +	iommu->vmm = vmm;
>  
>  	return iommu;
>  }
> @@ -2084,6 +2090,8 @@ static void vfio_iommu_type1_release(void *iommu_data)
>  	}
>  
>  	vfio_iommu_iova_free(&iommu->iova_list);
> +	if (iommu->vmm)
> +		vfio_mm_put(iommu->vmm);
>  
>  	kfree(iommu);
>  }
> @@ -2172,6 +2180,55 @@ static int vfio_iommu_iova_build_caps(struct vfio_iommu *iommu,
>  	return ret;
>  }
>  
> +static bool vfio_iommu_type1_pasid_req_valid(u32 flags)
> +{
> +	return !((flags & ~VFIO_PASID_REQUEST_MASK) ||
> +		 (flags & VFIO_IOMMU_PASID_ALLOC &&
> +		  flags & VFIO_IOMMU_PASID_FREE));
> +}
> +
> +static int vfio_iommu_type1_pasid_alloc(struct vfio_iommu *iommu,
> +					 int min,
> +					 int max)
> +{
> +	struct vfio_mm *vmm = iommu->vmm;
> +	int ret = 0;
> +
> +	mutex_lock(&iommu->lock);
> +	if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) {
> +		ret = -EFAULT;
> +		goto out_unlock;
> +	}

Non-iommu backed mdevs are excluded from this?  Is this a matter of
wiring the call out through the mdev parent device, or is this just
possible?

> +	if (vmm)
> +		ret = vfio_mm_pasid_alloc(vmm, min, max);
> +	else
> +		ret = -EINVAL;
> +out_unlock:
> +	mutex_unlock(&iommu->lock);
> +	return ret;
> +}
> +
> +static int vfio_iommu_type1_pasid_free(struct vfio_iommu *iommu,
> +				       unsigned int pasid)
> +{
> +	struct vfio_mm *vmm = iommu->vmm;
> +	int ret = 0;
> +
> +	mutex_lock(&iommu->lock);
> +	if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) {
> +		ret = -EFAULT;
> +		goto out_unlock;
> +	}

So if a container had an iommu backed device when the pasid was
allocated, but it was removed, now they can't free it?  Why do we need
the check above?

> +
> +	if (vmm)
> +		ret = vfio_mm_pasid_free(vmm, pasid);
> +	else
> +		ret = -EINVAL;
> +out_unlock:
> +	mutex_unlock(&iommu->lock);
> +	return ret;
> +}
> +
>  static long vfio_iommu_type1_ioctl(void *iommu_data,
>  				   unsigned int cmd, unsigned long arg)
>  {
> @@ -2276,6 +2333,53 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
>  
>  		return copy_to_user((void __user *)arg, &unmap, minsz) ?
>  			-EFAULT : 0;
> +
> +	} else if (cmd == VFIO_IOMMU_PASID_REQUEST) {
> +		struct vfio_iommu_type1_pasid_request req;
> +		unsigned long offset;
> +
> +		minsz = offsetofend(struct vfio_iommu_type1_pasid_request,
> +				    flags);
> +
> +		if (copy_from_user(&req, (void __user *)arg, minsz))
> +			return -EFAULT;
> +
> +		if (req.argsz < minsz ||
> +		    !vfio_iommu_type1_pasid_req_valid(req.flags))
> +			return -EINVAL;
> +
> +		if (copy_from_user((void *)&req + minsz,
> +				   (void __user *)arg + minsz,
> +				   sizeof(req) - minsz))
> +			return -EFAULT;

Huh?  Why do we have argsz if we're going to assume this is here?

> +
> +		switch (req.flags & VFIO_PASID_REQUEST_MASK) {
> +		case VFIO_IOMMU_PASID_ALLOC:
> +		{
> +			int ret = 0, result;
> +
> +			result = vfio_iommu_type1_pasid_alloc(iommu,
> +							req.alloc_pasid.min,
> +							req.alloc_pasid.max);
> +			if (result > 0) {
> +				offset = offsetof(
> +					struct vfio_iommu_type1_pasid_request,
> +					alloc_pasid.result);
> +				ret = copy_to_user(
> +					      (void __user *) (arg + offset),
> +					      &result, sizeof(result));

Again assuming argsz supports this.

> +			} else {
> +				pr_debug("%s: PASID alloc failed\n", __func__);

rate limit?

> +				ret = -EFAULT;
> +			}
> +			return ret;
> +		}
> +		case VFIO_IOMMU_PASID_FREE:
> +			return vfio_iommu_type1_pasid_free(iommu,
> +							   req.free_pasid);
> +		default:
> +			return -EINVAL;
> +		}
>  	}
>  
>  	return -ENOTTY;
> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> index e42a711..75f9f7f1 100644
> --- a/include/linux/vfio.h
> +++ b/include/linux/vfio.h
> @@ -89,6 +89,26 @@ extern int vfio_register_iommu_driver(const struct vfio_iommu_driver_ops *ops);
>  extern void vfio_unregister_iommu_driver(
>  				const struct vfio_iommu_driver_ops *ops);
>  
> +#define VFIO_DEFAULT_PASID_QUOTA	1000
> +struct vfio_mm_token {
> +	unsigned long long val;
> +};
> +
> +struct vfio_mm {
> +	struct kref			kref;
> +	struct vfio_mm_token		token;
> +	int				ioasid_sid;
> +	/* protect @pasid_quota field and pasid allocation/free */
> +	struct mutex			pasid_lock;
> +	int				pasid_quota;
> +	struct list_head		vfio_next;
> +};
> +
> +extern struct vfio_mm *vfio_mm_get_from_task(struct task_struct *task);
> +extern void vfio_mm_put(struct vfio_mm *vmm);
> +extern int vfio_mm_pasid_alloc(struct vfio_mm *vmm, int min, int max);
> +extern int vfio_mm_pasid_free(struct vfio_mm *vmm, ioasid_t pasid);
> +
>  /*
>   * External user API
>   */
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 9e843a1..298ac80 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -794,6 +794,47 @@ struct vfio_iommu_type1_dma_unmap {
>  #define VFIO_IOMMU_ENABLE	_IO(VFIO_TYPE, VFIO_BASE + 15)
>  #define VFIO_IOMMU_DISABLE	_IO(VFIO_TYPE, VFIO_BASE + 16)
>  
> +/*
> + * PASID (Process Address Space ID) is a PCIe concept which
> + * has been extended to support DMA isolation in fine-grain.
> + * With device assigned to user space (e.g. VMs), PASID alloc
> + * and free need to be system wide. This structure defines
> + * the info for pasid alloc/free between user space and kernel
> + * space.
> + *
> + * @flag=VFIO_IOMMU_PASID_ALLOC, refer to the @alloc_pasid
> + * @flag=VFIO_IOMMU_PASID_FREE, refer to @free_pasid
> + */
> +struct vfio_iommu_type1_pasid_request {
> +	__u32	argsz;
> +#define VFIO_IOMMU_PASID_ALLOC	(1 << 0)
> +#define VFIO_IOMMU_PASID_FREE	(1 << 1)
> +	__u32	flags;
> +	union {
> +		struct {
> +			__u32 min;
> +			__u32 max;
> +			__u32 result;
> +		} alloc_pasid;
> +		__u32 free_pasid;
> +	};

We seem to be using __u8 data[] lately where the struct at data is
defined by the flags.  should we do that here?

> +};
> +
> +#define VFIO_PASID_REQUEST_MASK	(VFIO_IOMMU_PASID_ALLOC | \
> +					 VFIO_IOMMU_PASID_FREE)
> +
> +/**
> + * VFIO_IOMMU_PASID_REQUEST - _IOWR(VFIO_TYPE, VFIO_BASE + 22,
> + *				struct vfio_iommu_type1_pasid_request)
> + *
> + * Availability of this feature depends on PASID support in the device,
> + * its bus, the underlying IOMMU and the CPU architecture. In VFIO, it
> + * is available after VFIO_SET_IOMMU.
> + *
> + * returns: 0 on success, -errno on failure.
> + */
> +#define VFIO_IOMMU_PASID_REQUEST	_IO(VFIO_TYPE, VFIO_BASE + 22)

So a user needs to try to allocate a PASID in order to test for the
support?  Should we have a PROBE flag?

> +
>  /* -------- Additional API for SPAPR TCE (Server POWERPC) IOMMU -------- */
>  
>  /*


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH v1 2/8] vfio/type1: Add vfio_iommu_type1 parameter for quota tuning
  2020-03-30 11:44           ` Tian, Kevin
@ 2020-04-02 17:58             ` Alex Williamson
  2020-04-03  8:15               ` Liu, Yi L
  0 siblings, 1 reply; 110+ messages in thread
From: Alex Williamson @ 2020-04-02 17:58 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Liu, Yi L, eric.auger, jacob.jun.pan, joro, Raj, Ashok, Tian,
	Jun J, Sun, Yi Y, jean-philippe, peterx, iommu, kvm,
	linux-kernel, Wu, Hao

On Mon, 30 Mar 2020 11:44:08 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Liu, Yi L <yi.l.liu@intel.com>
> > Sent: Monday, March 30, 2020 5:27 PM
> >   
> > > From: Tian, Kevin <kevin.tian@intel.com>
> > > Sent: Monday, March 30, 2020 5:20 PM
> > > To: Liu, Yi L <yi.l.liu@intel.com>; alex.williamson@redhat.com;
> > > Subject: RE: [PATCH v1 2/8] vfio/type1: Add vfio_iommu_type1 parameter  
> > for quota  
> > > tuning
> > >  
> > > > From: Liu, Yi L <yi.l.liu@intel.com>
> > > > Sent: Monday, March 30, 2020 4:53 PM
> > > >  
> > > > > From: Tian, Kevin <kevin.tian@intel.com>
> > > > > Sent: Monday, March 30, 2020 4:41 PM
> > > > > To: Liu, Yi L <yi.l.liu@intel.com>; alex.williamson@redhat.com;
> > > > > Subject: RE: [PATCH v1 2/8] vfio/type1: Add vfio_iommu_type1
> > > > > parameter  
> > > > for quota  
> > > > > tuning
> > > > >  
> > > > > > From: Liu, Yi L <yi.l.liu@intel.com>
> > > > > > Sent: Sunday, March 22, 2020 8:32 PM
> > > > > >
> > > > > > From: Liu Yi L <yi.l.liu@intel.com>
> > > > > >
> > > > > > This patch adds a module option to make the PASID quota tunable by
> > > > > > administrator.
> > > > > >
> > > > > > TODO: needs to think more on how to  make the tuning to be per-  
> > process.  
> > > > > >
> > > > > > Previous discussions:
> > > > > > https://patchwork.kernel.org/patch/11209429/
> > > > > >
> > > > > > Cc: Kevin Tian <kevin.tian@intel.com>
> > > > > > CC: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > > > > > Cc: Alex Williamson <alex.williamson@redhat.com>
> > > > > > Cc: Eric Auger <eric.auger@redhat.com>
> > > > > > Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
> > > > > > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > > > > > ---
> > > > > >  drivers/vfio/vfio.c             | 8 +++++++-
> > > > > >  drivers/vfio/vfio_iommu_type1.c | 7 ++++++-
> > > > > >  include/linux/vfio.h            | 3 ++-
> > > > > >  3 files changed, 15 insertions(+), 3 deletions(-)
> > > > > >
> > > > > > diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c index
> > > > > > d13b483..020a792 100644
> > > > > > --- a/drivers/vfio/vfio.c
> > > > > > +++ b/drivers/vfio/vfio.c
> > > > > > @@ -2217,13 +2217,19 @@ struct vfio_mm  
> > > > *vfio_mm_get_from_task(struct  
> > > > > > task_struct *task)
> > > > > >  }
> > > > > >  EXPORT_SYMBOL_GPL(vfio_mm_get_from_task);
> > > > > >
> > > > > > -int vfio_mm_pasid_alloc(struct vfio_mm *vmm, int min, int max)
> > > > > > +int vfio_mm_pasid_alloc(struct vfio_mm *vmm, int quota, int min,
> > > > > > +int  
> > > > max)  
> > > > > >  {
> > > > > >  	ioasid_t pasid;
> > > > > >  	int ret = -ENOSPC;
> > > > > >
> > > > > >  	mutex_lock(&vmm->pasid_lock);
> > > > > >
> > > > > > +	/* update quota as it is tunable by admin */
> > > > > > +	if (vmm->pasid_quota != quota) {
> > > > > > +		vmm->pasid_quota = quota;
> > > > > > +		ioasid_adjust_set(vmm->ioasid_sid, quota);
> > > > > > +	}
> > > > > > +  
> > > > >
> > > > > It's a bit weird to have quota adjusted in the alloc path, since the
> > > > > latter  
> > > > might  
> > > > > be initiated by non-privileged users. Why not doing the simple math
> > > > > in  
> > > > vfio_  
> > > > > create_mm to set the quota when the ioasid set is created? even in
> > > > > the  
> > > > future  
> > > > > you may allow per-process quota setting, that should come from
> > > > > separate privileged path instead of thru alloc..  
> > > >
> > > > The reason is the kernel parameter modification has no event which can
> > > > be used to adjust the quota. So I chose to adjust it in pasid_alloc
> > > > path. If it's not good, how about adding one more IOCTL to let user-
> > > > space trigger a quota adjustment event? Then even non-privileged user
> > > > could trigger quota adjustment, the quota is actually controlled by
> > > > privileged user. How about your opinion?
> > > >  
> > >
> > > why do you need an event to adjust? As I said, you can set the quota when  
> > the set is  
> > > created in vfio_create_mm...  
> > 
> > oh, it's to support runtime adjustments. I guess it may be helpful to let
> > per-VM quota tunable even the VM is running. If just set the quota in
> > vfio_create_mm(), it is not able to adjust at runtime.
> >   
> 
> ok, I didn't note the module parameter was granted with a write permission.
> However there is a further problem. We cannot support PASID reclaim now.
> What about the admin sets a quota smaller than previous value while some
> IOASID sets already exceed the new quota? I'm not sure how to fail a runtime
> module parameter change due to that situation. possibly a normal sysfs 
> node better suites the runtime change requirement...

Yep, making this runtime adjustable seems a bit unpredictable and racy,
and it's not clear to me how a user is going to jump in at just the
right time for a user and adjust the limit.  I'd probably go for a
simple non-runtime adjustable module option.  It's a safety net at this
point anyway afaict.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH v1 3/8] vfio/type1: Report PASID alloc/free support to userspace
  2020-03-22 12:32 ` [PATCH v1 3/8] vfio/type1: Report PASID alloc/free support to userspace Liu, Yi L
  2020-03-30  9:43   ` Tian, Kevin
  2020-04-01  9:41   ` Auger Eric
@ 2020-04-02 18:01   ` Alex Williamson
  2020-04-03  8:17     ` Liu, Yi L
  2 siblings, 1 reply; 110+ messages in thread
From: Alex Williamson @ 2020-04-02 18:01 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: eric.auger, kevin.tian, jacob.jun.pan, joro, ashok.raj,
	jun.j.tian, yi.y.sun, jean-philippe, peterx, iommu, kvm,
	linux-kernel, hao.wu

On Sun, 22 Mar 2020 05:32:00 -0700
"Liu, Yi L" <yi.l.liu@intel.com> wrote:

> From: Liu Yi L <yi.l.liu@intel.com>
> 
> This patch reports PASID alloc/free availability to userspace (e.g. QEMU)
> thus userspace could do a pre-check before utilizing this feature.
> 
> Cc: Kevin Tian <kevin.tian@intel.com>
> CC: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Cc: Alex Williamson <alex.williamson@redhat.com>
> Cc: Eric Auger <eric.auger@redhat.com>
> Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> ---
>  drivers/vfio/vfio_iommu_type1.c | 28 ++++++++++++++++++++++++++++
>  include/uapi/linux/vfio.h       |  8 ++++++++
>  2 files changed, 36 insertions(+)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index e40afc0..ddd1ffe 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -2234,6 +2234,30 @@ static int vfio_iommu_type1_pasid_free(struct vfio_iommu *iommu,
>  	return ret;
>  }
>  
> +static int vfio_iommu_info_add_nesting_cap(struct vfio_iommu *iommu,
> +					 struct vfio_info_cap *caps)
> +{
> +	struct vfio_info_cap_header *header;
> +	struct vfio_iommu_type1_info_cap_nesting *nesting_cap;
> +
> +	header = vfio_info_cap_add(caps, sizeof(*nesting_cap),
> +				   VFIO_IOMMU_TYPE1_INFO_CAP_NESTING, 1);
> +	if (IS_ERR(header))
> +		return PTR_ERR(header);
> +
> +	nesting_cap = container_of(header,
> +				struct vfio_iommu_type1_info_cap_nesting,
> +				header);
> +
> +	nesting_cap->nesting_capabilities = 0;
> +	if (iommu->nesting) {
> +		/* nesting iommu type supports PASID requests (alloc/free) */
> +		nesting_cap->nesting_capabilities |= VFIO_IOMMU_PASID_REQS;
> +	}
> +
> +	return 0;
> +}
> +
>  static long vfio_iommu_type1_ioctl(void *iommu_data,
>  				   unsigned int cmd, unsigned long arg)
>  {
> @@ -2283,6 +2307,10 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
>  		if (ret)
>  			return ret;
>  
> +		ret = vfio_iommu_info_add_nesting_cap(iommu, &caps);
> +		if (ret)
> +			return ret;
> +
>  		if (caps.size) {
>  			info.flags |= VFIO_IOMMU_INFO_CAPS;
>  
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 298ac80..8837219 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -748,6 +748,14 @@ struct vfio_iommu_type1_info_cap_iova_range {
>  	struct	vfio_iova_range iova_ranges[];
>  };
>  
> +#define VFIO_IOMMU_TYPE1_INFO_CAP_NESTING  2
> +
> +struct vfio_iommu_type1_info_cap_nesting {
> +	struct	vfio_info_cap_header header;
> +#define VFIO_IOMMU_PASID_REQS	(1 << 0)
> +	__u32	nesting_capabilities;
> +};
> +
>  #define VFIO_IOMMU_GET_INFO _IO(VFIO_TYPE, VFIO_BASE + 12)
>  
>  /**

I think this answers my PROBE question on patch 1/.  Should the
quota/usage be exposed to the user here?  Thanks,

Alex


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH v1 5/8] vfio/type1: Report 1st-level/stage-1 format to userspace
  2020-03-22 12:32 ` [PATCH v1 5/8] vfio/type1: Report 1st-level/stage-1 format to userspace Liu, Yi L
                     ` (2 preceding siblings ...)
  2020-04-01  8:51   ` Auger Eric
@ 2020-04-02 19:20   ` Alex Williamson
  2020-04-03 11:59     ` Liu, Yi L
  3 siblings, 1 reply; 110+ messages in thread
From: Alex Williamson @ 2020-04-02 19:20 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: eric.auger, kevin.tian, jacob.jun.pan, joro, ashok.raj,
	jun.j.tian, yi.y.sun, jean-philippe, peterx, iommu, kvm,
	linux-kernel, hao.wu

On Sun, 22 Mar 2020 05:32:02 -0700
"Liu, Yi L" <yi.l.liu@intel.com> wrote:

> From: Liu Yi L <yi.l.liu@intel.com>
> 
> VFIO exposes IOMMU nesting translation (a.k.a dual stage translation)
> capability to userspace. Thus applications like QEMU could support
> vIOMMU with hardware's nesting translation capability for pass-through
> devices. Before setting up nesting translation for pass-through devices,
> QEMU and other applications need to learn the supported 1st-lvl/stage-1
> translation structure format like page table format.
> 
> Take vSVA (virtual Shared Virtual Addressing) as an example, to support
> vSVA for pass-through devices, QEMU setup nesting translation for pass-
> through devices. The guest page table are configured to host as 1st-lvl/
> stage-1 page table. Therefore, guest format should be compatible with
> host side.
> 
> This patch reports the supported 1st-lvl/stage-1 page table format on the
> current platform to userspace. QEMU and other alike applications should
> use this format info when trying to setup IOMMU nesting translation on
> host IOMMU.
> 
> Cc: Kevin Tian <kevin.tian@intel.com>
> CC: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Cc: Alex Williamson <alex.williamson@redhat.com>
> Cc: Eric Auger <eric.auger@redhat.com>
> Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> ---
>  drivers/vfio/vfio_iommu_type1.c | 56 +++++++++++++++++++++++++++++++++++++++++
>  include/uapi/linux/vfio.h       |  1 +
>  2 files changed, 57 insertions(+)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index 9aa2a67..82a9e0b 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -2234,11 +2234,66 @@ static int vfio_iommu_type1_pasid_free(struct vfio_iommu *iommu,
>  	return ret;
>  }
>  
> +static int vfio_iommu_get_stage1_format(struct vfio_iommu *iommu,
> +					 u32 *stage1_format)
> +{
> +	struct vfio_domain *domain;
> +	u32 format = 0, tmp_format = 0;
> +	int ret;
> +
> +	mutex_lock(&iommu->lock);
> +	if (list_empty(&iommu->domain_list)) {
> +		mutex_unlock(&iommu->lock);
> +		return -EINVAL;
> +	}
> +
> +	list_for_each_entry(domain, &iommu->domain_list, next) {
> +		if (iommu_domain_get_attr(domain->domain,
> +			DOMAIN_ATTR_PASID_FORMAT, &format)) {
> +			ret = -EINVAL;
> +			format = 0;
> +			goto out_unlock;
> +		}
> +		/*
> +		 * format is always non-zero (the first format is
> +		 * IOMMU_PASID_FORMAT_INTEL_VTD which is 1). For
> +		 * the reason of potential different backed IOMMU
> +		 * formats, here we expect to have identical formats
> +		 * in the domain list, no mixed formats support.
> +		 * return -EINVAL to fail the attempt of setup
> +		 * VFIO_TYPE1_NESTING_IOMMU if non-identical formats
> +		 * are detected.
> +		 */
> +		if (tmp_format && tmp_format != format) {
> +			ret = -EINVAL;
> +			format = 0;
> +			goto out_unlock;
> +		}
> +
> +		tmp_format = format;
> +	}
> +	ret = 0;
> +
> +out_unlock:
> +	if (format)
> +		*stage1_format = format;
> +	mutex_unlock(&iommu->lock);
> +	return ret;
> +}
> +
>  static int vfio_iommu_info_add_nesting_cap(struct vfio_iommu *iommu,
>  					 struct vfio_info_cap *caps)
>  {
>  	struct vfio_info_cap_header *header;
>  	struct vfio_iommu_type1_info_cap_nesting *nesting_cap;
> +	u32 formats = 0;
> +	int ret;
> +
> +	ret = vfio_iommu_get_stage1_format(iommu, &formats);
> +	if (ret) {
> +		pr_warn("Failed to get stage-1 format\n");
> +		return ret;

Looks like this generates a warning and causes the iommu_get_info ioctl
to fail if the hardware doesn't support the pasid format attribute, or
the domain list is empty.  This breaks users on existing hardware.

> +	}
>  
>  	header = vfio_info_cap_add(caps, sizeof(*nesting_cap),
>  				   VFIO_IOMMU_TYPE1_INFO_CAP_NESTING, 1);
> @@ -2254,6 +2309,7 @@ static int vfio_iommu_info_add_nesting_cap(struct vfio_iommu *iommu,
>  		/* nesting iommu type supports PASID requests (alloc/free) */
>  		nesting_cap->nesting_capabilities |= VFIO_IOMMU_PASID_REQS;
>  	}
> +	nesting_cap->stage1_formats = formats;
>  
>  	return 0;
>  }
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index ed9881d..ebeaf3e 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -763,6 +763,7 @@ struct vfio_iommu_type1_info_cap_nesting {
>  	struct	vfio_info_cap_header header;
>  #define VFIO_IOMMU_PASID_REQS	(1 << 0)
>  	__u32	nesting_capabilities;
> +	__u32	stage1_formats;
>  };
>  
>  #define VFIO_IOMMU_GET_INFO _IO(VFIO_TYPE, VFIO_BASE + 12)


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH v1 6/8] vfio/type1: Bind guest page tables to host
  2020-03-22 12:32 ` [PATCH v1 6/8] vfio/type1: Bind guest page tables to host Liu, Yi L
  2020-03-22 18:10   ` kbuild test robot
  2020-03-30 12:46   ` Tian, Kevin
@ 2020-04-02 19:57   ` Alex Williamson
  2020-04-03 13:30     ` Liu, Yi L
  2020-04-11  5:52     ` Liu, Yi L
  2 siblings, 2 replies; 110+ messages in thread
From: Alex Williamson @ 2020-04-02 19:57 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: eric.auger, kevin.tian, jacob.jun.pan, joro, ashok.raj,
	jun.j.tian, yi.y.sun, jean-philippe, peterx, iommu, kvm,
	linux-kernel, hao.wu

On Sun, 22 Mar 2020 05:32:03 -0700
"Liu, Yi L" <yi.l.liu@intel.com> wrote:

> From: Liu Yi L <yi.l.liu@intel.com>
> 
> VFIO_TYPE1_NESTING_IOMMU is an IOMMU type which is backed by hardware
> IOMMUs that have nesting DMA translation (a.k.a dual stage address
> translation). For such hardware IOMMUs, there are two stages/levels of
> address translation, and software may let userspace/VM to own the first-
> level/stage-1 translation structures. Example of such usage is vSVA (
> virtual Shared Virtual Addressing). VM owns the first-level/stage-1
> translation structures and bind the structures to host, then hardware
> IOMMU would utilize nesting translation when doing DMA translation fo
> the devices behind such hardware IOMMU.
> 
> This patch adds vfio support for binding guest translation (a.k.a stage 1)
> structure to host iommu. And for VFIO_TYPE1_NESTING_IOMMU, not only bind
> guest page table is needed, it also requires to expose interface to guest
> for iommu cache invalidation when guest modified the first-level/stage-1
> translation structures since hardware needs to be notified to flush stale
> iotlbs. This would be introduced in next patch.
> 
> In this patch, guest page table bind and unbind are done by using flags
> VFIO_IOMMU_BIND_GUEST_PGTBL and VFIO_IOMMU_UNBIND_GUEST_PGTBL under IOCTL
> VFIO_IOMMU_BIND, the bind/unbind data are conveyed by
> struct iommu_gpasid_bind_data. Before binding guest page table to host,
> VM should have got a PASID allocated by host via VFIO_IOMMU_PASID_REQUEST.
> 
> Bind guest translation structures (here is guest page table) to host
> are the first step to setup vSVA (Virtual Shared Virtual Addressing).
> 
> Cc: Kevin Tian <kevin.tian@intel.com>
> CC: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Cc: Alex Williamson <alex.williamson@redhat.com>
> Cc: Eric Auger <eric.auger@redhat.com>
> Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
> Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.com>
> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> ---
>  drivers/vfio/vfio_iommu_type1.c | 158 ++++++++++++++++++++++++++++++++++++++++
>  include/uapi/linux/vfio.h       |  46 ++++++++++++
>  2 files changed, 204 insertions(+)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index 82a9e0b..a877747 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -130,6 +130,33 @@ struct vfio_regions {
>  #define IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)	\
>  					(!list_empty(&iommu->domain_list))
>  
> +struct domain_capsule {
> +	struct iommu_domain *domain;
> +	void *data;
> +};
> +
> +/* iommu->lock must be held */
> +static int vfio_iommu_for_each_dev(struct vfio_iommu *iommu,
> +		      int (*fn)(struct device *dev, void *data),
> +		      void *data)
> +{
> +	struct domain_capsule dc = {.data = data};
> +	struct vfio_domain *d;
> +	struct vfio_group *g;
> +	int ret = 0;
> +
> +	list_for_each_entry(d, &iommu->domain_list, next) {
> +		dc.domain = d->domain;
> +		list_for_each_entry(g, &d->group_list, next) {
> +			ret = iommu_group_for_each_dev(g->iommu_group,
> +						       &dc, fn);
> +			if (ret)
> +				break;
> +		}
> +	}
> +	return ret;
> +}
> +
>  static int put_pfn(unsigned long pfn, int prot);
>  
>  /*
> @@ -2314,6 +2341,88 @@ static int vfio_iommu_info_add_nesting_cap(struct vfio_iommu *iommu,
>  	return 0;
>  }
>  
> +static int vfio_bind_gpasid_fn(struct device *dev, void *data)
> +{
> +	struct domain_capsule *dc = (struct domain_capsule *)data;
> +	struct iommu_gpasid_bind_data *gbind_data =
> +		(struct iommu_gpasid_bind_data *) dc->data;
> +
> +	return iommu_sva_bind_gpasid(dc->domain, dev, gbind_data);
> +}
> +
> +static int vfio_unbind_gpasid_fn(struct device *dev, void *data)
> +{
> +	struct domain_capsule *dc = (struct domain_capsule *)data;
> +	struct iommu_gpasid_bind_data *gbind_data =
> +		(struct iommu_gpasid_bind_data *) dc->data;
> +
> +	return iommu_sva_unbind_gpasid(dc->domain, dev,
> +					gbind_data->hpasid);
> +}
> +
> +/**
> + * Unbind specific gpasid, caller of this function requires hold
> + * vfio_iommu->lock
> + */
> +static long vfio_iommu_type1_do_guest_unbind(struct vfio_iommu *iommu,
> +				struct iommu_gpasid_bind_data *gbind_data)
> +{
> +	return vfio_iommu_for_each_dev(iommu,
> +				vfio_unbind_gpasid_fn, gbind_data);
> +}
> +
> +static long vfio_iommu_type1_bind_gpasid(struct vfio_iommu *iommu,
> +				struct iommu_gpasid_bind_data *gbind_data)
> +{
> +	int ret = 0;
> +
> +	mutex_lock(&iommu->lock);
> +	if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) {
> +		ret = -EINVAL;
> +		goto out_unlock;
> +	}
> +
> +	ret = vfio_iommu_for_each_dev(iommu,
> +			vfio_bind_gpasid_fn, gbind_data);
> +	/*
> +	 * If bind failed, it may not be a total failure. Some devices
> +	 * within the iommu group may have bind successfully. Although
> +	 * we don't enable pasid capability for non-singletion iommu
> +	 * groups, a unbind operation would be helpful to ensure no
> +	 * partial binding for an iommu group.

Where was the non-singleton group restriction done, I missed that.

> +	 */
> +	if (ret)
> +		/*
> +		 * Undo all binds that already succeeded, no need to
> +		 * check the return value here since some device within
> +		 * the group has no successful bind when coming to this
> +		 * place switch.
> +		 */
> +		vfio_iommu_type1_do_guest_unbind(iommu, gbind_data);

However, the for_each_dev function stops when the callback function
returns error, are we just assuming we stop at the same device as we
faulted on the first time and that we traverse the same set of devices
the second time?  It seems strange to me that unbind should be able to
fail.

> +
> +out_unlock:
> +	mutex_unlock(&iommu->lock);
> +	return ret;
> +}
> +
> +static long vfio_iommu_type1_unbind_gpasid(struct vfio_iommu *iommu,
> +				struct iommu_gpasid_bind_data *gbind_data)
> +{
> +	int ret = 0;
> +
> +	mutex_lock(&iommu->lock);
> +	if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) {
> +		ret = -EINVAL;
> +		goto out_unlock;
> +	}
> +
> +	ret = vfio_iommu_type1_do_guest_unbind(iommu, gbind_data);

How is a user supposed to respond to their unbind failing?

> +
> +out_unlock:
> +	mutex_unlock(&iommu->lock);
> +	return ret;
> +}
> +
>  static long vfio_iommu_type1_ioctl(void *iommu_data,
>  				   unsigned int cmd, unsigned long arg)
>  {
> @@ -2471,6 +2580,55 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
>  		default:
>  			return -EINVAL;
>  		}
> +
> +	} else if (cmd == VFIO_IOMMU_BIND) {
> +		struct vfio_iommu_type1_bind bind;
> +		u32 version;
> +		int data_size;
> +		void *gbind_data;
> +		int ret;
> +
> +		minsz = offsetofend(struct vfio_iommu_type1_bind, flags);
> +
> +		if (copy_from_user(&bind, (void __user *)arg, minsz))
> +			return -EFAULT;
> +
> +		if (bind.argsz < minsz)
> +			return -EINVAL;
> +
> +		/* Get the version of struct iommu_gpasid_bind_data */
> +		if (copy_from_user(&version,
> +			(void __user *) (arg + minsz),
> +					sizeof(version)))
> +			return -EFAULT;

Why are we coping things from beyond the size we've validated that the
user has provided again?

> +
> +		data_size = iommu_uapi_get_data_size(
> +				IOMMU_UAPI_BIND_GPASID, version);
> +		gbind_data = kzalloc(data_size, GFP_KERNEL);
> +		if (!gbind_data)
> +			return -ENOMEM;
> +
> +		if (copy_from_user(gbind_data,
> +			 (void __user *) (arg + minsz), data_size)) {
> +			kfree(gbind_data);
> +			return -EFAULT;
> +		}

And again.  argsz isn't just for minsz.

> +
> +		switch (bind.flags & VFIO_IOMMU_BIND_MASK) {
> +		case VFIO_IOMMU_BIND_GUEST_PGTBL:
> +			ret = vfio_iommu_type1_bind_gpasid(iommu,
> +							   gbind_data);
> +			break;
> +		case VFIO_IOMMU_UNBIND_GUEST_PGTBL:
> +			ret = vfio_iommu_type1_unbind_gpasid(iommu,
> +							     gbind_data);
> +			break;
> +		default:
> +			ret = -EINVAL;
> +			break;
> +		}
> +		kfree(gbind_data);
> +		return ret;
>  	}
>  
>  	return -ENOTTY;
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index ebeaf3e..2235bc6 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -14,6 +14,7 @@
>  
>  #include <linux/types.h>
>  #include <linux/ioctl.h>
> +#include <linux/iommu.h>
>  
>  #define VFIO_API_VERSION	0
>  
> @@ -853,6 +854,51 @@ struct vfio_iommu_type1_pasid_request {
>   */
>  #define VFIO_IOMMU_PASID_REQUEST	_IO(VFIO_TYPE, VFIO_BASE + 22)
>  
> +/**
> + * Supported flags:
> + *	- VFIO_IOMMU_BIND_GUEST_PGTBL: bind guest page tables to host for
> + *			nesting type IOMMUs. In @data field It takes struct
> + *			iommu_gpasid_bind_data.
> + *	- VFIO_IOMMU_UNBIND_GUEST_PGTBL: undo a bind guest page table operation
> + *			invoked by VFIO_IOMMU_BIND_GUEST_PGTBL.

This must require iommu_gpasid_bind_data in the data field as well,
right?

> + *
> + */
> +struct vfio_iommu_type1_bind {
> +	__u32		argsz;
> +	__u32		flags;
> +#define VFIO_IOMMU_BIND_GUEST_PGTBL	(1 << 0)
> +#define VFIO_IOMMU_UNBIND_GUEST_PGTBL	(1 << 1)
> +	__u8		data[];
> +};
> +
> +#define VFIO_IOMMU_BIND_MASK	(VFIO_IOMMU_BIND_GUEST_PGTBL | \
> +					VFIO_IOMMU_UNBIND_GUEST_PGTBL)
> +
> +/**
> + * VFIO_IOMMU_BIND - _IOW(VFIO_TYPE, VFIO_BASE + 23,
> + *				struct vfio_iommu_type1_bind)
> + *
> + * Manage address spaces of devices in this container. Initially a TYPE1
> + * container can only have one address space, managed with
> + * VFIO_IOMMU_MAP/UNMAP_DMA.
> + *
> + * An IOMMU of type VFIO_TYPE1_NESTING_IOMMU can be managed by both MAP/UNMAP
> + * and BIND ioctls at the same time. MAP/UNMAP acts on the stage-2 (host) page
> + * tables, and BIND manages the stage-1 (guest) page tables. Other types of
> + * IOMMU may allow MAP/UNMAP and BIND to coexist, where MAP/UNMAP controls
> + * the traffics only require single stage translation while BIND controls the
> + * traffics require nesting translation. But this depends on the underlying
> + * IOMMU architecture and isn't guaranteed. Example of this is the guest SVA
> + * traffics, such traffics need nesting translation to gain gVA->gPA and then
> + * gPA->hPA translation.
> + *
> + * Availability of this feature depends on the device, its bus, the underlying
> + * IOMMU and the CPU architecture.
> + *
> + * returns: 0 on success, -errno on failure.
> + */
> +#define VFIO_IOMMU_BIND		_IO(VFIO_TYPE, VFIO_BASE + 23)
> +
>  /* -------- Additional API for SPAPR TCE (Server POWERPC) IOMMU -------- */
>  
>  /*


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH v1 7/8] vfio/type1: Add VFIO_IOMMU_CACHE_INVALIDATE
  2020-03-22 12:32 ` [PATCH v1 7/8] vfio/type1: Add VFIO_IOMMU_CACHE_INVALIDATE Liu, Yi L
  2020-03-30 12:58   ` Tian, Kevin
  2020-03-31  7:56   ` Christoph Hellwig
@ 2020-04-02 20:24   ` Alex Williamson
  2020-04-03  6:39     ` Tian, Kevin
  2 siblings, 1 reply; 110+ messages in thread
From: Alex Williamson @ 2020-04-02 20:24 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: eric.auger, kevin.tian, jacob.jun.pan, joro, ashok.raj,
	jun.j.tian, yi.y.sun, jean-philippe, peterx, iommu, kvm,
	linux-kernel, hao.wu

On Sun, 22 Mar 2020 05:32:04 -0700
"Liu, Yi L" <yi.l.liu@intel.com> wrote:

> From: Liu Yi L <yi.l.liu@linux.intel.com>
> 
> For VFIO IOMMUs with the type VFIO_TYPE1_NESTING_IOMMU, guest "owns" the
> first-level/stage-1 translation structures, the host IOMMU driver has no
> knowledge of first-level/stage-1 structure cache updates unless the guest
> invalidation requests are trapped and propagated to the host.
> 
> This patch adds a new IOCTL VFIO_IOMMU_CACHE_INVALIDATE to propagate guest
> first-level/stage-1 IOMMU cache invalidations to host to ensure IOMMU cache
> correctness.
> 
> With this patch, vSVA (Virtual Shared Virtual Addressing) can be used safely
> as the host IOMMU iotlb correctness are ensured.
> 
> Cc: Kevin Tian <kevin.tian@intel.com>
> CC: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Cc: Alex Williamson <alex.williamson@redhat.com>
> Cc: Eric Auger <eric.auger@redhat.com>
> Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
> Signed-off-by: Liu Yi L <yi.l.liu@linux.intel.com>
> Signed-off-by: Eric Auger <eric.auger@redhat.com>
> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> ---
>  drivers/vfio/vfio_iommu_type1.c | 49 +++++++++++++++++++++++++++++++++++++++++
>  include/uapi/linux/vfio.h       | 22 ++++++++++++++++++
>  2 files changed, 71 insertions(+)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index a877747..937ec3f 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -2423,6 +2423,15 @@ static long vfio_iommu_type1_unbind_gpasid(struct vfio_iommu *iommu,
>  	return ret;
>  }
>  
> +static int vfio_cache_inv_fn(struct device *dev, void *data)
> +{
> +	struct domain_capsule *dc = (struct domain_capsule *)data;
> +	struct iommu_cache_invalidate_info *cache_inv_info =
> +		(struct iommu_cache_invalidate_info *) dc->data;
> +
> +	return iommu_cache_invalidate(dc->domain, dev, cache_inv_info);
> +}
> +
>  static long vfio_iommu_type1_ioctl(void *iommu_data,
>  				   unsigned int cmd, unsigned long arg)
>  {
> @@ -2629,6 +2638,46 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
>  		}
>  		kfree(gbind_data);
>  		return ret;
> +	} else if (cmd == VFIO_IOMMU_CACHE_INVALIDATE) {
> +		struct vfio_iommu_type1_cache_invalidate cache_inv;
> +		u32 version;
> +		int info_size;
> +		void *cache_info;
> +		int ret;
> +
> +		minsz = offsetofend(struct vfio_iommu_type1_cache_invalidate,
> +				    flags);

This breaks backward compatibility as soon as struct
iommu_cache_invalidate_info changes size by its defined versioning
scheme.  ie. a field gets added, the version is bumped, all existing
userspace breaks.  Our minsz is offsetofend to the version field,
interpret the version to size, then reevaluate argsz.

> +
> +		if (copy_from_user(&cache_inv, (void __user *)arg, minsz))
> +			return -EFAULT;
> +
> +		if (cache_inv.argsz < minsz || cache_inv.flags)
> +			return -EINVAL;
> +
> +		/* Get the version of struct iommu_cache_invalidate_info */
> +		if (copy_from_user(&version,
> +			(void __user *) (arg + minsz), sizeof(version)))
> +			return -EFAULT;
> +
> +		info_size = iommu_uapi_get_data_size(
> +					IOMMU_UAPI_CACHE_INVAL, version);
> +
> +		cache_info = kzalloc(info_size, GFP_KERNEL);
> +		if (!cache_info)
> +			return -ENOMEM;
> +
> +		if (copy_from_user(cache_info,
> +			(void __user *) (arg + minsz), info_size)) {
> +			kfree(cache_info);
> +			return -EFAULT;
> +		}
> +
> +		mutex_lock(&iommu->lock);
> +		ret = vfio_iommu_for_each_dev(iommu, vfio_cache_inv_fn,
> +					    cache_info);

How does a user respond when their cache invalidate fails?  Isn't this
also another case where our for_each_dev can fail at an arbitrary point
leaving us with no idea whether each device even had the opportunity to
perform the invalidation request.  I don't see how we have any chance
to maintain coherency after this faults.

> +		mutex_unlock(&iommu->lock);
> +		kfree(cache_info);
> +		return ret;
>  	}
>  
>  	return -ENOTTY;
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 2235bc6..62ca791 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -899,6 +899,28 @@ struct vfio_iommu_type1_bind {
>   */
>  #define VFIO_IOMMU_BIND		_IO(VFIO_TYPE, VFIO_BASE + 23)
>  
> +/**
> + * VFIO_IOMMU_CACHE_INVALIDATE - _IOW(VFIO_TYPE, VFIO_BASE + 24,
> + *			struct vfio_iommu_type1_cache_invalidate)
> + *
> + * Propagate guest IOMMU cache invalidation to the host. The cache
> + * invalidation information is conveyed by @cache_info, the content
> + * format would be structures defined in uapi/linux/iommu.h. User
> + * should be aware of that the struct  iommu_cache_invalidate_info
> + * has a @version field, vfio needs to parse this field before getting
> + * data from userspace.
> + *
> + * Availability of this IOCTL is after VFIO_SET_IOMMU.

Is this a necessary qualifier?  A user can try to call this ioctl at
any point, it only makes sense in certain configurations, but it should
always "do the right thing" relative to the container iommu config.

Also, I don't see anything in these last few patches testing the
operating IOMMU model, what happens when a user calls them when not
using the nesting IOMMU?

Is this ioctl and the previous BIND ioctl only valid when configured
for the nesting IOMMU type?

> + *
> + * returns: 0 on success, -errno on failure.
> + */
> +struct vfio_iommu_type1_cache_invalidate {
> +	__u32   argsz;
> +	__u32   flags;
> +	struct	iommu_cache_invalidate_info cache_info;
> +};
> +#define VFIO_IOMMU_CACHE_INVALIDATE      _IO(VFIO_TYPE, VFIO_BASE + 24)

The future extension capabilities of this ioctl worry me, I wonder if
we should do another data[] with flag defining that data as CACHE_INFO.

> +
>  /* -------- Additional API for SPAPR TCE (Server POWERPC) IOMMU -------- */
>  
>  /*


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH v1 8/8] vfio/type1: Add vSVA support for IOMMU-backed mdevs
  2020-03-22 12:32 ` [PATCH v1 8/8] vfio/type1: Add vSVA support for IOMMU-backed mdevs Liu, Yi L
  2020-03-30 13:18   ` Tian, Kevin
@ 2020-04-02 20:33   ` Alex Williamson
  2020-04-03 13:39     ` Liu, Yi L
  1 sibling, 1 reply; 110+ messages in thread
From: Alex Williamson @ 2020-04-02 20:33 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: eric.auger, kevin.tian, jacob.jun.pan, joro, ashok.raj,
	jun.j.tian, yi.y.sun, jean-philippe, peterx, iommu, kvm,
	linux-kernel, hao.wu

On Sun, 22 Mar 2020 05:32:05 -0700
"Liu, Yi L" <yi.l.liu@intel.com> wrote:

> From: Liu Yi L <yi.l.liu@intel.com>
> 
> Recent years, mediated device pass-through framework (e.g. vfio-mdev)
> are used to achieve flexible device sharing across domains (e.g. VMs).
> Also there are hardware assisted mediated pass-through solutions from
> platform vendors. e.g. Intel VT-d scalable mode which supports Intel
> Scalable I/O Virtualization technology. Such mdevs are called IOMMU-
> backed mdevs as there are IOMMU enforced DMA isolation for such mdevs.
> In kernel, IOMMU-backed mdevs are exposed to IOMMU layer by aux-domain
> concept, which means mdevs are protected by an iommu domain which is
> aux-domain of its physical device. Details can be found in the KVM
> presentation from Kevin Tian. IOMMU-backed equals to IOMMU-capable.
> 
> https://events19.linuxfoundation.org/wp-content/uploads/2017/12/\
> Hardware-Assisted-Mediated-Pass-Through-with-VFIO-Kevin-Tian-Intel.pdf
> 
> This patch supports NESTING IOMMU for IOMMU-backed mdevs by figuring
> out the physical device of an IOMMU-backed mdev and then invoking IOMMU
> requests to IOMMU layer with the physical device and the mdev's aux
> domain info.
> 
> With this patch, vSVA (Virtual Shared Virtual Addressing) can be used
> on IOMMU-backed mdevs.
> 
> Cc: Kevin Tian <kevin.tian@intel.com>
> CC: Jacob Pan <jacob.jun.pan@linux.intel.com>
> CC: Jun Tian <jun.j.tian@intel.com>
> Cc: Alex Williamson <alex.williamson@redhat.com>
> Cc: Eric Auger <eric.auger@redhat.com>
> Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> ---
>  drivers/vfio/vfio_iommu_type1.c | 23 ++++++++++++++++++++---
>  1 file changed, 20 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index 937ec3f..d473665 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -132,6 +132,7 @@ struct vfio_regions {
>  
>  struct domain_capsule {
>  	struct iommu_domain *domain;
> +	struct vfio_group *group;
>  	void *data;
>  };
>  
> @@ -148,6 +149,7 @@ static int vfio_iommu_for_each_dev(struct vfio_iommu *iommu,
>  	list_for_each_entry(d, &iommu->domain_list, next) {
>  		dc.domain = d->domain;
>  		list_for_each_entry(g, &d->group_list, next) {
> +			dc.group = g;
>  			ret = iommu_group_for_each_dev(g->iommu_group,
>  						       &dc, fn);
>  			if (ret)
> @@ -2347,7 +2349,12 @@ static int vfio_bind_gpasid_fn(struct device *dev, void *data)
>  	struct iommu_gpasid_bind_data *gbind_data =
>  		(struct iommu_gpasid_bind_data *) dc->data;
>  
> -	return iommu_sva_bind_gpasid(dc->domain, dev, gbind_data);
> +	if (dc->group->mdev_group)
> +		return iommu_sva_bind_gpasid(dc->domain,
> +			vfio_mdev_get_iommu_device(dev), gbind_data);

But we can't assume an mdev device is iommu backed, so this can call
with NULL dev, which appears will pretty quickly segfault
intel_svm_bind_gpasid.

> +	else
> +		return iommu_sva_bind_gpasid(dc->domain,
> +						dev, gbind_data);
>  }
>  
>  static int vfio_unbind_gpasid_fn(struct device *dev, void *data)
> @@ -2356,8 +2363,13 @@ static int vfio_unbind_gpasid_fn(struct device *dev, void *data)
>  	struct iommu_gpasid_bind_data *gbind_data =
>  		(struct iommu_gpasid_bind_data *) dc->data;
>  
> -	return iommu_sva_unbind_gpasid(dc->domain, dev,
> +	if (dc->group->mdev_group)
> +		return iommu_sva_unbind_gpasid(dc->domain,
> +					vfio_mdev_get_iommu_device(dev),
>  					gbind_data->hpasid);

Same

> +	else
> +		return iommu_sva_unbind_gpasid(dc->domain, dev,
> +						gbind_data->hpasid);
>  }
>  
>  /**
> @@ -2429,7 +2441,12 @@ static int vfio_cache_inv_fn(struct device *dev, void *data)
>  	struct iommu_cache_invalidate_info *cache_inv_info =
>  		(struct iommu_cache_invalidate_info *) dc->data;
>  
> -	return iommu_cache_invalidate(dc->domain, dev, cache_inv_info);
> +	if (dc->group->mdev_group)
> +		return iommu_cache_invalidate(dc->domain,
> +			vfio_mdev_get_iommu_device(dev), cache_inv_info);

And again

> +	else
> +		return iommu_cache_invalidate(dc->domain,
> +						dev, cache_inv_info);
>  }
>  
>  static long vfio_iommu_type1_ioctl(void *iommu_data,


^ permalink raw reply	[flat|nested] 110+ messages in thread

* RE: [PATCH v1 1/8] vfio: Add VFIO_IOMMU_PASID_REQUEST(alloc/free)
  2020-04-02 17:50   ` Alex Williamson
@ 2020-04-03  5:58     ` Tian, Kevin
  2020-04-03 15:14       ` Alex Williamson
  2020-04-03 13:12     ` Liu, Yi L
  1 sibling, 1 reply; 110+ messages in thread
From: Tian, Kevin @ 2020-04-03  5:58 UTC (permalink / raw)
  To: Alex Williamson, Liu, Yi L
  Cc: eric.auger, jacob.jun.pan, joro, Raj, Ashok, Tian, Jun J, Sun,
	Yi Y, jean-philippe, peterx, iommu, kvm, linux-kernel, Wu, Hao

> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Friday, April 3, 2020 1:50 AM
> 
> On Sun, 22 Mar 2020 05:31:58 -0700
> "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> 
> > From: Liu Yi L <yi.l.liu@intel.com>
> >
> > For a long time, devices have only one DMA address space from platform
> > IOMMU's point of view. This is true for both bare metal and directed-
> > access in virtualization environment. Reason is the source ID of DMA in
> > PCIe are BDF (bus/dev/fnc ID), which results in only device granularity
> > DMA isolation. However, this is changing with the latest advancement in
> > I/O technology area. More and more platform vendors are utilizing the
> PCIe
> > PASID TLP prefix in DMA requests, thus to give devices with multiple DMA
> > address spaces as identified by their individual PASIDs. For example,
> > Shared Virtual Addressing (SVA, a.k.a Shared Virtual Memory) is able to
> > let device access multiple process virtual address space by binding the
> > virtual address space with a PASID. Wherein the PASID is allocated in
> > software and programmed to device per device specific manner. Devices
> > which support PASID capability are called PASID-capable devices. If such
> > devices are passed through to VMs, guest software are also able to bind
> > guest process virtual address space on such devices. Therefore, the guest
> > software could reuse the bare metal software programming model, which
> > means guest software will also allocate PASID and program it to device
> > directly. This is a dangerous situation since it has potential PASID
> > conflicts and unauthorized address space access. It would be safer to
> > let host intercept in the guest software's PASID allocation. Thus PASID
> > are managed system-wide.
> 
> Providing an allocation interface only allows for collaborative usage
> of PASIDs though.  Do we have any ability to enforce PASID usage or can
> a user spoof other PASIDs on the same BDF?

An user can access only PASIDs allocated to itself, i.e. the specific IOASID
set tied to its mm_struct.

Thanks
Kevin

> 
> > This patch adds VFIO_IOMMU_PASID_REQUEST ioctl which aims to
> passdown
> > PASID allocation/free request from the virtual IOMMU. Additionally, such
> > requests are intended to be invoked by QEMU or other applications which
> > are running in userspace, it is necessary to have a mechanism to prevent
> > single application from abusing available PASIDs in system. With such
> > consideration, this patch tracks the VFIO PASID allocation per-VM. There
> > was a discussion to make quota to be per assigned devices. e.g. if a VM
> > has many assigned devices, then it should have more quota. However, it
> > is not sure how many PASIDs an assigned devices will use. e.g. it is
> > possible that a VM with multiples assigned devices but requests less
> > PASIDs. Therefore per-VM quota would be better.
> >
> > This patch uses struct mm pointer as a per-VM token. We also considered
> > using task structure pointer and vfio_iommu structure pointer. However,
> > task structure is per-thread, which means it cannot achieve per-VM PASID
> > alloc tracking purpose. While for vfio_iommu structure, it is visible
> > only within vfio. Therefore, structure mm pointer is selected. This patch
> > adds a structure vfio_mm. A vfio_mm is created when the first vfio
> > container is opened by a VM. On the reverse order, vfio_mm is free when
> > the last vfio container is released. Each VM is assigned with a PASID
> > quota, so that it is not able to request PASID beyond its quota. This
> > patch adds a default quota of 1000. This quota could be tuned by
> > administrator. Making PASID quota tunable will be added in another patch
> > in this series.
> >
> > Previous discussions:
> > https://patchwork.kernel.org/patch/11209429/
> >
> > Cc: Kevin Tian <kevin.tian@intel.com>
> > CC: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > Cc: Alex Williamson <alex.williamson@redhat.com>
> > Cc: Eric Auger <eric.auger@redhat.com>
> > Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
> > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
> > Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > ---
> >  drivers/vfio/vfio.c             | 130
> ++++++++++++++++++++++++++++++++++++++++
> >  drivers/vfio/vfio_iommu_type1.c | 104
> ++++++++++++++++++++++++++++++++
> >  include/linux/vfio.h            |  20 +++++++
> >  include/uapi/linux/vfio.h       |  41 +++++++++++++
> >  4 files changed, 295 insertions(+)
> >
> > diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> > index c848262..d13b483 100644
> > --- a/drivers/vfio/vfio.c
> > +++ b/drivers/vfio/vfio.c
> > @@ -32,6 +32,7 @@
> >  #include <linux/vfio.h>
> >  #include <linux/wait.h>
> >  #include <linux/sched/signal.h>
> > +#include <linux/sched/mm.h>
> >
> >  #define DRIVER_VERSION	"0.3"
> >  #define DRIVER_AUTHOR	"Alex Williamson
> <alex.williamson@redhat.com>"
> > @@ -46,6 +47,8 @@ static struct vfio {
> >  	struct mutex			group_lock;
> >  	struct cdev			group_cdev;
> >  	dev_t				group_devt;
> > +	struct list_head		vfio_mm_list;
> > +	struct mutex			vfio_mm_lock;
> >  	wait_queue_head_t		release_q;
> >  } vfio;
> >
> > @@ -2129,6 +2132,131 @@ int vfio_unregister_notifier(struct device *dev,
> enum vfio_notify_type type,
> >  EXPORT_SYMBOL(vfio_unregister_notifier);
> >
> >  /**
> > + * VFIO_MM objects - create, release, get, put, search
> > + * Caller of the function should have held vfio.vfio_mm_lock.
> > + */
> > +static struct vfio_mm *vfio_create_mm(struct mm_struct *mm)
> > +{
> > +	struct vfio_mm *vmm;
> > +	struct vfio_mm_token *token;
> > +	int ret = 0;
> > +
> > +	vmm = kzalloc(sizeof(*vmm), GFP_KERNEL);
> > +	if (!vmm)
> > +		return ERR_PTR(-ENOMEM);
> > +
> > +	/* Per mm IOASID set used for quota control and group operations
> */
> > +	ret = ioasid_alloc_set((struct ioasid_set *) mm,
> > +			       VFIO_DEFAULT_PASID_QUOTA, &vmm-
> >ioasid_sid);
> > +	if (ret) {
> > +		kfree(vmm);
> > +		return ERR_PTR(ret);
> > +	}
> > +
> > +	kref_init(&vmm->kref);
> > +	token = &vmm->token;
> > +	token->val = mm;
> > +	vmm->pasid_quota = VFIO_DEFAULT_PASID_QUOTA;
> > +	mutex_init(&vmm->pasid_lock);
> > +
> > +	list_add(&vmm->vfio_next, &vfio.vfio_mm_list);
> > +
> > +	return vmm;
> > +}
> > +
> > +static void vfio_mm_unlock_and_free(struct vfio_mm *vmm)
> > +{
> > +	/* destroy the ioasid set */
> > +	ioasid_free_set(vmm->ioasid_sid, true);
> > +	mutex_unlock(&vfio.vfio_mm_lock);
> > +	kfree(vmm);
> > +}
> > +
> > +/* called with vfio.vfio_mm_lock held */
> > +static void vfio_mm_release(struct kref *kref)
> > +{
> > +	struct vfio_mm *vmm = container_of(kref, struct vfio_mm, kref);
> > +
> > +	list_del(&vmm->vfio_next);
> > +	vfio_mm_unlock_and_free(vmm);
> > +}
> > +
> > +void vfio_mm_put(struct vfio_mm *vmm)
> > +{
> > +	kref_put_mutex(&vmm->kref, vfio_mm_release,
> &vfio.vfio_mm_lock);
> > +}
> > +EXPORT_SYMBOL_GPL(vfio_mm_put);
> > +
> > +/* Assume vfio_mm_lock or vfio_mm reference is held */
> > +static void vfio_mm_get(struct vfio_mm *vmm)
> > +{
> > +	kref_get(&vmm->kref);
> > +}
> > +
> > +struct vfio_mm *vfio_mm_get_from_task(struct task_struct *task)
> > +{
> > +	struct mm_struct *mm = get_task_mm(task);
> > +	struct vfio_mm *vmm;
> > +	unsigned long long val = (unsigned long long) mm;
> > +
> > +	mutex_lock(&vfio.vfio_mm_lock);
> > +	list_for_each_entry(vmm, &vfio.vfio_mm_list, vfio_next) {
> > +		if (vmm->token.val == val) {
> > +			vfio_mm_get(vmm);
> > +			goto out;
> > +		}
> > +	}
> > +
> > +	vmm = vfio_create_mm(mm);
> > +	if (IS_ERR(vmm))
> > +		vmm = NULL;
> > +out:
> > +	mutex_unlock(&vfio.vfio_mm_lock);
> > +	mmput(mm);
> > +	return vmm;
> > +}
> > +EXPORT_SYMBOL_GPL(vfio_mm_get_from_task);
> > +
> > +int vfio_mm_pasid_alloc(struct vfio_mm *vmm, int min, int max)
> > +{
> > +	ioasid_t pasid;
> > +	int ret = -ENOSPC;
> > +
> > +	mutex_lock(&vmm->pasid_lock);
> > +
> > +	pasid = ioasid_alloc(vmm->ioasid_sid, min, max, NULL);
> > +	if (pasid == INVALID_IOASID) {
> > +		ret = -ENOSPC;
> > +		goto out_unlock;
> > +	}
> > +
> > +	ret = pasid;
> > +out_unlock:
> > +	mutex_unlock(&vmm->pasid_lock);
> > +	return ret;
> > +}
> > +EXPORT_SYMBOL_GPL(vfio_mm_pasid_alloc);
> > +
> > +int vfio_mm_pasid_free(struct vfio_mm *vmm, ioasid_t pasid)
> > +{
> > +	void *pdata;
> > +	int ret = 0;
> > +
> > +	mutex_lock(&vmm->pasid_lock);
> > +	pdata = ioasid_find(vmm->ioasid_sid, pasid, NULL);
> > +	if (IS_ERR(pdata)) {
> > +		ret = PTR_ERR(pdata);
> > +		goto out_unlock;
> > +	}
> > +	ioasid_free(pasid);
> > +
> > +out_unlock:
> > +	mutex_unlock(&vmm->pasid_lock);
> > +	return ret;
> > +}
> > +EXPORT_SYMBOL_GPL(vfio_mm_pasid_free);
> > +
> > +/**
> >   * Module/class support
> >   */
> >  static char *vfio_devnode(struct device *dev, umode_t *mode)
> > @@ -2151,8 +2279,10 @@ static int __init vfio_init(void)
> >  	idr_init(&vfio.group_idr);
> >  	mutex_init(&vfio.group_lock);
> >  	mutex_init(&vfio.iommu_drivers_lock);
> > +	mutex_init(&vfio.vfio_mm_lock);
> >  	INIT_LIST_HEAD(&vfio.group_list);
> >  	INIT_LIST_HEAD(&vfio.iommu_drivers_list);
> > +	INIT_LIST_HEAD(&vfio.vfio_mm_list);
> >  	init_waitqueue_head(&vfio.release_q);
> >
> >  	ret = misc_register(&vfio_dev);
> 
> Is vfio.c the right place for any of the above?  It seems like it could
> all be in a separate vfio_pasid module, similar to our virqfd module.
> 
> > diff --git a/drivers/vfio/vfio_iommu_type1.c
> b/drivers/vfio/vfio_iommu_type1.c
> > index a177bf2..331ceee 100644
> > --- a/drivers/vfio/vfio_iommu_type1.c
> > +++ b/drivers/vfio/vfio_iommu_type1.c
> > @@ -70,6 +70,7 @@ struct vfio_iommu {
> >  	unsigned int		dma_avail;
> >  	bool			v2;
> >  	bool			nesting;
> > +	struct vfio_mm		*vmm;
> >  };
> >
> >  struct vfio_domain {
> > @@ -2018,6 +2019,7 @@ static void
> vfio_iommu_type1_detach_group(void *iommu_data,
> >  static void *vfio_iommu_type1_open(unsigned long arg)
> >  {
> >  	struct vfio_iommu *iommu;
> > +	struct vfio_mm *vmm = NULL;
> >
> >  	iommu = kzalloc(sizeof(*iommu), GFP_KERNEL);
> >  	if (!iommu)
> > @@ -2043,6 +2045,10 @@ static void *vfio_iommu_type1_open(unsigned
> long arg)
> >  	iommu->dma_avail = dma_entry_limit;
> >  	mutex_init(&iommu->lock);
> >  	BLOCKING_INIT_NOTIFIER_HEAD(&iommu->notifier);
> > +	vmm = vfio_mm_get_from_task(current);
> > +	if (!vmm)
> > +		pr_err("Failed to get vfio_mm track\n");
> 
> Doesn't this presume everyone is instantly running PASID capable hosts?
> Looks like a noisy support regression to me.
> 
> > +	iommu->vmm = vmm;
> >
> >  	return iommu;
> >  }
> > @@ -2084,6 +2090,8 @@ static void vfio_iommu_type1_release(void
> *iommu_data)
> >  	}
> >
> >  	vfio_iommu_iova_free(&iommu->iova_list);
> > +	if (iommu->vmm)
> > +		vfio_mm_put(iommu->vmm);
> >
> >  	kfree(iommu);
> >  }
> > @@ -2172,6 +2180,55 @@ static int vfio_iommu_iova_build_caps(struct
> vfio_iommu *iommu,
> >  	return ret;
> >  }
> >
> > +static bool vfio_iommu_type1_pasid_req_valid(u32 flags)
> > +{
> > +	return !((flags & ~VFIO_PASID_REQUEST_MASK) ||
> > +		 (flags & VFIO_IOMMU_PASID_ALLOC &&
> > +		  flags & VFIO_IOMMU_PASID_FREE));
> > +}
> > +
> > +static int vfio_iommu_type1_pasid_alloc(struct vfio_iommu *iommu,
> > +					 int min,
> > +					 int max)
> > +{
> > +	struct vfio_mm *vmm = iommu->vmm;
> > +	int ret = 0;
> > +
> > +	mutex_lock(&iommu->lock);
> > +	if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) {
> > +		ret = -EFAULT;
> > +		goto out_unlock;
> > +	}
> 
> Non-iommu backed mdevs are excluded from this?  Is this a matter of
> wiring the call out through the mdev parent device, or is this just
> possible?
> 
> > +	if (vmm)
> > +		ret = vfio_mm_pasid_alloc(vmm, min, max);
> > +	else
> > +		ret = -EINVAL;
> > +out_unlock:
> > +	mutex_unlock(&iommu->lock);
> > +	return ret;
> > +}
> > +
> > +static int vfio_iommu_type1_pasid_free(struct vfio_iommu *iommu,
> > +				       unsigned int pasid)
> > +{
> > +	struct vfio_mm *vmm = iommu->vmm;
> > +	int ret = 0;
> > +
> > +	mutex_lock(&iommu->lock);
> > +	if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) {
> > +		ret = -EFAULT;
> > +		goto out_unlock;
> > +	}
> 
> So if a container had an iommu backed device when the pasid was
> allocated, but it was removed, now they can't free it?  Why do we need
> the check above?
> 
> > +
> > +	if (vmm)
> > +		ret = vfio_mm_pasid_free(vmm, pasid);
> > +	else
> > +		ret = -EINVAL;
> > +out_unlock:
> > +	mutex_unlock(&iommu->lock);
> > +	return ret;
> > +}
> > +
> >  static long vfio_iommu_type1_ioctl(void *iommu_data,
> >  				   unsigned int cmd, unsigned long arg)
> >  {
> > @@ -2276,6 +2333,53 @@ static long vfio_iommu_type1_ioctl(void
> *iommu_data,
> >
> >  		return copy_to_user((void __user *)arg, &unmap, minsz) ?
> >  			-EFAULT : 0;
> > +
> > +	} else if (cmd == VFIO_IOMMU_PASID_REQUEST) {
> > +		struct vfio_iommu_type1_pasid_request req;
> > +		unsigned long offset;
> > +
> > +		minsz = offsetofend(struct vfio_iommu_type1_pasid_request,
> > +				    flags);
> > +
> > +		if (copy_from_user(&req, (void __user *)arg, minsz))
> > +			return -EFAULT;
> > +
> > +		if (req.argsz < minsz ||
> > +		    !vfio_iommu_type1_pasid_req_valid(req.flags))
> > +			return -EINVAL;
> > +
> > +		if (copy_from_user((void *)&req + minsz,
> > +				   (void __user *)arg + minsz,
> > +				   sizeof(req) - minsz))
> > +			return -EFAULT;
> 
> Huh?  Why do we have argsz if we're going to assume this is here?
> 
> > +
> > +		switch (req.flags & VFIO_PASID_REQUEST_MASK) {
> > +		case VFIO_IOMMU_PASID_ALLOC:
> > +		{
> > +			int ret = 0, result;
> > +
> > +			result = vfio_iommu_type1_pasid_alloc(iommu,
> > +							req.alloc_pasid.min,
> > +							req.alloc_pasid.max);
> > +			if (result > 0) {
> > +				offset = offsetof(
> > +					struct
> vfio_iommu_type1_pasid_request,
> > +					alloc_pasid.result);
> > +				ret = copy_to_user(
> > +					      (void __user *) (arg + offset),
> > +					      &result, sizeof(result));
> 
> Again assuming argsz supports this.
> 
> > +			} else {
> > +				pr_debug("%s: PASID alloc failed\n",
> __func__);
> 
> rate limit?
> 
> > +				ret = -EFAULT;
> > +			}
> > +			return ret;
> > +		}
> > +		case VFIO_IOMMU_PASID_FREE:
> > +			return vfio_iommu_type1_pasid_free(iommu,
> > +							   req.free_pasid);
> > +		default:
> > +			return -EINVAL;
> > +		}
> >  	}
> >
> >  	return -ENOTTY;
> > diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> > index e42a711..75f9f7f1 100644
> > --- a/include/linux/vfio.h
> > +++ b/include/linux/vfio.h
> > @@ -89,6 +89,26 @@ extern int vfio_register_iommu_driver(const struct
> vfio_iommu_driver_ops *ops);
> >  extern void vfio_unregister_iommu_driver(
> >  				const struct vfio_iommu_driver_ops *ops);
> >
> > +#define VFIO_DEFAULT_PASID_QUOTA	1000
> > +struct vfio_mm_token {
> > +	unsigned long long val;
> > +};
> > +
> > +struct vfio_mm {
> > +	struct kref			kref;
> > +	struct vfio_mm_token		token;
> > +	int				ioasid_sid;
> > +	/* protect @pasid_quota field and pasid allocation/free */
> > +	struct mutex			pasid_lock;
> > +	int				pasid_quota;
> > +	struct list_head		vfio_next;
> > +};
> > +
> > +extern struct vfio_mm *vfio_mm_get_from_task(struct task_struct *task);
> > +extern void vfio_mm_put(struct vfio_mm *vmm);
> > +extern int vfio_mm_pasid_alloc(struct vfio_mm *vmm, int min, int max);
> > +extern int vfio_mm_pasid_free(struct vfio_mm *vmm, ioasid_t pasid);
> > +
> >  /*
> >   * External user API
> >   */
> > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> > index 9e843a1..298ac80 100644
> > --- a/include/uapi/linux/vfio.h
> > +++ b/include/uapi/linux/vfio.h
> > @@ -794,6 +794,47 @@ struct vfio_iommu_type1_dma_unmap {
> >  #define VFIO_IOMMU_ENABLE	_IO(VFIO_TYPE, VFIO_BASE + 15)
> >  #define VFIO_IOMMU_DISABLE	_IO(VFIO_TYPE, VFIO_BASE + 16)
> >
> > +/*
> > + * PASID (Process Address Space ID) is a PCIe concept which
> > + * has been extended to support DMA isolation in fine-grain.
> > + * With device assigned to user space (e.g. VMs), PASID alloc
> > + * and free need to be system wide. This structure defines
> > + * the info for pasid alloc/free between user space and kernel
> > + * space.
> > + *
> > + * @flag=VFIO_IOMMU_PASID_ALLOC, refer to the @alloc_pasid
> > + * @flag=VFIO_IOMMU_PASID_FREE, refer to @free_pasid
> > + */
> > +struct vfio_iommu_type1_pasid_request {
> > +	__u32	argsz;
> > +#define VFIO_IOMMU_PASID_ALLOC	(1 << 0)
> > +#define VFIO_IOMMU_PASID_FREE	(1 << 1)
> > +	__u32	flags;
> > +	union {
> > +		struct {
> > +			__u32 min;
> > +			__u32 max;
> > +			__u32 result;
> > +		} alloc_pasid;
> > +		__u32 free_pasid;
> > +	};
> 
> We seem to be using __u8 data[] lately where the struct at data is
> defined by the flags.  should we do that here?
> 
> > +};
> > +
> > +#define VFIO_PASID_REQUEST_MASK	(VFIO_IOMMU_PASID_ALLOC
> | \
> > +					 VFIO_IOMMU_PASID_FREE)
> > +
> > +/**
> > + * VFIO_IOMMU_PASID_REQUEST - _IOWR(VFIO_TYPE, VFIO_BASE + 22,
> > + *				struct vfio_iommu_type1_pasid_request)
> > + *
> > + * Availability of this feature depends on PASID support in the device,
> > + * its bus, the underlying IOMMU and the CPU architecture. In VFIO, it
> > + * is available after VFIO_SET_IOMMU.
> > + *
> > + * returns: 0 on success, -errno on failure.
> > + */
> > +#define VFIO_IOMMU_PASID_REQUEST	_IO(VFIO_TYPE, VFIO_BASE +
> 22)
> 
> So a user needs to try to allocate a PASID in order to test for the
> support?  Should we have a PROBE flag?
> 
> > +
> >  /* -------- Additional API for SPAPR TCE (Server POWERPC) IOMMU --------
> */
> >
> >  /*


^ permalink raw reply	[flat|nested] 110+ messages in thread

* RE: [PATCH v1 7/8] vfio/type1: Add VFIO_IOMMU_CACHE_INVALIDATE
  2020-04-02 20:24   ` Alex Williamson
@ 2020-04-03  6:39     ` Tian, Kevin
  2020-04-03 15:31       ` Jacob Pan
  2020-04-03 15:34       ` Alex Williamson
  0 siblings, 2 replies; 110+ messages in thread
From: Tian, Kevin @ 2020-04-03  6:39 UTC (permalink / raw)
  To: Alex Williamson, Liu, Yi L
  Cc: eric.auger, jacob.jun.pan, joro, Raj, Ashok, Tian, Jun J, Sun,
	Yi Y, jean-philippe, peterx, iommu, kvm, linux-kernel, Wu, Hao

> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Friday, April 3, 2020 4:24 AM
> 
> On Sun, 22 Mar 2020 05:32:04 -0700
> "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> 
> > From: Liu Yi L <yi.l.liu@linux.intel.com>
> >
> > For VFIO IOMMUs with the type VFIO_TYPE1_NESTING_IOMMU, guest
> "owns" the
> > first-level/stage-1 translation structures, the host IOMMU driver has no
> > knowledge of first-level/stage-1 structure cache updates unless the guest
> > invalidation requests are trapped and propagated to the host.
> >
> > This patch adds a new IOCTL VFIO_IOMMU_CACHE_INVALIDATE to
> propagate guest
> > first-level/stage-1 IOMMU cache invalidations to host to ensure IOMMU
> cache
> > correctness.
> >
> > With this patch, vSVA (Virtual Shared Virtual Addressing) can be used safely
> > as the host IOMMU iotlb correctness are ensured.
> >
> > Cc: Kevin Tian <kevin.tian@intel.com>
> > CC: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > Cc: Alex Williamson <alex.williamson@redhat.com>
> > Cc: Eric Auger <eric.auger@redhat.com>
> > Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
> > Signed-off-by: Liu Yi L <yi.l.liu@linux.intel.com>
> > Signed-off-by: Eric Auger <eric.auger@redhat.com>
> > Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > ---
> >  drivers/vfio/vfio_iommu_type1.c | 49
> +++++++++++++++++++++++++++++++++++++++++
> >  include/uapi/linux/vfio.h       | 22 ++++++++++++++++++
> >  2 files changed, 71 insertions(+)
> >
> > diff --git a/drivers/vfio/vfio_iommu_type1.c
> b/drivers/vfio/vfio_iommu_type1.c
> > index a877747..937ec3f 100644
> > --- a/drivers/vfio/vfio_iommu_type1.c
> > +++ b/drivers/vfio/vfio_iommu_type1.c
> > @@ -2423,6 +2423,15 @@ static long
> vfio_iommu_type1_unbind_gpasid(struct vfio_iommu *iommu,
> >  	return ret;
> >  }
> >
> > +static int vfio_cache_inv_fn(struct device *dev, void *data)
> > +{
> > +	struct domain_capsule *dc = (struct domain_capsule *)data;
> > +	struct iommu_cache_invalidate_info *cache_inv_info =
> > +		(struct iommu_cache_invalidate_info *) dc->data;
> > +
> > +	return iommu_cache_invalidate(dc->domain, dev, cache_inv_info);
> > +}
> > +
> >  static long vfio_iommu_type1_ioctl(void *iommu_data,
> >  				   unsigned int cmd, unsigned long arg)
> >  {
> > @@ -2629,6 +2638,46 @@ static long vfio_iommu_type1_ioctl(void
> *iommu_data,
> >  		}
> >  		kfree(gbind_data);
> >  		return ret;
> > +	} else if (cmd == VFIO_IOMMU_CACHE_INVALIDATE) {
> > +		struct vfio_iommu_type1_cache_invalidate cache_inv;
> > +		u32 version;
> > +		int info_size;
> > +		void *cache_info;
> > +		int ret;
> > +
> > +		minsz = offsetofend(struct
> vfio_iommu_type1_cache_invalidate,
> > +				    flags);
> 
> This breaks backward compatibility as soon as struct
> iommu_cache_invalidate_info changes size by its defined versioning
> scheme.  ie. a field gets added, the version is bumped, all existing
> userspace breaks.  Our minsz is offsetofend to the version field,
> interpret the version to size, then reevaluate argsz.

btw the version scheme is challenged by Christoph Hellwig. After
some discussions, we need your guidance how to move forward.
Jacob summarized available options below:
	https://lkml.org/lkml/2020/4/2/876

> 
> > +
> > +		if (copy_from_user(&cache_inv, (void __user *)arg, minsz))
> > +			return -EFAULT;
> > +
> > +		if (cache_inv.argsz < minsz || cache_inv.flags)
> > +			return -EINVAL;
> > +
> > +		/* Get the version of struct iommu_cache_invalidate_info */
> > +		if (copy_from_user(&version,
> > +			(void __user *) (arg + minsz), sizeof(version)))
> > +			return -EFAULT;
> > +
> > +		info_size = iommu_uapi_get_data_size(
> > +					IOMMU_UAPI_CACHE_INVAL,
> version);
> > +
> > +		cache_info = kzalloc(info_size, GFP_KERNEL);
> > +		if (!cache_info)
> > +			return -ENOMEM;
> > +
> > +		if (copy_from_user(cache_info,
> > +			(void __user *) (arg + minsz), info_size)) {
> > +			kfree(cache_info);
> > +			return -EFAULT;
> > +		}
> > +
> > +		mutex_lock(&iommu->lock);
> > +		ret = vfio_iommu_for_each_dev(iommu, vfio_cache_inv_fn,
> > +					    cache_info);
> 
> How does a user respond when their cache invalidate fails?  Isn't this
> also another case where our for_each_dev can fail at an arbitrary point
> leaving us with no idea whether each device even had the opportunity to
> perform the invalidation request.  I don't see how we have any chance
> to maintain coherency after this faults.

Then can we make it simple to support singleton group only? 

> 
> > +		mutex_unlock(&iommu->lock);
> > +		kfree(cache_info);
> > +		return ret;
> >  	}
> >
> >  	return -ENOTTY;
> > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> > index 2235bc6..62ca791 100644
> > --- a/include/uapi/linux/vfio.h
> > +++ b/include/uapi/linux/vfio.h
> > @@ -899,6 +899,28 @@ struct vfio_iommu_type1_bind {
> >   */
> >  #define VFIO_IOMMU_BIND		_IO(VFIO_TYPE, VFIO_BASE + 23)
> >
> > +/**
> > + * VFIO_IOMMU_CACHE_INVALIDATE - _IOW(VFIO_TYPE, VFIO_BASE + 24,
> > + *			struct vfio_iommu_type1_cache_invalidate)
> > + *
> > + * Propagate guest IOMMU cache invalidation to the host. The cache
> > + * invalidation information is conveyed by @cache_info, the content
> > + * format would be structures defined in uapi/linux/iommu.h. User
> > + * should be aware of that the struct  iommu_cache_invalidate_info
> > + * has a @version field, vfio needs to parse this field before getting
> > + * data from userspace.
> > + *
> > + * Availability of this IOCTL is after VFIO_SET_IOMMU.
> 
> Is this a necessary qualifier?  A user can try to call this ioctl at
> any point, it only makes sense in certain configurations, but it should
> always "do the right thing" relative to the container iommu config.
> 
> Also, I don't see anything in these last few patches testing the
> operating IOMMU model, what happens when a user calls them when not
> using the nesting IOMMU?
> 
> Is this ioctl and the previous BIND ioctl only valid when configured
> for the nesting IOMMU type?

I think so. We should add the nesting check in those new ioctls.

> 
> > + *
> > + * returns: 0 on success, -errno on failure.
> > + */
> > +struct vfio_iommu_type1_cache_invalidate {
> > +	__u32   argsz;
> > +	__u32   flags;
> > +	struct	iommu_cache_invalidate_info cache_info;
> > +};
> > +#define VFIO_IOMMU_CACHE_INVALIDATE      _IO(VFIO_TYPE, VFIO_BASE
> + 24)
> 
> The future extension capabilities of this ioctl worry me, I wonder if
> we should do another data[] with flag defining that data as CACHE_INFO.

Can you elaborate? Does it mean with this way we don't rely on iommu
driver to provide version_to_size conversion and instead we just pass
data[] to iommu driver for further audit?

> 
> > +
> >  /* -------- Additional API for SPAPR TCE (Server POWERPC) IOMMU --------
> */
> >
> >  /*

Thanks
Kevin

^ permalink raw reply	[flat|nested] 110+ messages in thread

* RE: [PATCH v1 2/8] vfio/type1: Add vfio_iommu_type1 parameter for quota tuning
  2020-04-02 17:58             ` Alex Williamson
@ 2020-04-03  8:15               ` Liu, Yi L
  0 siblings, 0 replies; 110+ messages in thread
From: Liu, Yi L @ 2020-04-03  8:15 UTC (permalink / raw)
  To: Alex Williamson, Tian, Kevin
  Cc: eric.auger, jacob.jun.pan, joro, Raj, Ashok, Tian, Jun J, Sun,
	Yi Y, jean-philippe, peterx, iommu, kvm, linux-kernel, Wu, Hao

> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Friday, April 3, 2020 1:59 AM
> To: Tian, Kevin <kevin.tian@intel.com>
> Subject: Re: [PATCH v1 2/8] vfio/type1: Add vfio_iommu_type1 parameter for quota
> tuning
> 
> On Mon, 30 Mar 2020 11:44:08 +0000
> "Tian, Kevin" <kevin.tian@intel.com> wrote:
> 
> > > From: Liu, Yi L <yi.l.liu@intel.com>
> > > Sent: Monday, March 30, 2020 5:27 PM
> > >
> > > > From: Tian, Kevin <kevin.tian@intel.com>
> > > > Sent: Monday, March 30, 2020 5:20 PM
> > > > To: Liu, Yi L <yi.l.liu@intel.com>; alex.williamson@redhat.com;
> > > > Subject: RE: [PATCH v1 2/8] vfio/type1: Add vfio_iommu_type1 parameter
> > > for quota
> > > > tuning
> > > >
> > > > > From: Liu, Yi L <yi.l.liu@intel.com>
> > > > > Sent: Monday, March 30, 2020 4:53 PM
> > > > >
> > > > > > From: Tian, Kevin <kevin.tian@intel.com>
> > > > > > Sent: Monday, March 30, 2020 4:41 PM
> > > > > > To: Liu, Yi L <yi.l.liu@intel.com>; alex.williamson@redhat.com;
> > > > > > Subject: RE: [PATCH v1 2/8] vfio/type1: Add vfio_iommu_type1
> > > > > > parameter
> > > > > for quota
> > > > > > tuning
> > > > > >
> > > > > > > From: Liu, Yi L <yi.l.liu@intel.com>
> > > > > > > Sent: Sunday, March 22, 2020 8:32 PM
> > > > > > >
> > > > > > > From: Liu Yi L <yi.l.liu@intel.com>
> > > > > > >
> > > > > > > This patch adds a module option to make the PASID quota tunable by
> > > > > > > administrator.
> > > > > > >
> > > > > > > TODO: needs to think more on how to  make the tuning to be per-
> > > process.
> > > > > > >
> > > > > > > Previous discussions:
> > > > > > > https://patchwork.kernel.org/patch/11209429/
> > > > > > >
> > > > > > > Cc: Kevin Tian <kevin.tian@intel.com>
> > > > > > > CC: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > > > > > > Cc: Alex Williamson <alex.williamson@redhat.com>
> > > > > > > Cc: Eric Auger <eric.auger@redhat.com>
> > > > > > > Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
> > > > > > > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > > > > > > ---
> > > > > > >  drivers/vfio/vfio.c             | 8 +++++++-
> > > > > > >  drivers/vfio/vfio_iommu_type1.c | 7 ++++++-
> > > > > > >  include/linux/vfio.h            | 3 ++-
> > > > > > >  3 files changed, 15 insertions(+), 3 deletions(-)
> > > > > > >
> > > > > > > diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c index
> > > > > > > d13b483..020a792 100644
> > > > > > > --- a/drivers/vfio/vfio.c
> > > > > > > +++ b/drivers/vfio/vfio.c
> > > > > > > @@ -2217,13 +2217,19 @@ struct vfio_mm
> > > > > *vfio_mm_get_from_task(struct
> > > > > > > task_struct *task)
> > > > > > >  }
> > > > > > >  EXPORT_SYMBOL_GPL(vfio_mm_get_from_task);
> > > > > > >
> > > > > > > -int vfio_mm_pasid_alloc(struct vfio_mm *vmm, int min, int max)
> > > > > > > +int vfio_mm_pasid_alloc(struct vfio_mm *vmm, int quota, int min,
> > > > > > > +int
> > > > > max)
> > > > > > >  {
> > > > > > >  	ioasid_t pasid;
> > > > > > >  	int ret = -ENOSPC;
> > > > > > >
> > > > > > >  	mutex_lock(&vmm->pasid_lock);
> > > > > > >
> > > > > > > +	/* update quota as it is tunable by admin */
> > > > > > > +	if (vmm->pasid_quota != quota) {
> > > > > > > +		vmm->pasid_quota = quota;
> > > > > > > +		ioasid_adjust_set(vmm->ioasid_sid, quota);
> > > > > > > +	}
> > > > > > > +
> > > > > >
> > > > > > It's a bit weird to have quota adjusted in the alloc path, since the
> > > > > > latter
> > > > > might
> > > > > > be initiated by non-privileged users. Why not doing the simple math
> > > > > > in
> > > > > vfio_
> > > > > > create_mm to set the quota when the ioasid set is created? even in
> > > > > > the
> > > > > future
> > > > > > you may allow per-process quota setting, that should come from
> > > > > > separate privileged path instead of thru alloc..
> > > > >
> > > > > The reason is the kernel parameter modification has no event which can
> > > > > be used to adjust the quota. So I chose to adjust it in pasid_alloc
> > > > > path. If it's not good, how about adding one more IOCTL to let user-
> > > > > space trigger a quota adjustment event? Then even non-privileged user
> > > > > could trigger quota adjustment, the quota is actually controlled by
> > > > > privileged user. How about your opinion?
> > > > >
> > > >
> > > > why do you need an event to adjust? As I said, you can set the quota when
> > > the set is
> > > > created in vfio_create_mm...
> > >
> > > oh, it's to support runtime adjustments. I guess it may be helpful to let
> > > per-VM quota tunable even the VM is running. If just set the quota in
> > > vfio_create_mm(), it is not able to adjust at runtime.
> > >
> >
> > ok, I didn't note the module parameter was granted with a write permission.
> > However there is a further problem. We cannot support PASID reclaim now.
> > What about the admin sets a quota smaller than previous value while some
> > IOASID sets already exceed the new quota? I'm not sure how to fail a runtime
> > module parameter change due to that situation. possibly a normal sysfs
> > node better suites the runtime change requirement...
> 
> Yep, making this runtime adjustable seems a bit unpredictable and racy,
> and it's not clear to me how a user is going to jump in at just the
> right time for a user and adjust the limit.  I'd probably go for a
> simple non-runtime adjustable module option.  It's a safety net at this
> point anyway afaict.  Thanks,

thanks, I can do the changes.

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 110+ messages in thread

* RE: [PATCH v1 3/8] vfio/type1: Report PASID alloc/free support to userspace
  2020-04-02 18:01   ` Alex Williamson
@ 2020-04-03  8:17     ` Liu, Yi L
  2020-04-03 17:28       ` Alex Williamson
  0 siblings, 1 reply; 110+ messages in thread
From: Liu, Yi L @ 2020-04-03  8:17 UTC (permalink / raw)
  To: Alex Williamson
  Cc: eric.auger, Tian, Kevin, jacob.jun.pan, joro, Raj, Ashok, Tian,
	Jun J, Sun, Yi Y, jean-philippe, peterx, iommu, kvm,
	linux-kernel, Wu, Hao

> From: Alex Williamson < alex.williamson@redhat.com >
> Sent: Friday, April 3, 2020 2:01 AM
> To: Liu, Yi L <yi.l.liu@intel.com>
> Subject: Re: [PATCH v1 3/8] vfio/type1: Report PASID alloc/free support to
> userspace
> 
> On Sun, 22 Mar 2020 05:32:00 -0700
> "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> 
> > From: Liu Yi L <yi.l.liu@intel.com>
> >
> > This patch reports PASID alloc/free availability to userspace (e.g.
> > QEMU) thus userspace could do a pre-check before utilizing this feature.
> >
> > Cc: Kevin Tian <kevin.tian@intel.com>
> > CC: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > Cc: Alex Williamson <alex.williamson@redhat.com>
> > Cc: Eric Auger <eric.auger@redhat.com>
> > Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
> > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > ---
> >  drivers/vfio/vfio_iommu_type1.c | 28 ++++++++++++++++++++++++++++
> >  include/uapi/linux/vfio.h       |  8 ++++++++
> >  2 files changed, 36 insertions(+)
> >
> > diff --git a/drivers/vfio/vfio_iommu_type1.c
> > b/drivers/vfio/vfio_iommu_type1.c index e40afc0..ddd1ffe 100644
> > --- a/drivers/vfio/vfio_iommu_type1.c
> > +++ b/drivers/vfio/vfio_iommu_type1.c
> > @@ -2234,6 +2234,30 @@ static int vfio_iommu_type1_pasid_free(struct
> vfio_iommu *iommu,
> >  	return ret;
> >  }
> >
> > +static int vfio_iommu_info_add_nesting_cap(struct vfio_iommu *iommu,
> > +					 struct vfio_info_cap *caps)
> > +{
> > +	struct vfio_info_cap_header *header;
> > +	struct vfio_iommu_type1_info_cap_nesting *nesting_cap;
> > +
> > +	header = vfio_info_cap_add(caps, sizeof(*nesting_cap),
> > +				   VFIO_IOMMU_TYPE1_INFO_CAP_NESTING, 1);
> > +	if (IS_ERR(header))
> > +		return PTR_ERR(header);
> > +
> > +	nesting_cap = container_of(header,
> > +				struct vfio_iommu_type1_info_cap_nesting,
> > +				header);
> > +
> > +	nesting_cap->nesting_capabilities = 0;
> > +	if (iommu->nesting) {
> > +		/* nesting iommu type supports PASID requests (alloc/free) */
> > +		nesting_cap->nesting_capabilities |= VFIO_IOMMU_PASID_REQS;
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> >  static long vfio_iommu_type1_ioctl(void *iommu_data,
> >  				   unsigned int cmd, unsigned long arg)  { @@ -
> 2283,6 +2307,10 @@
> > static long vfio_iommu_type1_ioctl(void *iommu_data,
> >  		if (ret)
> >  			return ret;
> >
> > +		ret = vfio_iommu_info_add_nesting_cap(iommu, &caps);
> > +		if (ret)
> > +			return ret;
> > +
> >  		if (caps.size) {
> >  			info.flags |= VFIO_IOMMU_INFO_CAPS;
> >
> > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> > index 298ac80..8837219 100644
> > --- a/include/uapi/linux/vfio.h
> > +++ b/include/uapi/linux/vfio.h
> > @@ -748,6 +748,14 @@ struct vfio_iommu_type1_info_cap_iova_range {
> >  	struct	vfio_iova_range iova_ranges[];
> >  };
> >
> > +#define VFIO_IOMMU_TYPE1_INFO_CAP_NESTING  2
> > +
> > +struct vfio_iommu_type1_info_cap_nesting {
> > +	struct	vfio_info_cap_header header;
> > +#define VFIO_IOMMU_PASID_REQS	(1 << 0)
> > +	__u32	nesting_capabilities;
> > +};
> > +
> >  #define VFIO_IOMMU_GET_INFO _IO(VFIO_TYPE, VFIO_BASE + 12)
> >
> >  /**
> 
> I think this answers my PROBE question on patch 1/. 
yep.
> Should the quota/usage be exposed to the user here?  Thanks,

Do you mean report the quota available for this user in this cap info as well?
For usage, do you mean the alloc and free or others?

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH v1 5/8] vfio/type1: Report 1st-level/stage-1 format to userspace
  2020-04-01 13:01       ` Auger Eric
@ 2020-04-03  8:23         ` Jean-Philippe Brucker
  2020-04-07  9:43           ` Liu, Yi L
  0 siblings, 1 reply; 110+ messages in thread
From: Jean-Philippe Brucker @ 2020-04-03  8:23 UTC (permalink / raw)
  To: Auger Eric
  Cc: Liu, Yi L, alex.williamson, Tian, Kevin, jacob.jun.pan, joro,
	Raj, Ashok, Tian, Jun J, Sun, Yi Y, peterx, iommu, kvm,
	linux-kernel, Wu, Hao

On Wed, Apr 01, 2020 at 03:01:12PM +0200, Auger Eric wrote:
> >>>  	header = vfio_info_cap_add(caps, sizeof(*nesting_cap),
> >>>  				   VFIO_IOMMU_TYPE1_INFO_CAP_NESTING, 1);
> >>> @@ -2254,6 +2309,7 @@ static int vfio_iommu_info_add_nesting_cap(struct
> >> vfio_iommu *iommu,
> >>>  		/* nesting iommu type supports PASID requests (alloc/free) */
> >>>  		nesting_cap->nesting_capabilities |= VFIO_IOMMU_PASID_REQS;
> >> What is the meaning for ARM?
> > 
> > I think it's just a software capability exposed to userspace, on
> > userspace side, it has a choice to use it or not. :-) The reason
> > define it and report it in cap nesting is that I'd like to make
> > the pasid alloc/free be available just for IOMMU with type
> > VFIO_IOMMU_TYPE1_NESTING. Please feel free tell me if it is not
> > good for ARM. We can find a proper way to report the availability.
> 
> Well it is more a question for jean-Philippe. Do we have a system wide
> PASID allocation on ARM?

We don't, the PASID spaces are per-VM on Arm, so this function should
consult the IOMMU driver before setting flags. As you said on patch 3,
nested doesn't necessarily imply PASID support. The SMMUv2 does not
support PASID but does support nesting stages 1 and 2 for the IOVA space.
SMMUv3 support of PASID depends on HW capabilities. So I think this needs
to be finer grained:

Does the container support:
* VFIO_IOMMU_PASID_REQUEST?
  -> Yes for VT-d 3
  -> No for Arm SMMU
* VFIO_IOMMU_{,UN}BIND_GUEST_PGTBL?
  -> Yes for VT-d 3
  -> Sometimes for SMMUv2
  -> No for SMMUv3 (if we go with BIND_PASID_TABLE, which is simpler due to
     PASID tables being in GPA space.)
* VFIO_IOMMU_BIND_PASID_TABLE?
  -> No for VT-d
  -> Sometimes for SMMUv3

Any bind support implies VFIO_IOMMU_CACHE_INVALIDATE support.


> >>> +	nesting_cap->stage1_formats = formats;
> >> as spotted by Kevin, since a single format is supported, rename
> > 
> > ok, I was believing it may be possible on ARM or so. :-) will
> > rename it.

Yes I don't think an u32 is going to cut it for Arm :( We need to describe
all sorts of capabilities for page and PASID tables (granules, GPA size,
ASID/PASID size, HW access/dirty, etc etc.) Just saying "Arm stage-1
format" wouldn't mean much. I guess we could have a secondary vendor
capability for these?

Thanks,
Jean

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH v1 6/8] vfio/type1: Bind guest page tables to host
  2020-04-02  8:05         ` Liu, Yi L
@ 2020-04-03  8:34           ` Jean-Philippe Brucker
  2020-04-07 10:33             ` Liu, Yi L
  0 siblings, 1 reply; 110+ messages in thread
From: Jean-Philippe Brucker @ 2020-04-03  8:34 UTC (permalink / raw)
  To: Yi L Liu
  Cc: Tian, Kevin, alex.williamson, eric.auger, jacob.jun.pan, joro,
	Raj, Ashok, Tian, Jun J, Sun, Yi Y, peterx, iommu, kvm,
	linux-kernel, Wu, Hao

On Thu, Apr 02, 2020 at 08:05:29AM +0000, Liu, Yi L wrote:
> > > > > static long vfio_iommu_type1_ioctl(void *iommu_data,
> > > > >  		default:
> > > > >  			return -EINVAL;
> > > > >  		}
> > > > > +
> > > > > +	} else if (cmd == VFIO_IOMMU_BIND) {
> > > >
> > > > BIND what? VFIO_IOMMU_BIND_PASID sounds clearer to me.
> > >
> > > Emm, it's up to the flags to indicate bind what. It was proposed to
> > > cover the three cases below:
> > > a) BIND/UNBIND_GPASID
> > > b) BIND/UNBIND_GPASID_TABLE
> > > c) BIND/UNBIND_PROCESS
> > > <only a) is covered in this patch>
> > > So it's called VFIO_IOMMU_BIND.
> > 
> > but aren't they all about PASID related binding?
> 
> yeah, I can rename it. :-)

I don't know if anyone intends to implement it, but SMMUv2 supports
nesting translation without any PASID support. For that case the name
VFIO_IOMMU_BIND_GUEST_PGTBL without "PASID" anywhere makes more sense.
Ideally we'd also use a neutral name for the IOMMU API instead of
bind_gpasid(), but that's easier to change later.

Thanks,
Jean


^ permalink raw reply	[flat|nested] 110+ messages in thread

* RE: [PATCH v1 1/8] vfio: Add VFIO_IOMMU_PASID_REQUEST(alloc/free)
  2020-04-02 13:52   ` Jean-Philippe Brucker
@ 2020-04-03 11:56     ` Liu, Yi L
  2020-04-03 12:39       ` Jean-Philippe Brucker
  0 siblings, 1 reply; 110+ messages in thread
From: Liu, Yi L @ 2020-04-03 11:56 UTC (permalink / raw)
  To: Jean-Philippe Brucker
  Cc: alex.williamson, eric.auger, Tian, Kevin, jacob.jun.pan, joro,
	Raj, Ashok, Tian, Jun J, Sun, Yi Y, peterx, iommu, kvm,
	linux-kernel, Wu, Hao

Hi Jean,

> From: Jean-Philippe Brucker <jean-philippe@linaro.org>
> Sent: Thursday, April 2, 2020 9:53 PM
> To: Liu, Yi L <yi.l.liu@intel.com>
> Subject: Re: [PATCH v1 1/8] vfio: Add VFIO_IOMMU_PASID_REQUEST(alloc/free)
> 
> Hi Yi,
> 
> On Sun, Mar 22, 2020 at 05:31:58AM -0700, Liu, Yi L wrote:
> > From: Liu Yi L <yi.l.liu@intel.com>
> >
> > For a long time, devices have only one DMA address space from platform
> > IOMMU's point of view. This is true for both bare metal and directed-
> > access in virtualization environment. Reason is the source ID of DMA in
> > PCIe are BDF (bus/dev/fnc ID), which results in only device granularity
> > DMA isolation. However, this is changing with the latest advancement in
> > I/O technology area. More and more platform vendors are utilizing the PCIe
> > PASID TLP prefix in DMA requests, thus to give devices with multiple DMA
> > address spaces as identified by their individual PASIDs. For example,
> > Shared Virtual Addressing (SVA, a.k.a Shared Virtual Memory) is able to
> > let device access multiple process virtual address space by binding the
> > virtual address space with a PASID. Wherein the PASID is allocated in
> > software and programmed to device per device specific manner. Devices
> > which support PASID capability are called PASID-capable devices. If such
> > devices are passed through to VMs, guest software are also able to bind
> > guest process virtual address space on such devices. Therefore, the guest
> > software could reuse the bare metal software programming model, which
> > means guest software will also allocate PASID and program it to device
> > directly. This is a dangerous situation since it has potential PASID
> > conflicts and unauthorized address space access.
> 
> It's worth noting that this applies to Intel VT-d with scalable mode, not
> IOMMUs that use one PASID space per VM

Oh yes. will add it.

> 
> > It would be safer to
> > let host intercept in the guest software's PASID allocation. Thus PASID
> > are managed system-wide.
> >
> > This patch adds VFIO_IOMMU_PASID_REQUEST ioctl which aims to passdown
> > PASID allocation/free request from the virtual IOMMU. Additionally, such
> > requests are intended to be invoked by QEMU or other applications which
> > are running in userspace, it is necessary to have a mechanism to prevent
> > single application from abusing available PASIDs in system. With such
> > consideration, this patch tracks the VFIO PASID allocation per-VM. There
> > was a discussion to make quota to be per assigned devices. e.g. if a VM
> > has many assigned devices, then it should have more quota. However, it
> > is not sure how many PASIDs an assigned devices will use. e.g. it is
> > possible that a VM with multiples assigned devices but requests less
> > PASIDs. Therefore per-VM quota would be better.
> >
> > This patch uses struct mm pointer as a per-VM token. We also considered
> > using task structure pointer and vfio_iommu structure pointer. However,
> > task structure is per-thread, which means it cannot achieve per-VM PASID
> > alloc tracking purpose. While for vfio_iommu structure, it is visible
> > only within vfio. Therefore, structure mm pointer is selected. This patch
> > adds a structure vfio_mm. A vfio_mm is created when the first vfio
> > container is opened by a VM. On the reverse order, vfio_mm is free when
> > the last vfio container is released. Each VM is assigned with a PASID
> > quota, so that it is not able to request PASID beyond its quota. This
> > patch adds a default quota of 1000. This quota could be tuned by
> > administrator. Making PASID quota tunable will be added in another patch
> > in this series.
> >
> > Previous discussions:
> > https://patchwork.kernel.org/patch/11209429/
> >
> > Cc: Kevin Tian <kevin.tian@intel.com>
> > CC: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > Cc: Alex Williamson <alex.williamson@redhat.com>
> > Cc: Eric Auger <eric.auger@redhat.com>
> > Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
> > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
> > Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > ---
> >  drivers/vfio/vfio.c             | 130 ++++++++++++++++++++++++++++++++++++++++
> >  drivers/vfio/vfio_iommu_type1.c | 104 ++++++++++++++++++++++++++++++++
> >  include/linux/vfio.h            |  20 +++++++
> >  include/uapi/linux/vfio.h       |  41 +++++++++++++
> >  4 files changed, 295 insertions(+)
> >
> > diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> > index c848262..d13b483 100644
> > --- a/drivers/vfio/vfio.c
> > +++ b/drivers/vfio/vfio.c
> > @@ -32,6 +32,7 @@
> >  #include <linux/vfio.h>
> >  #include <linux/wait.h>
> >  #include <linux/sched/signal.h>
> > +#include <linux/sched/mm.h>
> >
> >  #define DRIVER_VERSION	"0.3"
> >  #define DRIVER_AUTHOR	"Alex Williamson <alex.williamson@redhat.com>"
> > @@ -46,6 +47,8 @@ static struct vfio {
> >  	struct mutex			group_lock;
> >  	struct cdev			group_cdev;
> >  	dev_t				group_devt;
> > +	struct list_head		vfio_mm_list;
> > +	struct mutex			vfio_mm_lock;
> >  	wait_queue_head_t		release_q;
> >  } vfio;
> >
> > @@ -2129,6 +2132,131 @@ int vfio_unregister_notifier(struct device *dev, enum
> vfio_notify_type type,
> >  EXPORT_SYMBOL(vfio_unregister_notifier);
> >
> >  /**
> > + * VFIO_MM objects - create, release, get, put, search
> > + * Caller of the function should have held vfio.vfio_mm_lock.
> > + */
> > +static struct vfio_mm *vfio_create_mm(struct mm_struct *mm)
> > +{
> > +	struct vfio_mm *vmm;
> > +	struct vfio_mm_token *token;
> > +	int ret = 0;
> > +
> > +	vmm = kzalloc(sizeof(*vmm), GFP_KERNEL);
> > +	if (!vmm)
> > +		return ERR_PTR(-ENOMEM);
> > +
> > +	/* Per mm IOASID set used for quota control and group operations */
> > +	ret = ioasid_alloc_set((struct ioasid_set *) mm,
> 
> Hmm, either we need to change the token of ioasid_alloc_set() to "void *",
> or pass an actual ioasid_set struct, but this cast doesn't look good :)
>
> As I commented on the IOASID series, I think we could embed a struct
> ioasid_set into vfio_mm, pass that struct to all other ioasid_* functions,
> and get rid of ioasid_sid.

I think change to "void *" is better as we needs the token to ensure all
threads within a single VM share the same ioasid_set.

> > +			       VFIO_DEFAULT_PASID_QUOTA, &vmm->ioasid_sid);
> > +	if (ret) {
> > +		kfree(vmm);
> > +		return ERR_PTR(ret);
> > +	}
> > +
> > +	kref_init(&vmm->kref);
> > +	token = &vmm->token;
> > +	token->val = mm;
> 
> Why the intermediate token struct?  Could we just store the mm_struct
> pointer within vfio_mm?

Hmm, here we only want to use the pointer as a token, instead of using
the structure behind the pointer. If store the mm_struct directly, may
leave a space to further use its content, this is not good.

Regards,
Yi Liu


^ permalink raw reply	[flat|nested] 110+ messages in thread

* RE: [PATCH v1 5/8] vfio/type1: Report 1st-level/stage-1 format to userspace
  2020-04-02 19:20   ` Alex Williamson
@ 2020-04-03 11:59     ` Liu, Yi L
  0 siblings, 0 replies; 110+ messages in thread
From: Liu, Yi L @ 2020-04-03 11:59 UTC (permalink / raw)
  To: Alex Williamson
  Cc: eric.auger, Tian, Kevin, jacob.jun.pan, joro, Raj, Ashok, Tian,
	Jun J, Sun, Yi Y, jean-philippe, peterx, iommu, kvm,
	linux-kernel, Wu, Hao

Hi Alex,

> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Friday, April 3, 2020 3:20 AM
> To: Liu, Yi L <yi.l.liu@intel.com>
> Subject: Re: [PATCH v1 5/8] vfio/type1: Report 1st-level/stage-1 format to
> userspace
> 
> On Sun, 22 Mar 2020 05:32:02 -0700
> "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> 
> > From: Liu Yi L <yi.l.liu@intel.com>
> >
> > VFIO exposes IOMMU nesting translation (a.k.a dual stage translation)
> > capability to userspace. Thus applications like QEMU could support
> > vIOMMU with hardware's nesting translation capability for pass-through
> > devices. Before setting up nesting translation for pass-through devices,
> > QEMU and other applications need to learn the supported 1st-lvl/stage-1
> > translation structure format like page table format.
> >
> > Take vSVA (virtual Shared Virtual Addressing) as an example, to support
> > vSVA for pass-through devices, QEMU setup nesting translation for pass-
> > through devices. The guest page table are configured to host as 1st-lvl/
> > stage-1 page table. Therefore, guest format should be compatible with
> > host side.
> >
> > This patch reports the supported 1st-lvl/stage-1 page table format on the
> > current platform to userspace. QEMU and other alike applications should
> > use this format info when trying to setup IOMMU nesting translation on
> > host IOMMU.
> >
> > Cc: Kevin Tian <kevin.tian@intel.com>
> > CC: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > Cc: Alex Williamson <alex.williamson@redhat.com>
> > Cc: Eric Auger <eric.auger@redhat.com>
> > Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
> > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > ---
> >  drivers/vfio/vfio_iommu_type1.c | 56
> +++++++++++++++++++++++++++++++++++++++++
> >  include/uapi/linux/vfio.h       |  1 +
> >  2 files changed, 57 insertions(+)
> >
> > diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> > index 9aa2a67..82a9e0b 100644
> > --- a/drivers/vfio/vfio_iommu_type1.c
> > +++ b/drivers/vfio/vfio_iommu_type1.c
> > @@ -2234,11 +2234,66 @@ static int vfio_iommu_type1_pasid_free(struct
> vfio_iommu *iommu,
> >  	return ret;
> >  }
> >
> > +static int vfio_iommu_get_stage1_format(struct vfio_iommu *iommu,
> > +					 u32 *stage1_format)
> > +{
> > +	struct vfio_domain *domain;
> > +	u32 format = 0, tmp_format = 0;
> > +	int ret;
> > +
> > +	mutex_lock(&iommu->lock);
> > +	if (list_empty(&iommu->domain_list)) {
> > +		mutex_unlock(&iommu->lock);
> > +		return -EINVAL;
> > +	}
> > +
> > +	list_for_each_entry(domain, &iommu->domain_list, next) {
> > +		if (iommu_domain_get_attr(domain->domain,
> > +			DOMAIN_ATTR_PASID_FORMAT, &format)) {
> > +			ret = -EINVAL;
> > +			format = 0;
> > +			goto out_unlock;
> > +		}
> > +		/*
> > +		 * format is always non-zero (the first format is
> > +		 * IOMMU_PASID_FORMAT_INTEL_VTD which is 1). For
> > +		 * the reason of potential different backed IOMMU
> > +		 * formats, here we expect to have identical formats
> > +		 * in the domain list, no mixed formats support.
> > +		 * return -EINVAL to fail the attempt of setup
> > +		 * VFIO_TYPE1_NESTING_IOMMU if non-identical formats
> > +		 * are detected.
> > +		 */
> > +		if (tmp_format && tmp_format != format) {
> > +			ret = -EINVAL;
> > +			format = 0;
> > +			goto out_unlock;
> > +		}
> > +
> > +		tmp_format = format;
> > +	}
> > +	ret = 0;
> > +
> > +out_unlock:
> > +	if (format)
> > +		*stage1_format = format;
> > +	mutex_unlock(&iommu->lock);
> > +	return ret;
> > +}
> > +
> >  static int vfio_iommu_info_add_nesting_cap(struct vfio_iommu *iommu,
> >  					 struct vfio_info_cap *caps)
> >  {
> >  	struct vfio_info_cap_header *header;
> >  	struct vfio_iommu_type1_info_cap_nesting *nesting_cap;
> > +	u32 formats = 0;
> > +	int ret;
> > +
> > +	ret = vfio_iommu_get_stage1_format(iommu, &formats);
> > +	if (ret) {
> > +		pr_warn("Failed to get stage-1 format\n");
> > +		return ret;
> 
> Looks like this generates a warning and causes the iommu_get_info ioctl
> to fail if the hardware doesn't support the pasid format attribute, or
> the domain list is empty.  This breaks users on existing hardware.

oops, yes, it should not fail anything as it is just an extended feature.
let me correct it.

Thanks,
Yi Liu


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH v1 1/8] vfio: Add VFIO_IOMMU_PASID_REQUEST(alloc/free)
  2020-04-03 11:56     ` Liu, Yi L
@ 2020-04-03 12:39       ` Jean-Philippe Brucker
  2020-04-03 12:44         ` Liu, Yi L
  0 siblings, 1 reply; 110+ messages in thread
From: Jean-Philippe Brucker @ 2020-04-03 12:39 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: alex.williamson, eric.auger, Tian, Kevin, jacob.jun.pan, joro,
	Raj, Ashok, Tian, Jun J, Sun, Yi Y, peterx, iommu, kvm,
	linux-kernel, Wu, Hao

On Fri, Apr 03, 2020 at 11:56:09AM +0000, Liu, Yi L wrote:
> > >  /**
> > > + * VFIO_MM objects - create, release, get, put, search
> > > + * Caller of the function should have held vfio.vfio_mm_lock.
> > > + */
> > > +static struct vfio_mm *vfio_create_mm(struct mm_struct *mm)
> > > +{
> > > +	struct vfio_mm *vmm;
> > > +	struct vfio_mm_token *token;
> > > +	int ret = 0;
> > > +
> > > +	vmm = kzalloc(sizeof(*vmm), GFP_KERNEL);
> > > +	if (!vmm)
> > > +		return ERR_PTR(-ENOMEM);
> > > +
> > > +	/* Per mm IOASID set used for quota control and group operations */
> > > +	ret = ioasid_alloc_set((struct ioasid_set *) mm,
> > 
> > Hmm, either we need to change the token of ioasid_alloc_set() to "void *",
> > or pass an actual ioasid_set struct, but this cast doesn't look good :)
> >
> > As I commented on the IOASID series, I think we could embed a struct
> > ioasid_set into vfio_mm, pass that struct to all other ioasid_* functions,
> > and get rid of ioasid_sid.
> 
> I think change to "void *" is better as we needs the token to ensure all
> threads within a single VM share the same ioasid_set.

Don't they share the same vfio_mm?

Thanks,
Jean
> 
> > > +			       VFIO_DEFAULT_PASID_QUOTA, &vmm->ioasid_sid);
> > > +	if (ret) {
> > > +		kfree(vmm);
> > > +		return ERR_PTR(ret);
> > > +	}
> > > +
> > > +	kref_init(&vmm->kref);
> > > +	token = &vmm->token;
> > > +	token->val = mm;
> > 
> > Why the intermediate token struct?  Could we just store the mm_struct
> > pointer within vfio_mm?
> 
> Hmm, here we only want to use the pointer as a token, instead of using
> the structure behind the pointer. If store the mm_struct directly, may
> leave a space to further use its content, this is not good.
> 
> Regards,
> Yi Liu
> 

^ permalink raw reply	[flat|nested] 110+ messages in thread

* RE: [PATCH v1 1/8] vfio: Add VFIO_IOMMU_PASID_REQUEST(alloc/free)
  2020-04-03 12:39       ` Jean-Philippe Brucker
@ 2020-04-03 12:44         ` Liu, Yi L
  0 siblings, 0 replies; 110+ messages in thread
From: Liu, Yi L @ 2020-04-03 12:44 UTC (permalink / raw)
  To: Jean-Philippe Brucker
  Cc: alex.williamson, eric.auger, Tian, Kevin, jacob.jun.pan, joro,
	Raj, Ashok, Tian, Jun J, Sun, Yi Y, peterx, iommu, kvm,
	linux-kernel, Wu, Hao

> From: Jean-Philippe Brucker <jean-philippe@linaro.org>
> Sent: Friday, April 3, 2020 8:40 PM
> Subject: Re: [PATCH v1 1/8] vfio: Add VFIO_IOMMU_PASID_REQUEST(alloc/free)
> 
> On Fri, Apr 03, 2020 at 11:56:09AM +0000, Liu, Yi L wrote:
> > > >  /**
> > > > + * VFIO_MM objects - create, release, get, put, search
> > > > + * Caller of the function should have held vfio.vfio_mm_lock.
> > > > + */
> > > > +static struct vfio_mm *vfio_create_mm(struct mm_struct *mm) {
> > > > +	struct vfio_mm *vmm;
> > > > +	struct vfio_mm_token *token;
> > > > +	int ret = 0;
> > > > +
> > > > +	vmm = kzalloc(sizeof(*vmm), GFP_KERNEL);
> > > > +	if (!vmm)
> > > > +		return ERR_PTR(-ENOMEM);
> > > > +
> > > > +	/* Per mm IOASID set used for quota control and group operations */
> > > > +	ret = ioasid_alloc_set((struct ioasid_set *) mm,
> > >
> > > Hmm, either we need to change the token of ioasid_alloc_set() to
> > > "void *", or pass an actual ioasid_set struct, but this cast doesn't
> > > look good :)
> > >
> > > As I commented on the IOASID series, I think we could embed a struct
> > > ioasid_set into vfio_mm, pass that struct to all other ioasid_*
> > > functions, and get rid of ioasid_sid.
> >
> > I think change to "void *" is better as we needs the token to ensure
> > all threads within a single VM share the same ioasid_set.
> 
> Don't they share the same vfio_mm?

that's right. then both works well for me.

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 110+ messages in thread

* RE: [PATCH v1 1/8] vfio: Add VFIO_IOMMU_PASID_REQUEST(alloc/free)
  2020-04-02 17:50   ` Alex Williamson
  2020-04-03  5:58     ` Tian, Kevin
@ 2020-04-03 13:12     ` Liu, Yi L
  2020-04-03 17:50       ` Alex Williamson
  1 sibling, 1 reply; 110+ messages in thread
From: Liu, Yi L @ 2020-04-03 13:12 UTC (permalink / raw)
  To: Alex Williamson
  Cc: eric.auger, Tian, Kevin, jacob.jun.pan, joro, Raj, Ashok, Tian,
	Jun J, Sun, Yi Y, jean-philippe, peterx, iommu, kvm,
	linux-kernel, Wu, Hao

Hi Alex,

> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Friday, April 3, 2020 1:50 AM
> Subject: Re: [PATCH v1 1/8] vfio: Add VFIO_IOMMU_PASID_REQUEST(alloc/free)
> 
> On Sun, 22 Mar 2020 05:31:58 -0700
> "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> 
> > From: Liu Yi L <yi.l.liu@intel.com>
> >
> > For a long time, devices have only one DMA address space from platform
> > IOMMU's point of view. This is true for both bare metal and directed-
> > access in virtualization environment. Reason is the source ID of DMA in
> > PCIe are BDF (bus/dev/fnc ID), which results in only device granularity
> > DMA isolation. However, this is changing with the latest advancement in
> > I/O technology area. More and more platform vendors are utilizing the PCIe
> > PASID TLP prefix in DMA requests, thus to give devices with multiple DMA
> > address spaces as identified by their individual PASIDs. For example,
> > Shared Virtual Addressing (SVA, a.k.a Shared Virtual Memory) is able to
> > let device access multiple process virtual address space by binding the
> > virtual address space with a PASID. Wherein the PASID is allocated in
> > software and programmed to device per device specific manner. Devices
> > which support PASID capability are called PASID-capable devices. If such
> > devices are passed through to VMs, guest software are also able to bind
> > guest process virtual address space on such devices. Therefore, the guest
> > software could reuse the bare metal software programming model, which
> > means guest software will also allocate PASID and program it to device
> > directly. This is a dangerous situation since it has potential PASID
> > conflicts and unauthorized address space access. It would be safer to
> > let host intercept in the guest software's PASID allocation. Thus PASID
> > are managed system-wide.
> 
> Providing an allocation interface only allows for collaborative usage
> of PASIDs though.  Do we have any ability to enforce PASID usage or can
> a user spoof other PASIDs on the same BDF?
> 
> > This patch adds VFIO_IOMMU_PASID_REQUEST ioctl which aims to passdown
> > PASID allocation/free request from the virtual IOMMU. Additionally, such
> > requests are intended to be invoked by QEMU or other applications which
> > are running in userspace, it is necessary to have a mechanism to prevent
> > single application from abusing available PASIDs in system. With such
> > consideration, this patch tracks the VFIO PASID allocation per-VM. There
> > was a discussion to make quota to be per assigned devices. e.g. if a VM
> > has many assigned devices, then it should have more quota. However, it
> > is not sure how many PASIDs an assigned devices will use. e.g. it is
> > possible that a VM with multiples assigned devices but requests less
> > PASIDs. Therefore per-VM quota would be better.
> >
> > This patch uses struct mm pointer as a per-VM token. We also considered
> > using task structure pointer and vfio_iommu structure pointer. However,
> > task structure is per-thread, which means it cannot achieve per-VM PASID
> > alloc tracking purpose. While for vfio_iommu structure, it is visible
> > only within vfio. Therefore, structure mm pointer is selected. This patch
> > adds a structure vfio_mm. A vfio_mm is created when the first vfio
> > container is opened by a VM. On the reverse order, vfio_mm is free when
> > the last vfio container is released. Each VM is assigned with a PASID
> > quota, so that it is not able to request PASID beyond its quota. This
> > patch adds a default quota of 1000. This quota could be tuned by
> > administrator. Making PASID quota tunable will be added in another patch
> > in this series.
> >
> > Previous discussions:
> > https://patchwork.kernel.org/patch/11209429/
> >
> > Cc: Kevin Tian <kevin.tian@intel.com>
> > CC: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > Cc: Alex Williamson <alex.williamson@redhat.com>
> > Cc: Eric Auger <eric.auger@redhat.com>
> > Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
> > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
> > Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > ---
> >  drivers/vfio/vfio.c             | 130 ++++++++++++++++++++++++++++++++++++++++
> >  drivers/vfio/vfio_iommu_type1.c | 104 ++++++++++++++++++++++++++++++++
> >  include/linux/vfio.h            |  20 +++++++
> >  include/uapi/linux/vfio.h       |  41 +++++++++++++
> >  4 files changed, 295 insertions(+)
> >
> > diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> > index c848262..d13b483 100644
> > --- a/drivers/vfio/vfio.c
> > +++ b/drivers/vfio/vfio.c
> > @@ -32,6 +32,7 @@
> >  #include <linux/vfio.h>
> >  #include <linux/wait.h>
> >  #include <linux/sched/signal.h>
> > +#include <linux/sched/mm.h>
> >
> >  #define DRIVER_VERSION	"0.3"
> >  #define DRIVER_AUTHOR	"Alex Williamson <alex.williamson@redhat.com>"
> > @@ -46,6 +47,8 @@ static struct vfio {
> >  	struct mutex			group_lock;
> >  	struct cdev			group_cdev;
> >  	dev_t				group_devt;
> > +	struct list_head		vfio_mm_list;
> > +	struct mutex			vfio_mm_lock;
> >  	wait_queue_head_t		release_q;
> >  } vfio;
> >
> > @@ -2129,6 +2132,131 @@ int vfio_unregister_notifier(struct device *dev, enum
> vfio_notify_type type,
> >  EXPORT_SYMBOL(vfio_unregister_notifier);
> >
> >  /**
> > + * VFIO_MM objects - create, release, get, put, search
> > + * Caller of the function should have held vfio.vfio_mm_lock.
> > + */
> > +static struct vfio_mm *vfio_create_mm(struct mm_struct *mm)
> > +{
> > +	struct vfio_mm *vmm;
> > +	struct vfio_mm_token *token;
> > +	int ret = 0;
> > +
> > +	vmm = kzalloc(sizeof(*vmm), GFP_KERNEL);
> > +	if (!vmm)
> > +		return ERR_PTR(-ENOMEM);
> > +
> > +	/* Per mm IOASID set used for quota control and group operations */
> > +	ret = ioasid_alloc_set((struct ioasid_set *) mm,
> > +			       VFIO_DEFAULT_PASID_QUOTA, &vmm->ioasid_sid);
> > +	if (ret) {
> > +		kfree(vmm);
> > +		return ERR_PTR(ret);
> > +	}
> > +
> > +	kref_init(&vmm->kref);
> > +	token = &vmm->token;
> > +	token->val = mm;
> > +	vmm->pasid_quota = VFIO_DEFAULT_PASID_QUOTA;
> > +	mutex_init(&vmm->pasid_lock);
> > +
> > +	list_add(&vmm->vfio_next, &vfio.vfio_mm_list);
> > +
> > +	return vmm;
> > +}
> > +
> > +static void vfio_mm_unlock_and_free(struct vfio_mm *vmm)
> > +{
> > +	/* destroy the ioasid set */
> > +	ioasid_free_set(vmm->ioasid_sid, true);
> > +	mutex_unlock(&vfio.vfio_mm_lock);
> > +	kfree(vmm);
> > +}
> > +
> > +/* called with vfio.vfio_mm_lock held */
> > +static void vfio_mm_release(struct kref *kref)
> > +{
> > +	struct vfio_mm *vmm = container_of(kref, struct vfio_mm, kref);
> > +
> > +	list_del(&vmm->vfio_next);
> > +	vfio_mm_unlock_and_free(vmm);
> > +}
> > +
> > +void vfio_mm_put(struct vfio_mm *vmm)
> > +{
> > +	kref_put_mutex(&vmm->kref, vfio_mm_release, &vfio.vfio_mm_lock);
> > +}
> > +EXPORT_SYMBOL_GPL(vfio_mm_put);
> > +
> > +/* Assume vfio_mm_lock or vfio_mm reference is held */
> > +static void vfio_mm_get(struct vfio_mm *vmm)
> > +{
> > +	kref_get(&vmm->kref);
> > +}
> > +
> > +struct vfio_mm *vfio_mm_get_from_task(struct task_struct *task)
> > +{
> > +	struct mm_struct *mm = get_task_mm(task);
> > +	struct vfio_mm *vmm;
> > +	unsigned long long val = (unsigned long long) mm;
> > +
> > +	mutex_lock(&vfio.vfio_mm_lock);
> > +	list_for_each_entry(vmm, &vfio.vfio_mm_list, vfio_next) {
> > +		if (vmm->token.val == val) {
> > +			vfio_mm_get(vmm);
> > +			goto out;
> > +		}
> > +	}
> > +
> > +	vmm = vfio_create_mm(mm);
> > +	if (IS_ERR(vmm))
> > +		vmm = NULL;
> > +out:
> > +	mutex_unlock(&vfio.vfio_mm_lock);
> > +	mmput(mm);
> > +	return vmm;
> > +}
> > +EXPORT_SYMBOL_GPL(vfio_mm_get_from_task);
> > +
> > +int vfio_mm_pasid_alloc(struct vfio_mm *vmm, int min, int max)
> > +{
> > +	ioasid_t pasid;
> > +	int ret = -ENOSPC;
> > +
> > +	mutex_lock(&vmm->pasid_lock);
> > +
> > +	pasid = ioasid_alloc(vmm->ioasid_sid, min, max, NULL);
> > +	if (pasid == INVALID_IOASID) {
> > +		ret = -ENOSPC;
> > +		goto out_unlock;
> > +	}
> > +
> > +	ret = pasid;
> > +out_unlock:
> > +	mutex_unlock(&vmm->pasid_lock);
> > +	return ret;
> > +}
> > +EXPORT_SYMBOL_GPL(vfio_mm_pasid_alloc);
> > +
> > +int vfio_mm_pasid_free(struct vfio_mm *vmm, ioasid_t pasid)
> > +{
> > +	void *pdata;
> > +	int ret = 0;
> > +
> > +	mutex_lock(&vmm->pasid_lock);
> > +	pdata = ioasid_find(vmm->ioasid_sid, pasid, NULL);
> > +	if (IS_ERR(pdata)) {
> > +		ret = PTR_ERR(pdata);
> > +		goto out_unlock;
> > +	}
> > +	ioasid_free(pasid);
> > +
> > +out_unlock:
> > +	mutex_unlock(&vmm->pasid_lock);
> > +	return ret;
> > +}
> > +EXPORT_SYMBOL_GPL(vfio_mm_pasid_free);
> > +
> > +/**
> >   * Module/class support
> >   */
> >  static char *vfio_devnode(struct device *dev, umode_t *mode)
> > @@ -2151,8 +2279,10 @@ static int __init vfio_init(void)
> >  	idr_init(&vfio.group_idr);
> >  	mutex_init(&vfio.group_lock);
> >  	mutex_init(&vfio.iommu_drivers_lock);
> > +	mutex_init(&vfio.vfio_mm_lock);
> >  	INIT_LIST_HEAD(&vfio.group_list);
> >  	INIT_LIST_HEAD(&vfio.iommu_drivers_list);
> > +	INIT_LIST_HEAD(&vfio.vfio_mm_list);
> >  	init_waitqueue_head(&vfio.release_q);
> >
> >  	ret = misc_register(&vfio_dev);
> 
> Is vfio.c the right place for any of the above?  It seems like it could
> all be in a separate vfio_pasid module, similar to our virqfd module.

I think it could be a separate vfio_pasid module. let me make it in next
version if it's your preference. :-)

> > diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> > index a177bf2..331ceee 100644
> > --- a/drivers/vfio/vfio_iommu_type1.c
> > +++ b/drivers/vfio/vfio_iommu_type1.c
> > @@ -70,6 +70,7 @@ struct vfio_iommu {
> >  	unsigned int		dma_avail;
> >  	bool			v2;
> >  	bool			nesting;
> > +	struct vfio_mm		*vmm;
> >  };
> >
> >  struct vfio_domain {
> > @@ -2018,6 +2019,7 @@ static void vfio_iommu_type1_detach_group(void
> *iommu_data,
> >  static void *vfio_iommu_type1_open(unsigned long arg)
> >  {
> >  	struct vfio_iommu *iommu;
> > +	struct vfio_mm *vmm = NULL;
> >
> >  	iommu = kzalloc(sizeof(*iommu), GFP_KERNEL);
> >  	if (!iommu)
> > @@ -2043,6 +2045,10 @@ static void *vfio_iommu_type1_open(unsigned long
> arg)
> >  	iommu->dma_avail = dma_entry_limit;
> >  	mutex_init(&iommu->lock);
> >  	BLOCKING_INIT_NOTIFIER_HEAD(&iommu->notifier);
> > +	vmm = vfio_mm_get_from_task(current);
> > +	if (!vmm)
> > +		pr_err("Failed to get vfio_mm track\n");
> 
> Doesn't this presume everyone is instantly running PASID capable hosts?
> Looks like a noisy support regression to me.

right, it is. Kevin also questioned this part, I'll refine it and avoid
regression noisy.

> > +	iommu->vmm = vmm;
> >
> >  	return iommu;
> >  }
> > @@ -2084,6 +2090,8 @@ static void vfio_iommu_type1_release(void
> *iommu_data)
> >  	}
> >
> >  	vfio_iommu_iova_free(&iommu->iova_list);
> > +	if (iommu->vmm)
> > +		vfio_mm_put(iommu->vmm);
> >
> >  	kfree(iommu);
> >  }
> > @@ -2172,6 +2180,55 @@ static int vfio_iommu_iova_build_caps(struct
> vfio_iommu *iommu,
> >  	return ret;
> >  }
> >
> > +static bool vfio_iommu_type1_pasid_req_valid(u32 flags)
> > +{
> > +	return !((flags & ~VFIO_PASID_REQUEST_MASK) ||
> > +		 (flags & VFIO_IOMMU_PASID_ALLOC &&
> > +		  flags & VFIO_IOMMU_PASID_FREE));
> > +}
> > +
> > +static int vfio_iommu_type1_pasid_alloc(struct vfio_iommu *iommu,
> > +					 int min,
> > +					 int max)
> > +{
> > +	struct vfio_mm *vmm = iommu->vmm;
> > +	int ret = 0;
> > +
> > +	mutex_lock(&iommu->lock);
> > +	if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) {
> > +		ret = -EFAULT;
> > +		goto out_unlock;
> > +	}
> 
> Non-iommu backed mdevs are excluded from this?  Is this a matter of
> wiring the call out through the mdev parent device, or is this just
> possible?

At the beginning, non-iommu backed mdevs are excluded. However,
Combined with your succeeded comment. I think this check should be
removed as the PASID alloc/free capability should be available as
long as the container is backed by a pasid-capable iommu backend.
So should remove it, and it is the same with the free path.

> 
> > +	if (vmm)
> > +		ret = vfio_mm_pasid_alloc(vmm, min, max);
> > +	else
> > +		ret = -EINVAL;
> > +out_unlock:
> > +	mutex_unlock(&iommu->lock);
> > +	return ret;
> > +}
> > +
> > +static int vfio_iommu_type1_pasid_free(struct vfio_iommu *iommu,
> > +				       unsigned int pasid)
> > +{
> > +	struct vfio_mm *vmm = iommu->vmm;
> > +	int ret = 0;
> > +
> > +	mutex_lock(&iommu->lock);
> > +	if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) {
> > +		ret = -EFAULT;
> > +		goto out_unlock;
> > +	}
> 
> So if a container had an iommu backed device when the pasid was
> allocated, but it was removed, now they can't free it?  Why do we need
> the check above?

should be removed. thanks for spotting it.

> > +
> > +	if (vmm)
> > +		ret = vfio_mm_pasid_free(vmm, pasid);
> > +	else
> > +		ret = -EINVAL;
> > +out_unlock:
> > +	mutex_unlock(&iommu->lock);
> > +	return ret;
> > +}
> > +
> >  static long vfio_iommu_type1_ioctl(void *iommu_data,
> >  				   unsigned int cmd, unsigned long arg)
> >  {
> > @@ -2276,6 +2333,53 @@ static long vfio_iommu_type1_ioctl(void
> *iommu_data,
> >
> >  		return copy_to_user((void __user *)arg, &unmap, minsz) ?
> >  			-EFAULT : 0;
> > +
> > +	} else if (cmd == VFIO_IOMMU_PASID_REQUEST) {
> > +		struct vfio_iommu_type1_pasid_request req;
> > +		unsigned long offset;
> > +
> > +		minsz = offsetofend(struct vfio_iommu_type1_pasid_request,
> > +				    flags);
> > +
> > +		if (copy_from_user(&req, (void __user *)arg, minsz))
> > +			return -EFAULT;
> > +
> > +		if (req.argsz < minsz ||
> > +		    !vfio_iommu_type1_pasid_req_valid(req.flags))
> > +			return -EINVAL;
> > +
> > +		if (copy_from_user((void *)&req + minsz,
> > +				   (void __user *)arg + minsz,
> > +				   sizeof(req) - minsz))
> > +			return -EFAULT;
> 
> Huh?  Why do we have argsz if we're going to assume this is here?

do you mean replacing sizeof(req) with argsz? if yes, I can do that.

> > +
> > +		switch (req.flags & VFIO_PASID_REQUEST_MASK) {
> > +		case VFIO_IOMMU_PASID_ALLOC:
> > +		{
> > +			int ret = 0, result;
> > +
> > +			result = vfio_iommu_type1_pasid_alloc(iommu,
> > +							req.alloc_pasid.min,
> > +							req.alloc_pasid.max);
> > +			if (result > 0) {
> > +				offset = offsetof(
> > +					struct vfio_iommu_type1_pasid_request,
> > +					alloc_pasid.result);
> > +				ret = copy_to_user(
> > +					      (void __user *) (arg + offset),
> > +					      &result, sizeof(result));
> 
> Again assuming argsz supports this.

same as above.

> 
> > +			} else {
> > +				pr_debug("%s: PASID alloc failed\n", __func__);
> 
> rate limit?

not quite get. could you give more hints?

> > +				ret = -EFAULT;
> > +			}
> > +			return ret;
> > +		}
> > +		case VFIO_IOMMU_PASID_FREE:
> > +			return vfio_iommu_type1_pasid_free(iommu,
> > +							   req.free_pasid);
> > +		default:
> > +			return -EINVAL;
> > +		}
> >  	}
> >
> >  	return -ENOTTY;
> > diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> > index e42a711..75f9f7f1 100644
> > --- a/include/linux/vfio.h
> > +++ b/include/linux/vfio.h
> > @@ -89,6 +89,26 @@ extern int vfio_register_iommu_driver(const struct
> vfio_iommu_driver_ops *ops);
> >  extern void vfio_unregister_iommu_driver(
> >  				const struct vfio_iommu_driver_ops *ops);
> >
> > +#define VFIO_DEFAULT_PASID_QUOTA	1000
> > +struct vfio_mm_token {
> > +	unsigned long long val;
> > +};
> > +
> > +struct vfio_mm {
> > +	struct kref			kref;
> > +	struct vfio_mm_token		token;
> > +	int				ioasid_sid;
> > +	/* protect @pasid_quota field and pasid allocation/free */
> > +	struct mutex			pasid_lock;
> > +	int				pasid_quota;
> > +	struct list_head		vfio_next;
> > +};
> > +
> > +extern struct vfio_mm *vfio_mm_get_from_task(struct task_struct *task);
> > +extern void vfio_mm_put(struct vfio_mm *vmm);
> > +extern int vfio_mm_pasid_alloc(struct vfio_mm *vmm, int min, int max);
> > +extern int vfio_mm_pasid_free(struct vfio_mm *vmm, ioasid_t pasid);
> > +
> >  /*
> >   * External user API
> >   */
> > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> > index 9e843a1..298ac80 100644
> > --- a/include/uapi/linux/vfio.h
> > +++ b/include/uapi/linux/vfio.h
> > @@ -794,6 +794,47 @@ struct vfio_iommu_type1_dma_unmap {
> >  #define VFIO_IOMMU_ENABLE	_IO(VFIO_TYPE, VFIO_BASE + 15)
> >  #define VFIO_IOMMU_DISABLE	_IO(VFIO_TYPE, VFIO_BASE + 16)
> >
> > +/*
> > + * PASID (Process Address Space ID) is a PCIe concept which
> > + * has been extended to support DMA isolation in fine-grain.
> > + * With device assigned to user space (e.g. VMs), PASID alloc
> > + * and free need to be system wide. This structure defines
> > + * the info for pasid alloc/free between user space and kernel
> > + * space.
> > + *
> > + * @flag=VFIO_IOMMU_PASID_ALLOC, refer to the @alloc_pasid
> > + * @flag=VFIO_IOMMU_PASID_FREE, refer to @free_pasid
> > + */
> > +struct vfio_iommu_type1_pasid_request {
> > +	__u32	argsz;
> > +#define VFIO_IOMMU_PASID_ALLOC	(1 << 0)
> > +#define VFIO_IOMMU_PASID_FREE	(1 << 1)
> > +	__u32	flags;
> > +	union {
> > +		struct {
> > +			__u32 min;
> > +			__u32 max;
> > +			__u32 result;
> > +		} alloc_pasid;
> > +		__u32 free_pasid;
> > +	};
> 
> We seem to be using __u8 data[] lately where the struct at data is
> defined by the flags.  should we do that here?

yeah, I can do that. BTW. Do you want to let the structure in the
lately patch share the same structure with this one? As I can foresee,
the two structures would look like similar as both of them include
argsz, flags and data[] fields. The difference is the definition of
flags. what about your opinion?

struct vfio_iommu_type1_pasid_request {
	__u32	argsz;
#define VFIO_IOMMU_PASID_ALLOC	(1 << 0)
#define VFIO_IOMMU_PASID_FREE	(1 << 1)
	__u32	flags;
	__u8	data[];
};

struct vfio_iommu_type1_bind {
        __u32           argsz;
        __u32           flags;
#define VFIO_IOMMU_BIND_GUEST_PGTBL     (1 << 0)
#define VFIO_IOMMU_UNBIND_GUEST_PGTBL   (1 << 1)
        __u8            data[];
};

> 
> > +};
> > +
> > +#define VFIO_PASID_REQUEST_MASK	(VFIO_IOMMU_PASID_ALLOC | \
> > +					 VFIO_IOMMU_PASID_FREE)
> > +
> > +/**
> > + * VFIO_IOMMU_PASID_REQUEST - _IOWR(VFIO_TYPE, VFIO_BASE + 22,
> > + *				struct vfio_iommu_type1_pasid_request)
> > + *
> > + * Availability of this feature depends on PASID support in the device,
> > + * its bus, the underlying IOMMU and the CPU architecture. In VFIO, it
> > + * is available after VFIO_SET_IOMMU.
> > + *
> > + * returns: 0 on success, -errno on failure.
> > + */
> > +#define VFIO_IOMMU_PASID_REQUEST	_IO(VFIO_TYPE, VFIO_BASE + 22)
> 
> So a user needs to try to allocate a PASID in order to test for the
> support?  Should we have a PROBE flag?

answered in in later patch. :-)

Regards,
Yi Liu


^ permalink raw reply	[flat|nested] 110+ messages in thread

* RE: [PATCH v1 6/8] vfio/type1: Bind guest page tables to host
  2020-04-02 19:57   ` Alex Williamson
@ 2020-04-03 13:30     ` Liu, Yi L
  2020-04-03 18:11       ` Alex Williamson
  2020-04-11  5:52     ` Liu, Yi L
  1 sibling, 1 reply; 110+ messages in thread
From: Liu, Yi L @ 2020-04-03 13:30 UTC (permalink / raw)
  To: Alex Williamson
  Cc: eric.auger, Tian, Kevin, jacob.jun.pan, joro, Raj, Ashok, Tian,
	Jun J, Sun, Yi Y, jean-philippe, peterx, iommu, kvm,
	linux-kernel, Wu, Hao

Hi Alex,

> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Friday, April 3, 2020 3:57 AM
> To: Liu, Yi L <yi.l.liu@intel.com>
> 
> On Sun, 22 Mar 2020 05:32:03 -0700
> "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> 
> > From: Liu Yi L <yi.l.liu@intel.com>
> >
> > VFIO_TYPE1_NESTING_IOMMU is an IOMMU type which is backed by hardware
> > IOMMUs that have nesting DMA translation (a.k.a dual stage address
> > translation). For such hardware IOMMUs, there are two stages/levels of
> > address translation, and software may let userspace/VM to own the first-
> > level/stage-1 translation structures. Example of such usage is vSVA (
> > virtual Shared Virtual Addressing). VM owns the first-level/stage-1
> > translation structures and bind the structures to host, then hardware
> > IOMMU would utilize nesting translation when doing DMA translation fo
> > the devices behind such hardware IOMMU.
> >
> > This patch adds vfio support for binding guest translation (a.k.a stage 1)
> > structure to host iommu. And for VFIO_TYPE1_NESTING_IOMMU, not only bind
> > guest page table is needed, it also requires to expose interface to guest
> > for iommu cache invalidation when guest modified the first-level/stage-1
> > translation structures since hardware needs to be notified to flush stale
> > iotlbs. This would be introduced in next patch.
> >
> > In this patch, guest page table bind and unbind are done by using flags
> > VFIO_IOMMU_BIND_GUEST_PGTBL and VFIO_IOMMU_UNBIND_GUEST_PGTBL
> under IOCTL
> > VFIO_IOMMU_BIND, the bind/unbind data are conveyed by
> > struct iommu_gpasid_bind_data. Before binding guest page table to host,
> > VM should have got a PASID allocated by host via VFIO_IOMMU_PASID_REQUEST.
> >
> > Bind guest translation structures (here is guest page table) to host
> > are the first step to setup vSVA (Virtual Shared Virtual Addressing).
> >
> > Cc: Kevin Tian <kevin.tian@intel.com>
> > CC: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > Cc: Alex Williamson <alex.williamson@redhat.com>
> > Cc: Eric Auger <eric.auger@redhat.com>
> > Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
> > Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.com>
> > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > ---
> >  drivers/vfio/vfio_iommu_type1.c | 158
> ++++++++++++++++++++++++++++++++++++++++
> >  include/uapi/linux/vfio.h       |  46 ++++++++++++
> >  2 files changed, 204 insertions(+)
> >
> > diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> > index 82a9e0b..a877747 100644
> > --- a/drivers/vfio/vfio_iommu_type1.c
> > +++ b/drivers/vfio/vfio_iommu_type1.c
> > @@ -130,6 +130,33 @@ struct vfio_regions {
> >  #define IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)	\
> >  					(!list_empty(&iommu->domain_list))
> >
> > +struct domain_capsule {
> > +	struct iommu_domain *domain;
> > +	void *data;
> > +};
> > +
> > +/* iommu->lock must be held */
> > +static int vfio_iommu_for_each_dev(struct vfio_iommu *iommu,
> > +		      int (*fn)(struct device *dev, void *data),
> > +		      void *data)
> > +{
> > +	struct domain_capsule dc = {.data = data};
> > +	struct vfio_domain *d;
> > +	struct vfio_group *g;
> > +	int ret = 0;
> > +
> > +	list_for_each_entry(d, &iommu->domain_list, next) {
> > +		dc.domain = d->domain;
> > +		list_for_each_entry(g, &d->group_list, next) {
> > +			ret = iommu_group_for_each_dev(g->iommu_group,
> > +						       &dc, fn);
> > +			if (ret)
> > +				break;
> > +		}
> > +	}
> > +	return ret;
> > +}
> > +
> >  static int put_pfn(unsigned long pfn, int prot);
> >
> >  /*
> > @@ -2314,6 +2341,88 @@ static int vfio_iommu_info_add_nesting_cap(struct
> vfio_iommu *iommu,
> >  	return 0;
> >  }
> >
> > +static int vfio_bind_gpasid_fn(struct device *dev, void *data)
> > +{
> > +	struct domain_capsule *dc = (struct domain_capsule *)data;
> > +	struct iommu_gpasid_bind_data *gbind_data =
> > +		(struct iommu_gpasid_bind_data *) dc->data;
> > +
> > +	return iommu_sva_bind_gpasid(dc->domain, dev, gbind_data);
> > +}
> > +
> > +static int vfio_unbind_gpasid_fn(struct device *dev, void *data)
> > +{
> > +	struct domain_capsule *dc = (struct domain_capsule *)data;
> > +	struct iommu_gpasid_bind_data *gbind_data =
> > +		(struct iommu_gpasid_bind_data *) dc->data;
> > +
> > +	return iommu_sva_unbind_gpasid(dc->domain, dev,
> > +					gbind_data->hpasid);
> > +}
> > +
> > +/**
> > + * Unbind specific gpasid, caller of this function requires hold
> > + * vfio_iommu->lock
> > + */
> > +static long vfio_iommu_type1_do_guest_unbind(struct vfio_iommu *iommu,
> > +				struct iommu_gpasid_bind_data *gbind_data)
> > +{
> > +	return vfio_iommu_for_each_dev(iommu,
> > +				vfio_unbind_gpasid_fn, gbind_data);
> > +}
> > +
> > +static long vfio_iommu_type1_bind_gpasid(struct vfio_iommu *iommu,
> > +				struct iommu_gpasid_bind_data *gbind_data)
> > +{
> > +	int ret = 0;
> > +
> > +	mutex_lock(&iommu->lock);
> > +	if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) {
> > +		ret = -EINVAL;
> > +		goto out_unlock;
> > +	}
> > +
> > +	ret = vfio_iommu_for_each_dev(iommu,
> > +			vfio_bind_gpasid_fn, gbind_data);
> > +	/*
> > +	 * If bind failed, it may not be a total failure. Some devices
> > +	 * within the iommu group may have bind successfully. Although
> > +	 * we don't enable pasid capability for non-singletion iommu
> > +	 * groups, a unbind operation would be helpful to ensure no
> > +	 * partial binding for an iommu group.
> 
> Where was the non-singleton group restriction done, I missed that.

Hmm, it's missed. thanks for spotting it. How about adding this
check in the vfio_iommu_for_each_dev()? If looped a non-singleton
group, just skip it. It applies to the cache_inv path all the
same.

> > +	 */
> > +	if (ret)
> > +		/*
> > +		 * Undo all binds that already succeeded, no need to
> > +		 * check the return value here since some device within
> > +		 * the group has no successful bind when coming to this
> > +		 * place switch.
> > +		 */
> > +		vfio_iommu_type1_do_guest_unbind(iommu, gbind_data);
> 
> However, the for_each_dev function stops when the callback function
> returns error, are we just assuming we stop at the same device as we
> faulted on the first time and that we traverse the same set of devices
> the second time?  It seems strange to me that unbind should be able to
> fail.

unbind can fail if a user attempts to unbind a pasid which is not belonged
to it or a pasid which hasn't ever been bound. Otherwise, I didn't see a
reason to fail.

> > +
> > +out_unlock:
> > +	mutex_unlock(&iommu->lock);
> > +	return ret;
> > +}
> > +
> > +static long vfio_iommu_type1_unbind_gpasid(struct vfio_iommu *iommu,
> > +				struct iommu_gpasid_bind_data *gbind_data)
> > +{
> > +	int ret = 0;
> > +
> > +	mutex_lock(&iommu->lock);
> > +	if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) {
> > +		ret = -EINVAL;
> > +		goto out_unlock;
> > +	}
> > +
> > +	ret = vfio_iommu_type1_do_guest_unbind(iommu, gbind_data);
> 
> How is a user supposed to respond to their unbind failing?

If it's a malicious unbind (e.g. unbind a not yet bound pasid or unbind
a pasid which doesn't belong to current user).

> > +
> > +out_unlock:
> > +	mutex_unlock(&iommu->lock);
> > +	return ret;
> > +}
> > +
> >  static long vfio_iommu_type1_ioctl(void *iommu_data,
> >  				   unsigned int cmd, unsigned long arg)
> >  {
> > @@ -2471,6 +2580,55 @@ static long vfio_iommu_type1_ioctl(void
> *iommu_data,
> >  		default:
> >  			return -EINVAL;
> >  		}
> > +
> > +	} else if (cmd == VFIO_IOMMU_BIND) {
> > +		struct vfio_iommu_type1_bind bind;
> > +		u32 version;
> > +		int data_size;
> > +		void *gbind_data;
> > +		int ret;
> > +
> > +		minsz = offsetofend(struct vfio_iommu_type1_bind, flags);
> > +
> > +		if (copy_from_user(&bind, (void __user *)arg, minsz))
> > +			return -EFAULT;
> > +
> > +		if (bind.argsz < minsz)
> > +			return -EINVAL;
> > +
> > +		/* Get the version of struct iommu_gpasid_bind_data */
> > +		if (copy_from_user(&version,
> > +			(void __user *) (arg + minsz),
> > +					sizeof(version)))
> > +			return -EFAULT;
> 
> Why are we coping things from beyond the size we've validated that the
> user has provided again?

let me wait for the result in Jacob's thread below. looks like need
to have a decision from you and Joreg. If using argsze is good, then
I guess we don't need the version-to-size mapping. right? Actually,
the version-to-size mapping is added to ensure vfio copy data correctly.
https://lkml.org/lkml/2020/4/2/876

> > +
> > +		data_size = iommu_uapi_get_data_size(
> > +				IOMMU_UAPI_BIND_GPASID, version);
> > +		gbind_data = kzalloc(data_size, GFP_KERNEL);
> > +		if (!gbind_data)
> > +			return -ENOMEM;
> > +
> > +		if (copy_from_user(gbind_data,
> > +			 (void __user *) (arg + minsz), data_size)) {
> > +			kfree(gbind_data);
> > +			return -EFAULT;
> > +		}
> 
> And again.  argsz isn't just for minsz.
>
> > +
> > +		switch (bind.flags & VFIO_IOMMU_BIND_MASK) {
> > +		case VFIO_IOMMU_BIND_GUEST_PGTBL:
> > +			ret = vfio_iommu_type1_bind_gpasid(iommu,
> > +							   gbind_data);
> > +			break;
> > +		case VFIO_IOMMU_UNBIND_GUEST_PGTBL:
> > +			ret = vfio_iommu_type1_unbind_gpasid(iommu,
> > +							     gbind_data);
> > +			break;
> > +		default:
> > +			ret = -EINVAL;
> > +			break;
> > +		}
> > +		kfree(gbind_data);
> > +		return ret;
> >  	}
> >
> >  	return -ENOTTY;
> > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> > index ebeaf3e..2235bc6 100644
> > --- a/include/uapi/linux/vfio.h
> > +++ b/include/uapi/linux/vfio.h
> > @@ -14,6 +14,7 @@
> >
> >  #include <linux/types.h>
> >  #include <linux/ioctl.h>
> > +#include <linux/iommu.h>
> >
> >  #define VFIO_API_VERSION	0
> >
> > @@ -853,6 +854,51 @@ struct vfio_iommu_type1_pasid_request {
> >   */
> >  #define VFIO_IOMMU_PASID_REQUEST	_IO(VFIO_TYPE, VFIO_BASE + 22)
> >
> > +/**
> > + * Supported flags:
> > + *	- VFIO_IOMMU_BIND_GUEST_PGTBL: bind guest page tables to host for
> > + *			nesting type IOMMUs. In @data field It takes struct
> > + *			iommu_gpasid_bind_data.
> > + *	- VFIO_IOMMU_UNBIND_GUEST_PGTBL: undo a bind guest page table
> operation
> > + *			invoked by VFIO_IOMMU_BIND_GUEST_PGTBL.
> 
> This must require iommu_gpasid_bind_data in the data field as well,
> right?

yes.

Regards,
Yi Liu


^ permalink raw reply	[flat|nested] 110+ messages in thread

* RE: [PATCH v1 8/8] vfio/type1: Add vSVA support for IOMMU-backed mdevs
  2020-04-02 20:33   ` Alex Williamson
@ 2020-04-03 13:39     ` Liu, Yi L
  0 siblings, 0 replies; 110+ messages in thread
From: Liu, Yi L @ 2020-04-03 13:39 UTC (permalink / raw)
  To: Alex Williamson
  Cc: eric.auger, Tian, Kevin, jacob.jun.pan, joro, Raj, Ashok, Tian,
	Jun J, Sun, Yi Y, jean-philippe, peterx, iommu, kvm,
	linux-kernel, Wu, Hao

Hi Alex,

> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Friday, April 3, 2020 4:34 AM
> To: Liu, Yi L <yi.l.liu@intel.com>
> Subject: Re: [PATCH v1 8/8] vfio/type1: Add vSVA support for IOMMU-backed
> mdevs
> 
> On Sun, 22 Mar 2020 05:32:05 -0700
> "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> 
> > From: Liu Yi L <yi.l.liu@intel.com>
> >
> > Recent years, mediated device pass-through framework (e.g. vfio-mdev)
> > are used to achieve flexible device sharing across domains (e.g. VMs).
> > Also there are hardware assisted mediated pass-through solutions from
> > platform vendors. e.g. Intel VT-d scalable mode which supports Intel
> > Scalable I/O Virtualization technology. Such mdevs are called IOMMU-
> > backed mdevs as there are IOMMU enforced DMA isolation for such mdevs.
> > In kernel, IOMMU-backed mdevs are exposed to IOMMU layer by aux-domain
> > concept, which means mdevs are protected by an iommu domain which is
> > aux-domain of its physical device. Details can be found in the KVM
> > presentation from Kevin Tian. IOMMU-backed equals to IOMMU-capable.
> >
> > https://events19.linuxfoundation.org/wp-content/uploads/2017/12/\
> > Hardware-Assisted-Mediated-Pass-Through-with-VFIO-Kevin-Tian-Intel.pdf
> >
> > This patch supports NESTING IOMMU for IOMMU-backed mdevs by figuring
> > out the physical device of an IOMMU-backed mdev and then invoking IOMMU
> > requests to IOMMU layer with the physical device and the mdev's aux
> > domain info.
> >
> > With this patch, vSVA (Virtual Shared Virtual Addressing) can be used
> > on IOMMU-backed mdevs.
> >
> > Cc: Kevin Tian <kevin.tian@intel.com>
> > CC: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > CC: Jun Tian <jun.j.tian@intel.com>
> > Cc: Alex Williamson <alex.williamson@redhat.com>
> > Cc: Eric Auger <eric.auger@redhat.com>
> > Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
> > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > ---
> >  drivers/vfio/vfio_iommu_type1.c | 23 ++++++++++++++++++++---
> >  1 file changed, 20 insertions(+), 3 deletions(-)
> >
> > diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> > index 937ec3f..d473665 100644
> > --- a/drivers/vfio/vfio_iommu_type1.c
> > +++ b/drivers/vfio/vfio_iommu_type1.c
> > @@ -132,6 +132,7 @@ struct vfio_regions {
> >
> >  struct domain_capsule {
> >  	struct iommu_domain *domain;
> > +	struct vfio_group *group;
> >  	void *data;
> >  };
> >
> > @@ -148,6 +149,7 @@ static int vfio_iommu_for_each_dev(struct vfio_iommu
> *iommu,
> >  	list_for_each_entry(d, &iommu->domain_list, next) {
> >  		dc.domain = d->domain;
> >  		list_for_each_entry(g, &d->group_list, next) {
> > +			dc.group = g;
> >  			ret = iommu_group_for_each_dev(g->iommu_group,
> >  						       &dc, fn);
> >  			if (ret)
> > @@ -2347,7 +2349,12 @@ static int vfio_bind_gpasid_fn(struct device *dev, void
> *data)
> >  	struct iommu_gpasid_bind_data *gbind_data =
> >  		(struct iommu_gpasid_bind_data *) dc->data;
> >
> > -	return iommu_sva_bind_gpasid(dc->domain, dev, gbind_data);
> > +	if (dc->group->mdev_group)
> > +		return iommu_sva_bind_gpasid(dc->domain,
> > +			vfio_mdev_get_iommu_device(dev), gbind_data);
> 
> But we can't assume an mdev device is iommu backed, so this can call
> with NULL dev, which appears will pretty quickly segfault
> intel_svm_bind_gpasid.

I don't think the non-iommu backed mdev will not be in the
iommu->domain_list. right? But, yeah, from this function p.o.v
, it is still necessary to do a check. How about adding a check
on the return of vfio_mdev_get_iommu_device(dev)? If iommu_device
is fetch, the mdev should be iommu-backed. does it make sense?

Regards,
Yi Liu

> 
> > +	else
> > +		return iommu_sva_bind_gpasid(dc->domain,
> > +						dev, gbind_data);
> >  }
> >
> >  static int vfio_unbind_gpasid_fn(struct device *dev, void *data)
> > @@ -2356,8 +2363,13 @@ static int vfio_unbind_gpasid_fn(struct device *dev,
> void *data)
> >  	struct iommu_gpasid_bind_data *gbind_data =
> >  		(struct iommu_gpasid_bind_data *) dc->data;
> >
> > -	return iommu_sva_unbind_gpasid(dc->domain, dev,
> > +	if (dc->group->mdev_group)
> > +		return iommu_sva_unbind_gpasid(dc->domain,
> > +					vfio_mdev_get_iommu_device(dev),
> >  					gbind_data->hpasid);
> 
> Same
> 
> > +	else
> > +		return iommu_sva_unbind_gpasid(dc->domain, dev,
> > +						gbind_data->hpasid);
> >  }
> >
> >  /**
> > @@ -2429,7 +2441,12 @@ static int vfio_cache_inv_fn(struct device *dev, void
> *data)
> >  	struct iommu_cache_invalidate_info *cache_inv_info =
> >  		(struct iommu_cache_invalidate_info *) dc->data;
> >
> > -	return iommu_cache_invalidate(dc->domain, dev, cache_inv_info);
> > +	if (dc->group->mdev_group)
> > +		return iommu_cache_invalidate(dc->domain,
> > +			vfio_mdev_get_iommu_device(dev), cache_inv_info);
> 
> And again
> 
> > +	else
> > +		return iommu_cache_invalidate(dc->domain,
> > +						dev, cache_inv_info);
> >  }
> >
> >  static long vfio_iommu_type1_ioctl(void *iommu_data,


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH v1 1/8] vfio: Add VFIO_IOMMU_PASID_REQUEST(alloc/free)
  2020-04-03  5:58     ` Tian, Kevin
@ 2020-04-03 15:14       ` Alex Williamson
  2020-04-07  4:42         ` Tian, Kevin
  0 siblings, 1 reply; 110+ messages in thread
From: Alex Williamson @ 2020-04-03 15:14 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Liu, Yi L, eric.auger, jacob.jun.pan, joro, Raj, Ashok, Tian,
	Jun J, Sun, Yi Y, jean-philippe, peterx, iommu, kvm,
	linux-kernel, Wu, Hao

On Fri, 3 Apr 2020 05:58:55 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Friday, April 3, 2020 1:50 AM
> > 
> > On Sun, 22 Mar 2020 05:31:58 -0700
> > "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> >   
> > > From: Liu Yi L <yi.l.liu@intel.com>
> > >
> > > For a long time, devices have only one DMA address space from platform
> > > IOMMU's point of view. This is true for both bare metal and directed-
> > > access in virtualization environment. Reason is the source ID of DMA in
> > > PCIe are BDF (bus/dev/fnc ID), which results in only device granularity
> > > DMA isolation. However, this is changing with the latest advancement in
> > > I/O technology area. More and more platform vendors are utilizing the  
> > PCIe  
> > > PASID TLP prefix in DMA requests, thus to give devices with multiple DMA
> > > address spaces as identified by their individual PASIDs. For example,
> > > Shared Virtual Addressing (SVA, a.k.a Shared Virtual Memory) is able to
> > > let device access multiple process virtual address space by binding the
> > > virtual address space with a PASID. Wherein the PASID is allocated in
> > > software and programmed to device per device specific manner. Devices
> > > which support PASID capability are called PASID-capable devices. If such
> > > devices are passed through to VMs, guest software are also able to bind
> > > guest process virtual address space on such devices. Therefore, the guest
> > > software could reuse the bare metal software programming model, which
> > > means guest software will also allocate PASID and program it to device
> > > directly. This is a dangerous situation since it has potential PASID
> > > conflicts and unauthorized address space access. It would be safer to
> > > let host intercept in the guest software's PASID allocation. Thus PASID
> > > are managed system-wide.  
> > 
> > Providing an allocation interface only allows for collaborative usage
> > of PASIDs though.  Do we have any ability to enforce PASID usage or can
> > a user spoof other PASIDs on the same BDF?  
> 
> An user can access only PASIDs allocated to itself, i.e. the specific IOASID
> set tied to its mm_struct.

A user is only _supposed_ to access PASIDs allocated to itself.  AIUI
the mm_struct is used for managing the pool of IOASIDs from which the
user may allocate that PASID.  We also state that programming the PASID
into the device is device specific.  Therefore, are we simply trusting
the user to use a PASID that's been allocated to them when they program
the device?  If a user can program an arbitrary PASID into the device,
then what prevents them from attempting to access data from another
user via the device?   I think I've asked this question before, so if
there's a previous explanation or spec section I need to review, please
point me to it.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH v1 7/8] vfio/type1: Add VFIO_IOMMU_CACHE_INVALIDATE
  2020-04-03  6:39     ` Tian, Kevin
@ 2020-04-03 15:31       ` Jacob Pan
  2020-04-03 15:34       ` Alex Williamson
  1 sibling, 0 replies; 110+ messages in thread
From: Jacob Pan @ 2020-04-03 15:31 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Alex Williamson, Liu, Yi L, eric.auger, joro, Raj, Ashok, Tian,
	Jun J, Sun, Yi Y, jean-philippe, peterx, iommu, kvm,
	linux-kernel, Wu, Hao, jacob.jun.pan

On Fri, 3 Apr 2020 06:39:22 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Friday, April 3, 2020 4:24 AM
> > 
> > On Sun, 22 Mar 2020 05:32:04 -0700
> > "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> >   
> > > From: Liu Yi L <yi.l.liu@linux.intel.com>
> > >
> > > For VFIO IOMMUs with the type VFIO_TYPE1_NESTING_IOMMU, guest  
> > "owns" the  
> > > first-level/stage-1 translation structures, the host IOMMU driver
> > > has no knowledge of first-level/stage-1 structure cache updates
> > > unless the guest invalidation requests are trapped and propagated
> > > to the host.
> > >
> > > This patch adds a new IOCTL VFIO_IOMMU_CACHE_INVALIDATE to  
> > propagate guest  
> > > first-level/stage-1 IOMMU cache invalidations to host to ensure
> > > IOMMU  
> > cache  
> > > correctness.
> > >
> > > With this patch, vSVA (Virtual Shared Virtual Addressing) can be
> > > used safely as the host IOMMU iotlb correctness are ensured.
> > >
> > > Cc: Kevin Tian <kevin.tian@intel.com>
> > > CC: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > > Cc: Alex Williamson <alex.williamson@redhat.com>
> > > Cc: Eric Auger <eric.auger@redhat.com>
> > > Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
> > > Signed-off-by: Liu Yi L <yi.l.liu@linux.intel.com>
> > > Signed-off-by: Eric Auger <eric.auger@redhat.com>
> > > Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > > ---
> > >  drivers/vfio/vfio_iommu_type1.c | 49  
> > +++++++++++++++++++++++++++++++++++++++++  
> > >  include/uapi/linux/vfio.h       | 22 ++++++++++++++++++
> > >  2 files changed, 71 insertions(+)
> > >
> > > diff --git a/drivers/vfio/vfio_iommu_type1.c  
> > b/drivers/vfio/vfio_iommu_type1.c  
> > > index a877747..937ec3f 100644
> > > --- a/drivers/vfio/vfio_iommu_type1.c
> > > +++ b/drivers/vfio/vfio_iommu_type1.c
> > > @@ -2423,6 +2423,15 @@ static long  
> > vfio_iommu_type1_unbind_gpasid(struct vfio_iommu *iommu,  
> > >  	return ret;
> > >  }
> > >
> > > +static int vfio_cache_inv_fn(struct device *dev, void *data)
> > > +{
> > > +	struct domain_capsule *dc = (struct domain_capsule
> > > *)data;
> > > +	struct iommu_cache_invalidate_info *cache_inv_info =
> > > +		(struct iommu_cache_invalidate_info *) dc->data;
> > > +
> > > +	return iommu_cache_invalidate(dc->domain, dev,
> > > cache_inv_info); +}
> > > +
> > >  static long vfio_iommu_type1_ioctl(void *iommu_data,
> > >  				   unsigned int cmd, unsigned
> > > long arg) {
> > > @@ -2629,6 +2638,46 @@ static long vfio_iommu_type1_ioctl(void  
> > *iommu_data,  
> > >  		}
> > >  		kfree(gbind_data);
> > >  		return ret;
> > > +	} else if (cmd == VFIO_IOMMU_CACHE_INVALIDATE) {
> > > +		struct vfio_iommu_type1_cache_invalidate
> > > cache_inv;
> > > +		u32 version;
> > > +		int info_size;
> > > +		void *cache_info;
> > > +		int ret;
> > > +
> > > +		minsz = offsetofend(struct  
> > vfio_iommu_type1_cache_invalidate,  
> > > +				    flags);  
> > 
> > This breaks backward compatibility as soon as struct
> > iommu_cache_invalidate_info changes size by its defined versioning
> > scheme.  ie. a field gets added, the version is bumped, all existing
> > userspace breaks.  Our minsz is offsetofend to the version field,
> > interpret the version to size, then reevaluate argsz.  
> 
> btw the version scheme is challenged by Christoph Hellwig. After
> some discussions, we need your guidance how to move forward.
> Jacob summarized available options below:
> 	https://lkml.org/lkml/2020/4/2/876
> 
For this particular case, I don't quite get the difference between
minsz=flags and minsz=version. Our IOMMU version scheme will only
change size at the end where the variable union is used for vendor
specific extensions. Version bump does not change size (only re-purpose
padding) from the start of the UAPI structure to the union, i.e. version
will __always__ be after struct vfio_iommu_type1_cache_invalidate.flags.


> >   
> > > +
> > > +		if (copy_from_user(&cache_inv, (void __user
> > > *)arg, minsz))
> > > +			return -EFAULT;
> > > +
> > > +		if (cache_inv.argsz < minsz || cache_inv.flags)
> > > +			return -EINVAL;
> > > +
> > > +		/* Get the version of struct
> > > iommu_cache_invalidate_info */
> > > +		if (copy_from_user(&version,
> > > +			(void __user *) (arg + minsz),
> > > sizeof(version)))
> > > +			return -EFAULT;
> > > +
> > > +		info_size = iommu_uapi_get_data_size(
> > > +					IOMMU_UAPI_CACHE_INVAL,  
> > version);  
> > > +
> > > +		cache_info = kzalloc(info_size, GFP_KERNEL);
> > > +		if (!cache_info)
> > > +			return -ENOMEM;
> > > +
> > > +		if (copy_from_user(cache_info,
> > > +			(void __user *) (arg + minsz),
> > > info_size)) {
> > > +			kfree(cache_info);
> > > +			return -EFAULT;
> > > +		}
> > > +
> > > +		mutex_lock(&iommu->lock);
> > > +		ret = vfio_iommu_for_each_dev(iommu,
> > > vfio_cache_inv_fn,
> > > +					    cache_info);  
> > 
> > How does a user respond when their cache invalidate fails?  Isn't
> > this also another case where our for_each_dev can fail at an
> > arbitrary point leaving us with no idea whether each device even
> > had the opportunity to perform the invalidation request.  I don't
> > see how we have any chance to maintain coherency after this
> > faults.  
> 
> Then can we make it simple to support singleton group only? 
> 
> >   
> > > +		mutex_unlock(&iommu->lock);
> > > +		kfree(cache_info);
> > > +		return ret;
> > >  	}
> > >
> > >  	return -ENOTTY;
> > > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> > > index 2235bc6..62ca791 100644
> > > --- a/include/uapi/linux/vfio.h
> > > +++ b/include/uapi/linux/vfio.h
> > > @@ -899,6 +899,28 @@ struct vfio_iommu_type1_bind {
> > >   */
> > >  #define VFIO_IOMMU_BIND		_IO(VFIO_TYPE, VFIO_BASE
> > > + 23)
> > >
> > > +/**
> > > + * VFIO_IOMMU_CACHE_INVALIDATE - _IOW(VFIO_TYPE, VFIO_BASE + 24,
> > > + *			struct
> > > vfio_iommu_type1_cache_invalidate)
> > > + *
> > > + * Propagate guest IOMMU cache invalidation to the host. The
> > > cache
> > > + * invalidation information is conveyed by @cache_info, the
> > > content
> > > + * format would be structures defined in uapi/linux/iommu.h. User
> > > + * should be aware of that the struct
> > > iommu_cache_invalidate_info
> > > + * has a @version field, vfio needs to parse this field before
> > > getting
> > > + * data from userspace.
> > > + *
> > > + * Availability of this IOCTL is after VFIO_SET_IOMMU.  
> > 
> > Is this a necessary qualifier?  A user can try to call this ioctl at
> > any point, it only makes sense in certain configurations, but it
> > should always "do the right thing" relative to the container iommu
> > config.
> > 
> > Also, I don't see anything in these last few patches testing the
> > operating IOMMU model, what happens when a user calls them when not
> > using the nesting IOMMU?
> > 
> > Is this ioctl and the previous BIND ioctl only valid when configured
> > for the nesting IOMMU type?  
> 
> I think so. We should add the nesting check in those new ioctls.
> 
I also added nesting domain attribute check in IOMMU driver, so bind
guest PASID will fail if nesting mode is not supported. There will be
no invalidation w/o bind.

> >   
> > > + *
> > > + * returns: 0 on success, -errno on failure.
> > > + */
> > > +struct vfio_iommu_type1_cache_invalidate {
> > > +	__u32   argsz;
> > > +	__u32   flags;
> > > +	struct	iommu_cache_invalidate_info cache_info;
> > > +};
> > > +#define VFIO_IOMMU_CACHE_INVALIDATE      _IO(VFIO_TYPE,
> > > VFIO_BASE  
> > + 24)
> > 
> > The future extension capabilities of this ioctl worry me, I wonder
> > if we should do another data[] with flag defining that data as
> > CACHE_INFO.  
> 
> Can you elaborate? Does it mean with this way we don't rely on iommu
> driver to provide version_to_size conversion and instead we just pass
> data[] to iommu driver for further audit?
> 
I guess Alex meant we do something similar to:
struct vfio_irq_set {
	__u32	argsz;
	__u32	flags;
#define VFIO_IRQ_SET_DATA_NONE		(1 << 0) /* Data not present */
#define VFIO_IRQ_SET_DATA_BOOL		(1 << 1) /* Data is bool (u8) */
#define VFIO_IRQ_SET_DATA_EVENTFD	(1 << 2) /* Data is eventfd (s32) */
#define VFIO_IRQ_SET_ACTION_MASK	(1 << 3) /* Mask interrupt */
#define VFIO_IRQ_SET_ACTION_UNMASK	(1 << 4) /* Unmask interrupt */
#define VFIO_IRQ_SET_ACTION_TRIGGER	(1 << 5) /* Trigger interrupt */
	__u32	index;
	__u32	start;
	__u32	count;
	__u8	data[];
};

So we could do:
struct vfio_iommu_type1_cache_invalidate {
	__u32   argsz;
#define VFIO_INVL_DATA_NONE
#define VFIO_INVL_DATA_CACHE_INFO		(1 << 1)
	__u32   flags;
	__u8	data[];
}

We still need version_to_size version, but under
if (flag & VFIO_INVL_DATA_CACHE_INFO)
	get_size_from_version();

> >   
> > > +
> > >  /* -------- Additional API for SPAPR TCE (Server POWERPC) IOMMU
> > > --------  
> > */  
> > >
> > >  /*  
> 
> Thanks
> Kevin

[Jacob Pan]

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH v1 7/8] vfio/type1: Add VFIO_IOMMU_CACHE_INVALIDATE
  2020-04-03  6:39     ` Tian, Kevin
  2020-04-03 15:31       ` Jacob Pan
@ 2020-04-03 15:34       ` Alex Williamson
  2020-04-08  2:28         ` Liu, Yi L
  2020-04-16 10:40         ` Liu, Yi L
  1 sibling, 2 replies; 110+ messages in thread
From: Alex Williamson @ 2020-04-03 15:34 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Liu, Yi L, eric.auger, jacob.jun.pan, joro, Raj, Ashok, Tian,
	Jun J, Sun, Yi Y, jean-philippe, peterx, iommu, kvm,
	linux-kernel, Wu, Hao

On Fri, 3 Apr 2020 06:39:22 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Friday, April 3, 2020 4:24 AM
> > 
> > On Sun, 22 Mar 2020 05:32:04 -0700
> > "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> >   
> > > From: Liu Yi L <yi.l.liu@linux.intel.com>
> > >
> > > For VFIO IOMMUs with the type VFIO_TYPE1_NESTING_IOMMU, guest  
> > "owns" the  
> > > first-level/stage-1 translation structures, the host IOMMU driver has no
> > > knowledge of first-level/stage-1 structure cache updates unless the guest
> > > invalidation requests are trapped and propagated to the host.
> > >
> > > This patch adds a new IOCTL VFIO_IOMMU_CACHE_INVALIDATE to  
> > propagate guest  
> > > first-level/stage-1 IOMMU cache invalidations to host to ensure IOMMU  
> > cache  
> > > correctness.
> > >
> > > With this patch, vSVA (Virtual Shared Virtual Addressing) can be used safely
> > > as the host IOMMU iotlb correctness are ensured.
> > >
> > > Cc: Kevin Tian <kevin.tian@intel.com>
> > > CC: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > > Cc: Alex Williamson <alex.williamson@redhat.com>
> > > Cc: Eric Auger <eric.auger@redhat.com>
> > > Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
> > > Signed-off-by: Liu Yi L <yi.l.liu@linux.intel.com>
> > > Signed-off-by: Eric Auger <eric.auger@redhat.com>
> > > Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > > ---
> > >  drivers/vfio/vfio_iommu_type1.c | 49  
> > +++++++++++++++++++++++++++++++++++++++++  
> > >  include/uapi/linux/vfio.h       | 22 ++++++++++++++++++
> > >  2 files changed, 71 insertions(+)
> > >
> > > diff --git a/drivers/vfio/vfio_iommu_type1.c  
> > b/drivers/vfio/vfio_iommu_type1.c  
> > > index a877747..937ec3f 100644
> > > --- a/drivers/vfio/vfio_iommu_type1.c
> > > +++ b/drivers/vfio/vfio_iommu_type1.c
> > > @@ -2423,6 +2423,15 @@ static long  
> > vfio_iommu_type1_unbind_gpasid(struct vfio_iommu *iommu,  
> > >  	return ret;
> > >  }
> > >
> > > +static int vfio_cache_inv_fn(struct device *dev, void *data)
> > > +{
> > > +	struct domain_capsule *dc = (struct domain_capsule *)data;
> > > +	struct iommu_cache_invalidate_info *cache_inv_info =
> > > +		(struct iommu_cache_invalidate_info *) dc->data;
> > > +
> > > +	return iommu_cache_invalidate(dc->domain, dev, cache_inv_info);
> > > +}
> > > +
> > >  static long vfio_iommu_type1_ioctl(void *iommu_data,
> > >  				   unsigned int cmd, unsigned long arg)
> > >  {
> > > @@ -2629,6 +2638,46 @@ static long vfio_iommu_type1_ioctl(void  
> > *iommu_data,  
> > >  		}
> > >  		kfree(gbind_data);
> > >  		return ret;
> > > +	} else if (cmd == VFIO_IOMMU_CACHE_INVALIDATE) {
> > > +		struct vfio_iommu_type1_cache_invalidate cache_inv;
> > > +		u32 version;
> > > +		int info_size;
> > > +		void *cache_info;
> > > +		int ret;
> > > +
> > > +		minsz = offsetofend(struct  
> > vfio_iommu_type1_cache_invalidate,  
> > > +				    flags);  
> > 
> > This breaks backward compatibility as soon as struct
> > iommu_cache_invalidate_info changes size by its defined versioning
> > scheme.  ie. a field gets added, the version is bumped, all existing
> > userspace breaks.  Our minsz is offsetofend to the version field,
> > interpret the version to size, then reevaluate argsz.  
> 
> btw the version scheme is challenged by Christoph Hellwig. After
> some discussions, we need your guidance how to move forward.
> Jacob summarized available options below:
> 	https://lkml.org/lkml/2020/4/2/876

Ok

> >   
> > > +
> > > +		if (copy_from_user(&cache_inv, (void __user *)arg, minsz))
> > > +			return -EFAULT;
> > > +
> > > +		if (cache_inv.argsz < minsz || cache_inv.flags)
> > > +			return -EINVAL;
> > > +
> > > +		/* Get the version of struct iommu_cache_invalidate_info */
> > > +		if (copy_from_user(&version,
> > > +			(void __user *) (arg + minsz), sizeof(version)))
> > > +			return -EFAULT;
> > > +
> > > +		info_size = iommu_uapi_get_data_size(
> > > +					IOMMU_UAPI_CACHE_INVAL,  
> > version);  
> > > +
> > > +		cache_info = kzalloc(info_size, GFP_KERNEL);
> > > +		if (!cache_info)
> > > +			return -ENOMEM;
> > > +
> > > +		if (copy_from_user(cache_info,
> > > +			(void __user *) (arg + minsz), info_size)) {
> > > +			kfree(cache_info);
> > > +			return -EFAULT;
> > > +		}
> > > +
> > > +		mutex_lock(&iommu->lock);
> > > +		ret = vfio_iommu_for_each_dev(iommu, vfio_cache_inv_fn,
> > > +					    cache_info);  
> > 
> > How does a user respond when their cache invalidate fails?  Isn't this
> > also another case where our for_each_dev can fail at an arbitrary point
> > leaving us with no idea whether each device even had the opportunity to
> > perform the invalidation request.  I don't see how we have any chance
> > to maintain coherency after this faults.  
> 
> Then can we make it simple to support singleton group only? 

Are you suggesting a single group per container or a single device per
group?  Unless we have both, aren't we always going to have this issue.
OTOH, why should a cache invalidate fail?

> > > +		mutex_unlock(&iommu->lock);
> > > +		kfree(cache_info);
> > > +		return ret;
> > >  	}
> > >
> > >  	return -ENOTTY;
> > > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> > > index 2235bc6..62ca791 100644
> > > --- a/include/uapi/linux/vfio.h
> > > +++ b/include/uapi/linux/vfio.h
> > > @@ -899,6 +899,28 @@ struct vfio_iommu_type1_bind {
> > >   */
> > >  #define VFIO_IOMMU_BIND		_IO(VFIO_TYPE, VFIO_BASE + 23)
> > >
> > > +/**
> > > + * VFIO_IOMMU_CACHE_INVALIDATE - _IOW(VFIO_TYPE, VFIO_BASE + 24,
> > > + *			struct vfio_iommu_type1_cache_invalidate)
> > > + *
> > > + * Propagate guest IOMMU cache invalidation to the host. The cache
> > > + * invalidation information is conveyed by @cache_info, the content
> > > + * format would be structures defined in uapi/linux/iommu.h. User
> > > + * should be aware of that the struct  iommu_cache_invalidate_info
> > > + * has a @version field, vfio needs to parse this field before getting
> > > + * data from userspace.
> > > + *
> > > + * Availability of this IOCTL is after VFIO_SET_IOMMU.  
> > 
> > Is this a necessary qualifier?  A user can try to call this ioctl at
> > any point, it only makes sense in certain configurations, but it should
> > always "do the right thing" relative to the container iommu config.
> > 
> > Also, I don't see anything in these last few patches testing the
> > operating IOMMU model, what happens when a user calls them when not
> > using the nesting IOMMU?
> > 
> > Is this ioctl and the previous BIND ioctl only valid when configured
> > for the nesting IOMMU type?  
> 
> I think so. We should add the nesting check in those new ioctls.
> 
> >   
> > > + *
> > > + * returns: 0 on success, -errno on failure.
> > > + */
> > > +struct vfio_iommu_type1_cache_invalidate {
> > > +	__u32   argsz;
> > > +	__u32   flags;
> > > +	struct	iommu_cache_invalidate_info cache_info;
> > > +};
> > > +#define VFIO_IOMMU_CACHE_INVALIDATE      _IO(VFIO_TYPE, VFIO_BASE  
> > + 24)
> > 
> > The future extension capabilities of this ioctl worry me, I wonder if
> > we should do another data[] with flag defining that data as CACHE_INFO.  
> 
> Can you elaborate? Does it mean with this way we don't rely on iommu
> driver to provide version_to_size conversion and instead we just pass
> data[] to iommu driver for further audit?

No, my concern is that this ioctl has a single function, strictly tied
to the iommu uapi.  If we replace cache_info with data[] then we can
define a flag to specify that data[] is struct
iommu_cache_invalidate_info, and if we need to, a different flag to
identify data[] as something else.  For example if we get stuck
expanding cache_info to meet new demands and develop a new uapi to
solve that, how would we expand this ioctl to support it rather than
also create a new ioctl?  There's also a trade-off in making the ioctl
usage more difficult for the user.  I'd still expect the vfio layer to
check the flag and interpret data[] as indicated by the flag rather
than just passing a blob of opaque data to the iommu layer though.
Thanks,

Alex


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH v1 3/8] vfio/type1: Report PASID alloc/free support to userspace
  2020-04-03  8:17     ` Liu, Yi L
@ 2020-04-03 17:28       ` Alex Williamson
  2020-04-04 11:36         ` Liu, Yi L
  0 siblings, 1 reply; 110+ messages in thread
From: Alex Williamson @ 2020-04-03 17:28 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: eric.auger, Tian, Kevin, jacob.jun.pan, joro, Raj, Ashok, Tian,
	Jun J, Sun, Yi Y, jean-philippe, peterx, iommu, kvm,
	linux-kernel, Wu, Hao

On Fri, 3 Apr 2020 08:17:44 +0000
"Liu, Yi L" <yi.l.liu@intel.com> wrote:

> > From: Alex Williamson < alex.williamson@redhat.com >
> > Sent: Friday, April 3, 2020 2:01 AM
> > To: Liu, Yi L <yi.l.liu@intel.com>
> > Subject: Re: [PATCH v1 3/8] vfio/type1: Report PASID alloc/free support to
> > userspace
> > 
> > On Sun, 22 Mar 2020 05:32:00 -0700
> > "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> >   
> > > From: Liu Yi L <yi.l.liu@intel.com>
> > >
> > > This patch reports PASID alloc/free availability to userspace (e.g.
> > > QEMU) thus userspace could do a pre-check before utilizing this feature.
> > >
> > > Cc: Kevin Tian <kevin.tian@intel.com>
> > > CC: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > > Cc: Alex Williamson <alex.williamson@redhat.com>
> > > Cc: Eric Auger <eric.auger@redhat.com>
> > > Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
> > > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > > ---
> > >  drivers/vfio/vfio_iommu_type1.c | 28 ++++++++++++++++++++++++++++
> > >  include/uapi/linux/vfio.h       |  8 ++++++++
> > >  2 files changed, 36 insertions(+)
> > >
> > > diff --git a/drivers/vfio/vfio_iommu_type1.c
> > > b/drivers/vfio/vfio_iommu_type1.c index e40afc0..ddd1ffe 100644
> > > --- a/drivers/vfio/vfio_iommu_type1.c
> > > +++ b/drivers/vfio/vfio_iommu_type1.c
> > > @@ -2234,6 +2234,30 @@ static int vfio_iommu_type1_pasid_free(struct  
> > vfio_iommu *iommu,  
> > >  	return ret;
> > >  }
> > >
> > > +static int vfio_iommu_info_add_nesting_cap(struct vfio_iommu *iommu,
> > > +					 struct vfio_info_cap *caps)
> > > +{
> > > +	struct vfio_info_cap_header *header;
> > > +	struct vfio_iommu_type1_info_cap_nesting *nesting_cap;
> > > +
> > > +	header = vfio_info_cap_add(caps, sizeof(*nesting_cap),
> > > +				   VFIO_IOMMU_TYPE1_INFO_CAP_NESTING, 1);
> > > +	if (IS_ERR(header))
> > > +		return PTR_ERR(header);
> > > +
> > > +	nesting_cap = container_of(header,
> > > +				struct vfio_iommu_type1_info_cap_nesting,
> > > +				header);
> > > +
> > > +	nesting_cap->nesting_capabilities = 0;
> > > +	if (iommu->nesting) {
> > > +		/* nesting iommu type supports PASID requests (alloc/free) */
> > > +		nesting_cap->nesting_capabilities |= VFIO_IOMMU_PASID_REQS;
> > > +	}
> > > +
> > > +	return 0;
> > > +}
> > > +
> > >  static long vfio_iommu_type1_ioctl(void *iommu_data,
> > >  				   unsigned int cmd, unsigned long arg)  { @@ -  
> > 2283,6 +2307,10 @@  
> > > static long vfio_iommu_type1_ioctl(void *iommu_data,
> > >  		if (ret)
> > >  			return ret;
> > >
> > > +		ret = vfio_iommu_info_add_nesting_cap(iommu, &caps);
> > > +		if (ret)
> > > +			return ret;
> > > +
> > >  		if (caps.size) {
> > >  			info.flags |= VFIO_IOMMU_INFO_CAPS;
> > >
> > > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> > > index 298ac80..8837219 100644
> > > --- a/include/uapi/linux/vfio.h
> > > +++ b/include/uapi/linux/vfio.h
> > > @@ -748,6 +748,14 @@ struct vfio_iommu_type1_info_cap_iova_range {
> > >  	struct	vfio_iova_range iova_ranges[];
> > >  };
> > >
> > > +#define VFIO_IOMMU_TYPE1_INFO_CAP_NESTING  2
> > > +
> > > +struct vfio_iommu_type1_info_cap_nesting {
> > > +	struct	vfio_info_cap_header header;
> > > +#define VFIO_IOMMU_PASID_REQS	(1 << 0)
> > > +	__u32	nesting_capabilities;
> > > +};
> > > +
> > >  #define VFIO_IOMMU_GET_INFO _IO(VFIO_TYPE, VFIO_BASE + 12)
> > >
> > >  /**  
> > 
> > I think this answers my PROBE question on patch 1/.   
> yep.
> > Should the quota/usage be exposed to the user here?  Thanks,  
> 
> Do you mean report the quota available for this user in this cap info as well?

Yes.  Would it be useful?

> For usage, do you mean the alloc and free or others?

I mean how many of the quota are currently in allocated, or
alternatively, how many remain.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH v1 1/8] vfio: Add VFIO_IOMMU_PASID_REQUEST(alloc/free)
  2020-04-03 13:12     ` Liu, Yi L
@ 2020-04-03 17:50       ` Alex Williamson
  2020-04-07  4:52         ` Tian, Kevin
  2020-04-08  0:52         ` Liu, Yi L
  0 siblings, 2 replies; 110+ messages in thread
From: Alex Williamson @ 2020-04-03 17:50 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: eric.auger, Tian, Kevin, jacob.jun.pan, joro, Raj, Ashok, Tian,
	Jun J, Sun, Yi Y, jean-philippe, peterx, iommu, kvm,
	linux-kernel, Wu, Hao

On Fri, 3 Apr 2020 13:12:50 +0000
"Liu, Yi L" <yi.l.liu@intel.com> wrote:

> Hi Alex,
> 
> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Friday, April 3, 2020 1:50 AM
> > Subject: Re: [PATCH v1 1/8] vfio: Add VFIO_IOMMU_PASID_REQUEST(alloc/free)
> > 
> > On Sun, 22 Mar 2020 05:31:58 -0700
> > "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> >   
> > > From: Liu Yi L <yi.l.liu@intel.com>
> > >
> > > For a long time, devices have only one DMA address space from platform
> > > IOMMU's point of view. This is true for both bare metal and directed-
> > > access in virtualization environment. Reason is the source ID of DMA in
> > > PCIe are BDF (bus/dev/fnc ID), which results in only device granularity
> > > DMA isolation. However, this is changing with the latest advancement in
> > > I/O technology area. More and more platform vendors are utilizing the PCIe
> > > PASID TLP prefix in DMA requests, thus to give devices with multiple DMA
> > > address spaces as identified by their individual PASIDs. For example,
> > > Shared Virtual Addressing (SVA, a.k.a Shared Virtual Memory) is able to
> > > let device access multiple process virtual address space by binding the
> > > virtual address space with a PASID. Wherein the PASID is allocated in
> > > software and programmed to device per device specific manner. Devices
> > > which support PASID capability are called PASID-capable devices. If such
> > > devices are passed through to VMs, guest software are also able to bind
> > > guest process virtual address space on such devices. Therefore, the guest
> > > software could reuse the bare metal software programming model, which
> > > means guest software will also allocate PASID and program it to device
> > > directly. This is a dangerous situation since it has potential PASID
> > > conflicts and unauthorized address space access. It would be safer to
> > > let host intercept in the guest software's PASID allocation. Thus PASID
> > > are managed system-wide.  
> > 
> > Providing an allocation interface only allows for collaborative usage
> > of PASIDs though.  Do we have any ability to enforce PASID usage or can
> > a user spoof other PASIDs on the same BDF?
> >   
> > > This patch adds VFIO_IOMMU_PASID_REQUEST ioctl which aims to passdown
> > > PASID allocation/free request from the virtual IOMMU. Additionally, such
> > > requests are intended to be invoked by QEMU or other applications which
> > > are running in userspace, it is necessary to have a mechanism to prevent
> > > single application from abusing available PASIDs in system. With such
> > > consideration, this patch tracks the VFIO PASID allocation per-VM. There
> > > was a discussion to make quota to be per assigned devices. e.g. if a VM
> > > has many assigned devices, then it should have more quota. However, it
> > > is not sure how many PASIDs an assigned devices will use. e.g. it is
> > > possible that a VM with multiples assigned devices but requests less
> > > PASIDs. Therefore per-VM quota would be better.
> > >
> > > This patch uses struct mm pointer as a per-VM token. We also considered
> > > using task structure pointer and vfio_iommu structure pointer. However,
> > > task structure is per-thread, which means it cannot achieve per-VM PASID
> > > alloc tracking purpose. While for vfio_iommu structure, it is visible
> > > only within vfio. Therefore, structure mm pointer is selected. This patch
> > > adds a structure vfio_mm. A vfio_mm is created when the first vfio
> > > container is opened by a VM. On the reverse order, vfio_mm is free when
> > > the last vfio container is released. Each VM is assigned with a PASID
> > > quota, so that it is not able to request PASID beyond its quota. This
> > > patch adds a default quota of 1000. This quota could be tuned by
> > > administrator. Making PASID quota tunable will be added in another patch
> > > in this series.
> > >
> > > Previous discussions:
> > > https://patchwork.kernel.org/patch/11209429/
> > >
> > > Cc: Kevin Tian <kevin.tian@intel.com>
> > > CC: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > > Cc: Alex Williamson <alex.williamson@redhat.com>
> > > Cc: Eric Auger <eric.auger@redhat.com>
> > > Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
> > > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > > Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
> > > Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > > ---
> > >  drivers/vfio/vfio.c             | 130 ++++++++++++++++++++++++++++++++++++++++
> > >  drivers/vfio/vfio_iommu_type1.c | 104 ++++++++++++++++++++++++++++++++
> > >  include/linux/vfio.h            |  20 +++++++
> > >  include/uapi/linux/vfio.h       |  41 +++++++++++++
> > >  4 files changed, 295 insertions(+)
> > >
> > > diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> > > index c848262..d13b483 100644
> > > --- a/drivers/vfio/vfio.c
> > > +++ b/drivers/vfio/vfio.c
> > > @@ -32,6 +32,7 @@
> > >  #include <linux/vfio.h>
> > >  #include <linux/wait.h>
> > >  #include <linux/sched/signal.h>
> > > +#include <linux/sched/mm.h>
> > >
> > >  #define DRIVER_VERSION	"0.3"
> > >  #define DRIVER_AUTHOR	"Alex Williamson <alex.williamson@redhat.com>"
> > > @@ -46,6 +47,8 @@ static struct vfio {
> > >  	struct mutex			group_lock;
> > >  	struct cdev			group_cdev;
> > >  	dev_t				group_devt;
> > > +	struct list_head		vfio_mm_list;
> > > +	struct mutex			vfio_mm_lock;
> > >  	wait_queue_head_t		release_q;
> > >  } vfio;
> > >
> > > @@ -2129,6 +2132,131 @@ int vfio_unregister_notifier(struct device *dev, enum  
> > vfio_notify_type type,  
> > >  EXPORT_SYMBOL(vfio_unregister_notifier);
> > >
> > >  /**
> > > + * VFIO_MM objects - create, release, get, put, search
> > > + * Caller of the function should have held vfio.vfio_mm_lock.
> > > + */
> > > +static struct vfio_mm *vfio_create_mm(struct mm_struct *mm)
> > > +{
> > > +	struct vfio_mm *vmm;
> > > +	struct vfio_mm_token *token;
> > > +	int ret = 0;
> > > +
> > > +	vmm = kzalloc(sizeof(*vmm), GFP_KERNEL);
> > > +	if (!vmm)
> > > +		return ERR_PTR(-ENOMEM);
> > > +
> > > +	/* Per mm IOASID set used for quota control and group operations */
> > > +	ret = ioasid_alloc_set((struct ioasid_set *) mm,
> > > +			       VFIO_DEFAULT_PASID_QUOTA, &vmm->ioasid_sid);
> > > +	if (ret) {
> > > +		kfree(vmm);
> > > +		return ERR_PTR(ret);
> > > +	}
> > > +
> > > +	kref_init(&vmm->kref);
> > > +	token = &vmm->token;
> > > +	token->val = mm;
> > > +	vmm->pasid_quota = VFIO_DEFAULT_PASID_QUOTA;
> > > +	mutex_init(&vmm->pasid_lock);
> > > +
> > > +	list_add(&vmm->vfio_next, &vfio.vfio_mm_list);
> > > +
> > > +	return vmm;
> > > +}
> > > +
> > > +static void vfio_mm_unlock_and_free(struct vfio_mm *vmm)
> > > +{
> > > +	/* destroy the ioasid set */
> > > +	ioasid_free_set(vmm->ioasid_sid, true);
> > > +	mutex_unlock(&vfio.vfio_mm_lock);
> > > +	kfree(vmm);
> > > +}
> > > +
> > > +/* called with vfio.vfio_mm_lock held */
> > > +static void vfio_mm_release(struct kref *kref)
> > > +{
> > > +	struct vfio_mm *vmm = container_of(kref, struct vfio_mm, kref);
> > > +
> > > +	list_del(&vmm->vfio_next);
> > > +	vfio_mm_unlock_and_free(vmm);
> > > +}
> > > +
> > > +void vfio_mm_put(struct vfio_mm *vmm)
> > > +{
> > > +	kref_put_mutex(&vmm->kref, vfio_mm_release, &vfio.vfio_mm_lock);
> > > +}
> > > +EXPORT_SYMBOL_GPL(vfio_mm_put);
> > > +
> > > +/* Assume vfio_mm_lock or vfio_mm reference is held */
> > > +static void vfio_mm_get(struct vfio_mm *vmm)
> > > +{
> > > +	kref_get(&vmm->kref);
> > > +}
> > > +
> > > +struct vfio_mm *vfio_mm_get_from_task(struct task_struct *task)
> > > +{
> > > +	struct mm_struct *mm = get_task_mm(task);
> > > +	struct vfio_mm *vmm;
> > > +	unsigned long long val = (unsigned long long) mm;
> > > +
> > > +	mutex_lock(&vfio.vfio_mm_lock);
> > > +	list_for_each_entry(vmm, &vfio.vfio_mm_list, vfio_next) {
> > > +		if (vmm->token.val == val) {
> > > +			vfio_mm_get(vmm);
> > > +			goto out;
> > > +		}
> > > +	}
> > > +
> > > +	vmm = vfio_create_mm(mm);
> > > +	if (IS_ERR(vmm))
> > > +		vmm = NULL;
> > > +out:
> > > +	mutex_unlock(&vfio.vfio_mm_lock);
> > > +	mmput(mm);
> > > +	return vmm;
> > > +}
> > > +EXPORT_SYMBOL_GPL(vfio_mm_get_from_task);
> > > +
> > > +int vfio_mm_pasid_alloc(struct vfio_mm *vmm, int min, int max)
> > > +{
> > > +	ioasid_t pasid;
> > > +	int ret = -ENOSPC;
> > > +
> > > +	mutex_lock(&vmm->pasid_lock);
> > > +
> > > +	pasid = ioasid_alloc(vmm->ioasid_sid, min, max, NULL);
> > > +	if (pasid == INVALID_IOASID) {
> > > +		ret = -ENOSPC;
> > > +		goto out_unlock;
> > > +	}
> > > +
> > > +	ret = pasid;
> > > +out_unlock:
> > > +	mutex_unlock(&vmm->pasid_lock);
> > > +	return ret;
> > > +}
> > > +EXPORT_SYMBOL_GPL(vfio_mm_pasid_alloc);
> > > +
> > > +int vfio_mm_pasid_free(struct vfio_mm *vmm, ioasid_t pasid)
> > > +{
> > > +	void *pdata;
> > > +	int ret = 0;
> > > +
> > > +	mutex_lock(&vmm->pasid_lock);
> > > +	pdata = ioasid_find(vmm->ioasid_sid, pasid, NULL);
> > > +	if (IS_ERR(pdata)) {
> > > +		ret = PTR_ERR(pdata);
> > > +		goto out_unlock;
> > > +	}
> > > +	ioasid_free(pasid);
> > > +
> > > +out_unlock:
> > > +	mutex_unlock(&vmm->pasid_lock);
> > > +	return ret;
> > > +}
> > > +EXPORT_SYMBOL_GPL(vfio_mm_pasid_free);
> > > +
> > > +/**
> > >   * Module/class support
> > >   */
> > >  static char *vfio_devnode(struct device *dev, umode_t *mode)
> > > @@ -2151,8 +2279,10 @@ static int __init vfio_init(void)
> > >  	idr_init(&vfio.group_idr);
> > >  	mutex_init(&vfio.group_lock);
> > >  	mutex_init(&vfio.iommu_drivers_lock);
> > > +	mutex_init(&vfio.vfio_mm_lock);
> > >  	INIT_LIST_HEAD(&vfio.group_list);
> > >  	INIT_LIST_HEAD(&vfio.iommu_drivers_list);
> > > +	INIT_LIST_HEAD(&vfio.vfio_mm_list);
> > >  	init_waitqueue_head(&vfio.release_q);
> > >
> > >  	ret = misc_register(&vfio_dev);  
> > 
> > Is vfio.c the right place for any of the above?  It seems like it could
> > all be in a separate vfio_pasid module, similar to our virqfd module.  
> 
> I think it could be a separate vfio_pasid module. let me make it in next
> version if it's your preference. :-)
> 
> > > diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> > > index a177bf2..331ceee 100644
> > > --- a/drivers/vfio/vfio_iommu_type1.c
> > > +++ b/drivers/vfio/vfio_iommu_type1.c
> > > @@ -70,6 +70,7 @@ struct vfio_iommu {
> > >  	unsigned int		dma_avail;
> > >  	bool			v2;
> > >  	bool			nesting;
> > > +	struct vfio_mm		*vmm;
> > >  };
> > >
> > >  struct vfio_domain {
> > > @@ -2018,6 +2019,7 @@ static void vfio_iommu_type1_detach_group(void  
> > *iommu_data,  
> > >  static void *vfio_iommu_type1_open(unsigned long arg)
> > >  {
> > >  	struct vfio_iommu *iommu;
> > > +	struct vfio_mm *vmm = NULL;
> > >
> > >  	iommu = kzalloc(sizeof(*iommu), GFP_KERNEL);
> > >  	if (!iommu)
> > > @@ -2043,6 +2045,10 @@ static void *vfio_iommu_type1_open(unsigned long  
> > arg)  
> > >  	iommu->dma_avail = dma_entry_limit;
> > >  	mutex_init(&iommu->lock);
> > >  	BLOCKING_INIT_NOTIFIER_HEAD(&iommu->notifier);
> > > +	vmm = vfio_mm_get_from_task(current);
> > > +	if (!vmm)
> > > +		pr_err("Failed to get vfio_mm track\n");  
> > 
> > Doesn't this presume everyone is instantly running PASID capable hosts?
> > Looks like a noisy support regression to me.  
> 
> right, it is. Kevin also questioned this part, I'll refine it and avoid
> regression noisy.
> 
> > > +	iommu->vmm = vmm;
> > >
> > >  	return iommu;
> > >  }
> > > @@ -2084,6 +2090,8 @@ static void vfio_iommu_type1_release(void  
> > *iommu_data)  
> > >  	}
> > >
> > >  	vfio_iommu_iova_free(&iommu->iova_list);
> > > +	if (iommu->vmm)
> > > +		vfio_mm_put(iommu->vmm);
> > >
> > >  	kfree(iommu);
> > >  }
> > > @@ -2172,6 +2180,55 @@ static int vfio_iommu_iova_build_caps(struct  
> > vfio_iommu *iommu,  
> > >  	return ret;
> > >  }
> > >
> > > +static bool vfio_iommu_type1_pasid_req_valid(u32 flags)
> > > +{
> > > +	return !((flags & ~VFIO_PASID_REQUEST_MASK) ||
> > > +		 (flags & VFIO_IOMMU_PASID_ALLOC &&
> > > +		  flags & VFIO_IOMMU_PASID_FREE));
> > > +}
> > > +
> > > +static int vfio_iommu_type1_pasid_alloc(struct vfio_iommu *iommu,
> > > +					 int min,
> > > +					 int max)
> > > +{
> > > +	struct vfio_mm *vmm = iommu->vmm;
> > > +	int ret = 0;
> > > +
> > > +	mutex_lock(&iommu->lock);
> > > +	if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) {
> > > +		ret = -EFAULT;
> > > +		goto out_unlock;
> > > +	}  
> > 
> > Non-iommu backed mdevs are excluded from this?  Is this a matter of
> > wiring the call out through the mdev parent device, or is this just
> > possible?  
> 
> At the beginning, non-iommu backed mdevs are excluded. However,
> Combined with your succeeded comment. I think this check should be
> removed as the PASID alloc/free capability should be available as
> long as the container is backed by a pasid-capable iommu backend.
> So should remove it, and it is the same with the free path.
> 
> >   
> > > +	if (vmm)
> > > +		ret = vfio_mm_pasid_alloc(vmm, min, max);
> > > +	else
> > > +		ret = -EINVAL;
> > > +out_unlock:
> > > +	mutex_unlock(&iommu->lock);
> > > +	return ret;
> > > +}
> > > +
> > > +static int vfio_iommu_type1_pasid_free(struct vfio_iommu *iommu,
> > > +				       unsigned int pasid)
> > > +{
> > > +	struct vfio_mm *vmm = iommu->vmm;
> > > +	int ret = 0;
> > > +
> > > +	mutex_lock(&iommu->lock);
> > > +	if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) {
> > > +		ret = -EFAULT;
> > > +		goto out_unlock;
> > > +	}  
> > 
> > So if a container had an iommu backed device when the pasid was
> > allocated, but it was removed, now they can't free it?  Why do we need
> > the check above?  
> 
> should be removed. thanks for spotting it.
> 
> > > +
> > > +	if (vmm)
> > > +		ret = vfio_mm_pasid_free(vmm, pasid);
> > > +	else
> > > +		ret = -EINVAL;
> > > +out_unlock:
> > > +	mutex_unlock(&iommu->lock);
> > > +	return ret;
> > > +}
> > > +
> > >  static long vfio_iommu_type1_ioctl(void *iommu_data,
> > >  				   unsigned int cmd, unsigned long arg)
> > >  {
> > > @@ -2276,6 +2333,53 @@ static long vfio_iommu_type1_ioctl(void  
> > *iommu_data,  
> > >
> > >  		return copy_to_user((void __user *)arg, &unmap, minsz) ?
> > >  			-EFAULT : 0;
> > > +
> > > +	} else if (cmd == VFIO_IOMMU_PASID_REQUEST) {
> > > +		struct vfio_iommu_type1_pasid_request req;
> > > +		unsigned long offset;
> > > +
> > > +		minsz = offsetofend(struct vfio_iommu_type1_pasid_request,
> > > +				    flags);
> > > +
> > > +		if (copy_from_user(&req, (void __user *)arg, minsz))
> > > +			return -EFAULT;
> > > +
> > > +		if (req.argsz < minsz ||
> > > +		    !vfio_iommu_type1_pasid_req_valid(req.flags))
> > > +			return -EINVAL;
> > > +
> > > +		if (copy_from_user((void *)&req + minsz,
> > > +				   (void __user *)arg + minsz,
> > > +				   sizeof(req) - minsz))
> > > +			return -EFAULT;  
> > 
> > Huh?  Why do we have argsz if we're going to assume this is here?  
> 
> do you mean replacing sizeof(req) with argsz? if yes, I can do that.

No, I mean the user tells us how much data they've provided via argsz.
We create minsz the the end of flags and verify argsz includes flags.
Then we proceed to ignore argsz to see if the user has provided the
remainder of the structure.
 
> > > +
> > > +		switch (req.flags & VFIO_PASID_REQUEST_MASK) {
> > > +		case VFIO_IOMMU_PASID_ALLOC:
> > > +		{
> > > +			int ret = 0, result;
> > > +
> > > +			result = vfio_iommu_type1_pasid_alloc(iommu,
> > > +							req.alloc_pasid.min,
> > > +							req.alloc_pasid.max);
> > > +			if (result > 0) {
> > > +				offset = offsetof(
> > > +					struct vfio_iommu_type1_pasid_request,
> > > +					alloc_pasid.result);
> > > +				ret = copy_to_user(
> > > +					      (void __user *) (arg + offset),
> > > +					      &result, sizeof(result));  
> > 
> > Again assuming argsz supports this.  
> 
> same as above.
> 
> >   
> > > +			} else {
> > > +				pr_debug("%s: PASID alloc failed\n", __func__);  
> > 
> > rate limit?  
> 
> not quite get. could you give more hints?

A user can spam the host logs simply by allocating their quota of
PASIDs and then trying to allocate more, or by specifying min/max such
that they cannot allocate the requested PASID.  If this logging is
necessary for debugging, it should be ratelimited to avoid a DoS on the
host.

> > > +				ret = -EFAULT;
> > > +			}
> > > +			return ret;
> > > +		}
> > > +		case VFIO_IOMMU_PASID_FREE:
> > > +			return vfio_iommu_type1_pasid_free(iommu,
> > > +							   req.free_pasid);
> > > +		default:
> > > +			return -EINVAL;
> > > +		}
> > >  	}
> > >
> > >  	return -ENOTTY;
> > > diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> > > index e42a711..75f9f7f1 100644
> > > --- a/include/linux/vfio.h
> > > +++ b/include/linux/vfio.h
> > > @@ -89,6 +89,26 @@ extern int vfio_register_iommu_driver(const struct  
> > vfio_iommu_driver_ops *ops);  
> > >  extern void vfio_unregister_iommu_driver(
> > >  				const struct vfio_iommu_driver_ops *ops);
> > >
> > > +#define VFIO_DEFAULT_PASID_QUOTA	1000
> > > +struct vfio_mm_token {
> > > +	unsigned long long val;
> > > +};
> > > +
> > > +struct vfio_mm {
> > > +	struct kref			kref;
> > > +	struct vfio_mm_token		token;
> > > +	int				ioasid_sid;
> > > +	/* protect @pasid_quota field and pasid allocation/free */
> > > +	struct mutex			pasid_lock;
> > > +	int				pasid_quota;
> > > +	struct list_head		vfio_next;
> > > +};
> > > +
> > > +extern struct vfio_mm *vfio_mm_get_from_task(struct task_struct *task);
> > > +extern void vfio_mm_put(struct vfio_mm *vmm);
> > > +extern int vfio_mm_pasid_alloc(struct vfio_mm *vmm, int min, int max);
> > > +extern int vfio_mm_pasid_free(struct vfio_mm *vmm, ioasid_t pasid);
> > > +
> > >  /*
> > >   * External user API
> > >   */
> > > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> > > index 9e843a1..298ac80 100644
> > > --- a/include/uapi/linux/vfio.h
> > > +++ b/include/uapi/linux/vfio.h
> > > @@ -794,6 +794,47 @@ struct vfio_iommu_type1_dma_unmap {
> > >  #define VFIO_IOMMU_ENABLE	_IO(VFIO_TYPE, VFIO_BASE + 15)
> > >  #define VFIO_IOMMU_DISABLE	_IO(VFIO_TYPE, VFIO_BASE + 16)
> > >
> > > +/*
> > > + * PASID (Process Address Space ID) is a PCIe concept which
> > > + * has been extended to support DMA isolation in fine-grain.
> > > + * With device assigned to user space (e.g. VMs), PASID alloc
> > > + * and free need to be system wide. This structure defines
> > > + * the info for pasid alloc/free between user space and kernel
> > > + * space.
> > > + *
> > > + * @flag=VFIO_IOMMU_PASID_ALLOC, refer to the @alloc_pasid
> > > + * @flag=VFIO_IOMMU_PASID_FREE, refer to @free_pasid
> > > + */
> > > +struct vfio_iommu_type1_pasid_request {
> > > +	__u32	argsz;
> > > +#define VFIO_IOMMU_PASID_ALLOC	(1 << 0)
> > > +#define VFIO_IOMMU_PASID_FREE	(1 << 1)
> > > +	__u32	flags;
> > > +	union {
> > > +		struct {
> > > +			__u32 min;
> > > +			__u32 max;
> > > +			__u32 result;
> > > +		} alloc_pasid;
> > > +		__u32 free_pasid;
> > > +	};  
> > 
> > We seem to be using __u8 data[] lately where the struct at data is
> > defined by the flags.  should we do that here?  
> 
> yeah, I can do that. BTW. Do you want to let the structure in the
> lately patch share the same structure with this one? As I can foresee,
> the two structures would look like similar as both of them include
> argsz, flags and data[] fields. The difference is the definition of
> flags. what about your opinion?
> 
> struct vfio_iommu_type1_pasid_request {
> 	__u32	argsz;
> #define VFIO_IOMMU_PASID_ALLOC	(1 << 0)
> #define VFIO_IOMMU_PASID_FREE	(1 << 1)
> 	__u32	flags;
> 	__u8	data[];
> };
> 
> struct vfio_iommu_type1_bind {
>         __u32           argsz;
>         __u32           flags;
> #define VFIO_IOMMU_BIND_GUEST_PGTBL     (1 << 0)
> #define VFIO_IOMMU_UNBIND_GUEST_PGTBL   (1 << 1)
>         __u8            data[];
> };


Yes, I was even wondering the same for the cache invalidate ioctl, or
whether this is going too far for a general purpose "everything related
to PASIDs" ioctl.  We need to factor usability into the equation too.
I'd be interested in opinions from others here too.  Clearly I don't
like single use, throw-away ioctls, but I can find myself on either
side of the argument that allocation, binding, and invalidating are all
within the domain of PASIDs and could fall within a single ioctl or
they each represent different facets of managing PASIDs and should have
separate ioctls.  Thanks,

Alex


> > > +};
> > > +
> > > +#define VFIO_PASID_REQUEST_MASK	(VFIO_IOMMU_PASID_ALLOC | \
> > > +					 VFIO_IOMMU_PASID_FREE)
> > > +
> > > +/**
> > > + * VFIO_IOMMU_PASID_REQUEST - _IOWR(VFIO_TYPE, VFIO_BASE + 22,
> > > + *				struct vfio_iommu_type1_pasid_request)
> > > + *
> > > + * Availability of this feature depends on PASID support in the device,
> > > + * its bus, the underlying IOMMU and the CPU architecture. In VFIO, it
> > > + * is available after VFIO_SET_IOMMU.
> > > + *
> > > + * returns: 0 on success, -errno on failure.
> > > + */
> > > +#define VFIO_IOMMU_PASID_REQUEST	_IO(VFIO_TYPE, VFIO_BASE + 22)  
> > 
> > So a user needs to try to allocate a PASID in order to test for the
> > support?  Should we have a PROBE flag?  
> 
> answered in in later patch. :-)
> 
> Regards,
> Yi Liu
> 


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH v1 6/8] vfio/type1: Bind guest page tables to host
  2020-04-03 13:30     ` Liu, Yi L
@ 2020-04-03 18:11       ` Alex Williamson
  2020-04-04 10:28         ` Liu, Yi L
  0 siblings, 1 reply; 110+ messages in thread
From: Alex Williamson @ 2020-04-03 18:11 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: eric.auger, Tian, Kevin, jacob.jun.pan, joro, Raj, Ashok, Tian,
	Jun J, Sun, Yi Y, jean-philippe, peterx, iommu, kvm,
	linux-kernel, Wu, Hao

On Fri, 3 Apr 2020 13:30:49 +0000
"Liu, Yi L" <yi.l.liu@intel.com> wrote:

> Hi Alex,
> 
> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Friday, April 3, 2020 3:57 AM
> > To: Liu, Yi L <yi.l.liu@intel.com>
> > 
> > On Sun, 22 Mar 2020 05:32:03 -0700
> > "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> >   
> > > From: Liu Yi L <yi.l.liu@intel.com>
> > >
> > > VFIO_TYPE1_NESTING_IOMMU is an IOMMU type which is backed by hardware
> > > IOMMUs that have nesting DMA translation (a.k.a dual stage address
> > > translation). For such hardware IOMMUs, there are two stages/levels of
> > > address translation, and software may let userspace/VM to own the first-
> > > level/stage-1 translation structures. Example of such usage is vSVA (
> > > virtual Shared Virtual Addressing). VM owns the first-level/stage-1
> > > translation structures and bind the structures to host, then hardware
> > > IOMMU would utilize nesting translation when doing DMA translation fo
> > > the devices behind such hardware IOMMU.
> > >
> > > This patch adds vfio support for binding guest translation (a.k.a stage 1)
> > > structure to host iommu. And for VFIO_TYPE1_NESTING_IOMMU, not only bind
> > > guest page table is needed, it also requires to expose interface to guest
> > > for iommu cache invalidation when guest modified the first-level/stage-1
> > > translation structures since hardware needs to be notified to flush stale
> > > iotlbs. This would be introduced in next patch.
> > >
> > > In this patch, guest page table bind and unbind are done by using flags
> > > VFIO_IOMMU_BIND_GUEST_PGTBL and VFIO_IOMMU_UNBIND_GUEST_PGTBL  
> > under IOCTL  
> > > VFIO_IOMMU_BIND, the bind/unbind data are conveyed by
> > > struct iommu_gpasid_bind_data. Before binding guest page table to host,
> > > VM should have got a PASID allocated by host via VFIO_IOMMU_PASID_REQUEST.
> > >
> > > Bind guest translation structures (here is guest page table) to host
> > > are the first step to setup vSVA (Virtual Shared Virtual Addressing).
> > >
> > > Cc: Kevin Tian <kevin.tian@intel.com>
> > > CC: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > > Cc: Alex Williamson <alex.williamson@redhat.com>
> > > Cc: Eric Auger <eric.auger@redhat.com>
> > > Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
> > > Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.com>
> > > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > > Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > > ---
> > >  drivers/vfio/vfio_iommu_type1.c | 158  
> > ++++++++++++++++++++++++++++++++++++++++  
> > >  include/uapi/linux/vfio.h       |  46 ++++++++++++
> > >  2 files changed, 204 insertions(+)
> > >
> > > diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> > > index 82a9e0b..a877747 100644
> > > --- a/drivers/vfio/vfio_iommu_type1.c
> > > +++ b/drivers/vfio/vfio_iommu_type1.c
> > > @@ -130,6 +130,33 @@ struct vfio_regions {
> > >  #define IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)	\
> > >  					(!list_empty(&iommu->domain_list))
> > >
> > > +struct domain_capsule {
> > > +	struct iommu_domain *domain;
> > > +	void *data;
> > > +};
> > > +
> > > +/* iommu->lock must be held */
> > > +static int vfio_iommu_for_each_dev(struct vfio_iommu *iommu,
> > > +		      int (*fn)(struct device *dev, void *data),
> > > +		      void *data)
> > > +{
> > > +	struct domain_capsule dc = {.data = data};
> > > +	struct vfio_domain *d;
> > > +	struct vfio_group *g;
> > > +	int ret = 0;
> > > +
> > > +	list_for_each_entry(d, &iommu->domain_list, next) {
> > > +		dc.domain = d->domain;
> > > +		list_for_each_entry(g, &d->group_list, next) {
> > > +			ret = iommu_group_for_each_dev(g->iommu_group,
> > > +						       &dc, fn);
> > > +			if (ret)
> > > +				break;
> > > +		}
> > > +	}
> > > +	return ret;
> > > +}
> > > +
> > >  static int put_pfn(unsigned long pfn, int prot);
> > >
> > >  /*
> > > @@ -2314,6 +2341,88 @@ static int vfio_iommu_info_add_nesting_cap(struct  
> > vfio_iommu *iommu,  
> > >  	return 0;
> > >  }
> > >
> > > +static int vfio_bind_gpasid_fn(struct device *dev, void *data)
> > > +{
> > > +	struct domain_capsule *dc = (struct domain_capsule *)data;
> > > +	struct iommu_gpasid_bind_data *gbind_data =
> > > +		(struct iommu_gpasid_bind_data *) dc->data;
> > > +
> > > +	return iommu_sva_bind_gpasid(dc->domain, dev, gbind_data);
> > > +}
> > > +
> > > +static int vfio_unbind_gpasid_fn(struct device *dev, void *data)
> > > +{
> > > +	struct domain_capsule *dc = (struct domain_capsule *)data;
> > > +	struct iommu_gpasid_bind_data *gbind_data =
> > > +		(struct iommu_gpasid_bind_data *) dc->data;
> > > +
> > > +	return iommu_sva_unbind_gpasid(dc->domain, dev,
> > > +					gbind_data->hpasid);
> > > +}
> > > +
> > > +/**
> > > + * Unbind specific gpasid, caller of this function requires hold
> > > + * vfio_iommu->lock
> > > + */
> > > +static long vfio_iommu_type1_do_guest_unbind(struct vfio_iommu *iommu,
> > > +				struct iommu_gpasid_bind_data *gbind_data)
> > > +{
> > > +	return vfio_iommu_for_each_dev(iommu,
> > > +				vfio_unbind_gpasid_fn, gbind_data);
> > > +}
> > > +
> > > +static long vfio_iommu_type1_bind_gpasid(struct vfio_iommu *iommu,
> > > +				struct iommu_gpasid_bind_data *gbind_data)
> > > +{
> > > +	int ret = 0;
> > > +
> > > +	mutex_lock(&iommu->lock);
> > > +	if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) {
> > > +		ret = -EINVAL;
> > > +		goto out_unlock;
> > > +	}
> > > +
> > > +	ret = vfio_iommu_for_each_dev(iommu,
> > > +			vfio_bind_gpasid_fn, gbind_data);
> > > +	/*
> > > +	 * If bind failed, it may not be a total failure. Some devices
> > > +	 * within the iommu group may have bind successfully. Although
> > > +	 * we don't enable pasid capability for non-singletion iommu
> > > +	 * groups, a unbind operation would be helpful to ensure no
> > > +	 * partial binding for an iommu group.  
> > 
> > Where was the non-singleton group restriction done, I missed that.  
> 
> Hmm, it's missed. thanks for spotting it. How about adding this
> check in the vfio_iommu_for_each_dev()? If looped a non-singleton
> group, just skip it. It applies to the cache_inv path all the
> same.

I don't really understand the singleton issue, which is why I was
surprised to see this since I didn't see a discussion previously.
Skipping a singleton group seems like unpredictable behavior to the
user though.

> > > +	 */
> > > +	if (ret)
> > > +		/*
> > > +		 * Undo all binds that already succeeded, no need to
> > > +		 * check the return value here since some device within
> > > +		 * the group has no successful bind when coming to this
> > > +		 * place switch.
> > > +		 */
> > > +		vfio_iommu_type1_do_guest_unbind(iommu, gbind_data);  
> > 
> > However, the for_each_dev function stops when the callback function
> > returns error, are we just assuming we stop at the same device as we
> > faulted on the first time and that we traverse the same set of devices
> > the second time?  It seems strange to me that unbind should be able to
> > fail.  
> 
> unbind can fail if a user attempts to unbind a pasid which is not belonged
> to it or a pasid which hasn't ever been bound. Otherwise, I didn't see a
> reason to fail.

Even if so, this doesn't address the first part of the question.  If
our for_each_dev() callback returns error then the loop stops and we
can't be sure we've triggered it everywhere that it needs to be
triggered.  There are also aspects of whether it's an error to unbind
something that is not bound because the result is still that the pasid
is unbound, right?

> > > +
> > > +out_unlock:
> > > +	mutex_unlock(&iommu->lock);
> > > +	return ret;
> > > +}
> > > +
> > > +static long vfio_iommu_type1_unbind_gpasid(struct vfio_iommu *iommu,
> > > +				struct iommu_gpasid_bind_data *gbind_data)
> > > +{
> > > +	int ret = 0;
> > > +
> > > +	mutex_lock(&iommu->lock);
> > > +	if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) {
> > > +		ret = -EINVAL;
> > > +		goto out_unlock;
> > > +	}
> > > +
> > > +	ret = vfio_iommu_type1_do_guest_unbind(iommu, gbind_data);  
> > 
> > How is a user supposed to respond to their unbind failing?  
> 
> If it's a malicious unbind (e.g. unbind a not yet bound pasid or unbind
> a pasid which doesn't belong to current user).

And if it's not a malicious unbind?  To me this is similar semantics to
free() failing.  Is there any remedy other than to abort?  Thanks,

Alex

> > > +
> > > +out_unlock:
> > > +	mutex_unlock(&iommu->lock);
> > > +	return ret;
> > > +}
> > > +
> > >  static long vfio_iommu_type1_ioctl(void *iommu_data,
> > >  				   unsigned int cmd, unsigned long arg)
> > >  {
> > > @@ -2471,6 +2580,55 @@ static long vfio_iommu_type1_ioctl(void  
> > *iommu_data,  
> > >  		default:
> > >  			return -EINVAL;
> > >  		}
> > > +
> > > +	} else if (cmd == VFIO_IOMMU_BIND) {
> > > +		struct vfio_iommu_type1_bind bind;
> > > +		u32 version;
> > > +		int data_size;
> > > +		void *gbind_data;
> > > +		int ret;
> > > +
> > > +		minsz = offsetofend(struct vfio_iommu_type1_bind, flags);
> > > +
> > > +		if (copy_from_user(&bind, (void __user *)arg, minsz))
> > > +			return -EFAULT;
> > > +
> > > +		if (bind.argsz < minsz)
> > > +			return -EINVAL;
> > > +
> > > +		/* Get the version of struct iommu_gpasid_bind_data */
> > > +		if (copy_from_user(&version,
> > > +			(void __user *) (arg + minsz),
> > > +					sizeof(version)))
> > > +			return -EFAULT;  
> > 
> > Why are we coping things from beyond the size we've validated that the
> > user has provided again?  
> 
> let me wait for the result in Jacob's thread below. looks like need
> to have a decision from you and Joreg. If using argsze is good, then
> I guess we don't need the version-to-size mapping. right? Actually,
> the version-to-size mapping is added to ensure vfio copy data correctly.
> https://lkml.org/lkml/2020/4/2/876
> 
> > > +
> > > +		data_size = iommu_uapi_get_data_size(
> > > +				IOMMU_UAPI_BIND_GPASID, version);
> > > +		gbind_data = kzalloc(data_size, GFP_KERNEL);
> > > +		if (!gbind_data)
> > > +			return -ENOMEM;
> > > +
> > > +		if (copy_from_user(gbind_data,
> > > +			 (void __user *) (arg + minsz), data_size)) {
> > > +			kfree(gbind_data);
> > > +			return -EFAULT;
> > > +		}  
> > 
> > And again.  argsz isn't just for minsz.
> >  
> > > +
> > > +		switch (bind.flags & VFIO_IOMMU_BIND_MASK) {
> > > +		case VFIO_IOMMU_BIND_GUEST_PGTBL:
> > > +			ret = vfio_iommu_type1_bind_gpasid(iommu,
> > > +							   gbind_data);
> > > +			break;
> > > +		case VFIO_IOMMU_UNBIND_GUEST_PGTBL:
> > > +			ret = vfio_iommu_type1_unbind_gpasid(iommu,
> > > +							     gbind_data);
> > > +			break;
> > > +		default:
> > > +			ret = -EINVAL;
> > > +			break;
> > > +		}
> > > +		kfree(gbind_data);
> > > +		return ret;
> > >  	}
> > >
> > >  	return -ENOTTY;
> > > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> > > index ebeaf3e..2235bc6 100644
> > > --- a/include/uapi/linux/vfio.h
> > > +++ b/include/uapi/linux/vfio.h
> > > @@ -14,6 +14,7 @@
> > >
> > >  #include <linux/types.h>
> > >  #include <linux/ioctl.h>
> > > +#include <linux/iommu.h>
> > >
> > >  #define VFIO_API_VERSION	0
> > >
> > > @@ -853,6 +854,51 @@ struct vfio_iommu_type1_pasid_request {
> > >   */
> > >  #define VFIO_IOMMU_PASID_REQUEST	_IO(VFIO_TYPE, VFIO_BASE + 22)
> > >
> > > +/**
> > > + * Supported flags:
> > > + *	- VFIO_IOMMU_BIND_GUEST_PGTBL: bind guest page tables to host for
> > > + *			nesting type IOMMUs. In @data field It takes struct
> > > + *			iommu_gpasid_bind_data.
> > > + *	- VFIO_IOMMU_UNBIND_GUEST_PGTBL: undo a bind guest page table  
> > operation  
> > > + *			invoked by VFIO_IOMMU_BIND_GUEST_PGTBL.  
> > 
> > This must require iommu_gpasid_bind_data in the data field as well,
> > right?  
> 
> yes.
> 
> Regards,
> Yi Liu
> 


^ permalink raw reply	[flat|nested] 110+ messages in thread

* RE: [PATCH v1 6/8] vfio/type1: Bind guest page tables to host
  2020-04-03 18:11       ` Alex Williamson
@ 2020-04-04 10:28         ` Liu, Yi L
  0 siblings, 0 replies; 110+ messages in thread
From: Liu, Yi L @ 2020-04-04 10:28 UTC (permalink / raw)
  To: Alex Williamson, jean-philippe
  Cc: eric.auger, Tian, Kevin, jacob.jun.pan, joro, Raj, Ashok, Tian,
	Jun J, Sun, Yi Y, jean-philippe, peterx, iommu, kvm,
	linux-kernel, Wu, Hao

Hi Alex,

> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Saturday, April 4, 2020 2:11 AM
> To: Liu, Yi L <yi.l.liu@intel.com>
> Subject: Re: [PATCH v1 6/8] vfio/type1: Bind guest page tables to host
> 
> On Fri, 3 Apr 2020 13:30:49 +0000
> "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> 
> > Hi Alex,
> >
> > > From: Alex Williamson <alex.williamson@redhat.com>
> > > Sent: Friday, April 3, 2020 3:57 AM
> > > To: Liu, Yi L <yi.l.liu@intel.com>
> > >
> > > On Sun, 22 Mar 2020 05:32:03 -0700
> > > "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> > >
> > > > From: Liu Yi L <yi.l.liu@intel.com>
> > > >
> > > > VFIO_TYPE1_NESTING_IOMMU is an IOMMU type which is backed by
> > > > hardware IOMMUs that have nesting DMA translation (a.k.a dual
> > > > stage address translation). For such hardware IOMMUs, there are
> > > > two stages/levels of address translation, and software may let
> > > > userspace/VM to own the first-
> > > > level/stage-1 translation structures. Example of such usage is
> > > > vSVA ( virtual Shared Virtual Addressing). VM owns the
> > > > first-level/stage-1 translation structures and bind the structures
> > > > to host, then hardware IOMMU would utilize nesting translation
> > > > when doing DMA translation fo the devices behind such hardware IOMMU.
> > > >
> > > > This patch adds vfio support for binding guest translation (a.k.a
> > > > stage 1) structure to host iommu. And for
> > > > VFIO_TYPE1_NESTING_IOMMU, not only bind guest page table is
> > > > needed, it also requires to expose interface to guest for iommu
> > > > cache invalidation when guest modified the first-level/stage-1
> > > > translation structures since hardware needs to be notified to flush stale iotlbs.
> This would be introduced in next patch.
> > > >
> > > > In this patch, guest page table bind and unbind are done by using
> > > > flags VFIO_IOMMU_BIND_GUEST_PGTBL and
> > > > VFIO_IOMMU_UNBIND_GUEST_PGTBL
> > > under IOCTL
> > > > VFIO_IOMMU_BIND, the bind/unbind data are conveyed by struct
> > > > iommu_gpasid_bind_data. Before binding guest page table to host,
> > > > VM should have got a PASID allocated by host via
> VFIO_IOMMU_PASID_REQUEST.
> > > >
> > > > Bind guest translation structures (here is guest page table) to
> > > > host are the first step to setup vSVA (Virtual Shared Virtual Addressing).
> > > >
> > > > Cc: Kevin Tian <kevin.tian@intel.com>
> > > > CC: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > > > Cc: Alex Williamson <alex.williamson@redhat.com>
> > > > Cc: Eric Auger <eric.auger@redhat.com>
> > > > Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
> > > > Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.com>
> > > > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > > > Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > > > ---
> > > >  drivers/vfio/vfio_iommu_type1.c | 158
> > > ++++++++++++++++++++++++++++++++++++++++
> > > >  include/uapi/linux/vfio.h       |  46 ++++++++++++
> > > >  2 files changed, 204 insertions(+)
> > > >
> > > > diff --git a/drivers/vfio/vfio_iommu_type1.c
> > > > b/drivers/vfio/vfio_iommu_type1.c index 82a9e0b..a877747 100644
> > > > --- a/drivers/vfio/vfio_iommu_type1.c
> > > > +++ b/drivers/vfio/vfio_iommu_type1.c
> > > > @@ -130,6 +130,33 @@ struct vfio_regions {
> > > >  #define IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)	\
> > > >  					(!list_empty(&iommu->domain_list))
> > > >
> > > > +struct domain_capsule {
> > > > +	struct iommu_domain *domain;
> > > > +	void *data;
> > > > +};
> > > > +
> > > > +/* iommu->lock must be held */
> > > > +static int vfio_iommu_for_each_dev(struct vfio_iommu *iommu,
> > > > +		      int (*fn)(struct device *dev, void *data),
> > > > +		      void *data)
> > > > +{
> > > > +	struct domain_capsule dc = {.data = data};
> > > > +	struct vfio_domain *d;
> > > > +	struct vfio_group *g;
> > > > +	int ret = 0;
> > > > +
> > > > +	list_for_each_entry(d, &iommu->domain_list, next) {
> > > > +		dc.domain = d->domain;
> > > > +		list_for_each_entry(g, &d->group_list, next) {
> > > > +			ret = iommu_group_for_each_dev(g->iommu_group,
> > > > +						       &dc, fn);
> > > > +			if (ret)
> > > > +				break;
> > > > +		}
> > > > +	}
> > > > +	return ret;
> > > > +}
> > > > +
> > > >  static int put_pfn(unsigned long pfn, int prot);
> > > >
> > > >  /*
> > > > @@ -2314,6 +2341,88 @@ static int
> > > > vfio_iommu_info_add_nesting_cap(struct
> > > vfio_iommu *iommu,
> > > >  	return 0;
> > > >  }
> > > >
> > > > +static int vfio_bind_gpasid_fn(struct device *dev, void *data) {
> > > > +	struct domain_capsule *dc = (struct domain_capsule *)data;
> > > > +	struct iommu_gpasid_bind_data *gbind_data =
> > > > +		(struct iommu_gpasid_bind_data *) dc->data;
> > > > +
> > > > +	return iommu_sva_bind_gpasid(dc->domain, dev, gbind_data); }
> > > > +
> > > > +static int vfio_unbind_gpasid_fn(struct device *dev, void *data)
> > > > +{
> > > > +	struct domain_capsule *dc = (struct domain_capsule *)data;
> > > > +	struct iommu_gpasid_bind_data *gbind_data =
> > > > +		(struct iommu_gpasid_bind_data *) dc->data;
> > > > +
> > > > +	return iommu_sva_unbind_gpasid(dc->domain, dev,
> > > > +					gbind_data->hpasid);
> > > > +}
> > > > +
> > > > +/**
> > > > + * Unbind specific gpasid, caller of this function requires hold
> > > > + * vfio_iommu->lock
> > > > + */
> > > > +static long vfio_iommu_type1_do_guest_unbind(struct vfio_iommu *iommu,
> > > > +				struct iommu_gpasid_bind_data *gbind_data) {
> > > > +	return vfio_iommu_for_each_dev(iommu,
> > > > +				vfio_unbind_gpasid_fn, gbind_data); }
> > > > +
> > > > +static long vfio_iommu_type1_bind_gpasid(struct vfio_iommu *iommu,
> > > > +				struct iommu_gpasid_bind_data *gbind_data) {
> > > > +	int ret = 0;
> > > > +
> > > > +	mutex_lock(&iommu->lock);
> > > > +	if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) {
> > > > +		ret = -EINVAL;
> > > > +		goto out_unlock;
> > > > +	}
> > > > +
> > > > +	ret = vfio_iommu_for_each_dev(iommu,
> > > > +			vfio_bind_gpasid_fn, gbind_data);
> > > > +	/*
> > > > +	 * If bind failed, it may not be a total failure. Some devices
> > > > +	 * within the iommu group may have bind successfully. Although
> > > > +	 * we don't enable pasid capability for non-singletion iommu
> > > > +	 * groups, a unbind operation would be helpful to ensure no
> > > > +	 * partial binding for an iommu group.
> > >
> > > Where was the non-singleton group restriction done, I missed that.
> >
> > Hmm, it's missed. thanks for spotting it. How about adding this check
> > in the vfio_iommu_for_each_dev()? If looped a non-singleton group,
> > just skip it. It applies to the cache_inv path all the same.
> 
> I don't really understand the singleton issue, which is why I was surprised to see this
> since I didn't see a discussion previously.
> Skipping a singleton group seems like unpredictable behavior to the user though.

There is a discussion on the SVA availability in the below link. There
was a conclusion to only support SVA for singleton group. I think bind
guest page table also needs to apply this rule.
https://patchwork.kernel.org/patch/10213877/

> > > > +	 */
> > > > +	if (ret)
> > > > +		/*
> > > > +		 * Undo all binds that already succeeded, no need to
> > > > +		 * check the return value here since some device within
> > > > +		 * the group has no successful bind when coming to this
> > > > +		 * place switch.
> > > > +		 */
> > > > +		vfio_iommu_type1_do_guest_unbind(iommu, gbind_data);
> > >
> > > However, the for_each_dev function stops when the callback function
> > > returns error, are we just assuming we stop at the same device as we
> > > faulted on the first time and that we traverse the same set of
> > > devices the second time?  It seems strange to me that unbind should
> > > be able to fail.
> >
> > unbind can fail if a user attempts to unbind a pasid which is not
> > belonged to it or a pasid which hasn't ever been bound. Otherwise, I
> > didn't see a reason to fail.
> 
> Even if so, this doesn't address the first part of the question. 
> If our for_each_dev()
> callback returns error then the loop stops and we can't be sure we've
> triggered it
> everywhere that it needs to be triggered. 

Hmm, let me pull back a little. Back to the code, in the attempt to
do bind, the code uses for_each_dev() to loop devices. If failed then
uses for_each_dev() again to do unbind. Your question is can the second
for_each_dev() be able to undo the bind correctly as the second
for_each_dev() call has no idea where it failed in the bind phase. is it?
Actually, this is why I added the comment that no need to check the return
value of vfio_iommu_type1_do_guest_unbind().

> There are also aspects of whether it's an
> error to unbind something that is not bound because the result is still
> that the pasid
> is unbound, right?

agreed, as you mentioned in the below comment, no need to fail unbind
unless user is trying to unbind a pasid which doesn't belong to it.

> > > > +
> > > > +out_unlock:
> > > > +	mutex_unlock(&iommu->lock);
> > > > +	return ret;
> > > > +}
> > > > +
> > > > +static long vfio_iommu_type1_unbind_gpasid(struct vfio_iommu *iommu,
> > > > +				struct iommu_gpasid_bind_data *gbind_data) {
> > > > +	int ret = 0;
> > > > +
> > > > +	mutex_lock(&iommu->lock);
> > > > +	if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) {
> > > > +		ret = -EINVAL;
> > > > +		goto out_unlock;
> > > > +	}
> > > > +
> > > > +	ret = vfio_iommu_type1_do_guest_unbind(iommu, gbind_data);
> > >
> > > How is a user supposed to respond to their unbind failing?
> >
> > If it's a malicious unbind (e.g. unbind a not yet bound pasid or
> > unbind a pasid which doesn't belong to current user).
> 
> And if it's not a malicious unbind?  To me this is similar semantics to
> free() failing.  Is there any remedy other than to abort?  Thanks,

got it. so if user is trying to unbind a pasid which doesn't belong to
it, should kernel return error to user or just abort it?

Regards,
Yi Liu

> Alex
> 
> > > > +
> > > > +out_unlock:
> > > > +	mutex_unlock(&iommu->lock);
> > > > +	return ret;
> > > > +}
> > > > +
> > > >  static long vfio_iommu_type1_ioctl(void *iommu_data,
> > > >  				   unsigned int cmd, unsigned long arg)  { @@ -
> 2471,6
> > > > +2580,55 @@ static long vfio_iommu_type1_ioctl(void
> > > *iommu_data,
> > > >  		default:
> > > >  			return -EINVAL;
> > > >  		}
> > > > +
> > > > +	} else if (cmd == VFIO_IOMMU_BIND) {
> > > > +		struct vfio_iommu_type1_bind bind;
> > > > +		u32 version;
> > > > +		int data_size;
> > > > +		void *gbind_data;
> > > > +		int ret;
> > > > +
> > > > +		minsz = offsetofend(struct vfio_iommu_type1_bind, flags);
> > > > +
> > > > +		if (copy_from_user(&bind, (void __user *)arg, minsz))
> > > > +			return -EFAULT;
> > > > +
> > > > +		if (bind.argsz < minsz)
> > > > +			return -EINVAL;
> > > > +
> > > > +		/* Get the version of struct iommu_gpasid_bind_data */
> > > > +		if (copy_from_user(&version,
> > > > +			(void __user *) (arg + minsz),
> > > > +					sizeof(version)))
> > > > +			return -EFAULT;
> > >
> > > Why are we coping things from beyond the size we've validated that
> > > the user has provided again?
> >
> > let me wait for the result in Jacob's thread below. looks like need to
> > have a decision from you and Joreg. If using argsze is good, then I
> > guess we don't need the version-to-size mapping. right? Actually, the
> > version-to-size mapping is added to ensure vfio copy data correctly.
> > https://lkml.org/lkml/2020/4/2/876
> >
> > > > +
> > > > +		data_size = iommu_uapi_get_data_size(
> > > > +				IOMMU_UAPI_BIND_GPASID, version);
> > > > +		gbind_data = kzalloc(data_size, GFP_KERNEL);
> > > > +		if (!gbind_data)
> > > > +			return -ENOMEM;
> > > > +
> > > > +		if (copy_from_user(gbind_data,
> > > > +			 (void __user *) (arg + minsz), data_size)) {
> > > > +			kfree(gbind_data);
> > > > +			return -EFAULT;
> > > > +		}
> > >
> > > And again.  argsz isn't just for minsz.
> > >
> > > > +
> > > > +		switch (bind.flags & VFIO_IOMMU_BIND_MASK) {
> > > > +		case VFIO_IOMMU_BIND_GUEST_PGTBL:
> > > > +			ret = vfio_iommu_type1_bind_gpasid(iommu,
> > > > +							   gbind_data);
> > > > +			break;
> > > > +		case VFIO_IOMMU_UNBIND_GUEST_PGTBL:
> > > > +			ret = vfio_iommu_type1_unbind_gpasid(iommu,
> > > > +							     gbind_data);
> > > > +			break;
> > > > +		default:
> > > > +			ret = -EINVAL;
> > > > +			break;
> > > > +		}
> > > > +		kfree(gbind_data);
> > > > +		return ret;
> > > >  	}
> > > >
> > > >  	return -ENOTTY;
> > > > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> > > > index ebeaf3e..2235bc6 100644
> > > > --- a/include/uapi/linux/vfio.h
> > > > +++ b/include/uapi/linux/vfio.h
> > > > @@ -14,6 +14,7 @@
> > > >
> > > >  #include <linux/types.h>
> > > >  #include <linux/ioctl.h>
> > > > +#include <linux/iommu.h>
> > > >
> > > >  #define VFIO_API_VERSION	0
> > > >
> > > > @@ -853,6 +854,51 @@ struct vfio_iommu_type1_pasid_request {
> > > >   */
> > > >  #define VFIO_IOMMU_PASID_REQUEST	_IO(VFIO_TYPE, VFIO_BASE + 22)
> > > >
> > > > +/**
> > > > + * Supported flags:
> > > > + *	- VFIO_IOMMU_BIND_GUEST_PGTBL: bind guest page tables to
> host for
> > > > + *			nesting type IOMMUs. In @data field It takes struct
> > > > + *			iommu_gpasid_bind_data.
> > > > + *	- VFIO_IOMMU_UNBIND_GUEST_PGTBL: undo a bind guest page
> table
> > > operation
> > > > + *			invoked by VFIO_IOMMU_BIND_GUEST_PGTBL.
> > >
> > > This must require iommu_gpasid_bind_data in the data field as well,
> > > right?
> >
> > yes.
> >
> > Regards,
> > Yi Liu
> >


^ permalink raw reply	[flat|nested] 110+ messages in thread

* RE: [PATCH v1 3/8] vfio/type1: Report PASID alloc/free support to userspace
  2020-04-03 17:28       ` Alex Williamson
@ 2020-04-04 11:36         ` Liu, Yi L
  0 siblings, 0 replies; 110+ messages in thread
From: Liu, Yi L @ 2020-04-04 11:36 UTC (permalink / raw)
  To: Alex Williamson
  Cc: eric.auger, Tian, Kevin, jacob.jun.pan, joro, Raj, Ashok, Tian,
	Jun J, Sun, Yi Y, jean-philippe, peterx, iommu, kvm,
	linux-kernel, Wu, Hao

Hi Alex,

> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Saturday, April 4, 2020 1:28 AM
> To: Liu, Yi L <yi.l.liu@intel.com>
> Cc: eric.auger@redhat.com; Tian, Kevin <kevin.tian@intel.com>;
> jacob.jun.pan@linux.intel.com; joro@8bytes.org; Raj, Ashok <ashok.raj@intel.com>;
> Tian, Jun J <jun.j.tian@intel.com>; Sun, Yi Y <yi.y.sun@intel.com>; jean-
> philippe@linaro.org; peterx@redhat.com; iommu@lists.linux-foundation.org;
> kvm@vger.kernel.org; linux-kernel@vger.kernel.org; Wu, Hao <hao.wu@intel.com>
> Subject: Re: [PATCH v1 3/8] vfio/type1: Report PASID alloc/free support to
> userspace
> 
> On Fri, 3 Apr 2020 08:17:44 +0000
> "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> 
> > > From: Alex Williamson < alex.williamson@redhat.com >
> > > Sent: Friday, April 3, 2020 2:01 AM
> > > To: Liu, Yi L <yi.l.liu@intel.com>
> > > Subject: Re: [PATCH v1 3/8] vfio/type1: Report PASID alloc/free
> > > support to userspace
> > >
> > > On Sun, 22 Mar 2020 05:32:00 -0700
> > > "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> > >
> > > > From: Liu Yi L <yi.l.liu@intel.com>
> > > >
> > > > This patch reports PASID alloc/free availability to userspace (e.g.
> > > > QEMU) thus userspace could do a pre-check before utilizing this feature.
> > > >
> > > > Cc: Kevin Tian <kevin.tian@intel.com>
> > > > CC: Jacob Pan <jacob.jun.pan@linux.intel.com>
> > > > Cc: Alex Williamson <alex.williamson@redhat.com>
> > > > Cc: Eric Auger <eric.auger@redhat.com>
> > > > Cc: Jean-Philippe Brucker <jean-philippe@linaro.org>
> > > > Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> > > > ---
> > > >  drivers/vfio/vfio_iommu_type1.c | 28 ++++++++++++++++++++++++++++
> > > >  include/uapi/linux/vfio.h       |  8 ++++++++
> > > >  2 files changed, 36 insertions(+)
> > > >
> > > > diff --git a/drivers/vfio/vfio_iommu_type1.c
> > > > b/drivers/vfio/vfio_iommu_type1.c index e40afc0..ddd1ffe 100644
> > > > --- a/drivers/vfio/vfio_iommu_type1.c
> > > > +++ b/drivers/vfio/vfio_iommu_type1.c
> > > > @@ -2234,6 +2234,30 @@ static int
> > > > vfio_iommu_type1_pasid_free(struct
> > > vfio_iommu *iommu,
> > > >  	return ret;
> > > >  }
> > > >
> > > > +static int vfio_iommu_info_add_nesting_cap(struct vfio_iommu *iommu,
> > > > +					 struct vfio_info_cap *caps) {
> > > > +	struct vfio_info_cap_header *header;
> > > > +	struct vfio_iommu_type1_info_cap_nesting *nesting_cap;
> > > > +
> > > > +	header = vfio_info_cap_add(caps, sizeof(*nesting_cap),
> > > > +				   VFIO_IOMMU_TYPE1_INFO_CAP_NESTING, 1);
> > > > +	if (IS_ERR(header))
> > > > +		return PTR_ERR(header);
> > > > +
> > > > +	nesting_cap = container_of(header,
> > > > +				struct vfio_iommu_type1_info_cap_nesting,
> > > > +				header);
> > > > +
> > > > +	nesting_cap->nesting_capabilities = 0;
> > > > +	if (iommu->nesting) {
> > > > +		/* nesting iommu type supports PASID requests (alloc/free) */
> > > > +		nesting_cap->nesting_capabilities |= VFIO_IOMMU_PASID_REQS;
> > > > +	}
> > > > +
> > > > +	return 0;
> > > > +}
> > > > +
> > > >  static long vfio_iommu_type1_ioctl(void *iommu_data,
> > > >  				   unsigned int cmd, unsigned long arg)  { @@ -
> > > 2283,6 +2307,10 @@
> > > > static long vfio_iommu_type1_ioctl(void *iommu_data,
> > > >  		if (ret)
> > > >  			return ret;
> > > >
> > > > +		ret = vfio_iommu_info_add_nesting_cap(iommu, &caps);
> > > > +		if (ret)
> > > > +			return ret;
> > > > +
> > > >  		if (caps.size) {
> > > >  			info.flags |= VFIO_IOMMU_INFO_CAPS;
> > > >
> > > > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> > > > index 298ac80..8837219 100644
> > > > --- a/include/uapi/linux/vfio.h
> > > > +++ b/include/uapi/linux/vfio.h
> > > > @@ -748,6 +748,14 @@ struct vfio_iommu_type1_info_cap_iova_range {
> > > >  	struct	vfio_iova_range iova_ranges[];
> > > >  };
> > > >
> > > > +#define VFIO_IOMMU_TYPE1_INFO_CAP_NESTING  2
> > > > +
> > > > +struct vfio_iommu_type1_info_cap_nesting {
> > > > +	struct	vfio_info_cap_header header;
> > > > +#define VFIO_IOMMU_PASID_REQS	(1 << 0)
> > > > +	__u32	nesting_capabilities;
> > > > +};
> > > > +
> > > >  #define VFIO_IOMMU_GET_INFO _IO(VFIO_TYPE, VFIO_BASE + 12)
> > > >
> > > >  /**
> > >
> > > I think this answers my PROBE question on patch 1/.
> > yep.
> > > Should the quota/usage be exposed to the user here?  Thanks,
> >
> > Do you mean report the quota available for this user in this cap info as well?
> 
> Yes.  Would it be useful?

I think so.

> > For usage, do you mean the alloc and free or others?
> 
> I mean how many of the quota are currently in allocated, or alternatively, how
> many remain.  Thanks,

ok, got it, maybe report the remain. thanks.

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 110+ messages in thread

* RE: [PATCH v1 1/8] vfio: Add VFIO_IOMMU_PASID_REQUEST(alloc/free)
  2020-04-03 15:14       ` Alex Williamson
@ 2020-04-07  4:42         ` Tian, Kevin
  2020-04-07 15:14           ` Alex Williamson
  0 siblings, 1 reply; 110+ messages in thread
From: Tian, Kevin @ 2020-04-07  4:42 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Liu, Yi L, eric.auger, jacob.jun.pan, joro, Raj, Ashok, Tian,
	Jun J, Sun, Yi Y, jean-philippe, peterx, iommu, kvm,
	linux-kernel, Wu, Hao

> From: Alex Williamson
> Sent: Friday, April 3, 2020 11:14 PM
> 
> On Fri, 3 Apr 2020 05:58:55 +0000
> "Tian, Kevin" <kevin.tian@intel.com> wrote:
> 
> > > From: Alex Williamson <alex.williamson@redhat.com>
> > > Sent: Friday, April 3, 2020 1:50 AM
> > >
> > > On Sun, 22 Mar 2020 05:31:58 -0700
> > > "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> > >
> > > > From: Liu Yi L <yi.l.liu@intel.com>
> > > >
> > > > For a long time, devices have only one DMA address space from
> platform
> > > > IOMMU's point of view. This is true for both bare metal and directed-
> > > > access in virtualization environment. Reason is the source ID of DMA in
> > > > PCIe are BDF (bus/dev/fnc ID), which results in only device granularity
> > > > DMA isolation. However, this is changing with the latest advancement in
> > > > I/O technology area. More and more platform vendors are utilizing the
> > > PCIe
> > > > PASID TLP prefix in DMA requests, thus to give devices with multiple
> DMA
> > > > address spaces as identified by their individual PASIDs. For example,
> > > > Shared Virtual Addressing (SVA, a.k.a Shared Virtual Memory) is able to
> > > > let device access multiple process virtual address space by binding the
> > > > virtual address space with a PASID. Wherein the PASID is allocated in
> > > > software and programmed to device per device specific manner.
> Devices
> > > > which support PASID capability are called PASID-capable devices. If such
> > > > devices are passed through to VMs, guest software are also able to bind
> > > > guest process virtual address space on such devices. Therefore, the
> guest
> > > > software could reuse the bare metal software programming model,
> which
> > > > means guest software will also allocate PASID and program it to device
> > > > directly. This is a dangerous situation since it has potential PASID
> > > > conflicts and unauthorized address space access. It would be safer to
> > > > let host intercept in the guest software's PASID allocation. Thus PASID
> > > > are managed system-wide.
> > >
> > > Providing an allocation interface only allows for collaborative usage
> > > of PASIDs though.  Do we have any ability to enforce PASID usage or can
> > > a user spoof other PASIDs on the same BDF?
> >
> > An user can access only PASIDs allocated to itself, i.e. the specific IOASID
> > set tied to its mm_struct.
> 
> A user is only _supposed_ to access PASIDs allocated to itself.  AIUI
> the mm_struct is used for managing the pool of IOASIDs from which the
> user may allocate that PASID.  We also state that programming the PASID
> into the device is device specific.  Therefore, are we simply trusting
> the user to use a PASID that's been allocated to them when they program
> the device?  If a user can program an arbitrary PASID into the device,
> then what prevents them from attempting to access data from another
> user via the device?   I think I've asked this question before, so if
> there's a previous explanation or spec section I need to review, please
> point me to it.  Thanks,
> 

There are two scenarios:

(1) for PF/VF, the iommu driver maintains an individual PASID table per
PDF. Although the PASID namespace is global, the per-BDF PASID table
contains only valid entries for those PASIDs which are allocated to the
mm_struct. The user is free to program arbitrary PASID into the assigned
device, but using invalid PASIDs simply hit iommu fault.

(2) for mdev, multiple mdev instances share the same PASID table of
the parent BDF. However, PASID programming is a privileged operation
in multiplexing usage, thus must be mediated by mdev device driver. 
The mediation logic will guarantee that only allocated PASIDs are 
forwarded to the device. 

Thanks
Kevin

^ permalink raw reply	[flat|nested] 110+ messages in thread

* RE: [PATCH v1 1/8] vfio: Add VFIO_IOMMU_PASID_REQUEST(alloc/free)
  2020-04-03 17:50       ` Alex Williamson
@ 2020-04-07  4:52         ` Tian, Kevin
  2020-04-08  0:52         ` Liu, Yi L
  1 sibling, 0 replies; 110+ messages in thread
From: Tian, Kevin @ 2020-04-07  4:52 UTC (permalink / raw)
  To: Alex Williamson, Liu, Yi L
  Cc: jean-philippe, Raj, Ashok, kvm, Tian, Jun J, iommu, linux-kernel,
	Sun, Yi Y, Wu, Hao

> From: Alex Williamson
> Sent: Saturday, April 4, 2020 1:50 AM
[...]
> > > > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> > > > index 9e843a1..298ac80 100644
> > > > --- a/include/uapi/linux/vfio.h
> > > > +++ b/include/uapi/linux/vfio.h
> > > > @@ -794,6 +794,47 @@ struct vfio_iommu_type1_dma_unmap {
> > > >  #define VFIO_IOMMU_ENABLE	_IO(VFIO_TYPE, VFIO_BASE + 15)
> > > >  #define VFIO_IOMMU_DISABLE	_IO(VFIO_TYPE, VFIO_BASE + 16)
> > > >
> > > > +/*
> > > > + * PASID (Process Address Space ID) is a PCIe concept which
> > > > + * has been extended to support DMA isolation in fine-grain.
> > > > + * With device assigned to user space (e.g. VMs), PASID alloc
> > > > + * and free need to be system wide. This structure defines
> > > > + * the info for pasid alloc/free between user space and kernel
> > > > + * space.
> > > > + *
> > > > + * @flag=VFIO_IOMMU_PASID_ALLOC, refer to the @alloc_pasid
> > > > + * @flag=VFIO_IOMMU_PASID_FREE, refer to @free_pasid
> > > > + */
> > > > +struct vfio_iommu_type1_pasid_request {
> > > > +	__u32	argsz;
> > > > +#define VFIO_IOMMU_PASID_ALLOC	(1 << 0)
> > > > +#define VFIO_IOMMU_PASID_FREE	(1 << 1)
> > > > +	__u32	flags;
> > > > +	union {
> > > > +		struct {
> > > > +			__u32 min;
> > > > +			__u32 max;
> > > > +			__u32 result;
> > > > +		} alloc_pasid;
> > > > +		__u32 free_pasid;
> > > > +	};
> > >
> > > We seem to be using __u8 data[] lately where the struct at data is
> > > defined by the flags.  should we do that here?
> >
> > yeah, I can do that. BTW. Do you want to let the structure in the
> > lately patch share the same structure with this one? As I can foresee,
> > the two structures would look like similar as both of them include
> > argsz, flags and data[] fields. The difference is the definition of
> > flags. what about your opinion?
> >
> > struct vfio_iommu_type1_pasid_request {
> > 	__u32	argsz;
> > #define VFIO_IOMMU_PASID_ALLOC	(1 << 0)
> > #define VFIO_IOMMU_PASID_FREE	(1 << 1)
> > 	__u32	flags;
> > 	__u8	data[];
> > };
> >
> > struct vfio_iommu_type1_bind {
> >         __u32           argsz;
> >         __u32           flags;
> > #define VFIO_IOMMU_BIND_GUEST_PGTBL     (1 << 0)
> > #define VFIO_IOMMU_UNBIND_GUEST_PGTBL   (1 << 1)
> >         __u8            data[];
> > };
> 
> 
> Yes, I was even wondering the same for the cache invalidate ioctl, or
> whether this is going too far for a general purpose "everything related
> to PASIDs" ioctl.  We need to factor usability into the equation too.
> I'd be interested in opinions from others here too.  Clearly I don't
> like single use, throw-away ioctls, but I can find myself on either
> side of the argument that allocation, binding, and invalidating are all
> within the domain of PASIDs and could fall within a single ioctl or
> they each represent different facets of managing PASIDs and should have
> separate ioctls.  Thanks,
> 

Looking at uapi/linux/iommu.h:

* Invalidations by %IOMMU_INV_GRANU_DOMAIN don't take any argument other than
 * @version and @cache.

Although intel-iommu handles only PASID-related invalidation now, I
suppose other vendors (or future usages?) may allow non-pasid
based invalidation too based on above comment. 

Thanks
Kevin

^ permalink raw reply	[flat|nested] 110+ messages in thread

* RE: [PATCH v1 5/8] vfio/type1: Report 1st-level/stage-1 format to userspace
  2020-04-03  8:23         ` Jean-Philippe Brucker
@ 2020-04-07  9:43           ` Liu, Yi L
  2020-04-08  1:02             ` Liu, Yi L
  2020-04-08 10:27             ` Auger Eric
  0 siblings, 2 replies; 110+ messages in thread
From: Liu, Yi L @ 2020-04-07  9:43 UTC (permalink / raw)
  To: Jean-Philippe Brucker, Auger Eric
  Cc: alex.williamson, Tian, Kevin, jacob.jun.pan, joro, Raj, Ashok,
	Tian, Jun J, Sun, Yi Y, peterx, iommu, kvm, linux-kernel, Wu,
	Hao

Hi Jean,

> From: Jean-Philippe Brucker <jean-philippe@linaro.org>
> Sent: Friday, April 3, 2020 4:23 PM
> To: Auger Eric <eric.auger@redhat.com>
> userspace
> 
> On Wed, Apr 01, 2020 at 03:01:12PM +0200, Auger Eric wrote:
> > >>>  	header = vfio_info_cap_add(caps, sizeof(*nesting_cap),
> > >>>  				   VFIO_IOMMU_TYPE1_INFO_CAP_NESTING, 1);
> @@ -2254,6 +2309,7
> > >>> @@ static int vfio_iommu_info_add_nesting_cap(struct
> > >> vfio_iommu *iommu,
> > >>>  		/* nesting iommu type supports PASID requests (alloc/free) */
> > >>>  		nesting_cap->nesting_capabilities |= VFIO_IOMMU_PASID_REQS;
> > >> What is the meaning for ARM?
> > >
> > > I think it's just a software capability exposed to userspace, on
> > > userspace side, it has a choice to use it or not. :-) The reason
> > > define it and report it in cap nesting is that I'd like to make the
> > > pasid alloc/free be available just for IOMMU with type
> > > VFIO_IOMMU_TYPE1_NESTING. Please feel free tell me if it is not good
> > > for ARM. We can find a proper way to report the availability.
> >
> > Well it is more a question for jean-Philippe. Do we have a system wide
> > PASID allocation on ARM?
> 
> We don't, the PASID spaces are per-VM on Arm, so this function should consult the
> IOMMU driver before setting flags. As you said on patch 3, nested doesn't
> necessarily imply PASID support. The SMMUv2 does not support PASID but does
> support nesting stages 1 and 2 for the IOVA space.
> SMMUv3 support of PASID depends on HW capabilities. So I think this needs to be
> finer grained:
> 
> Does the container support:
> * VFIO_IOMMU_PASID_REQUEST?
>   -> Yes for VT-d 3
>   -> No for Arm SMMU
> * VFIO_IOMMU_{,UN}BIND_GUEST_PGTBL?
>   -> Yes for VT-d 3
>   -> Sometimes for SMMUv2
>   -> No for SMMUv3 (if we go with BIND_PASID_TABLE, which is simpler due to
>      PASID tables being in GPA space.)
> * VFIO_IOMMU_BIND_PASID_TABLE?
>   -> No for VT-d
>   -> Sometimes for SMMUv3
> 
> Any bind support implies VFIO_IOMMU_CACHE_INVALIDATE support.

good summary. do you expect to see any 

> 
> > >>> +	nesting_cap->stage1_formats = formats;
> > >> as spotted by Kevin, since a single format is supported, rename
> > >
> > > ok, I was believing it may be possible on ARM or so. :-) will rename
> > > it.
> 
> Yes I don't think an u32 is going to cut it for Arm :( We need to describe all sorts of
> capabilities for page and PASID tables (granules, GPA size, ASID/PASID size, HW
> access/dirty, etc etc.) Just saying "Arm stage-1 format" wouldn't mean much. I
> guess we could have a secondary vendor capability for these?

Actually, I'm wondering if we can define some formats to stands for a set of
capabilities. e.g. VTD_STAGE1_FORMAT_V1 which may indicates the 1st level
page table related caps (aw, a/d, SRE, EA and etc.). And vIOMMU can parse
the capabilities.

Regards,
Yi Liu


^ permalink raw reply	[flat|nested] 110+ messages in thread

* RE: [PATCH v1 6/8] vfio/type1: Bind guest page tables to host
  2020-04-03  8:34           ` Jean-Philippe Brucker
@ 2020-04-07 10:33             ` Liu, Yi L
  2020-04-09  8:28               ` Jean-Philippe Brucker
  0 siblings, 1 reply; 110+ messages in thread
From: Liu, Yi L @ 2020-04-07 10:33 UTC (permalink / raw)
  To: Jean-Philippe Brucker
  Cc: Tian, Kevin, alex.williamson, eric.auger, jacob.jun.pan, joro,
	Raj, Ashok, Tian, Jun J, Sun, Yi Y, peterx, iommu, kvm,
	linux-kernel, Wu, Hao

Hi Jean,

> From: Jean-Philippe Brucker < jean-philippe@linaro.org >
> Sent: Friday, April 3, 2020 4:35 PM
> Subject: Re: [PATCH v1 6/8] vfio/type1: Bind guest page tables to host
> 
> On Thu, Apr 02, 2020 at 08:05:29AM +0000, Liu, Yi L wrote:
> > > > > > static long vfio_iommu_type1_ioctl(void *iommu_data,
> > > > > >  		default:
> > > > > >  			return -EINVAL;
> > > > > >  		}
> > > > > > +
> > > > > > +	} else if (cmd == VFIO_IOMMU_BIND) {
> > > > >
> > > > > BIND what? VFIO_IOMMU_BIND_PASID sounds clearer to me.
> > > >
> > > > Emm, it's up to the flags to indicate bind what. It was proposed to
> > > > cover the three cases below:
> > > > a) BIND/UNBIND_GPASID
> > > > b) BIND/UNBIND_GPASID_TABLE
> > > > c) BIND/UNBIND_PROCESS
> > > > <only a) is covered in this patch>
> > > > So it's called VFIO_IOMMU_BIND.
> > >
> > > but aren't they all about PASID related binding?
> >
> > yeah, I can rename it. :-)
> 
> I don't know if anyone intends to implement it, but SMMUv2 supports
> nesting translation without any PASID support. For that case the name
> VFIO_IOMMU_BIND_GUEST_PGTBL without "PASID" anywhere makes more sense.
> Ideally we'd also use a neutral name for the IOMMU API instead of
> bind_gpasid(), but that's easier to change later.

I agree VFIO_IOMMU_BIND is somehow not straight-forward. Especially, it may
cause confusion when thinking about VFIO_SET_IOMMU. How about using
VFIO_NESTING_IOMMU_BIND_STAGE1 to cover a) and b)? And has another
VFIO_BIND_PROCESS in future for the SVA bind case.

Regards,
Yi Liu


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH v1 1/8] vfio: Add VFIO_IOMMU_PASID_REQUEST(alloc/free)
  2020-04-07  4:42         ` Tian, Kevin
@ 2020-04-07 15:14           ` Alex Williamson
  0 siblings, 0 replies; 110+ messages in thread
From: Alex Williamson @ 2020-04-07 15:14 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Liu, Yi L, eric.auger, jacob.jun.pan, joro, Raj, Ashok, Tian,
	Jun J, Sun, Yi Y, jean-philippe, peterx, iommu, kvm,
	linux-kernel, Wu, Hao

On Tue, 7 Apr 2020 04:42:02 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Alex Williamson
> > Sent: Friday, April 3, 2020 11:14 PM
> > 
> > On Fri, 3 Apr 2020 05:58:55 +0000
> > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> >   
> > > > From: Alex Williamson <alex.williamson@redhat.com>
> > > > Sent: Friday, April 3, 2020 1:50 AM
> > > >
> > > > On Sun, 22 Mar 2020 05:31:58 -0700
> > > > "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> > > >  
> > > > > From: Liu Yi L <yi.l.liu@intel.com>
> > > > >
> > > > > For a long time, devices have only one DMA address space from  
> > platform  
> > > > > IOMMU's point of view. This is true for both bare metal and directed-
> > > > > access in virtualization environment. Reason is the source ID of DMA in
> > > > > PCIe are BDF (bus/dev/fnc ID), which results in only device granularity
> > > > > DMA isolation. However, this is changing with the latest advancement in
> > > > > I/O technology area. More and more platform vendors are utilizing the  
> > > > PCIe  
> > > > > PASID TLP prefix in DMA requests, thus to give devices with multiple  
> > DMA  
> > > > > address spaces as identified by their individual PASIDs. For example,
> > > > > Shared Virtual Addressing (SVA, a.k.a Shared Virtual Memory) is able to
> > > > > let device access multiple process virtual address space by binding the
> > > > > virtual address space with a PASID. Wherein the PASID is allocated in
> > > > > software and programmed to device per device specific manner.  
> > Devices  
> > > > > which support PASID capability are called PASID-capable devices. If such
> > > > > devices are passed through to VMs, guest software are also able to bind
> > > > > guest process virtual address space on such devices. Therefore, the  
> > guest  
> > > > > software could reuse the bare metal software programming model,  
> > which  
> > > > > means guest software will also allocate PASID and program it to device
> > > > > directly. This is a dangerous situation since it has potential PASID
> > > > > conflicts and unauthorized address space access. It would be safer to
> > > > > let host intercept in the guest software's PASID allocation. Thus PASID
> > > > > are managed system-wide.  
> > > >
> > > > Providing an allocation interface only allows for collaborative usage
> > > > of PASIDs though.  Do we have any ability to enforce PASID usage or can
> > > > a user spoof other PASIDs on the same BDF?  
> > >
> > > An user can access only PASIDs allocated to itself, i.e. the specific IOASID
> > > set tied to its mm_struct.  
> > 
> > A user is only _supposed_ to access PASIDs allocated to itself.  AIUI
> > the mm_struct is used for managing the pool of IOASIDs from which the
> > user may allocate that PASID.  We also state that programming the PASID
> > into the device is device specific.  Therefore, are we simply trusting
> > the user to use a PASID that's been allocated to them when they program
> > the device?  If a user can program an arbitrary PASID into the device,
> > then what prevents them from attempting to access data from another
> > user via the device?   I think I've asked this question before, so if
> > there's a previous explanation or spec section I need to review, please
> > point me to it.  Thanks,
> >   
> 
> There are two scenarios:
> 
> (1) for PF/VF, the iommu driver maintains an individual PASID table per
> PDF. Although the PASID namespace is global, the per-BDF PASID table
> contains only valid entries for those PASIDs which are allocated to the
> mm_struct. The user is free to program arbitrary PASID into the assigned
> device, but using invalid PASIDs simply hit iommu fault.
> 
> (2) for mdev, multiple mdev instances share the same PASID table of
> the parent BDF. However, PASID programming is a privileged operation
> in multiplexing usage, thus must be mediated by mdev device driver. 
> The mediation logic will guarantee that only allocated PASIDs are 
> forwarded to the device. 

Thanks, I was confused about multiple tenants sharing a BDF when PASID
programming to the device is device specific, and therefore not
something we can virtualize.  However, the solution is device specific
virtualization via mdev.  Thus, any time we're sharing a BDF between
tenants, we must virtualize the PASID programming and therefore it must
be an mdev device currently.  If a tenant is the exclusive user of the
BDF, then no virtualization of the PASID programming is required.  I
think it's clear now (again).  Thanks,

Alex


^ permalink raw reply	[flat|nested] 110+ messages in thread

* RE: [PATCH v1 1/8] vfio: Add VFIO_IOMMU_PASID_REQUEST(alloc/free)
  2020-04-03 17:50       ` Alex Williamson
  2020-04-07  4:52         ` Tian, Kevin
@ 2020-04-08  0:52         ` Liu, Yi L
  1 sibling, 0 replies; 110+ messages in thread
From: Liu, Yi L @ 2020-04-08  0:52 UTC (permalink / raw)
  To: Alex Williamson
  Cc: eric.auger, Tian, Kevin, jacob.jun.pan, joro, Raj, Ashok, Tian,
	Jun J, Sun, Yi Y, jean-philippe, peterx, iommu, kvm,
	linux-kernel, Wu, Hao

Hi Alex,
> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Saturday, April 4, 2020 1:50 AM
> Subject: Re: [PATCH v1 1/8] vfio: Add VFIO_IOMMU_PASID_REQUEST(alloc/free)
> 
> On Fri, 3 Apr 2020 13:12:50 +0000
> "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> 
> > Hi Alex,
> >
> > > From: Alex Williamson <alex.williamson@redhat.com>
> > > Sent: Friday, April 3, 2020 1:50 AM
> > > Subject: Re: [PATCH v1 1/8] vfio: Add VFIO_IOMMU_PASID_REQUEST(alloc/free)
> > >
> > > On Sun, 22 Mar 2020 05:31:58 -0700
> > > "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> > >
> > > > From: Liu Yi L <yi.l.liu@intel.com>
> > > >
[...]
> > > >  static long vfio_iommu_type1_ioctl(void *iommu_data,
> > > >  				   unsigned int cmd, unsigned long arg)
> > > >  {
> > > > @@ -2276,6 +2333,53 @@ static long vfio_iommu_type1_ioctl(void
> > > *iommu_data,
> > > >
> > > >  		return copy_to_user((void __user *)arg, &unmap, minsz) ?
> > > >  			-EFAULT : 0;
> > > > +
> > > > +	} else if (cmd == VFIO_IOMMU_PASID_REQUEST) {
> > > > +		struct vfio_iommu_type1_pasid_request req;
> > > > +		unsigned long offset;
> > > > +
> > > > +		minsz = offsetofend(struct vfio_iommu_type1_pasid_request,
> > > > +				    flags);
> > > > +
> > > > +		if (copy_from_user(&req, (void __user *)arg, minsz))
> > > > +			return -EFAULT;
> > > > +
> > > > +		if (req.argsz < minsz ||
> > > > +		    !vfio_iommu_type1_pasid_req_valid(req.flags))
> > > > +			return -EINVAL;
> > > > +
> > > > +		if (copy_from_user((void *)&req + minsz,
> > > > +				   (void __user *)arg + minsz,
> > > > +				   sizeof(req) - minsz))
> > > > +			return -EFAULT;
> > >
> > > Huh?  Why do we have argsz if we're going to assume this is here?
> >
> > do you mean replacing sizeof(req) with argsz? if yes, I can do that.
> 
> No, I mean the user tells us how much data they've provided via argsz.
> We create minsz the the end of flags and verify argsz includes flags.
> Then we proceed to ignore argsz to see if the user has provided the
> remainder of the structure.

I think I should avoid using sizeof(req) as it may be variable
new flag is added. I think better to make a data[] field in struct
vfio_iommu_type1_pasid_request and copy data[] per flag. I'll
make this change in new version.

> > > > +
> > > > +		switch (req.flags & VFIO_PASID_REQUEST_MASK) {
> > > > +		case VFIO_IOMMU_PASID_ALLOC:
> > > > +		{
> > > > +			int ret = 0, result;
> > > > +
> > > > +			result = vfio_iommu_type1_pasid_alloc(iommu,
> > > > +							req.alloc_pasid.min,
> > > > +							req.alloc_pasid.max);
> > > > +			if (result > 0) {
> > > > +				offset = offsetof(
> > > > +					struct vfio_iommu_type1_pasid_request,
> > > > +					alloc_pasid.result);
> > > > +				ret = copy_to_user(
> > > > +					      (void __user *) (arg + offset),
> > > > +					      &result, sizeof(result));
> > >
> > > Again assuming argsz supports this.
> >
> > same as above.
> >
> > >
> > > > +			} else {
> > > > +				pr_debug("%s: PASID alloc failed\n", __func__);
> > >
> > > rate limit?
> >
> > not quite get. could you give more hints?
> 
> A user can spam the host logs simply by allocating their quota of
> PASIDs and then trying to allocate more, or by specifying min/max such
> that they cannot allocate the requested PASID.  If this logging is
> necessary for debugging, it should be ratelimited to avoid a DoS on the
> host.

got it. thanks for the coaching. will use pr_debug_ratelimited().

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 110+ messages in thread

* RE: [PATCH v1 5/8] vfio/type1: Report 1st-level/stage-1 format to userspace
  2020-04-07  9:43           ` Liu, Yi L
@ 2020-04-08  1:02             ` Liu, Yi L
  2020-04-08 10:27             ` Auger Eric
  1 sibling, 0 replies; 110+ messages in thread
From: Liu, Yi L @ 2020-04-08  1:02 UTC (permalink / raw)
  To: Jean-Philippe Brucker, Auger Eric
  Cc: alex.williamson, Tian, Kevin, jacob.jun.pan, joro, Raj, Ashok,
	Tian, Jun J, Sun, Yi Y, peterx, iommu, kvm, linux-kernel, Wu,
	Hao

> From: Liu, Yi L
> Sent: Tuesday, April 7, 2020 5:43 PM
>
> > We don't, the PASID spaces are per-VM on Arm, so this function should
> > consult the IOMMU driver before setting flags. As you said on patch 3,
> > nested doesn't necessarily imply PASID support. The SMMUv2 does not
> > support PASID but does support nesting stages 1 and 2 for the IOVA space.
> > SMMUv3 support of PASID depends on HW capabilities. So I think this
> > needs to be finer grained:
> >
> > Does the container support:
> > * VFIO_IOMMU_PASID_REQUEST?
> >   -> Yes for VT-d 3
> >   -> No for Arm SMMU
> > * VFIO_IOMMU_{,UN}BIND_GUEST_PGTBL?
> >   -> Yes for VT-d 3
> >   -> Sometimes for SMMUv2
> >   -> No for SMMUv3 (if we go with BIND_PASID_TABLE, which is simpler due to
> >      PASID tables being in GPA space.)
> > * VFIO_IOMMU_BIND_PASID_TABLE?
> >   -> No for VT-d
> >   -> Sometimes for SMMUv3
> >
> > Any bind support implies VFIO_IOMMU_CACHE_INVALIDATE support.
> 
> good summary. do you expect to see any
please ignore this message. I planned to ask if possible to report
VFIO_IOMMU_CACHE_INVALIDATE  only (no bind support). But I stopped
typing it when I came to believe it's unnecessary to report it if
there is no bind support.

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 110+ messages in thread

* RE: [PATCH v1 7/8] vfio/type1: Add VFIO_IOMMU_CACHE_INVALIDATE
  2020-04-03 15:34       ` Alex Williamson
@ 2020-04-08  2:28         ` Liu, Yi L
  2020-04-16 10:40         ` Liu, Yi L
  1 sibling, 0 replies; 110+ messages in thread
From: Liu, Yi L @ 2020-04-08  2:28 UTC (permalink / raw)
  To: Alex Williamson, Tian, Kevin
  Cc: eric.auger, jacob.jun.pan, joro, Raj, Ashok, Tian, Jun J, Sun,
	Yi Y, jean-philippe, peterx, iommu, kvm, linux-kernel, Wu, Hao

Hi Alex,

> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Friday, April 3, 2020 11:35 PM
> Subject: Re: [PATCH v1 7/8] vfio/type1: Add VFIO_IOMMU_CACHE_INVALIDATE
> 
> On Fri, 3 Apr 2020 06:39:22 +0000
> "Tian, Kevin" <kevin.tian@intel.com> wrote:
> 
> > > From: Alex Williamson <alex.williamson@redhat.com>
> > > Sent: Friday, April 3, 2020 4:24 AM
> > >
> > > On Sun, 22 Mar 2020 05:32:04 -0700
> > > "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> > >
> > > > From: Liu Yi L <yi.l.liu@linux.intel.com>
> > > >
[...]
> 
> > >
> > > > +
> > > > +		if (copy_from_user(&cache_inv, (void __user *)arg, minsz))
> > > > +			return -EFAULT;
> > > > +
> > > > +		if (cache_inv.argsz < minsz || cache_inv.flags)
> > > > +			return -EINVAL;
> > > > +
> > > > +		/* Get the version of struct iommu_cache_invalidate_info */
> > > > +		if (copy_from_user(&version,
> > > > +			(void __user *) (arg + minsz), sizeof(version)))
> > > > +			return -EFAULT;
> > > > +
> > > > +		info_size = iommu_uapi_get_data_size(
> > > > +					IOMMU_UAPI_CACHE_INVAL,
> > > version);
> > > > +
> > > > +		cache_info = kzalloc(info_size, GFP_KERNEL);
> > > > +		if (!cache_info)
> > > > +			return -ENOMEM;
> > > > +
> > > > +		if (copy_from_user(cache_info,
> > > > +			(void __user *) (arg + minsz), info_size)) {
> > > > +			kfree(cache_info);
> > > > +			return -EFAULT;
> > > > +		}
> > > > +
> > > > +		mutex_lock(&iommu->lock);
> > > > +		ret = vfio_iommu_for_each_dev(iommu, vfio_cache_inv_fn,
> > > > +					    cache_info);
> > >
> > > How does a user respond when their cache invalidate fails?  Isn't this
> > > also another case where our for_each_dev can fail at an arbitrary point
> > > leaving us with no idea whether each device even had the opportunity to
> > > perform the invalidation request.  I don't see how we have any chance
> > > to maintain coherency after this faults.
> >
> > Then can we make it simple to support singleton group only?
> 
> Are you suggesting a single group per container or a single device per
> group? Unless we have both, aren't we always going to have this issue.

Agreed. we need both to avoid the potential for_each_dev() loop issue.
I suppose this is also the most typical and desired config for vSVA
support. I think it makes sense with below items:

a) one group per container
PASID and nested translation gives user-space a chance to attach their
page table (e.g. guest process page table) to host IOMMU, this is vSVA.
If adding multiple groups to a vSVA-capable container, then a SVA bind
on this container means bind it with all groups (devices are included)
within the container. This doesn't make sense with three reasons: for
one the passthru devices are not necessary to be manipulated by same
guest application; for two passthru devices are not surely added in a
single guest group; for three not all passthru devices (either from
different group or same group) are sva capable.
As above, enforce one group per container makes sense to me.

b) one device per group
SVA support is limited to singleton group so far in bare-metal bind
per Jean's series. I think it's be good to follow it in passthru case.
https://patchwork.kernel.org/patch/10213877/
https://lkml.org/lkml/2019/4/10/663
As mentioned in a), group may have both SVA-capable device and non-SVA
-capable device, it would be a problem for VFIO to figure a way to isolate
them.

> OTOH, why should a cache invalidate fail?

there are sanity check done by vendor iommu driver against the invalidate
request from userspace. so it may fail if sanity check failed. But I guess
it may be better to something like abort instead of fail the request. isn't?

> 
> > > > +		mutex_unlock(&iommu->lock);
> > > > +		kfree(cache_info);
> > > > +		return ret;
> > > >  	}
> > > >
> > > >  	return -ENOTTY;
> > > > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> > > > index 2235bc6..62ca791 100644
> > > > --- a/include/uapi/linux/vfio.h
> > > > +++ b/include/uapi/linux/vfio.h
> > > > @@ -899,6 +899,28 @@ struct vfio_iommu_type1_bind {
> > > >   */
> > > >  #define VFIO_IOMMU_BIND		_IO(VFIO_TYPE, VFIO_BASE + 23)
> > > >
> > > > +/**
> > > > + * VFIO_IOMMU_CACHE_INVALIDATE - _IOW(VFIO_TYPE, VFIO_BASE + 24,
> > > > + *			struct vfio_iommu_type1_cache_invalidate)
> > > > + *
> > > > + * Propagate guest IOMMU cache invalidation to the host. The cache
> > > > + * invalidation information is conveyed by @cache_info, the content
> > > > + * format would be structures defined in uapi/linux/iommu.h. User
> > > > + * should be aware of that the struct  iommu_cache_invalidate_info
> > > > + * has a @version field, vfio needs to parse this field before getting
> > > > + * data from userspace.
> > > > + *
> > > > + * Availability of this IOCTL is after VFIO_SET_IOMMU.
> > >
> > > Is this a necessary qualifier?  A user can try to call this ioctl at
> > > any point, it only makes sense in certain configurations, but it should
> > > always "do the right thing" relative to the container iommu config.
> > >
> > > Also, I don't see anything in these last few patches testing the
> > > operating IOMMU model, what happens when a user calls them when not
> > > using the nesting IOMMU?
> > >
> > > Is this ioctl and the previous BIND ioctl only valid when configured
> > > for the nesting IOMMU type?
> >
> > I think so. We should add the nesting check in those new ioctls.
> >
> > >
> > > > + *
> > > > + * returns: 0 on success, -errno on failure.
> > > > + */
> > > > +struct vfio_iommu_type1_cache_invalidate {
> > > > +	__u32   argsz;
> > > > +	__u32   flags;
> > > > +	struct	iommu_cache_invalidate_info cache_info;
> > > > +};
> > > > +#define VFIO_IOMMU_CACHE_INVALIDATE      _IO(VFIO_TYPE, VFIO_BASE
> > > + 24)
> > >
> > > The future extension capabilities of this ioctl worry me, I wonder if
> > > we should do another data[] with flag defining that data as CACHE_INFO.
> >
> > Can you elaborate? Does it mean with this way we don't rely on iommu
> > driver to provide version_to_size conversion and instead we just pass
> > data[] to iommu driver for further audit?
> 
> No, my concern is that this ioctl has a single function, strictly tied
> to the iommu uapi.  If we replace cache_info with data[] then we can
> define a flag to specify that data[] is struct
> iommu_cache_invalidate_info, and if we need to, a different flag to
> identify data[] as something else.  For example if we get stuck
> expanding cache_info to meet new demands and develop a new uapi to
> solve that, how would we expand this ioctl to support it rather than
> also create a new ioctl?  There's also a trade-off in making the ioctl
> usage more difficult for the user.  I'd still expect the vfio layer to
> check the flag and interpret data[] as indicated by the flag rather
> than just passing a blob of opaque data to the iommu layer though.

Ok, I think data[] is acceptable. BTW. Do you have any decision on the
uapi version open iin Jacob's thread? I'd like to re-work my patch based
on your decision.

https://lkml.org/lkml/2020/4/2/876

thanks again for your help. :-)

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH v1 5/8] vfio/type1: Report 1st-level/stage-1 format to userspace
  2020-04-07  9:43           ` Liu, Yi L
  2020-04-08  1:02             ` Liu, Yi L
@ 2020-04-08 10:27             ` Auger Eric
  2020-04-09  8:14               ` Jean-Philippe Brucker
  1 sibling, 1 reply; 110+ messages in thread
From: Auger Eric @ 2020-04-08 10:27 UTC (permalink / raw)
  To: Liu, Yi L, Jean-Philippe Brucker
  Cc: alex.williamson, Tian, Kevin, jacob.jun.pan, joro, Raj, Ashok,
	Tian, Jun J, Sun, Yi Y, peterx, iommu, kvm, linux-kernel, Wu,
	Hao

Hi Yi,

On 4/7/20 11:43 AM, Liu, Yi L wrote:
> Hi Jean,
> 
>> From: Jean-Philippe Brucker <jean-philippe@linaro.org>
>> Sent: Friday, April 3, 2020 4:23 PM
>> To: Auger Eric <eric.auger@redhat.com>
>> userspace
>>
>> On Wed, Apr 01, 2020 at 03:01:12PM +0200, Auger Eric wrote:
>>>>>>  	header = vfio_info_cap_add(caps, sizeof(*nesting_cap),
>>>>>>  				   VFIO_IOMMU_TYPE1_INFO_CAP_NESTING, 1);
>> @@ -2254,6 +2309,7
>>>>>> @@ static int vfio_iommu_info_add_nesting_cap(struct
>>>>> vfio_iommu *iommu,
>>>>>>  		/* nesting iommu type supports PASID requests (alloc/free) */
>>>>>>  		nesting_cap->nesting_capabilities |= VFIO_IOMMU_PASID_REQS;
>>>>> What is the meaning for ARM?
>>>>
>>>> I think it's just a software capability exposed to userspace, on
>>>> userspace side, it has a choice to use it or not. :-) The reason
>>>> define it and report it in cap nesting is that I'd like to make the
>>>> pasid alloc/free be available just for IOMMU with type
>>>> VFIO_IOMMU_TYPE1_NESTING. Please feel free tell me if it is not good
>>>> for ARM. We can find a proper way to report the availability.
>>>
>>> Well it is more a question for jean-Philippe. Do we have a system wide
>>> PASID allocation on ARM?
>>
>> We don't, the PASID spaces are per-VM on Arm, so this function should consult the
>> IOMMU driver before setting flags. As you said on patch 3, nested doesn't
>> necessarily imply PASID support. The SMMUv2 does not support PASID but does
>> support nesting stages 1 and 2 for the IOVA space.
>> SMMUv3 support of PASID depends on HW capabilities. So I think this needs to be
>> finer grained:
>>
>> Does the container support:
>> * VFIO_IOMMU_PASID_REQUEST?
>>   -> Yes for VT-d 3
>>   -> No for Arm SMMU
>> * VFIO_IOMMU_{,UN}BIND_GUEST_PGTBL?
>>   -> Yes for VT-d 3
>>   -> Sometimes for SMMUv2
>>   -> No for SMMUv3 (if we go with BIND_PASID_TABLE, which is simpler due to
>>      PASID tables being in GPA space.)
>> * VFIO_IOMMU_BIND_PASID_TABLE?
>>   -> No for VT-d
>>   -> Sometimes for SMMUv3
>>
>> Any bind support implies VFIO_IOMMU_CACHE_INVALIDATE support.
> 
> good summary. do you expect to see any 
> 
>>
>>>>>> +	nesting_cap->stage1_formats = formats;
>>>>> as spotted by Kevin, since a single format is supported, rename
>>>>
>>>> ok, I was believing it may be possible on ARM or so. :-) will rename
>>>> it.
>>
>> Yes I don't think an u32 is going to cut it for Arm :( We need to describe all sorts of
>> capabilities for page and PASID tables (granules, GPA size, ASID/PASID size, HW
>> access/dirty, etc etc.) Just saying "Arm stage-1 format" wouldn't mean much. I
>> guess we could have a secondary vendor capability for these?
> 
> Actually, I'm wondering if we can define some formats to stands for a set of
> capabilities. e.g. VTD_STAGE1_FORMAT_V1 which may indicates the 1st level
> page table related caps (aw, a/d, SRE, EA and etc.). And vIOMMU can parse
> the capabilities.

But eventually do we really need all those capability getters? I mean
can't we simply rely on the actual call to VFIO_IOMMU_BIND_GUEST_PGTBL()
to detect any mismatch? Definitively the error handling may be heavier
on userspace but can't we manage. My fear is we end up with an overly
complex series. This capability getter may be interesting if we can
switch to a fallback implementation but here I guess we don't have any
fallback. With smmuv3 nested stage we don't have any fallback solution
either. For the versions, it is different because the userspace shall be
able to adapt (or not) to the max version supported by the kernel.

Thanks

Eric
> 
> Regards,
> Yi Liu
> 


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH v1 5/8] vfio/type1: Report 1st-level/stage-1 format to userspace
  2020-04-08 10:27             ` Auger Eric
@ 2020-04-09  8:14               ` Jean-Philippe Brucker
  2020-04-09  9:01                 ` Auger Eric
  2020-04-09 12:47                 ` Liu, Yi L
  0 siblings, 2 replies; 110+ messages in thread
From: Jean-Philippe Brucker @ 2020-04-09  8:14 UTC (permalink / raw)
  To: Auger Eric
  Cc: Liu, Yi L, alex.williamson, Tian, Kevin, jacob.jun.pan, joro,
	Raj, Ashok, Tian, Jun J, Sun, Yi Y, peterx, iommu, kvm,
	linux-kernel, Wu, Hao

On Wed, Apr 08, 2020 at 12:27:58PM +0200, Auger Eric wrote:
> Hi Yi,
> 
> On 4/7/20 11:43 AM, Liu, Yi L wrote:
> > Hi Jean,
> > 
> >> From: Jean-Philippe Brucker <jean-philippe@linaro.org>
> >> Sent: Friday, April 3, 2020 4:23 PM
> >> To: Auger Eric <eric.auger@redhat.com>
> >> userspace
> >>
> >> On Wed, Apr 01, 2020 at 03:01:12PM +0200, Auger Eric wrote:
> >>>>>>  	header = vfio_info_cap_add(caps, sizeof(*nesting_cap),
> >>>>>>  				   VFIO_IOMMU_TYPE1_INFO_CAP_NESTING, 1);
> >> @@ -2254,6 +2309,7
> >>>>>> @@ static int vfio_iommu_info_add_nesting_cap(struct
> >>>>> vfio_iommu *iommu,
> >>>>>>  		/* nesting iommu type supports PASID requests (alloc/free) */
> >>>>>>  		nesting_cap->nesting_capabilities |= VFIO_IOMMU_PASID_REQS;
> >>>>> What is the meaning for ARM?
> >>>>
> >>>> I think it's just a software capability exposed to userspace, on
> >>>> userspace side, it has a choice to use it or not. :-) The reason
> >>>> define it and report it in cap nesting is that I'd like to make the
> >>>> pasid alloc/free be available just for IOMMU with type
> >>>> VFIO_IOMMU_TYPE1_NESTING. Please feel free tell me if it is not good
> >>>> for ARM. We can find a proper way to report the availability.
> >>>
> >>> Well it is more a question for jean-Philippe. Do we have a system wide
> >>> PASID allocation on ARM?
> >>
> >> We don't, the PASID spaces are per-VM on Arm, so this function should consult the
> >> IOMMU driver before setting flags. As you said on patch 3, nested doesn't
> >> necessarily imply PASID support. The SMMUv2 does not support PASID but does
> >> support nesting stages 1 and 2 for the IOVA space.
> >> SMMUv3 support of PASID depends on HW capabilities. So I think this needs to be
> >> finer grained:
> >>
> >> Does the container support:
> >> * VFIO_IOMMU_PASID_REQUEST?
> >>   -> Yes for VT-d 3
> >>   -> No for Arm SMMU
> >> * VFIO_IOMMU_{,UN}BIND_GUEST_PGTBL?
> >>   -> Yes for VT-d 3
> >>   -> Sometimes for SMMUv2
> >>   -> No for SMMUv3 (if we go with BIND_PASID_TABLE, which is simpler due to
> >>      PASID tables being in GPA space.)
> >> * VFIO_IOMMU_BIND_PASID_TABLE?
> >>   -> No for VT-d
> >>   -> Sometimes for SMMUv3
> >>
> >> Any bind support implies VFIO_IOMMU_CACHE_INVALIDATE support.
> > 
> > good summary. do you expect to see any 
> > 
> >>
> >>>>>> +	nesting_cap->stage1_formats = formats;
> >>>>> as spotted by Kevin, since a single format is supported, rename
> >>>>
> >>>> ok, I was believing it may be possible on ARM or so. :-) will rename
> >>>> it.
> >>
> >> Yes I don't think an u32 is going to cut it for Arm :( We need to describe all sorts of
> >> capabilities for page and PASID tables (granules, GPA size, ASID/PASID size, HW
> >> access/dirty, etc etc.) Just saying "Arm stage-1 format" wouldn't mean much. I
> >> guess we could have a secondary vendor capability for these?
> > 
> > Actually, I'm wondering if we can define some formats to stands for a set of
> > capabilities. e.g. VTD_STAGE1_FORMAT_V1 which may indicates the 1st level
> > page table related caps (aw, a/d, SRE, EA and etc.). And vIOMMU can parse
> > the capabilities.
> 
> But eventually do we really need all those capability getters? I mean
> can't we simply rely on the actual call to VFIO_IOMMU_BIND_GUEST_PGTBL()
> to detect any mismatch? Definitively the error handling may be heavier
> on userspace but can't we manage.

I think we need to present these capabilities at boot time, long before
the guest triggers a bind(). For example if the host SMMU doesn't support
16-bit ASID, we need to communicate that to the guest using vSMMU ID
registers or PROBE properties. Otherwise a bind() will succeed, but if the
guest uses 16-bit ASIDs in its CD, DMA will result in C_BAD_CD events
which we'll inject into the guest, for no apparent reason from their
perspective.

In addition some VMMs may have fallbacks if shared page tables are not
available. They could fall back to a MAP/UNMAP interface, or simply not
present a vIOMMU to the guest.

Thanks,
Jean

> My fear is we end up with an overly
> complex series. This capability getter may be interesting if we can
> switch to a fallback implementation but here I guess we don't have any
> fallback. With smmuv3 nested stage we don't have any fallback solution
> either. For the versions, it is different because the userspace shall be
> able to adapt (or not) to the max version supported by the kernel.
> 
> Thanks
> 
> Eric
> > 
> > Regards,
> > Yi Liu
> > 
> 

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH v1 6/8] vfio/type1: Bind guest page tables to host
  2020-04-07 10:33             ` Liu, Yi L
@ 2020-04-09  8:28               ` Jean-Philippe Brucker
  2020-04-09  9:15                 ` Liu, Yi L
  0 siblings, 1 reply; 110+ messages in thread
From: Jean-Philippe Brucker @ 2020-04-09  8:28 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: Tian, Kevin, alex.williamson, eric.auger, jacob.jun.pan, joro,
	Raj, Ashok, Tian, Jun J, Sun, Yi Y, peterx, iommu, kvm,
	linux-kernel, Wu, Hao

On Tue, Apr 07, 2020 at 10:33:25AM +0000, Liu, Yi L wrote:
> Hi Jean,
> 
> > From: Jean-Philippe Brucker < jean-philippe@linaro.org >
> > Sent: Friday, April 3, 2020 4:35 PM
> > Subject: Re: [PATCH v1 6/8] vfio/type1: Bind guest page tables to host
> > 
> > On Thu, Apr 02, 2020 at 08:05:29AM +0000, Liu, Yi L wrote:
> > > > > > > static long vfio_iommu_type1_ioctl(void *iommu_data,
> > > > > > >  		default:
> > > > > > >  			return -EINVAL;
> > > > > > >  		}
> > > > > > > +
> > > > > > > +	} else if (cmd == VFIO_IOMMU_BIND) {
> > > > > >
> > > > > > BIND what? VFIO_IOMMU_BIND_PASID sounds clearer to me.
> > > > >
> > > > > Emm, it's up to the flags to indicate bind what. It was proposed to
> > > > > cover the three cases below:
> > > > > a) BIND/UNBIND_GPASID
> > > > > b) BIND/UNBIND_GPASID_TABLE
> > > > > c) BIND/UNBIND_PROCESS
> > > > > <only a) is covered in this patch>
> > > > > So it's called VFIO_IOMMU_BIND.
> > > >
> > > > but aren't they all about PASID related binding?
> > >
> > > yeah, I can rename it. :-)
> > 
> > I don't know if anyone intends to implement it, but SMMUv2 supports
> > nesting translation without any PASID support. For that case the name
> > VFIO_IOMMU_BIND_GUEST_PGTBL without "PASID" anywhere makes more sense.
> > Ideally we'd also use a neutral name for the IOMMU API instead of
> > bind_gpasid(), but that's easier to change later.
> 
> I agree VFIO_IOMMU_BIND is somehow not straight-forward. Especially, it may
> cause confusion when thinking about VFIO_SET_IOMMU. How about using
> VFIO_NESTING_IOMMU_BIND_STAGE1 to cover a) and b)? And has another
> VFIO_BIND_PROCESS in future for the SVA bind case.

I think minimizing the number of ioctls is more important than finding the
ideal name. VFIO_IOMMU_BIND was fine to me, but if it's too vague then
rename it to VFIO_IOMMU_BIND_PASID and we'll just piggy-back on it for
non-PASID things (they should be rare enough).

Thanks,
Jean

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH v1 5/8] vfio/type1: Report 1st-level/stage-1 format to userspace
  2020-04-09  8:14               ` Jean-Philippe Brucker
@ 2020-04-09  9:01                 ` Auger Eric
  2020-04-09 12:47                 ` Liu, Yi L
  1 sibling, 0 replies; 110+ messages in thread
From: Auger Eric @ 2020-04-09  9:01 UTC (permalink / raw)
  To: Jean-Philippe Brucker
  Cc: Liu, Yi L, alex.williamson, Tian, Kevin, jacob.jun.pan, joro,
	Raj, Ashok, Tian, Jun J, Sun, Yi Y, peterx, iommu, kvm,
	linux-kernel, Wu, Hao

Hi Jean,

On 4/9/20 10:14 AM, Jean-Philippe Brucker wrote:
> On Wed, Apr 08, 2020 at 12:27:58PM +0200, Auger Eric wrote:
>> Hi Yi,
>>
>> On 4/7/20 11:43 AM, Liu, Yi L wrote:
>>> Hi Jean,
>>>
>>>> From: Jean-Philippe Brucker <jean-philippe@linaro.org>
>>>> Sent: Friday, April 3, 2020 4:23 PM
>>>> To: Auger Eric <eric.auger@redhat.com>
>>>> userspace
>>>>
>>>> On Wed, Apr 01, 2020 at 03:01:12PM +0200, Auger Eric wrote:
>>>>>>>>  	header = vfio_info_cap_add(caps, sizeof(*nesting_cap),
>>>>>>>>  				   VFIO_IOMMU_TYPE1_INFO_CAP_NESTING, 1);
>>>> @@ -2254,6 +2309,7
>>>>>>>> @@ static int vfio_iommu_info_add_nesting_cap(struct
>>>>>>> vfio_iommu *iommu,
>>>>>>>>  		/* nesting iommu type supports PASID requests (alloc/free) */
>>>>>>>>  		nesting_cap->nesting_capabilities |= VFIO_IOMMU_PASID_REQS;
>>>>>>> What is the meaning for ARM?
>>>>>>
>>>>>> I think it's just a software capability exposed to userspace, on
>>>>>> userspace side, it has a choice to use it or not. :-) The reason
>>>>>> define it and report it in cap nesting is that I'd like to make the
>>>>>> pasid alloc/free be available just for IOMMU with type
>>>>>> VFIO_IOMMU_TYPE1_NESTING. Please feel free tell me if it is not good
>>>>>> for ARM. We can find a proper way to report the availability.
>>>>>
>>>>> Well it is more a question for jean-Philippe. Do we have a system wide
>>>>> PASID allocation on ARM?
>>>>
>>>> We don't, the PASID spaces are per-VM on Arm, so this function should consult the
>>>> IOMMU driver before setting flags. As you said on patch 3, nested doesn't
>>>> necessarily imply PASID support. The SMMUv2 does not support PASID but does
>>>> support nesting stages 1 and 2 for the IOVA space.
>>>> SMMUv3 support of PASID depends on HW capabilities. So I think this needs to be
>>>> finer grained:
>>>>
>>>> Does the container support:
>>>> * VFIO_IOMMU_PASID_REQUEST?
>>>>   -> Yes for VT-d 3
>>>>   -> No for Arm SMMU
>>>> * VFIO_IOMMU_{,UN}BIND_GUEST_PGTBL?
>>>>   -> Yes for VT-d 3
>>>>   -> Sometimes for SMMUv2
>>>>   -> No for SMMUv3 (if we go with BIND_PASID_TABLE, which is simpler due to
>>>>      PASID tables being in GPA space.)
>>>> * VFIO_IOMMU_BIND_PASID_TABLE?
>>>>   -> No for VT-d
>>>>   -> Sometimes for SMMUv3
>>>>
>>>> Any bind support implies VFIO_IOMMU_CACHE_INVALIDATE support.
>>>
>>> good summary. do you expect to see any 
>>>
>>>>
>>>>>>>> +	nesting_cap->stage1_formats = formats;
>>>>>>> as spotted by Kevin, since a single format is supported, rename
>>>>>>
>>>>>> ok, I was believing it may be possible on ARM or so. :-) will rename
>>>>>> it.
>>>>
>>>> Yes I don't think an u32 is going to cut it for Arm :( We need to describe all sorts of
>>>> capabilities for page and PASID tables (granules, GPA size, ASID/PASID size, HW
>>>> access/dirty, etc etc.) Just saying "Arm stage-1 format" wouldn't mean much. I
>>>> guess we could have a secondary vendor capability for these?
>>>
>>> Actually, I'm wondering if we can define some formats to stands for a set of
>>> capabilities. e.g. VTD_STAGE1_FORMAT_V1 which may indicates the 1st level
>>> page table related caps (aw, a/d, SRE, EA and etc.). And vIOMMU can parse
>>> the capabilities.
>>
>> But eventually do we really need all those capability getters? I mean
>> can't we simply rely on the actual call to VFIO_IOMMU_BIND_GUEST_PGTBL()
>> to detect any mismatch? Definitively the error handling may be heavier
>> on userspace but can't we manage.
> 
> I think we need to present these capabilities at boot time, long before
> the guest triggers a bind(). For example if the host SMMU doesn't support
> 16-bit ASID, we need to communicate that to the guest using vSMMU ID
> registers or PROBE properties. Otherwise a bind() will succeed, but if the
> guest uses 16-bit ASIDs in its CD, DMA will result in C_BAD_CD events
> which we'll inject into the guest, for no apparent reason from their
> perspective.
OK I understand this case as in this situation we may be able to change
the way to iommu is exposed to the guest.
> 
> In addition some VMMs may have fallbacks if shared page tables are not
> available. They could fall back to a MAP/UNMAP interface, or simply not
> present a vIOMMU to the guest.
fair enough, there is a need for such capability checker in the mid
term. But this patch introduces the capability to check whether system
wide PASID alloc is supported and this may not be requested at that
stage for the whole vSVM integration?

Thanks

Eric
> 
> Thanks,
> Jean
> 
>> My fear is we end up with an overly
>> complex series. This capability getter may be interesting if we can
>> switch to a fallback implementation but here I guess we don't have any
>> fallback. With smmuv3 nested stage we don't have any fallback solution
>> either. For the versions, it is different because the userspace shall be
>> able to adapt (or not) to the max version supported by the kernel.
>>
>> Thanks
>>
>> Eric
>>>
>>> Regards,
>>> Yi Liu
>>>
>>
> 


^ permalink raw reply	[flat|nested] 110+ messages in thread

* RE: [PATCH v1 6/8] vfio/type1: Bind guest page tables to host
  2020-04-09  8:28               ` Jean-Philippe Brucker
@ 2020-04-09  9:15                 ` Liu, Yi L
  2020-04-09  9:38                   ` Jean-Philippe Brucker
  0 siblings, 1 reply; 110+ messages in thread
From: Liu, Yi L @ 2020-04-09  9:15 UTC (permalink / raw)
  To: Jean-Philippe Brucker
  Cc: Tian, Kevin, alex.williamson, eric.auger, jacob.jun.pan, joro,
	Raj, Ashok, Tian, Jun J, Sun, Yi Y, peterx, iommu, kvm,
	linux-kernel, Wu, Hao

> From: Jean-Philippe Brucker <jean-philippe@linaro.org>
> Sent: Thursday, April 9, 2020 4:29 PM
> To: Liu, Yi L <yi.l.liu@intel.com>
> 
> On Tue, Apr 07, 2020 at 10:33:25AM +0000, Liu, Yi L wrote:
> > Hi Jean,
> >
> > > From: Jean-Philippe Brucker < jean-philippe@linaro.org >
> > > Sent: Friday, April 3, 2020 4:35 PM
> > > Subject: Re: [PATCH v1 6/8] vfio/type1: Bind guest page tables to host
> > >
> > > On Thu, Apr 02, 2020 at 08:05:29AM +0000, Liu, Yi L wrote:
> > > > > > > > static long vfio_iommu_type1_ioctl(void *iommu_data,
> > > > > > > >  		default:
> > > > > > > >  			return -EINVAL;
> > > > > > > >  		}
> > > > > > > > +
> > > > > > > > +	} else if (cmd == VFIO_IOMMU_BIND) {
> > > > > > >
> > > > > > > BIND what? VFIO_IOMMU_BIND_PASID sounds clearer to me.
> > > > > >
> > > > > > Emm, it's up to the flags to indicate bind what. It was proposed to
> > > > > > cover the three cases below:
> > > > > > a) BIND/UNBIND_GPASID
> > > > > > b) BIND/UNBIND_GPASID_TABLE
> > > > > > c) BIND/UNBIND_PROCESS
> > > > > > <only a) is covered in this patch>
> > > > > > So it's called VFIO_IOMMU_BIND.
> > > > >
> > > > > but aren't they all about PASID related binding?
> > > >
> > > > yeah, I can rename it. :-)
> > >
> > > I don't know if anyone intends to implement it, but SMMUv2 supports
> > > nesting translation without any PASID support. For that case the name
> > > VFIO_IOMMU_BIND_GUEST_PGTBL without "PASID" anywhere makes more
> sense.
> > > Ideally we'd also use a neutral name for the IOMMU API instead of
> > > bind_gpasid(), but that's easier to change later.
> >
> > I agree VFIO_IOMMU_BIND is somehow not straight-forward. Especially, it may
> > cause confusion when thinking about VFIO_SET_IOMMU. How about using
> > VFIO_NESTING_IOMMU_BIND_STAGE1 to cover a) and b)? And has another
> > VFIO_BIND_PROCESS in future for the SVA bind case.
> 
> I think minimizing the number of ioctls is more important than finding the
> ideal name. VFIO_IOMMU_BIND was fine to me, but if it's too vague then
> rename it to VFIO_IOMMU_BIND_PASID and we'll just piggy-back on it for
> non-PASID things (they should be rare enough).
maybe we can start with VFIO_IOMMU_BIND_PASID. Actually, there is
also a discussion on reusing the same ioctl and vfio structure for
pasid_alloc/free, bind/unbind_gpasid. and cache_inv. how about your
opinion?

https://lkml.org/lkml/2020/4/3/833

Regards,
Yi Liu




^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH v1 6/8] vfio/type1: Bind guest page tables to host
  2020-04-09  9:15                 ` Liu, Yi L
@ 2020-04-09  9:38                   ` Jean-Philippe Brucker
  0 siblings, 0 replies; 110+ messages in thread
From: Jean-Philippe Brucker @ 2020-04-09  9:38 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: Tian, Kevin, alex.williamson, eric.auger, jacob.jun.pan, joro,
	Raj, Ashok, Tian, Jun J, Sun, Yi Y, peterx, iommu, kvm,
	linux-kernel, Wu, Hao

On Thu, Apr 09, 2020 at 09:15:29AM +0000, Liu, Yi L wrote:
> > From: Jean-Philippe Brucker <jean-philippe@linaro.org>
> > Sent: Thursday, April 9, 2020 4:29 PM
> > To: Liu, Yi L <yi.l.liu@intel.com>
> > 
> > On Tue, Apr 07, 2020 at 10:33:25AM +0000, Liu, Yi L wrote:
> > > Hi Jean,
> > >
> > > > From: Jean-Philippe Brucker < jean-philippe@linaro.org >
> > > > Sent: Friday, April 3, 2020 4:35 PM
> > > > Subject: Re: [PATCH v1 6/8] vfio/type1: Bind guest page tables to host
> > > >
> > > > On Thu, Apr 02, 2020 at 08:05:29AM +0000, Liu, Yi L wrote:
> > > > > > > > > static long vfio_iommu_type1_ioctl(void *iommu_data,
> > > > > > > > >  		default:
> > > > > > > > >  			return -EINVAL;
> > > > > > > > >  		}
> > > > > > > > > +
> > > > > > > > > +	} else if (cmd == VFIO_IOMMU_BIND) {
> > > > > > > >
> > > > > > > > BIND what? VFIO_IOMMU_BIND_PASID sounds clearer to me.
> > > > > > >
> > > > > > > Emm, it's up to the flags to indicate bind what. It was proposed to
> > > > > > > cover the three cases below:
> > > > > > > a) BIND/UNBIND_GPASID
> > > > > > > b) BIND/UNBIND_GPASID_TABLE
> > > > > > > c) BIND/UNBIND_PROCESS
> > > > > > > <only a) is covered in this patch>
> > > > > > > So it's called VFIO_IOMMU_BIND.
> > > > > >
> > > > > > but aren't they all about PASID related binding?
> > > > >
> > > > > yeah, I can rename it. :-)
> > > >
> > > > I don't know if anyone intends to implement it, but SMMUv2 supports
> > > > nesting translation without any PASID support. For that case the name
> > > > VFIO_IOMMU_BIND_GUEST_PGTBL without "PASID" anywhere makes more
> > sense.
> > > > Ideally we'd also use a neutral name for the IOMMU API instead of
> > > > bind_gpasid(), but that's easier to change later.
> > >
> > > I agree VFIO_IOMMU_BIND is somehow not straight-forward. Especially, it may
> > > cause confusion when thinking about VFIO_SET_IOMMU. How about using
> > > VFIO_NESTING_IOMMU_BIND_STAGE1 to cover a) and b)? And has another
> > > VFIO_BIND_PROCESS in future for the SVA bind case.
> > 
> > I think minimizing the number of ioctls is more important than finding the
> > ideal name. VFIO_IOMMU_BIND was fine to me, but if it's too vague then
> > rename it to VFIO_IOMMU_BIND_PASID and we'll just piggy-back on it for
> > non-PASID things (they should be rare enough).
> maybe we can start with VFIO_IOMMU_BIND_PASID. Actually, there is
> also a discussion on reusing the same ioctl and vfio structure for
> pasid_alloc/free, bind/unbind_gpasid. and cache_inv. how about your
> opinion?

Merging bind with unbind and alloc with free makes sense. I'd leave at
least invalidate a separate ioctl, because alloc/bind/unbind/free are
setup functions while invalidate is a runtime thing and performance
sensitive. But I can't see a good reason not to merge them all together,
so either way is fine by me.

Thanks,
Jean

^ permalink raw reply	[flat|nested] 110+ messages in thread

* RE: [PATCH v1 5/8] vfio/type1: Report 1st-level/stage-1 format to userspace
  2020-04-09  8:14               ` Jean-Philippe Brucker
  2020-04-09  9:01                 ` Auger Eric
@ 2020-04-09 12:47                 ` Liu, Yi L
  2020-04-10  3:28                   ` Auger Eric
  2020-04-10 12:30                   ` Liu, Yi L
  1 sibling, 2 replies; 110+ messages in thread
From: Liu, Yi L @ 2020-04-09 12:47 UTC (permalink / raw)
  To: Jean-Philippe Brucker, Auger Eric, jacob.jun.pan
  Cc: alex.williamson, Tian, Kevin, jacob.jun.pan, joro, Raj, Ashok,
	Tian, Jun J, Sun, Yi Y, peterx, iommu, kvm, linux-kernel, Wu,
	Hao

Hi Jean,

> From: Jean-Philippe Brucker <jean-philippe@linaro.org>
> Sent: Thursday, April 9, 2020 4:15 PM
> Subject: Re: [PATCH v1 5/8] vfio/type1: Report 1st-level/stage-1 format to
> userspace
> 
> On Wed, Apr 08, 2020 at 12:27:58PM +0200, Auger Eric wrote:
> > Hi Yi,
> >
> > On 4/7/20 11:43 AM, Liu, Yi L wrote:
> > > Hi Jean,
> > >
> > >> From: Jean-Philippe Brucker <jean-philippe@linaro.org>
> > >> Sent: Friday, April 3, 2020 4:23 PM
> > >> To: Auger Eric <eric.auger@redhat.com>
> > >> userspace
> > >>
> > >> On Wed, Apr 01, 2020 at 03:01:12PM +0200, Auger Eric wrote:
> > >>>>>>  	header = vfio_info_cap_add(caps, sizeof(*nesting_cap),
> > >>>>>>
> VFIO_IOMMU_TYPE1_INFO_CAP_NESTING, 1);
> > >> @@ -2254,6 +2309,7
> > >>>>>> @@ static int vfio_iommu_info_add_nesting_cap(struct
> > >>>>> vfio_iommu *iommu,
> > >>>>>>  		/* nesting iommu type supports PASID requests (alloc/free)
> */
> > >>>>>>  		nesting_cap->nesting_capabilities |=
> VFIO_IOMMU_PASID_REQS;
> > >>>>> What is the meaning for ARM?
> > >>>>
> > >>>> I think it's just a software capability exposed to userspace, on
> > >>>> userspace side, it has a choice to use it or not. :-) The reason
> > >>>> define it and report it in cap nesting is that I'd like to make the
> > >>>> pasid alloc/free be available just for IOMMU with type
> > >>>> VFIO_IOMMU_TYPE1_NESTING. Please feel free tell me if it is not good
> > >>>> for ARM. We can find a proper way to report the availability.
> > >>>
> > >>> Well it is more a question for jean-Philippe. Do we have a system wide
> > >>> PASID allocation on ARM?
> > >>
> > >> We don't, the PASID spaces are per-VM on Arm, so this function should consult
> the
> > >> IOMMU driver before setting flags. As you said on patch 3, nested doesn't
> > >> necessarily imply PASID support. The SMMUv2 does not support PASID but does
> > >> support nesting stages 1 and 2 for the IOVA space.
> > >> SMMUv3 support of PASID depends on HW capabilities. So I think this needs to
> be
> > >> finer grained:
> > >>
> > >> Does the container support:
> > >> * VFIO_IOMMU_PASID_REQUEST?
> > >>   -> Yes for VT-d 3
> > >>   -> No for Arm SMMU
> > >> * VFIO_IOMMU_{,UN}BIND_GUEST_PGTBL?
> > >>   -> Yes for VT-d 3
> > >>   -> Sometimes for SMMUv2
> > >>   -> No for SMMUv3 (if we go with BIND_PASID_TABLE, which is simpler due to
> > >>      PASID tables being in GPA space.)
> > >> * VFIO_IOMMU_BIND_PASID_TABLE?
> > >>   -> No for VT-d
> > >>   -> Sometimes for SMMUv3
> > >>
> > >> Any bind support implies VFIO_IOMMU_CACHE_INVALIDATE support.
> > >
> > > good summary. do you expect to see any
> > >
> > >>
> > >>>>>> +	nesting_cap->stage1_formats = formats;
> > >>>>> as spotted by Kevin, since a single format is supported, rename
> > >>>>
> > >>>> ok, I was believing it may be possible on ARM or so. :-) will rename
> > >>>> it.
> > >>
> > >> Yes I don't think an u32 is going to cut it for Arm :( We need to 
> > >> describe all sorts
> of
> > >> capabilities for page and PASID tables (granules, GPA size, ASID/PASID size, HW
> > >> access/dirty, etc etc.) Just saying "Arm stage-1 format" wouldn't mean much. I
> > >> guess we could have a secondary vendor capability for these?
> > >
> > > Actually, I'm wondering if we can define some formats to stands for a set of
> > > capabilities. e.g. VTD_STAGE1_FORMAT_V1 which may indicates the 1st level
> > > page table related caps (aw, a/d, SRE, EA and etc.). And vIOMMU can parse
> > > the capabilities.
> >
> > But eventually do we really need all those capability getters? I mean
> > can't we simply rely on the actual call to VFIO_IOMMU_BIND_GUEST_PGTBL()
> > to detect any mismatch? Definitively the error handling may be heavier
> > on userspace but can't we manage.
> 
> I think we need to present these capabilities at boot time, long before
> the guest triggers a bind(). For example if the host SMMU doesn't support
> 16-bit ASID, we need to communicate that to the guest using vSMMU ID
> registers or PROBE properties. Otherwise a bind() will succeed, but if the
> guest uses 16-bit ASIDs in its CD, DMA will result in C_BAD_CD events
> which we'll inject into the guest, for no apparent reason from their
> perspective.
> 
> In addition some VMMs may have fallbacks if shared page tables are not
> available. They could fall back to a MAP/UNMAP interface, or simply not
> present a vIOMMU to the guest.
> 

Based on the comments, I think it would be a need to report iommu caps
in detail. So I guess iommu uapi needs to provide something alike vfio
cap chain in iommu uapi. Please feel free let me know your thoughts. :-)

In vfio, we can define a cap as below:

struct vfio_iommu_type1_info_cap_nesting {
	struct  vfio_info_cap_header header;
	__u64	iommu_model;
#define VFIO_IOMMU_PASID_REQS		(1 << 0)
#define VFIO_IOMMU_BIND_GPASID		(1 << 1)
#define VFIO_IOMMU_CACHE_INV		(1 << 2)
	__u32	nesting_capabilities;
	__u32	pasid_bits;
#define VFIO_IOMMU_VENDOR_SUB_CAP	(1 << 3)
	__u32	flags;
	__u32	data_size;
	__u8	data[];  /*iommu info caps defined by iommu uapi */
};

VFIO needs new iommu APIs to ask iommu driver whether PASID/bind_gpasid/
cache_inv/bind_gpasid_table is available or not and also the pasid
bits. After that VFIO will ask iommu driver about the iommu_cap_info
and fill in the @data[] field.

iommu uapi:
struct iommu_info_cap_header {
	__u16	id;		/* Identifies capability */
	__u16	version;		/* Version specific to the capability ID */
	__u32	next;		/* Offset of next capability */
};

#define IOMMU_INFO_CAP_INTEL_VTD 1
struct iommu_info_cap_intel_vtd {
	struct	iommu_info_cap_header header;
	__u32   vaddr_width;   /* VA addr_width*/
	__u32   ipaddr_width; /* IPA addr_width, input of SL page table */
	/* same definition with @flags instruct iommu_gpasid_bind_data_vtd */
	__u64	flags;
};

#define IOMMU_INFO_CAP_ARM_SMMUv3 2
struct iommu_info_cap_arm_smmuv3 {
	struct	iommu_info_cap_header header;
	...
};

Regards,
Yi Liu


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH v1 5/8] vfio/type1: Report 1st-level/stage-1 format to userspace
  2020-04-09 12:47                 ` Liu, Yi L
@ 2020-04-10  3:28                   ` Auger Eric
  2020-04-10  3:48                     ` Liu, Yi L
  2020-04-10 12:30                   ` Liu, Yi L
  1 sibling, 1 reply; 110+ messages in thread
From: Auger Eric @ 2020-04-10  3:28 UTC (permalink / raw)
  To: Liu, Yi L, Jean-Philippe Brucker, jacob.jun.pan
  Cc: alex.williamson, Tian, Kevin, joro, Raj, Ashok, Tian, Jun J, Sun,
	Yi Y, peterx, iommu, kvm, linux-kernel, Wu, Hao

Hi Yi,

On 4/9/20 2:47 PM, Liu, Yi L wrote:
> Hi Jean,
> 
>> From: Jean-Philippe Brucker <jean-philippe@linaro.org>
>> Sent: Thursday, April 9, 2020 4:15 PM
>> Subject: Re: [PATCH v1 5/8] vfio/type1: Report 1st-level/stage-1 format to
>> userspace
>>
>> On Wed, Apr 08, 2020 at 12:27:58PM +0200, Auger Eric wrote:
>>> Hi Yi,
>>>
>>> On 4/7/20 11:43 AM, Liu, Yi L wrote:
>>>> Hi Jean,
>>>>
>>>>> From: Jean-Philippe Brucker <jean-philippe@linaro.org>
>>>>> Sent: Friday, April 3, 2020 4:23 PM
>>>>> To: Auger Eric <eric.auger@redhat.com>
>>>>> userspace
>>>>>
>>>>> On Wed, Apr 01, 2020 at 03:01:12PM +0200, Auger Eric wrote:
>>>>>>>>>  	header = vfio_info_cap_add(caps, sizeof(*nesting_cap),
>>>>>>>>>
>> VFIO_IOMMU_TYPE1_INFO_CAP_NESTING, 1);
>>>>> @@ -2254,6 +2309,7
>>>>>>>>> @@ static int vfio_iommu_info_add_nesting_cap(struct
>>>>>>>> vfio_iommu *iommu,
>>>>>>>>>  		/* nesting iommu type supports PASID requests (alloc/free)
>> */
>>>>>>>>>  		nesting_cap->nesting_capabilities |=
>> VFIO_IOMMU_PASID_REQS;
>>>>>>>> What is the meaning for ARM?
>>>>>>>
>>>>>>> I think it's just a software capability exposed to userspace, on
>>>>>>> userspace side, it has a choice to use it or not. :-) The reason
>>>>>>> define it and report it in cap nesting is that I'd like to make the
>>>>>>> pasid alloc/free be available just for IOMMU with type
>>>>>>> VFIO_IOMMU_TYPE1_NESTING. Please feel free tell me if it is not good
>>>>>>> for ARM. We can find a proper way to report the availability.
>>>>>>
>>>>>> Well it is more a question for jean-Philippe. Do we have a system wide
>>>>>> PASID allocation on ARM?
>>>>>
>>>>> We don't, the PASID spaces are per-VM on Arm, so this function should consult
>> the
>>>>> IOMMU driver before setting flags. As you said on patch 3, nested doesn't
>>>>> necessarily imply PASID support. The SMMUv2 does not support PASID but does
>>>>> support nesting stages 1 and 2 for the IOVA space.
>>>>> SMMUv3 support of PASID depends on HW capabilities. So I think this needs to
>> be
>>>>> finer grained:
>>>>>
>>>>> Does the container support:
>>>>> * VFIO_IOMMU_PASID_REQUEST?
>>>>>   -> Yes for VT-d 3
>>>>>   -> No for Arm SMMU
>>>>> * VFIO_IOMMU_{,UN}BIND_GUEST_PGTBL?
>>>>>   -> Yes for VT-d 3
>>>>>   -> Sometimes for SMMUv2
>>>>>   -> No for SMMUv3 (if we go with BIND_PASID_TABLE, which is simpler due to
>>>>>      PASID tables being in GPA space.)
>>>>> * VFIO_IOMMU_BIND_PASID_TABLE?
>>>>>   -> No for VT-d
>>>>>   -> Sometimes for SMMUv3
>>>>>
>>>>> Any bind support implies VFIO_IOMMU_CACHE_INVALIDATE support.
>>>>
>>>> good summary. do you expect to see any
>>>>
>>>>>
>>>>>>>>> +	nesting_cap->stage1_formats = formats;
>>>>>>>> as spotted by Kevin, since a single format is supported, rename
>>>>>>>
>>>>>>> ok, I was believing it may be possible on ARM or so. :-) will rename
>>>>>>> it.
>>>>>
>>>>> Yes I don't think an u32 is going to cut it for Arm :( We need to 
>>>>> describe all sorts
>> of
>>>>> capabilities for page and PASID tables (granules, GPA size, ASID/PASID size, HW
>>>>> access/dirty, etc etc.) Just saying "Arm stage-1 format" wouldn't mean much. I
>>>>> guess we could have a secondary vendor capability for these?
>>>>
>>>> Actually, I'm wondering if we can define some formats to stands for a set of
>>>> capabilities. e.g. VTD_STAGE1_FORMAT_V1 which may indicates the 1st level
>>>> page table related caps (aw, a/d, SRE, EA and etc.). And vIOMMU can parse
>>>> the capabilities.
>>>
>>> But eventually do we really need all those capability getters? I mean
>>> can't we simply rely on the actual call to VFIO_IOMMU_BIND_GUEST_PGTBL()
>>> to detect any mismatch? Definitively the error handling may be heavier
>>> on userspace but can't we manage.
>>
>> I think we need to present these capabilities at boot time, long before
>> the guest triggers a bind(). For example if the host SMMU doesn't support
>> 16-bit ASID, we need to communicate that to the guest using vSMMU ID
>> registers or PROBE properties. Otherwise a bind() will succeed, but if the
>> guest uses 16-bit ASIDs in its CD, DMA will result in C_BAD_CD events
>> which we'll inject into the guest, for no apparent reason from their
>> perspective.
>>
>> In addition some VMMs may have fallbacks if shared page tables are not
>> available. They could fall back to a MAP/UNMAP interface, or simply not
>> present a vIOMMU to the guest.
>>
> 
> Based on the comments, I think it would be a need to report iommu caps
> in detail. So I guess iommu uapi needs to provide something alike vfio
> cap chain in iommu uapi. Please feel free let me know your thoughts. :-)

Yes to me it sounds sensible.
> 
> In vfio, we can define a cap as below:
> 
> struct vfio_iommu_type1_info_cap_nesting {
> 	struct  vfio_info_cap_header header;
> 	__u64	iommu_model;
> #define VFIO_IOMMU_PASID_REQS		(1 << 0)
I still think the name shall be changed
> #define VFIO_IOMMU_BIND_GPASID		(1 << 1)
> #define VFIO_IOMMU_CACHE_INV		(1 << 2)
this operation seems mandated as soon as we have a nested paging based
implementation?
> 	__u32	nesting_capabilities;
> 	__u32	pasid_bits;
> #define VFIO_IOMMU_VENDOR_SUB_CAP	(1 << 3)
> 	__u32	flags;
> 	__u32	data_size;
> 	__u8	data[];  /*iommu info caps defined by iommu uapi */
> };
> 
> VFIO needs new iommu APIs to ask iommu driver whether PASID/bind_gpasid/
> cache_inv/bind_gpasid_table is available or not and also the pasid
> bits. After that VFIO will ask iommu driver about the iommu_cap_info
> and fill in the @data[] field.
> 
> iommu uapi:
> struct iommu_info_cap_header {
> 	__u16	id;		/* Identifies capability */
> 	__u16	version;		/* Version specific to the capability ID */
> 	__u32	next;		/* Offset of next capability */
> };
> 
> #define IOMMU_INFO_CAP_INTEL_VTD 1
> struct iommu_info_cap_intel_vtd {
> 	struct	iommu_info_cap_header header;
> 	__u32   vaddr_width;   /* VA addr_width*/
> 	__u32   ipaddr_width; /* IPA addr_width, input of SL page table */
> 	/* same definition with @flags instruct iommu_gpasid_bind_data_vtd */
> 	__u64	flags;
> };
> 
> #define IOMMU_INFO_CAP_ARM_SMMUv3 2
> struct iommu_info_cap_arm_smmuv3 {
> 	struct	iommu_info_cap_header header;
> 	...
> };

Thanks

Eric
> 
> Regards,
> Yi Liu
> 


^ permalink raw reply	[flat|nested] 110+ messages in thread

* RE: [PATCH v1 5/8] vfio/type1: Report 1st-level/stage-1 format to userspace
  2020-04-10  3:28                   ` Auger Eric
@ 2020-04-10  3:48                     ` Liu, Yi L
  0 siblings, 0 replies; 110+ messages in thread
From: Liu, Yi L @ 2020-04-10  3:48 UTC (permalink / raw)
  To: Auger Eric, Jean-Philippe Brucker, jacob.jun.pan
  Cc: alex.williamson, Tian, Kevin, joro, Raj, Ashok, Tian, Jun J, Sun,
	Yi Y, peterx, iommu, kvm, linux-kernel, Wu, Hao

Hi Eric,

> From: Auger Eric <eric.auger@redhat.com>
> Sent: Friday, April 10, 2020 11:28 AM
> To: Liu, Yi L <yi.l.liu@intel.com>; Jean-Philippe Brucker <jean-
> Subject: Re: [PATCH v1 5/8] vfio/type1: Report 1st-level/stage-1 format to
> userspace
> 
> Hi Yi,
> 
> On 4/9/20 2:47 PM, Liu, Yi L wrote:
> > Hi Jean,
> >
> >> From: Jean-Philippe Brucker <jean-philippe@linaro.org>
> >> Sent: Thursday, April 9, 2020 4:15 PM
> >> Subject: Re: [PATCH v1 5/8] vfio/type1: Report 1st-level/stage-1
> >> format to userspace
> >>
> >> On Wed, Apr 08, 2020 at 12:27:58PM +0200, Auger Eric wrote:
> >>> Hi Yi,
> >>>
> >>> On 4/7/20 11:43 AM, Liu, Yi L wrote:
> >>>> Hi Jean,
> >>>>
> >>>>> From: Jean-Philippe Brucker <jean-philippe@linaro.org>
> >>>>> Sent: Friday, April 3, 2020 4:23 PM
> >>>>> To: Auger Eric <eric.auger@redhat.com> userspace
> >>>>>
> >>>>> On Wed, Apr 01, 2020 at 03:01:12PM +0200, Auger Eric wrote:
> >>>>>>>>>  	header = vfio_info_cap_add(caps, sizeof(*nesting_cap),
> >>>>>>>>>
> >> VFIO_IOMMU_TYPE1_INFO_CAP_NESTING, 1);
> >>>>> @@ -2254,6 +2309,7
> >>>>>>>>> @@ static int vfio_iommu_info_add_nesting_cap(struct
> >>>>>>>> vfio_iommu *iommu,
> >>>>>>>>>  		/* nesting iommu type supports PASID requests
> (alloc/free)
> >> */
> >>>>>>>>>  		nesting_cap->nesting_capabilities |=
> >> VFIO_IOMMU_PASID_REQS;
> >>>>>>>> What is the meaning for ARM?
> >>>>>>>
> >>>>>>> I think it's just a software capability exposed to userspace, on
> >>>>>>> userspace side, it has a choice to use it or not. :-) The reason
> >>>>>>> define it and report it in cap nesting is that I'd like to make
> >>>>>>> the pasid alloc/free be available just for IOMMU with type
> >>>>>>> VFIO_IOMMU_TYPE1_NESTING. Please feel free tell me if it is not
> >>>>>>> good for ARM. We can find a proper way to report the availability.
> >>>>>>
> >>>>>> Well it is more a question for jean-Philippe. Do we have a system
> >>>>>> wide PASID allocation on ARM?
> >>>>>
> >>>>> We don't, the PASID spaces are per-VM on Arm, so this function
> >>>>> should consult
> >> the
> >>>>> IOMMU driver before setting flags. As you said on patch 3, nested
> >>>>> doesn't necessarily imply PASID support. The SMMUv2 does not
> >>>>> support PASID but does support nesting stages 1 and 2 for the IOVA space.
> >>>>> SMMUv3 support of PASID depends on HW capabilities. So I think
> >>>>> this needs to
> >> be
> >>>>> finer grained:
> >>>>>
> >>>>> Does the container support:
> >>>>> * VFIO_IOMMU_PASID_REQUEST?
> >>>>>   -> Yes for VT-d 3
> >>>>>   -> No for Arm SMMU
> >>>>> * VFIO_IOMMU_{,UN}BIND_GUEST_PGTBL?
> >>>>>   -> Yes for VT-d 3
> >>>>>   -> Sometimes for SMMUv2
> >>>>>   -> No for SMMUv3 (if we go with BIND_PASID_TABLE, which is simpler
> due to
> >>>>>      PASID tables being in GPA space.)
> >>>>> * VFIO_IOMMU_BIND_PASID_TABLE?
> >>>>>   -> No for VT-d
> >>>>>   -> Sometimes for SMMUv3
> >>>>>
> >>>>> Any bind support implies VFIO_IOMMU_CACHE_INVALIDATE support.
> >>>>
> >>>> good summary. do you expect to see any
> >>>>
> >>>>>
> >>>>>>>>> +	nesting_cap->stage1_formats = formats;
> >>>>>>>> as spotted by Kevin, since a single format is supported, rename
> >>>>>>>
> >>>>>>> ok, I was believing it may be possible on ARM or so. :-) will
> >>>>>>> rename it.
> >>>>>
> >>>>> Yes I don't think an u32 is going to cut it for Arm :( We need to
> >>>>> describe all sorts
> >> of
> >>>>> capabilities for page and PASID tables (granules, GPA size,
> >>>>> ASID/PASID size, HW access/dirty, etc etc.) Just saying "Arm
> >>>>> stage-1 format" wouldn't mean much. I guess we could have a secondary
> vendor capability for these?
> >>>>
> >>>> Actually, I'm wondering if we can define some formats to stands for
> >>>> a set of capabilities. e.g. VTD_STAGE1_FORMAT_V1 which may
> >>>> indicates the 1st level page table related caps (aw, a/d, SRE, EA
> >>>> and etc.). And vIOMMU can parse the capabilities.
> >>>
> >>> But eventually do we really need all those capability getters? I
> >>> mean can't we simply rely on the actual call to
> >>> VFIO_IOMMU_BIND_GUEST_PGTBL() to detect any mismatch? Definitively
> >>> the error handling may be heavier on userspace but can't we manage.
> >>
> >> I think we need to present these capabilities at boot time, long
> >> before the guest triggers a bind(). For example if the host SMMU
> >> doesn't support 16-bit ASID, we need to communicate that to the guest
> >> using vSMMU ID registers or PROBE properties. Otherwise a bind() will
> >> succeed, but if the guest uses 16-bit ASIDs in its CD, DMA will
> >> result in C_BAD_CD events which we'll inject into the guest, for no
> >> apparent reason from their perspective.
> >>
> >> In addition some VMMs may have fallbacks if shared page tables are
> >> not available. They could fall back to a MAP/UNMAP interface, or
> >> simply not present a vIOMMU to the guest.
> >>
> >
> > Based on the comments, I think it would be a need to report iommu caps
> > in detail. So I guess iommu uapi needs to provide something alike vfio
> > cap chain in iommu uapi. Please feel free let me know your thoughts.
> > :-)
> 
> Yes to me it sounds sensible.
> >
> > In vfio, we can define a cap as below:
> >
> > struct vfio_iommu_type1_info_cap_nesting {
> > 	struct  vfio_info_cap_header header;
> > 	__u64	iommu_model;
> > #define VFIO_IOMMU_PASID_REQS		(1 << 0)
> I still think the name shall be changed

yes, I'll rename it per your suggestion.:-)

> > #define VFIO_IOMMU_BIND_GPASID		(1 << 1)
> > #define VFIO_IOMMU_CACHE_INV		(1 << 2)
> this operation seems mandated as soon as we have a nested paging based
> implementation?

oh, yes, should be. will remove it and comment in the code.

Regards,
Yi Liu

> > 	__u32	nesting_capabilities;
> > 	__u32	pasid_bits;
> > #define VFIO_IOMMU_VENDOR_SUB_CAP	(1 << 3)
> > 	__u32	flags;
> > 	__u32	data_size;
> > 	__u8	data[];  /*iommu info caps defined by iommu uapi */
> > };
> >
> > VFIO needs new iommu APIs to ask iommu driver whether
> > PASID/bind_gpasid/ cache_inv/bind_gpasid_table is available or not and
> > also the pasid bits. After that VFIO will ask iommu driver about the
> > iommu_cap_info and fill in the @data[] field.
> >
> > iommu uapi:
> > struct iommu_info_cap_header {
> > 	__u16	id;		/* Identifies capability */
> > 	__u16	version;		/* Version specific to the capability ID */
> > 	__u32	next;		/* Offset of next capability */
> > };
> >
> > #define IOMMU_INFO_CAP_INTEL_VTD 1
> > struct iommu_info_cap_intel_vtd {
> > 	struct	iommu_info_cap_header header;
> > 	__u32   vaddr_width;   /* VA addr_width*/
> > 	__u32   ipaddr_width; /* IPA addr_width, input of SL page table */
> > 	/* same definition with @flags instruct iommu_gpasid_bind_data_vtd */
> > 	__u64	flags;
> > };
> >
> > #define IOMMU_INFO_CAP_ARM_SMMUv3 2
> > struct iommu_info_cap_arm_smmuv3 {
> > 	struct	iommu_info_cap_header header;
> > 	...
> > };
> 
> Thanks
> 
> Eric
> >
> > Regards,
> > Yi Liu
> >


^ permalink raw reply	[flat|nested] 110+ messages in thread

* RE: [PATCH v1 5/8] vfio/type1: Report 1st-level/stage-1 format to userspace
  2020-04-09 12:47                 ` Liu, Yi L
  2020-04-10  3:28                   ` Auger Eric
@ 2020-04-10 12:30                   ` Liu, Yi L
  1 sibling, 0 replies; 110+ messages in thread
From: Liu, Yi L @ 2020-04-10 12:30 UTC (permalink / raw)
  To: Jean-Philippe Brucker, Auger Eric, jacob.jun.pan
  Cc: alex.williamson, Tian, Kevin, jacob.jun.pan, joro, Raj, Ashok,
	Tian, Jun J, Sun, Yi Y, peterx, iommu, kvm, linux-kernel, Wu,
	Hao

Hi Jean, Eric,

> From: Liu, Yi L <yi.l.liu@intel.com>
> Sent: Thursday, April 9, 2020 8:47 PM
> Subject: RE: [PATCH v1 5/8] vfio/type1: Report 1st-level/stage-1 format to
> userspace
> 
[...]
> > > >>
> > > >> Yes I don't think an u32 is going to cut it for Arm :( We need to
> > > >> describe all sorts
> > of
> > > >> capabilities for page and PASID tables (granules, GPA size, ASID/PASID size,
> HW
> > > >> access/dirty, etc etc.) Just saying "Arm stage-1 format" wouldn't mean
> much. I
> > > >> guess we could have a secondary vendor capability for these?
> > > >
> > > > Actually, I'm wondering if we can define some formats to stands for a set of
> > > > capabilities. e.g. VTD_STAGE1_FORMAT_V1 which may indicates the 1st
> level
> > > > page table related caps (aw, a/d, SRE, EA and etc.). And vIOMMU can parse
> > > > the capabilities.
> > >
> > > But eventually do we really need all those capability getters? I mean
> > > can't we simply rely on the actual call to VFIO_IOMMU_BIND_GUEST_PGTBL()
> > > to detect any mismatch? Definitively the error handling may be heavier
> > > on userspace but can't we manage.
> >
> > I think we need to present these capabilities at boot time, long before
> > the guest triggers a bind(). For example if the host SMMU doesn't support
> > 16-bit ASID, we need to communicate that to the guest using vSMMU ID
> > registers or PROBE properties. Otherwise a bind() will succeed, but if the
> > guest uses 16-bit ASIDs in its CD, DMA will result in C_BAD_CD events
> > which we'll inject into the guest, for no apparent reason from their
> > perspective.
> >
> > In addition some VMMs may have fallbacks if shared page tables are not
> > available. They could fall back to a MAP/UNMAP interface, or simply not
> > present a vIOMMU to the guest.
> >
> 
> Based on the comments, I think it would be a need to report iommu caps
> in detail. So I guess iommu uapi needs to provide something alike vfio
> cap chain in iommu uapi. Please feel free let me know your thoughts. :-)

Consider more, I guess it may be better to start simpler. Cap chain suits
the case in which there are multiple caps. e.g. some vendor iommu driver
may want to report iommu capabilities via multiple caps. Actually, in VT-d
side, the host IOMMU capability could be reported in a single cap structure.
I'm not sure about ARM side. Will there be multiple iommu_info_caps for ARM?

> In vfio, we can define a cap as below:
>
> struct vfio_iommu_type1_info_cap_nesting {
> 	struct  vfio_info_cap_header header;
> 	__u64	iommu_model;
> #define VFIO_IOMMU_PASID_REQS		(1 << 0)
> #define VFIO_IOMMU_BIND_GPASID		(1 << 1)
> #define VFIO_IOMMU_CACHE_INV		(1 << 2)
> 	__u32	nesting_capabilities;
> 	__u32	pasid_bits;
> #define VFIO_IOMMU_VENDOR_SUB_CAP	(1 << 3)
> 	__u32	flags;
> 	__u32	data_size;
> 	__u8	data[];  /*iommu info caps defined by iommu uapi */
> };
> 

If iommu vendor driver only needs one cap structure to report hw
capability, then I think we needn't implement cap chain in iommu
uapi. The @data[] field could be determined by the @iommu_model
and @flags fields. This would be easier. thoughts?

> VFIO needs new iommu APIs to ask iommu driver whether PASID/bind_gpasid/
> cache_inv/bind_gpasid_table is available or not and also the pasid
> bits. After that VFIO will ask iommu driver about the iommu_cap_info
> and fill in the @data[] field.
>
> iommu uapi:
> struct iommu_info_cap_header {
> 	__u16	id;		/* Identifies capability */
> 	__u16	version;		/* Version specific to the capability ID */
> 	__u32	next;		/* Offset of next capability */
> };
> 
> #define IOMMU_INFO_CAP_INTEL_VTD 1
> struct iommu_info_cap_intel_vtd {
> 	struct	iommu_info_cap_header header;
> 	__u32   vaddr_width;   /* VA addr_width*/
> 	__u32   ipaddr_width; /* IPA addr_width, input of SL page table */
> 	/* same definition with @flags instruct iommu_gpasid_bind_data_vtd */
> 	__u64	flags;
> };
> 
> #define IOMMU_INFO_CAP_ARM_SMMUv3 2
> struct iommu_info_cap_arm_smmuv3 {
> 	struct	iommu_info_cap_header header;
> 	...
> };

Regards,
Yi Liu


^ permalink raw reply	[flat|nested] 110+ messages in thread

* RE: [PATCH v1 6/8] vfio/type1: Bind guest page tables to host
  2020-04-02 19:57   ` Alex Williamson
  2020-04-03 13:30     ` Liu, Yi L
@ 2020-04-11  5:52     ` Liu, Yi L
  1 sibling, 0 replies; 110+ messages in thread
From: Liu, Yi L @ 2020-04-11  5:52 UTC (permalink / raw)
  To: Alex Williamson
  Cc: eric.auger, Tian, Kevin, jacob.jun.pan, joro, Raj, Ashok, Tian,
	Jun J, Sun, Yi Y, jean-philippe, peterx, iommu, kvm,
	linux-kernel, Wu, Hao

Hi Alex,

> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Friday, April 3, 2020 3:57 AM
> To: Liu, Yi L <yi.l.liu@intel.com>
> Subject: Re: [PATCH v1 6/8] vfio/type1: Bind guest page tables to host
> 
> On Sun, 22 Mar 2020 05:32:03 -0700
> "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> 
> > From: Liu Yi L <yi.l.liu@intel.com>
> >
> > VFIO_TYPE1_NESTING_IOMMU is an IOMMU type which is backed by
[...]
> > +/**
> > + * Unbind specific gpasid, caller of this function requires hold
> > + * vfio_iommu->lock
> > + */
> > +static long vfio_iommu_type1_do_guest_unbind(struct vfio_iommu *iommu,
> > +				struct iommu_gpasid_bind_data *gbind_data)
> > +{
> > +	return vfio_iommu_for_each_dev(iommu,
> > +				vfio_unbind_gpasid_fn, gbind_data);
> > +}
> > +
> > +static long vfio_iommu_type1_bind_gpasid(struct vfio_iommu *iommu,
> > +				struct iommu_gpasid_bind_data *gbind_data)
> > +{
> > +	int ret = 0;
> > +
> > +	mutex_lock(&iommu->lock);
> > +	if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)) {
> > +		ret = -EINVAL;
> > +		goto out_unlock;
> > +	}
> > +
> > +	ret = vfio_iommu_for_each_dev(iommu,
> > +			vfio_bind_gpasid_fn, gbind_data);
> > +	/*
> > +	 * If bind failed, it may not be a total failure. Some devices
> > +	 * within the iommu group may have bind successfully. Although
> > +	 * we don't enable pasid capability for non-singletion iommu
> > +	 * groups, a unbind operation would be helpful to ensure no
> > +	 * partial binding for an iommu group.
> 
> Where was the non-singleton group restriction done, I missed that.
> 
> > +	 */
> > +	if (ret)
> > +		/*
> > +		 * Undo all binds that already succeeded, no need to
> > +		 * check the return value here since some device within
> > +		 * the group has no successful bind when coming to this
> > +		 * place switch.
> > +		 */
> > +		vfio_iommu_type1_do_guest_unbind(iommu, gbind_data);
> 
> However, the for_each_dev function stops when the callback function
> returns error, are we just assuming we stop at the same device as we
> faulted on the first time and that we traverse the same set of devices
> the second time?  It seems strange to me that unbind should be able to
> fail.

I think the code needs enhancement. Although one group per container
and one device per group is the most typical and desired case, but
the code here loops domains, groups, devices. It should be able to
unwind prior bind when the loop failed for a device. So I plan to do
below change for bind path.

list_for_each_entry(domain, &iommu->domain_list, next) {
	list_for_each_entry(group, &domain->group_list, next) {
		/*
		  * if bind failed on a certain device, should unbind prior successful
		  * bind iommu_group_for_each_dev() should be modified to take two
		  * callbacks, one for forward loop and one for reverse loop when failure
		  * happened. "return_upon_failure" indicates whether return upon failure
		  * during forward loop or not. If yes, iommu_group_for_each_dev() should
		  * unwind the prior bind in this iommu group before return.
		  */
		ret = iommu_group_for_each_dev(iommu_group, bind_gpasid_fn,
					unbind_gpasid_fn, data, return_upon_failure);
		if (ret)
			break;
	}
	if (ret) {
		/* unwind bindings with prior groups */
		list_for_each_entry_continue_reverse(group,
							&domain->group_list, next) {
			iommu_group_for_each_dev(iommu_group, unbind_gpasid_fn,
						NULL, data, ignore_intermediate_failure);
		}
		break;
	}
}

if (ret) {
	/* unwind bindings with prior domains */
	list_for_each_entry_continue_reverse(domain, &iommu->domain_list, next) {
		iommu_group_for_each_dev(iommu_group, unbind_gpasid_fn,
						NULL, data, ignore_intermediate_failure);
		}
	}
}

return ret;

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 110+ messages in thread

* RE: [PATCH v1 7/8] vfio/type1: Add VFIO_IOMMU_CACHE_INVALIDATE
  2020-04-03 15:34       ` Alex Williamson
  2020-04-08  2:28         ` Liu, Yi L
@ 2020-04-16 10:40         ` Liu, Yi L
  2020-04-16 12:09           ` Tian, Kevin
  2020-04-16 14:40           ` Alex Williamson
  1 sibling, 2 replies; 110+ messages in thread
From: Liu, Yi L @ 2020-04-16 10:40 UTC (permalink / raw)
  To: Alex Williamson, Tian, Kevin
  Cc: eric.auger, jacob.jun.pan, joro, Raj, Ashok, Tian, Jun J, Sun,
	Yi Y, jean-philippe, peterx, iommu, kvm, linux-kernel, Wu, Hao

Hi Alex,
Still have a direction question with you. Better get agreement with you
before heading forward.

> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Friday, April 3, 2020 11:35 PM
[...]
> > > > + *
> > > > + * returns: 0 on success, -errno on failure.
> > > > + */
> > > > +struct vfio_iommu_type1_cache_invalidate {
> > > > +	__u32   argsz;
> > > > +	__u32   flags;
> > > > +	struct	iommu_cache_invalidate_info cache_info;
> > > > +};
> > > > +#define VFIO_IOMMU_CACHE_INVALIDATE      _IO(VFIO_TYPE,
> VFIO_BASE
> > > + 24)
> > >
> > > The future extension capabilities of this ioctl worry me, I wonder if
> > > we should do another data[] with flag defining that data as CACHE_INFO.
> >
> > Can you elaborate? Does it mean with this way we don't rely on iommu
> > driver to provide version_to_size conversion and instead we just pass
> > data[] to iommu driver for further audit?
> 
> No, my concern is that this ioctl has a single function, strictly tied
> to the iommu uapi.  If we replace cache_info with data[] then we can
> define a flag to specify that data[] is struct
> iommu_cache_invalidate_info, and if we need to, a different flag to
> identify data[] as something else.  For example if we get stuck
> expanding cache_info to meet new demands and develop a new uapi to
> solve that, how would we expand this ioctl to support it rather than
> also create a new ioctl?  There's also a trade-off in making the ioctl
> usage more difficult for the user.  I'd still expect the vfio layer to
> check the flag and interpret data[] as indicated by the flag rather
> than just passing a blob of opaque data to the iommu layer though.
> Thanks,

Based on your comments about defining a single ioctl and a unified
vfio structure (with a @data[] field) for pasid_alloc/free, bind/
unbind_gpasid, cache_inv. After some offline trying, I think it would
be good for bind/unbind_gpasid and cache_inv as both of them use the
iommu uapi definition. While the pasid alloc/free operation doesn't.
It would be weird to put all of them together. So pasid alloc/free
may have a separate ioctl. It would look as below. Does this direction
look good per your opinion?

ioctl #22: VFIO_IOMMU_PASID_REQUEST
/**
  * @pasid: used to return the pasid alloc result when flags == ALLOC_PASID
  *         specify a pasid to be freed when flags == FREE_PASID
  * @range: specify the allocation range when flags == ALLOC_PASID
  */
struct vfio_iommu_pasid_request {
	__u32	argsz;
#define VFIO_IOMMU_ALLOC_PASID	(1 << 0)
#define VFIO_IOMMU_FREE_PASID	(1 << 1)
	__u32	flags;
	__u32	pasid;
	struct {
		__u32	min;
		__u32	max;
	} range;
};

ioctl #23: VFIO_IOMMU_NESTING_OP
struct vfio_iommu_type1_nesting_op {
	__u32	argsz;
	__u32	flags;
	__u32	op;
	__u8	data[];
};

/* Nesting Ops */
#define VFIO_IOMMU_NESTING_OP_BIND_PGTBL        0
#define VFIO_IOMMU_NESTING_OP_UNBIND_PGTBL      1
#define VFIO_IOMMU_NESTING_OP_CACHE_INVLD       2
 
Thanks,
Yi Liu

^ permalink raw reply	[flat|nested] 110+ messages in thread

* RE: [PATCH v1 7/8] vfio/type1: Add VFIO_IOMMU_CACHE_INVALIDATE
  2020-04-16 10:40         ` Liu, Yi L
@ 2020-04-16 12:09           ` Tian, Kevin
  2020-04-16 12:42             ` Auger Eric
  2020-04-16 14:40           ` Alex Williamson
  1 sibling, 1 reply; 110+ messages in thread
From: Tian, Kevin @ 2020-04-16 12:09 UTC (permalink / raw)
  To: Liu, Yi L, Alex Williamson
  Cc: eric.auger, jacob.jun.pan, joro, Raj, Ashok, Tian, Jun J, Sun,
	Yi Y, jean-philippe, peterx, iommu, kvm, linux-kernel, Wu, Hao

> From: Liu, Yi L <yi.l.liu@intel.com>
> Sent: Thursday, April 16, 2020 6:40 PM
> 
> Hi Alex,
> Still have a direction question with you. Better get agreement with you
> before heading forward.
> 
> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Friday, April 3, 2020 11:35 PM
> [...]
> > > > > + *
> > > > > + * returns: 0 on success, -errno on failure.
> > > > > + */
> > > > > +struct vfio_iommu_type1_cache_invalidate {
> > > > > +	__u32   argsz;
> > > > > +	__u32   flags;
> > > > > +	struct	iommu_cache_invalidate_info cache_info;
> > > > > +};
> > > > > +#define VFIO_IOMMU_CACHE_INVALIDATE      _IO(VFIO_TYPE,
> > VFIO_BASE
> > > > + 24)
> > > >
> > > > The future extension capabilities of this ioctl worry me, I wonder if
> > > > we should do another data[] with flag defining that data as
> CACHE_INFO.
> > >
> > > Can you elaborate? Does it mean with this way we don't rely on iommu
> > > driver to provide version_to_size conversion and instead we just pass
> > > data[] to iommu driver for further audit?
> >
> > No, my concern is that this ioctl has a single function, strictly tied
> > to the iommu uapi.  If we replace cache_info with data[] then we can
> > define a flag to specify that data[] is struct
> > iommu_cache_invalidate_info, and if we need to, a different flag to
> > identify data[] as something else.  For example if we get stuck
> > expanding cache_info to meet new demands and develop a new uapi to
> > solve that, how would we expand this ioctl to support it rather than
> > also create a new ioctl?  There's also a trade-off in making the ioctl
> > usage more difficult for the user.  I'd still expect the vfio layer to
> > check the flag and interpret data[] as indicated by the flag rather
> > than just passing a blob of opaque data to the iommu layer though.
> > Thanks,
> 
> Based on your comments about defining a single ioctl and a unified
> vfio structure (with a @data[] field) for pasid_alloc/free, bind/
> unbind_gpasid, cache_inv. After some offline trying, I think it would
> be good for bind/unbind_gpasid and cache_inv as both of them use the
> iommu uapi definition. While the pasid alloc/free operation doesn't.
> It would be weird to put all of them together. So pasid alloc/free
> may have a separate ioctl. It would look as below. Does this direction
> look good per your opinion?
> 
> ioctl #22: VFIO_IOMMU_PASID_REQUEST
> /**
>   * @pasid: used to return the pasid alloc result when flags == ALLOC_PASID
>   *         specify a pasid to be freed when flags == FREE_PASID
>   * @range: specify the allocation range when flags == ALLOC_PASID
>   */
> struct vfio_iommu_pasid_request {
> 	__u32	argsz;
> #define VFIO_IOMMU_ALLOC_PASID	(1 << 0)
> #define VFIO_IOMMU_FREE_PASID	(1 << 1)
> 	__u32	flags;
> 	__u32	pasid;
> 	struct {
> 		__u32	min;
> 		__u32	max;
> 	} range;
> };
> 
> ioctl #23: VFIO_IOMMU_NESTING_OP
> struct vfio_iommu_type1_nesting_op {
> 	__u32	argsz;
> 	__u32	flags;
> 	__u32	op;
> 	__u8	data[];
> };
> 
> /* Nesting Ops */
> #define VFIO_IOMMU_NESTING_OP_BIND_PGTBL        0
> #define VFIO_IOMMU_NESTING_OP_UNBIND_PGTBL      1
> #define VFIO_IOMMU_NESTING_OP_CACHE_INVLD       2
> 

Then why cannot we just put PASID into the header since the
majority of nested usage is associated with a pasid? 

ioctl #23: VFIO_IOMMU_NESTING_OP
struct vfio_iommu_type1_nesting_op {
	__u32	argsz;
	__u32	flags;
	__u32	op;
	__u32   pasid;
	__u8	data[];
};

In case of SMMUv2 which supports nested w/o PASID, this field can
be ignored for that specific case.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH v1 7/8] vfio/type1: Add VFIO_IOMMU_CACHE_INVALIDATE
  2020-04-16 12:09           ` Tian, Kevin
@ 2020-04-16 12:42             ` Auger Eric
  2020-04-16 13:28               ` Tian, Kevin
  0 siblings, 1 reply; 110+ messages in thread
From: Auger Eric @ 2020-04-16 12:42 UTC (permalink / raw)
  To: Tian, Kevin, Liu, Yi L, Alex Williamson
  Cc: jacob.jun.pan, joro, Raj, Ashok, Tian, Jun J, Sun, Yi Y,
	jean-philippe, peterx, iommu, kvm, linux-kernel, Wu, Hao

Hi Kevin,
On 4/16/20 2:09 PM, Tian, Kevin wrote:
>> From: Liu, Yi L <yi.l.liu@intel.com>
>> Sent: Thursday, April 16, 2020 6:40 PM
>>
>> Hi Alex,
>> Still have a direction question with you. Better get agreement with you
>> before heading forward.
>>
>>> From: Alex Williamson <alex.williamson@redhat.com>
>>> Sent: Friday, April 3, 2020 11:35 PM
>> [...]
>>>>>> + *
>>>>>> + * returns: 0 on success, -errno on failure.
>>>>>> + */
>>>>>> +struct vfio_iommu_type1_cache_invalidate {
>>>>>> +	__u32   argsz;
>>>>>> +	__u32   flags;
>>>>>> +	struct	iommu_cache_invalidate_info cache_info;
>>>>>> +};
>>>>>> +#define VFIO_IOMMU_CACHE_INVALIDATE      _IO(VFIO_TYPE,
>>> VFIO_BASE
>>>>> + 24)
>>>>>
>>>>> The future extension capabilities of this ioctl worry me, I wonder if
>>>>> we should do another data[] with flag defining that data as
>> CACHE_INFO.
>>>>
>>>> Can you elaborate? Does it mean with this way we don't rely on iommu
>>>> driver to provide version_to_size conversion and instead we just pass
>>>> data[] to iommu driver for further audit?
>>>
>>> No, my concern is that this ioctl has a single function, strictly tied
>>> to the iommu uapi.  If we replace cache_info with data[] then we can
>>> define a flag to specify that data[] is struct
>>> iommu_cache_invalidate_info, and if we need to, a different flag to
>>> identify data[] as something else.  For example if we get stuck
>>> expanding cache_info to meet new demands and develop a new uapi to
>>> solve that, how would we expand this ioctl to support it rather than
>>> also create a new ioctl?  There's also a trade-off in making the ioctl
>>> usage more difficult for the user.  I'd still expect the vfio layer to
>>> check the flag and interpret data[] as indicated by the flag rather
>>> than just passing a blob of opaque data to the iommu layer though.
>>> Thanks,
>>
>> Based on your comments about defining a single ioctl and a unified
>> vfio structure (with a @data[] field) for pasid_alloc/free, bind/
>> unbind_gpasid, cache_inv. After some offline trying, I think it would
>> be good for bind/unbind_gpasid and cache_inv as both of them use the
>> iommu uapi definition. While the pasid alloc/free operation doesn't.
>> It would be weird to put all of them together. So pasid alloc/free
>> may have a separate ioctl. It would look as below. Does this direction
>> look good per your opinion?
>>
>> ioctl #22: VFIO_IOMMU_PASID_REQUEST
>> /**
>>   * @pasid: used to return the pasid alloc result when flags == ALLOC_PASID
>>   *         specify a pasid to be freed when flags == FREE_PASID
>>   * @range: specify the allocation range when flags == ALLOC_PASID
>>   */
>> struct vfio_iommu_pasid_request {
>> 	__u32	argsz;
>> #define VFIO_IOMMU_ALLOC_PASID	(1 << 0)
>> #define VFIO_IOMMU_FREE_PASID	(1 << 1)
>> 	__u32	flags;
>> 	__u32	pasid;
>> 	struct {
>> 		__u32	min;
>> 		__u32	max;
>> 	} range;
>> };
>>
>> ioctl #23: VFIO_IOMMU_NESTING_OP
>> struct vfio_iommu_type1_nesting_op {
>> 	__u32	argsz;
>> 	__u32	flags;
>> 	__u32	op;
>> 	__u8	data[];
>> };
>>
>> /* Nesting Ops */
>> #define VFIO_IOMMU_NESTING_OP_BIND_PGTBL        0
>> #define VFIO_IOMMU_NESTING_OP_UNBIND_PGTBL      1
>> #define VFIO_IOMMU_NESTING_OP_CACHE_INVLD       2
>>
> 
> Then why cannot we just put PASID into the header since the
> majority of nested usage is associated with a pasid? 
> 
> ioctl #23: VFIO_IOMMU_NESTING_OP
> struct vfio_iommu_type1_nesting_op {
> 	__u32	argsz;
> 	__u32	flags;
> 	__u32	op;
> 	__u32   pasid;
> 	__u8	data[];
> };
> 
> In case of SMMUv2 which supports nested w/o PASID, this field can
> be ignored for that specific case.
On my side I would prefer keeping the pasid in the data[]. This is not
always used.

For instance, in iommu_cache_invalidate_info/iommu_inv_pasid_info we
devised flags to tell whether the PASID is used.

Thanks

Eric
> 
> Thanks
> Kevin
> 


^ permalink raw reply	[flat|nested] 110+ messages in thread

* RE: [PATCH v1 7/8] vfio/type1: Add VFIO_IOMMU_CACHE_INVALIDATE
  2020-04-16 12:42             ` Auger Eric
@ 2020-04-16 13:28               ` Tian, Kevin
  2020-04-16 15:12                 ` Auger Eric
  0 siblings, 1 reply; 110+ messages in thread
From: Tian, Kevin @ 2020-04-16 13:28 UTC (permalink / raw)
  To: Auger Eric, Liu, Yi L, Alex Williamson
  Cc: jacob.jun.pan, joro, Raj, Ashok, Tian, Jun J, Sun, Yi Y,
	jean-philippe, peterx, iommu, kvm, linux-kernel, Wu, Hao

> From: Auger Eric <eric.auger@redhat.com>
> Sent: Thursday, April 16, 2020 8:43 PM
> 
> Hi Kevin,
> On 4/16/20 2:09 PM, Tian, Kevin wrote:
> >> From: Liu, Yi L <yi.l.liu@intel.com>
> >> Sent: Thursday, April 16, 2020 6:40 PM
> >>
> >> Hi Alex,
> >> Still have a direction question with you. Better get agreement with you
> >> before heading forward.
> >>
> >>> From: Alex Williamson <alex.williamson@redhat.com>
> >>> Sent: Friday, April 3, 2020 11:35 PM
> >> [...]
> >>>>>> + *
> >>>>>> + * returns: 0 on success, -errno on failure.
> >>>>>> + */
> >>>>>> +struct vfio_iommu_type1_cache_invalidate {
> >>>>>> +	__u32   argsz;
> >>>>>> +	__u32   flags;
> >>>>>> +	struct	iommu_cache_invalidate_info cache_info;
> >>>>>> +};
> >>>>>> +#define VFIO_IOMMU_CACHE_INVALIDATE      _IO(VFIO_TYPE,
> >>> VFIO_BASE
> >>>>> + 24)
> >>>>>
> >>>>> The future extension capabilities of this ioctl worry me, I wonder if
> >>>>> we should do another data[] with flag defining that data as
> >> CACHE_INFO.
> >>>>
> >>>> Can you elaborate? Does it mean with this way we don't rely on iommu
> >>>> driver to provide version_to_size conversion and instead we just pass
> >>>> data[] to iommu driver for further audit?
> >>>
> >>> No, my concern is that this ioctl has a single function, strictly tied
> >>> to the iommu uapi.  If we replace cache_info with data[] then we can
> >>> define a flag to specify that data[] is struct
> >>> iommu_cache_invalidate_info, and if we need to, a different flag to
> >>> identify data[] as something else.  For example if we get stuck
> >>> expanding cache_info to meet new demands and develop a new uapi to
> >>> solve that, how would we expand this ioctl to support it rather than
> >>> also create a new ioctl?  There's also a trade-off in making the ioctl
> >>> usage more difficult for the user.  I'd still expect the vfio layer to
> >>> check the flag and interpret data[] as indicated by the flag rather
> >>> than just passing a blob of opaque data to the iommu layer though.
> >>> Thanks,
> >>
> >> Based on your comments about defining a single ioctl and a unified
> >> vfio structure (with a @data[] field) for pasid_alloc/free, bind/
> >> unbind_gpasid, cache_inv. After some offline trying, I think it would
> >> be good for bind/unbind_gpasid and cache_inv as both of them use the
> >> iommu uapi definition. While the pasid alloc/free operation doesn't.
> >> It would be weird to put all of them together. So pasid alloc/free
> >> may have a separate ioctl. It would look as below. Does this direction
> >> look good per your opinion?
> >>
> >> ioctl #22: VFIO_IOMMU_PASID_REQUEST
> >> /**
> >>   * @pasid: used to return the pasid alloc result when flags ==
> ALLOC_PASID
> >>   *         specify a pasid to be freed when flags == FREE_PASID
> >>   * @range: specify the allocation range when flags == ALLOC_PASID
> >>   */
> >> struct vfio_iommu_pasid_request {
> >> 	__u32	argsz;
> >> #define VFIO_IOMMU_ALLOC_PASID	(1 << 0)
> >> #define VFIO_IOMMU_FREE_PASID	(1 << 1)
> >> 	__u32	flags;
> >> 	__u32	pasid;
> >> 	struct {
> >> 		__u32	min;
> >> 		__u32	max;
> >> 	} range;
> >> };
> >>
> >> ioctl #23: VFIO_IOMMU_NESTING_OP
> >> struct vfio_iommu_type1_nesting_op {
> >> 	__u32	argsz;
> >> 	__u32	flags;
> >> 	__u32	op;
> >> 	__u8	data[];
> >> };
> >>
> >> /* Nesting Ops */
> >> #define VFIO_IOMMU_NESTING_OP_BIND_PGTBL        0
> >> #define VFIO_IOMMU_NESTING_OP_UNBIND_PGTBL      1
> >> #define VFIO_IOMMU_NESTING_OP_CACHE_INVLD       2
> >>
> >
> > Then why cannot we just put PASID into the header since the
> > majority of nested usage is associated with a pasid?
> >
> > ioctl #23: VFIO_IOMMU_NESTING_OP
> > struct vfio_iommu_type1_nesting_op {
> > 	__u32	argsz;
> > 	__u32	flags;
> > 	__u32	op;
> > 	__u32   pasid;
> > 	__u8	data[];
> > };
> >
> > In case of SMMUv2 which supports nested w/o PASID, this field can
> > be ignored for that specific case.
> On my side I would prefer keeping the pasid in the data[]. This is not
> always used.
> 
> For instance, in iommu_cache_invalidate_info/iommu_inv_pasid_info we
> devised flags to tell whether the PASID is used.
> 

But don't we include a PASID in both invalidate structures already?

struct iommu_inv_addr_info {
#define IOMMU_INV_ADDR_FLAGS_PASID      (1 << 0)
#define IOMMU_INV_ADDR_FLAGS_ARCHID     (1 << 1)
#define IOMMU_INV_ADDR_FLAGS_LEAF       (1 << 2)
        __u32   flags;
        __u32   archid;
        __u64   pasid;
        __u64   addr;
        __u64   granule_size;
        __u64   nb_granules;
};

struct iommu_inv_pasid_info {
#define IOMMU_INV_PASID_FLAGS_PASID     (1 << 0)
#define IOMMU_INV_PASID_FLAGS_ARCHID    (1 << 1)
        __u32   flags;
        __u32   archid;
        __u64   pasid;
};

then consolidating the pasid field into generic header doesn't
hurt. the specific handler still rely on flags to tell whether it
is used?

Thanks
Kevin

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH v1 7/8] vfio/type1: Add VFIO_IOMMU_CACHE_INVALIDATE
  2020-04-16 10:40         ` Liu, Yi L
  2020-04-16 12:09           ` Tian, Kevin
@ 2020-04-16 14:40           ` Alex Williamson
  2020-04-16 14:48             ` Alex Williamson
  2020-04-17  6:03             ` Liu, Yi L
  1 sibling, 2 replies; 110+ messages in thread
From: Alex Williamson @ 2020-04-16 14:40 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: Tian, Kevin, eric.auger, jacob.jun.pan, joro, Raj, Ashok, Tian,
	Jun J, Sun, Yi Y, jean-philippe, peterx, iommu, kvm,
	linux-kernel, Wu, Hao

On Thu, 16 Apr 2020 10:40:03 +0000
"Liu, Yi L" <yi.l.liu@intel.com> wrote:

> Hi Alex,
> Still have a direction question with you. Better get agreement with you
> before heading forward.
> 
> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Friday, April 3, 2020 11:35 PM  
> [...]
> > > > > + *
> > > > > + * returns: 0 on success, -errno on failure.
> > > > > + */
> > > > > +struct vfio_iommu_type1_cache_invalidate {
> > > > > +	__u32   argsz;
> > > > > +	__u32   flags;
> > > > > +	struct	iommu_cache_invalidate_info cache_info;
> > > > > +};
> > > > > +#define VFIO_IOMMU_CACHE_INVALIDATE      _IO(VFIO_TYPE,  
> > VFIO_BASE  
> > > > + 24)
> > > >
> > > > The future extension capabilities of this ioctl worry me, I wonder if
> > > > we should do another data[] with flag defining that data as CACHE_INFO.  
> > >
> > > Can you elaborate? Does it mean with this way we don't rely on iommu
> > > driver to provide version_to_size conversion and instead we just pass
> > > data[] to iommu driver for further audit?  
> > 
> > No, my concern is that this ioctl has a single function, strictly tied
> > to the iommu uapi.  If we replace cache_info with data[] then we can
> > define a flag to specify that data[] is struct
> > iommu_cache_invalidate_info, and if we need to, a different flag to
> > identify data[] as something else.  For example if we get stuck
> > expanding cache_info to meet new demands and develop a new uapi to
> > solve that, how would we expand this ioctl to support it rather than
> > also create a new ioctl?  There's also a trade-off in making the ioctl
> > usage more difficult for the user.  I'd still expect the vfio layer to
> > check the flag and interpret data[] as indicated by the flag rather
> > than just passing a blob of opaque data to the iommu layer though.
> > Thanks,  
> 
> Based on your comments about defining a single ioctl and a unified
> vfio structure (with a @data[] field) for pasid_alloc/free, bind/
> unbind_gpasid, cache_inv. After some offline trying, I think it would
> be good for bind/unbind_gpasid and cache_inv as both of them use the
> iommu uapi definition. While the pasid alloc/free operation doesn't.
> It would be weird to put all of them together. So pasid alloc/free
> may have a separate ioctl. It would look as below. Does this direction
> look good per your opinion?
> 
> ioctl #22: VFIO_IOMMU_PASID_REQUEST
> /**
>   * @pasid: used to return the pasid alloc result when flags == ALLOC_PASID
>   *         specify a pasid to be freed when flags == FREE_PASID
>   * @range: specify the allocation range when flags == ALLOC_PASID
>   */
> struct vfio_iommu_pasid_request {
> 	__u32	argsz;
> #define VFIO_IOMMU_ALLOC_PASID	(1 << 0)
> #define VFIO_IOMMU_FREE_PASID	(1 << 1)
> 	__u32	flags;
> 	__u32	pasid;
> 	struct {
> 		__u32	min;
> 		__u32	max;
> 	} range;
> };

Can't the ioctl return the pasid valid on alloc (like GET_DEVICE_FD)?
Would it be useful to support freeing a range of pasids?  If so then we
could simply use range for both, ie. allocate a pasid from this range
and return it, or free all pasids in this range?  vfio already needs to
track pasids to free them on release, so presumably this is something
we could support easily.
 
> ioctl #23: VFIO_IOMMU_NESTING_OP
> struct vfio_iommu_type1_nesting_op {
> 	__u32	argsz;
> 	__u32	flags;
> 	__u32	op;
> 	__u8	data[];
> };

data only has 4-byte alignment, I think we really want it at an 8-byte
alignment.  This is why I embedded the "op" into the flag for
DEVICE_FEATURE.  Thanks,

Alex

> 
> /* Nesting Ops */
> #define VFIO_IOMMU_NESTING_OP_BIND_PGTBL        0
> #define VFIO_IOMMU_NESTING_OP_UNBIND_PGTBL      1
> #define VFIO_IOMMU_NESTING_OP_CACHE_INVLD       2
>  
> Thanks,
> Yi Liu
> 


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH v1 7/8] vfio/type1: Add VFIO_IOMMU_CACHE_INVALIDATE
  2020-04-16 14:40           ` Alex Williamson
@ 2020-04-16 14:48             ` Alex Williamson
  2020-04-17  6:03             ` Liu, Yi L
  1 sibling, 0 replies; 110+ messages in thread
From: Alex Williamson @ 2020-04-16 14:48 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: jean-philippe, Tian, Kevin, Raj, Ashok, kvm, Tian, Jun J, iommu,
	linux-kernel, Sun, Yi Y, Wu, Hao

On Thu, 16 Apr 2020 08:40:31 -0600
Alex Williamson <alex.williamson@redhat.com> wrote:

> On Thu, 16 Apr 2020 10:40:03 +0000
> "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> 
> > Hi Alex,
> > Still have a direction question with you. Better get agreement with you
> > before heading forward.
> >   
> > > From: Alex Williamson <alex.williamson@redhat.com>
> > > Sent: Friday, April 3, 2020 11:35 PM    
> > [...]  
> > > > > > + *
> > > > > > + * returns: 0 on success, -errno on failure.
> > > > > > + */
> > > > > > +struct vfio_iommu_type1_cache_invalidate {
> > > > > > +	__u32   argsz;
> > > > > > +	__u32   flags;
> > > > > > +	struct	iommu_cache_invalidate_info cache_info;
> > > > > > +};
> > > > > > +#define VFIO_IOMMU_CACHE_INVALIDATE      _IO(VFIO_TYPE,    
> > > VFIO_BASE    
> > > > > + 24)
> > > > >
> > > > > The future extension capabilities of this ioctl worry me, I wonder if
> > > > > we should do another data[] with flag defining that data as CACHE_INFO.    
> > > >
> > > > Can you elaborate? Does it mean with this way we don't rely on iommu
> > > > driver to provide version_to_size conversion and instead we just pass
> > > > data[] to iommu driver for further audit?    
> > > 
> > > No, my concern is that this ioctl has a single function, strictly tied
> > > to the iommu uapi.  If we replace cache_info with data[] then we can
> > > define a flag to specify that data[] is struct
> > > iommu_cache_invalidate_info, and if we need to, a different flag to
> > > identify data[] as something else.  For example if we get stuck
> > > expanding cache_info to meet new demands and develop a new uapi to
> > > solve that, how would we expand this ioctl to support it rather than
> > > also create a new ioctl?  There's also a trade-off in making the ioctl
> > > usage more difficult for the user.  I'd still expect the vfio layer to
> > > check the flag and interpret data[] as indicated by the flag rather
> > > than just passing a blob of opaque data to the iommu layer though.
> > > Thanks,    
> > 
> > Based on your comments about defining a single ioctl and a unified
> > vfio structure (with a @data[] field) for pasid_alloc/free, bind/
> > unbind_gpasid, cache_inv. After some offline trying, I think it would
> > be good for bind/unbind_gpasid and cache_inv as both of them use the
> > iommu uapi definition. While the pasid alloc/free operation doesn't.
> > It would be weird to put all of them together. So pasid alloc/free
> > may have a separate ioctl. It would look as below. Does this direction
> > look good per your opinion?
> > 
> > ioctl #22: VFIO_IOMMU_PASID_REQUEST
> > /**
> >   * @pasid: used to return the pasid alloc result when flags == ALLOC_PASID
> >   *         specify a pasid to be freed when flags == FREE_PASID
> >   * @range: specify the allocation range when flags == ALLOC_PASID
> >   */
> > struct vfio_iommu_pasid_request {
> > 	__u32	argsz;
> > #define VFIO_IOMMU_ALLOC_PASID	(1 << 0)
> > #define VFIO_IOMMU_FREE_PASID	(1 << 1)
> > 	__u32	flags;
> > 	__u32	pasid;
> > 	struct {
> > 		__u32	min;
> > 		__u32	max;
> > 	} range;
> > };  
> 
> Can't the ioctl return the pasid valid on alloc (like GET_DEVICE_FD)?

s/valid/value/

> Would it be useful to support freeing a range of pasids?  If so then we
> could simply use range for both, ie. allocate a pasid from this range
> and return it, or free all pasids in this range?  vfio already needs to
> track pasids to free them on release, so presumably this is something
> we could support easily.
>  
> > ioctl #23: VFIO_IOMMU_NESTING_OP
> > struct vfio_iommu_type1_nesting_op {
> > 	__u32	argsz;
> > 	__u32	flags;
> > 	__u32	op;
> > 	__u8	data[];
> > };  
> 
> data only has 4-byte alignment, I think we really want it at an 8-byte
> alignment.  This is why I embedded the "op" into the flag for
> DEVICE_FEATURE.  Thanks,
> 
> Alex
> 
> > 
> > /* Nesting Ops */
> > #define VFIO_IOMMU_NESTING_OP_BIND_PGTBL        0
> > #define VFIO_IOMMU_NESTING_OP_UNBIND_PGTBL      1
> > #define VFIO_IOMMU_NESTING_OP_CACHE_INVLD       2
> >  
> > Thanks,
> > Yi Liu
> >   
> 
> _______________________________________________
> iommu mailing list
> iommu@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/iommu
> 


^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [PATCH v1 7/8] vfio/type1: Add VFIO_IOMMU_CACHE_INVALIDATE
  2020-04-16 13:28               ` Tian, Kevin
@ 2020-04-16 15:12                 ` Auger Eric
  0 siblings, 0 replies; 110+ messages in thread
From: Auger Eric @ 2020-04-16 15:12 UTC (permalink / raw)
  To: Tian, Kevin, Liu, Yi L, Alex Williamson
  Cc: jacob.jun.pan, joro, Raj, Ashok, Tian, Jun J, Sun, Yi Y,
	jean-philippe, peterx, iommu, kvm, linux-kernel, Wu, Hao

Hi Kevin,

On 4/16/20 3:28 PM, Tian, Kevin wrote:
>> From: Auger Eric <eric.auger@redhat.com>
>> Sent: Thursday, April 16, 2020 8:43 PM
>>
>> Hi Kevin,
>> On 4/16/20 2:09 PM, Tian, Kevin wrote:
>>>> From: Liu, Yi L <yi.l.liu@intel.com>
>>>> Sent: Thursday, April 16, 2020 6:40 PM
>>>>
>>>> Hi Alex,
>>>> Still have a direction question with you. Better get agreement with you
>>>> before heading forward.
>>>>
>>>>> From: Alex Williamson <alex.williamson@redhat.com>
>>>>> Sent: Friday, April 3, 2020 11:35 PM
>>>> [...]
>>>>>>>> + *
>>>>>>>> + * returns: 0 on success, -errno on failure.
>>>>>>>> + */
>>>>>>>> +struct vfio_iommu_type1_cache_invalidate {
>>>>>>>> +	__u32   argsz;
>>>>>>>> +	__u32   flags;
>>>>>>>> +	struct	iommu_cache_invalidate_info cache_info;
>>>>>>>> +};
>>>>>>>> +#define VFIO_IOMMU_CACHE_INVALIDATE      _IO(VFIO_TYPE,
>>>>> VFIO_BASE
>>>>>>> + 24)
>>>>>>>
>>>>>>> The future extension capabilities of this ioctl worry me, I wonder if
>>>>>>> we should do another data[] with flag defining that data as
>>>> CACHE_INFO.
>>>>>>
>>>>>> Can you elaborate? Does it mean with this way we don't rely on iommu
>>>>>> driver to provide version_to_size conversion and instead we just pass
>>>>>> data[] to iommu driver for further audit?
>>>>>
>>>>> No, my concern is that this ioctl has a single function, strictly tied
>>>>> to the iommu uapi.  If we replace cache_info with data[] then we can
>>>>> define a flag to specify that data[] is struct
>>>>> iommu_cache_invalidate_info, and if we need to, a different flag to
>>>>> identify data[] as something else.  For example if we get stuck
>>>>> expanding cache_info to meet new demands and develop a new uapi to
>>>>> solve that, how would we expand this ioctl to support it rather than
>>>>> also create a new ioctl?  There's also a trade-off in making the ioctl
>>>>> usage more difficult for the user.  I'd still expect the vfio layer to
>>>>> check the flag and interpret data[] as indicated by the flag rather
>>>>> than just passing a blob of opaque data to the iommu layer though.
>>>>> Thanks,
>>>>
>>>> Based on your comments about defining a single ioctl and a unified
>>>> vfio structure (with a @data[] field) for pasid_alloc/free, bind/
>>>> unbind_gpasid, cache_inv. After some offline trying, I think it would
>>>> be good for bind/unbind_gpasid and cache_inv as both of them use the
>>>> iommu uapi definition. While the pasid alloc/free operation doesn't.
>>>> It would be weird to put all of them together. So pasid alloc/free
>>>> may have a separate ioctl. It would look as below. Does this direction
>>>> look good per your opinion?
>>>>
>>>> ioctl #22: VFIO_IOMMU_PASID_REQUEST
>>>> /**
>>>>   * @pasid: used to return the pasid alloc result when flags ==
>> ALLOC_PASID
>>>>   *         specify a pasid to be freed when flags == FREE_PASID
>>>>   * @range: specify the allocation range when flags == ALLOC_PASID
>>>>   */
>>>> struct vfio_iommu_pasid_request {
>>>> 	__u32	argsz;
>>>> #define VFIO_IOMMU_ALLOC_PASID	(1 << 0)
>>>> #define VFIO_IOMMU_FREE_PASID	(1 << 1)
>>>> 	__u32	flags;
>>>> 	__u32	pasid;
>>>> 	struct {
>>>> 		__u32	min;
>>>> 		__u32	max;
>>>> 	} range;
>>>> };
>>>>
>>>> ioctl #23: VFIO_IOMMU_NESTING_OP
>>>> struct vfio_iommu_type1_nesting_op {
>>>> 	__u32	argsz;
>>>> 	__u32	flags;
>>>> 	__u32	op;
>>>> 	__u8	data[];
>>>> };
>>>>
>>>> /* Nesting Ops */
>>>> #define VFIO_IOMMU_NESTING_OP_BIND_PGTBL        0
>>>> #define VFIO_IOMMU_NESTING_OP_UNBIND_PGTBL      1
>>>> #define VFIO_IOMMU_NESTING_OP_CACHE_INVLD       2
>>>>
>>>
>>> Then why cannot we just put PASID into the header since the
>>> majority of nested usage is associated with a pasid?
>>>
>>> ioctl #23: VFIO_IOMMU_NESTING_OP
>>> struct vfio_iommu_type1_nesting_op {
>>> 	__u32	argsz;
>>> 	__u32	flags;
>>> 	__u32	op;
>>> 	__u32   pasid;
>>> 	__u8	data[];
>>> };
>>>
>>> In case of SMMUv2 which supports nested w/o PASID, this field can
>>> be ignored for that specific case.
>> On my side I would prefer keeping the pasid in the data[]. This is not
>> always used.
>>
>> For instance, in iommu_cache_invalidate_info/iommu_inv_pasid_info we
>> devised flags to tell whether the PASID is used.
>>
> 
> But don't we include a PASID in both invalidate structures already?
The pasid presence is indicated by the IOMMU_INV_ADDR_FLAGS_PASID flag.

For instance for nested stage SMMUv3 I current performs an ARCHID (asid)
based invalidation only.

Eric
> 
> struct iommu_inv_addr_info {
> #define IOMMU_INV_ADDR_FLAGS_PASID      (1 << 0)
> #define IOMMU_INV_ADDR_FLAGS_ARCHID     (1 << 1)
> #define IOMMU_INV_ADDR_FLAGS_LEAF       (1 << 2)
>         __u32   flags;
>         __u32   archid;
>         __u64   pasid;
>         __u64   addr;
>         __u64   granule_size;
>         __u64   nb_granules;
> };
> 
> struct iommu_inv_pasid_info {
> #define IOMMU_INV_PASID_FLAGS_PASID     (1 << 0)
> #define IOMMU_INV_PASID_FLAGS_ARCHID    (1 << 1)
>         __u32   flags;
>         __u32   archid;
>         __u64   pasid;
> };
> 
> then consolidating the pasid field into generic header doesn't
> hurt. the specific handler still rely on flags to tell whether it
> is used?
> 
> Thanks
> Kevin
> 


^ permalink raw reply	[flat|nested] 110+ messages in thread

* RE: [PATCH v1 7/8] vfio/type1: Add VFIO_IOMMU_CACHE_INVALIDATE
  2020-04-16 14:40           ` Alex Williamson
  2020-04-16 14:48             ` Alex Williamson
@ 2020-04-17  6:03             ` Liu, Yi L
  1 sibling, 0 replies; 110+ messages in thread
From: Liu, Yi L @ 2020-04-17  6:03 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Tian, Kevin, eric.auger, jacob.jun.pan, joro, Raj, Ashok, Tian,
	Jun J, Sun, Yi Y, jean-philippe, peterx, iommu, kvm,
	linux-kernel, Wu, Hao

Hi Alex,
> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Thursday, April 16, 2020 10:41 PM
> To: Liu, Yi L <yi.l.liu@intel.com>
> Subject: Re: [PATCH v1 7/8] vfio/type1: Add VFIO_IOMMU_CACHE_INVALIDATE
> 
> On Thu, 16 Apr 2020 10:40:03 +0000
> "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> 
> > Hi Alex,
> > Still have a direction question with you. Better get agreement with you
> > before heading forward.
> >
> > > From: Alex Williamson <alex.williamson@redhat.com>
> > > Sent: Friday, April 3, 2020 11:35 PM
> > [...]
> > > > > > + *
> > > > > > + * returns: 0 on success, -errno on failure.
> > > > > > + */
> > > > > > +struct vfio_iommu_type1_cache_invalidate {
> > > > > > +	__u32   argsz;
> > > > > > +	__u32   flags;
> > > > > > +	struct	iommu_cache_invalidate_info cache_info;
> > > > > > +};
> > > > > > +#define VFIO_IOMMU_CACHE_INVALIDATE      _IO(VFIO_TYPE,
> > > VFIO_BASE
> > > > > + 24)
> > > > >
> > > > > The future extension capabilities of this ioctl worry me, I wonder if
> > > > > we should do another data[] with flag defining that data as CACHE_INFO.
> > > >
> > > > Can you elaborate? Does it mean with this way we don't rely on iommu
> > > > driver to provide version_to_size conversion and instead we just pass
> > > > data[] to iommu driver for further audit?
> > >
> > > No, my concern is that this ioctl has a single function, strictly tied
> > > to the iommu uapi.  If we replace cache_info with data[] then we can
> > > define a flag to specify that data[] is struct
> > > iommu_cache_invalidate_info, and if we need to, a different flag to
> > > identify data[] as something else.  For example if we get stuck
> > > expanding cache_info to meet new demands and develop a new uapi to
> > > solve that, how would we expand this ioctl to support it rather than
> > > also create a new ioctl?  There's also a trade-off in making the ioctl
> > > usage more difficult for the user.  I'd still expect the vfio layer to
> > > check the flag and interpret data[] as indicated by the flag rather
> > > than just passing a blob of opaque data to the iommu layer though.
> > > Thanks,
> >
> > Based on your comments about defining a single ioctl and a unified
> > vfio structure (with a @data[] field) for pasid_alloc/free, bind/
> > unbind_gpasid, cache_inv. After some offline trying, I think it would
> > be good for bind/unbind_gpasid and cache_inv as both of them use the
> > iommu uapi definition. While the pasid alloc/free operation doesn't.
> > It would be weird to put all of them together. So pasid alloc/free
> > may have a separate ioctl. It would look as below. Does this direction
> > look good per your opinion?
> >
> > ioctl #22: VFIO_IOMMU_PASID_REQUEST
> > /**
> >   * @pasid: used to return the pasid alloc result when flags == ALLOC_PASID
> >   *         specify a pasid to be freed when flags == FREE_PASID
> >   * @range: specify the allocation range when flags == ALLOC_PASID
> >   */
> > struct vfio_iommu_pasid_request {
> > 	__u32	argsz;
> > #define VFIO_IOMMU_ALLOC_PASID	(1 << 0)
> > #define VFIO_IOMMU_FREE_PASID	(1 << 1)
> > 	__u32	flags;
> > 	__u32	pasid;
> > 	struct {
> > 		__u32	min;
> > 		__u32	max;
> > 	} range;
> > };
> 
> Can't the ioctl return the pasid valid on alloc (like GET_DEVICE_FD)?

Yep, I think you mentioned before. At that time, I believed it would be
better to return the result via a __u32 buffer so that make full use of
the 32 bits. But looks like it doesn't make much difference. I'll follow
your suggestion.

> Would it be useful to support freeing a range of pasids?  If so then we
> could simply use range for both, ie. allocate a pasid from this range
> and return it, or free all pasids in this range?  vfio already needs to
> track pasids to free them on release, so presumably this is something
> we could support easily.

yes, I think it is a nice thing. then I can remove the @pasid field.
will do it.

> > ioctl #23: VFIO_IOMMU_NESTING_OP
> > struct vfio_iommu_type1_nesting_op {
> > 	__u32	argsz;
> > 	__u32	flags;
> > 	__u32	op;
> > 	__u8	data[];
> > };
> 
> data only has 4-byte alignment, I think we really want it at an 8-byte
> alignment.  This is why I embedded the "op" into the flag for
> DEVICE_FEATURE.  Thanks,

got it. I may also merge the op into flags (maybe the lower 16 bits for
op).

Thanks,
Yi Liu
> Alex
> 
> >
> > /* Nesting Ops */
> > #define VFIO_IOMMU_NESTING_OP_BIND_PGTBL        0
> > #define VFIO_IOMMU_NESTING_OP_UNBIND_PGTBL      1
> > #define VFIO_IOMMU_NESTING_OP_CACHE_INVLD       2
> >
> > Thanks,
> > Yi Liu
> >


^ permalink raw reply	[flat|nested] 110+ messages in thread

end of thread, other threads:[~2020-04-17  6:03 UTC | newest]

Thread overview: 110+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-22 12:31 [PATCH v1 0/8] vfio: expose virtual Shared Virtual Addressing to VMs Liu, Yi L
2020-03-22 12:31 ` [PATCH v1 1/8] vfio: Add VFIO_IOMMU_PASID_REQUEST(alloc/free) Liu, Yi L
2020-03-22 16:21   ` kbuild test robot
2020-03-30  8:32   ` Tian, Kevin
2020-03-30 14:36     ` Liu, Yi L
2020-03-31  5:40       ` Tian, Kevin
2020-03-31 13:22         ` Liu, Yi L
2020-04-01  5:43           ` Tian, Kevin
2020-04-01  5:48             ` Liu, Yi L
2020-03-31  7:53   ` Christoph Hellwig
2020-03-31  8:17     ` Liu, Yi L
2020-03-31  8:32     ` Liu, Yi L
2020-03-31  8:36       ` Liu, Yi L
2020-03-31  9:15         ` Christoph Hellwig
2020-04-02 13:52   ` Jean-Philippe Brucker
2020-04-03 11:56     ` Liu, Yi L
2020-04-03 12:39       ` Jean-Philippe Brucker
2020-04-03 12:44         ` Liu, Yi L
2020-04-02 17:50   ` Alex Williamson
2020-04-03  5:58     ` Tian, Kevin
2020-04-03 15:14       ` Alex Williamson
2020-04-07  4:42         ` Tian, Kevin
2020-04-07 15:14           ` Alex Williamson
2020-04-03 13:12     ` Liu, Yi L
2020-04-03 17:50       ` Alex Williamson
2020-04-07  4:52         ` Tian, Kevin
2020-04-08  0:52         ` Liu, Yi L
2020-03-22 12:31 ` [PATCH v1 2/8] vfio/type1: Add vfio_iommu_type1 parameter for quota tuning Liu, Yi L
2020-03-22 17:20   ` kbuild test robot
2020-03-30  8:40   ` Tian, Kevin
2020-03-30  8:52     ` Liu, Yi L
2020-03-30  9:19       ` Tian, Kevin
2020-03-30  9:26         ` Liu, Yi L
2020-03-30 11:44           ` Tian, Kevin
2020-04-02 17:58             ` Alex Williamson
2020-04-03  8:15               ` Liu, Yi L
2020-03-22 12:32 ` [PATCH v1 3/8] vfio/type1: Report PASID alloc/free support to userspace Liu, Yi L
2020-03-30  9:43   ` Tian, Kevin
2020-04-01  7:46     ` Liu, Yi L
2020-04-01  9:41   ` Auger Eric
2020-04-01 13:13     ` Liu, Yi L
2020-04-02 18:01   ` Alex Williamson
2020-04-03  8:17     ` Liu, Yi L
2020-04-03 17:28       ` Alex Williamson
2020-04-04 11:36         ` Liu, Yi L
2020-03-22 12:32 ` [PATCH v1 4/8] vfio: Check nesting iommu uAPI version Liu, Yi L
2020-03-22 18:30   ` kbuild test robot
2020-03-22 12:32 ` [PATCH v1 5/8] vfio/type1: Report 1st-level/stage-1 format to userspace Liu, Yi L
2020-03-22 16:44   ` kbuild test robot
2020-03-30 11:48   ` Tian, Kevin
2020-04-01  7:38     ` Liu, Yi L
2020-04-01  7:56       ` Tian, Kevin
2020-04-01  8:06         ` Liu, Yi L
2020-04-01  8:08           ` Tian, Kevin
2020-04-01  8:09             ` Liu, Yi L
2020-04-01  8:51   ` Auger Eric
2020-04-01 12:51     ` Liu, Yi L
2020-04-01 13:01       ` Auger Eric
2020-04-03  8:23         ` Jean-Philippe Brucker
2020-04-07  9:43           ` Liu, Yi L
2020-04-08  1:02             ` Liu, Yi L
2020-04-08 10:27             ` Auger Eric
2020-04-09  8:14               ` Jean-Philippe Brucker
2020-04-09  9:01                 ` Auger Eric
2020-04-09 12:47                 ` Liu, Yi L
2020-04-10  3:28                   ` Auger Eric
2020-04-10  3:48                     ` Liu, Yi L
2020-04-10 12:30                   ` Liu, Yi L
2020-04-02 19:20   ` Alex Williamson
2020-04-03 11:59     ` Liu, Yi L
2020-03-22 12:32 ` [PATCH v1 6/8] vfio/type1: Bind guest page tables to host Liu, Yi L
2020-03-22 18:10   ` kbuild test robot
2020-03-30 12:46   ` Tian, Kevin
2020-04-01  9:13     ` Liu, Yi L
2020-04-02  2:12       ` Tian, Kevin
2020-04-02  8:05         ` Liu, Yi L
2020-04-03  8:34           ` Jean-Philippe Brucker
2020-04-07 10:33             ` Liu, Yi L
2020-04-09  8:28               ` Jean-Philippe Brucker
2020-04-09  9:15                 ` Liu, Yi L
2020-04-09  9:38                   ` Jean-Philippe Brucker
2020-04-02 19:57   ` Alex Williamson
2020-04-03 13:30     ` Liu, Yi L
2020-04-03 18:11       ` Alex Williamson
2020-04-04 10:28         ` Liu, Yi L
2020-04-11  5:52     ` Liu, Yi L
2020-03-22 12:32 ` [PATCH v1 7/8] vfio/type1: Add VFIO_IOMMU_CACHE_INVALIDATE Liu, Yi L
2020-03-30 12:58   ` Tian, Kevin
2020-04-01  7:49     ` Liu, Yi L
2020-03-31  7:56   ` Christoph Hellwig
2020-03-31 10:48     ` Liu, Yi L
2020-04-02 20:24   ` Alex Williamson
2020-04-03  6:39     ` Tian, Kevin
2020-04-03 15:31       ` Jacob Pan
2020-04-03 15:34       ` Alex Williamson
2020-04-08  2:28         ` Liu, Yi L
2020-04-16 10:40         ` Liu, Yi L
2020-04-16 12:09           ` Tian, Kevin
2020-04-16 12:42             ` Auger Eric
2020-04-16 13:28               ` Tian, Kevin
2020-04-16 15:12                 ` Auger Eric
2020-04-16 14:40           ` Alex Williamson
2020-04-16 14:48             ` Alex Williamson
2020-04-17  6:03             ` Liu, Yi L
2020-03-22 12:32 ` [PATCH v1 8/8] vfio/type1: Add vSVA support for IOMMU-backed mdevs Liu, Yi L
2020-03-30 13:18   ` Tian, Kevin
2020-04-01  7:51     ` Liu, Yi L
2020-04-02 20:33   ` Alex Williamson
2020-04-03 13:39     ` Liu, Yi L
2020-03-26 12:56 ` [PATCH v1 0/8] vfio: expose virtual Shared Virtual Addressing to VMs Liu, Yi L

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).