iommu.lists.linux-foundation.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/4] vfio-pci support pasid attach/detach
@ 2024-04-12  8:21 Yi Liu
  2024-04-12  8:21 ` [PATCH v2 1/4] ida: Add ida_get_lowest() Yi Liu
                   ` (4 more replies)
  0 siblings, 5 replies; 72+ messages in thread
From: Yi Liu @ 2024-04-12  8:21 UTC (permalink / raw)
  To: alex.williamson, jgg, kevin.tian
  Cc: joro, robin.murphy, eric.auger, nicolinc, kvm, chao.p.peng,
	yi.l.liu, iommu, baolu.lu, zhenzhong.duan, jacob.jun.pan

This adds the pasid attach/detach uAPIs for userspace to attach/detach
a PASID of a device to/from a given ioas/hwpt. Only vfio-pci driver is
enabled in this series. After this series, PASID-capable devices bound
with vfio-pci can report PASID capability to userspace and VM to enable
PASID usages like Shared Virtual Addressing (SVA).

This series first adds the helpers for pasid attach in vfio core and then
adds the device cdev ioctls for pasid attach/detach, finally exposing the
device PASID capability to the user. It depends on the iommufd pasid
attach/detach series [1].

A userspace VMM is supposed to get the details of the device's PASID capability
and assemble a virtual PASID capability in a proper offset in the virtual PCI
configuration space. While it is still an open on how to get the available
offsets. Devices may have hidden bits that are not in the PCI cap chain. For
now, there are two options to get the available offsets.[2]

- Report the available offsets via ioctl. This requires device-specific logic
  to provide available offsets. e.g., vfio-pci variant driver. Or may the device
  provide the available offset by DVSEC.
- Store the available offsets in a static table in userspace VMM. VMM gets the
  empty offsets from this table.

Since there was no clear answer to it, I have not made much change on this in
this version. Wish we could have more discussions on this.

The completed code can be found at [3], tested with a hacky Qemu branch [4].

[1] https://lore.kernel.org/linux-iommu/20240412081516.31168-1-yi.l.liu@intel.com/
[2] https://lore.kernel.org/kvm/b3e07591-8ebc-4924-85fe-29a46fc73d78@intel.com/
[3] https://github.com/yiliu1765/iommufd/tree/iommufd_pasid
[4] https://github.com/yiliu1765/qemu/tree/wip/zhenzhong/iommufd_nesting_rfcv2-test-pasid

Change log:

v2:
 - Use IDA to track if PASID is attached or not in VFIO. (Jason)
 - Fix the issue of calling pasid_at[de]tach_ioas callback unconditionally (Alex)
 - Fix the wrong data copy in vfio_df_ioctl_pasid_detach_pt() (Zhenzhong)
 - Minor tweaks in comments (Kevin)

v1: https://lore.kernel.org/kvm/20231127063909.129153-1-yi.l.liu@intel.com/
 - Report PASID capability via VFIO_DEVICE_FEATURE (Alex)

rfc: https://lore.kernel.org/linux-iommu/20230926093121.18676-1-yi.l.liu@intel.com/

Regards,
        Yi Liu

Yi Liu (4):
  ida: Add ida_get_lowest()
  vfio-iommufd: Support pasid [at|de]tach for physical VFIO devices
  vfio: Add VFIO_DEVICE_PASID_[AT|DE]TACH_IOMMUFD_PT
  vfio: Report PASID capability via VFIO_DEVICE_FEATURE ioctl

 drivers/vfio/device_cdev.c       | 51 +++++++++++++++++++++++
 drivers/vfio/iommufd.c           | 60 +++++++++++++++++++++++++++
 drivers/vfio/pci/vfio_pci.c      |  2 +
 drivers/vfio/pci/vfio_pci_core.c | 50 +++++++++++++++++++++++
 drivers/vfio/vfio.h              |  4 ++
 drivers/vfio/vfio_main.c         |  8 ++++
 include/linux/idr.h              |  1 +
 include/linux/vfio.h             | 11 +++++
 include/uapi/linux/vfio.h        | 69 ++++++++++++++++++++++++++++++++
 lib/idr.c                        | 67 +++++++++++++++++++++++++++++++
 10 files changed, 323 insertions(+)

-- 
2.34.1


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH v2 1/4] ida: Add ida_get_lowest()
  2024-04-12  8:21 [PATCH v2 0/4] vfio-pci support pasid attach/detach Yi Liu
@ 2024-04-12  8:21 ` Yi Liu
  2024-04-16 16:03   ` Alex Williamson
  2024-04-12  8:21 ` [PATCH v2 2/4] vfio-iommufd: Support pasid [at|de]tach for physical VFIO devices Yi Liu
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 72+ messages in thread
From: Yi Liu @ 2024-04-12  8:21 UTC (permalink / raw)
  To: alex.williamson, jgg, kevin.tian
  Cc: joro, robin.murphy, eric.auger, nicolinc, kvm, chao.p.peng,
	yi.l.liu, iommu, baolu.lu, zhenzhong.duan, jacob.jun.pan,
	Matthew Wilcox

There is no helpers for user to check if a given ID is allocated or not,
neither a helper to loop all the allocated IDs in an IDA and do something
for cleanup. With the two needs, a helper to get the lowest allocated ID
of a range can help to achieve it.

Caller can check if a given ID is allocated or not by:
	int id = 200, rc;

	rc = ida_get_lowest(&ida, id, id);
	if (rc == id)
		//id 200 is used
	else
		//id 200 is not used

Caller can iterate all allocated IDs by:
	int id = 0;

	while (!ida_is_empty(&pasid_ida)) {
		id = ida_get_lowest(pasid_ida, id, INT_MAX);
		if (id < 0)
			break;
		//anything to do with the allocated ID
		ida_free(pasid_ida, pasid);
	}

Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
---
 include/linux/idr.h |  1 +
 lib/idr.c           | 67 +++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 68 insertions(+)

diff --git a/include/linux/idr.h b/include/linux/idr.h
index da5f5fa4a3a6..1dae71d4a75d 100644
--- a/include/linux/idr.h
+++ b/include/linux/idr.h
@@ -257,6 +257,7 @@ struct ida {
 int ida_alloc_range(struct ida *, unsigned int min, unsigned int max, gfp_t);
 void ida_free(struct ida *, unsigned int id);
 void ida_destroy(struct ida *ida);
+int ida_get_lowest(struct ida *ida, unsigned int min, unsigned int max);
 
 /**
  * ida_alloc() - Allocate an unused ID.
diff --git a/lib/idr.c b/lib/idr.c
index da36054c3ca0..03e461242fe2 100644
--- a/lib/idr.c
+++ b/lib/idr.c
@@ -476,6 +476,73 @@ int ida_alloc_range(struct ida *ida, unsigned int min, unsigned int max,
 }
 EXPORT_SYMBOL(ida_alloc_range);
 
+/**
+ * ida_get_lowest - Get the lowest used ID.
+ * @ida: IDA handle.
+ * @min: Lowest ID to get.
+ * @max: Highest ID to get.
+ *
+ * Get the lowest used ID between @min and @max, inclusive.  The returned
+ * ID will not exceed %INT_MAX, even if @max is larger.
+ *
+ * Context: Any context. Takes and releases the xa_lock.
+ * Return: The lowest used ID, or errno if no used ID is found.
+ */
+int ida_get_lowest(struct ida *ida, unsigned int min, unsigned int max)
+{
+	unsigned long index = min / IDA_BITMAP_BITS;
+	unsigned int offset = min % IDA_BITMAP_BITS;
+	unsigned long *addr, size, bit;
+	unsigned long flags;
+	void *entry;
+	int ret;
+
+	if (min >= INT_MAX)
+		return -EINVAL;
+	if (max >= INT_MAX)
+		max = INT_MAX;
+
+	xa_lock_irqsave(&ida->xa, flags);
+
+	entry = xa_find(&ida->xa, &index, max / IDA_BITMAP_BITS, XA_PRESENT);
+	if (!entry) {
+		ret = -ENOTTY;
+		goto err_unlock;
+	}
+
+	if (index > min / IDA_BITMAP_BITS)
+		offset = 0;
+	if (index * IDA_BITMAP_BITS + offset > max) {
+		ret = -ENOTTY;
+		goto err_unlock;
+	}
+
+	if (xa_is_value(entry)) {
+		unsigned long tmp = xa_to_value(entry);
+
+		addr = &tmp;
+		size = BITS_PER_XA_VALUE;
+	} else {
+		addr = ((struct ida_bitmap *)entry)->bitmap;
+		size = IDA_BITMAP_BITS;
+	}
+
+	bit = find_next_bit(addr, size, offset);
+
+	xa_unlock_irqrestore(&ida->xa, flags);
+
+	if (bit == size ||
+	    index * IDA_BITMAP_BITS + bit > max)
+		return -ENOTTY;
+
+	return index * IDA_BITMAP_BITS + bit;
+
+err_unlock:
+	xa_unlock_irqrestore(&ida->xa, flags);
+	return ret;
+}
+EXPORT_SYMBOL(ida_get_lowest);
+
 /**
  * ida_free() - Release an allocated ID.
  * @ida: IDA handle.
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v2 2/4] vfio-iommufd: Support pasid [at|de]tach for physical VFIO devices
  2024-04-12  8:21 [PATCH v2 0/4] vfio-pci support pasid attach/detach Yi Liu
  2024-04-12  8:21 ` [PATCH v2 1/4] ida: Add ida_get_lowest() Yi Liu
@ 2024-04-12  8:21 ` Yi Liu
  2024-04-16  9:01   ` Tian, Kevin
  2024-04-23 12:43   ` Jason Gunthorpe
  2024-04-12  8:21 ` [PATCH v2 3/4] vfio: Add VFIO_DEVICE_PASID_[AT|DE]TACH_IOMMUFD_PT Yi Liu
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 72+ messages in thread
From: Yi Liu @ 2024-04-12  8:21 UTC (permalink / raw)
  To: alex.williamson, jgg, kevin.tian
  Cc: joro, robin.murphy, eric.auger, nicolinc, kvm, chao.p.peng,
	yi.l.liu, iommu, baolu.lu, zhenzhong.duan, jacob.jun.pan,
	Matthew Wilcox

This adds pasid_at|de]tach_ioas ops for attaching hwpt to pasid of a
device and the helpers for it. For now, only vfio-pci supports pasid
attach/detach.

Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
---
 drivers/vfio/iommufd.c      | 60 +++++++++++++++++++++++++++++++++++++
 drivers/vfio/pci/vfio_pci.c |  2 ++
 include/linux/vfio.h        | 11 +++++++
 3 files changed, 73 insertions(+)

diff --git a/drivers/vfio/iommufd.c b/drivers/vfio/iommufd.c
index 82eba6966fa5..fc533416c75d 100644
--- a/drivers/vfio/iommufd.c
+++ b/drivers/vfio/iommufd.c
@@ -119,14 +119,26 @@ int vfio_iommufd_physical_bind(struct vfio_device *vdev,
 	if (IS_ERR(idev))
 		return PTR_ERR(idev);
 	vdev->iommufd_device = idev;
+	ida_init(&vdev->pasids);
 	return 0;
 }
 EXPORT_SYMBOL_GPL(vfio_iommufd_physical_bind);
 
 void vfio_iommufd_physical_unbind(struct vfio_device *vdev)
 {
+	int pasid = 0;
+
 	lockdep_assert_held(&vdev->dev_set->lock);
 
+	while (!ida_is_empty(&vdev->pasids)) {
+		pasid = ida_get_lowest(&vdev->pasids, pasid, INT_MAX);
+		if (pasid < 0)
+			break;
+
+		iommufd_device_pasid_detach(vdev->iommufd_device, pasid);
+		ida_free(&vdev->pasids, pasid);
+	}
+
 	if (vdev->iommufd_attached) {
 		iommufd_device_detach(vdev->iommufd_device);
 		vdev->iommufd_attached = false;
@@ -168,6 +180,54 @@ void vfio_iommufd_physical_detach_ioas(struct vfio_device *vdev)
 }
 EXPORT_SYMBOL_GPL(vfio_iommufd_physical_detach_ioas);
 
+int vfio_iommufd_physical_pasid_attach_ioas(struct vfio_device *vdev,
+					    u32 pasid, u32 *pt_id)
+{
+	int rc;
+
+	lockdep_assert_held(&vdev->dev_set->lock);
+
+	if (WARN_ON(!vdev->iommufd_device))
+		return -EINVAL;
+
+	rc = ida_get_lowest(&vdev->pasids, pasid, pasid);
+	if (rc == pasid)
+		return iommufd_device_pasid_replace(vdev->iommufd_device,
+						    pasid, pt_id);
+
+	rc = iommufd_device_pasid_attach(vdev->iommufd_device, pasid, pt_id);
+	if (rc)
+		return rc;
+
+	rc = ida_alloc_range(&vdev->pasids, pasid, pasid, GFP_KERNEL);
+	if (rc < 0) {
+		iommufd_device_pasid_detach(vdev->iommufd_device, pasid);
+		return rc;
+	}
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(vfio_iommufd_physical_pasid_attach_ioas);
+
+void vfio_iommufd_physical_pasid_detach_ioas(struct vfio_device *vdev,
+					     u32 pasid)
+{
+	int rc;
+
+	lockdep_assert_held(&vdev->dev_set->lock);
+
+	if (WARN_ON(!vdev->iommufd_device))
+		return;
+
+	rc = ida_get_lowest(&vdev->pasids, pasid, pasid);
+	if (rc < 0)
+		return;
+
+	iommufd_device_pasid_detach(vdev->iommufd_device, pasid);
+	ida_free(&vdev->pasids, pasid);
+}
+EXPORT_SYMBOL_GPL(vfio_iommufd_physical_pasid_detach_ioas);
+
 /*
  * The emulated standard ops mean that vfio_device is going to use the
  * "mdev path" and will call vfio_pin_pages()/vfio_dma_rw(). Drivers using this
diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index cb5b7f865d58..e0198851ffd2 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -142,6 +142,8 @@ static const struct vfio_device_ops vfio_pci_ops = {
 	.unbind_iommufd	= vfio_iommufd_physical_unbind,
 	.attach_ioas	= vfio_iommufd_physical_attach_ioas,
 	.detach_ioas	= vfio_iommufd_physical_detach_ioas,
+	.pasid_attach_ioas	= vfio_iommufd_physical_pasid_attach_ioas,
+	.pasid_detach_ioas	= vfio_iommufd_physical_pasid_detach_ioas,
 };
 
 static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index 8b1a29820409..8fd1db173e84 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -66,6 +66,7 @@ struct vfio_device {
 	void (*put_kvm)(struct kvm *kvm);
 #if IS_ENABLED(CONFIG_IOMMUFD)
 	struct iommufd_device *iommufd_device;
+	struct ida pasids;
 	u8 iommufd_attached:1;
 #endif
 	u8 cdev_opened:1;
@@ -90,6 +91,8 @@ struct vfio_device {
  *		 bound iommufd. Undo in unbind_iommufd if @detach_ioas is not
  *		 called.
  * @detach_ioas: Opposite of attach_ioas
+ * @pasid_attach_ioas: The pasid variation of attach_ioas
+ * @pasid_detach_ioas: Opposite of pasid_attach_ioas
  * @open_device: Called when the first file descriptor is opened for this device
  * @close_device: Opposite of open_device
  * @read: Perform read(2) on device file descriptor
@@ -114,6 +117,8 @@ struct vfio_device_ops {
 	void	(*unbind_iommufd)(struct vfio_device *vdev);
 	int	(*attach_ioas)(struct vfio_device *vdev, u32 *pt_id);
 	void	(*detach_ioas)(struct vfio_device *vdev);
+	int	(*pasid_attach_ioas)(struct vfio_device *vdev, u32 pasid, u32 *pt_id);
+	void	(*pasid_detach_ioas)(struct vfio_device *vdev, u32 pasid);
 	int	(*open_device)(struct vfio_device *vdev);
 	void	(*close_device)(struct vfio_device *vdev);
 	ssize_t	(*read)(struct vfio_device *vdev, char __user *buf,
@@ -138,6 +143,8 @@ int vfio_iommufd_physical_bind(struct vfio_device *vdev,
 void vfio_iommufd_physical_unbind(struct vfio_device *vdev);
 int vfio_iommufd_physical_attach_ioas(struct vfio_device *vdev, u32 *pt_id);
 void vfio_iommufd_physical_detach_ioas(struct vfio_device *vdev);
+int vfio_iommufd_physical_pasid_attach_ioas(struct vfio_device *vdev, u32 pasid, u32 *pt_id);
+void vfio_iommufd_physical_pasid_detach_ioas(struct vfio_device *vdev, u32 pasid);
 int vfio_iommufd_emulated_bind(struct vfio_device *vdev,
 			       struct iommufd_ctx *ictx, u32 *out_device_id);
 void vfio_iommufd_emulated_unbind(struct vfio_device *vdev);
@@ -165,6 +172,10 @@ vfio_iommufd_get_dev_id(struct vfio_device *vdev, struct iommufd_ctx *ictx)
 	((int (*)(struct vfio_device *vdev, u32 *pt_id)) NULL)
 #define vfio_iommufd_physical_detach_ioas \
 	((void (*)(struct vfio_device *vdev)) NULL)
+#define vfio_iommufd_physical_pasid_attach_ioas \
+	((int (*)(struct vfio_device *vdev, u32 pasid, u32 *pt_id)) NULL)
+#define vfio_iommufd_physical_pasid_detach_ioas \
+	((void (*)(struct vfio_device *vdev, u32 pasid)) NULL)
 #define vfio_iommufd_emulated_bind                                      \
 	((int (*)(struct vfio_device *vdev, struct iommufd_ctx *ictx,   \
 		  u32 *out_device_id)) NULL)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v2 3/4] vfio: Add VFIO_DEVICE_PASID_[AT|DE]TACH_IOMMUFD_PT
  2024-04-12  8:21 [PATCH v2 0/4] vfio-pci support pasid attach/detach Yi Liu
  2024-04-12  8:21 ` [PATCH v2 1/4] ida: Add ida_get_lowest() Yi Liu
  2024-04-12  8:21 ` [PATCH v2 2/4] vfio-iommufd: Support pasid [at|de]tach for physical VFIO devices Yi Liu
@ 2024-04-12  8:21 ` Yi Liu
  2024-04-16  9:13   ` Tian, Kevin
  2024-04-23 12:45   ` Jason Gunthorpe
  2024-04-12  8:21 ` [PATCH v2 4/4] vfio: Report PASID capability via VFIO_DEVICE_FEATURE ioctl Yi Liu
  2024-04-16  8:38 ` [PATCH v2 0/4] vfio-pci support pasid attach/detach Tian, Kevin
  4 siblings, 2 replies; 72+ messages in thread
From: Yi Liu @ 2024-04-12  8:21 UTC (permalink / raw)
  To: alex.williamson, jgg, kevin.tian
  Cc: joro, robin.murphy, eric.auger, nicolinc, kvm, chao.p.peng,
	yi.l.liu, iommu, baolu.lu, zhenzhong.duan, jacob.jun.pan

This adds ioctls for the userspace to attach/detach a given pasid of a
vfio device to/from an IOAS/HWPT.

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
---
 drivers/vfio/device_cdev.c | 51 +++++++++++++++++++++++++++++++++++
 drivers/vfio/vfio.h        |  4 +++
 drivers/vfio/vfio_main.c   |  8 ++++++
 include/uapi/linux/vfio.h  | 55 ++++++++++++++++++++++++++++++++++++++
 4 files changed, 118 insertions(+)

diff --git a/drivers/vfio/device_cdev.c b/drivers/vfio/device_cdev.c
index e75da0a70d1f..5326f1608ace 100644
--- a/drivers/vfio/device_cdev.c
+++ b/drivers/vfio/device_cdev.c
@@ -210,6 +210,57 @@ int vfio_df_ioctl_detach_pt(struct vfio_device_file *df,
 	return 0;
 }
 
+int vfio_df_ioctl_pasid_attach_pt(struct vfio_device_file *df,
+				  struct vfio_device_pasid_attach_iommufd_pt __user *arg)
+{
+	struct vfio_device_pasid_attach_iommufd_pt attach;
+	struct vfio_device *device = df->device;
+	unsigned long minsz;
+	int ret;
+
+	minsz = offsetofend(struct vfio_device_pasid_attach_iommufd_pt, pt_id);
+
+	if (copy_from_user(&attach, arg, minsz))
+		return -EFAULT;
+
+	if (attach.argsz < minsz || attach.flags)
+		return -EINVAL;
+
+	if (!device->ops->pasid_attach_ioas)
+		return -EOPNOTSUPP;
+
+	mutex_lock(&device->dev_set->lock);
+	ret = device->ops->pasid_attach_ioas(device, attach.pasid, &attach.pt_id);
+	mutex_unlock(&device->dev_set->lock);
+
+	return ret;
+}
+
+int vfio_df_ioctl_pasid_detach_pt(struct vfio_device_file *df,
+				  struct vfio_device_pasid_detach_iommufd_pt __user *arg)
+{
+	struct vfio_device_pasid_detach_iommufd_pt detach;
+	struct vfio_device *device = df->device;
+	unsigned long minsz;
+
+	minsz = offsetofend(struct vfio_device_pasid_detach_iommufd_pt, pasid);
+
+	if (copy_from_user(&detach, arg, minsz))
+		return -EFAULT;
+
+	if (detach.argsz < minsz || detach.flags)
+		return -EINVAL;
+
+	if (!device->ops->pasid_detach_ioas)
+		return -EOPNOTSUPP;
+
+	mutex_lock(&device->dev_set->lock);
+	device->ops->pasid_detach_ioas(device, detach.pasid);
+	mutex_unlock(&device->dev_set->lock);
+
+	return 0;
+}
+
 static char *vfio_device_devnode(const struct device *dev, umode_t *mode)
 {
 	return kasprintf(GFP_KERNEL, "vfio/devices/%s", dev_name(dev));
diff --git a/drivers/vfio/vfio.h b/drivers/vfio/vfio.h
index 50128da18bca..20d3cb283ba0 100644
--- a/drivers/vfio/vfio.h
+++ b/drivers/vfio/vfio.h
@@ -353,6 +353,10 @@ int vfio_df_ioctl_attach_pt(struct vfio_device_file *df,
 			    struct vfio_device_attach_iommufd_pt __user *arg);
 int vfio_df_ioctl_detach_pt(struct vfio_device_file *df,
 			    struct vfio_device_detach_iommufd_pt __user *arg);
+int vfio_df_ioctl_pasid_attach_pt(struct vfio_device_file *df,
+				  struct vfio_device_pasid_attach_iommufd_pt __user *arg);
+int vfio_df_ioctl_pasid_detach_pt(struct vfio_device_file *df,
+				  struct vfio_device_pasid_detach_iommufd_pt __user *arg);
 
 #if IS_ENABLED(CONFIG_VFIO_DEVICE_CDEV)
 void vfio_init_device_cdev(struct vfio_device *device);
diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
index e97d796a54fb..a5cece9fff5e 100644
--- a/drivers/vfio/vfio_main.c
+++ b/drivers/vfio/vfio_main.c
@@ -1242,6 +1242,14 @@ static long vfio_device_fops_unl_ioctl(struct file *filep,
 		case VFIO_DEVICE_DETACH_IOMMUFD_PT:
 			ret = vfio_df_ioctl_detach_pt(df, uptr);
 			goto out;
+
+		case VFIO_DEVICE_PASID_ATTACH_IOMMUFD_PT:
+			ret = vfio_df_ioctl_pasid_attach_pt(df, uptr);
+			goto out;
+
+		case VFIO_DEVICE_PASID_DETACH_IOMMUFD_PT:
+			ret = vfio_df_ioctl_pasid_detach_pt(df, uptr);
+			goto out;
 		}
 	}
 
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 2b68e6cdf190..9591dc24b75c 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -977,6 +977,61 @@ struct vfio_device_detach_iommufd_pt {
 
 #define VFIO_DEVICE_DETACH_IOMMUFD_PT		_IO(VFIO_TYPE, VFIO_BASE + 20)
 
+/*
+ * VFIO_DEVICE_PASID_ATTACH_IOMMUFD_PT - _IOW(VFIO_TYPE, VFIO_BASE + 21,
+ *					      struct vfio_device_pasid_attach_iommufd_pt)
+ * @argsz:	User filled size of this data.
+ * @flags:	Must be 0.
+ * @pasid:	The pasid to be attached.
+ * @pt_id:	Input the target id which can represent an ioas or a hwpt
+ *		allocated via iommufd subsystem.
+ *		Output the input ioas id or the attached hwpt id which could
+ *		be the specified hwpt itself or a hwpt automatically created
+ *		for the specified ioas by kernel during the attachment.
+ *
+ * Associate a pasid (of a cdev device) with an address space within the
+ * bound iommufd. Undo by VFIO_DEVICE_PASID_DETACH_IOMMUFD_PT or device fd
+ * close. This is only allowed on cdev fds.
+ *
+ * If a pasid is currently attached to a valid hw_pagetable (hwpt), without
+ * doing a VFIO_DEVICE_PASID_DETACH_IOMMUFD_PT, a second
+ * VFIO_DEVICE_PASID_ATTACH_IOMMUFD_PT ioctl passing in another hwpt id is
+ * allowed. This action, also known as a hwpt replacement, will replace the
+ * pasid's currently attached hwpt with a new hwpt corresponding to the given
+ * @pt_id.
+ *
+ * Return: 0 on success, -errno on failure.
+ */
+struct vfio_device_pasid_attach_iommufd_pt {
+	__u32	argsz;
+	__u32	flags;
+	__u32	pasid;
+	__u32	pt_id;
+};
+
+#define VFIO_DEVICE_PASID_ATTACH_IOMMUFD_PT	_IO(VFIO_TYPE, VFIO_BASE + 21)
+
+/*
+ * VFIO_DEVICE_PASID_DETACH_IOMMUFD_PT - _IOW(VFIO_TYPE, VFIO_BASE + 22,
+ *					      struct vfio_device_pasid_detach_iommufd_pt)
+ * @argsz:	User filled size of this data.
+ * @flags:	Must be 0.
+ * @pasid:	The pasid to be detached.
+ *
+ * Remove the association of a pasid (of a cdev device) and its current
+ * associated address space.  After it, the pasid of the device should be
+ * in a blocking DMA state.  This is only allowed on cdev fds.
+ *
+ * Return: 0 on success, -errno on failure.
+ */
+struct vfio_device_pasid_detach_iommufd_pt {
+	__u32	argsz;
+	__u32	flags;
+	__u32	pasid;
+};
+
+#define VFIO_DEVICE_PASID_DETACH_IOMMUFD_PT	_IO(VFIO_TYPE, VFIO_BASE + 22)
+
 /*
  * Provide support for setting a PCI VF Token, which is used as a shared
  * secret between PF and VF drivers.  This feature may only be set on a
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v2 4/4] vfio: Report PASID capability via VFIO_DEVICE_FEATURE ioctl
  2024-04-12  8:21 [PATCH v2 0/4] vfio-pci support pasid attach/detach Yi Liu
                   ` (2 preceding siblings ...)
  2024-04-12  8:21 ` [PATCH v2 3/4] vfio: Add VFIO_DEVICE_PASID_[AT|DE]TACH_IOMMUFD_PT Yi Liu
@ 2024-04-12  8:21 ` Yi Liu
  2024-04-16  9:40   ` Tian, Kevin
                     ` (2 more replies)
  2024-04-16  8:38 ` [PATCH v2 0/4] vfio-pci support pasid attach/detach Tian, Kevin
  4 siblings, 3 replies; 72+ messages in thread
From: Yi Liu @ 2024-04-12  8:21 UTC (permalink / raw)
  To: alex.williamson, jgg, kevin.tian
  Cc: joro, robin.murphy, eric.auger, nicolinc, kvm, chao.p.peng,
	yi.l.liu, iommu, baolu.lu, zhenzhong.duan, jacob.jun.pan

Today, vfio-pci hides the PASID capability of devices from userspace. Unlike
other PCI capabilities, PASID capability is going to be reported to user by
VFIO_DEVICE_FEATURE. Hence userspace could probe PASID capability by it.
This is a bit different from the other capabilities which are reported to
userspace when the user reads the device's PCI configuration space. There
are two reasons for this.

 - First, userspace like Qemu by default exposes all the available PCI
   capabilities in vfio-pci config space to the guest as read-only, so
   adding PASID capability in the vfio-pci config space will make it
   exposed to the guest automatically while an old Qemu doesn't really
   support it.

 - Second, the PASID capability does not exist on VFs (instead shares the
   cap of the PF). Creating a virtual PASID capability in vfio-pci config
   space needs to find a hole to place it, but doing so may require device
   specific knowledge to avoid potential conflict with device specific
   registers like hidden bits in VF's config space. It's simpler to move
   this burden to the VMM instead of maintaining a quirk system in the kernel.

Suggested-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
---
 drivers/vfio/pci/vfio_pci_core.c | 50 ++++++++++++++++++++++++++++++++
 include/uapi/linux/vfio.h        | 14 +++++++++
 2 files changed, 64 insertions(+)

diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index d94d61b92c1a..ca64e461d435 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -1495,6 +1495,54 @@ static int vfio_pci_core_feature_token(struct vfio_device *device, u32 flags,
 	return 0;
 }
 
+static int vfio_pci_core_feature_pasid(struct vfio_device *device, u32 flags,
+				       struct vfio_device_feature_pasid __user *arg,
+				       size_t argsz)
+{
+	struct vfio_pci_core_device *vdev =
+		container_of(device, struct vfio_pci_core_device, vdev);
+	struct vfio_device_feature_pasid pasid = { 0 };
+	struct pci_dev *pdev = vdev->pdev;
+	u32 capabilities = 0;
+	u16 ctrl = 0;
+	int ret;
+
+	/*
+	 * Due to no PASID capability per VF, to be consistent, we do not
+	 * support SET of the PASID capability for both PF and VF.
+	 */
+	ret = vfio_check_feature(flags, argsz, VFIO_DEVICE_FEATURE_GET,
+				 sizeof(pasid));
+	if (ret != 1)
+		return ret;
+
+	/* VF shares the PASID capability of its PF */
+	if (pdev->is_virtfn)
+		pdev = pci_physfn(pdev);
+
+	if (!pdev->pasid_enabled)
+		goto out;
+
+#ifdef CONFIG_PCI_PASID
+	pci_read_config_dword(pdev, pdev->pasid_cap + PCI_PASID_CAP,
+			      &capabilities);
+	pci_read_config_word(pdev, pdev->pasid_cap + PCI_PASID_CTRL,
+			     &ctrl);
+#endif
+
+	pasid.width = (capabilities >> 8) & 0x1f;
+
+	if (ctrl & PCI_PASID_CTRL_EXEC)
+		pasid.capabilities |= VFIO_DEVICE_PASID_CAP_EXEC;
+	if (ctrl & PCI_PASID_CTRL_PRIV)
+		pasid.capabilities |= VFIO_DEVICE_PASID_CAP_PRIV;
+
+out:
+	if (copy_to_user(arg, &pasid, sizeof(pasid)))
+		return -EFAULT;
+	return 0;
+}
+
 int vfio_pci_core_ioctl_feature(struct vfio_device *device, u32 flags,
 				void __user *arg, size_t argsz)
 {
@@ -1508,6 +1556,8 @@ int vfio_pci_core_ioctl_feature(struct vfio_device *device, u32 flags,
 		return vfio_pci_core_pm_exit(device, flags, arg, argsz);
 	case VFIO_DEVICE_FEATURE_PCI_VF_TOKEN:
 		return vfio_pci_core_feature_token(device, flags, arg, argsz);
+	case VFIO_DEVICE_FEATURE_PASID:
+		return vfio_pci_core_feature_pasid(device, flags, arg, argsz);
 	default:
 		return -ENOTTY;
 	}
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 9591dc24b75c..e50e55c67ab4 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -1513,6 +1513,20 @@ struct vfio_device_feature_bus_master {
 };
 #define VFIO_DEVICE_FEATURE_BUS_MASTER 10
 
+/**
+ * Upon VFIO_DEVICE_FEATURE_GET, return the PASID capability for the device.
+ * Zero width means no support for PASID.
+ */
+struct vfio_device_feature_pasid {
+	__u16 capabilities;
+#define VFIO_DEVICE_PASID_CAP_EXEC	(1 << 0)
+#define VFIO_DEVICE_PASID_CAP_PRIV	(1 << 1)
+	__u8 width;
+	__u8 __reserved;
+};
+
+#define VFIO_DEVICE_FEATURE_PASID 11
+
 /* -------- API for Type1 VFIO IOMMU -------- */
 
 /**
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* RE: [PATCH v2 0/4] vfio-pci support pasid attach/detach
  2024-04-12  8:21 [PATCH v2 0/4] vfio-pci support pasid attach/detach Yi Liu
                   ` (3 preceding siblings ...)
  2024-04-12  8:21 ` [PATCH v2 4/4] vfio: Report PASID capability via VFIO_DEVICE_FEATURE ioctl Yi Liu
@ 2024-04-16  8:38 ` Tian, Kevin
  2024-04-16 17:50   ` Jason Gunthorpe
  4 siblings, 1 reply; 72+ messages in thread
From: Tian, Kevin @ 2024-04-16  8:38 UTC (permalink / raw)
  To: Liu, Yi L, alex.williamson, jgg
  Cc: joro, robin.murphy, eric.auger, nicolinc, kvm, chao.p.peng,
	iommu, baolu.lu, Duan, Zhenzhong, Pan, Jacob jun

> From: Liu, Yi L <yi.l.liu@intel.com>
> Sent: Friday, April 12, 2024 4:21 PM
> 
> A userspace VMM is supposed to get the details of the device's PASID
> capability
> and assemble a virtual PASID capability in a proper offset in the virtual PCI
> configuration space. While it is still an open on how to get the available
> offsets. Devices may have hidden bits that are not in the PCI cap chain. For
> now, there are two options to get the available offsets.[2]
> 
> - Report the available offsets via ioctl. This requires device-specific logic
>   to provide available offsets. e.g., vfio-pci variant driver. Or may the device
>   provide the available offset by DVSEC.
> - Store the available offsets in a static table in userspace VMM. VMM gets the
>   empty offsets from this table.
> 

I'm not a fan of requesting a variant driver for every PASID-capable
VF just for the purpose of reporting a free range in the PCI config space.

It's easier to do that quirk in userspace.

But I like Alex's original comment that at least for PF there is no reason
to hide the offset. there could be a flag+field to communicate it. or 
if there will be a new variant VF driver for other purposes e.g. migration
it can certainly fill the field too.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* RE: [PATCH v2 2/4] vfio-iommufd: Support pasid [at|de]tach for physical VFIO devices
  2024-04-12  8:21 ` [PATCH v2 2/4] vfio-iommufd: Support pasid [at|de]tach for physical VFIO devices Yi Liu
@ 2024-04-16  9:01   ` Tian, Kevin
  2024-04-16  9:24     ` Yi Liu
  2024-04-23 12:43   ` Jason Gunthorpe
  1 sibling, 1 reply; 72+ messages in thread
From: Tian, Kevin @ 2024-04-16  9:01 UTC (permalink / raw)
  To: Liu, Yi L, alex.williamson, jgg
  Cc: joro, robin.murphy, eric.auger, nicolinc, kvm, chao.p.peng,
	iommu, baolu.lu, Duan, Zhenzhong, Pan, Jacob jun, Matthew Wilcox

> From: Liu, Yi L <yi.l.liu@intel.com>
> Sent: Friday, April 12, 2024 4:21 PM
> 
>  void vfio_iommufd_physical_unbind(struct vfio_device *vdev)
>  {
> +	int pasid = 0;
> +
>  	lockdep_assert_held(&vdev->dev_set->lock);
> 
> +	while (!ida_is_empty(&vdev->pasids)) {
> +		pasid = ida_get_lowest(&vdev->pasids, pasid, INT_MAX);
> +		if (pasid < 0)
> +			break;

WARN_ON as this shouldn't happen when ida is not empty.

> 
> +int vfio_iommufd_physical_pasid_attach_ioas(struct vfio_device *vdev,
> +					    u32 pasid, u32 *pt_id)

the name is too long. What about removing 'physical' as there is no
plan (unlikely) to support pasid on mdev?

> +{
> +	int rc;
> +
> +	lockdep_assert_held(&vdev->dev_set->lock);
> +
> +	if (WARN_ON(!vdev->iommufd_device))
> +		return -EINVAL;
> +
> +	rc = ida_get_lowest(&vdev->pasids, pasid, pasid);
> +	if (rc == pasid)
> +		return iommufd_device_pasid_replace(vdev-
> >iommufd_device,
> +						    pasid, pt_id);
> +
> +	rc = iommufd_device_pasid_attach(vdev->iommufd_device, pasid,
> pt_id);
> +	if (rc)
> +		return rc;
> +
> +	rc = ida_alloc_range(&vdev->pasids, pasid, pasid, GFP_KERNEL);
> +	if (rc < 0) {
> +		iommufd_device_pasid_detach(vdev->iommufd_device,
> pasid);
> +		return rc;
> +	}

I'd do simple operation (ida_alloc_range()) first before doing attach.


^ permalink raw reply	[flat|nested] 72+ messages in thread

* RE: [PATCH v2 3/4] vfio: Add VFIO_DEVICE_PASID_[AT|DE]TACH_IOMMUFD_PT
  2024-04-12  8:21 ` [PATCH v2 3/4] vfio: Add VFIO_DEVICE_PASID_[AT|DE]TACH_IOMMUFD_PT Yi Liu
@ 2024-04-16  9:13   ` Tian, Kevin
  2024-04-16  9:36     ` Yi Liu
  2024-04-23 12:45   ` Jason Gunthorpe
  1 sibling, 1 reply; 72+ messages in thread
From: Tian, Kevin @ 2024-04-16  9:13 UTC (permalink / raw)
  To: Liu, Yi L, alex.williamson, jgg
  Cc: joro, robin.murphy, eric.auger, nicolinc, kvm, chao.p.peng,
	iommu, baolu.lu, Duan, Zhenzhong, Pan, Jacob jun

> From: Liu, Yi L <yi.l.liu@intel.com>
> Sent: Friday, April 12, 2024 4:21 PM
> 
> +/*
> + * VFIO_DEVICE_PASID_ATTACH_IOMMUFD_PT - _IOW(VFIO_TYPE,
> VFIO_BASE + 21,
> + *					      struct
> vfio_device_pasid_attach_iommufd_pt)
> + * @argsz:	User filled size of this data.
> + * @flags:	Must be 0.
> + * @pasid:	The pasid to be attached.
> + * @pt_id:	Input the target id which can represent an ioas or a hwpt
> + *		allocated via iommufd subsystem.
> + *		Output the input ioas id or the attached hwpt id which could
> + *		be the specified hwpt itself or a hwpt automatically created
> + *		for the specified ioas by kernel during the attachment.
> + *
> + * Associate a pasid (of a cdev device) with an address space within the

remove '(of a cdev device)' as end of the paragraph has "This is only
allowed on cdev fds". Also a pasid certainly belongs to device hence
just using pasid alone is clear.

> + * bound iommufd. Undo by VFIO_DEVICE_PASID_DETACH_IOMMUFD_PT
> or device fd
> + * close. This is only allowed on cdev fds.
> + *
> + * If a pasid is currently attached to a valid hw_pagetable (hwpt), 

remove 'hw_pagetable'. the abbreviation "hwpt" has been used
throughout this file (e.g. even when explaining @pt_id in earlier place).


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 2/4] vfio-iommufd: Support pasid [at|de]tach for physical VFIO devices
  2024-04-16  9:01   ` Tian, Kevin
@ 2024-04-16  9:24     ` Yi Liu
  2024-04-16  9:47       ` Tian, Kevin
  0 siblings, 1 reply; 72+ messages in thread
From: Yi Liu @ 2024-04-16  9:24 UTC (permalink / raw)
  To: Tian, Kevin, alex.williamson, jgg
  Cc: joro, robin.murphy, eric.auger, nicolinc, kvm, chao.p.peng,
	iommu, baolu.lu, Duan, Zhenzhong, Pan, Jacob jun, Matthew Wilcox

On 2024/4/16 17:01, Tian, Kevin wrote:
>> From: Liu, Yi L <yi.l.liu@intel.com>
>> Sent: Friday, April 12, 2024 4:21 PM
>>
>>   void vfio_iommufd_physical_unbind(struct vfio_device *vdev)
>>   {
>> +	int pasid = 0;
>> +
>>   	lockdep_assert_held(&vdev->dev_set->lock);
>>
>> +	while (!ida_is_empty(&vdev->pasids)) {
>> +		pasid = ida_get_lowest(&vdev->pasids, pasid, INT_MAX);
>> +		if (pasid < 0)
>> +			break;
> 
> WARN_ON as this shouldn't happen when ida is not empty.

ok.

>>
>> +int vfio_iommufd_physical_pasid_attach_ioas(struct vfio_device *vdev,
>> +					    u32 pasid, u32 *pt_id)
> 
> the name is too long. What about removing 'physical' as there is no
> plan (unlikely) to support pasid on mdev?

I'm ok to do it.

>> +{
>> +	int rc;
>> +
>> +	lockdep_assert_held(&vdev->dev_set->lock);
>> +
>> +	if (WARN_ON(!vdev->iommufd_device))
>> +		return -EINVAL;
>> +
>> +	rc = ida_get_lowest(&vdev->pasids, pasid, pasid);
>> +	if (rc == pasid)
>> +		return iommufd_device_pasid_replace(vdev-
>>> iommufd_device,
>> +						    pasid, pt_id);
>> +
>> +	rc = iommufd_device_pasid_attach(vdev->iommufd_device, pasid,
>> pt_id);
>> +	if (rc)
>> +		return rc;
>> +
>> +	rc = ida_alloc_range(&vdev->pasids, pasid, pasid, GFP_KERNEL);
>> +	if (rc < 0) {
>> +		iommufd_device_pasid_detach(vdev->iommufd_device,
>> pasid);
>> +		return rc;
>> +	}
> 
> I'd do simple operation (ida_alloc_range()) first before doing attach.
> 

But that means we rely on the ida_alloc_range() to return -ENOSPC to
indicate the pasid is allocated, hence this attach is actually a
replacement. This is easy to be broken if ida_alloc_range() returns
-ENOSPC for other reasons in future.

-- 
Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 3/4] vfio: Add VFIO_DEVICE_PASID_[AT|DE]TACH_IOMMUFD_PT
  2024-04-16  9:13   ` Tian, Kevin
@ 2024-04-16  9:36     ` Yi Liu
  0 siblings, 0 replies; 72+ messages in thread
From: Yi Liu @ 2024-04-16  9:36 UTC (permalink / raw)
  To: Tian, Kevin, alex.williamson, jgg
  Cc: joro, robin.murphy, eric.auger, nicolinc, kvm, chao.p.peng,
	iommu, baolu.lu, Duan, Zhenzhong, Pan, Jacob jun

On 2024/4/16 17:13, Tian, Kevin wrote:
>> From: Liu, Yi L <yi.l.liu@intel.com>
>> Sent: Friday, April 12, 2024 4:21 PM
>>
>> +/*
>> + * VFIO_DEVICE_PASID_ATTACH_IOMMUFD_PT - _IOW(VFIO_TYPE,
>> VFIO_BASE + 21,
>> + *					      struct
>> vfio_device_pasid_attach_iommufd_pt)
>> + * @argsz:	User filled size of this data.
>> + * @flags:	Must be 0.
>> + * @pasid:	The pasid to be attached.
>> + * @pt_id:	Input the target id which can represent an ioas or a hwpt
>> + *		allocated via iommufd subsystem.
>> + *		Output the input ioas id or the attached hwpt id which could
>> + *		be the specified hwpt itself or a hwpt automatically created
>> + *		for the specified ioas by kernel during the attachment.
>> + *
>> + * Associate a pasid (of a cdev device) with an address space within the
> 
> remove '(of a cdev device)' as end of the paragraph has "This is only
> allowed on cdev fds". Also a pasid certainly belongs to device hence
> just using pasid alone is clear.

ok

>> + * bound iommufd. Undo by VFIO_DEVICE_PASID_DETACH_IOMMUFD_PT
>> or device fd
>> + * close. This is only allowed on cdev fds.
>> + *
>> + * If a pasid is currently attached to a valid hw_pagetable (hwpt),
> 
> remove 'hw_pagetable'. the abbreviation "hwpt" has been used
> throughout this file (e.g. even when explaining @pt_id in earlier place).
> 

ok

-- 
Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 72+ messages in thread

* RE: [PATCH v2 4/4] vfio: Report PASID capability via VFIO_DEVICE_FEATURE ioctl
  2024-04-12  8:21 ` [PATCH v2 4/4] vfio: Report PASID capability via VFIO_DEVICE_FEATURE ioctl Yi Liu
@ 2024-04-16  9:40   ` Tian, Kevin
  2024-04-16 17:57   ` Alex Williamson
  2024-04-23 12:39   ` Jason Gunthorpe
  2 siblings, 0 replies; 72+ messages in thread
From: Tian, Kevin @ 2024-04-16  9:40 UTC (permalink / raw)
  To: Liu, Yi L, alex.williamson, jgg
  Cc: joro, robin.murphy, eric.auger, nicolinc, kvm, chao.p.peng,
	iommu, baolu.lu, Duan, Zhenzhong, Pan, Jacob jun

> From: Liu, Yi L <yi.l.liu@intel.com>
> Sent: Friday, April 12, 2024 4:21 PM
> 
> +static int vfio_pci_core_feature_pasid(struct vfio_device *device, u32 flags,
> +				       struct vfio_device_feature_pasid __user
> *arg,
> +				       size_t argsz)
> +{
> +	struct vfio_pci_core_device *vdev =
> +		container_of(device, struct vfio_pci_core_device, vdev);
> +	struct vfio_device_feature_pasid pasid = { 0 };
> +	struct pci_dev *pdev = vdev->pdev;
> +	u32 capabilities = 0;
> +	u16 ctrl = 0;
> +	int ret;
> +
> +	/*
> +	 * Due to no PASID capability per VF, to be consistent, we do not
> +	 * support SET of the PASID capability for both PF and VF.
> +	 */

/* Disallow SET of the PASID capability given it is shared by all VF's 
 * and configured implicitly by the IOMMU driver.
 */

> +	ret = vfio_check_feature(flags, argsz, VFIO_DEVICE_FEATURE_GET,
> +				 sizeof(pasid));
> +	if (ret != 1)
> +		return ret;
> +
> +	/* VF shares the PASID capability of its PF */
> +	if (pdev->is_virtfn)
> +		pdev = pci_physfn(pdev);
> +
> +	if (!pdev->pasid_enabled)
> +		goto out;
> +
> +#ifdef CONFIG_PCI_PASID
> +	pci_read_config_dword(pdev, pdev->pasid_cap + PCI_PASID_CAP,
> +			      &capabilities);
> +	pci_read_config_word(pdev, pdev->pasid_cap + PCI_PASID_CTRL,
> +			     &ctrl);
> +#endif
> +
> +	pasid.width = (capabilities >> 8) & 0x1f;

it's cleaner to have helpers instead of directly checking CONFIG_PCI_XXX here.

there is an existing helper for the width: pci_max_pasids()

pci_pasid_features() can report supported features but not the actual
enabled set.

for enabled features it's already stored in pdev->pasid_features. so what's
required here is probably a new pci_pasid_enabled_features() to return
that field.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* RE: [PATCH v2 2/4] vfio-iommufd: Support pasid [at|de]tach for physical VFIO devices
  2024-04-16  9:24     ` Yi Liu
@ 2024-04-16  9:47       ` Tian, Kevin
  2024-04-18  7:04         ` Yi Liu
  0 siblings, 1 reply; 72+ messages in thread
From: Tian, Kevin @ 2024-04-16  9:47 UTC (permalink / raw)
  To: Liu, Yi L, alex.williamson, jgg
  Cc: joro, robin.murphy, eric.auger, nicolinc, kvm, chao.p.peng,
	iommu, baolu.lu, Duan, Zhenzhong, Pan, Jacob jun, Matthew Wilcox

> From: Liu, Yi L <yi.l.liu@intel.com>
> Sent: Tuesday, April 16, 2024 5:25 PM
> 
> On 2024/4/16 17:01, Tian, Kevin wrote:
> >> From: Liu, Yi L <yi.l.liu@intel.com>
> >> Sent: Friday, April 12, 2024 4:21 PM
> >>
> >> +
> >> +	rc = ida_get_lowest(&vdev->pasids, pasid, pasid);
> >> +	if (rc == pasid)
> >> +		return iommufd_device_pasid_replace(vdev-
> >>> iommufd_device,
> >> +						    pasid, pt_id);
> >> +
> >> +	rc = iommufd_device_pasid_attach(vdev->iommufd_device, pasid,
> >> pt_id);
> >> +	if (rc)
> >> +		return rc;
> >> +
> >> +	rc = ida_alloc_range(&vdev->pasids, pasid, pasid, GFP_KERNEL);
> >> +	if (rc < 0) {
> >> +		iommufd_device_pasid_detach(vdev->iommufd_device,
> >> pasid);
> >> +		return rc;
> >> +	}
> >
> > I'd do simple operation (ida_alloc_range()) first before doing attach.
> >
> 
> But that means we rely on the ida_alloc_range() to return -ENOSPC to
> indicate the pasid is allocated, hence this attach is actually a
> replacement. This is easy to be broken if ida_alloc_range() returns
> -ENOSPC for other reasons in future.
> 

ida_alloc_range() could fail for other reasons e.g. -ENOMEM.

in case I didn't make it clear I just meant to swap the order
between iommufd_device_pasid_attach() and ida_alloc_range().

replacement is still checked against ida_get_lowest().

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 1/4] ida: Add ida_get_lowest()
  2024-04-12  8:21 ` [PATCH v2 1/4] ida: Add ida_get_lowest() Yi Liu
@ 2024-04-16 16:03   ` Alex Williamson
  2024-04-18  7:02     ` Yi Liu
  0 siblings, 1 reply; 72+ messages in thread
From: Alex Williamson @ 2024-04-16 16:03 UTC (permalink / raw)
  To: Yi Liu
  Cc: jgg, kevin.tian, joro, robin.murphy, eric.auger, nicolinc, kvm,
	chao.p.peng, iommu, baolu.lu, zhenzhong.duan, jacob.jun.pan,
	Matthew Wilcox

On Fri, 12 Apr 2024 01:21:18 -0700
Yi Liu <yi.l.liu@intel.com> wrote:

> There is no helpers for user to check if a given ID is allocated or not,
> neither a helper to loop all the allocated IDs in an IDA and do something
> for cleanup. With the two needs, a helper to get the lowest allocated ID
> of a range can help to achieve it.
> 
> Caller can check if a given ID is allocated or not by:
> 	int id = 200, rc;
> 
> 	rc = ida_get_lowest(&ida, id, id);
> 	if (rc == id)
> 		//id 200 is used
> 	else
> 		//id 200 is not used
> 
> Caller can iterate all allocated IDs by:
> 	int id = 0;
> 
> 	while (!ida_is_empty(&pasid_ida)) {
> 		id = ida_get_lowest(pasid_ida, id, INT_MAX);
> 		if (id < 0)
> 			break;
> 		//anything to do with the allocated ID
> 		ida_free(pasid_ida, pasid);
> 	}
> 
> Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
> Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> ---
>  include/linux/idr.h |  1 +
>  lib/idr.c           | 67 +++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 68 insertions(+)
> 
> diff --git a/include/linux/idr.h b/include/linux/idr.h
> index da5f5fa4a3a6..1dae71d4a75d 100644
> --- a/include/linux/idr.h
> +++ b/include/linux/idr.h
> @@ -257,6 +257,7 @@ struct ida {
>  int ida_alloc_range(struct ida *, unsigned int min, unsigned int max, gfp_t);
>  void ida_free(struct ida *, unsigned int id);
>  void ida_destroy(struct ida *ida);
> +int ida_get_lowest(struct ida *ida, unsigned int min, unsigned int max);
>  
>  /**
>   * ida_alloc() - Allocate an unused ID.
> diff --git a/lib/idr.c b/lib/idr.c
> index da36054c3ca0..03e461242fe2 100644
> --- a/lib/idr.c
> +++ b/lib/idr.c
> @@ -476,6 +476,73 @@ int ida_alloc_range(struct ida *ida, unsigned int min, unsigned int max,
>  }
>  EXPORT_SYMBOL(ida_alloc_range);
>  
> +/**
> + * ida_get_lowest - Get the lowest used ID.
> + * @ida: IDA handle.
> + * @min: Lowest ID to get.
> + * @max: Highest ID to get.
> + *
> + * Get the lowest used ID between @min and @max, inclusive.  The returned
> + * ID will not exceed %INT_MAX, even if @max is larger.
> + *
> + * Context: Any context. Takes and releases the xa_lock.
> + * Return: The lowest used ID, or errno if no used ID is found.
> + */
> +int ida_get_lowest(struct ida *ida, unsigned int min, unsigned int max)
> +{
> +	unsigned long index = min / IDA_BITMAP_BITS;
> +	unsigned int offset = min % IDA_BITMAP_BITS;
> +	unsigned long *addr, size, bit;
> +	unsigned long flags;
> +	void *entry;
> +	int ret;
> +
> +	if (min >= INT_MAX)
> +		return -EINVAL;
> +	if (max >= INT_MAX)
> +		max = INT_MAX;
> +

Could these be made consistent with the test in ida_alloc_range(), ie:

	if ((int)min < 0)
		return -EINVAL;
	if ((int)max < 0)
		max = INT_MAX;


> +	xa_lock_irqsave(&ida->xa, flags);
> +
> +	entry = xa_find(&ida->xa, &index, max / IDA_BITMAP_BITS, XA_PRESENT);
> +	if (!entry) {
> +		ret = -ENOTTY;

-ENOENT?  Same for all below too.

> +		goto err_unlock;
> +	}
> +
> +	if (index > min / IDA_BITMAP_BITS)
> +		offset = 0;
> +	if (index * IDA_BITMAP_BITS + offset > max) {
> +		ret = -ENOTTY;
> +		goto err_unlock;
> +	}
> +
> +	if (xa_is_value(entry)) {
> +		unsigned long tmp = xa_to_value(entry);
> +
> +		addr = &tmp;
> +		size = BITS_PER_XA_VALUE;
> +	} else {
> +		addr = ((struct ida_bitmap *)entry)->bitmap;
> +		size = IDA_BITMAP_BITS;
> +	}
> +
> +	bit = find_next_bit(addr, size, offset);
> +
> +	xa_unlock_irqrestore(&ida->xa, flags);
> +
> +	if (bit == size ||
> +	    index * IDA_BITMAP_BITS + bit > max)
> +		return -ENOTTY;
> +
> +	return index * IDA_BITMAP_BITS + bit;
> +
> +err_unlock:
> +	xa_unlock_irqrestore(&ida->xa, flags);
> +	return ret;
> +}
> +EXPORT_SYMBOL(ida_get_lowest);

The API is a bit awkward to me, I wonder if it might be helped with
some renaming and wrappers...

int ida_find_first_range(struct ida *ida, unsigned int min, unsigned int max);

bool ida_exists(struct ida *ida, unsigned int id)
{
	return ida_find_first_range(ida, id, id) == id;
}

int ida_find_first(struct ida *ida)
{
	return ida_find_first_range(ida, 0, ~0);
}

_min and _max variations of the latter would align with existing
ida_alloc variants, but maybe no need to add them preemptively.

Possibly an ida_for_each() could be useful in the use case of
disassociating each id, but not required for the brute force iterative
method.  Thanks,

Alex

> +
>  /**
>   * ida_free() - Release an allocated ID.
>   * @ida: IDA handle.


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 0/4] vfio-pci support pasid attach/detach
  2024-04-16  8:38 ` [PATCH v2 0/4] vfio-pci support pasid attach/detach Tian, Kevin
@ 2024-04-16 17:50   ` Jason Gunthorpe
  2024-04-17  7:16     ` Tian, Kevin
  0 siblings, 1 reply; 72+ messages in thread
From: Jason Gunthorpe @ 2024-04-16 17:50 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Liu, Yi L, alex.williamson, joro, robin.murphy, eric.auger,
	nicolinc, kvm, chao.p.peng, iommu, baolu.lu, Duan, Zhenzhong,
	Pan, Jacob jun

On Tue, Apr 16, 2024 at 08:38:50AM +0000, Tian, Kevin wrote:
> > From: Liu, Yi L <yi.l.liu@intel.com>
> > Sent: Friday, April 12, 2024 4:21 PM
> > 
> > A userspace VMM is supposed to get the details of the device's PASID
> > capability
> > and assemble a virtual PASID capability in a proper offset in the virtual PCI
> > configuration space. While it is still an open on how to get the available
> > offsets. Devices may have hidden bits that are not in the PCI cap chain. For
> > now, there are two options to get the available offsets.[2]
> > 
> > - Report the available offsets via ioctl. This requires device-specific logic
> >   to provide available offsets. e.g., vfio-pci variant driver. Or may the device
> >   provide the available offset by DVSEC.
> > - Store the available offsets in a static table in userspace VMM. VMM gets the
> >   empty offsets from this table.
> > 
> 
> I'm not a fan of requesting a variant driver for every PASID-capable
> VF just for the purpose of reporting a free range in the PCI config space.
> 
> It's easier to do that quirk in userspace.
> 
> But I like Alex's original comment that at least for PF there is no reason
> to hide the offset. there could be a flag+field to communicate it. or 
> if there will be a new variant VF driver for other purposes e.g. migration
> it can certainly fill the field too.

Yes, since this has been such a sticking point can we get a clean
series that just enables it for PF and then come with a solution for
VF?

Jason

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 4/4] vfio: Report PASID capability via VFIO_DEVICE_FEATURE ioctl
  2024-04-12  8:21 ` [PATCH v2 4/4] vfio: Report PASID capability via VFIO_DEVICE_FEATURE ioctl Yi Liu
  2024-04-16  9:40   ` Tian, Kevin
@ 2024-04-16 17:57   ` Alex Williamson
  2024-04-17  7:09     ` Tian, Kevin
  2024-04-23 12:39   ` Jason Gunthorpe
  2 siblings, 1 reply; 72+ messages in thread
From: Alex Williamson @ 2024-04-16 17:57 UTC (permalink / raw)
  To: Yi Liu
  Cc: jgg, kevin.tian, joro, robin.murphy, eric.auger, nicolinc, kvm,
	chao.p.peng, iommu, baolu.lu, zhenzhong.duan, jacob.jun.pan

On Fri, 12 Apr 2024 01:21:21 -0700
Yi Liu <yi.l.liu@intel.com> wrote:

> Today, vfio-pci hides the PASID capability of devices from userspace. Unlike
> other PCI capabilities, PASID capability is going to be reported to user by
> VFIO_DEVICE_FEATURE. Hence userspace could probe PASID capability by it.
> This is a bit different from the other capabilities which are reported to
> userspace when the user reads the device's PCI configuration space. There
> are two reasons for this.
> 
>  - First, userspace like Qemu by default exposes all the available PCI
>    capabilities in vfio-pci config space to the guest as read-only, so
>    adding PASID capability in the vfio-pci config space will make it
>    exposed to the guest automatically while an old Qemu doesn't really
>    support it.
> 
>  - Second, the PASID capability does not exist on VFs (instead shares the
>    cap of the PF). Creating a virtual PASID capability in vfio-pci config
>    space needs to find a hole to place it, but doing so may require device
>    specific knowledge to avoid potential conflict with device specific
>    registers like hidden bits in VF's config space. It's simpler to move
>    this burden to the VMM instead of maintaining a quirk system in the kernel.
> 
> Suggested-by: Alex Williamson <alex.williamson@redhat.com>
> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> ---
>  drivers/vfio/pci/vfio_pci_core.c | 50 ++++++++++++++++++++++++++++++++
>  include/uapi/linux/vfio.h        | 14 +++++++++
>  2 files changed, 64 insertions(+)
> 
> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> index d94d61b92c1a..ca64e461d435 100644
> --- a/drivers/vfio/pci/vfio_pci_core.c
> +++ b/drivers/vfio/pci/vfio_pci_core.c
> @@ -1495,6 +1495,54 @@ static int vfio_pci_core_feature_token(struct vfio_device *device, u32 flags,
>  	return 0;
>  }
>  
> +static int vfio_pci_core_feature_pasid(struct vfio_device *device, u32 flags,
> +				       struct vfio_device_feature_pasid __user *arg,
> +				       size_t argsz)
> +{
> +	struct vfio_pci_core_device *vdev =
> +		container_of(device, struct vfio_pci_core_device, vdev);
> +	struct vfio_device_feature_pasid pasid = { 0 };
> +	struct pci_dev *pdev = vdev->pdev;
> +	u32 capabilities = 0;
> +	u16 ctrl = 0;
> +	int ret;
> +
> +	/*
> +	 * Due to no PASID capability per VF, to be consistent, we do not
> +	 * support SET of the PASID capability for both PF and VF.
> +	 */
> +	ret = vfio_check_feature(flags, argsz, VFIO_DEVICE_FEATURE_GET,
> +				 sizeof(pasid));
> +	if (ret != 1)
> +		return ret;
> +
> +	/* VF shares the PASID capability of its PF */
> +	if (pdev->is_virtfn)
> +		pdev = pci_physfn(pdev);
> +
> +	if (!pdev->pasid_enabled)
> +		goto out;
> +
> +#ifdef CONFIG_PCI_PASID
> +	pci_read_config_dword(pdev, pdev->pasid_cap + PCI_PASID_CAP,
> +			      &capabilities);
> +	pci_read_config_word(pdev, pdev->pasid_cap + PCI_PASID_CTRL,
> +			     &ctrl);
> +#endif
> +
> +	pasid.width = (capabilities >> 8) & 0x1f;
> +
> +	if (ctrl & PCI_PASID_CTRL_EXEC)
> +		pasid.capabilities |= VFIO_DEVICE_PASID_CAP_EXEC;
> +	if (ctrl & PCI_PASID_CTRL_PRIV)
> +		pasid.capabilities |= VFIO_DEVICE_PASID_CAP_PRIV;

I agree with Kevin here, let's make use of and add helpers to avoid
#ifdef blocks of code.

> +
> +out:
> +	if (copy_to_user(arg, &pasid, sizeof(pasid)))
> +		return -EFAULT;
> +	return 0;
> +}
> +
>  int vfio_pci_core_ioctl_feature(struct vfio_device *device, u32 flags,
>  				void __user *arg, size_t argsz)
>  {
> @@ -1508,6 +1556,8 @@ int vfio_pci_core_ioctl_feature(struct vfio_device *device, u32 flags,
>  		return vfio_pci_core_pm_exit(device, flags, arg, argsz);
>  	case VFIO_DEVICE_FEATURE_PCI_VF_TOKEN:
>  		return vfio_pci_core_feature_token(device, flags, arg, argsz);
> +	case VFIO_DEVICE_FEATURE_PASID:
> +		return vfio_pci_core_feature_pasid(device, flags, arg, argsz);
>  	default:
>  		return -ENOTTY;
>  	}
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 9591dc24b75c..e50e55c67ab4 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -1513,6 +1513,20 @@ struct vfio_device_feature_bus_master {
>  };
>  #define VFIO_DEVICE_FEATURE_BUS_MASTER 10
>  
> +/**
> + * Upon VFIO_DEVICE_FEATURE_GET, return the PASID capability for the device.
> + * Zero width means no support for PASID.

Why would we do that rather than reporting the feature as unsupported?
Just return -ENOTTY if PASID is not supported or enabled.

> + */
> +struct vfio_device_feature_pasid {
> +	__u16 capabilities;
> +#define VFIO_DEVICE_PASID_CAP_EXEC	(1 << 0)
> +#define VFIO_DEVICE_PASID_CAP_PRIV	(1 << 1)
> +	__u8 width;
> +	__u8 __reserved;
> +};

Building on Kevin's comment on the cover letter, if we could describe
an offset for emulating a PASID capability, this seems like the place
we'd do it.  I think we're not doing that because we'd like an in-band
mechanism for a device to report unused config space, such as a DVSEC
capability, so that it can be implemented on a physical device.  As
noted in the commit log here, we'd also prefer not to bloat the kernel
with more device quirks.

In an ideal world we might be able to jump start support of that DVSEC
option by emulating the DVSEC capability on top of the PASID capability
for PFs, but unfortunately the PASID capability is 8 bytes while the
DVSEC capability is at least 12 bytes, so we can't implement that
generically either.

I don't know there's any good solution here or whether there's actually
any value to the PASID capability on a PF, but do we need to consider
leaving a field+flag here to describe the offset for that scenario?
Would we then allow variant drivers to take advantage of it?  Does this
then turn into the quirk that we're trying to avoid in the kernel
rather than userspace and is that a problem?  Thanks,

Alex

> +
> +#define VFIO_DEVICE_FEATURE_PASID 11
> +
>  /* -------- API for Type1 VFIO IOMMU -------- */
>  
>  /**


^ permalink raw reply	[flat|nested] 72+ messages in thread

* RE: [PATCH v2 4/4] vfio: Report PASID capability via VFIO_DEVICE_FEATURE ioctl
  2024-04-16 17:57   ` Alex Williamson
@ 2024-04-17  7:09     ` Tian, Kevin
  2024-04-17 20:25       ` Alex Williamson
  0 siblings, 1 reply; 72+ messages in thread
From: Tian, Kevin @ 2024-04-17  7:09 UTC (permalink / raw)
  To: Alex Williamson, Liu, Yi L
  Cc: jgg, joro, robin.murphy, eric.auger, nicolinc, kvm, chao.p.peng,
	iommu, baolu.lu, Duan, Zhenzhong, Pan, Jacob jun

> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Wednesday, April 17, 2024 1:57 AM
> 
> On Fri, 12 Apr 2024 01:21:21 -0700
> Yi Liu <yi.l.liu@intel.com> wrote:
> 
> > + */
> > +struct vfio_device_feature_pasid {
> > +	__u16 capabilities;
> > +#define VFIO_DEVICE_PASID_CAP_EXEC	(1 << 0)
> > +#define VFIO_DEVICE_PASID_CAP_PRIV	(1 << 1)
> > +	__u8 width;
> > +	__u8 __reserved;
> > +};
> 
> Building on Kevin's comment on the cover letter, if we could describe
> an offset for emulating a PASID capability, this seems like the place
> we'd do it.  I think we're not doing that because we'd like an in-band
> mechanism for a device to report unused config space, such as a DVSEC
> capability, so that it can be implemented on a physical device.  As
> noted in the commit log here, we'd also prefer not to bloat the kernel
> with more device quirks.
> 
> In an ideal world we might be able to jump start support of that DVSEC
> option by emulating the DVSEC capability on top of the PASID capability
> for PFs, but unfortunately the PASID capability is 8 bytes while the
> DVSEC capability is at least 12 bytes, so we can't implement that
> generically either.

Yeah, that's a problem.

> 
> I don't know there's any good solution here or whether there's actually
> any value to the PASID capability on a PF, but do we need to consider
> leaving a field+flag here to describe the offset for that scenario?

Yes, I prefer to this way.

> Would we then allow variant drivers to take advantage of it?  Does this
> then turn into the quirk that we're trying to avoid in the kernel
> rather than userspace and is that a problem?  Thanks,
> 

We don't want to proactively pursue quirks in the kernel.

But if a variant driver exists for other reasons, I don't see why it 
should be prohibited from deciding an offset to ease the
userspace. 😊

^ permalink raw reply	[flat|nested] 72+ messages in thread

* RE: [PATCH v2 0/4] vfio-pci support pasid attach/detach
  2024-04-16 17:50   ` Jason Gunthorpe
@ 2024-04-17  7:16     ` Tian, Kevin
  2024-04-17 12:20       ` Jason Gunthorpe
  0 siblings, 1 reply; 72+ messages in thread
From: Tian, Kevin @ 2024-04-17  7:16 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Liu, Yi L, alex.williamson, joro, robin.murphy, eric.auger,
	nicolinc, kvm, chao.p.peng, iommu, baolu.lu, Duan, Zhenzhong,
	Pan, Jacob jun

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, April 17, 2024 1:50 AM
> 
> On Tue, Apr 16, 2024 at 08:38:50AM +0000, Tian, Kevin wrote:
> > > From: Liu, Yi L <yi.l.liu@intel.com>
> > > Sent: Friday, April 12, 2024 4:21 PM
> > >
> > > A userspace VMM is supposed to get the details of the device's PASID
> > > capability
> > > and assemble a virtual PASID capability in a proper offset in the virtual
> PCI
> > > configuration space. While it is still an open on how to get the available
> > > offsets. Devices may have hidden bits that are not in the PCI cap chain.
> For
> > > now, there are two options to get the available offsets.[2]
> > >
> > > - Report the available offsets via ioctl. This requires device-specific logic
> > >   to provide available offsets. e.g., vfio-pci variant driver. Or may the
> device
> > >   provide the available offset by DVSEC.
> > > - Store the available offsets in a static table in userspace VMM. VMM gets
> the
> > >   empty offsets from this table.
> > >
> >
> > I'm not a fan of requesting a variant driver for every PASID-capable
> > VF just for the purpose of reporting a free range in the PCI config space.
> >
> > It's easier to do that quirk in userspace.
> >
> > But I like Alex's original comment that at least for PF there is no reason
> > to hide the offset. there could be a flag+field to communicate it. or
> > if there will be a new variant VF driver for other purposes e.g. migration
> > it can certainly fill the field too.
> 
> Yes, since this has been such a sticking point can we get a clean
> series that just enables it for PF and then come with a solution for
> VF?
> 

sure but we at least need to reach consensus on a minimal required
uapi covering both PF/VF to move forward so the user doesn't need
to touch different contracts for PF vs. VF.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 0/4] vfio-pci support pasid attach/detach
  2024-04-17  7:16     ` Tian, Kevin
@ 2024-04-17 12:20       ` Jason Gunthorpe
  2024-04-17 23:02         ` Alex Williamson
  0 siblings, 1 reply; 72+ messages in thread
From: Jason Gunthorpe @ 2024-04-17 12:20 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Liu, Yi L, alex.williamson, joro, robin.murphy, eric.auger,
	nicolinc, kvm, chao.p.peng, iommu, baolu.lu, Duan, Zhenzhong,
	Pan, Jacob jun

On Wed, Apr 17, 2024 at 07:16:05AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Wednesday, April 17, 2024 1:50 AM
> > 
> > On Tue, Apr 16, 2024 at 08:38:50AM +0000, Tian, Kevin wrote:
> > > > From: Liu, Yi L <yi.l.liu@intel.com>
> > > > Sent: Friday, April 12, 2024 4:21 PM
> > > >
> > > > A userspace VMM is supposed to get the details of the device's PASID
> > > > capability
> > > > and assemble a virtual PASID capability in a proper offset in the virtual
> > PCI
> > > > configuration space. While it is still an open on how to get the available
> > > > offsets. Devices may have hidden bits that are not in the PCI cap chain.
> > For
> > > > now, there are two options to get the available offsets.[2]
> > > >
> > > > - Report the available offsets via ioctl. This requires device-specific logic
> > > >   to provide available offsets. e.g., vfio-pci variant driver. Or may the
> > device
> > > >   provide the available offset by DVSEC.
> > > > - Store the available offsets in a static table in userspace VMM. VMM gets
> > the
> > > >   empty offsets from this table.
> > > >
> > >
> > > I'm not a fan of requesting a variant driver for every PASID-capable
> > > VF just for the purpose of reporting a free range in the PCI config space.
> > >
> > > It's easier to do that quirk in userspace.
> > >
> > > But I like Alex's original comment that at least for PF there is no reason
> > > to hide the offset. there could be a flag+field to communicate it. or
> > > if there will be a new variant VF driver for other purposes e.g. migration
> > > it can certainly fill the field too.
> > 
> > Yes, since this has been such a sticking point can we get a clean
> > series that just enables it for PF and then come with a solution for
> > VF?
> > 
> 
> sure but we at least need to reach consensus on a minimal required
> uapi covering both PF/VF to move forward so the user doesn't need
> to touch different contracts for PF vs. VF.

Do we? The situation where the VMM needs to wholly make a up a PASID
capability seems completely new and seperate from just using an
existing PASID capability as in the PF case.

If it needs to make another system call or otherwise to do that then
that seems fine to do incrementally??

Jason

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 4/4] vfio: Report PASID capability via VFIO_DEVICE_FEATURE ioctl
  2024-04-17  7:09     ` Tian, Kevin
@ 2024-04-17 20:25       ` Alex Williamson
  2024-04-18  0:21         ` Tian, Kevin
  0 siblings, 1 reply; 72+ messages in thread
From: Alex Williamson @ 2024-04-17 20:25 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Liu, Yi L, jgg, joro, robin.murphy, eric.auger, nicolinc, kvm,
	chao.p.peng, iommu, baolu.lu, Duan, Zhenzhong, Pan, Jacob jun

On Wed, 17 Apr 2024 07:09:52 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Wednesday, April 17, 2024 1:57 AM
> > 
> > On Fri, 12 Apr 2024 01:21:21 -0700
> > Yi Liu <yi.l.liu@intel.com> wrote:
> >   
> > > + */
> > > +struct vfio_device_feature_pasid {
> > > +	__u16 capabilities;
> > > +#define VFIO_DEVICE_PASID_CAP_EXEC	(1 << 0)
> > > +#define VFIO_DEVICE_PASID_CAP_PRIV	(1 << 1)
> > > +	__u8 width;
> > > +	__u8 __reserved;
> > > +};  
> > 
> > Building on Kevin's comment on the cover letter, if we could describe
> > an offset for emulating a PASID capability, this seems like the place
> > we'd do it.  I think we're not doing that because we'd like an in-band
> > mechanism for a device to report unused config space, such as a DVSEC
> > capability, so that it can be implemented on a physical device.  As
> > noted in the commit log here, we'd also prefer not to bloat the kernel
> > with more device quirks.
> > 
> > In an ideal world we might be able to jump start support of that DVSEC
> > option by emulating the DVSEC capability on top of the PASID capability
> > for PFs, but unfortunately the PASID capability is 8 bytes while the
> > DVSEC capability is at least 12 bytes, so we can't implement that
> > generically either.  
> 
> Yeah, that's a problem.
> 
> > 
> > I don't know there's any good solution here or whether there's actually
> > any value to the PASID capability on a PF, but do we need to consider
> > leaving a field+flag here to describe the offset for that scenario?  
> 
> Yes, I prefer to this way.
> 
> > Would we then allow variant drivers to take advantage of it?  Does this
> > then turn into the quirk that we're trying to avoid in the kernel
> > rather than userspace and is that a problem?  Thanks,
> >   
> 
> We don't want to proactively pursue quirks in the kernel.
> 
> But if a variant driver exists for other reasons, I don't see why it 
> should be prohibited from deciding an offset to ease the
> userspace. 😊

At that point we've turned the corner into an arbitrary policy decision
that I can't defend.  A "worthy" variant driver can implement something
through a side channel vfio API, but implementing that side channel
itself is not enough to justify a variant driver?  It doesn't make
sense.

Further, if we have a variant driver, why do we need a side channel for
the purpose of describing available config space when we expect devices
themselves to eventually describe the same through a DVSEC capability?
The purpose of enabling variant drivers is to enhance the functionality
of the device.  Adding an emulated DVSEC capability seems like a valid
enhancement to justify a variant driver to me.

So the more I think about it, it would be easy to add something here
that hints a location for an emulated PASID capability in the VMM, but
it would also be counterproductive to an end goal of having a DVSEC
capability that describes unused config space.  The very narrow scope
where that side-band channel would be useful is an unknown PF device
which doesn't implement a DVSEC capability and without intervention
simply behaves as it always has, without PASID support.

A vendor desiring such support can a) implement DVSEC in the hardware,
b) implement a variant driver emulating a DVSEC capability, or c)
directly modify the VMM to tell it where to place the PASID capability.
I also don't think we should exclude the possibility that b) could turn
into a shared variant driver that knows about multiple devices and has
a table of free config space for each.  Option c) is only the last
resort if there's not already 12 bytes of contiguous, aligned free
space to place a DVSEC capability.  That seems unlikely.

At some point we need to define the format and use of this DVSEC.  Do
we allow (not require) one at every gap in config space that's at least
12-bytes long and adjust the DVSEC Length to describe longer gaps, or do
we use a single DVSEC to describe a table of ranges throughout extended
(maybe even conventional) config space?  The former seems easier,
especially if we expect a device has a large block of free space,
enough for multiple emulated capabilities and described by a single
DVSEC.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 0/4] vfio-pci support pasid attach/detach
  2024-04-17 12:20       ` Jason Gunthorpe
@ 2024-04-17 23:02         ` Alex Williamson
  2024-04-18  0:06           ` Tian, Kevin
  2024-04-19 13:34           ` Jason Gunthorpe
  0 siblings, 2 replies; 72+ messages in thread
From: Alex Williamson @ 2024-04-17 23:02 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tian, Kevin, Liu, Yi L, joro, robin.murphy, eric.auger, nicolinc,
	kvm, chao.p.peng, iommu, baolu.lu, Duan, Zhenzhong, Pan,
	Jacob jun

On Wed, 17 Apr 2024 09:20:51 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Wed, Apr 17, 2024 at 07:16:05AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Wednesday, April 17, 2024 1:50 AM
> > > 
> > > On Tue, Apr 16, 2024 at 08:38:50AM +0000, Tian, Kevin wrote:  
> > > > > From: Liu, Yi L <yi.l.liu@intel.com>
> > > > > Sent: Friday, April 12, 2024 4:21 PM
> > > > >
> > > > > A userspace VMM is supposed to get the details of the device's PASID
> > > > > capability
> > > > > and assemble a virtual PASID capability in a proper offset in the virtual  
> > > PCI  
> > > > > configuration space. While it is still an open on how to get the available
> > > > > offsets. Devices may have hidden bits that are not in the PCI cap chain.  
> > > For  
> > > > > now, there are two options to get the available offsets.[2]
> > > > >
> > > > > - Report the available offsets via ioctl. This requires device-specific logic
> > > > >   to provide available offsets. e.g., vfio-pci variant driver. Or may the  
> > > device  
> > > > >   provide the available offset by DVSEC.
> > > > > - Store the available offsets in a static table in userspace VMM. VMM gets  
> > > the  
> > > > >   empty offsets from this table.
> > > > >  
> > > >
> > > > I'm not a fan of requesting a variant driver for every PASID-capable
> > > > VF just for the purpose of reporting a free range in the PCI config space.
> > > >
> > > > It's easier to do that quirk in userspace.
> > > >
> > > > But I like Alex's original comment that at least for PF there is no reason
> > > > to hide the offset. there could be a flag+field to communicate it. or
> > > > if there will be a new variant VF driver for other purposes e.g. migration
> > > > it can certainly fill the field too.  
> > > 
> > > Yes, since this has been such a sticking point can we get a clean
> > > series that just enables it for PF and then come with a solution for
> > > VF?
> > >   
> > 
> > sure but we at least need to reach consensus on a minimal required
> > uapi covering both PF/VF to move forward so the user doesn't need
> > to touch different contracts for PF vs. VF.  
> 
> Do we? The situation where the VMM needs to wholly make a up a PASID
> capability seems completely new and seperate from just using an
> existing PASID capability as in the PF case.

But we don't actually expose the PASID capability on the PF and as
argued in path 4/ we can't because it would break existing userspace.

> If it needs to make another system call or otherwise to do that then
> that seems fine to do incrementally??

With PASID currently hidden, VF and PF support really seem too similar
not to handle them both at the same time.  What's missing is a
mechanism to describe unused config space where userspace can implement
an emulated PASID capability.  Defining a DVSEC capability to handle
this seemed to be the preferred solution from the LPC discussion.  PF
and VF devices that implement a TBD DVSEC capability could avoid
needing a variant driver.  Until then we need to be careful not to
undermine a future world with that preferred solution by introducing
side-band mechanisms which only work for variant drivers and PF
devices.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 72+ messages in thread

* RE: [PATCH v2 0/4] vfio-pci support pasid attach/detach
  2024-04-17 23:02         ` Alex Williamson
@ 2024-04-18  0:06           ` Tian, Kevin
  2024-04-18  9:03             ` Yi Liu
  2024-04-19 13:34           ` Jason Gunthorpe
  1 sibling, 1 reply; 72+ messages in thread
From: Tian, Kevin @ 2024-04-18  0:06 UTC (permalink / raw)
  To: Alex Williamson, Jason Gunthorpe
  Cc: Liu, Yi L, joro, robin.murphy, eric.auger, nicolinc, kvm,
	chao.p.peng, iommu, baolu.lu, Duan, Zhenzhong, Pan, Jacob jun

> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Thursday, April 18, 2024 7:02 AM
> 
> On Wed, 17 Apr 2024 09:20:51 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Wed, Apr 17, 2024 at 07:16:05AM +0000, Tian, Kevin wrote:
> > > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > > Sent: Wednesday, April 17, 2024 1:50 AM
> > > >
> > > > On Tue, Apr 16, 2024 at 08:38:50AM +0000, Tian, Kevin wrote:
> > > > > > From: Liu, Yi L <yi.l.liu@intel.com>
> > > > > > Sent: Friday, April 12, 2024 4:21 PM
> > > > > >
> > > > > > A userspace VMM is supposed to get the details of the device's
> PASID
> > > > > > capability
> > > > > > and assemble a virtual PASID capability in a proper offset in the
> virtual
> > > > PCI
> > > > > > configuration space. While it is still an open on how to get the
> available
> > > > > > offsets. Devices may have hidden bits that are not in the PCI cap
> chain.
> > > > For
> > > > > > now, there are two options to get the available offsets.[2]
> > > > > >
> > > > > > - Report the available offsets via ioctl. This requires device-specific
> logic
> > > > > >   to provide available offsets. e.g., vfio-pci variant driver. Or may the
> > > > device
> > > > > >   provide the available offset by DVSEC.
> > > > > > - Store the available offsets in a static table in userspace VMM.
> VMM gets
> > > > the
> > > > > >   empty offsets from this table.
> > > > > >
> > > > >
> > > > > I'm not a fan of requesting a variant driver for every PASID-capable
> > > > > VF just for the purpose of reporting a free range in the PCI config
> space.
> > > > >
> > > > > It's easier to do that quirk in userspace.
> > > > >
> > > > > But I like Alex's original comment that at least for PF there is no
> reason
> > > > > to hide the offset. there could be a flag+field to communicate it. or
> > > > > if there will be a new variant VF driver for other purposes e.g.
> migration
> > > > > it can certainly fill the field too.
> > > >
> > > > Yes, since this has been such a sticking point can we get a clean
> > > > series that just enables it for PF and then come with a solution for
> > > > VF?
> > > >
> > >
> > > sure but we at least need to reach consensus on a minimal required
> > > uapi covering both PF/VF to move forward so the user doesn't need
> > > to touch different contracts for PF vs. VF.
> >
> > Do we? The situation where the VMM needs to wholly make a up a PASID
> > capability seems completely new and seperate from just using an
> > existing PASID capability as in the PF case.
> 
> But we don't actually expose the PASID capability on the PF and as
> argued in path 4/ we can't because it would break existing userspace.

Come back to this statement.

Does 'break' means that legacy Qemu will crash due to a guest write
to the read-only PASID capability, or just a conceptually functional
break i.e. non-faithful emulation due to writes being dropped?

If the latter it's probably not a bad idea to allow exposing the PASID
capability on the PF as a sane guest shouldn't enable the PASID
capability w/o seeing vIOMMU supporting PASID. And there is no
status bit defined in the PASID capability to check back so even
if an insane guest wants to blindly enable PASID it will naturally
write and done. The only niche case is that the enable bits are
defined as RW so ideally reading back those bits should get the
latest written value. But probably this can be tolerated?

With that then should we consider exposing the PASID capability
in PCI config space as the first option? For PF it's simple as how
other caps are exposed. For VF a variant driver can also fake the
PASID capability or emulate a DVSEC capability for unused space
(to motivate the physical implementation so no variant driver is
required in the future)

This allows a staging approach as Jason envisioned. 

^ permalink raw reply	[flat|nested] 72+ messages in thread

* RE: [PATCH v2 4/4] vfio: Report PASID capability via VFIO_DEVICE_FEATURE ioctl
  2024-04-17 20:25       ` Alex Williamson
@ 2024-04-18  0:21         ` Tian, Kevin
  2024-04-18  8:23           ` Yi Liu
  2024-04-18 16:34           ` Alex Williamson
  0 siblings, 2 replies; 72+ messages in thread
From: Tian, Kevin @ 2024-04-18  0:21 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Liu, Yi L, jgg, joro, robin.murphy, eric.auger, nicolinc, kvm,
	chao.p.peng, iommu, baolu.lu, Duan, Zhenzhong, Pan, Jacob jun

> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Thursday, April 18, 2024 4:26 AM
> 
> On Wed, 17 Apr 2024 07:09:52 +0000
> "Tian, Kevin" <kevin.tian@intel.com> wrote:
> 
> > > From: Alex Williamson <alex.williamson@redhat.com>
> > > Sent: Wednesday, April 17, 2024 1:57 AM
> > >
> > > On Fri, 12 Apr 2024 01:21:21 -0700
> > > Yi Liu <yi.l.liu@intel.com> wrote:
> > >
> > > > + */
> > > > +struct vfio_device_feature_pasid {
> > > > +	__u16 capabilities;
> > > > +#define VFIO_DEVICE_PASID_CAP_EXEC	(1 << 0)
> > > > +#define VFIO_DEVICE_PASID_CAP_PRIV	(1 << 1)
> > > > +	__u8 width;
> > > > +	__u8 __reserved;
> > > > +};
> > >
> > > Building on Kevin's comment on the cover letter, if we could describe
> > > an offset for emulating a PASID capability, this seems like the place
> > > we'd do it.  I think we're not doing that because we'd like an in-band
> > > mechanism for a device to report unused config space, such as a DVSEC
> > > capability, so that it can be implemented on a physical device.  As
> > > noted in the commit log here, we'd also prefer not to bloat the kernel
> > > with more device quirks.
> > >
> > > In an ideal world we might be able to jump start support of that DVSEC
> > > option by emulating the DVSEC capability on top of the PASID capability
> > > for PFs, but unfortunately the PASID capability is 8 bytes while the
> > > DVSEC capability is at least 12 bytes, so we can't implement that
> > > generically either.
> >
> > Yeah, that's a problem.
> >
> > >
> > > I don't know there's any good solution here or whether there's actually
> > > any value to the PASID capability on a PF, but do we need to consider
> > > leaving a field+flag here to describe the offset for that scenario?
> >
> > Yes, I prefer to this way.
> >
> > > Would we then allow variant drivers to take advantage of it?  Does this
> > > then turn into the quirk that we're trying to avoid in the kernel
> > > rather than userspace and is that a problem?  Thanks,
> > >
> >
> > We don't want to proactively pursue quirks in the kernel.
> >
> > But if a variant driver exists for other reasons, I don't see why it
> > should be prohibited from deciding an offset to ease the
> > userspace. 😊
> 
> At that point we've turned the corner into an arbitrary policy decision
> that I can't defend.  A "worthy" variant driver can implement something
> through a side channel vfio API, but implementing that side channel
> itself is not enough to justify a variant driver?  It doesn't make
> sense.
> 
> Further, if we have a variant driver, why do we need a side channel for
> the purpose of describing available config space when we expect devices
> themselves to eventually describe the same through a DVSEC capability?
> The purpose of enabling variant drivers is to enhance the functionality
> of the device.  Adding an emulated DVSEC capability seems like a valid
> enhancement to justify a variant driver to me.
> 
> So the more I think about it, it would be easy to add something here
> that hints a location for an emulated PASID capability in the VMM, but
> it would also be counterproductive to an end goal of having a DVSEC
> capability that describes unused config space.  The very narrow scope
> where that side-band channel would be useful is an unknown PF device
> which doesn't implement a DVSEC capability and without intervention
> simply behaves as it always has, without PASID support.
> 
> A vendor desiring such support can a) implement DVSEC in the hardware,
> b) implement a variant driver emulating a DVSEC capability, or c)
> directly modify the VMM to tell it where to place the PASID capability.
> I also don't think we should exclude the possibility that b) could turn
> into a shared variant driver that knows about multiple devices and has
> a table of free config space for each.  Option c) is only the last
> resort if there's not already 12 bytes of contiguous, aligned free
> space to place a DVSEC capability.  That seems unlikely.

or b) could be a table in vfio_pci_config.c i.e. kind of making vfio-pci
as the shared variant driver.

> 
> At some point we need to define the format and use of this DVSEC.  Do
> we allow (not require) one at every gap in config space that's at least
> 12-bytes long and adjust the DVSEC Length to describe longer gaps, or do

Does PCI spec allows multiple same-type capabilities co-existing?

> we use a single DVSEC to describe a table of ranges throughout extended
> (maybe even conventional) config space?  The former seems easier,

this might be challenging as the table itself requires a contiguous
large free block.

> especially if we expect a device has a large block of free space,
> enough for multiple emulated capabilities and described by a single
> DVSEC.  Thanks,
> 

yes that sounds simpler.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 1/4] ida: Add ida_get_lowest()
  2024-04-16 16:03   ` Alex Williamson
@ 2024-04-18  7:02     ` Yi Liu
  2024-04-18 16:23       ` Alex Williamson
  0 siblings, 1 reply; 72+ messages in thread
From: Yi Liu @ 2024-04-18  7:02 UTC (permalink / raw)
  To: Alex Williamson
  Cc: jgg, kevin.tian, joro, robin.murphy, eric.auger, nicolinc, kvm,
	chao.p.peng, iommu, baolu.lu, zhenzhong.duan, jacob.jun.pan,
	Matthew Wilcox

On 2024/4/17 00:03, Alex Williamson wrote:
> On Fri, 12 Apr 2024 01:21:18 -0700
> Yi Liu <yi.l.liu@intel.com> wrote:
> 
>> There is no helpers for user to check if a given ID is allocated or not,
>> neither a helper to loop all the allocated IDs in an IDA and do something
>> for cleanup. With the two needs, a helper to get the lowest allocated ID
>> of a range can help to achieve it.
>>
>> Caller can check if a given ID is allocated or not by:
>> 	int id = 200, rc;
>>
>> 	rc = ida_get_lowest(&ida, id, id);
>> 	if (rc == id)
>> 		//id 200 is used
>> 	else
>> 		//id 200 is not used
>>
>> Caller can iterate all allocated IDs by:
>> 	int id = 0;
>>
>> 	while (!ida_is_empty(&pasid_ida)) {
>> 		id = ida_get_lowest(pasid_ida, id, INT_MAX);
>> 		if (id < 0)
>> 			break;
>> 		//anything to do with the allocated ID
>> 		ida_free(pasid_ida, pasid);
>> 	}
>>
>> Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
>> Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
>> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
>> ---
>>   include/linux/idr.h |  1 +
>>   lib/idr.c           | 67 +++++++++++++++++++++++++++++++++++++++++++++
>>   2 files changed, 68 insertions(+)
>>
>> diff --git a/include/linux/idr.h b/include/linux/idr.h
>> index da5f5fa4a3a6..1dae71d4a75d 100644
>> --- a/include/linux/idr.h
>> +++ b/include/linux/idr.h
>> @@ -257,6 +257,7 @@ struct ida {
>>   int ida_alloc_range(struct ida *, unsigned int min, unsigned int max, gfp_t);
>>   void ida_free(struct ida *, unsigned int id);
>>   void ida_destroy(struct ida *ida);
>> +int ida_get_lowest(struct ida *ida, unsigned int min, unsigned int max);
>>   
>>   /**
>>    * ida_alloc() - Allocate an unused ID.
>> diff --git a/lib/idr.c b/lib/idr.c
>> index da36054c3ca0..03e461242fe2 100644
>> --- a/lib/idr.c
>> +++ b/lib/idr.c
>> @@ -476,6 +476,73 @@ int ida_alloc_range(struct ida *ida, unsigned int min, unsigned int max,
>>   }
>>   EXPORT_SYMBOL(ida_alloc_range);
>>   
>> +/**
>> + * ida_get_lowest - Get the lowest used ID.
>> + * @ida: IDA handle.
>> + * @min: Lowest ID to get.
>> + * @max: Highest ID to get.
>> + *
>> + * Get the lowest used ID between @min and @max, inclusive.  The returned
>> + * ID will not exceed %INT_MAX, even if @max is larger.
>> + *
>> + * Context: Any context. Takes and releases the xa_lock.
>> + * Return: The lowest used ID, or errno if no used ID is found.
>> + */
>> +int ida_get_lowest(struct ida *ida, unsigned int min, unsigned int max)
>> +{
>> +	unsigned long index = min / IDA_BITMAP_BITS;
>> +	unsigned int offset = min % IDA_BITMAP_BITS;
>> +	unsigned long *addr, size, bit;
>> +	unsigned long flags;
>> +	void *entry;
>> +	int ret;
>> +
>> +	if (min >= INT_MAX)
>> +		return -EINVAL;
>> +	if (max >= INT_MAX)
>> +		max = INT_MAX;
>> +
> 
> Could these be made consistent with the test in ida_alloc_range(), ie:
> 
> 	if ((int)min < 0)
> 		return -EINVAL;
> 	if ((int)max < 0)
> 		max = INT_MAX;
> 

sure.

>> +	xa_lock_irqsave(&ida->xa, flags);
>> +
>> +	entry = xa_find(&ida->xa, &index, max / IDA_BITMAP_BITS, XA_PRESENT);
>> +	if (!entry) {
>> +		ret = -ENOTTY;
> 
> -ENOENT?  Same for all below too.

I see.

>> +		goto err_unlock;
>> +	}
>> +
>> +	if (index > min / IDA_BITMAP_BITS)
>> +		offset = 0;
>> +	if (index * IDA_BITMAP_BITS + offset > max) {
>> +		ret = -ENOTTY;
>> +		goto err_unlock;
>> +	}
>> +
>> +	if (xa_is_value(entry)) {
>> +		unsigned long tmp = xa_to_value(entry);
>> +
>> +		addr = &tmp;
>> +		size = BITS_PER_XA_VALUE;
>> +	} else {
>> +		addr = ((struct ida_bitmap *)entry)->bitmap;
>> +		size = IDA_BITMAP_BITS;
>> +	}
>> +
>> +	bit = find_next_bit(addr, size, offset);
>> +
>> +	xa_unlock_irqrestore(&ida->xa, flags);
>> +
>> +	if (bit == size ||
>> +	    index * IDA_BITMAP_BITS + bit > max)
>> +		return -ENOTTY;
>> +
>> +	return index * IDA_BITMAP_BITS + bit;
>> +
>> +err_unlock:
>> +	xa_unlock_irqrestore(&ida->xa, flags);
>> +	return ret;
>> +}
>> +EXPORT_SYMBOL(ida_get_lowest);
> 
> The API is a bit awkward to me, I wonder if it might be helped with
> some renaming and wrappers...
> 
> int ida_find_first_range(struct ida *ida, unsigned int min, unsigned int max);

ok.

> bool ida_exists(struct ida *ida, unsigned int id)
> {
> 	return ida_find_first_range(ida, id, id) == id;
> }

this does helps in next patch.

> 
> int ida_find_first(struct ida *ida)
> {
> 	return ida_find_first_range(ida, 0, ~0);
> }
>

perhaps it can be added in future. This series has two usages. One is to
check if a given ID is allocated. This can be covered by your ida_exists().
Another usage is to loop each IDs, do detach and free. This can still use
the ida_find_first_range() like the example in the commit message. The
first loop starts from 0, and next would start from the last found ID.
This may be more efficient than starts from 0 in every loop.


> _min and _max variations of the latter would align with existing
> ida_alloc variants, but maybe no need to add them preemptively.

yes.

> Possibly an ida_for_each() could be useful in the use case of
> disassociating each id, but not required for the brute force iterative
> method.  Thanks,

yep. maybe we can start with the below code, no need for ida_for_each()
today.


  	int id = 0;

  	while (!ida_is_empty(&pasid_ida)) {
  		id = ida_find_first_range(pasid_ida, id, INT_MAX);
  		if (unlikely(WARN_ON(id < 0))
			break;
  		iommufd_device_pasid_detach();
  		ida_free(pasid_ida, pasid);
  	}

> 
>> +
>>   /**
>>    * ida_free() - Release an allocated ID.
>>    * @ida: IDA handle.
> 

-- 
Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 2/4] vfio-iommufd: Support pasid [at|de]tach for physical VFIO devices
  2024-04-16  9:47       ` Tian, Kevin
@ 2024-04-18  7:04         ` Yi Liu
  0 siblings, 0 replies; 72+ messages in thread
From: Yi Liu @ 2024-04-18  7:04 UTC (permalink / raw)
  To: Tian, Kevin, alex.williamson, jgg
  Cc: joro, robin.murphy, eric.auger, nicolinc, kvm, chao.p.peng,
	iommu, baolu.lu, Duan, Zhenzhong, Pan, Jacob jun, Matthew Wilcox



On 2024/4/16 17:47, Tian, Kevin wrote:
>> From: Liu, Yi L <yi.l.liu@intel.com>
>> Sent: Tuesday, April 16, 2024 5:25 PM
>>
>> On 2024/4/16 17:01, Tian, Kevin wrote:
>>>> From: Liu, Yi L <yi.l.liu@intel.com>
>>>> Sent: Friday, April 12, 2024 4:21 PM
>>>>
>>>> +
>>>> +	rc = ida_get_lowest(&vdev->pasids, pasid, pasid);
>>>> +	if (rc == pasid)
>>>> +		return iommufd_device_pasid_replace(vdev-
>>>>> iommufd_device,
>>>> +						    pasid, pt_id);
>>>> +
>>>> +	rc = iommufd_device_pasid_attach(vdev->iommufd_device, pasid,
>>>> pt_id);
>>>> +	if (rc)
>>>> +		return rc;
>>>> +
>>>> +	rc = ida_alloc_range(&vdev->pasids, pasid, pasid, GFP_KERNEL);
>>>> +	if (rc < 0) {
>>>> +		iommufd_device_pasid_detach(vdev->iommufd_device,
>>>> pasid);
>>>> +		return rc;
>>>> +	}
>>>
>>> I'd do simple operation (ida_alloc_range()) first before doing attach.
>>>
>>
>> But that means we rely on the ida_alloc_range() to return -ENOSPC to
>> indicate the pasid is allocated, hence this attach is actually a
>> replacement. This is easy to be broken if ida_alloc_range() returns
>> -ENOSPC for other reasons in future.
>>
> 
> ida_alloc_range() could fail for other reasons e.g. -ENOMEM.
> 
> in case I didn't make it clear I just meant to swap the order
> between iommufd_device_pasid_attach() and ida_alloc_range().
> 
> replacement is still checked against ida_get_lowest().

aha, I see.

-- 
Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 4/4] vfio: Report PASID capability via VFIO_DEVICE_FEATURE ioctl
  2024-04-18  0:21         ` Tian, Kevin
@ 2024-04-18  8:23           ` Yi Liu
  2024-04-18 16:34           ` Alex Williamson
  1 sibling, 0 replies; 72+ messages in thread
From: Yi Liu @ 2024-04-18  8:23 UTC (permalink / raw)
  To: Tian, Kevin, Alex Williamson
  Cc: jgg, joro, robin.murphy, eric.auger, nicolinc, kvm, chao.p.peng,
	iommu, baolu.lu, Duan, Zhenzhong, Pan, Jacob jun

On 2024/4/18 08:21, Tian, Kevin wrote:
>> From: Alex Williamson <alex.williamson@redhat.com>
>> Sent: Thursday, April 18, 2024 4:26 AM
>>
>> On Wed, 17 Apr 2024 07:09:52 +0000
>> "Tian, Kevin" <kevin.tian@intel.com> wrote:
>>
>>>> From: Alex Williamson <alex.williamson@redhat.com>
>>>> Sent: Wednesday, April 17, 2024 1:57 AM
>>>>
>>>> On Fri, 12 Apr 2024 01:21:21 -0700
>>>> Yi Liu <yi.l.liu@intel.com> wrote:
>>>>
>>>>> + */
>>>>> +struct vfio_device_feature_pasid {
>>>>> +	__u16 capabilities;
>>>>> +#define VFIO_DEVICE_PASID_CAP_EXEC	(1 << 0)
>>>>> +#define VFIO_DEVICE_PASID_CAP_PRIV	(1 << 1)
>>>>> +	__u8 width;
>>>>> +	__u8 __reserved;
>>>>> +};
>>>>
>>>> Building on Kevin's comment on the cover letter, if we could describe
>>>> an offset for emulating a PASID capability, this seems like the place
>>>> we'd do it.  I think we're not doing that because we'd like an in-band
>>>> mechanism for a device to report unused config space, such as a DVSEC
>>>> capability, so that it can be implemented on a physical device.  As
>>>> noted in the commit log here, we'd also prefer not to bloat the kernel
>>>> with more device quirks.
>>>>
>>>> In an ideal world we might be able to jump start support of that DVSEC
>>>> option by emulating the DVSEC capability on top of the PASID capability
>>>> for PFs, but unfortunately the PASID capability is 8 bytes while the
>>>> DVSEC capability is at least 12 bytes, so we can't implement that
>>>> generically either.
>>>
>>> Yeah, that's a problem.
>>>
>>>>
>>>> I don't know there's any good solution here or whether there's actually
>>>> any value to the PASID capability on a PF, but do we need to consider
>>>> leaving a field+flag here to describe the offset for that scenario?
>>>
>>> Yes, I prefer to this way.
>>>
>>>> Would we then allow variant drivers to take advantage of it?  Does this
>>>> then turn into the quirk that we're trying to avoid in the kernel
>>>> rather than userspace and is that a problem?  Thanks,
>>>>
>>>
>>> We don't want to proactively pursue quirks in the kernel.
>>>
>>> But if a variant driver exists for other reasons, I don't see why it
>>> should be prohibited from deciding an offset to ease the
>>> userspace. 😊
>>
>> At that point we've turned the corner into an arbitrary policy decision
>> that I can't defend.  A "worthy" variant driver can implement something
>> through a side channel vfio API, but implementing that side channel
>> itself is not enough to justify a variant driver?  It doesn't make
>> sense.
>>
>> Further, if we have a variant driver, why do we need a side channel for
>> the purpose of describing available config space when we expect devices
>> themselves to eventually describe the same through a DVSEC capability?
>> The purpose of enabling variant drivers is to enhance the functionality
>> of the device.  Adding an emulated DVSEC capability seems like a valid
>> enhancement to justify a variant driver to me.
>>
>> So the more I think about it, it would be easy to add something here
>> that hints a location for an emulated PASID capability in the VMM, but
>> it would also be counterproductive to an end goal of having a DVSEC
>> capability that describes unused config space.  The very narrow scope
>> where that side-band channel would be useful is an unknown PF device
>> which doesn't implement a DVSEC capability and without intervention
>> simply behaves as it always has, without PASID support.
>>
>> A vendor desiring such support can a) implement DVSEC in the hardware,
>> b) implement a variant driver emulating a DVSEC capability, or c)
>> directly modify the VMM to tell it where to place the PASID capability.
>> I also don't think we should exclude the possibility that b) could turn
>> into a shared variant driver that knows about multiple devices and has
>> a table of free config space for each.  Option c) is only the last
>> resort if there's not already 12 bytes of contiguous, aligned free
>> space to place a DVSEC capability.  That seems unlikely.
> 
> or b) could be a table in vfio_pci_config.c i.e. kind of making vfio-pci
> as the shared variant driver.
> 
>>
>> At some point we need to define the format and use of this DVSEC.  Do
>> we allow (not require) one at every gap in config space that's at least
>> 12-bytes long and adjust the DVSEC Length to describe longer gaps, or do
> 
> Does PCI spec allows multiple same-type capabilities co-existing?

For DVSEC, it allows. Below is the sentence from PCIe spec.

A single PCI Express Function or RCRB is permitted to contain multiple 
DVSEC Capability structures


>> we use a single DVSEC to describe a table of ranges throughout extended
>> (maybe even conventional) config space?  The former seems easier,
> 
> this might be challenging as the table itself requires a contiguous
> large free block.
> 
>> especially if we expect a device has a large block of free space,
>> enough for multiple emulated capabilities and described by a single
>> DVSEC.  Thanks,
>>

this is a good point. The ATS and PRI capability do not exist in VF as
well. They need to be emulated.

> 
> yes that sounds simpler.

-- 
Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 0/4] vfio-pci support pasid attach/detach
  2024-04-18  0:06           ` Tian, Kevin
@ 2024-04-18  9:03             ` Yi Liu
  2024-04-18 20:37               ` Alex Williamson
  0 siblings, 1 reply; 72+ messages in thread
From: Yi Liu @ 2024-04-18  9:03 UTC (permalink / raw)
  To: Tian, Kevin, Alex Williamson, Jason Gunthorpe
  Cc: joro, robin.murphy, eric.auger, nicolinc, kvm, chao.p.peng,
	iommu, baolu.lu, Duan, Zhenzhong, Pan, Jacob jun

On 2024/4/18 08:06, Tian, Kevin wrote:
>> From: Alex Williamson <alex.williamson@redhat.com>
>> Sent: Thursday, April 18, 2024 7:02 AM
>>
>> On Wed, 17 Apr 2024 09:20:51 -0300
>> Jason Gunthorpe <jgg@nvidia.com> wrote:
>>
>>> On Wed, Apr 17, 2024 at 07:16:05AM +0000, Tian, Kevin wrote:
>>>>> From: Jason Gunthorpe <jgg@nvidia.com>
>>>>> Sent: Wednesday, April 17, 2024 1:50 AM
>>>>>
>>>>> On Tue, Apr 16, 2024 at 08:38:50AM +0000, Tian, Kevin wrote:
>>>>>>> From: Liu, Yi L <yi.l.liu@intel.com>
>>>>>>> Sent: Friday, April 12, 2024 4:21 PM
>>>>>>>
>>>>>>> A userspace VMM is supposed to get the details of the device's
>> PASID
>>>>>>> capability
>>>>>>> and assemble a virtual PASID capability in a proper offset in the
>> virtual
>>>>> PCI
>>>>>>> configuration space. While it is still an open on how to get the
>> available
>>>>>>> offsets. Devices may have hidden bits that are not in the PCI cap
>> chain.
>>>>> For
>>>>>>> now, there are two options to get the available offsets.[2]
>>>>>>>
>>>>>>> - Report the available offsets via ioctl. This requires device-specific
>> logic
>>>>>>>    to provide available offsets. e.g., vfio-pci variant driver. Or may the
>>>>> device
>>>>>>>    provide the available offset by DVSEC.
>>>>>>> - Store the available offsets in a static table in userspace VMM.
>> VMM gets
>>>>> the
>>>>>>>    empty offsets from this table.
>>>>>>>
>>>>>>
>>>>>> I'm not a fan of requesting a variant driver for every PASID-capable
>>>>>> VF just for the purpose of reporting a free range in the PCI config
>> space.
>>>>>>
>>>>>> It's easier to do that quirk in userspace.
>>>>>>
>>>>>> But I like Alex's original comment that at least for PF there is no
>> reason
>>>>>> to hide the offset. there could be a flag+field to communicate it. or
>>>>>> if there will be a new variant VF driver for other purposes e.g.
>> migration
>>>>>> it can certainly fill the field too.
>>>>>
>>>>> Yes, since this has been such a sticking point can we get a clean
>>>>> series that just enables it for PF and then come with a solution for
>>>>> VF?
>>>>>
>>>>
>>>> sure but we at least need to reach consensus on a minimal required
>>>> uapi covering both PF/VF to move forward so the user doesn't need
>>>> to touch different contracts for PF vs. VF.
>>>
>>> Do we? The situation where the VMM needs to wholly make a up a PASID
>>> capability seems completely new and seperate from just using an
>>> existing PASID capability as in the PF case.
>>
>> But we don't actually expose the PASID capability on the PF and as
>> argued in path 4/ we can't because it would break existing userspace.
> > Come back to this statement.
> 
> Does 'break' means that legacy Qemu will crash due to a guest write
> to the read-only PASID capability, or just a conceptually functional
> break i.e. non-faithful emulation due to writes being dropped?
> 
> If the latter it's probably not a bad idea to allow exposing the PASID
> capability on the PF as a sane guest shouldn't enable the PASID
> capability w/o seeing vIOMMU supporting PASID. And there is no
> status bit defined in the PASID capability to check back so even
> if an insane guest wants to blindly enable PASID it will naturally
> write and done. The only niche case is that the enable bits are
> defined as RW so ideally reading back those bits should get the
> latest written value. But probably this can be tolerated?
> 
> With that then should we consider exposing the PASID capability
> in PCI config space as the first option? For PF it's simple as how
> other caps are exposed. For VF a variant driver can also fake the
> PASID capability or emulate a DVSEC capability for unused space
> (to motivate the physical implementation so no variant driver is
> required in the future)

If kernel exposes pasid cap for PF same as other caps, and in the meantime
the variant driver chooses to emulate a DVSEC cap, then userspace follows
the below steps to expose pasid cap to VM.

1) Check if a pasid cap is already present in the virtual config space
    read from kernel. If no, but user wants pasid, then goto step 2).
2) Userspace invokes VFIO_DEVICE_FETURE to check if the device support
    pasid cap. If yes, goto step 3).
3) Userspace gets an available offset via reading the DVSEC cap.
4) Userspace assembles a pasid cap and inserts it to the vconfig space.

For PF, step 1) is enough. For VF, it needs to go through all the 4 steps.
This is a bit different from what we planned at the beginning. But sounds
doable if we want to pursue the staging direction.

-- 
Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 1/4] ida: Add ida_get_lowest()
  2024-04-18  7:02     ` Yi Liu
@ 2024-04-18 16:23       ` Alex Williamson
  2024-04-18 17:12         ` Jason Gunthorpe
  2024-04-19 13:40         ` Yi Liu
  0 siblings, 2 replies; 72+ messages in thread
From: Alex Williamson @ 2024-04-18 16:23 UTC (permalink / raw)
  To: Yi Liu
  Cc: jgg, kevin.tian, joro, robin.murphy, eric.auger, nicolinc, kvm,
	chao.p.peng, iommu, baolu.lu, zhenzhong.duan, jacob.jun.pan,
	Matthew Wilcox

On Thu, 18 Apr 2024 15:02:46 +0800
Yi Liu <yi.l.liu@intel.com> wrote:

> On 2024/4/17 00:03, Alex Williamson wrote:
> > On Fri, 12 Apr 2024 01:21:18 -0700
> > Yi Liu <yi.l.liu@intel.com> wrote:
> >   
> >> There is no helpers for user to check if a given ID is allocated or not,
> >> neither a helper to loop all the allocated IDs in an IDA and do something
> >> for cleanup. With the two needs, a helper to get the lowest allocated ID
> >> of a range can help to achieve it.
> >>
> >> Caller can check if a given ID is allocated or not by:
> >> 	int id = 200, rc;
> >>
> >> 	rc = ida_get_lowest(&ida, id, id);
> >> 	if (rc == id)
> >> 		//id 200 is used
> >> 	else
> >> 		//id 200 is not used
> >>
> >> Caller can iterate all allocated IDs by:
> >> 	int id = 0;
> >>
> >> 	while (!ida_is_empty(&pasid_ida)) {
> >> 		id = ida_get_lowest(pasid_ida, id, INT_MAX);
> >> 		if (id < 0)
> >> 			break;
> >> 		//anything to do with the allocated ID
> >> 		ida_free(pasid_ida, pasid);
> >> 	}
> >>
> >> Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
> >> Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
> >> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> >> ---
> >>   include/linux/idr.h |  1 +
> >>   lib/idr.c           | 67 +++++++++++++++++++++++++++++++++++++++++++++
> >>   2 files changed, 68 insertions(+)
> >>
> >> diff --git a/include/linux/idr.h b/include/linux/idr.h
> >> index da5f5fa4a3a6..1dae71d4a75d 100644
> >> --- a/include/linux/idr.h
> >> +++ b/include/linux/idr.h
> >> @@ -257,6 +257,7 @@ struct ida {
> >>   int ida_alloc_range(struct ida *, unsigned int min, unsigned int max, gfp_t);
> >>   void ida_free(struct ida *, unsigned int id);
> >>   void ida_destroy(struct ida *ida);
> >> +int ida_get_lowest(struct ida *ida, unsigned int min, unsigned int max);
> >>   
> >>   /**
> >>    * ida_alloc() - Allocate an unused ID.
> >> diff --git a/lib/idr.c b/lib/idr.c
> >> index da36054c3ca0..03e461242fe2 100644
> >> --- a/lib/idr.c
> >> +++ b/lib/idr.c
> >> @@ -476,6 +476,73 @@ int ida_alloc_range(struct ida *ida, unsigned int min, unsigned int max,
> >>   }
> >>   EXPORT_SYMBOL(ida_alloc_range);
> >>   
> >> +/**
> >> + * ida_get_lowest - Get the lowest used ID.
> >> + * @ida: IDA handle.
> >> + * @min: Lowest ID to get.
> >> + * @max: Highest ID to get.
> >> + *
> >> + * Get the lowest used ID between @min and @max, inclusive.  The returned
> >> + * ID will not exceed %INT_MAX, even if @max is larger.
> >> + *
> >> + * Context: Any context. Takes and releases the xa_lock.
> >> + * Return: The lowest used ID, or errno if no used ID is found.
> >> + */
> >> +int ida_get_lowest(struct ida *ida, unsigned int min, unsigned int max)
> >> +{
> >> +	unsigned long index = min / IDA_BITMAP_BITS;
> >> +	unsigned int offset = min % IDA_BITMAP_BITS;
> >> +	unsigned long *addr, size, bit;
> >> +	unsigned long flags;
> >> +	void *entry;
> >> +	int ret;
> >> +
> >> +	if (min >= INT_MAX)
> >> +		return -EINVAL;
> >> +	if (max >= INT_MAX)
> >> +		max = INT_MAX;
> >> +  
> > 
> > Could these be made consistent with the test in ida_alloc_range(), ie:
> > 
> > 	if ((int)min < 0)
> > 		return -EINVAL;
> > 	if ((int)max < 0)
> > 		max = INT_MAX;
> >   
> 
> sure.
> 
> >> +	xa_lock_irqsave(&ida->xa, flags);
> >> +
> >> +	entry = xa_find(&ida->xa, &index, max / IDA_BITMAP_BITS, XA_PRESENT);
> >> +	if (!entry) {
> >> +		ret = -ENOTTY;  
> > 
> > -ENOENT?  Same for all below too.  
> 
> I see.
> 
> >> +		goto err_unlock;
> >> +	}
> >> +
> >> +	if (index > min / IDA_BITMAP_BITS)
> >> +		offset = 0;
> >> +	if (index * IDA_BITMAP_BITS + offset > max) {
> >> +		ret = -ENOTTY;
> >> +		goto err_unlock;
> >> +	}
> >> +
> >> +	if (xa_is_value(entry)) {
> >> +		unsigned long tmp = xa_to_value(entry);
> >> +
> >> +		addr = &tmp;
> >> +		size = BITS_PER_XA_VALUE;
> >> +	} else {
> >> +		addr = ((struct ida_bitmap *)entry)->bitmap;
> >> +		size = IDA_BITMAP_BITS;
> >> +	}
> >> +
> >> +	bit = find_next_bit(addr, size, offset);
> >> +
> >> +	xa_unlock_irqrestore(&ida->xa, flags);
> >> +
> >> +	if (bit == size ||
> >> +	    index * IDA_BITMAP_BITS + bit > max)
> >> +		return -ENOTTY;
> >> +
> >> +	return index * IDA_BITMAP_BITS + bit;
> >> +
> >> +err_unlock:
> >> +	xa_unlock_irqrestore(&ida->xa, flags);
> >> +	return ret;
> >> +}
> >> +EXPORT_SYMBOL(ida_get_lowest);  
> > 
> > The API is a bit awkward to me, I wonder if it might be helped with
> > some renaming and wrappers...
> > 
> > int ida_find_first_range(struct ida *ida, unsigned int min, unsigned int max);  
> 
> ok.
> 
> > bool ida_exists(struct ida *ida, unsigned int id)
> > {
> > 	return ida_find_first_range(ida, id, id) == id;
> > }  
> 
> this does helps in next patch.
> 
> > 
> > int ida_find_first(struct ida *ida)
> > {
> > 	return ida_find_first_range(ida, 0, ~0);
> > }
> >  
> 
> perhaps it can be added in future. This series has two usages. One is to
> check if a given ID is allocated. This can be covered by your ida_exists().
> Another usage is to loop each IDs, do detach and free. This can still use
> the ida_find_first_range() like the example in the commit message. The
> first loop starts from 0, and next would start from the last found ID.
> This may be more efficient than starts from 0 in every loop.
> 
> 
> > _min and _max variations of the latter would align with existing
> > ida_alloc variants, but maybe no need to add them preemptively.  
> 
> yes.
> 
> > Possibly an ida_for_each() could be useful in the use case of
> > disassociating each id, but not required for the brute force iterative
> > method.  Thanks,  
> 
> yep. maybe we can start with the below code, no need for ida_for_each()
> today.
> 
> 
>   	int id = 0;
> 
>   	while (!ida_is_empty(&pasid_ida)) {
>   		id = ida_find_first_range(pasid_ida, id, INT_MAX);

You've actually already justified the _min function here:

static inline int ida_find_first_min(struct ida *ida, unsigned int min)
{
	return ida_find_first_range(ida, min, ~0);
}

Thanks,
Alex

>   		if (unlikely(WARN_ON(id < 0))
> 			break;
>   		iommufd_device_pasid_detach();
>   		ida_free(pasid_ida, pasid);
>   	}
> 
> >   
> >> +
> >>   /**
> >>    * ida_free() - Release an allocated ID.
> >>    * @ida: IDA handle.  
> >   
> 


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 4/4] vfio: Report PASID capability via VFIO_DEVICE_FEATURE ioctl
  2024-04-18  0:21         ` Tian, Kevin
  2024-04-18  8:23           ` Yi Liu
@ 2024-04-18 16:34           ` Alex Williamson
  1 sibling, 0 replies; 72+ messages in thread
From: Alex Williamson @ 2024-04-18 16:34 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Liu, Yi L, jgg, joro, robin.murphy, eric.auger, nicolinc, kvm,
	chao.p.peng, iommu, baolu.lu, Duan, Zhenzhong, Pan, Jacob jun

On Thu, 18 Apr 2024 00:21:36 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Thursday, April 18, 2024 4:26 AM
> > 
> > On Wed, 17 Apr 2024 07:09:52 +0000
> > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> >   
> > > > From: Alex Williamson <alex.williamson@redhat.com>
> > > > Sent: Wednesday, April 17, 2024 1:57 AM
> > > >
> > > > On Fri, 12 Apr 2024 01:21:21 -0700
> > > > Yi Liu <yi.l.liu@intel.com> wrote:
> > > >  
> > > > > + */
> > > > > +struct vfio_device_feature_pasid {
> > > > > +	__u16 capabilities;
> > > > > +#define VFIO_DEVICE_PASID_CAP_EXEC	(1 << 0)
> > > > > +#define VFIO_DEVICE_PASID_CAP_PRIV	(1 << 1)
> > > > > +	__u8 width;
> > > > > +	__u8 __reserved;
> > > > > +};  
> > > >
> > > > Building on Kevin's comment on the cover letter, if we could describe
> > > > an offset for emulating a PASID capability, this seems like the place
> > > > we'd do it.  I think we're not doing that because we'd like an in-band
> > > > mechanism for a device to report unused config space, such as a DVSEC
> > > > capability, so that it can be implemented on a physical device.  As
> > > > noted in the commit log here, we'd also prefer not to bloat the kernel
> > > > with more device quirks.
> > > >
> > > > In an ideal world we might be able to jump start support of that DVSEC
> > > > option by emulating the DVSEC capability on top of the PASID capability
> > > > for PFs, but unfortunately the PASID capability is 8 bytes while the
> > > > DVSEC capability is at least 12 bytes, so we can't implement that
> > > > generically either.  
> > >
> > > Yeah, that's a problem.
> > >  
> > > >
> > > > I don't know there's any good solution here or whether there's actually
> > > > any value to the PASID capability on a PF, but do we need to consider
> > > > leaving a field+flag here to describe the offset for that scenario?  
> > >
> > > Yes, I prefer to this way.
> > >  
> > > > Would we then allow variant drivers to take advantage of it?  Does this
> > > > then turn into the quirk that we're trying to avoid in the kernel
> > > > rather than userspace and is that a problem?  Thanks,
> > > >  
> > >
> > > We don't want to proactively pursue quirks in the kernel.
> > >
> > > But if a variant driver exists for other reasons, I don't see why it
> > > should be prohibited from deciding an offset to ease the
> > > userspace. 😊  
> > 
> > At that point we've turned the corner into an arbitrary policy decision
> > that I can't defend.  A "worthy" variant driver can implement something
> > through a side channel vfio API, but implementing that side channel
> > itself is not enough to justify a variant driver?  It doesn't make
> > sense.
> > 
> > Further, if we have a variant driver, why do we need a side channel for
> > the purpose of describing available config space when we expect devices
> > themselves to eventually describe the same through a DVSEC capability?
> > The purpose of enabling variant drivers is to enhance the functionality
> > of the device.  Adding an emulated DVSEC capability seems like a valid
> > enhancement to justify a variant driver to me.
> > 
> > So the more I think about it, it would be easy to add something here
> > that hints a location for an emulated PASID capability in the VMM, but
> > it would also be counterproductive to an end goal of having a DVSEC
> > capability that describes unused config space.  The very narrow scope
> > where that side-band channel would be useful is an unknown PF device
> > which doesn't implement a DVSEC capability and without intervention
> > simply behaves as it always has, without PASID support.
> > 
> > A vendor desiring such support can a) implement DVSEC in the hardware,
> > b) implement a variant driver emulating a DVSEC capability, or c)
> > directly modify the VMM to tell it where to place the PASID capability.
> > I also don't think we should exclude the possibility that b) could turn
> > into a shared variant driver that knows about multiple devices and has
> > a table of free config space for each.  Option c) is only the last
> > resort if there's not already 12 bytes of contiguous, aligned free
> > space to place a DVSEC capability.  That seems unlikely.  
> 
> or b) could be a table in vfio_pci_config.c i.e. kind of making vfio-pci
> as the shared variant driver.

We've kind of made the statement that variant drivers should be used
for any device specific quirks rather than further extending vfio-pci.
That's not to say that there couldn't be a shared variant driver that
binds to devices across vendors with device specific knowledge to
insert a DVSEC capability.

> > At some point we need to define the format and use of this DVSEC.  Do
> > we allow (not require) one at every gap in config space that's at least
> > 12-bytes long and adjust the DVSEC Length to describe longer gaps, or do  
> 
> Does PCI spec allows multiple same-type capabilities co-existing?

As Yi notes, yes.

> > we use a single DVSEC to describe a table of ranges throughout extended
> > (maybe even conventional) config space?  The former seems easier,  
> 
> this might be challenging as the table itself requires a contiguous
> large free block.

Yes, but DVSEC is an extended capability, so we have a fair bit of
address space to work with and the table could always collapse to zero
entries to indicate only the DVSEC capability itself is available, so
it's really no different in minimum described size to the other
approach.  Thanks,

Alex

> > especially if we expect a device has a large block of free space,
> > enough for multiple emulated capabilities and described by a single
> > DVSEC.  Thanks,
> >   
> 
> yes that sounds simpler.


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 1/4] ida: Add ida_get_lowest()
  2024-04-18 16:23       ` Alex Williamson
@ 2024-04-18 17:12         ` Jason Gunthorpe
  2024-04-19 13:43           ` Yi Liu
  2024-04-19 13:40         ` Yi Liu
  1 sibling, 1 reply; 72+ messages in thread
From: Jason Gunthorpe @ 2024-04-18 17:12 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Yi Liu, kevin.tian, joro, robin.murphy, eric.auger, nicolinc,
	kvm, chao.p.peng, iommu, baolu.lu, zhenzhong.duan, jacob.jun.pan,
	Matthew Wilcox

On Thu, Apr 18, 2024 at 10:23:14AM -0600, Alex Williamson wrote:
> > yep. maybe we can start with the below code, no need for ida_for_each()
> > today.
> > 
> > 
> >   	int id = 0;
> > 
> >   	while (!ida_is_empty(&pasid_ida)) {
> >   		id = ida_find_first_range(pasid_ida, id, INT_MAX);
> 
> You've actually already justified the _min function here:
> 
> static inline int ida_find_first_min(struct ida *ida, unsigned int min)
> {
> 	return ida_find_first_range(ida, min, ~0);
> }

It should also always start from 0..

Ideally written more like:

while ((id = ida_find_first(pasid_ida)) != EMPTY_IDA) {
  ida_remove(id);
}

Jason

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 0/4] vfio-pci support pasid attach/detach
  2024-04-18  9:03             ` Yi Liu
@ 2024-04-18 20:37               ` Alex Williamson
  2024-04-19  5:52                 ` Tian, Kevin
  2024-04-19 13:59                 ` Jason Gunthorpe
  0 siblings, 2 replies; 72+ messages in thread
From: Alex Williamson @ 2024-04-18 20:37 UTC (permalink / raw)
  To: Yi Liu
  Cc: Tian, Kevin, Jason Gunthorpe, joro, robin.murphy, eric.auger,
	nicolinc, kvm, chao.p.peng, iommu, baolu.lu, Duan, Zhenzhong,
	Pan, Jacob jun

On Thu, 18 Apr 2024 17:03:15 +0800
Yi Liu <yi.l.liu@intel.com> wrote:

> On 2024/4/18 08:06, Tian, Kevin wrote:
> >> From: Alex Williamson <alex.williamson@redhat.com>
> >> Sent: Thursday, April 18, 2024 7:02 AM
> >>
> >> On Wed, 17 Apr 2024 09:20:51 -0300
> >> Jason Gunthorpe <jgg@nvidia.com> wrote:
> >>  
> >>> On Wed, Apr 17, 2024 at 07:16:05AM +0000, Tian, Kevin wrote:  
> >>>>> From: Jason Gunthorpe <jgg@nvidia.com>
> >>>>> Sent: Wednesday, April 17, 2024 1:50 AM
> >>>>>
> >>>>> On Tue, Apr 16, 2024 at 08:38:50AM +0000, Tian, Kevin wrote:  
> >>>>>>> From: Liu, Yi L <yi.l.liu@intel.com>
> >>>>>>> Sent: Friday, April 12, 2024 4:21 PM
> >>>>>>>
> >>>>>>> A userspace VMM is supposed to get the details of the device's  
> >> PASID  
> >>>>>>> capability
> >>>>>>> and assemble a virtual PASID capability in a proper offset in the  
> >> virtual  
> >>>>> PCI  
> >>>>>>> configuration space. While it is still an open on how to get the  
> >> available  
> >>>>>>> offsets. Devices may have hidden bits that are not in the PCI cap  
> >> chain.  
> >>>>> For  
> >>>>>>> now, there are two options to get the available offsets.[2]
> >>>>>>>
> >>>>>>> - Report the available offsets via ioctl. This requires device-specific  
> >> logic  
> >>>>>>>    to provide available offsets. e.g., vfio-pci variant driver. Or may the  
> >>>>> device  
> >>>>>>>    provide the available offset by DVSEC.
> >>>>>>> - Store the available offsets in a static table in userspace VMM.  
> >> VMM gets  
> >>>>> the  
> >>>>>>>    empty offsets from this table.
> >>>>>>>  
> >>>>>>
> >>>>>> I'm not a fan of requesting a variant driver for every PASID-capable
> >>>>>> VF just for the purpose of reporting a free range in the PCI config  
> >> space.  
> >>>>>>
> >>>>>> It's easier to do that quirk in userspace.
> >>>>>>
> >>>>>> But I like Alex's original comment that at least for PF there is no  
> >> reason  
> >>>>>> to hide the offset. there could be a flag+field to communicate it. or
> >>>>>> if there will be a new variant VF driver for other purposes e.g.  
> >> migration  
> >>>>>> it can certainly fill the field too.  
> >>>>>
> >>>>> Yes, since this has been such a sticking point can we get a clean
> >>>>> series that just enables it for PF and then come with a solution for
> >>>>> VF?
> >>>>>  
> >>>>
> >>>> sure but we at least need to reach consensus on a minimal required
> >>>> uapi covering both PF/VF to move forward so the user doesn't need
> >>>> to touch different contracts for PF vs. VF.  
> >>>
> >>> Do we? The situation where the VMM needs to wholly make a up a PASID
> >>> capability seems completely new and seperate from just using an
> >>> existing PASID capability as in the PF case.  
> >>
> >> But we don't actually expose the PASID capability on the PF and as
> >> argued in path 4/ we can't because it would break existing userspace.
> > > Come back to this statement.  
> > 
> > Does 'break' means that legacy Qemu will crash due to a guest write
> > to the read-only PASID capability, or just a conceptually functional
> > break i.e. non-faithful emulation due to writes being dropped?

I expect more the latter.

> > If the latter it's probably not a bad idea to allow exposing the PASID
> > capability on the PF as a sane guest shouldn't enable the PASID
> > capability w/o seeing vIOMMU supporting PASID. And there is no
> > status bit defined in the PASID capability to check back so even
> > if an insane guest wants to blindly enable PASID it will naturally
> > write and done. The only niche case is that the enable bits are
> > defined as RW so ideally reading back those bits should get the
> > latest written value. But probably this can be tolerated?

Some degree of inconsistency is likely tolerated, the guest is unlikely
to check that a RW bit was set or cleared.  How would we virtualize the
control registers for a VF and are they similarly virtualized for a PF
or would we allow the guest to manipulate the physical PASID control
registers?

> > With that then should we consider exposing the PASID capability
> > in PCI config space as the first option? For PF it's simple as how
> > other caps are exposed. For VF a variant driver can also fake the
> > PASID capability or emulate a DVSEC capability for unused space
> > (to motivate the physical implementation so no variant driver is
> > required in the future)  
> 
> If kernel exposes pasid cap for PF same as other caps, and in the meantime
> the variant driver chooses to emulate a DVSEC cap, then userspace follows
> the below steps to expose pasid cap to VM.

If we have a variant driver, why wouldn't it expose an emulated PASID
capability rather than a DVSEC if we're choosing to expose PASID for
PFs?

> 
> 1) Check if a pasid cap is already present in the virtual config space
>     read from kernel. If no, but user wants pasid, then goto step 2).
> 2) Userspace invokes VFIO_DEVICE_FETURE to check if the device support
>     pasid cap. If yes, goto step 3).

Why do we need the vfio feature interface if a physical or virtual PASID
capability on the device exposes the same info?

> 3) Userspace gets an available offset via reading the DVSEC cap.

What's the scenario where we'd have a VF wanting to expose PASID
support which doesn't also have a variant driver that could implement a
virtual PASID?

> 4) Userspace assembles a pasid cap and inserts it to the vconfig space.
> 
> For PF, step 1) is enough. For VF, it needs to go through all the 4 steps.
> This is a bit different from what we planned at the beginning. But sounds
> doable if we want to pursue the staging direction.

Seems like if we decide that we can just expose the PASID capability
for a PF then we should just have any VF variant drivers also implement
a virtual PASID capability.  In this case DVSEC would only be used to
provide information for a purely userspace emulation of PASID (in which
case it also wouldn't necessarily need the vfio feature because it
might implicitly know the PASID capabilities of the device).  Thanks,

Alex


^ permalink raw reply	[flat|nested] 72+ messages in thread

* RE: [PATCH v2 0/4] vfio-pci support pasid attach/detach
  2024-04-18 20:37               ` Alex Williamson
@ 2024-04-19  5:52                 ` Tian, Kevin
  2024-04-19 16:35                   ` Alex Williamson
  2024-04-19 13:59                 ` Jason Gunthorpe
  1 sibling, 1 reply; 72+ messages in thread
From: Tian, Kevin @ 2024-04-19  5:52 UTC (permalink / raw)
  To: Alex Williamson, Liu, Yi L
  Cc: Jason Gunthorpe, joro, robin.murphy, eric.auger, nicolinc, kvm,
	chao.p.peng, iommu, baolu.lu, Duan, Zhenzhong, Pan, Jacob jun

> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Friday, April 19, 2024 4:38 AM
> 
> On Thu, 18 Apr 2024 17:03:15 +0800
> Yi Liu <yi.l.liu@intel.com> wrote:
> 
> > On 2024/4/18 08:06, Tian, Kevin wrote:
> > >> From: Alex Williamson <alex.williamson@redhat.com>
> > >> Sent: Thursday, April 18, 2024 7:02 AM
> > >>
> > >> But we don't actually expose the PASID capability on the PF and as
> > >> argued in path 4/ we can't because it would break existing userspace.
> > > > Come back to this statement.
> > >
> > > Does 'break' means that legacy Qemu will crash due to a guest write
> > > to the read-only PASID capability, or just a conceptually functional
> > > break i.e. non-faithful emulation due to writes being dropped?
> 
> I expect more the latter.
> 
> > > If the latter it's probably not a bad idea to allow exposing the PASID
> > > capability on the PF as a sane guest shouldn't enable the PASID
> > > capability w/o seeing vIOMMU supporting PASID. And there is no
> > > status bit defined in the PASID capability to check back so even
> > > if an insane guest wants to blindly enable PASID it will naturally
> > > write and done. The only niche case is that the enable bits are
> > > defined as RW so ideally reading back those bits should get the
> > > latest written value. But probably this can be tolerated?
> 
> Some degree of inconsistency is likely tolerated, the guest is unlikely
> to check that a RW bit was set or cleared.  How would we virtualize the
> control registers for a VF and are they similarly virtualized for a PF
> or would we allow the guest to manipulate the physical PASID control
> registers?

it's shared so the guest shouldn't be allowed to touch the physical
register.

Even for PF this is virtualized as the physical control is toggled by
the iommu driver today. We discussed before whether there is a
value moving the control to device driver but the conclusion is no.

> 
> > 4) Userspace assembles a pasid cap and inserts it to the vconfig space.
> >
> > For PF, step 1) is enough. For VF, it needs to go through all the 4 steps.
> > This is a bit different from what we planned at the beginning. But sounds
> > doable if we want to pursue the staging direction.
> 
> Seems like if we decide that we can just expose the PASID capability
> for a PF then we should just have any VF variant drivers also implement
> a virtual PASID capability.  In this case DVSEC would only be used to

I'm leaning toward this direction now.

> provide information for a purely userspace emulation of PASID (in which
> case it also wouldn't necessarily need the vfio feature because it
> might implicitly know the PASID capabilities of the device).  Thanks,
> 

that's a good point. Then no new contract is required.

and allowing variant driver to implement a virtual PASID capability
seems also make a room for making a shared variant driver to host
a table of virtual capabilities (both offset and content) for VFs, just
as discussed in patch4 having a shared driver to host a table for DVSEC?

Along this route probably most vendors will just extend the table in
the shared driver, leading to decreased value on DVSEC and question
on its necessity...

then it's back to the quirk-in-kernel approach... but if simple enough
probably not a bad idea to pursue? 😊

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 0/4] vfio-pci support pasid attach/detach
  2024-04-17 23:02         ` Alex Williamson
  2024-04-18  0:06           ` Tian, Kevin
@ 2024-04-19 13:34           ` Jason Gunthorpe
  1 sibling, 0 replies; 72+ messages in thread
From: Jason Gunthorpe @ 2024-04-19 13:34 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Tian, Kevin, Liu, Yi L, joro, robin.murphy, eric.auger, nicolinc,
	kvm, chao.p.peng, iommu, baolu.lu, Duan, Zhenzhong, Pan,
	Jacob jun

On Wed, Apr 17, 2024 at 05:02:16PM -0600, Alex Williamson wrote:

> > > sure but we at least need to reach consensus on a minimal required
> > > uapi covering both PF/VF to move forward so the user doesn't need
> > > to touch different contracts for PF vs. VF.  
> > 
> > Do we? The situation where the VMM needs to wholly make a up a PASID
> > capability seems completely new and seperate from just using an
> > existing PASID capability as in the PF case.
> 
> But we don't actually expose the PASID capability on the PF and as
> argued in path 4/ we can't because it would break existing
> userspace.

Are we sure about that argument? Exposing the PF's PASID cap to qemu
shouldn't break qemu at all - it will just pass it over into a vPCI
function?

Ultimately the guest will observe a vPCI device with a PASID cap. This
is not really so different from plugging in a new PCI device with the
PASID cap into an old HW system that doesn't support PASID.

So does a guest break? I'd expect no - the viommu's created by qemu
should not advertise PASID support already. So the guest can't use
PASID - just like an old HW system.

Is there a known concrete thing that breaks there?

> > If it needs to make another system call or otherwise to do that then
> > that seems fine to do incrementally??
> 
> With PASID currently hidden, VF and PF support really seem too similar
> not to handle them both at the same time.  What's missing is a
> mechanism to describe unused config space where userspace can implement
> an emulated PASID capability. 

Yes, sure the unused config space is a great idea. I thought we were
talking about doing the fixup in userspace, but a kernel solution is
good too.

But also if we are set on the dvsec, in kernel or user, then we can
move ahead with the core PASID enablement immediately?

Jason

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 1/4] ida: Add ida_get_lowest()
  2024-04-18 16:23       ` Alex Williamson
  2024-04-18 17:12         ` Jason Gunthorpe
@ 2024-04-19 13:40         ` Yi Liu
  1 sibling, 0 replies; 72+ messages in thread
From: Yi Liu @ 2024-04-19 13:40 UTC (permalink / raw)
  To: Alex Williamson
  Cc: jgg, kevin.tian, joro, robin.murphy, eric.auger, nicolinc, kvm,
	chao.p.peng, iommu, baolu.lu, zhenzhong.duan, jacob.jun.pan,
	Matthew Wilcox

On 2024/4/19 00:23, Alex Williamson wrote:
> On Thu, 18 Apr 2024 15:02:46 +0800
> Yi Liu <yi.l.liu@intel.com> wrote:
> 
>> On 2024/4/17 00:03, Alex Williamson wrote:
>>> On Fri, 12 Apr 2024 01:21:18 -0700
>>> Yi Liu <yi.l.liu@intel.com> wrote:
>>>    
>>>> There is no helpers for user to check if a given ID is allocated or not,
>>>> neither a helper to loop all the allocated IDs in an IDA and do something
>>>> for cleanup. With the two needs, a helper to get the lowest allocated ID
>>>> of a range can help to achieve it.
>>>>
>>>> Caller can check if a given ID is allocated or not by:
>>>> 	int id = 200, rc;
>>>>
>>>> 	rc = ida_get_lowest(&ida, id, id);
>>>> 	if (rc == id)
>>>> 		//id 200 is used
>>>> 	else
>>>> 		//id 200 is not used
>>>>
>>>> Caller can iterate all allocated IDs by:
>>>> 	int id = 0;
>>>>
>>>> 	while (!ida_is_empty(&pasid_ida)) {
>>>> 		id = ida_get_lowest(pasid_ida, id, INT_MAX);
>>>> 		if (id < 0)
>>>> 			break;
>>>> 		//anything to do with the allocated ID
>>>> 		ida_free(pasid_ida, pasid);
>>>> 	}
>>>>
>>>> Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
>>>> Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
>>>> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
>>>> ---
>>>>    include/linux/idr.h |  1 +
>>>>    lib/idr.c           | 67 +++++++++++++++++++++++++++++++++++++++++++++
>>>>    2 files changed, 68 insertions(+)
>>>>
>>>> diff --git a/include/linux/idr.h b/include/linux/idr.h
>>>> index da5f5fa4a3a6..1dae71d4a75d 100644
>>>> --- a/include/linux/idr.h
>>>> +++ b/include/linux/idr.h
>>>> @@ -257,6 +257,7 @@ struct ida {
>>>>    int ida_alloc_range(struct ida *, unsigned int min, unsigned int max, gfp_t);
>>>>    void ida_free(struct ida *, unsigned int id);
>>>>    void ida_destroy(struct ida *ida);
>>>> +int ida_get_lowest(struct ida *ida, unsigned int min, unsigned int max);
>>>>    
>>>>    /**
>>>>     * ida_alloc() - Allocate an unused ID.
>>>> diff --git a/lib/idr.c b/lib/idr.c
>>>> index da36054c3ca0..03e461242fe2 100644
>>>> --- a/lib/idr.c
>>>> +++ b/lib/idr.c
>>>> @@ -476,6 +476,73 @@ int ida_alloc_range(struct ida *ida, unsigned int min, unsigned int max,
>>>>    }
>>>>    EXPORT_SYMBOL(ida_alloc_range);
>>>>    
>>>> +/**
>>>> + * ida_get_lowest - Get the lowest used ID.
>>>> + * @ida: IDA handle.
>>>> + * @min: Lowest ID to get.
>>>> + * @max: Highest ID to get.
>>>> + *
>>>> + * Get the lowest used ID between @min and @max, inclusive.  The returned
>>>> + * ID will not exceed %INT_MAX, even if @max is larger.
>>>> + *
>>>> + * Context: Any context. Takes and releases the xa_lock.
>>>> + * Return: The lowest used ID, or errno if no used ID is found.
>>>> + */
>>>> +int ida_get_lowest(struct ida *ida, unsigned int min, unsigned int max)
>>>> +{
>>>> +	unsigned long index = min / IDA_BITMAP_BITS;
>>>> +	unsigned int offset = min % IDA_BITMAP_BITS;
>>>> +	unsigned long *addr, size, bit;
>>>> +	unsigned long flags;
>>>> +	void *entry;
>>>> +	int ret;
>>>> +
>>>> +	if (min >= INT_MAX)
>>>> +		return -EINVAL;
>>>> +	if (max >= INT_MAX)
>>>> +		max = INT_MAX;
>>>> +
>>>
>>> Could these be made consistent with the test in ida_alloc_range(), ie:
>>>
>>> 	if ((int)min < 0)
>>> 		return -EINVAL;
>>> 	if ((int)max < 0)
>>> 		max = INT_MAX;
>>>    
>>
>> sure.
>>
>>>> +	xa_lock_irqsave(&ida->xa, flags);
>>>> +
>>>> +	entry = xa_find(&ida->xa, &index, max / IDA_BITMAP_BITS, XA_PRESENT);
>>>> +	if (!entry) {
>>>> +		ret = -ENOTTY;
>>>
>>> -ENOENT?  Same for all below too.
>>
>> I see.
>>
>>>> +		goto err_unlock;
>>>> +	}
>>>> +
>>>> +	if (index > min / IDA_BITMAP_BITS)
>>>> +		offset = 0;
>>>> +	if (index * IDA_BITMAP_BITS + offset > max) {
>>>> +		ret = -ENOTTY;
>>>> +		goto err_unlock;
>>>> +	}
>>>> +
>>>> +	if (xa_is_value(entry)) {
>>>> +		unsigned long tmp = xa_to_value(entry);
>>>> +
>>>> +		addr = &tmp;
>>>> +		size = BITS_PER_XA_VALUE;
>>>> +	} else {
>>>> +		addr = ((struct ida_bitmap *)entry)->bitmap;
>>>> +		size = IDA_BITMAP_BITS;
>>>> +	}
>>>> +
>>>> +	bit = find_next_bit(addr, size, offset);
>>>> +
>>>> +	xa_unlock_irqrestore(&ida->xa, flags);
>>>> +
>>>> +	if (bit == size ||
>>>> +	    index * IDA_BITMAP_BITS + bit > max)
>>>> +		return -ENOTTY;
>>>> +
>>>> +	return index * IDA_BITMAP_BITS + bit;
>>>> +
>>>> +err_unlock:
>>>> +	xa_unlock_irqrestore(&ida->xa, flags);
>>>> +	return ret;
>>>> +}
>>>> +EXPORT_SYMBOL(ida_get_lowest);
>>>
>>> The API is a bit awkward to me, I wonder if it might be helped with
>>> some renaming and wrappers...
>>>
>>> int ida_find_first_range(struct ida *ida, unsigned int min, unsigned int max);
>>
>> ok.
>>
>>> bool ida_exists(struct ida *ida, unsigned int id)
>>> {
>>> 	return ida_find_first_range(ida, id, id) == id;
>>> }
>>
>> this does helps in next patch.
>>
>>>
>>> int ida_find_first(struct ida *ida)
>>> {
>>> 	return ida_find_first_range(ida, 0, ~0);
>>> }
>>>   
>>
>> perhaps it can be added in future. This series has two usages. One is to
>> check if a given ID is allocated. This can be covered by your ida_exists().
>> Another usage is to loop each IDs, do detach and free. This can still use
>> the ida_find_first_range() like the example in the commit message. The
>> first loop starts from 0, and next would start from the last found ID.
>> This may be more efficient than starts from 0 in every loop.
>>
>>
>>> _min and _max variations of the latter would align with existing
>>> ida_alloc variants, but maybe no need to add them preemptively.
>>
>> yes.
>>
>>> Possibly an ida_for_each() could be useful in the use case of
>>> disassociating each id, but not required for the brute force iterative
>>> method.  Thanks,
>>
>> yep. maybe we can start with the below code, no need for ida_for_each()
>> today.
>>
>>
>>    	int id = 0;
>>
>>    	while (!ida_is_empty(&pasid_ida)) {
>>    		id = ida_find_first_range(pasid_ida, id, INT_MAX);
> 
> You've actually already justified the _min function here:
> 
> static inline int ida_find_first_min(struct ida *ida, unsigned int min)
> {
> 	return ida_find_first_range(ida, min, ~0);
> }

aha, I see. :)

> 
>>    		if (unlikely(WARN_ON(id < 0))
>> 			break;
>>    		iommufd_device_pasid_detach();
>>    		ida_free(pasid_ida, pasid);
>>    	}
>>
>>>    
>>>> +
>>>>    /**
>>>>     * ida_free() - Release an allocated ID.
>>>>     * @ida: IDA handle.
>>>    
>>
> 
> 

-- 
Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 1/4] ida: Add ida_get_lowest()
  2024-04-18 17:12         ` Jason Gunthorpe
@ 2024-04-19 13:43           ` Yi Liu
  2024-04-19 13:55             ` Alex Williamson
  0 siblings, 1 reply; 72+ messages in thread
From: Yi Liu @ 2024-04-19 13:43 UTC (permalink / raw)
  To: Jason Gunthorpe, Alex Williamson
  Cc: kevin.tian, joro, robin.murphy, eric.auger, nicolinc, kvm,
	chao.p.peng, iommu, baolu.lu, zhenzhong.duan, jacob.jun.pan,
	Matthew Wilcox

On 2024/4/19 01:12, Jason Gunthorpe wrote:
> On Thu, Apr 18, 2024 at 10:23:14AM -0600, Alex Williamson wrote:
>>> yep. maybe we can start with the below code, no need for ida_for_each()
>>> today.
>>>
>>>
>>>    	int id = 0;
>>>
>>>    	while (!ida_is_empty(&pasid_ida)) {
>>>    		id = ida_find_first_range(pasid_ida, id, INT_MAX);
>>
>> You've actually already justified the _min function here:
>>
>> static inline int ida_find_first_min(struct ida *ida, unsigned int min)
>> {
>> 	return ida_find_first_range(ida, min, ~0);
>> }
> 
> It should also always start from 0..

any special reason to always start from 0? Here we want to loop all the
IDs, and remove them. In this usage, it should be more efficient if we
start from the last found ID.

> Ideally written more like:
> 
> while ((id = ida_find_first(pasid_ida)) != EMPTY_IDA) {
>    ida_remove(id);
> }

-- 
Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 1/4] ida: Add ida_get_lowest()
  2024-04-19 13:43           ` Yi Liu
@ 2024-04-19 13:55             ` Alex Williamson
  2024-04-19 14:00               ` Jason Gunthorpe
  0 siblings, 1 reply; 72+ messages in thread
From: Alex Williamson @ 2024-04-19 13:55 UTC (permalink / raw)
  To: Yi Liu
  Cc: Jason Gunthorpe, kevin.tian, joro, robin.murphy, eric.auger,
	nicolinc, kvm, chao.p.peng, iommu, baolu.lu, zhenzhong.duan,
	jacob.jun.pan, Matthew Wilcox

On Fri, 19 Apr 2024 21:43:17 +0800
Yi Liu <yi.l.liu@intel.com> wrote:

> On 2024/4/19 01:12, Jason Gunthorpe wrote:
> > On Thu, Apr 18, 2024 at 10:23:14AM -0600, Alex Williamson wrote:  
> >>> yep. maybe we can start with the below code, no need for ida_for_each()
> >>> today.
> >>>
> >>>
> >>>    	int id = 0;
> >>>
> >>>    	while (!ida_is_empty(&pasid_ida)) {
> >>>    		id = ida_find_first_range(pasid_ida, id, INT_MAX);  
> >>
> >> You've actually already justified the _min function here:
> >>
> >> static inline int ida_find_first_min(struct ida *ida, unsigned int min)
> >> {
> >> 	return ida_find_first_range(ida, min, ~0);
> >> }  
> > 
> > It should also always start from 0..  
> 
> any special reason to always start from 0? Here we want to loop all the
> IDs, and remove them. In this usage, it should be more efficient if we
> start from the last found ID.

In the above version, there's a possibility of an infinite loop, in the
below there's not.  I don't think the infinite loop is actually
reachable, but given the xarray backend to ida I'm not sure you're
gaining much to restart after the previously found id either.  Thanks,

Alex

> > Ideally written more like:
> > 
> > while ((id = ida_find_first(pasid_ida)) != EMPTY_IDA) {
> >    ida_remove(id);
> > }  
> 


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 0/4] vfio-pci support pasid attach/detach
  2024-04-18 20:37               ` Alex Williamson
  2024-04-19  5:52                 ` Tian, Kevin
@ 2024-04-19 13:59                 ` Jason Gunthorpe
  2024-04-23  7:58                   ` Yi Liu
  1 sibling, 1 reply; 72+ messages in thread
From: Jason Gunthorpe @ 2024-04-19 13:59 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Yi Liu, Tian, Kevin, joro, robin.murphy, eric.auger, nicolinc,
	kvm, chao.p.peng, iommu, baolu.lu, Duan, Zhenzhong, Pan,
	Jacob jun

On Thu, Apr 18, 2024 at 02:37:47PM -0600, Alex Williamson wrote:

> Some degree of inconsistency is likely tolerated, the guest is unlikely
> to check that a RW bit was set or cleared.  How would we virtualize the
> control registers for a VF and are they similarly virtualized for a PF
> or would we allow the guest to manipulate the physical PASID control
> registers?

No, the OS owns the physical PASID control. If the platform IOMMU
knows how to parse PASID then PASID support is turned on and left on
at boot time.

There should be no guest visible difference to not supporting global
PASID disable, and we can't even implement it for VFs anyhow.

Same sort of argument for ATS/etc

> > If kernel exposes pasid cap for PF same as other caps, and in the meantime
> > the variant driver chooses to emulate a DVSEC cap, then userspace follows
> > the below steps to expose pasid cap to VM.
> 
> If we have a variant driver, why wouldn't it expose an emulated PASID
> capability rather than a DVSEC if we're choosing to expose PASID for
> PFs?

Indeed, also an option. Supplying the DVSEC is probably simpler and
addresses other synthesized capability blocks in future. VMM is a
better place to build various synthetic blocks in general, IMHO.

New VMM's could parse the PF PASID cap and add it to its list of "free
space"

We may also be overdoing it here..

Maybe if the VMM wants to enable PASID we should flip the logic and
the VMM should assume that unused config space is safe to use. Only
devices that violate that rule need to join an ID list and provide a
DVSEC/free space list/etc.

I'm guessing that list will be pretty small and hopefully will not
grow. It is easy and better for future devices to wrap their hidden
registers in a private DVSEC.

Then I'd suggest just writing the special list in a text file and
leaving it in the VMM side.. Users can adjust the text file right away
if they have old and troublesome devices and all VMMs can share it.

> > 1) Check if a pasid cap is already present in the virtual config space
> >     read from kernel. If no, but user wants pasid, then goto step 2).
> > 2) Userspace invokes VFIO_DEVICE_FETURE to check if the device support
> >     pasid cap. If yes, goto step 3).
> 
> Why do we need the vfio feature interface if a physical or virtual PASID
> capability on the device exposes the same info?

Still need to check if the platform, os, iommu, etc are all OK with
enabling PASID support before the viommu advertises it.

Jason

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 1/4] ida: Add ida_get_lowest()
  2024-04-19 13:55             ` Alex Williamson
@ 2024-04-19 14:00               ` Jason Gunthorpe
  2024-04-23  7:19                 ` Yi Liu
  0 siblings, 1 reply; 72+ messages in thread
From: Jason Gunthorpe @ 2024-04-19 14:00 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Yi Liu, kevin.tian, joro, robin.murphy, eric.auger, nicolinc,
	kvm, chao.p.peng, iommu, baolu.lu, zhenzhong.duan, jacob.jun.pan,
	Matthew Wilcox

On Fri, Apr 19, 2024 at 07:55:04AM -0600, Alex Williamson wrote:
> On Fri, 19 Apr 2024 21:43:17 +0800
> Yi Liu <yi.l.liu@intel.com> wrote:
> 
> > On 2024/4/19 01:12, Jason Gunthorpe wrote:
> > > On Thu, Apr 18, 2024 at 10:23:14AM -0600, Alex Williamson wrote:  
> > >>> yep. maybe we can start with the below code, no need for ida_for_each()
> > >>> today.
> > >>>
> > >>>
> > >>>    	int id = 0;
> > >>>
> > >>>    	while (!ida_is_empty(&pasid_ida)) {
> > >>>    		id = ida_find_first_range(pasid_ida, id, INT_MAX);  
> > >>
> > >> You've actually already justified the _min function here:
> > >>
> > >> static inline int ida_find_first_min(struct ida *ida, unsigned int min)
> > >> {
> > >> 	return ida_find_first_range(ida, min, ~0);
> > >> }  
> > > 
> > > It should also always start from 0..  
> > 
> > any special reason to always start from 0? Here we want to loop all the
> > IDs, and remove them. In this usage, it should be more efficient if we
> > start from the last found ID.
> 
> In the above version, there's a possibility of an infinite loop, in the
> below there's not.  I don't think the infinite loop is actually
> reachable, but given the xarray backend to ida I'm not sure you're
> gaining much to restart after the previously found id either.  Thanks,

Right, there is no performance win on xarray and it only risks an
infinite loop compared to:

> > > while ((id = ida_find_first(pasid_ida)) != EMPTY_IDA) {
> > >    ida_remove(id);
> > > }  

Which does not by construction

Jason

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 0/4] vfio-pci support pasid attach/detach
  2024-04-19  5:52                 ` Tian, Kevin
@ 2024-04-19 16:35                   ` Alex Williamson
  2024-04-23  7:43                     ` Tian, Kevin
  0 siblings, 1 reply; 72+ messages in thread
From: Alex Williamson @ 2024-04-19 16:35 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Liu, Yi L, Jason Gunthorpe, joro, robin.murphy, eric.auger,
	nicolinc, kvm, chao.p.peng, iommu, baolu.lu, Duan, Zhenzhong,
	Pan, Jacob jun

On Fri, 19 Apr 2024 05:52:01 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Friday, April 19, 2024 4:38 AM
> > 
> > On Thu, 18 Apr 2024 17:03:15 +0800
> > Yi Liu <yi.l.liu@intel.com> wrote:
> >   
> > > On 2024/4/18 08:06, Tian, Kevin wrote:  
> > > >> From: Alex Williamson <alex.williamson@redhat.com>
> > > >> Sent: Thursday, April 18, 2024 7:02 AM
> > > >>
> > > >> But we don't actually expose the PASID capability on the PF and as
> > > >> argued in path 4/ we can't because it would break existing userspace.
> > > > > Come back to this statement.  
> > > >
> > > > Does 'break' means that legacy Qemu will crash due to a guest write
> > > > to the read-only PASID capability, or just a conceptually functional
> > > > break i.e. non-faithful emulation due to writes being dropped?  
> > 
> > I expect more the latter.
> >   
> > > > If the latter it's probably not a bad idea to allow exposing the PASID
> > > > capability on the PF as a sane guest shouldn't enable the PASID
> > > > capability w/o seeing vIOMMU supporting PASID. And there is no
> > > > status bit defined in the PASID capability to check back so even
> > > > if an insane guest wants to blindly enable PASID it will naturally
> > > > write and done. The only niche case is that the enable bits are
> > > > defined as RW so ideally reading back those bits should get the
> > > > latest written value. But probably this can be tolerated?  
> > 
> > Some degree of inconsistency is likely tolerated, the guest is unlikely
> > to check that a RW bit was set or cleared.  How would we virtualize the
> > control registers for a VF and are they similarly virtualized for a PF
> > or would we allow the guest to manipulate the physical PASID control
> > registers?  
> 
> it's shared so the guest shouldn't be allowed to touch the physical
> register.
> 
> Even for PF this is virtualized as the physical control is toggled by
> the iommu driver today. We discussed before whether there is a
> value moving the control to device driver but the conclusion is no.

So in both cases we virtualize the PASID bits in the vfio variant
driver in order to maintain spec compliant behavior of the register
(ie. the control bits are RW with no underlying hardware effect and
capability bits only reflect the features enabled by the host in the
control register)?

> > > 4) Userspace assembles a pasid cap and inserts it to the vconfig space.
> > >
> > > For PF, step 1) is enough. For VF, it needs to go through all the 4 steps.
> > > This is a bit different from what we planned at the beginning. But sounds
> > > doable if we want to pursue the staging direction.  
> > 
> > Seems like if we decide that we can just expose the PASID capability
> > for a PF then we should just have any VF variant drivers also implement
> > a virtual PASID capability.  In this case DVSEC would only be used to  
> 
> I'm leaning toward this direction now.
> 
> > provide information for a purely userspace emulation of PASID (in which
> > case it also wouldn't necessarily need the vfio feature because it
> > might implicitly know the PASID capabilities of the device).  Thanks,
> >   
> 
> that's a good point. Then no new contract is required.
> 
> and allowing variant driver to implement a virtual PASID capability
> seems also make a room for making a shared variant driver to host
> a table of virtual capabilities (both offset and content) for VFs, just
> as discussed in patch4 having a shared driver to host a table for DVSEC?

Yes, vfio-pci-core would support virtualizing the PF PASID capability
mapped 1:1 at the physical PASID location.  We should architect that
support to be easily reused for a driver provided offset for the VF use
case and then we'd need to decide if a lookup table to associate an
offset to a VF vendor:device warrants a variant driver (which could be
shared by multiple devices) or if we'd accept that into vfio-pci-core.

> Along this route probably most vendors will just extend the table in
> the shared driver, leading to decreased value on DVSEC and question
> on its necessity...
> 
> then it's back to the quirk-in-kernel approach... but if simple enough
> probably not a bad idea to pursue? 😊

A DVSEC to express unused config space could still support a generic
vfio-pci-core or variant driver implementation of PASID virtualization.
The table lookup would provide a device-specific quirk to a base
implementation of carving it from DVSEC reported free space.

The question of whether it should be in kernel or userspace is
difficult.  There are certainly other capabilities where vfio-pci
exposes RW registers as read-only and we rely on the userspace VMM to
emulate them.  We could consider this one of those cases so long as the
change of exposing PASID as a read-only capability is tolerated for old
QEMU, new kernel.

Then come VFs.  AFAIK, it would not be possible for an unprivileged
QEMU to inspect the PASID state for the PF.  Therefore I think vfio
needs to provide that information either in-band (ie. emulated PASID) or
out-of-band (device feature).  At that point, there's some kernel code
regardless, which leans me towards virtualizing in the kernel.  I'd
welcome a complete, coherent proposal that could be done in userspace
though.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 1/4] ida: Add ida_get_lowest()
  2024-04-19 14:00               ` Jason Gunthorpe
@ 2024-04-23  7:19                 ` Yi Liu
  0 siblings, 0 replies; 72+ messages in thread
From: Yi Liu @ 2024-04-23  7:19 UTC (permalink / raw)
  To: Jason Gunthorpe, Alex Williamson
  Cc: kevin.tian, joro, robin.murphy, eric.auger, nicolinc, kvm,
	chao.p.peng, iommu, baolu.lu, zhenzhong.duan, jacob.jun.pan,
	Matthew Wilcox

On 2024/4/19 22:00, Jason Gunthorpe wrote:
> On Fri, Apr 19, 2024 at 07:55:04AM -0600, Alex Williamson wrote:
>> On Fri, 19 Apr 2024 21:43:17 +0800
>> Yi Liu <yi.l.liu@intel.com> wrote:
>>
>>> On 2024/4/19 01:12, Jason Gunthorpe wrote:
>>>> On Thu, Apr 18, 2024 at 10:23:14AM -0600, Alex Williamson wrote:
>>>>>> yep. maybe we can start with the below code, no need for ida_for_each()
>>>>>> today.
>>>>>>
>>>>>>
>>>>>>     	int id = 0;
>>>>>>
>>>>>>     	while (!ida_is_empty(&pasid_ida)) {
>>>>>>     		id = ida_find_first_range(pasid_ida, id, INT_MAX);
>>>>>
>>>>> You've actually already justified the _min function here:
>>>>>
>>>>> static inline int ida_find_first_min(struct ida *ida, unsigned int min)
>>>>> {
>>>>> 	return ida_find_first_range(ida, min, ~0);
>>>>> }
>>>>
>>>> It should also always start from 0..
>>>
>>> any special reason to always start from 0? Here we want to loop all the
>>> IDs, and remove them. In this usage, it should be more efficient if we
>>> start from the last found ID.
>>
>> In the above version, there's a possibility of an infinite loop, in the
>> below there's not.  I don't think the infinite loop is actually
>> reachable, but given the xarray backend to ida I'm not sure you're
>> gaining much to restart after the previously found id either.  Thanks,
> 
> Right, there is no performance win on xarray and it only risks an
> infinite loop compared to:
> 
>>>> while ((id = ida_find_first(pasid_ida)) != EMPTY_IDA) {
>>>>     ida_remove(id);
>>>> }
> 
> Which does not by construction

thanks, got you two. :) Let's go with below. < 0 should mean
no ID found.

while ((id = ida_find_first(pasid_ida)) < 0) {
     ida_free(id);
}

-- 
Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 72+ messages in thread

* RE: [PATCH v2 0/4] vfio-pci support pasid attach/detach
  2024-04-19 16:35                   ` Alex Williamson
@ 2024-04-23  7:43                     ` Tian, Kevin
  2024-04-23 12:01                       ` Jason Gunthorpe
  0 siblings, 1 reply; 72+ messages in thread
From: Tian, Kevin @ 2024-04-23  7:43 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Liu, Yi L, Jason Gunthorpe, joro, robin.murphy, eric.auger,
	nicolinc, kvm, chao.p.peng, iommu, baolu.lu, Duan, Zhenzhong,
	Pan, Jacob jun

> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Saturday, April 20, 2024 12:36 AM
> 
> On Fri, 19 Apr 2024 05:52:01 +0000
> "Tian, Kevin" <kevin.tian@intel.com> wrote:
> 
> > > From: Alex Williamson <alex.williamson@redhat.com>
> > > Sent: Friday, April 19, 2024 4:38 AM
> > >
> > > On Thu, 18 Apr 2024 17:03:15 +0800
> > > Yi Liu <yi.l.liu@intel.com> wrote:
> > >
> > > > On 2024/4/18 08:06, Tian, Kevin wrote:
> > > > >> From: Alex Williamson <alex.williamson@redhat.com>
> > > > >> Sent: Thursday, April 18, 2024 7:02 AM
> > > > >>
> > > > >> But we don't actually expose the PASID capability on the PF and as
> > > > >> argued in path 4/ we can't because it would break existing userspace.
> > > > > > Come back to this statement.
> > > > >
> > > > > Does 'break' means that legacy Qemu will crash due to a guest write
> > > > > to the read-only PASID capability, or just a conceptually functional
> > > > > break i.e. non-faithful emulation due to writes being dropped?
> > >
> > > I expect more the latter.
> > >
> > > > > If the latter it's probably not a bad idea to allow exposing the PASID
> > > > > capability on the PF as a sane guest shouldn't enable the PASID
> > > > > capability w/o seeing vIOMMU supporting PASID. And there is no
> > > > > status bit defined in the PASID capability to check back so even
> > > > > if an insane guest wants to blindly enable PASID it will naturally
> > > > > write and done. The only niche case is that the enable bits are
> > > > > defined as RW so ideally reading back those bits should get the
> > > > > latest written value. But probably this can be tolerated?
> > >
> > > Some degree of inconsistency is likely tolerated, the guest is unlikely
> > > to check that a RW bit was set or cleared.  How would we virtualize the
> > > control registers for a VF and are they similarly virtualized for a PF
> > > or would we allow the guest to manipulate the physical PASID control
> > > registers?
> >
> > it's shared so the guest shouldn't be allowed to touch the physical
> > register.
> >
> > Even for PF this is virtualized as the physical control is toggled by
> > the iommu driver today. We discussed before whether there is a
> > value moving the control to device driver but the conclusion is no.
> 
> So in both cases we virtualize the PASID bits in the vfio variant
> driver in order to maintain spec compliant behavior of the register
> (ie. the control bits are RW with no underlying hardware effect and
> capability bits only reflect the features enabled by the host in the
> control register)?

yes

> 
> > > > 4) Userspace assembles a pasid cap and inserts it to the vconfig space.
> > > >
> > > > For PF, step 1) is enough. For VF, it needs to go through all the 4 steps.
> > > > This is a bit different from what we planned at the beginning. But
> sounds
> > > > doable if we want to pursue the staging direction.
> > >
> > > Seems like if we decide that we can just expose the PASID capability
> > > for a PF then we should just have any VF variant drivers also implement
> > > a virtual PASID capability.  In this case DVSEC would only be used to
> >
> > I'm leaning toward this direction now.
> >
> > > provide information for a purely userspace emulation of PASID (in which
> > > case it also wouldn't necessarily need the vfio feature because it
> > > might implicitly know the PASID capabilities of the device).  Thanks,
> > >
> >
> > that's a good point. Then no new contract is required.
> >
> > and allowing variant driver to implement a virtual PASID capability
> > seems also make a room for making a shared variant driver to host
> > a table of virtual capabilities (both offset and content) for VFs, just
> > as discussed in patch4 having a shared driver to host a table for DVSEC?
> 
> Yes, vfio-pci-core would support virtualizing the PF PASID capability
> mapped 1:1 at the physical PASID location.  We should architect that
> support to be easily reused for a driver provided offset for the VF use
> case and then we'd need to decide if a lookup table to associate an
> offset to a VF vendor:device warrants a variant driver (which could be
> shared by multiple devices) or if we'd accept that into vfio-pci-core.

yes

> 
> > Along this route probably most vendors will just extend the table in
> > the shared driver, leading to decreased value on DVSEC and question
> > on its necessity...
> >
> > then it's back to the quirk-in-kernel approach... but if simple enough
> > probably not a bad idea to pursue? 😊
> 
> A DVSEC to express unused config space could still support a generic
> vfio-pci-core or variant driver implementation of PASID virtualization.
> The table lookup would provide a device-specific quirk to a base
> implementation of carving it from DVSEC reported free space.
> 
> The question of whether it should be in kernel or userspace is
> difficult.  There are certainly other capabilities where vfio-pci
> exposes RW registers as read-only and we rely on the userspace VMM to
> emulate them.  We could consider this one of those cases so long as the
> change of exposing PASID as a read-only capability is tolerated for old
> QEMU, new kernel.

old QEMU/new kernel seems OK according to discussions in this thread.

> 
> Then come VFs.  AFAIK, it would not be possible for an unprivileged
> QEMU to inspect the PASID state for the PF.  Therefore I think vfio
> needs to provide that information either in-band (ie. emulated PASID) or
> out-of-band (device feature).  At that point, there's some kernel code
> regardless, which leans me towards virtualizing in the kernel.  I'd
> welcome a complete, coherent proposal that could be done in userspace
> though.  Thanks,
> 

I'm not sure how userspace can fully handle this w/o certain assistance
from the kernel.

So I kind of agree that emulated PASID capability is probably the only
contract which the kernel should provide:
  - mapped 1:1 at the physical location, or
  - constructed at an offset according to DVSEC, or
  - constructed at an offset according to a look-up table

The VMM always scans the vfio pci config space to expose vPASID.

Then the remaining open is what VMM could do when a VF supports
PASID but unfortunately it's not reported by vfio. W/o the capability
of inspecting the PASID state of PF, probably the only feasible option
is to maintain a look-up table in VMM itself and assumes the kernel
always enables the PASID cap on PF.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 0/4] vfio-pci support pasid attach/detach
  2024-04-19 13:59                 ` Jason Gunthorpe
@ 2024-04-23  7:58                   ` Yi Liu
  2024-04-23 12:05                     ` Jason Gunthorpe
  0 siblings, 1 reply; 72+ messages in thread
From: Yi Liu @ 2024-04-23  7:58 UTC (permalink / raw)
  To: Jason Gunthorpe, Alex Williamson
  Cc: Tian, Kevin, joro, robin.murphy, eric.auger, nicolinc, kvm,
	chao.p.peng, iommu, baolu.lu, Duan, Zhenzhong, Pan, Jacob jun

On 2024/4/19 21:59, Jason Gunthorpe wrote:
> On Thu, Apr 18, 2024 at 02:37:47PM -0600, Alex Williamson wrote:
> 
>> Some degree of inconsistency is likely tolerated, the guest is unlikely
>> to check that a RW bit was set or cleared.  How would we virtualize the
>> control registers for a VF and are they similarly virtualized for a PF
>> or would we allow the guest to manipulate the physical PASID control
>> registers?
> 
> No, the OS owns the physical PASID control. If the platform IOMMU
> knows how to parse PASID then PASID support is turned on and left on
> at boot time.

I think you mean host os. right?

> There should be no guest visible difference to not supporting global
> PASID disable, and we can't even implement it for VFs anyhow.
> 
> Same sort of argument for ATS/etc
> 
>>> If kernel exposes pasid cap for PF same as other caps, and in the meantime
>>> the variant driver chooses to emulate a DVSEC cap, then userspace follows
>>> the below steps to expose pasid cap to VM.
>>
>> If we have a variant driver, why wouldn't it expose an emulated PASID
>> capability rather than a DVSEC if we're choosing to expose PASID for
>> PFs?
> 
> Indeed, also an option. Supplying the DVSEC is probably simpler and
> addresses other synthesized capability blocks in future. VMM is a
> better place to build various synthetic blocks in general, IMHO.
> 
> New VMM's could parse the PF PASID cap and add it to its list of "free
> space"
> 
> We may also be overdoing it here..
> 
> Maybe if the VMM wants to enable PASID we should flip the logic and
> the VMM should assume that unused config space is safe to use. Only
> devices that violate that rule need to join an ID list and provide a
> DVSEC/free space list/etc.

So, if the kernel decides to hide a specific physical capability, the
space of this capability would be considered as free to use as well.
is it?

> I'm guessing that list will be pretty small and hopefully will not
> grow.

any channel to collect this kind of info? :)

> It is easy and better for future devices to wrap their hidden
> registers in a private DVSEC.

hmmm, do you mean include the registers a DVSEC hence userspace can
work out the free space by iterating the cap chain? or still mean
indicating the free spaces by DVSEC? I guess the prior one.

> Then I'd suggest just writing the special list in a text file and
> leaving it in the VMM side.. Users can adjust the text file right away
> if they have old and troublesome devices and all VMMs can share it.

So for the existing devices that have both pasid cap and hidden registers,
userspace should add them in the special list, and work out the free space
by referring the file. While for the devices that only have pasid cap, or
have the hidden register in a DVSEC, userspace finds a free space by
iterating the cap chain. This seems to be general for today and future.

>>> 1) Check if a pasid cap is already present in the virtual config space
>>>      read from kernel. If no, but user wants pasid, then goto step 2).
>>> 2) Userspace invokes VFIO_DEVICE_FETURE to check if the device support
>>>      pasid cap. If yes, goto step 3).
>>
>> Why do we need the vfio feature interface if a physical or virtual PASID
>> capability on the device exposes the same info?
> 
> Still need to check if the platform, os, iommu, etc are all OK with
> enabling PASID support before the viommu advertises it.

This means we don't expose physical or virtual PASID cap, is it? Otherwise,
host kernel could check if pasid is enabled before exposing the PASID cap.

-- 
Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 0/4] vfio-pci support pasid attach/detach
  2024-04-23  7:43                     ` Tian, Kevin
@ 2024-04-23 12:01                       ` Jason Gunthorpe
  2024-04-23 23:47                         ` Tian, Kevin
  0 siblings, 1 reply; 72+ messages in thread
From: Jason Gunthorpe @ 2024-04-23 12:01 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Alex Williamson, Liu, Yi L, joro, robin.murphy, eric.auger,
	nicolinc, kvm, chao.p.peng, iommu, baolu.lu, Duan, Zhenzhong,
	Pan, Jacob jun

On Tue, Apr 23, 2024 at 07:43:27AM +0000, Tian, Kevin wrote:
> I'm not sure how userspace can fully handle this w/o certain assistance
> from the kernel.
> 
> So I kind of agree that emulated PASID capability is probably the only
> contract which the kernel should provide:
>   - mapped 1:1 at the physical location, or
>   - constructed at an offset according to DVSEC, or
>   - constructed at an offset according to a look-up table
> 
> The VMM always scans the vfio pci config space to expose vPASID.
> 
> Then the remaining open is what VMM could do when a VF supports
> PASID but unfortunately it's not reported by vfio. W/o the capability
> of inspecting the PASID state of PF, probably the only feasible option
> is to maintain a look-up table in VMM itself and assumes the kernel
> always enables the PASID cap on PF.

I'm still not sure I like doing this in the kernel - we need to do the
same sort of thing for ATS too, right?

It feels simpler if the indicates if PASID and ATS can be supported
and userspace builds the capability blocks.

There are migration considerations too - the blocks need to be
migrated over and end up in the same place as well..

Jason

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 0/4] vfio-pci support pasid attach/detach
  2024-04-23  7:58                   ` Yi Liu
@ 2024-04-23 12:05                     ` Jason Gunthorpe
  0 siblings, 0 replies; 72+ messages in thread
From: Jason Gunthorpe @ 2024-04-23 12:05 UTC (permalink / raw)
  To: Yi Liu
  Cc: Alex Williamson, Tian, Kevin, joro, robin.murphy, eric.auger,
	nicolinc, kvm, chao.p.peng, iommu, baolu.lu, Duan, Zhenzhong,
	Pan, Jacob jun

On Tue, Apr 23, 2024 at 03:58:17PM +0800, Yi Liu wrote:
> > > > 1) Check if a pasid cap is already present in the virtual config space
> > > >      read from kernel. If no, but user wants pasid, then goto step 2).
> > > > 2) Userspace invokes VFIO_DEVICE_FETURE to check if the device support
> > > >      pasid cap. If yes, goto step 3).
> > > 
> > > Why do we need the vfio feature interface if a physical or virtual PASID
> > > capability on the device exposes the same info?
> > 
> > Still need to check if the platform, os, iommu, etc are all OK with
> > enabling PASID support before the viommu advertises it.
> 
> This means we don't expose physical or virtual PASID cap, is it?

Yeah keep hiding both still works. Some kind of test to see if PASID
and ATS are supportable on the device. Probably via
IOMMUFD_CMD_GET_HW_INFO

If they are supported then the VMM will find empty space and generate
the missing caps. The VMM will have means to fix the location during
migration.

Jason

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 4/4] vfio: Report PASID capability via VFIO_DEVICE_FEATURE ioctl
  2024-04-12  8:21 ` [PATCH v2 4/4] vfio: Report PASID capability via VFIO_DEVICE_FEATURE ioctl Yi Liu
  2024-04-16  9:40   ` Tian, Kevin
  2024-04-16 17:57   ` Alex Williamson
@ 2024-04-23 12:39   ` Jason Gunthorpe
  2024-04-24  0:24     ` Tian, Kevin
  2 siblings, 1 reply; 72+ messages in thread
From: Jason Gunthorpe @ 2024-04-23 12:39 UTC (permalink / raw)
  To: Yi Liu
  Cc: alex.williamson, kevin.tian, joro, robin.murphy, eric.auger,
	nicolinc, kvm, chao.p.peng, iommu, baolu.lu, zhenzhong.duan,
	jacob.jun.pan

On Fri, Apr 12, 2024 at 01:21:21AM -0700, Yi Liu wrote:
> Today, vfio-pci hides the PASID capability of devices from userspace. Unlike
> other PCI capabilities, PASID capability is going to be reported to user by
> VFIO_DEVICE_FEATURE. Hence userspace could probe PASID capability by it.
> This is a bit different from the other capabilities which are reported to
> userspace when the user reads the device's PCI configuration space. There
> are two reasons for this.

I'm thinking this probably does not belong in VFIO, iommufd should
report what the device, driver and OS is able to do with this
device. PASID support is at least 50% an iommu property too.

This is a seperate issue to forming the config space.

I didn't notice anything about SIOV in this, are we tackling it later?

IIRC we need the vIOMMU to specify a vPASID during attach and somehow
that gets mapped into a pPASID and synchronized with the KVM ENQCMD
translation?

Jason

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 2/4] vfio-iommufd: Support pasid [at|de]tach for physical VFIO devices
  2024-04-12  8:21 ` [PATCH v2 2/4] vfio-iommufd: Support pasid [at|de]tach for physical VFIO devices Yi Liu
  2024-04-16  9:01   ` Tian, Kevin
@ 2024-04-23 12:43   ` Jason Gunthorpe
  2024-04-24  0:33     ` Tian, Kevin
  2024-04-24  4:48     ` Yi Liu
  1 sibling, 2 replies; 72+ messages in thread
From: Jason Gunthorpe @ 2024-04-23 12:43 UTC (permalink / raw)
  To: Yi Liu
  Cc: alex.williamson, kevin.tian, joro, robin.murphy, eric.auger,
	nicolinc, kvm, chao.p.peng, iommu, baolu.lu, zhenzhong.duan,
	jacob.jun.pan, Matthew Wilcox

On Fri, Apr 12, 2024 at 01:21:19AM -0700, Yi Liu wrote:
> +int vfio_iommufd_physical_pasid_attach_ioas(struct vfio_device *vdev,
> +					    u32 pasid, u32 *pt_id)
> +{
> +	int rc;
> +
> +	lockdep_assert_held(&vdev->dev_set->lock);
> +
> +	if (WARN_ON(!vdev->iommufd_device))
> +		return -EINVAL;
> +
> +	rc = ida_get_lowest(&vdev->pasids, pasid, pasid);

A helper inline

    bool ida_is_allocate(&ida, id)

Would be nicer for that

> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index cb5b7f865d58..e0198851ffd2 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -142,6 +142,8 @@ static const struct vfio_device_ops vfio_pci_ops = {
>  	.unbind_iommufd	= vfio_iommufd_physical_unbind,
>  	.attach_ioas	= vfio_iommufd_physical_attach_ioas,
>  	.detach_ioas	= vfio_iommufd_physical_detach_ioas,
> +	.pasid_attach_ioas	= vfio_iommufd_physical_pasid_attach_ioas,
> +	.pasid_detach_ioas	= vfio_iommufd_physical_pasid_detach_ioas,
>  };

This should be copied into mlx5 and nvgrace-gpu at least as well

Jason

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 3/4] vfio: Add VFIO_DEVICE_PASID_[AT|DE]TACH_IOMMUFD_PT
  2024-04-12  8:21 ` [PATCH v2 3/4] vfio: Add VFIO_DEVICE_PASID_[AT|DE]TACH_IOMMUFD_PT Yi Liu
  2024-04-16  9:13   ` Tian, Kevin
@ 2024-04-23 12:45   ` Jason Gunthorpe
  1 sibling, 0 replies; 72+ messages in thread
From: Jason Gunthorpe @ 2024-04-23 12:45 UTC (permalink / raw)
  To: Yi Liu
  Cc: alex.williamson, kevin.tian, joro, robin.murphy, eric.auger,
	nicolinc, kvm, chao.p.peng, iommu, baolu.lu, zhenzhong.duan,
	jacob.jun.pan

On Fri, Apr 12, 2024 at 01:21:20AM -0700, Yi Liu wrote:
> This adds ioctls for the userspace to attach/detach a given pasid of a
> vfio device to/from an IOAS/HWPT.
> 
> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> ---
>  drivers/vfio/device_cdev.c | 51 +++++++++++++++++++++++++++++++++++
>  drivers/vfio/vfio.h        |  4 +++
>  drivers/vfio/vfio_main.c   |  8 ++++++
>  include/uapi/linux/vfio.h  | 55 ++++++++++++++++++++++++++++++++++++++
>  4 files changed, 118 insertions(+)

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

Jason

^ permalink raw reply	[flat|nested] 72+ messages in thread

* RE: [PATCH v2 0/4] vfio-pci support pasid attach/detach
  2024-04-23 12:01                       ` Jason Gunthorpe
@ 2024-04-23 23:47                         ` Tian, Kevin
  2024-04-24  0:12                           ` Jason Gunthorpe
  0 siblings, 1 reply; 72+ messages in thread
From: Tian, Kevin @ 2024-04-23 23:47 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Liu, Yi L, joro, robin.murphy, eric.auger,
	nicolinc, kvm, chao.p.peng, iommu, baolu.lu, Duan, Zhenzhong,
	Pan, Jacob jun

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Tuesday, April 23, 2024 8:02 PM
> 
> On Tue, Apr 23, 2024 at 07:43:27AM +0000, Tian, Kevin wrote:
> > I'm not sure how userspace can fully handle this w/o certain assistance
> > from the kernel.
> >
> > So I kind of agree that emulated PASID capability is probably the only
> > contract which the kernel should provide:
> >   - mapped 1:1 at the physical location, or
> >   - constructed at an offset according to DVSEC, or
> >   - constructed at an offset according to a look-up table
> >
> > The VMM always scans the vfio pci config space to expose vPASID.
> >
> > Then the remaining open is what VMM could do when a VF supports
> > PASID but unfortunately it's not reported by vfio. W/o the capability
> > of inspecting the PASID state of PF, probably the only feasible option
> > is to maintain a look-up table in VMM itself and assumes the kernel
> > always enables the PASID cap on PF.
> 
> I'm still not sure I like doing this in the kernel - we need to do the
> same sort of thing for ATS too, right?

VF is allowed to implement ATS.

PRI has the same problem as PASID.

> 
> It feels simpler if the indicates if PASID and ATS can be supported
> and userspace builds the capability blocks.

this routes back to Alex's original question about using different
interfaces (a device feature vs. PCI PASID cap) for VF and PF.

Are we OK with that divergence?

> 
> There are migration considerations too - the blocks need to be
> migrated over and end up in the same place as well..
> 

Can you elaborate what is the problem with the kernel emulating
the PASID cap in this consideration?

Does it talk about a case where the devices between src/dest are
different versions (but backward compatible) with different unused
space layout and the kernel approach may pick up different offsets
while the VMM can guarantee the same offset?

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 0/4] vfio-pci support pasid attach/detach
  2024-04-23 23:47                         ` Tian, Kevin
@ 2024-04-24  0:12                           ` Jason Gunthorpe
  2024-04-24  2:57                             ` Tian, Kevin
                                               ` (2 more replies)
  0 siblings, 3 replies; 72+ messages in thread
From: Jason Gunthorpe @ 2024-04-24  0:12 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Alex Williamson, Liu, Yi L, joro, robin.murphy, eric.auger,
	nicolinc, kvm, chao.p.peng, iommu, baolu.lu, Duan, Zhenzhong,
	Pan, Jacob jun

On Tue, Apr 23, 2024 at 11:47:50PM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Tuesday, April 23, 2024 8:02 PM
> > 
> > On Tue, Apr 23, 2024 at 07:43:27AM +0000, Tian, Kevin wrote:
> > > I'm not sure how userspace can fully handle this w/o certain assistance
> > > from the kernel.
> > >
> > > So I kind of agree that emulated PASID capability is probably the only
> > > contract which the kernel should provide:
> > >   - mapped 1:1 at the physical location, or
> > >   - constructed at an offset according to DVSEC, or
> > >   - constructed at an offset according to a look-up table
> > >
> > > The VMM always scans the vfio pci config space to expose vPASID.
> > >
> > > Then the remaining open is what VMM could do when a VF supports
> > > PASID but unfortunately it's not reported by vfio. W/o the capability
> > > of inspecting the PASID state of PF, probably the only feasible option
> > > is to maintain a look-up table in VMM itself and assumes the kernel
> > > always enables the PASID cap on PF.
> > 
> > I'm still not sure I like doing this in the kernel - we need to do the
> > same sort of thing for ATS too, right?
> 
> VF is allowed to implement ATS.
> 
> PRI has the same problem as PASID.

I'm surprised by this, I would have guessed ATS would be the device
global one, PRI not being per-VF seems problematic??? How do you
disable PRI generation to get a clean shutdown?

> > It feels simpler if the indicates if PASID and ATS can be supported
> > and userspace builds the capability blocks.
> 
> this routes back to Alex's original question about using different
> interfaces (a device feature vs. PCI PASID cap) for VF and PF.

I'm not sure it is different interfaces..

The only reason to pass the PF's PASID cap is to give free space to
the VMM. If we are saying that gaps are free space (excluding a list
of bad devices) then we don't acutally need to do that anymore.

VMM will always create a synthetic PASID cap and kernel will always
suppress a real one.

An iommufd query will indicate if the vIOMMU can support vPASID on
that device.

Same for all the troublesome non-physical caps.

> > There are migration considerations too - the blocks need to be
> > migrated over and end up in the same place as well..
> 
> Can you elaborate what is the problem with the kernel emulating
> the PASID cap in this consideration?

If the kernel changes the algorithm, say it wants to do PASID, PRI,
something_new then it might change the layout

We can't just have the kernel decide without also providing a way for
userspace to say what the right layout actually is. :\

> Does it talk about a case where the devices between src/dest are
> different versions (but backward compatible) with different unused
> space layout and the kernel approach may pick up different offsets
> while the VMM can guarantee the same offset?

That is also a concern where the PCI cap layout may change a bit but
they are still migration compatible, but my bigger worry is that the
kernel just lays out the fake caps in a different way because the
kernel changes.

At least if the VMM is doing this then the VMM can include the
information in its migration scheme and use it to recreate the PCI
layout withotu having to create a bunch of uAPI to do so.

Jason

^ permalink raw reply	[flat|nested] 72+ messages in thread

* RE: [PATCH v2 4/4] vfio: Report PASID capability via VFIO_DEVICE_FEATURE ioctl
  2024-04-23 12:39   ` Jason Gunthorpe
@ 2024-04-24  0:24     ` Tian, Kevin
  2024-04-24 13:59       ` Jason Gunthorpe
  0 siblings, 1 reply; 72+ messages in thread
From: Tian, Kevin @ 2024-04-24  0:24 UTC (permalink / raw)
  To: Jason Gunthorpe, Liu, Yi L
  Cc: alex.williamson, joro, robin.murphy, eric.auger, nicolinc, kvm,
	chao.p.peng, iommu, baolu.lu, Duan, Zhenzhong, Pan, Jacob jun

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Tuesday, April 23, 2024 8:40 PM
> 
> On Fri, Apr 12, 2024 at 01:21:21AM -0700, Yi Liu wrote:
> > Today, vfio-pci hides the PASID capability of devices from userspace. Unlike
> > other PCI capabilities, PASID capability is going to be reported to user by
> > VFIO_DEVICE_FEATURE. Hence userspace could probe PASID capability by
> it.
> > This is a bit different from the other capabilities which are reported to
> > userspace when the user reads the device's PCI configuration space. There
> > are two reasons for this.
> 
> I'm thinking this probably does not belong in VFIO, iommufd should
> report what the device, driver and OS is able to do with this
> device. PASID support is at least 50% an iommu property too.

We have PASID capability in both device side and iommu side.

VFIO is for the former and iommufd is for the latter.

both should report the capability only if that cap exists and is
enabled by OS.

> 
> This is a seperate issue to forming the config space.
> 
> I didn't notice anything about SIOV in this, are we tackling it later?

yes.

> 
> IIRC we need the vIOMMU to specify a vPASID during attach and somehow
> that gets mapped into a pPASID and synchronized with the KVM ENQCMD
> translation?
> 

yes, that is the original plan. More accurately the vfio attach uAPI
is always about a pPASID. The mapping will be added separately to
iommufd and synced with KVM.

But internally we are evaluating whether there is enough value
to justify adding this complexity to the kernel. It's the main burden
in SIOVr1. Given the limited usages very likely we'll only do the
basic SIOV support w/o the vPASID cap...

^ permalink raw reply	[flat|nested] 72+ messages in thread

* RE: [PATCH v2 2/4] vfio-iommufd: Support pasid [at|de]tach for physical VFIO devices
  2024-04-23 12:43   ` Jason Gunthorpe
@ 2024-04-24  0:33     ` Tian, Kevin
  2024-04-24  4:48     ` Yi Liu
  1 sibling, 0 replies; 72+ messages in thread
From: Tian, Kevin @ 2024-04-24  0:33 UTC (permalink / raw)
  To: Jason Gunthorpe, Liu, Yi L
  Cc: alex.williamson, joro, robin.murphy, eric.auger, nicolinc, kvm,
	chao.p.peng, iommu, baolu.lu, Duan, Zhenzhong, Pan, Jacob jun,
	Matthew Wilcox

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Tuesday, April 23, 2024 8:44 PM
> 
> On Fri, Apr 12, 2024 at 01:21:19AM -0700, Yi Liu wrote:
> > +int vfio_iommufd_physical_pasid_attach_ioas(struct vfio_device *vdev,
> > +					    u32 pasid, u32 *pt_id)
> > +{
> > +	int rc;
> > +
> > +	lockdep_assert_held(&vdev->dev_set->lock);
> > +
> > +	if (WARN_ON(!vdev->iommufd_device))
> > +		return -EINVAL;
> > +
> > +	rc = ida_get_lowest(&vdev->pasids, pasid, pasid);
> 
> A helper inline
> 
>     bool ida_is_allocate(&ida, id)
> 
> Would be nicer for that
> 
> > diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> > index cb5b7f865d58..e0198851ffd2 100644
> > --- a/drivers/vfio/pci/vfio_pci.c
> > +++ b/drivers/vfio/pci/vfio_pci.c
> > @@ -142,6 +142,8 @@ static const struct vfio_device_ops vfio_pci_ops = {
> >  	.unbind_iommufd	= vfio_iommufd_physical_unbind,
> >  	.attach_ioas	= vfio_iommufd_physical_attach_ioas,
> >  	.detach_ioas	= vfio_iommufd_physical_detach_ioas,
> > +	.pasid_attach_ioas	= vfio_iommufd_physical_pasid_attach_ioas,
> > +	.pasid_detach_ioas	= vfio_iommufd_physical_pasid_detach_ioas,
> >  };
> 
> This should be copied into mlx5 and nvgrace-gpu at least as well
> 

I'd prefer to the driver owners to add them separately. They know
their hardware and can do proper test.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* RE: [PATCH v2 0/4] vfio-pci support pasid attach/detach
  2024-04-24  0:12                           ` Jason Gunthorpe
@ 2024-04-24  2:57                             ` Tian, Kevin
  2024-04-24 12:29                               ` Baolu Lu
  2024-04-24 14:04                               ` Jason Gunthorpe
  2024-04-24  5:19                             ` Tian, Kevin
  2024-04-24 18:24                             ` Alex Williamson
  2 siblings, 2 replies; 72+ messages in thread
From: Tian, Kevin @ 2024-04-24  2:57 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Liu, Yi L, joro, robin.murphy, eric.auger,
	nicolinc, kvm, chao.p.peng, iommu, baolu.lu, Duan, Zhenzhong,
	Pan, Jacob jun

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, April 24, 2024 8:12 AM
> 
> On Tue, Apr 23, 2024 at 11:47:50PM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Tuesday, April 23, 2024 8:02 PM
> > >
> > > On Tue, Apr 23, 2024 at 07:43:27AM +0000, Tian, Kevin wrote:
> > > > I'm not sure how userspace can fully handle this w/o certain assistance
> > > > from the kernel.
> > > >
> > > > So I kind of agree that emulated PASID capability is probably the only
> > > > contract which the kernel should provide:
> > > >   - mapped 1:1 at the physical location, or
> > > >   - constructed at an offset according to DVSEC, or
> > > >   - constructed at an offset according to a look-up table
> > > >
> > > > The VMM always scans the vfio pci config space to expose vPASID.
> > > >
> > > > Then the remaining open is what VMM could do when a VF supports
> > > > PASID but unfortunately it's not reported by vfio. W/o the capability
> > > > of inspecting the PASID state of PF, probably the only feasible option
> > > > is to maintain a look-up table in VMM itself and assumes the kernel
> > > > always enables the PASID cap on PF.
> > >
> > > I'm still not sure I like doing this in the kernel - we need to do the
> > > same sort of thing for ATS too, right?
> >
> > VF is allowed to implement ATS.
> >
> > PRI has the same problem as PASID.
> 
> I'm surprised by this, I would have guessed ATS would be the device
> global one, PRI not being per-VF seems problematic??? How do you
> disable PRI generation to get a clean shutdown?

Here is what the PCIe spec says:

  For SR-IOV devices, a single Page Request Interface is permitted for
  the PF and is shared between the PF and its associated VFs, in which
  case the PF implements this capability and its VFs must not.

I'll let Baolu chime in for the potential impact to his PRI cleanup
effort, e.g. whether disabling PRI generation is mandatory if the
IOMMU side is already put in a mode auto-responding error to
new PRI request instead of reporting to sw.

But I do see another problem for shared capabilities between PF/VFs.

Now those shared capabilities are enabled/disabled when the PF is
attached to/detached from a domain, w/o counting the shared usage
from VFs.

Looks we have a gap here.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 2/4] vfio-iommufd: Support pasid [at|de]tach for physical VFIO devices
  2024-04-23 12:43   ` Jason Gunthorpe
  2024-04-24  0:33     ` Tian, Kevin
@ 2024-04-24  4:48     ` Yi Liu
  1 sibling, 0 replies; 72+ messages in thread
From: Yi Liu @ 2024-04-24  4:48 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: alex.williamson, kevin.tian, joro, robin.murphy, eric.auger,
	nicolinc, kvm, chao.p.peng, iommu, baolu.lu, zhenzhong.duan,
	jacob.jun.pan, Matthew Wilcox



On 2024/4/23 20:43, Jason Gunthorpe wrote:
> On Fri, Apr 12, 2024 at 01:21:19AM -0700, Yi Liu wrote:
>> +int vfio_iommufd_physical_pasid_attach_ioas(struct vfio_device *vdev,
>> +					    u32 pasid, u32 *pt_id)
>> +{
>> +	int rc;
>> +
>> +	lockdep_assert_held(&vdev->dev_set->lock);
>> +
>> +	if (WARN_ON(!vdev->iommufd_device))
>> +		return -EINVAL;
>> +
>> +	rc = ida_get_lowest(&vdev->pasids, pasid, pasid);
> 
> A helper inline
> 
>      bool ida_is_allocate(&ida, id)
> 
> Would be nicer for that

ok.

>> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
>> index cb5b7f865d58..e0198851ffd2 100644
>> --- a/drivers/vfio/pci/vfio_pci.c
>> +++ b/drivers/vfio/pci/vfio_pci.c
>> @@ -142,6 +142,8 @@ static const struct vfio_device_ops vfio_pci_ops = {
>>   	.unbind_iommufd	= vfio_iommufd_physical_unbind,
>>   	.attach_ioas	= vfio_iommufd_physical_attach_ioas,
>>   	.detach_ioas	= vfio_iommufd_physical_detach_ioas,
>> +	.pasid_attach_ioas	= vfio_iommufd_physical_pasid_attach_ioas,
>> +	.pasid_detach_ioas	= vfio_iommufd_physical_pasid_detach_ioas,
>>   };
> 
> This should be copied into mlx5 and nvgrace-gpu at least as well

looks like Kevin has a different idea on it.


-- 
Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 72+ messages in thread

* RE: [PATCH v2 0/4] vfio-pci support pasid attach/detach
  2024-04-24  0:12                           ` Jason Gunthorpe
  2024-04-24  2:57                             ` Tian, Kevin
@ 2024-04-24  5:19                             ` Tian, Kevin
  2024-04-24 14:15                               ` Jason Gunthorpe
  2024-04-24 18:24                             ` Alex Williamson
  2 siblings, 1 reply; 72+ messages in thread
From: Tian, Kevin @ 2024-04-24  5:19 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Liu, Yi L, joro, robin.murphy, eric.auger,
	nicolinc, kvm, chao.p.peng, iommu, baolu.lu, Duan, Zhenzhong,
	Pan, Jacob jun

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, April 24, 2024 8:12 AM
> 
> On Tue, Apr 23, 2024 at 11:47:50PM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Tuesday, April 23, 2024 8:02 PM
> > >
> > > It feels simpler if the indicates if PASID and ATS can be supported
> > > and userspace builds the capability blocks.
> >
> > this routes back to Alex's original question about using different
> > interfaces (a device feature vs. PCI PASID cap) for VF and PF.
> 
> I'm not sure it is different interfaces..
> 
> The only reason to pass the PF's PASID cap is to give free space to
> the VMM. If we are saying that gaps are free space (excluding a list
> of bad devices) then we don't acutally need to do that anymore.
> 
> VMM will always create a synthetic PASID cap and kernel will always
> suppress a real one.

oh you suggest that there won't even be a 1:1 map for PF!

kind of continue with the device_feature method as this series does.
and it could include all VMM-emulated capabilities which are not
enumerated properly from vfio pci config space.

this interface only reports the availability/features of a capability
but never includes information about offset.

If a device implements DVSEC it will be exposed to the VMM.

Then suppose the VMM will introduce a new cmd parameter to
turn on the emulation of the PASID capability. It's default off so
legacy usages can still work.

Once the parameter is on then the VMM will emulate the PASID
capability by:

  - Locating a free range according to DVSEC, or,
  - Locating a free range from gaps between PCI caps,

If a device is found not working properly then add a fixed offset for 
this device.
 
> 
> An iommufd query will indicate if the vIOMMU can support vPASID on
> that device.
> 
> Same for all the troublesome non-physical caps.
> 
> > > There are migration considerations too - the blocks need to be
> > > migrated over and end up in the same place as well..
> >
> > Can you elaborate what is the problem with the kernel emulating
> > the PASID cap in this consideration?
> 
> If the kernel changes the algorithm, say it wants to do PASID, PRI,
> something_new then it might change the layout
> 
> We can't just have the kernel decide without also providing a way for
> userspace to say what the right layout actually is. :\
> 

emm that's a good point.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 0/4] vfio-pci support pasid attach/detach
  2024-04-24  2:57                             ` Tian, Kevin
@ 2024-04-24 12:29                               ` Baolu Lu
  2024-04-24 14:04                               ` Jason Gunthorpe
  1 sibling, 0 replies; 72+ messages in thread
From: Baolu Lu @ 2024-04-24 12:29 UTC (permalink / raw)
  To: Tian, Kevin, Jason Gunthorpe
  Cc: baolu.lu, Alex Williamson, Liu, Yi L, joro, robin.murphy,
	eric.auger, nicolinc, kvm, chao.p.peng, iommu, Duan, Zhenzhong,
	Pan, Jacob jun

On 2024/4/24 10:57, Tian, Kevin wrote:
>> From: Jason Gunthorpe <jgg@nvidia.com>
>> Sent: Wednesday, April 24, 2024 8:12 AM
>>
>> On Tue, Apr 23, 2024 at 11:47:50PM +0000, Tian, Kevin wrote:
>>>> From: Jason Gunthorpe <jgg@nvidia.com>
>>>> Sent: Tuesday, April 23, 2024 8:02 PM
>>>>
>>>> On Tue, Apr 23, 2024 at 07:43:27AM +0000, Tian, Kevin wrote:
>>>>> I'm not sure how userspace can fully handle this w/o certain assistance
>>>>> from the kernel.
>>>>>
>>>>> So I kind of agree that emulated PASID capability is probably the only
>>>>> contract which the kernel should provide:
>>>>>    - mapped 1:1 at the physical location, or
>>>>>    - constructed at an offset according to DVSEC, or
>>>>>    - constructed at an offset according to a look-up table
>>>>>
>>>>> The VMM always scans the vfio pci config space to expose vPASID.
>>>>>
>>>>> Then the remaining open is what VMM could do when a VF supports
>>>>> PASID but unfortunately it's not reported by vfio. W/o the capability
>>>>> of inspecting the PASID state of PF, probably the only feasible option
>>>>> is to maintain a look-up table in VMM itself and assumes the kernel
>>>>> always enables the PASID cap on PF.
>>>>
>>>> I'm still not sure I like doing this in the kernel - we need to do the
>>>> same sort of thing for ATS too, right?
>>>
>>> VF is allowed to implement ATS.
>>>
>>> PRI has the same problem as PASID.
>>
>> I'm surprised by this, I would have guessed ATS would be the device
>> global one, PRI not being per-VF seems problematic??? How do you
>> disable PRI generation to get a clean shutdown?
> 
> Here is what the PCIe spec says:
> 
>    For SR-IOV devices, a single Page Request Interface is permitted for
>    the PF and is shared between the PF and its associated VFs, in which
>    case the PF implements this capability and its VFs must not.
> 
> I'll let Baolu chime in for the potential impact to his PRI cleanup
> effort, e.g. whether disabling PRI generation is mandatory if the
> IOMMU side is already put in a mode auto-responding error to
> new PRI request instead of reporting to sw.

The PRI cleanup steps are defined like this:

  * - Disable new PRI reception: Turn off PRI generation in the IOMMU 
hardware
  *   and flush any hardware page request queues. This should be done before
  *   calling into this helper.
  * - Acknowledge all outstanding PRQs to the device: Respond to all 
outstanding
  *   page requests with IOMMU_PAGE_RESP_INVALID, indicating the device 
should
  *   not retry. This helper function handles this.
  * - Disable PRI on the device: After calling this helper, the caller could
  *   then disable PRI on the device.

Disabling PRI on the device is the last step and optional because the
IOMMU is required to support a PRI blocking state and has already been
put in that state at the first step.

For the VF case, it probably is a no-op except for maintaining a
reference count. Once PRI is disabled on all PFs and VFs, it can then be
physically disabled on the PF.

> 
> But I do see another problem for shared capabilities between PF/VFs.
> 
> Now those shared capabilities are enabled/disabled when the PF is
> attached to/detached from a domain, w/o counting the shared usage
> from VFs.
> 
> Looks we have a gap here.

Yes, there's a gap at least for the Intel IOMMU driver. I'll soon fix
this by moving the handling of ATS out of the driver, especially from
the default domain attach/detach paths.

Best regards,
baolu

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 4/4] vfio: Report PASID capability via VFIO_DEVICE_FEATURE ioctl
  2024-04-24  0:24     ` Tian, Kevin
@ 2024-04-24 13:59       ` Jason Gunthorpe
  0 siblings, 0 replies; 72+ messages in thread
From: Jason Gunthorpe @ 2024-04-24 13:59 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Liu, Yi L, alex.williamson, joro, robin.murphy, eric.auger,
	nicolinc, kvm, chao.p.peng, iommu, baolu.lu, Duan, Zhenzhong,
	Pan, Jacob jun

On Wed, Apr 24, 2024 at 12:24:19AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Tuesday, April 23, 2024 8:40 PM
> > 
> > On Fri, Apr 12, 2024 at 01:21:21AM -0700, Yi Liu wrote:
> > > Today, vfio-pci hides the PASID capability of devices from userspace. Unlike
> > > other PCI capabilities, PASID capability is going to be reported to user by
> > > VFIO_DEVICE_FEATURE. Hence userspace could probe PASID capability by
> > it.
> > > This is a bit different from the other capabilities which are reported to
> > > userspace when the user reads the device's PCI configuration space. There
> > > are two reasons for this.
> > 
> > I'm thinking this probably does not belong in VFIO, iommufd should
> > report what the device, driver and OS is able to do with this
> > device. PASID support is at least 50% an iommu property too.
> 
> We have PASID capability in both device side and iommu side.
> 
> VFIO is for the former and iommufd is for the latter.

iommu can do the device side too, we have a device info ioctl after
all.

Jason

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 0/4] vfio-pci support pasid attach/detach
  2024-04-24  2:57                             ` Tian, Kevin
  2024-04-24 12:29                               ` Baolu Lu
@ 2024-04-24 14:04                               ` Jason Gunthorpe
  1 sibling, 0 replies; 72+ messages in thread
From: Jason Gunthorpe @ 2024-04-24 14:04 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Alex Williamson, Liu, Yi L, joro, robin.murphy, eric.auger,
	nicolinc, kvm, chao.p.peng, iommu, baolu.lu, Duan, Zhenzhong,
	Pan, Jacob jun

On Wed, Apr 24, 2024 at 02:57:38AM +0000, Tian, Kevin wrote:
> Now those shared capabilities are enabled/disabled when the PF is
> attached to/detached from a domain, w/o counting the shared usage
> from VFs.

Urh..

This looks broken in the kernel right now?

If you attach a SVA PASID to a VF then the iommu driver will call 
pci_enable_pri(vf) which will always fail:

	/*
	 * VFs must not implement the PRI Capability.  If their PF
	 * implements PRI, it is shared by the VFs, so if the PF PRI is
	 * enabled, it is also enabled for the VF.
	 */
	if (pdev->is_virtfn) {
		if (pci_physfn(pdev)->pri_enabled)
			return 0;
		return -EINVAL;
	}

More to fix :(

Jason

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 0/4] vfio-pci support pasid attach/detach
  2024-04-24  5:19                             ` Tian, Kevin
@ 2024-04-24 14:15                               ` Jason Gunthorpe
  2024-04-24 18:38                                 ` Alex Williamson
  0 siblings, 1 reply; 72+ messages in thread
From: Jason Gunthorpe @ 2024-04-24 14:15 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Alex Williamson, Liu, Yi L, joro, robin.murphy, eric.auger,
	nicolinc, kvm, chao.p.peng, iommu, baolu.lu, Duan, Zhenzhong,
	Pan, Jacob jun

On Wed, Apr 24, 2024 at 05:19:31AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Wednesday, April 24, 2024 8:12 AM
> > 
> > On Tue, Apr 23, 2024 at 11:47:50PM +0000, Tian, Kevin wrote:
> > > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > > Sent: Tuesday, April 23, 2024 8:02 PM
> > > >
> > > > It feels simpler if the indicates if PASID and ATS can be supported
> > > > and userspace builds the capability blocks.
> > >
> > > this routes back to Alex's original question about using different
> > > interfaces (a device feature vs. PCI PASID cap) for VF and PF.
> > 
> > I'm not sure it is different interfaces..
> > 
> > The only reason to pass the PF's PASID cap is to give free space to
> > the VMM. If we are saying that gaps are free space (excluding a list
> > of bad devices) then we don't acutally need to do that anymore.
> > 
> > VMM will always create a synthetic PASID cap and kernel will always
> > suppress a real one.
> 
> oh you suggest that there won't even be a 1:1 map for PF!

Right. No real need..

> kind of continue with the device_feature method as this series does.
> and it could include all VMM-emulated capabilities which are not
> enumerated properly from vfio pci config space.

1) VFIO creates the iommufd idev
2) VMM queries IOMMUFD_CMD_GET_HW_INFO to learn if PASID, PRI, etc,
   etc is supported
3) VMM locates empty space in the config space
4) VMM figures out where and what cap blocks to create (considering
   migration needs/etc)
5) VMM synthesizes the blocks and ties emulation to other iommufd things

This works generically for any synthetic vPCI function including a
non-vfio-pci one.

Most likely due to migration needs the exact layout of the PCI config
space should be configured to the VMM, including the location of any
blocks copied from physical and any blocks synthezied. This is the
only way to be sure the config space is actually 100% consistent.

For non migration cases to make it automatic we can check the free
space via gaps. The broken devices that have problems with this can
either be told to use the explicit approach above,the VMM could
consult some text file, or vPASID/etc can be left disabled. IMHO the
support of PASID is so rare today this is probably fine.

Vendors should be *strongly encouraged* to wrap their special used
config space areas in DVSEC and not hide them in free space.

We may also want a DVSEC to indicate free space - but if vendors are
going to change their devices I'd rather them change to mark the used
space with DVSEC then mark the free space :)

Jason

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 0/4] vfio-pci support pasid attach/detach
  2024-04-24  0:12                           ` Jason Gunthorpe
  2024-04-24  2:57                             ` Tian, Kevin
  2024-04-24  5:19                             ` Tian, Kevin
@ 2024-04-24 18:24                             ` Alex Williamson
  2024-04-24 18:36                               ` Jason Gunthorpe
  2024-04-25  9:26                               ` Yi Liu
  2 siblings, 2 replies; 72+ messages in thread
From: Alex Williamson @ 2024-04-24 18:24 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tian, Kevin, Liu, Yi L, joro, robin.murphy, eric.auger, nicolinc,
	kvm, chao.p.peng, iommu, baolu.lu, Duan, Zhenzhong, Pan,
	Jacob jun

On Tue, 23 Apr 2024 21:12:21 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Tue, Apr 23, 2024 at 11:47:50PM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Tuesday, April 23, 2024 8:02 PM
> > > 
> > > On Tue, Apr 23, 2024 at 07:43:27AM +0000, Tian, Kevin wrote:  
> > > > I'm not sure how userspace can fully handle this w/o certain assistance
> > > > from the kernel.
> > > >
> > > > So I kind of agree that emulated PASID capability is probably the only
> > > > contract which the kernel should provide:
> > > >   - mapped 1:1 at the physical location, or
> > > >   - constructed at an offset according to DVSEC, or
> > > >   - constructed at an offset according to a look-up table
> > > >
> > > > The VMM always scans the vfio pci config space to expose vPASID.
> > > >
> > > > Then the remaining open is what VMM could do when a VF supports
> > > > PASID but unfortunately it's not reported by vfio. W/o the capability
> > > > of inspecting the PASID state of PF, probably the only feasible option
> > > > is to maintain a look-up table in VMM itself and assumes the kernel
> > > > always enables the PASID cap on PF.  
> > > 
> > > I'm still not sure I like doing this in the kernel - we need to do the
> > > same sort of thing for ATS too, right?  
> > 
> > VF is allowed to implement ATS.
> > 
> > PRI has the same problem as PASID.  
> 
> I'm surprised by this, I would have guessed ATS would be the device
> global one, PRI not being per-VF seems problematic??? How do you
> disable PRI generation to get a clean shutdown?
> 
> > > It feels simpler if the indicates if PASID and ATS can be supported
> > > and userspace builds the capability blocks.  
> > 
> > this routes back to Alex's original question about using different
> > interfaces (a device feature vs. PCI PASID cap) for VF and PF.  
> 
> I'm not sure it is different interfaces..
> 
> The only reason to pass the PF's PASID cap is to give free space to
> the VMM. If we are saying that gaps are free space (excluding a list
> of bad devices) then we don't acutally need to do that anymore.

Are we saying that now??  That's new.

> VMM will always create a synthetic PASID cap and kernel will always
> suppress a real one.
> 
> An iommufd query will indicate if the vIOMMU can support vPASID on
> that device.
> 
> Same for all the troublesome non-physical caps.
> 
> > > There are migration considerations too - the blocks need to be
> > > migrated over and end up in the same place as well..  
> > 
> > Can you elaborate what is the problem with the kernel emulating
> > the PASID cap in this consideration?  
> 
> If the kernel changes the algorithm, say it wants to do PASID, PRI,
> something_new then it might change the layout
> 
> We can't just have the kernel decide without also providing a way for
> userspace to say what the right layout actually is. :\

The capability layout is only relevant to migration, right?  A variant
driver that supports migration is a prerequisite and would also be
responsible for exposing the PASID capability.  This isn't as disjoint
as it's being portrayed.

> > Does it talk about a case where the devices between src/dest are
> > different versions (but backward compatible) with different unused
> > space layout and the kernel approach may pick up different offsets
> > while the VMM can guarantee the same offset?  
> 
> That is also a concern where the PCI cap layout may change a bit but
> they are still migration compatible, but my bigger worry is that the
> kernel just lays out the fake caps in a different way because the
> kernel changes.

Outside of migration, what does it matter if the cap layout is
different?  A driver should never hard code the address for a
capability.
 
> At least if the VMM is doing this then the VMM can include the
> information in its migration scheme and use it to recreate the PCI
> layout withotu having to create a bunch of uAPI to do so.

We're again back to migration compatibility, where again the capability
layout would be governed by the migration support in the in-kernel
variant driver.  Once migration is involved the location of a PASID
shouldn't be arbitrary, whether it's provided by the kernel or the VMM.

Regardless, the VMM ultimately has the authority what the guest
sees in config space.  The VMM is not bound to expose the PASID at the
offset provided by the kernel, or bound to expose it at all.  The
kernel exposed PASID can simply provide an available location and set
of enabled capabilities.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 0/4] vfio-pci support pasid attach/detach
  2024-04-24 18:24                             ` Alex Williamson
@ 2024-04-24 18:36                               ` Jason Gunthorpe
  2024-04-24 20:13                                 ` Alex Williamson
  2024-04-25  9:26                               ` Yi Liu
  1 sibling, 1 reply; 72+ messages in thread
From: Jason Gunthorpe @ 2024-04-24 18:36 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Tian, Kevin, Liu, Yi L, joro, robin.murphy, eric.auger, nicolinc,
	kvm, chao.p.peng, iommu, baolu.lu, Duan, Zhenzhong, Pan,
	Jacob jun

On Wed, Apr 24, 2024 at 12:24:37PM -0600, Alex Williamson wrote:
> > The only reason to pass the PF's PASID cap is to give free space to
> > the VMM. If we are saying that gaps are free space (excluding a list
> > of bad devices) then we don't acutally need to do that anymore.
> 
> Are we saying that now??  That's new.

I suggested it a few times

> 
> > VMM will always create a synthetic PASID cap and kernel will always
> > suppress a real one.
> > 
> > An iommufd query will indicate if the vIOMMU can support vPASID on
> > that device.
> > 
> > Same for all the troublesome non-physical caps.
> > 
> > > > There are migration considerations too - the blocks need to be
> > > > migrated over and end up in the same place as well..  
> > > 
> > > Can you elaborate what is the problem with the kernel emulating
> > > the PASID cap in this consideration?  
> > 
> > If the kernel changes the algorithm, say it wants to do PASID, PRI,
> > something_new then it might change the layout
> > 
> > We can't just have the kernel decide without also providing a way for
> > userspace to say what the right layout actually is. :\
> 
> The capability layout is only relevant to migration, right?  

Yes, proabbly

> A variant
> driver that supports migration is a prerequisite and would also be
> responsible for exposing the PASID capability.  This isn't as disjoint
> as it's being portrayed.

I guess..  But also not quite. We still have the problem that kernel
migration driver V1 could legitimately create a different config space
that migration driver V2

And now you are saying that the migration driver has to parse the
migration stream and readjust its own layout

And every driver needs to do this?

We can, it is a quite big bit of infrastructure I think, but sure..

I fear the VMM still has to be involved somehow because it still has
to know if the source VMM has removed any kernel created caps.

> Outside of migration, what does it matter if the cap layout is
> different?  A driver should never hard code the address for a
> capability.

Yes, talking about migration here - migration is the hardest case it
seems.
 
> > At least if the VMM is doing this then the VMM can include the
> > information in its migration scheme and use it to recreate the PCI
> > layout withotu having to create a bunch of uAPI to do so.
> 
> We're again back to migration compatibility, where again the capability
> layout would be governed by the migration support in the in-kernel
> variant driver.  Once migration is involved the location of a PASID
> shouldn't be arbitrary, whether it's provided by the kernel or the VMM.

I wasn't going in this direction. I was thinking to make the VMM
create the config space layout that is approriate and hold it stable
as a migration ABI.

I think in practice many VMMs are going to do this anyhow unless we
put full support for config space synthesis, stable versions, and
version selection in the kernel directly. I was thinking to avoid
doing that.
 
> Regardless, the VMM ultimately has the authority what the guest
> sees in config space.  The VMM is not bound to expose the PASID at the
> offset provided by the kernel, or bound to expose it at all.  The
> kernel exposed PASID can simply provide an available location and set
> of enabled capabilities. 

And if the VMM is going to ignore the kernel layout then why do so
much work in the kernel to create it?

I think we need to decide, either only the VMM or only the kernel
should do this.

Jason

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 0/4] vfio-pci support pasid attach/detach
  2024-04-24 14:15                               ` Jason Gunthorpe
@ 2024-04-24 18:38                                 ` Alex Williamson
  2024-04-24 18:45                                   ` Jason Gunthorpe
  0 siblings, 1 reply; 72+ messages in thread
From: Alex Williamson @ 2024-04-24 18:38 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tian, Kevin, Liu, Yi L, joro, robin.murphy, eric.auger, nicolinc,
	kvm, chao.p.peng, iommu, baolu.lu, Duan, Zhenzhong, Pan,
	Jacob jun

On Wed, 24 Apr 2024 11:15:25 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Wed, Apr 24, 2024 at 05:19:31AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Wednesday, April 24, 2024 8:12 AM
> > > 
> > > On Tue, Apr 23, 2024 at 11:47:50PM +0000, Tian, Kevin wrote:  
> > > > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > > > Sent: Tuesday, April 23, 2024 8:02 PM
> > > > >
> > > > > It feels simpler if the indicates if PASID and ATS can be supported
> > > > > and userspace builds the capability blocks.  
> > > >
> > > > this routes back to Alex's original question about using different
> > > > interfaces (a device feature vs. PCI PASID cap) for VF and PF.  
> > > 
> > > I'm not sure it is different interfaces..
> > > 
> > > The only reason to pass the PF's PASID cap is to give free space to
> > > the VMM. If we are saying that gaps are free space (excluding a list
> > > of bad devices) then we don't acutally need to do that anymore.
> > > 
> > > VMM will always create a synthetic PASID cap and kernel will always
> > > suppress a real one.  
> > 
> > oh you suggest that there won't even be a 1:1 map for PF!  
> 
> Right. No real need..
> 
> > kind of continue with the device_feature method as this series does.
> > and it could include all VMM-emulated capabilities which are not
> > enumerated properly from vfio pci config space.  
> 
> 1) VFIO creates the iommufd idev
> 2) VMM queries IOMMUFD_CMD_GET_HW_INFO to learn if PASID, PRI, etc,
>    etc is supported
> 3) VMM locates empty space in the config space
> 4) VMM figures out where and what cap blocks to create (considering
>    migration needs/etc)
> 5) VMM synthesizes the blocks and ties emulation to other iommufd things
> 
> This works generically for any synthetic vPCI function including a
> non-vfio-pci one.

Maybe this is the actual value in implementing this in the VMM, one
implementation can support multiple device interfaces.

> Most likely due to migration needs the exact layout of the PCI config
> space should be configured to the VMM, including the location of any
> blocks copied from physical and any blocks synthezied. This is the
> only way to be sure the config space is actually 100% consistent.

Where is this concern about config space arbitrarily changing coming
from?  It's possible, yes, but vfio-pci-core or a variant driver are
going to have some sort of reasoning for exposing a capability at a
given offset.  A variant driver is necessary for supporting migration,
that variant driver should be aware that capability offsets are part of
the migration contract, and QEMU will enforce identical config space
unless we introduce exceptions.

> For non migration cases to make it automatic we can check the free
> space via gaps. The broken devices that have problems with this can
> either be told to use the explicit approach above,the VMM could
> consult some text file, or vPASID/etc can be left disabled. IMHO the
> support of PASID is so rare today this is probably fine.
> 
> Vendors should be *strongly encouraged* to wrap their special used
> config space areas in DVSEC and not hide them in free space.

As in my previous reply, this is a new approach that I had thought we
weren't comfortable making, and I'm still not very comfortable with.
The fact is random registers exist outside of capabilities today and
creating a general policy that we're going to deal with that as issues
arise from a generic "find a gap" algorithm is concerning.

> We may also want a DVSEC to indicate free space - but if vendors are
> going to change their devices I'd rather them change to mark the used
> space with DVSEC then mark the free space :)

Sure, had we proposed this and had vendor buy-in 10+yrs ago, that'd be
great.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 0/4] vfio-pci support pasid attach/detach
  2024-04-24 18:38                                 ` Alex Williamson
@ 2024-04-24 18:45                                   ` Jason Gunthorpe
  0 siblings, 0 replies; 72+ messages in thread
From: Jason Gunthorpe @ 2024-04-24 18:45 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Tian, Kevin, Liu, Yi L, joro, robin.murphy, eric.auger, nicolinc,
	kvm, chao.p.peng, iommu, baolu.lu, Duan, Zhenzhong, Pan,
	Jacob jun

On Wed, Apr 24, 2024 at 12:38:51PM -0600, Alex Williamson wrote:
> On Wed, 24 Apr 2024 11:15:25 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Wed, Apr 24, 2024 at 05:19:31AM +0000, Tian, Kevin wrote:
> > > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > > Sent: Wednesday, April 24, 2024 8:12 AM
> > > > 
> > > > On Tue, Apr 23, 2024 at 11:47:50PM +0000, Tian, Kevin wrote:  
> > > > > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > > > > Sent: Tuesday, April 23, 2024 8:02 PM
> > > > > >
> > > > > > It feels simpler if the indicates if PASID and ATS can be supported
> > > > > > and userspace builds the capability blocks.  
> > > > >
> > > > > this routes back to Alex's original question about using different
> > > > > interfaces (a device feature vs. PCI PASID cap) for VF and PF.  
> > > > 
> > > > I'm not sure it is different interfaces..
> > > > 
> > > > The only reason to pass the PF's PASID cap is to give free space to
> > > > the VMM. If we are saying that gaps are free space (excluding a list
> > > > of bad devices) then we don't acutally need to do that anymore.
> > > > 
> > > > VMM will always create a synthetic PASID cap and kernel will always
> > > > suppress a real one.  
> > > 
> > > oh you suggest that there won't even be a 1:1 map for PF!  
> > 
> > Right. No real need..
> > 
> > > kind of continue with the device_feature method as this series does.
> > > and it could include all VMM-emulated capabilities which are not
> > > enumerated properly from vfio pci config space.  
> > 
> > 1) VFIO creates the iommufd idev
> > 2) VMM queries IOMMUFD_CMD_GET_HW_INFO to learn if PASID, PRI, etc,
> >    etc is supported
> > 3) VMM locates empty space in the config space
> > 4) VMM figures out where and what cap blocks to create (considering
> >    migration needs/etc)
> > 5) VMM synthesizes the blocks and ties emulation to other iommufd things
> > 
> > This works generically for any synthetic vPCI function including a
> > non-vfio-pci one.
> 
> Maybe this is the actual value in implementing this in the VMM, one
> implementation can support multiple device interfaces.
> 
> > Most likely due to migration needs the exact layout of the PCI config
> > space should be configured to the VMM, including the location of any
> > blocks copied from physical and any blocks synthezied. This is the
> > only way to be sure the config space is actually 100% consistent.
> 
> Where is this concern about config space arbitrarily changing coming
> from?

It is important for migration.

Today with the drivers we have the devices have to take care of their
own config space layout only in HW. We are not expecting migration
drivers to worry about any SW created config space. SW config space is
a new thing.

> > We may also want a DVSEC to indicate free space - but if vendors are
> > going to change their devices I'd rather them change to mark the used
> > space with DVSEC then mark the free space :)
> 
> Sure, had we proposed this and had vendor buy-in 10+yrs ago, that'd be
> great

We are applying this going forward to PASID, and PRI, which barely
exist on devices at all today. We don't need to worry about those
10+yr old devices that don't support PASID/PRI in the first place.

I think this makes the problem much smaller.

Jason

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 0/4] vfio-pci support pasid attach/detach
  2024-04-24 18:36                               ` Jason Gunthorpe
@ 2024-04-24 20:13                                 ` Alex Williamson
  2024-04-26 14:11                                   ` Jason Gunthorpe
  0 siblings, 1 reply; 72+ messages in thread
From: Alex Williamson @ 2024-04-24 20:13 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tian, Kevin, Liu, Yi L, joro, robin.murphy, eric.auger, nicolinc,
	kvm, chao.p.peng, iommu, baolu.lu, Duan, Zhenzhong, Pan,
	Jacob jun, Cédric Le Goater

On Wed, 24 Apr 2024 15:36:26 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Wed, Apr 24, 2024 at 12:24:37PM -0600, Alex Williamson wrote:
> > > The only reason to pass the PF's PASID cap is to give free space to
> > > the VMM. If we are saying that gaps are free space (excluding a list
> > > of bad devices) then we don't acutally need to do that anymore.  
> > 
> > Are we saying that now??  That's new.  
> 
> I suggested it a few times
> 
> >   
> > > VMM will always create a synthetic PASID cap and kernel will always
> > > suppress a real one.
> > > 
> > > An iommufd query will indicate if the vIOMMU can support vPASID on
> > > that device.
> > > 
> > > Same for all the troublesome non-physical caps.
> > >   
> > > > > There are migration considerations too - the blocks need to be
> > > > > migrated over and end up in the same place as well..    
> > > > 
> > > > Can you elaborate what is the problem with the kernel emulating
> > > > the PASID cap in this consideration?    
> > > 
> > > If the kernel changes the algorithm, say it wants to do PASID, PRI,
> > > something_new then it might change the layout
> > > 
> > > We can't just have the kernel decide without also providing a way for
> > > userspace to say what the right layout actually is. :\  
> > 
> > The capability layout is only relevant to migration, right?    
> 
> Yes, proabbly
> 
> > A variant
> > driver that supports migration is a prerequisite and would also be
> > responsible for exposing the PASID capability.  This isn't as disjoint
> > as it's being portrayed.  
> 
> I guess..  But also not quite. We still have the problem that kernel
> migration driver V1 could legitimately create a different config space
> that migration driver V2
> 
> And now you are saying that the migration driver has to parse the
> migration stream and readjust its own layout
> 
> And every driver needs to do this?
> 
> We can, it is a quite big bit of infrastructure I think, but sure..
> 
> I fear the VMM still has to be involved somehow because it still has
> to know if the source VMM has removed any kernel created caps.

This is kind of an absurd example to portray as a ubiquitous problem.
Typically the config space layout is a reflection of hardware whether
the device supports migration or not.  If a driver were to insert a
virtual capability, then yes it would want to be consistent about it if
it also cares about migration.  If the driver needs to change the
location of a virtual capability, problems will arise, but that's also
not something that every driver needs to do.

Also, how exactly does emulating the capability in the VMM solve this
problem?  Currently QEMU migration simply applies state to an identical
VM on the target.  QEMU doesn't modify the target VM to conform to the
data stream.  So in either case, the problem might be more along the
lines of how to make a V1 device from a V2 driver, which is more the
device type/flavor/persona problem.

> > Outside of migration, what does it matter if the cap layout is
> > different?  A driver should never hard code the address for a
> > capability.  
> 
> Yes, talking about migration here - migration is the hardest case it
> seems.
>  
> > > At least if the VMM is doing this then the VMM can include the
> > > information in its migration scheme and use it to recreate the PCI
> > > layout withotu having to create a bunch of uAPI to do so.  
> > 
> > We're again back to migration compatibility, where again the capability
> > layout would be governed by the migration support in the in-kernel
> > variant driver.  Once migration is involved the location of a PASID
> > shouldn't be arbitrary, whether it's provided by the kernel or the VMM.  
> 
> I wasn't going in this direction. I was thinking to make the VMM
> create the config space layout that is approriate and hold it stable
> as a migration ABI.
> 
> I think in practice many VMMs are going to do this anyhow unless we
> put full support for config space synthesis, stable versions, and
> version selection in the kernel directly. I was thinking to avoid
> doing that.

Currently QEMU replies on determinism that a given command line results
in an identical machine configuration and identical devices.  State of
that target VM is then populated, not defined by, the migration stream.

> > Regardless, the VMM ultimately has the authority what the guest
> > sees in config space.  The VMM is not bound to expose the PASID at the
> > offset provided by the kernel, or bound to expose it at all.  The
> > kernel exposed PASID can simply provide an available location and set
> > of enabled capabilities.   
> 
> And if the VMM is going to ignore the kernel layout then why do so
> much work in the kernel to create it?

Ok, let's not ignore it ;)

> I think we need to decide, either only the VMM or only the kernel
> should do this.

What are you actually proposing?  Are you suggesting that if a device
supports the Power Management capability it will be virtualized at
offset 0x60, if the device supports the MSI capability it will be
virtualized at 0x68,... if a device supports PASID it will be
virtualized at offset 0x300, etc...?

That's not only impractical because we can't layout all the capabilities
within the available space, but also because we will run into masking
hidden registers and devices where the driver hard codes a capability
offset.

If the VMM implements the "find a gap" solution then it's just as
subject to config space changes in hardware or provided by the variant
driver.

If not either of those, are we hard coding a device specific config
space map into the VMM or providing one on the command line?  I thought
we were using vfio-pci variant drivers and a defined vfio migration API
in order to prevent modifying the VMM for every device we want to
support migration.  Also I'd wonder if the driver itself shouldn't be
configured to provide a compatible type.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 0/4] vfio-pci support pasid attach/detach
  2024-04-24 18:24                             ` Alex Williamson
  2024-04-24 18:36                               ` Jason Gunthorpe
@ 2024-04-25  9:26                               ` Yi Liu
  2024-04-25 12:58                                 ` Alex Williamson
  1 sibling, 1 reply; 72+ messages in thread
From: Yi Liu @ 2024-04-25  9:26 UTC (permalink / raw)
  To: Alex Williamson, Jason Gunthorpe
  Cc: Tian, Kevin, joro, robin.murphy, eric.auger, nicolinc, kvm,
	chao.p.peng, iommu, baolu.lu, Duan, Zhenzhong, Pan, Jacob jun

On 2024/4/25 02:24, Alex Williamson wrote:
> On Tue, 23 Apr 2024 21:12:21 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
>> On Tue, Apr 23, 2024 at 11:47:50PM +0000, Tian, Kevin wrote:
>>>> From: Jason Gunthorpe <jgg@nvidia.com>
>>>> Sent: Tuesday, April 23, 2024 8:02 PM
>>>>
>>>> On Tue, Apr 23, 2024 at 07:43:27AM +0000, Tian, Kevin wrote:
>>>>> I'm not sure how userspace can fully handle this w/o certain assistance
>>>>> from the kernel.
>>>>>
>>>>> So I kind of agree that emulated PASID capability is probably the only
>>>>> contract which the kernel should provide:
>>>>>    - mapped 1:1 at the physical location, or
>>>>>    - constructed at an offset according to DVSEC, or
>>>>>    - constructed at an offset according to a look-up table
>>>>>
>>>>> The VMM always scans the vfio pci config space to expose vPASID.
>>>>>
>>>>> Then the remaining open is what VMM could do when a VF supports
>>>>> PASID but unfortunately it's not reported by vfio. W/o the capability
>>>>> of inspecting the PASID state of PF, probably the only feasible option
>>>>> is to maintain a look-up table in VMM itself and assumes the kernel
>>>>> always enables the PASID cap on PF.
>>>>
>>>> I'm still not sure I like doing this in the kernel - we need to do the
>>>> same sort of thing for ATS too, right?
>>>
>>> VF is allowed to implement ATS.
>>>
>>> PRI has the same problem as PASID.
>>
>> I'm surprised by this, I would have guessed ATS would be the device
>> global one, PRI not being per-VF seems problematic??? How do you
>> disable PRI generation to get a clean shutdown?
>>
>>>> It feels simpler if the indicates if PASID and ATS can be supported
>>>> and userspace builds the capability blocks.
>>>
>>> this routes back to Alex's original question about using different
>>> interfaces (a device feature vs. PCI PASID cap) for VF and PF.
>>
>> I'm not sure it is different interfaces..
>>
>> The only reason to pass the PF's PASID cap is to give free space to
>> the VMM. If we are saying that gaps are free space (excluding a list
>> of bad devices) then we don't acutally need to do that anymore.
> 
> Are we saying that now??  That's new.
> 
>> VMM will always create a synthetic PASID cap and kernel will always
>> suppress a real one.
>>
>> An iommufd query will indicate if the vIOMMU can support vPASID on
>> that device.
>>
>> Same for all the troublesome non-physical caps.
>>
>>>> There are migration considerations too - the blocks need to be
>>>> migrated over and end up in the same place as well..
>>>
>>> Can you elaborate what is the problem with the kernel emulating
>>> the PASID cap in this consideration?
>>
>> If the kernel changes the algorithm, say it wants to do PASID, PRI,
>> something_new then it might change the layout
>>
>> We can't just have the kernel decide without also providing a way for
>> userspace to say what the right layout actually is. :\
> 
> The capability layout is only relevant to migration, right?  A variant
> driver that supports migration is a prerequisite and would also be
> responsible for exposing the PASID capability.  This isn't as disjoint
> as it's being portrayed.
> 
>>> Does it talk about a case where the devices between src/dest are
>>> different versions (but backward compatible) with different unused
>>> space layout and the kernel approach may pick up different offsets
>>> while the VMM can guarantee the same offset?
>>
>> That is also a concern where the PCI cap layout may change a bit but
>> they are still migration compatible, but my bigger worry is that the
>> kernel just lays out the fake caps in a different way because the
>> kernel changes.
> 
> Outside of migration, what does it matter if the cap layout is
> different?  A driver should never hard code the address for a
> capability.
>   

But it may store the offset of capability to make next cap access more
convenient. I noticted struct pci_dev stores the offset of PRI and PASID
cap. So if the layout of config space changed between src and dst, it may
result in problem in guest when guest driver uses the offsets to access
PRI/PASID cap. I can see pci_dev stores offsets of other caps (acs, msi,
msix). So there is already a problem even put aside the PRI and PASID cap.

#ifdef CONFIG_PCI_PRI
	u16		pri_cap;	/* PRI Capability offset */
	u32		pri_reqs_alloc; /* Number of PRI requests allocated */
	unsigned int	pasid_required:1; /* PRG Response PASID Required */
#endif
#ifdef CONFIG_PCI_PASID
	u16		pasid_cap;	/* PASID Capability offset */
	u16		pasid_features;
#endif
#ifdef CONFIG_PCI_P2PDMA
	struct pci_p2pdma __rcu *p2pdma;
#endif
#ifdef CONFIG_PCI_DOE
	struct xarray	doe_mbs;	/* Data Object Exchange mailboxes */
#endif
	u16		acs_cap;	/* ACS Capability offset */

https://github.com/torvalds/linux/blob/master/include/linux/pci.h#L350

>> At least if the VMM is doing this then the VMM can include the
>> information in its migration scheme and use it to recreate the PCI
>> layout withotu having to create a bunch of uAPI to do so.
> 
> We're again back to migration compatibility, where again the capability
> layout would be governed by the migration support in the in-kernel
> variant driver.  Once migration is involved the location of a PASID
> shouldn't be arbitrary, whether it's provided by the kernel or the VMM.
> 
> Regardless, the VMM ultimately has the authority what the guest
> sees in config space.  The VMM is not bound to expose the PASID at the
> offset provided by the kernel, or bound to expose it at all.  The
> kernel exposed PASID can simply provide an available location and set
> of enabled capabilities.  Thanks,
> 
> Alex
> 

-- 
Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 0/4] vfio-pci support pasid attach/detach
  2024-04-25  9:26                               ` Yi Liu
@ 2024-04-25 12:58                                 ` Alex Williamson
  2024-04-26  9:01                                   ` Yi Liu
  0 siblings, 1 reply; 72+ messages in thread
From: Alex Williamson @ 2024-04-25 12:58 UTC (permalink / raw)
  To: Yi Liu
  Cc: Jason Gunthorpe, Tian, Kevin, joro, robin.murphy, eric.auger,
	nicolinc, kvm, chao.p.peng, iommu, baolu.lu, Duan, Zhenzhong,
	Pan, Jacob jun

On Thu, 25 Apr 2024 17:26:54 +0800
Yi Liu <yi.l.liu@intel.com> wrote:

> On 2024/4/25 02:24, Alex Williamson wrote:
> > On Tue, 23 Apr 2024 21:12:21 -0300
> > Jason Gunthorpe <jgg@nvidia.com> wrote:
> >   
> >> On Tue, Apr 23, 2024 at 11:47:50PM +0000, Tian, Kevin wrote:  
> >>>> From: Jason Gunthorpe <jgg@nvidia.com>
> >>>> Sent: Tuesday, April 23, 2024 8:02 PM
> >>>>
> >>>> On Tue, Apr 23, 2024 at 07:43:27AM +0000, Tian, Kevin wrote:  
> >>>>> I'm not sure how userspace can fully handle this w/o certain assistance
> >>>>> from the kernel.
> >>>>>
> >>>>> So I kind of agree that emulated PASID capability is probably the only
> >>>>> contract which the kernel should provide:
> >>>>>    - mapped 1:1 at the physical location, or
> >>>>>    - constructed at an offset according to DVSEC, or
> >>>>>    - constructed at an offset according to a look-up table
> >>>>>
> >>>>> The VMM always scans the vfio pci config space to expose vPASID.
> >>>>>
> >>>>> Then the remaining open is what VMM could do when a VF supports
> >>>>> PASID but unfortunately it's not reported by vfio. W/o the capability
> >>>>> of inspecting the PASID state of PF, probably the only feasible option
> >>>>> is to maintain a look-up table in VMM itself and assumes the kernel
> >>>>> always enables the PASID cap on PF.  
> >>>>
> >>>> I'm still not sure I like doing this in the kernel - we need to do the
> >>>> same sort of thing for ATS too, right?  
> >>>
> >>> VF is allowed to implement ATS.
> >>>
> >>> PRI has the same problem as PASID.  
> >>
> >> I'm surprised by this, I would have guessed ATS would be the device
> >> global one, PRI not being per-VF seems problematic??? How do you
> >> disable PRI generation to get a clean shutdown?
> >>  
> >>>> It feels simpler if the indicates if PASID and ATS can be supported
> >>>> and userspace builds the capability blocks.  
> >>>
> >>> this routes back to Alex's original question about using different
> >>> interfaces (a device feature vs. PCI PASID cap) for VF and PF.  
> >>
> >> I'm not sure it is different interfaces..
> >>
> >> The only reason to pass the PF's PASID cap is to give free space to
> >> the VMM. If we are saying that gaps are free space (excluding a list
> >> of bad devices) then we don't acutally need to do that anymore.  
> > 
> > Are we saying that now??  That's new.
> >   
> >> VMM will always create a synthetic PASID cap and kernel will always
> >> suppress a real one.
> >>
> >> An iommufd query will indicate if the vIOMMU can support vPASID on
> >> that device.
> >>
> >> Same for all the troublesome non-physical caps.
> >>  
> >>>> There are migration considerations too - the blocks need to be
> >>>> migrated over and end up in the same place as well..  
> >>>
> >>> Can you elaborate what is the problem with the kernel emulating
> >>> the PASID cap in this consideration?  
> >>
> >> If the kernel changes the algorithm, say it wants to do PASID, PRI,
> >> something_new then it might change the layout
> >>
> >> We can't just have the kernel decide without also providing a way for
> >> userspace to say what the right layout actually is. :\  
> > 
> > The capability layout is only relevant to migration, right?  A variant
> > driver that supports migration is a prerequisite and would also be
> > responsible for exposing the PASID capability.  This isn't as disjoint
> > as it's being portrayed.
> >   
> >>> Does it talk about a case where the devices between src/dest are
> >>> different versions (but backward compatible) with different unused
> >>> space layout and the kernel approach may pick up different offsets
> >>> while the VMM can guarantee the same offset?  
> >>
> >> That is also a concern where the PCI cap layout may change a bit but
> >> they are still migration compatible, but my bigger worry is that the
> >> kernel just lays out the fake caps in a different way because the
> >> kernel changes.  
> > 
> > Outside of migration, what does it matter if the cap layout is
    ^^^^^^^^^^^^^^^^^^^^
> > different?  A driver should never hard code the address for a
> > capability.
> >     
> 
> But it may store the offset of capability to make next cap access more
> convenient. I noticted struct pci_dev stores the offset of PRI and PASID
> cap. So if the layout of config space changed between src and dst, it may
> result in problem in guest when guest driver uses the offsets to access
> PRI/PASID cap. I can see pci_dev stores offsets of other caps (acs, msi,
> msix). So there is already a problem even put aside the PRI and PASID cap.

Yes, I had noted "outside of migration" above.  Config space must be
consistent to a running VM.  But the possibility of config space
changing like this only exists in the case where the driver supports
migration, so I think we're inventing an unrealistic concern that a
driver that supports migration would arbitrarily modify the config space
layout in order to make an argument for VMM managed layout.  Thanks,

Alex

> #ifdef CONFIG_PCI_PRI
> 	u16		pri_cap;	/* PRI Capability offset */
> 	u32		pri_reqs_alloc; /* Number of PRI requests allocated */
> 	unsigned int	pasid_required:1; /* PRG Response PASID Required */
> #endif
> #ifdef CONFIG_PCI_PASID
> 	u16		pasid_cap;	/* PASID Capability offset */
> 	u16		pasid_features;
> #endif
> #ifdef CONFIG_PCI_P2PDMA
> 	struct pci_p2pdma __rcu *p2pdma;
> #endif
> #ifdef CONFIG_PCI_DOE
> 	struct xarray	doe_mbs;	/* Data Object Exchange mailboxes */
> #endif
> 	u16		acs_cap;	/* ACS Capability offset */
> 
> https://github.com/torvalds/linux/blob/master/include/linux/pci.h#L350
> 
> >> At least if the VMM is doing this then the VMM can include the
> >> information in its migration scheme and use it to recreate the PCI
> >> layout withotu having to create a bunch of uAPI to do so.  
> > 
> > We're again back to migration compatibility, where again the capability
> > layout would be governed by the migration support in the in-kernel
> > variant driver.  Once migration is involved the location of a PASID
> > shouldn't be arbitrary, whether it's provided by the kernel or the VMM.
> > 
> > Regardless, the VMM ultimately has the authority what the guest
> > sees in config space.  The VMM is not bound to expose the PASID at the
> > offset provided by the kernel, or bound to expose it at all.  The
> > kernel exposed PASID can simply provide an available location and set
> > of enabled capabilities.  Thanks,
> > 
> > Alex
> >   
> 


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 0/4] vfio-pci support pasid attach/detach
  2024-04-25 12:58                                 ` Alex Williamson
@ 2024-04-26  9:01                                   ` Yi Liu
  0 siblings, 0 replies; 72+ messages in thread
From: Yi Liu @ 2024-04-26  9:01 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Jason Gunthorpe, Tian, Kevin, joro, robin.murphy, eric.auger,
	nicolinc, kvm, chao.p.peng, iommu, baolu.lu, Duan, Zhenzhong,
	Pan, Jacob jun

On 2024/4/25 20:58, Alex Williamson wrote:
> On Thu, 25 Apr 2024 17:26:54 +0800
> Yi Liu <yi.l.liu@intel.com> wrote:
> 
>> On 2024/4/25 02:24, Alex Williamson wrote:
>>> On Tue, 23 Apr 2024 21:12:21 -0300
>>> Jason Gunthorpe <jgg@nvidia.com> wrote:
>>>    
>>>> On Tue, Apr 23, 2024 at 11:47:50PM +0000, Tian, Kevin wrote:
>>>>>> From: Jason Gunthorpe <jgg@nvidia.com>
>>>>>> Sent: Tuesday, April 23, 2024 8:02 PM
>>>>>>
>>>>>> On Tue, Apr 23, 2024 at 07:43:27AM +0000, Tian, Kevin wrote:
>>>>>>> I'm not sure how userspace can fully handle this w/o certain assistance
>>>>>>> from the kernel.
>>>>>>>
>>>>>>> So I kind of agree that emulated PASID capability is probably the only
>>>>>>> contract which the kernel should provide:
>>>>>>>     - mapped 1:1 at the physical location, or
>>>>>>>     - constructed at an offset according to DVSEC, or
>>>>>>>     - constructed at an offset according to a look-up table
>>>>>>>
>>>>>>> The VMM always scans the vfio pci config space to expose vPASID.
>>>>>>>
>>>>>>> Then the remaining open is what VMM could do when a VF supports
>>>>>>> PASID but unfortunately it's not reported by vfio. W/o the capability
>>>>>>> of inspecting the PASID state of PF, probably the only feasible option
>>>>>>> is to maintain a look-up table in VMM itself and assumes the kernel
>>>>>>> always enables the PASID cap on PF.
>>>>>>
>>>>>> I'm still not sure I like doing this in the kernel - we need to do the
>>>>>> same sort of thing for ATS too, right?
>>>>>
>>>>> VF is allowed to implement ATS.
>>>>>
>>>>> PRI has the same problem as PASID.
>>>>
>>>> I'm surprised by this, I would have guessed ATS would be the device
>>>> global one, PRI not being per-VF seems problematic??? How do you
>>>> disable PRI generation to get a clean shutdown?
>>>>   
>>>>>> It feels simpler if the indicates if PASID and ATS can be supported
>>>>>> and userspace builds the capability blocks.
>>>>>
>>>>> this routes back to Alex's original question about using different
>>>>> interfaces (a device feature vs. PCI PASID cap) for VF and PF.
>>>>
>>>> I'm not sure it is different interfaces..
>>>>
>>>> The only reason to pass the PF's PASID cap is to give free space to
>>>> the VMM. If we are saying that gaps are free space (excluding a list
>>>> of bad devices) then we don't acutally need to do that anymore.
>>>
>>> Are we saying that now??  That's new.
>>>    
>>>> VMM will always create a synthetic PASID cap and kernel will always
>>>> suppress a real one.
>>>>
>>>> An iommufd query will indicate if the vIOMMU can support vPASID on
>>>> that device.
>>>>
>>>> Same for all the troublesome non-physical caps.
>>>>   
>>>>>> There are migration considerations too - the blocks need to be
>>>>>> migrated over and end up in the same place as well..
>>>>>
>>>>> Can you elaborate what is the problem with the kernel emulating
>>>>> the PASID cap in this consideration?
>>>>
>>>> If the kernel changes the algorithm, say it wants to do PASID, PRI,
>>>> something_new then it might change the layout
>>>>
>>>> We can't just have the kernel decide without also providing a way for
>>>> userspace to say what the right layout actually is. :\
>>>
>>> The capability layout is only relevant to migration, right?  A variant
>>> driver that supports migration is a prerequisite and would also be
>>> responsible for exposing the PASID capability.  This isn't as disjoint
>>> as it's being portrayed.
>>>    
>>>>> Does it talk about a case where the devices between src/dest are
>>>>> different versions (but backward compatible) with different unused
>>>>> space layout and the kernel approach may pick up different offsets
>>>>> while the VMM can guarantee the same offset?
>>>>
>>>> That is also a concern where the PCI cap layout may change a bit but
>>>> they are still migration compatible, but my bigger worry is that the
>>>> kernel just lays out the fake caps in a different way because the
>>>> kernel changes.
>>>
>>> Outside of migration, what does it matter if the cap layout is
>      ^^^^^^^^^^^^^^^^^^^^
>>> different?  A driver should never hard code the address for a
>>> capability.
>>>      
>>
>> But it may store the offset of capability to make next cap access more
>> convenient. I noticted struct pci_dev stores the offset of PRI and PASID
>> cap. So if the layout of config space changed between src and dst, it may
>> result in problem in guest when guest driver uses the offsets to access
>> PRI/PASID cap. I can see pci_dev stores offsets of other caps (acs, msi,
>> msix). So there is already a problem even put aside the PRI and PASID cap.
> 
> Yes, I had noted "outside of migration" above.  Config space must be
> consistent to a running VM.  But the possibility of config space
> changing like this only exists in the case where the driver supports
> migration, so I think we're inventing an unrealistic concern that a
> driver that supports migration would arbitrarily modify the config space
> layout in order to make an argument for VMM managed layout.  Thanks,

I was considering a case in which the src VM has a v1 hw device, while the
dst VM has a v2 hw device. Due to whatever reasons, the config space
layouts are different between v1 and v2 hw. Since the current vfio copies
the physical layout (except for the hidden caps) to VMM, the layout between
the src and dst VM would be different. This would result in problem since 
the cap offsets are stale. It seems to be a problem you mentioned in
another email of this thread[1]. :)

[1] "So in either case, the problem might be more along the
lines of how to make a V1 device from a V2 driver, which is more the
device type/flavor/persona problem.
" 
https://lore.kernel.org/linux-iommu/20240424141349.376bdbf9.alex.williamson@redhat.com/


> Alex
> 
>> #ifdef CONFIG_PCI_PRI
>> 	u16		pri_cap;	/* PRI Capability offset */
>> 	u32		pri_reqs_alloc; /* Number of PRI requests allocated */
>> 	unsigned int	pasid_required:1; /* PRG Response PASID Required */
>> #endif
>> #ifdef CONFIG_PCI_PASID
>> 	u16		pasid_cap;	/* PASID Capability offset */
>> 	u16		pasid_features;
>> #endif
>> #ifdef CONFIG_PCI_P2PDMA
>> 	struct pci_p2pdma __rcu *p2pdma;
>> #endif
>> #ifdef CONFIG_PCI_DOE
>> 	struct xarray	doe_mbs;	/* Data Object Exchange mailboxes */
>> #endif
>> 	u16		acs_cap;	/* ACS Capability offset */
>>
>> https://github.com/torvalds/linux/blob/master/include/linux/pci.h#L350
>>
>>>> At least if the VMM is doing this then the VMM can include the
>>>> information in its migration scheme and use it to recreate the PCI
>>>> layout withotu having to create a bunch of uAPI to do so.
>>>
>>> We're again back to migration compatibility, where again the capability
>>> layout would be governed by the migration support in the in-kernel
>>> variant driver.  Once migration is involved the location of a PASID
>>> shouldn't be arbitrary, whether it's provided by the kernel or the VMM.
>>>
>>> Regardless, the VMM ultimately has the authority what the guest
>>> sees in config space.  The VMM is not bound to expose the PASID at the
>>> offset provided by the kernel, or bound to expose it at all.  The
>>> kernel exposed PASID can simply provide an available location and set
>>> of enabled capabilities.  Thanks,
>>>
>>> Alex
>>>    
>>
> 

-- 
Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 0/4] vfio-pci support pasid attach/detach
  2024-04-24 20:13                                 ` Alex Williamson
@ 2024-04-26 14:11                                   ` Jason Gunthorpe
  2024-04-26 20:13                                     ` Alex Williamson
  2024-04-27  5:05                                     ` Christoph Hellwig
  0 siblings, 2 replies; 72+ messages in thread
From: Jason Gunthorpe @ 2024-04-26 14:11 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Tian, Kevin, Liu, Yi L, joro, robin.murphy, eric.auger, nicolinc,
	kvm, chao.p.peng, iommu, baolu.lu, Duan, Zhenzhong, Pan,
	Jacob jun, Cédric Le Goater

On Wed, Apr 24, 2024 at 02:13:49PM -0600, Alex Williamson wrote:

> This is kind of an absurd example to portray as a ubiquitous problem.
> Typically the config space layout is a reflection of hardware whether
> the device supports migration or not.

Er, all our HW has FW constructed config space. It changes with FW
upgrades. We change it during the life of the product. This has to be
considered..

> If a driver were to insert a
> virtual capability, then yes it would want to be consistent about it if
> it also cares about migration.  If the driver needs to change the
> location of a virtual capability, problems will arise, but that's also
> not something that every driver needs to do.

Well, mlx5 has to cope with this. It supports so many devices with so
many config space layouts :( I don't know if we can just hard wire an
offset to stick in a PASID cap and expect that to work...

> Also, how exactly does emulating the capability in the VMM solve this
> problem?  Currently QEMU migration simply applies state to an identical
> VM on the target.  QEMU doesn't modify the target VM to conform to the
> data stream.  So in either case, the problem might be more along the
> lines of how to make a V1 device from a V2 driver, which is more the
> device type/flavor/persona problem.

Yes, it doesn't solve anything, it just puts the responsibility for
something that is very complicated in userspace where there are more
options to configure and customize it to the environment.

> Currently QEMU replies on determinism that a given command line results
> in an identical machine configuration and identical devices.  State of
> that target VM is then populated, not defined by, the migration stream.

But that won't be true if the kernel is making decisions. The config
space layout depends now on the kernel driver version too.

> > I think we need to decide, either only the VMM or only the kernel
> > should do this.
> 
> What are you actually proposing?

Okay, what I'm thinking about is a text file that describes the vPCI
function configuration space to create. The community will standardize
this and VMMs will have to implement to get PASID/etc. Maybe the
community will provide a BSD licensed library to do this job.

The text file allows the operator to specify exactly the configuration
space the VFIO function should have. It would not be derived
automatically from physical. AFAIK qemu does not have this capability
currently.

This reflects my observation and discussions around the live migration
standardization. I belive we are fast reaching a point where this is
required.

Consider standards based migration between wildly different
devices. The devices will not standardize their physical config space,
but an operator could generate a consistent vPCI config space that
works with all the devices in their fleet.

Consider the usual working model of the large operators - they define
instance types with some regularity. But an instance type is fixed in
concrete once it is specified, things like the vPCI config space are
fixed.

Running Instance A on newer hardware with a changed physical config
space should continue to present Instance A's vPCI config layout
regardless. Ie Instance A might not support PASID but Instance B can
run on newer HW that does. The config space layout depends on the
requested Instance Type, not the physical layout.

The auto-configuration of the config layout from physical is a nice
feature and is excellent for development/small scale, but it shouldn't
be the only way to work.

So - if we accept that text file configuration should be something the
VMM supports then let's reconsider how to solve the PASID problem.

I'd say the way to solve it should be via a text file specifying a
full config space layout that includes the PASID cap. From the VMM
perspective this works fine, and it ports to every VMM directly via
processing the text file.

The autoconfiguration use case can be done by making a tool build the
text file by deriving it from physical, much like today. The single
instance of that tool could have device specific knowledge to avoid
quirks. This way the smarts can still be shared by all the VMMs
without going into the kernel. Special devices with hidden config
space could get special quirks or special reference text files into
the tool repo.

Serious operators doing production SRIOV/etc would negotiate the text
file with the HW vendors when they define their Instance Type. Ideally
these reference text files would be contributed to the tool repo
above. I think there would be some nice idea to define fully open
source Instance Types that include VFIO devices too.

Is it too much of a fantasy?

Jason

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 0/4] vfio-pci support pasid attach/detach
  2024-04-26 14:11                                   ` Jason Gunthorpe
@ 2024-04-26 20:13                                     ` Alex Williamson
  2024-04-28  6:19                                       ` Tian, Kevin
  2024-04-29 17:44                                       ` Jason Gunthorpe
  2024-04-27  5:05                                     ` Christoph Hellwig
  1 sibling, 2 replies; 72+ messages in thread
From: Alex Williamson @ 2024-04-26 20:13 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tian, Kevin, Liu, Yi L, joro, robin.murphy, eric.auger, nicolinc,
	kvm, chao.p.peng, iommu, baolu.lu, Duan, Zhenzhong, Pan,
	Jacob jun, Cédric Le Goater

On Fri, 26 Apr 2024 11:11:17 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Wed, Apr 24, 2024 at 02:13:49PM -0600, Alex Williamson wrote:
> 
> > This is kind of an absurd example to portray as a ubiquitous problem.
> > Typically the config space layout is a reflection of hardware whether
> > the device supports migration or not.  
> 
> Er, all our HW has FW constructed config space. It changes with FW
> upgrades. We change it during the life of the product. This has to be
> considered..

So as I understand it, the concern is that you have firmware that
supports migration, but it also openly hostile to the fundamental
aspects of exposing a stable device ABI in support of migration.

> > If a driver were to insert a
> > virtual capability, then yes it would want to be consistent about it if
> > it also cares about migration.  If the driver needs to change the
> > location of a virtual capability, problems will arise, but that's also
> > not something that every driver needs to do.  
> 
> Well, mlx5 has to cope with this. It supports so many devices with so
> many config space layouts :( I don't know if we can just hard wire an
> offset to stick in a PASID cap and expect that to work...
> 
> > Also, how exactly does emulating the capability in the VMM solve this
> > problem?  Currently QEMU migration simply applies state to an identical
> > VM on the target.  QEMU doesn't modify the target VM to conform to the
> > data stream.  So in either case, the problem might be more along the
> > lines of how to make a V1 device from a V2 driver, which is more the
> > device type/flavor/persona problem.  
> 
> Yes, it doesn't solve anything, it just puts the responsibility for
> something that is very complicated in userspace where there are more
> options to configure and customize it to the environment.
> 
> > Currently QEMU replies on determinism that a given command line results
> > in an identical machine configuration and identical devices.  State of
> > that target VM is then populated, not defined by, the migration stream.  
> 
> But that won't be true if the kernel is making decisions. The config
> space layout depends now on the kernel driver version too.

But in the cases where we support migration there's a device specific
variant driver that supports that migration.  It's the job of that
variant driver to not only export and import the device state, but also
to provide a consistent ABI to the user, which includes the config
space layout.  I don't understand why we'd say the device programming
ABI itself falls within the purview of the device/variant driver, but
PCI config space is defined by device specific code at a higher level.

> > > I think we need to decide, either only the VMM or only the kernel
> > > should do this.  
> > 
> > What are you actually proposing?  
> 
> Okay, what I'm thinking about is a text file that describes the vPCI
> function configuration space to create. The community will standardize
> this and VMMs will have to implement to get PASID/etc. Maybe the
> community will provide a BSD licensed library to do this job.
> 
> The text file allows the operator to specify exactly the configuration
> space the VFIO function should have. It would not be derived
> automatically from physical. AFAIK qemu does not have this capability
> currently.
> 
> This reflects my observation and discussions around the live migration
> standardization. I belive we are fast reaching a point where this is
> required.
> 
> Consider standards based migration between wildly different
> devices. The devices will not standardize their physical config space,
> but an operator could generate a consistent vPCI config space that
> works with all the devices in their fleet.
> 
> Consider the usual working model of the large operators - they define
> instance types with some regularity. But an instance type is fixed in
> concrete once it is specified, things like the vPCI config space are
> fixed.
> 
> Running Instance A on newer hardware with a changed physical config
> space should continue to present Instance A's vPCI config layout
> regardless. Ie Instance A might not support PASID but Instance B can
> run on newer HW that does. The config space layout depends on the
> requested Instance Type, not the physical layout.
> 
> The auto-configuration of the config layout from physical is a nice
> feature and is excellent for development/small scale, but it shouldn't
> be the only way to work.
> 
> So - if we accept that text file configuration should be something the
> VMM supports then let's reconsider how to solve the PASID problem.
> 
> I'd say the way to solve it should be via a text file specifying a
> full config space layout that includes the PASID cap. From the VMM
> perspective this works fine, and it ports to every VMM directly via
> processing the text file.
> 
> The autoconfiguration use case can be done by making a tool build the
> text file by deriving it from physical, much like today. The single
> instance of that tool could have device specific knowledge to avoid
> quirks. This way the smarts can still be shared by all the VMMs
> without going into the kernel. Special devices with hidden config
> space could get special quirks or special reference text files into
> the tool repo.
> 
> Serious operators doing production SRIOV/etc would negotiate the text
> file with the HW vendors when they define their Instance Type. Ideally
> these reference text files would be contributed to the tool repo
> above. I think there would be some nice idea to define fully open
> source Instance Types that include VFIO devices too.

Regarding "if we accept that text file configuration should be
something the VMM supports", I'm not on board with this yet, so
applying it to PASID discussion seems premature.

We've developed variant drivers specifically to host the device specific
aspects of migration support.  The requirement of a consistent config
space layout is a problem that only exists relative to migration.  This
is an issue that I would have considered the responsibility of the
variant driver, which would likely expect a consistent interface from
the hardware/firmware.  Why does hostile firmware suddenly make it the
VMM's problem to provide a consistent ABI to the config space of the
device rather than the variant driver?

Obviously config maps are something that a VMM could do, but it also
seems to impose a non-trivial burden that every VMM requires an
implementation of a config space map and integration for each device
rather than simply expecting the exposed config space of the device to
be part of the migration ABI.  Also this solution specifically only
addresses config space compatibility without considering the more
generic issue that a variant driver can expose different device
personas.  A versioned persona and config space virtualization in the
variant driver is a much more flexible solution.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 0/4] vfio-pci support pasid attach/detach
  2024-04-26 14:11                                   ` Jason Gunthorpe
  2024-04-26 20:13                                     ` Alex Williamson
@ 2024-04-27  5:05                                     ` Christoph Hellwig
  1 sibling, 0 replies; 72+ messages in thread
From: Christoph Hellwig @ 2024-04-27  5:05 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Tian, Kevin, Liu, Yi L, joro, robin.murphy,
	eric.auger, nicolinc, kvm, chao.p.peng, iommu, baolu.lu, Duan,
	Zhenzhong, Pan, Jacob jun, Cédric Le Goater

On Fri, Apr 26, 2024 at 11:11:17AM -0300, Jason Gunthorpe wrote:
> On Wed, Apr 24, 2024 at 02:13:49PM -0600, Alex Williamson wrote:
> 
> > This is kind of an absurd example to portray as a ubiquitous problem.
> > Typically the config space layout is a reflection of hardware whether
> > the device supports migration or not.
> 
> Er, all our HW has FW constructed config space. It changes with FW
> upgrades. We change it during the life of the product. This has to be
> considered..

FYI, the same is true for just about any storage device I know.  Maybe
not all of the config space, but definitely parts of it.


^ permalink raw reply	[flat|nested] 72+ messages in thread

* RE: [PATCH v2 0/4] vfio-pci support pasid attach/detach
  2024-04-26 20:13                                     ` Alex Williamson
@ 2024-04-28  6:19                                       ` Tian, Kevin
  2024-04-29  7:43                                         ` Yi Liu
  2024-04-29 17:15                                         ` Jason Gunthorpe
  2024-04-29 17:44                                       ` Jason Gunthorpe
  1 sibling, 2 replies; 72+ messages in thread
From: Tian, Kevin @ 2024-04-28  6:19 UTC (permalink / raw)
  To: Alex Williamson, Jason Gunthorpe
  Cc: Liu, Yi L, joro, robin.murphy, eric.auger, nicolinc, kvm,
	chao.p.peng, iommu, baolu.lu, Duan, Zhenzhong, Pan, Jacob jun,
	Cédric Le Goater

> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Saturday, April 27, 2024 4:14 AM
> 
> On Fri, 26 Apr 2024 11:11:17 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Wed, Apr 24, 2024 at 02:13:49PM -0600, Alex Williamson wrote:
> >
> > > This is kind of an absurd example to portray as a ubiquitous problem.
> > > Typically the config space layout is a reflection of hardware whether
> > > the device supports migration or not.
> >
> > Er, all our HW has FW constructed config space. It changes with FW
> > upgrades. We change it during the life of the product. This has to be
> > considered..
> 
> So as I understand it, the concern is that you have firmware that
> supports migration, but it also openly hostile to the fundamental
> aspects of exposing a stable device ABI in support of migration.
> 
> > > If a driver were to insert a
> > > virtual capability, then yes it would want to be consistent about it if
> > > it also cares about migration.  If the driver needs to change the
> > > location of a virtual capability, problems will arise, but that's also
> > > not something that every driver needs to do.
> >
> > Well, mlx5 has to cope with this. It supports so many devices with so
> > many config space layouts :( I don't know if we can just hard wire an
> > offset to stick in a PASID cap and expect that to work...

Are those config space layout differences usually also coming with
mmio-side interface change? If yes there are more to handle for
running V1 instance on V2 device and it'd make sense to manage
everything about compatibility in one place.

If we pursue the direction deciding the vconfig layout in VMM, does
it imply that anything related to mmio layout would also be put in
VMM too?

e.g. it's not unusual to see a device mmio layout as:

	REG_BASE:  base addr of a register block in a BAR
	REG_FEAT1_OFF: feature1 regs offset in the register block
	REG_FEAT2_OFF: feature2 regs offset in the register block
	...

Driver accesses registers according to those read-only offsets.

A FW upgrade may lead to change of offsets but functions stay the
same. An instance created on an old version can be migrated to 
the new version as long as accesses to old offsets are trapped
and routed to the new offsets.

Do we envision things like above in the variant driver or in VMM?

> > >
> > > What are you actually proposing?
> >
> > Okay, what I'm thinking about is a text file that describes the vPCI
> > function configuration space to create. The community will standardize
> > this and VMMs will have to implement to get PASID/etc. Maybe the
> > community will provide a BSD licensed library to do this job.
> >
> > The text file allows the operator to specify exactly the configuration
> > space the VFIO function should have. It would not be derived
> > automatically from physical. AFAIK qemu does not have this capability
> > currently.
> >
> > This reflects my observation and discussions around the live migration
> > standardization. I belive we are fast reaching a point where this is
> > required.
> >
> > Consider standards based migration between wildly different
> > devices. The devices will not standardize their physical config space,
> > but an operator could generate a consistent vPCI config space that
> > works with all the devices in their fleet.

It's hard to believe that 'wildly different' devices only have difference
in the layout of vPCI config space. 

> >
> > Consider the usual working model of the large operators - they define
> > instance types with some regularity. But an instance type is fixed in
> > concrete once it is specified, things like the vPCI config space are
> > fixed.
> >
> > Running Instance A on newer hardware with a changed physical config
> > space should continue to present Instance A's vPCI config layout
> > regardless. Ie Instance A might not support PASID but Instance B can
> > run on newer HW that does. The config space layout depends on the
> > requested Instance Type, not the physical layout.
> >
> > The auto-configuration of the config layout from physical is a nice
> > feature and is excellent for development/small scale, but it shouldn't
> > be the only way to work.
> >
> > So - if we accept that text file configuration should be something the
> > VMM supports then let's reconsider how to solve the PASID problem.
> >
> > I'd say the way to solve it should be via a text file specifying a
> > full config space layout that includes the PASID cap. From the VMM
> > perspective this works fine, and it ports to every VMM directly via
> > processing the text file.
> >
> > The autoconfiguration use case can be done by making a tool build the
> > text file by deriving it from physical, much like today. The single
> > instance of that tool could have device specific knowledge to avoid
> > quirks. This way the smarts can still be shared by all the VMMs
> > without going into the kernel. Special devices with hidden config
> > space could get special quirks or special reference text files into
> > the tool repo.
> >
> > Serious operators doing production SRIOV/etc would negotiate the text
> > file with the HW vendors when they define their Instance Type. Ideally
> > these reference text files would be contributed to the tool repo
> > above. I think there would be some nice idea to define fully open
> > source Instance Types that include VFIO devices too.
> 
> Regarding "if we accept that text file configuration should be
> something the VMM supports", I'm not on board with this yet, so
> applying it to PASID discussion seems premature.
> 
> We've developed variant drivers specifically to host the device specific
> aspects of migration support.  The requirement of a consistent config
> space layout is a problem that only exists relative to migration.  This
> is an issue that I would have considered the responsibility of the
> variant driver, which would likely expect a consistent interface from
> the hardware/firmware.  Why does hostile firmware suddenly make it the
> VMM's problem to provide a consistent ABI to the config space of the
> device rather than the variant driver?
> 
> Obviously config maps are something that a VMM could do, but it also
> seems to impose a non-trivial burden that every VMM requires an
> implementation of a config space map and integration for each device
> rather than simply expecting the exposed config space of the device to
> be part of the migration ABI.  Also this solution specifically only
> addresses config space compatibility without considering the more
> generic issue that a variant driver can expose different device
> personas.  A versioned persona and config space virtualization in the
> variant driver is a much more flexible solution.  Thanks,
> 

and looks this community lacks of a clear criteria on what burden
should be put in the kernel vs. in the VMM.

e.g. in earlier nvgrace-gpu discussion a major open was whether
the PCI bar emulation should be done by the variant driver or
by the VMM (with variant driver providing a device feature).

It ends up to be in the variant driver with one major argument
that doing so avoids the burden in various VMMs.

But now seems the 'text-file' proposal heads the opposite direction?

btw while this discussion may continue some time, I wonder whether
this vPASID reporting open can be handled separately from the
pasid attach/detach series so we can move the ball and merge
something already in agreement. anyway it's just a read-only cap so
won't affect how VFIO/IOMMUFD handles the pasid related requests.

Thanks
Kevin


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 0/4] vfio-pci support pasid attach/detach
  2024-04-28  6:19                                       ` Tian, Kevin
@ 2024-04-29  7:43                                         ` Yi Liu
  2024-04-29 17:15                                         ` Jason Gunthorpe
  1 sibling, 0 replies; 72+ messages in thread
From: Yi Liu @ 2024-04-29  7:43 UTC (permalink / raw)
  To: Tian, Kevin, Alex Williamson, Jason Gunthorpe
  Cc: joro, robin.murphy, eric.auger, nicolinc, kvm, chao.p.peng,
	iommu, baolu.lu, Duan, Zhenzhong, Pan, Jacob jun,
	Cédric Le Goater

On 2024/4/28 14:19, Tian, Kevin wrote:
>> From: Alex Williamson <alex.williamson@redhat.com>
>> Sent: Saturday, April 27, 2024 4:14 AM
>>
>> On Fri, 26 Apr 2024 11:11:17 -0300
>> Jason Gunthorpe <jgg@nvidia.com> wrote:
>>
>>> On Wed, Apr 24, 2024 at 02:13:49PM -0600, Alex Williamson wrote:
>>>
>>>> This is kind of an absurd example to portray as a ubiquitous problem.
>>>> Typically the config space layout is a reflection of hardware whether
>>>> the device supports migration or not.
>>>
>>> Er, all our HW has FW constructed config space. It changes with FW
>>> upgrades. We change it during the life of the product. This has to be
>>> considered..
>>
>> So as I understand it, the concern is that you have firmware that
>> supports migration, but it also openly hostile to the fundamental
>> aspects of exposing a stable device ABI in support of migration.
>>
>>>> If a driver were to insert a
>>>> virtual capability, then yes it would want to be consistent about it if
>>>> it also cares about migration.  If the driver needs to change the
>>>> location of a virtual capability, problems will arise, but that's also
>>>> not something that every driver needs to do.
>>>
>>> Well, mlx5 has to cope with this. It supports so many devices with so
>>> many config space layouts :( I don't know if we can just hard wire an
>>> offset to stick in a PASID cap and expect that to work...
> 
> Are those config space layout differences usually also coming with
> mmio-side interface change? If yes there are more to handle for
> running V1 instance on V2 device and it'd make sense to manage
> everything about compatibility in one place.
> 
> If we pursue the direction deciding the vconfig layout in VMM, does
> it imply that anything related to mmio layout would also be put in
> VMM too?
> 
> e.g. it's not unusual to see a device mmio layout as:
> 
> 	REG_BASE:  base addr of a register block in a BAR
> 	REG_FEAT1_OFF: feature1 regs offset in the register block
> 	REG_FEAT2_OFF: feature2 regs offset in the register block
> 	...

Should we also consider the possibility of vBAR address difference between
the src and dst? It should be impossible that the guest driver mmaps vBAR
again if it has already been loaded.

> Driver accesses registers according to those read-only offsets.
> 
> A FW upgrade may lead to change of offsets but functions stay the
> same. An instance created on an old version can be migrated to
> the new version as long as accesses to old offsets are trapped
> and routed to the new offsets.
> 
> Do we envision things like above in the variant driver or in VMM?
> 
>>>>
>>>> What are you actually proposing?
>>>
>>> Okay, what I'm thinking about is a text file that describes the vPCI
>>> function configuration space to create. The community will standardize
>>> this and VMMs will have to implement to get PASID/etc. Maybe the
>>> community will provide a BSD licensed library to do this job.
>>>
>>> The text file allows the operator to specify exactly the configuration
>>> space the VFIO function should have. It would not be derived
>>> automatically from physical. AFAIK qemu does not have this capability
>>> currently.
>>>
>>> This reflects my observation and discussions around the live migration
>>> standardization. I belive we are fast reaching a point where this is
>>> required.
>>>
>>> Consider standards based migration between wildly different
>>> devices. The devices will not standardize their physical config space,
>>> but an operator could generate a consistent vPCI config space that
>>> works with all the devices in their fleet.
> 
> It's hard to believe that 'wildly different' devices only have difference
> in the layout of vPCI config space.
> 
>>>
>>> Consider the usual working model of the large operators - they define
>>> instance types with some regularity. But an instance type is fixed in
>>> concrete once it is specified, things like the vPCI config space are
>>> fixed.
>>>
>>> Running Instance A on newer hardware with a changed physical config
>>> space should continue to present Instance A's vPCI config layout
>>> regardless. Ie Instance A might not support PASID but Instance B can
>>> run on newer HW that does. The config space layout depends on the
>>> requested Instance Type, not the physical layout.
>>>
>>> The auto-configuration of the config layout from physical is a nice
>>> feature and is excellent for development/small scale, but it shouldn't
>>> be the only way to work.
>>>
>>> So - if we accept that text file configuration should be something the
>>> VMM supports then let's reconsider how to solve the PASID problem.
>>>
>>> I'd say the way to solve it should be via a text file specifying a
>>> full config space layout that includes the PASID cap. From the VMM
>>> perspective this works fine, and it ports to every VMM directly via
>>> processing the text file.
>>>
>>> The autoconfiguration use case can be done by making a tool build the
>>> text file by deriving it from physical, much like today. The single
>>> instance of that tool could have device specific knowledge to avoid
>>> quirks. This way the smarts can still be shared by all the VMMs
>>> without going into the kernel. Special devices with hidden config
>>> space could get special quirks or special reference text files into
>>> the tool repo.
>>>
>>> Serious operators doing production SRIOV/etc would negotiate the text
>>> file with the HW vendors when they define their Instance Type. Ideally
>>> these reference text files would be contributed to the tool repo
>>> above. I think there would be some nice idea to define fully open
>>> source Instance Types that include VFIO devices too.
>>
>> Regarding "if we accept that text file configuration should be
>> something the VMM supports", I'm not on board with this yet, so
>> applying it to PASID discussion seems premature.
>>
>> We've developed variant drivers specifically to host the device specific
>> aspects of migration support.  The requirement of a consistent config
>> space layout is a problem that only exists relative to migration.  This
>> is an issue that I would have considered the responsibility of the
>> variant driver, which would likely expect a consistent interface from
>> the hardware/firmware.  Why does hostile firmware suddenly make it the
>> VMM's problem to provide a consistent ABI to the config space of the
>> device rather than the variant driver?
>>
>> Obviously config maps are something that a VMM could do, but it also
>> seems to impose a non-trivial burden that every VMM requires an
>> implementation of a config space map and integration for each device
>> rather than simply expecting the exposed config space of the device to
>> be part of the migration ABI.  Also this solution specifically only
>> addresses config space compatibility without considering the more
>> generic issue that a variant driver can expose different device
>> personas.  A versioned persona and config space virtualization in the
>> variant driver is a much more flexible solution.  Thanks,
>>
> 
> and looks this community lacks of a clear criteria on what burden
> should be put in the kernel vs. in the VMM.
> 
> e.g. in earlier nvgrace-gpu discussion a major open was whether
> the PCI bar emulation should be done by the variant driver or
> by the VMM (with variant driver providing a device feature).
> 
> It ends up to be in the variant driver with one major argument
> that doing so avoids the burden in various VMMs.
> 
> But now seems the 'text-file' proposal heads the opposite direction?
> 
> btw while this discussion may continue some time, I wonder whether
> this vPASID reporting open can be handled separately from the
> pasid attach/detach series so we can move the ball and merge
> something already in agreement. anyway it's just a read-only cap so
> won't affect how VFIO/IOMMUFD handles the pasid related requests.
> 
> Thanks
> Kevin
> 

-- 
Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 0/4] vfio-pci support pasid attach/detach
  2024-04-28  6:19                                       ` Tian, Kevin
  2024-04-29  7:43                                         ` Yi Liu
@ 2024-04-29 17:15                                         ` Jason Gunthorpe
  1 sibling, 0 replies; 72+ messages in thread
From: Jason Gunthorpe @ 2024-04-29 17:15 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Alex Williamson, Liu, Yi L, joro, robin.murphy, eric.auger,
	nicolinc, kvm, chao.p.peng, iommu, baolu.lu, Duan, Zhenzhong,
	Pan, Jacob jun, Cédric Le Goater

On Sun, Apr 28, 2024 at 06:19:29AM +0000, Tian, Kevin wrote:
> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Saturday, April 27, 2024 4:14 AM
> > 
> > On Fri, 26 Apr 2024 11:11:17 -0300
> > Jason Gunthorpe <jgg@nvidia.com> wrote:
> > 
> > > On Wed, Apr 24, 2024 at 02:13:49PM -0600, Alex Williamson wrote:
> > >
> > > > This is kind of an absurd example to portray as a ubiquitous problem.
> > > > Typically the config space layout is a reflection of hardware whether
> > > > the device supports migration or not.
> > >
> > > Er, all our HW has FW constructed config space. It changes with FW
> > > upgrades. We change it during the life of the product. This has to be
> > > considered..
> > 
> > So as I understand it, the concern is that you have firmware that
> > supports migration, but it also openly hostile to the fundamental
> > aspects of exposing a stable device ABI in support of migration.
> > 
> > > > If a driver were to insert a
> > > > virtual capability, then yes it would want to be consistent about it if
> > > > it also cares about migration.  If the driver needs to change the
> > > > location of a virtual capability, problems will arise, but that's also
> > > > not something that every driver needs to do.
> > >
> > > Well, mlx5 has to cope with this. It supports so many devices with so
> > > many config space layouts :( I don't know if we can just hard wire an
> > > offset to stick in a PASID cap and expect that to work...
> 
> Are those config space layout differences usually also coming with
> mmio-side interface change? 

Not really

> If yes there are more to handle for
> running V1 instance on V2 device and it'd make sense to manage
> everything about compatibility in one place.

It's complicated :|

The config space layout can't change once the device is discovered by
the OS and I have a feeling it can't differ between VFs in a SRIOV.

So even if we did say that a device HW/FW will change it's personality
on the fly, I'm not sure we actually can and still conform to the
PCI specs?

> If we pursue the direction deciding the vconfig layout in VMM, does
> it imply that anything related to mmio layout would also be put in
> VMM too?

I'd guess no, MMIO doesn't have the OS and standards entanglements. If
the FW can reprogram the MMIO layout then when it loads the migration
state, or provisions the device, it should put the MMIO to the right
configuration.

> Do we envision things like above in the variant driver or in VMM?

In the device itself. I think it would be an extreme case for someone
to make a device that had a different MMIO layout that could be
backwards compatible to an old layout and NOT provide a way for the
device HW to go back to the old layout directly. If so that device
would have to mediate MMIO in the VMM or kernel. Depends on the
complexity I suppose where to do it.

> > > Consider standards based migration between wildly different
> > > devices. The devices will not standardize their physical config space,
> > > but an operator could generate a consistent vPCI config space that
> > > works with all the devices in their fleet.
> 
> It's hard to believe that 'wildly different' devices only have difference
> in the layout of vPCI config space. 

Well, I mean a device from vendor A and another device from vendor
B. They don't have anything in common except implementing the
standard.

The standard perscribes the MMIO layout. Both devices have a way to
accept a migration stream defined by a standard. The MMIO will be
adjusted as the standard requires after loading the migration stream.

Config space primiarily remains a problem in this imagined world.

My best view is that it needs to be solved by config space synthesis
in the hypervisor and not via HW/FW on the physical device.

> and looks this community lacks of a clear criteria on what burden
> should be put in the kernel vs. in the VMM.

Right, we've always discussed this. I've fairly consistently wanted to
push things to userspace, principally for security.

> e.g. in earlier nvgrace-gpu discussion a major open was whether
> the PCI bar emulation should be done by the variant driver or
> by the VMM (with variant driver providing a device feature).

Sort of, this was actually also about config space synthesis. If we
had this kind of scheme then perhaps the argument would be different.

> It ends up to be in the variant driver with one major argument
> that doing so avoids the burden in various VMMs.

Yes, that has always been the argument that has won out.

> btw while this discussion may continue some time, I wonder whether
> this vPASID reporting open can be handled separately from the
> pasid attach/detach series so we can move the ball and merge
> something already in agreement. anyway it's just a read-only cap so
> won't affect how VFIO/IOMMUFD handles the pasid related requests.

Yes, this makes sense to me. Ultimately we all agree the config space
will have to be synthesized.

Jason

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 0/4] vfio-pci support pasid attach/detach
  2024-04-26 20:13                                     ` Alex Williamson
  2024-04-28  6:19                                       ` Tian, Kevin
@ 2024-04-29 17:44                                       ` Jason Gunthorpe
  1 sibling, 0 replies; 72+ messages in thread
From: Jason Gunthorpe @ 2024-04-29 17:44 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Tian, Kevin, Liu, Yi L, joro, robin.murphy, eric.auger, nicolinc,
	kvm, chao.p.peng, iommu, baolu.lu, Duan, Zhenzhong, Pan,
	Jacob jun, Cédric Le Goater

On Fri, Apr 26, 2024 at 02:13:54PM -0600, Alex Williamson wrote:
> On Fri, 26 Apr 2024 11:11:17 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Wed, Apr 24, 2024 at 02:13:49PM -0600, Alex Williamson wrote:
> > 
> > > This is kind of an absurd example to portray as a ubiquitous problem.
> > > Typically the config space layout is a reflection of hardware whether
> > > the device supports migration or not.  
> > 
> > Er, all our HW has FW constructed config space. It changes with FW
> > upgrades. We change it during the life of the product. This has to be
> > considered..
> 
> So as I understand it, the concern is that you have firmware that
> supports migration, but it also openly hostile to the fundamental
> aspects of exposing a stable device ABI in support of migration.

Well, that makes it sound rude, but yes that is part of it.

mlx5 is tremendously FW defined. The FW can only cope with migration
in some limited cases today. Making that compatability bigger is
ongoing work.

Config space is one of the areas that has not been addressed.
Currently things are such that the FW won't support migration in
combinations that have different physical config space - so it is not
a problem.

But, in principle, it is an issue. AFAIK, the only complete solution
is for the hypervisor to fully synthesize a stable config space.

So, if we keep this in the kernel then I'd imagine the kernel will
need to grow some shared infrastructure to fully synthezise the config
space - not text file based, but basically the same as what I
described for the VMM.

> > But that won't be true if the kernel is making decisions. The config
> > space layout depends now on the kernel driver version too.
> 
> But in the cases where we support migration there's a device specific
> variant driver that supports that migration.  It's the job of that
> variant driver to not only export and import the device state, but also
> to provide a consistent ABI to the user, which includes the config
> space layout. 

Yes, we could do that, but I'm not sure how it will work in all cases.

> I don't understand why we'd say the device programming ABI itself
> falls within the purview of the device/variant driver, but PCI
> config space is defined by device specific code at a higher level.

The "device programming ABI" doesn't contain any *policy*. The layout
of the config space is like 50% policy. Especially when we start to
talk about standards defined migration. The standards will set the
"device programming ABI" and maybe even specify the migration
stream. They will not, and arguably can not, specify the config space.

Config space layout is substantially policy of the instance type. Even
little things like the vendor IDs can be meaningfully replaced in VMs.

> Regarding "if we accept that text file configuration should be
> something the VMM supports", I'm not on board with this yet, so
> applying it to PASID discussion seems premature.

Sure, I'm just explaining a way this could all fit together.

> We've developed variant drivers specifically to host the device specific
> aspects of migration support.  The requirement of a consistent config
> space layout is a problem that only exists relative to migration.  

Well, I wouldn't go quite so far. Arguably even non-migritable
instance types may want to adjust thier config space. Eg if I'm using
a DPU and I get a NVMe/Virtio PCI function I may want to scrub out
details from the config space to make it more general. Even without
migration.

This already happens today in places like VDPA which completely
replace the underlying config space in some cases.

I see it as a difference from a world of highly constrained "instance
types" and a more ad hoc world.

> is an issue that I would have considered the responsibility of the
> variant driver, which would likely expect a consistent interface from
> the hardware/firmware.  Why does hostile firmware suddenly make it the
> VMM's problem to provide a consistent ABI to the config space of the
> device rather than the variant driver?

It is not "hostile firmware"! It accepting that a significant aspect
of the config layout is actually policy.

Plus the standards limitations that mean we can't change the config
space on the fly make it pretty much impossible for the device to
acutally do anything to help here. Software must fix the config space.

> Obviously config maps are something that a VMM could do, but it also
> seems to impose a non-trivial burden that every VMM requires an
> implementation of a config space map and integration for each device
> rather than simply expecting the exposed config space of the device to
> be part of the migration ABI.  

Well, the flip is true to, it is alot of burden on every variant
device driver implement and on the kernel in general to manage config
space policy on behalf of the VMM.

My point is if the VMM is already going to be forced to manage config
space policy for other good reasons, are we sure we want to put a
bunch of stuff in the kernel that sometimes won't be used?

> Also this solution specifically only addresses config space
> compatibility without considering the more generic issue that a
> variant driver can expose different device personas.  A versioned
> persona and config space virtualization in the variant driver is a
> much more flexible solution.

It is addressed, the different personas would have their own text file
maps. The target VMM would have to load the right map. Shared common
code across all the variant drivers.

Jason

^ permalink raw reply	[flat|nested] 72+ messages in thread

end of thread, other threads:[~2024-04-29 17:44 UTC | newest]

Thread overview: 72+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-04-12  8:21 [PATCH v2 0/4] vfio-pci support pasid attach/detach Yi Liu
2024-04-12  8:21 ` [PATCH v2 1/4] ida: Add ida_get_lowest() Yi Liu
2024-04-16 16:03   ` Alex Williamson
2024-04-18  7:02     ` Yi Liu
2024-04-18 16:23       ` Alex Williamson
2024-04-18 17:12         ` Jason Gunthorpe
2024-04-19 13:43           ` Yi Liu
2024-04-19 13:55             ` Alex Williamson
2024-04-19 14:00               ` Jason Gunthorpe
2024-04-23  7:19                 ` Yi Liu
2024-04-19 13:40         ` Yi Liu
2024-04-12  8:21 ` [PATCH v2 2/4] vfio-iommufd: Support pasid [at|de]tach for physical VFIO devices Yi Liu
2024-04-16  9:01   ` Tian, Kevin
2024-04-16  9:24     ` Yi Liu
2024-04-16  9:47       ` Tian, Kevin
2024-04-18  7:04         ` Yi Liu
2024-04-23 12:43   ` Jason Gunthorpe
2024-04-24  0:33     ` Tian, Kevin
2024-04-24  4:48     ` Yi Liu
2024-04-12  8:21 ` [PATCH v2 3/4] vfio: Add VFIO_DEVICE_PASID_[AT|DE]TACH_IOMMUFD_PT Yi Liu
2024-04-16  9:13   ` Tian, Kevin
2024-04-16  9:36     ` Yi Liu
2024-04-23 12:45   ` Jason Gunthorpe
2024-04-12  8:21 ` [PATCH v2 4/4] vfio: Report PASID capability via VFIO_DEVICE_FEATURE ioctl Yi Liu
2024-04-16  9:40   ` Tian, Kevin
2024-04-16 17:57   ` Alex Williamson
2024-04-17  7:09     ` Tian, Kevin
2024-04-17 20:25       ` Alex Williamson
2024-04-18  0:21         ` Tian, Kevin
2024-04-18  8:23           ` Yi Liu
2024-04-18 16:34           ` Alex Williamson
2024-04-23 12:39   ` Jason Gunthorpe
2024-04-24  0:24     ` Tian, Kevin
2024-04-24 13:59       ` Jason Gunthorpe
2024-04-16  8:38 ` [PATCH v2 0/4] vfio-pci support pasid attach/detach Tian, Kevin
2024-04-16 17:50   ` Jason Gunthorpe
2024-04-17  7:16     ` Tian, Kevin
2024-04-17 12:20       ` Jason Gunthorpe
2024-04-17 23:02         ` Alex Williamson
2024-04-18  0:06           ` Tian, Kevin
2024-04-18  9:03             ` Yi Liu
2024-04-18 20:37               ` Alex Williamson
2024-04-19  5:52                 ` Tian, Kevin
2024-04-19 16:35                   ` Alex Williamson
2024-04-23  7:43                     ` Tian, Kevin
2024-04-23 12:01                       ` Jason Gunthorpe
2024-04-23 23:47                         ` Tian, Kevin
2024-04-24  0:12                           ` Jason Gunthorpe
2024-04-24  2:57                             ` Tian, Kevin
2024-04-24 12:29                               ` Baolu Lu
2024-04-24 14:04                               ` Jason Gunthorpe
2024-04-24  5:19                             ` Tian, Kevin
2024-04-24 14:15                               ` Jason Gunthorpe
2024-04-24 18:38                                 ` Alex Williamson
2024-04-24 18:45                                   ` Jason Gunthorpe
2024-04-24 18:24                             ` Alex Williamson
2024-04-24 18:36                               ` Jason Gunthorpe
2024-04-24 20:13                                 ` Alex Williamson
2024-04-26 14:11                                   ` Jason Gunthorpe
2024-04-26 20:13                                     ` Alex Williamson
2024-04-28  6:19                                       ` Tian, Kevin
2024-04-29  7:43                                         ` Yi Liu
2024-04-29 17:15                                         ` Jason Gunthorpe
2024-04-29 17:44                                       ` Jason Gunthorpe
2024-04-27  5:05                                     ` Christoph Hellwig
2024-04-25  9:26                               ` Yi Liu
2024-04-25 12:58                                 ` Alex Williamson
2024-04-26  9:01                                   ` Yi Liu
2024-04-19 13:59                 ` Jason Gunthorpe
2024-04-23  7:58                   ` Yi Liu
2024-04-23 12:05                     ` Jason Gunthorpe
2024-04-19 13:34           ` Jason Gunthorpe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).