intel-gfx.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed
* [Intel-gfx] [PATCH v3 00/12] Introduce new methods for verifying ownership in vfio PCI hot reset
@ 2023-04-01 14:44 Yi Liu
  2023-04-01 14:44 ` [Intel-gfx] [PATCH v3 01/12] vfio/pci: Update comment around group_fd get in vfio_pci_ioctl_pci_hot_reset() Yi Liu
                   ` (12 more replies)
  0 siblings, 13 replies; 145+ messages in thread
From: Yi Liu @ 2023-04-01 14:44 UTC (permalink / raw)
  To: alex.williamson, jgg, kevin.tian
  Cc: mjrosato, jasowang, xudong.hao, peterx, terrence.xu, chao.p.peng,
	linux-s390, yi.l.liu, kvm, lulu, yanting.jiang, joro, nicolinc,
	yan.y.zhao, intel-gfx, eric.auger, intel-gvt-dev, yi.y.sun,
	cohuck, shameerali.kolothum.thodi, suravee.suthikulpanit,
	robin.murphy

VFIO_DEVICE_PCI_HOT_RESET requires user to pass an array of group fds
to prove that it owns all devices affected by resetting the calling
device. This series introduces several extensions to allow the ownership
check better aligned with iommufd and coming vfio device cdev support.

First, resetting an unopened device is always safe given nobody is using
it. So relax the check to allow such devices not covered by group fd
array. [1]

When iommufd is used we can simply verify that all affected devices are
bound to a same iommufd then no need for the user to provide extra fd
information. This is enabled by the user passing a zero-length fd array
and moving forward this should be the preferred way for hot reset. [2]

However the iommufd method has difficulty working with noiommu devices
since those devices don't have a valid iommufd, unless the noiommu device
is in a singleton dev_set hence no ownership check is required. [3]

For noiommu backward compatibility a 3rd method is introduced by allowing
the user to pass an array of device fds to prove ownership. [4]

As suggested by Jason [5], we have this series to introduce the above
stuffs to the vfio PCI hot reset. Per the dicussion in [6] [7], in the
end of this series, the VFIO_DEVICE_GET_PCI_HOT_RESET_INFO is extended
to report devid for the devices opened as cdev. This is goging to support
the device fd passing usage.

The new hot reset method and updated _INFO ioctl are tested with two
test commits in below qemu:

https://github.com/yiliu1765/qemu/commits/iommufd_rfcv3
(requires to test with cdev kernel)

[1] https://lore.kernel.org/kvm/Y%2FdobS6gdSkxnPH7@nvidia.com/
[2] https://lore.kernel.org/kvm/Y%2FZOOClu8nXy2toX@nvidia.com/#t
[3] https://lore.kernel.org/kvm/ZACX+Np%2FIY7ygqL5@nvidia.com/
[4] https://lore.kernel.org/kvm/DS0PR11MB7529BE88460582BD599DC1F7C3B19@DS0PR11MB7529.namprd11.prod.outlook.com/#t
[5] https://lore.kernel.org/kvm/ZAcvzvhkt9QhCmdi@nvidia.com/
[6] https://lore.kernel.org/kvm/ZBoYgNq60eDpV9Un@nvidia.com/
[7] https://lore.kernel.org/kvm/20230327132619.5ab15440.alex.williamson@redhat.com/

Change log:

v3:
 - Remove the new _INFO ioctl of v2, extend the existing _INFO ioctl to
   report devid (Alex)
 - Add r-b from Jason
 - Add t-b from Terrence Xu and Yanting Jiang (mainly regression test)

v2: https://lore.kernel.org/kvm/20230327093458.44939-1-yi.l.liu@intel.com/
 - Split the patch 03 of v1 to be 03, 04 and 05 of v2 (Jaon)
 - Add r-b from Kevin and Jason
 - Add patch 10 to introduce a new _INFO ioctl for the usage of device
   fd passing usage in cdev path (Jason, Alex)

v1: https://lore.kernel.org/kvm/20230316124156.12064-1-yi.l.liu@intel.com/

Regards,
	Yi Liu

Yi Liu (12):
  vfio/pci: Update comment around group_fd get in
    vfio_pci_ioctl_pci_hot_reset()
  vfio/pci: Only check ownership of opened devices in hot reset
  vfio/pci: Move the existing hot reset logic to be a helper
  vfio-iommufd: Add helper to retrieve iommufd_ctx and devid for
    vfio_device
  vfio/pci: Allow passing zero-length fd array in
    VFIO_DEVICE_PCI_HOT_RESET
  vfio: Refine vfio file kAPIs for vfio PCI hot reset
  vfio: Accpet device file from vfio PCI hot reset path
  vfio/pci: Renaming for accepting device fd in hot reset path
  vfio/pci: Accept device fd in VFIO_DEVICE_PCI_HOT_RESET ioctl
  vfio: Mark cdev usage in vfio_device
  iommufd: Define IOMMUFD_INVALID_ID in uapi
  vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO

 drivers/iommu/iommufd/device.c   |  12 ++
 drivers/vfio/group.c             |  32 +++--
 drivers/vfio/iommufd.c           |  14 +++
 drivers/vfio/pci/vfio_pci_core.c | 204 ++++++++++++++++++++++---------
 drivers/vfio/vfio.h              |   2 +
 drivers/vfio/vfio_main.c         |  44 +++++++
 include/linux/iommufd.h          |   3 +
 include/linux/vfio.h             |  21 ++++
 include/uapi/linux/iommufd.h     |   3 +
 include/uapi/linux/vfio.h        |  42 ++++++-
 10 files changed, 301 insertions(+), 76 deletions(-)

-- 
2.34.1


^ permalink raw reply	[flat|nested] 145+ messages in thread

* [Intel-gfx] [PATCH v3 01/12] vfio/pci: Update comment around group_fd get in vfio_pci_ioctl_pci_hot_reset()
  2023-04-01 14:44 [Intel-gfx] [PATCH v3 00/12] Introduce new methods for verifying ownership in vfio PCI hot reset Yi Liu
@ 2023-04-01 14:44 ` Yi Liu
  2023-04-04 13:59   ` Eric Auger
  2023-04-01 14:44 ` [Intel-gfx] [PATCH v3 02/12] vfio/pci: Only check ownership of opened devices in hot reset Yi Liu
                   ` (11 subsequent siblings)
  12 siblings, 1 reply; 145+ messages in thread
From: Yi Liu @ 2023-04-01 14:44 UTC (permalink / raw)
  To: alex.williamson, jgg, kevin.tian
  Cc: mjrosato, jasowang, xudong.hao, peterx, terrence.xu, chao.p.peng,
	linux-s390, yi.l.liu, kvm, lulu, yanting.jiang, joro, nicolinc,
	yan.y.zhao, intel-gfx, eric.auger, intel-gvt-dev, yi.y.sun,
	cohuck, shameerali.kolothum.thodi, suravee.suthikulpanit,
	robin.murphy

this suits more on what the code does.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Tested-by: Yanting Jiang <yanting.jiang@intel.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
---
 drivers/vfio/pci/vfio_pci_core.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index a5ab416cf476..65bbef562268 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -1308,9 +1308,8 @@ static int vfio_pci_ioctl_pci_hot_reset(struct vfio_pci_core_device *vdev,
 	}
 
 	/*
-	 * For each group_fd, get the group through the vfio external user
-	 * interface and store the group and iommu ID.  This ensures the group
-	 * is held across the reset.
+	 * Get the group file for each fd to ensure the group held across
+	 * the reset
 	 */
 	for (file_idx = 0; file_idx < hdr.count; file_idx++) {
 		struct file *file = fget(group_fds[file_idx]);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 145+ messages in thread

* [Intel-gfx] [PATCH v3 02/12] vfio/pci: Only check ownership of opened devices in hot reset
  2023-04-01 14:44 [Intel-gfx] [PATCH v3 00/12] Introduce new methods for verifying ownership in vfio PCI hot reset Yi Liu
  2023-04-01 14:44 ` [Intel-gfx] [PATCH v3 01/12] vfio/pci: Update comment around group_fd get in vfio_pci_ioctl_pci_hot_reset() Yi Liu
@ 2023-04-01 14:44 ` Yi Liu
  2023-04-04 13:59   ` Eric Auger
  2023-04-01 14:44 ` [Intel-gfx] [PATCH v3 03/12] vfio/pci: Move the existing hot reset logic to be a helper Yi Liu
                   ` (10 subsequent siblings)
  12 siblings, 1 reply; 145+ messages in thread
From: Yi Liu @ 2023-04-01 14:44 UTC (permalink / raw)
  To: alex.williamson, jgg, kevin.tian
  Cc: mjrosato, jasowang, xudong.hao, peterx, terrence.xu, chao.p.peng,
	linux-s390, yi.l.liu, kvm, lulu, yanting.jiang, joro, nicolinc,
	yan.y.zhao, intel-gfx, eric.auger, intel-gvt-dev, yi.y.sun,
	cohuck, shameerali.kolothum.thodi, suravee.suthikulpanit,
	robin.murphy

If the affected device is not opened by any user, it's safe to reset it
given it's not in use.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Tested-by: Yanting Jiang <yanting.jiang@intel.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
---
 drivers/vfio/pci/vfio_pci_core.c | 14 +++++++++++---
 include/uapi/linux/vfio.h        |  8 ++++++++
 2 files changed, 19 insertions(+), 3 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 65bbef562268..5d745c9abf05 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -2429,10 +2429,18 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
 
 	list_for_each_entry(cur_vma, &dev_set->device_list, vdev.dev_set_list) {
 		/*
-		 * Test whether all the affected devices are contained by the
-		 * set of groups provided by the user.
+		 * Test whether all the affected devices can be reset by the
+		 * user.
+		 *
+		 * Resetting an unused device (not opened) is safe, because
+		 * dev_set->lock is held in hot reset path so this device
+		 * cannot race being opened by another user simultaneously.
+		 *
+		 * Otherwise all opened devices in the dev_set must be
+		 * contained by the set of groups provided by the user.
 		 */
-		if (!vfio_dev_in_groups(cur_vma, groups)) {
+		if (cur_vma->vdev.open_count &&
+		    !vfio_dev_in_groups(cur_vma, groups)) {
 			ret = -EINVAL;
 			goto err_undo;
 		}
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 0552e8dcf0cb..f96e5689cffc 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -673,6 +673,14 @@ struct vfio_pci_hot_reset_info {
  * VFIO_DEVICE_PCI_HOT_RESET - _IOW(VFIO_TYPE, VFIO_BASE + 13,
  *				    struct vfio_pci_hot_reset)
  *
+ * Userspace requests hot reset for the devices it uses.  Due to the
+ * underlying topology, multiple devices can be affected in the reset
+ * while some might be opened by another user.  To avoid interference
+ * the calling user must ensure all affected devices, if opened, are
+ * owned by itself.
+ *
+ * The ownership is proved by an array of group fds.
+ *
  * Return: 0 on success, -errno on failure.
  */
 struct vfio_pci_hot_reset {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 145+ messages in thread

* [Intel-gfx] [PATCH v3 03/12] vfio/pci: Move the existing hot reset logic to be a helper
  2023-04-01 14:44 [Intel-gfx] [PATCH v3 00/12] Introduce new methods for verifying ownership in vfio PCI hot reset Yi Liu
  2023-04-01 14:44 ` [Intel-gfx] [PATCH v3 01/12] vfio/pci: Update comment around group_fd get in vfio_pci_ioctl_pci_hot_reset() Yi Liu
  2023-04-01 14:44 ` [Intel-gfx] [PATCH v3 02/12] vfio/pci: Only check ownership of opened devices in hot reset Yi Liu
@ 2023-04-01 14:44 ` Yi Liu
  2023-04-04 13:59   ` Eric Auger
  2023-04-01 14:44 ` [Intel-gfx] [PATCH v3 04/12] vfio-iommufd: Add helper to retrieve iommufd_ctx and devid for vfio_device Yi Liu
                   ` (9 subsequent siblings)
  12 siblings, 1 reply; 145+ messages in thread
From: Yi Liu @ 2023-04-01 14:44 UTC (permalink / raw)
  To: alex.williamson, jgg, kevin.tian
  Cc: mjrosato, jasowang, xudong.hao, peterx, terrence.xu, chao.p.peng,
	linux-s390, yi.l.liu, kvm, lulu, yanting.jiang, joro, nicolinc,
	yan.y.zhao, intel-gfx, eric.auger, intel-gvt-dev, yi.y.sun,
	cohuck, shameerali.kolothum.thodi, suravee.suthikulpanit,
	robin.murphy

This prepares to add another method for hot reset. The major hot reset logic
are moved to vfio_pci_ioctl_pci_hot_reset_groups().

No functional change is intended.

Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Tested-by: Yanting Jiang <yanting.jiang@intel.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
---
 drivers/vfio/pci/vfio_pci_core.c | 56 +++++++++++++++++++-------------
 1 file changed, 33 insertions(+), 23 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 5d745c9abf05..3696b8e58445 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -1255,29 +1255,17 @@ static int vfio_pci_ioctl_get_pci_hot_reset_info(
 	return ret;
 }
 
-static int vfio_pci_ioctl_pci_hot_reset(struct vfio_pci_core_device *vdev,
-					struct vfio_pci_hot_reset __user *arg)
+static int
+vfio_pci_ioctl_pci_hot_reset_groups(struct vfio_pci_core_device *vdev,
+				    struct vfio_pci_hot_reset *hdr,
+				    bool slot,
+				    struct vfio_pci_hot_reset __user *arg)
 {
-	unsigned long minsz = offsetofend(struct vfio_pci_hot_reset, count);
-	struct vfio_pci_hot_reset hdr;
 	int32_t *group_fds;
 	struct file **files;
 	struct vfio_pci_group_info info;
-	bool slot = false;
 	int file_idx, count = 0, ret = 0;
 
-	if (copy_from_user(&hdr, arg, minsz))
-		return -EFAULT;
-
-	if (hdr.argsz < minsz || hdr.flags)
-		return -EINVAL;
-
-	/* Can we do a slot or bus reset or neither? */
-	if (!pci_probe_reset_slot(vdev->pdev->slot))
-		slot = true;
-	else if (pci_probe_reset_bus(vdev->pdev->bus))
-		return -ENODEV;
-
 	/*
 	 * We can't let userspace give us an arbitrarily large buffer to copy,
 	 * so verify how many we think there could be.  Note groups can have
@@ -1289,11 +1277,11 @@ static int vfio_pci_ioctl_pci_hot_reset(struct vfio_pci_core_device *vdev,
 		return ret;
 
 	/* Somewhere between 1 and count is OK */
-	if (!hdr.count || hdr.count > count)
+	if (!hdr->count || hdr->count > count)
 		return -EINVAL;
 
-	group_fds = kcalloc(hdr.count, sizeof(*group_fds), GFP_KERNEL);
-	files = kcalloc(hdr.count, sizeof(*files), GFP_KERNEL);
+	group_fds = kcalloc(hdr->count, sizeof(*group_fds), GFP_KERNEL);
+	files = kcalloc(hdr->count, sizeof(*files), GFP_KERNEL);
 	if (!group_fds || !files) {
 		kfree(group_fds);
 		kfree(files);
@@ -1301,7 +1289,7 @@ static int vfio_pci_ioctl_pci_hot_reset(struct vfio_pci_core_device *vdev,
 	}
 
 	if (copy_from_user(group_fds, arg->group_fds,
-			   hdr.count * sizeof(*group_fds))) {
+			   hdr->count * sizeof(*group_fds))) {
 		kfree(group_fds);
 		kfree(files);
 		return -EFAULT;
@@ -1311,7 +1299,7 @@ static int vfio_pci_ioctl_pci_hot_reset(struct vfio_pci_core_device *vdev,
 	 * Get the group file for each fd to ensure the group held across
 	 * the reset
 	 */
-	for (file_idx = 0; file_idx < hdr.count; file_idx++) {
+	for (file_idx = 0; file_idx < hdr->count; file_idx++) {
 		struct file *file = fget(group_fds[file_idx]);
 
 		if (!file) {
@@ -1335,7 +1323,7 @@ static int vfio_pci_ioctl_pci_hot_reset(struct vfio_pci_core_device *vdev,
 	if (ret)
 		goto hot_reset_release;
 
-	info.count = hdr.count;
+	info.count = hdr->count;
 	info.files = files;
 
 	ret = vfio_pci_dev_set_hot_reset(vdev->vdev.dev_set, &info);
@@ -1348,6 +1336,28 @@ static int vfio_pci_ioctl_pci_hot_reset(struct vfio_pci_core_device *vdev,
 	return ret;
 }
 
+static int vfio_pci_ioctl_pci_hot_reset(struct vfio_pci_core_device *vdev,
+					struct vfio_pci_hot_reset __user *arg)
+{
+	unsigned long minsz = offsetofend(struct vfio_pci_hot_reset, count);
+	struct vfio_pci_hot_reset hdr;
+	bool slot = false;
+
+	if (copy_from_user(&hdr, arg, minsz))
+		return -EFAULT;
+
+	if (hdr.argsz < minsz || hdr.flags)
+		return -EINVAL;
+
+	/* Can we do a slot or bus reset or neither? */
+	if (!pci_probe_reset_slot(vdev->pdev->slot))
+		slot = true;
+	else if (pci_probe_reset_bus(vdev->pdev->bus))
+		return -ENODEV;
+
+	return vfio_pci_ioctl_pci_hot_reset_groups(vdev, &hdr, slot, arg);
+}
+
 static int vfio_pci_ioctl_ioeventfd(struct vfio_pci_core_device *vdev,
 				    struct vfio_device_ioeventfd __user *arg)
 {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 145+ messages in thread

* [Intel-gfx] [PATCH v3 04/12] vfio-iommufd: Add helper to retrieve iommufd_ctx and devid for vfio_device
  2023-04-01 14:44 [Intel-gfx] [PATCH v3 00/12] Introduce new methods for verifying ownership in vfio PCI hot reset Yi Liu
                   ` (2 preceding siblings ...)
  2023-04-01 14:44 ` [Intel-gfx] [PATCH v3 03/12] vfio/pci: Move the existing hot reset logic to be a helper Yi Liu
@ 2023-04-01 14:44 ` Yi Liu
  2023-04-04 15:28   ` Eric Auger
  2023-04-01 14:44 ` [Intel-gfx] [PATCH v3 05/12] vfio/pci: Allow passing zero-length fd array in VFIO_DEVICE_PCI_HOT_RESET Yi Liu
                   ` (8 subsequent siblings)
  12 siblings, 1 reply; 145+ messages in thread
From: Yi Liu @ 2023-04-01 14:44 UTC (permalink / raw)
  To: alex.williamson, jgg, kevin.tian
  Cc: mjrosato, jasowang, xudong.hao, peterx, terrence.xu, chao.p.peng,
	linux-s390, yi.l.liu, kvm, lulu, yanting.jiang, joro, nicolinc,
	yan.y.zhao, intel-gfx, eric.auger, intel-gvt-dev, yi.y.sun,
	cohuck, shameerali.kolothum.thodi, suravee.suthikulpanit,
	robin.murphy

This is needed by the vfio-pci driver to report affected devices in the
hot reset for a given device.

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Tested-by: Yanting Jiang <yanting.jiang@intel.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
---
 drivers/iommu/iommufd/device.c | 12 ++++++++++++
 drivers/vfio/iommufd.c         | 14 ++++++++++++++
 include/linux/iommufd.h        |  3 +++
 include/linux/vfio.h           | 13 +++++++++++++
 4 files changed, 42 insertions(+)

diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c
index 25115d401d8f..04a57aa1ae2c 100644
--- a/drivers/iommu/iommufd/device.c
+++ b/drivers/iommu/iommufd/device.c
@@ -131,6 +131,18 @@ void iommufd_device_unbind(struct iommufd_device *idev)
 }
 EXPORT_SYMBOL_NS_GPL(iommufd_device_unbind, IOMMUFD);
 
+struct iommufd_ctx *iommufd_device_to_ictx(struct iommufd_device *idev)
+{
+	return idev->ictx;
+}
+EXPORT_SYMBOL_NS_GPL(iommufd_device_to_ictx, IOMMUFD);
+
+u32 iommufd_device_to_id(struct iommufd_device *idev)
+{
+	return idev->obj.id;
+}
+EXPORT_SYMBOL_NS_GPL(iommufd_device_to_id, IOMMUFD);
+
 static int iommufd_device_setup_msi(struct iommufd_device *idev,
 				    struct iommufd_hw_pagetable *hwpt,
 				    phys_addr_t sw_msi_start)
diff --git a/drivers/vfio/iommufd.c b/drivers/vfio/iommufd.c
index 88b00c501015..809f2dd73b9e 100644
--- a/drivers/vfio/iommufd.c
+++ b/drivers/vfio/iommufd.c
@@ -66,6 +66,20 @@ void vfio_iommufd_unbind(struct vfio_device *vdev)
 		vdev->ops->unbind_iommufd(vdev);
 }
 
+struct iommufd_ctx *vfio_iommufd_physical_ictx(struct vfio_device *vdev)
+{
+	if (!vdev->iommufd_device)
+		return NULL;
+	return iommufd_device_to_ictx(vdev->iommufd_device);
+}
+EXPORT_SYMBOL_GPL(vfio_iommufd_physical_ictx);
+
+void vfio_iommufd_physical_devid(struct vfio_device *vdev, u32 *id)
+{
+	if (vdev->iommufd_device)
+		*id = iommufd_device_to_id(vdev->iommufd_device);
+}
+EXPORT_SYMBOL_GPL(vfio_iommufd_physical_devid);
 /*
  * The physical standard ops mean that the iommufd_device is bound to the
  * physical device vdev->dev that was provided to vfio_init_group_dev(). Drivers
diff --git a/include/linux/iommufd.h b/include/linux/iommufd.h
index 1129a36a74c4..ac96df406833 100644
--- a/include/linux/iommufd.h
+++ b/include/linux/iommufd.h
@@ -24,6 +24,9 @@ void iommufd_device_unbind(struct iommufd_device *idev);
 int iommufd_device_attach(struct iommufd_device *idev, u32 *pt_id);
 void iommufd_device_detach(struct iommufd_device *idev);
 
+struct iommufd_ctx *iommufd_device_to_ictx(struct iommufd_device *idev);
+u32 iommufd_device_to_id(struct iommufd_device *idev);
+
 struct iommufd_access_ops {
 	u8 needs_pin_pages : 1;
 	void (*unmap)(void *data, unsigned long iova, unsigned long length);
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index 3188d8a374bd..97a1174b922f 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -113,6 +113,8 @@ struct vfio_device_ops {
 };
 
 #if IS_ENABLED(CONFIG_IOMMUFD)
+struct iommufd_ctx *vfio_iommufd_physical_ictx(struct vfio_device *vdev);
+void vfio_iommufd_physical_devid(struct vfio_device *vdev, u32 *id);
 int vfio_iommufd_physical_bind(struct vfio_device *vdev,
 			       struct iommufd_ctx *ictx, u32 *out_device_id);
 void vfio_iommufd_physical_unbind(struct vfio_device *vdev);
@@ -122,6 +124,17 @@ int vfio_iommufd_emulated_bind(struct vfio_device *vdev,
 void vfio_iommufd_emulated_unbind(struct vfio_device *vdev);
 int vfio_iommufd_emulated_attach_ioas(struct vfio_device *vdev, u32 *pt_id);
 #else
+static inline struct iommufd_ctx *
+vfio_iommufd_physical_ictx(struct vfio_device *vdev)
+{
+	return NULL;
+}
+
+static inline void
+vfio_iommufd_physical_devid(struct vfio_device *vdev, u32 *id)
+{
+}
+
 #define vfio_iommufd_physical_bind                                      \
 	((int (*)(struct vfio_device *vdev, struct iommufd_ctx *ictx,   \
 		  u32 *out_device_id)) NULL)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 145+ messages in thread

* [Intel-gfx] [PATCH v3 05/12] vfio/pci: Allow passing zero-length fd array in VFIO_DEVICE_PCI_HOT_RESET
  2023-04-01 14:44 [Intel-gfx] [PATCH v3 00/12] Introduce new methods for verifying ownership in vfio PCI hot reset Yi Liu
                   ` (3 preceding siblings ...)
  2023-04-01 14:44 ` [Intel-gfx] [PATCH v3 04/12] vfio-iommufd: Add helper to retrieve iommufd_ctx and devid for vfio_device Yi Liu
@ 2023-04-01 14:44 ` Yi Liu
  2023-04-04 16:54   ` Eric Auger
  2023-04-04 20:18   ` Alex Williamson
  2023-04-01 14:44 ` [Intel-gfx] [PATCH v3 06/12] vfio: Refine vfio file kAPIs for vfio PCI hot reset Yi Liu
                   ` (7 subsequent siblings)
  12 siblings, 2 replies; 145+ messages in thread
From: Yi Liu @ 2023-04-01 14:44 UTC (permalink / raw)
  To: alex.williamson, jgg, kevin.tian
  Cc: mjrosato, jasowang, xudong.hao, peterx, terrence.xu, chao.p.peng,
	linux-s390, yi.l.liu, kvm, lulu, yanting.jiang, joro, nicolinc,
	yan.y.zhao, intel-gfx, eric.auger, intel-gvt-dev, yi.y.sun,
	cohuck, shameerali.kolothum.thodi, suravee.suthikulpanit,
	robin.murphy

as an alternative method for ownership check when iommufd is used. In
this case all opened devices in the affected dev_set are verified to
be bound to a same valid iommufd value to allow reset. It's simpler
and faster as user does not need to pass a set of fds and kernel no
need to search the device within the given fds.

a device in noiommu mode doesn't have a valid iommufd, so this method
should not be used in a dev_set which contains multiple devices and one
of them is in noiommu. The only allowed noiommu scenario is that the
calling device is noiommu and it's in a singleton dev_set.

Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Tested-by: Yanting Jiang <yanting.jiang@intel.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
---
 drivers/vfio/pci/vfio_pci_core.c | 42 +++++++++++++++++++++++++++-----
 include/uapi/linux/vfio.h        |  9 ++++++-
 2 files changed, 44 insertions(+), 7 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 3696b8e58445..b68fcba67a4b 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -180,7 +180,8 @@ static void vfio_pci_probe_mmaps(struct vfio_pci_core_device *vdev)
 struct vfio_pci_group_info;
 static void vfio_pci_dev_set_try_reset(struct vfio_device_set *dev_set);
 static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
-				      struct vfio_pci_group_info *groups);
+				      struct vfio_pci_group_info *groups,
+				      struct iommufd_ctx *iommufd_ctx);
 
 /*
  * INTx masking requires the ability to disable INTx signaling via PCI_COMMAND
@@ -1277,7 +1278,7 @@ vfio_pci_ioctl_pci_hot_reset_groups(struct vfio_pci_core_device *vdev,
 		return ret;
 
 	/* Somewhere between 1 and count is OK */
-	if (!hdr->count || hdr->count > count)
+	if (hdr->count > count)
 		return -EINVAL;
 
 	group_fds = kcalloc(hdr->count, sizeof(*group_fds), GFP_KERNEL);
@@ -1326,7 +1327,7 @@ vfio_pci_ioctl_pci_hot_reset_groups(struct vfio_pci_core_device *vdev,
 	info.count = hdr->count;
 	info.files = files;
 
-	ret = vfio_pci_dev_set_hot_reset(vdev->vdev.dev_set, &info);
+	ret = vfio_pci_dev_set_hot_reset(vdev->vdev.dev_set, &info, NULL);
 
 hot_reset_release:
 	for (file_idx--; file_idx >= 0; file_idx--)
@@ -1341,6 +1342,7 @@ static int vfio_pci_ioctl_pci_hot_reset(struct vfio_pci_core_device *vdev,
 {
 	unsigned long minsz = offsetofend(struct vfio_pci_hot_reset, count);
 	struct vfio_pci_hot_reset hdr;
+	struct iommufd_ctx *iommufd;
 	bool slot = false;
 
 	if (copy_from_user(&hdr, arg, minsz))
@@ -1355,7 +1357,12 @@ static int vfio_pci_ioctl_pci_hot_reset(struct vfio_pci_core_device *vdev,
 	else if (pci_probe_reset_bus(vdev->pdev->bus))
 		return -ENODEV;
 
-	return vfio_pci_ioctl_pci_hot_reset_groups(vdev, &hdr, slot, arg);
+	if (hdr.count)
+		return vfio_pci_ioctl_pci_hot_reset_groups(vdev, &hdr, slot, arg);
+
+	iommufd = vfio_iommufd_physical_ictx(&vdev->vdev);
+
+	return vfio_pci_dev_set_hot_reset(vdev->vdev.dev_set, NULL, iommufd);
 }
 
 static int vfio_pci_ioctl_ioeventfd(struct vfio_pci_core_device *vdev,
@@ -2327,6 +2334,9 @@ static bool vfio_dev_in_groups(struct vfio_pci_core_device *vdev,
 {
 	unsigned int i;
 
+	if (!groups)
+		return false;
+
 	for (i = 0; i < groups->count; i++)
 		if (vfio_file_has_dev(groups->files[i], &vdev->vdev))
 			return true;
@@ -2402,13 +2412,25 @@ static int vfio_pci_dev_set_pm_runtime_get(struct vfio_device_set *dev_set)
 	return ret;
 }
 
+static bool vfio_dev_in_iommufd_ctx(struct vfio_pci_core_device *vdev,
+				    struct iommufd_ctx *iommufd_ctx)
+{
+	struct iommufd_ctx *iommufd = vfio_iommufd_physical_ictx(&vdev->vdev);
+
+	if (!iommufd)
+		return false;
+
+	return iommufd == iommufd_ctx;
+}
+
 /*
  * We need to get memory_lock for each device, but devices can share mmap_lock,
  * therefore we need to zap and hold the vma_lock for each device, and only then
  * get each memory_lock.
  */
 static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
-				      struct vfio_pci_group_info *groups)
+				      struct vfio_pci_group_info *groups,
+				      struct iommufd_ctx *iommufd_ctx)
 {
 	struct vfio_pci_core_device *cur_mem;
 	struct vfio_pci_core_device *cur_vma;
@@ -2448,9 +2470,17 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
 		 *
 		 * Otherwise all opened devices in the dev_set must be
 		 * contained by the set of groups provided by the user.
+		 *
+		 * If user provides a zero-length array, then all the
+		 * opened devices must be bound to a same iommufd_ctx.
+		 *
+		 * If all above checks are failed, reset is allowed only if
+		 * the calling device is in a singleton dev_set.
 		 */
 		if (cur_vma->vdev.open_count &&
-		    !vfio_dev_in_groups(cur_vma, groups)) {
+		    !vfio_dev_in_groups(cur_vma, groups) &&
+		    !vfio_dev_in_iommufd_ctx(cur_vma, iommufd_ctx) &&
+		    (dev_set->device_count > 1)) {
 			ret = -EINVAL;
 			goto err_undo;
 		}
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index f96e5689cffc..17aa5d09db41 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -679,7 +679,14 @@ struct vfio_pci_hot_reset_info {
  * the calling user must ensure all affected devices, if opened, are
  * owned by itself.
  *
- * The ownership is proved by an array of group fds.
+ * The ownership can be proved by:
+ *   - An array of group fds
+ *   - A zero-length array
+ *
+ * In the last case all affected devices which are opened by this user
+ * must have been bound to a same iommufd. If the calling device is in
+ * noiommu mode (no valid iommufd) then it can be reset only if the reset
+ * doesn't affect other devices.
  *
  * Return: 0 on success, -errno on failure.
  */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 145+ messages in thread

* [Intel-gfx] [PATCH v3 06/12] vfio: Refine vfio file kAPIs for vfio PCI hot reset
  2023-04-01 14:44 [Intel-gfx] [PATCH v3 00/12] Introduce new methods for verifying ownership in vfio PCI hot reset Yi Liu
                   ` (4 preceding siblings ...)
  2023-04-01 14:44 ` [Intel-gfx] [PATCH v3 05/12] vfio/pci: Allow passing zero-length fd array in VFIO_DEVICE_PCI_HOT_RESET Yi Liu
@ 2023-04-01 14:44 ` Yi Liu
  2023-04-05  8:27   ` Eric Auger
  2023-04-01 14:44 ` [Intel-gfx] [PATCH v3 07/12] vfio: Accpet device file from vfio PCI hot reset path Yi Liu
                   ` (6 subsequent siblings)
  12 siblings, 1 reply; 145+ messages in thread
From: Yi Liu @ 2023-04-01 14:44 UTC (permalink / raw)
  To: alex.williamson, jgg, kevin.tian
  Cc: mjrosato, jasowang, xudong.hao, peterx, terrence.xu, chao.p.peng,
	linux-s390, yi.l.liu, kvm, lulu, yanting.jiang, joro, nicolinc,
	yan.y.zhao, intel-gfx, eric.auger, intel-gvt-dev, yi.y.sun,
	cohuck, shameerali.kolothum.thodi, suravee.suthikulpanit,
	robin.murphy

This prepares vfio core to accept vfio device file from the vfio PCI
hot reset path. vfio_file_is_group() is still kept for KVM usage.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Tested-by: Yanting Jiang <yanting.jiang@intel.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
---
 drivers/vfio/group.c             | 32 ++++++++++++++------------------
 drivers/vfio/pci/vfio_pci_core.c |  4 ++--
 drivers/vfio/vfio.h              |  2 ++
 drivers/vfio/vfio_main.c         | 29 +++++++++++++++++++++++++++++
 include/linux/vfio.h             |  1 +
 5 files changed, 48 insertions(+), 20 deletions(-)

diff --git a/drivers/vfio/group.c b/drivers/vfio/group.c
index 27d5ba7cf9dc..d0c95d033605 100644
--- a/drivers/vfio/group.c
+++ b/drivers/vfio/group.c
@@ -745,6 +745,15 @@ bool vfio_device_has_container(struct vfio_device *device)
 	return device->group->container;
 }
 
+struct vfio_group *vfio_group_from_file(struct file *file)
+{
+	struct vfio_group *group = file->private_data;
+
+	if (file->f_op != &vfio_group_fops)
+		return NULL;
+	return group;
+}
+
 /**
  * vfio_file_iommu_group - Return the struct iommu_group for the vfio group file
  * @file: VFIO group file
@@ -755,13 +764,13 @@ bool vfio_device_has_container(struct vfio_device *device)
  */
 struct iommu_group *vfio_file_iommu_group(struct file *file)
 {
-	struct vfio_group *group = file->private_data;
+	struct vfio_group *group = vfio_group_from_file(file);
 	struct iommu_group *iommu_group = NULL;
 
 	if (!IS_ENABLED(CONFIG_SPAPR_TCE_IOMMU))
 		return NULL;
 
-	if (!vfio_file_is_group(file))
+	if (!group)
 		return NULL;
 
 	mutex_lock(&group->group_lock);
@@ -775,12 +784,12 @@ struct iommu_group *vfio_file_iommu_group(struct file *file)
 EXPORT_SYMBOL_GPL(vfio_file_iommu_group);
 
 /**
- * vfio_file_is_group - True if the file is usable with VFIO aPIS
+ * vfio_file_is_group - True if the file is a vfio group file
  * @file: VFIO group file
  */
 bool vfio_file_is_group(struct file *file)
 {
-	return file->f_op == &vfio_group_fops;
+	return vfio_group_from_file(file);
 }
 EXPORT_SYMBOL_GPL(vfio_file_is_group);
 
@@ -842,23 +851,10 @@ void vfio_file_set_kvm(struct file *file, struct kvm *kvm)
 }
 EXPORT_SYMBOL_GPL(vfio_file_set_kvm);
 
-/**
- * vfio_file_has_dev - True if the VFIO file is a handle for device
- * @file: VFIO file to check
- * @device: Device that must be part of the file
- *
- * Returns true if given file has permission to manipulate the given device.
- */
-bool vfio_file_has_dev(struct file *file, struct vfio_device *device)
+bool vfio_group_has_dev(struct vfio_group *group, struct vfio_device *device)
 {
-	struct vfio_group *group = file->private_data;
-
-	if (!vfio_file_is_group(file))
-		return false;
-
 	return group == device->group;
 }
-EXPORT_SYMBOL_GPL(vfio_file_has_dev);
 
 static char *vfio_devnode(const struct device *dev, umode_t *mode)
 {
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index b68fcba67a4b..2a510b71edcb 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -1308,8 +1308,8 @@ vfio_pci_ioctl_pci_hot_reset_groups(struct vfio_pci_core_device *vdev,
 			break;
 		}
 
-		/* Ensure the FD is a vfio group FD.*/
-		if (!vfio_file_is_group(file)) {
+		/* Ensure the FD is a vfio FD. vfio group or vfio device */
+		if (!vfio_file_is_valid(file)) {
 			fput(file);
 			ret = -EINVAL;
 			break;
diff --git a/drivers/vfio/vfio.h b/drivers/vfio/vfio.h
index 7b19c621e0e6..c0aeea24fbd6 100644
--- a/drivers/vfio/vfio.h
+++ b/drivers/vfio/vfio.h
@@ -84,6 +84,8 @@ void vfio_device_group_unregister(struct vfio_device *device);
 int vfio_device_group_use_iommu(struct vfio_device *device);
 void vfio_device_group_unuse_iommu(struct vfio_device *device);
 void vfio_device_group_close(struct vfio_device *device);
+struct vfio_group *vfio_group_from_file(struct file *file);
+bool vfio_group_has_dev(struct vfio_group *group, struct vfio_device *device);
 bool vfio_device_has_container(struct vfio_device *device);
 int __init vfio_group_init(void);
 void vfio_group_cleanup(void);
diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
index 89497c933490..fe7446805afd 100644
--- a/drivers/vfio/vfio_main.c
+++ b/drivers/vfio/vfio_main.c
@@ -1154,6 +1154,35 @@ const struct file_operations vfio_device_fops = {
 	.mmap		= vfio_device_fops_mmap,
 };
 
+/**
+ * vfio_file_is_valid - True if the file is valid vfio file
+ * @file: VFIO group file or VFIO device file
+ */
+bool vfio_file_is_valid(struct file *file)
+{
+	return vfio_group_from_file(file);
+}
+EXPORT_SYMBOL_GPL(vfio_file_is_valid);
+
+/**
+ * vfio_file_has_dev - True if the VFIO file is a handle for device
+ * @file: VFIO file to check
+ * @device: Device that must be part of the file
+ *
+ * Returns true if given file has permission to manipulate the given device.
+ */
+bool vfio_file_has_dev(struct file *file, struct vfio_device *device)
+{
+	struct vfio_group *group;
+
+	group = vfio_group_from_file(file);
+	if (!group)
+		return false;
+
+	return vfio_group_has_dev(group, device);
+}
+EXPORT_SYMBOL_GPL(vfio_file_has_dev);
+
 /*
  * Sub-module support
  */
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index 97a1174b922f..f8fb9ab25188 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -258,6 +258,7 @@ int vfio_mig_get_next_state(struct vfio_device *device,
  */
 struct iommu_group *vfio_file_iommu_group(struct file *file);
 bool vfio_file_is_group(struct file *file);
+bool vfio_file_is_valid(struct file *file);
 bool vfio_file_enforced_coherent(struct file *file);
 void vfio_file_set_kvm(struct file *file, struct kvm *kvm);
 bool vfio_file_has_dev(struct file *file, struct vfio_device *device);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 145+ messages in thread

* [Intel-gfx] [PATCH v3 07/12] vfio: Accpet device file from vfio PCI hot reset path
  2023-04-01 14:44 [Intel-gfx] [PATCH v3 00/12] Introduce new methods for verifying ownership in vfio PCI hot reset Yi Liu
                   ` (5 preceding siblings ...)
  2023-04-01 14:44 ` [Intel-gfx] [PATCH v3 06/12] vfio: Refine vfio file kAPIs for vfio PCI hot reset Yi Liu
@ 2023-04-01 14:44 ` Yi Liu
  2023-04-04 20:31   ` Alex Williamson
  2023-04-05  8:07   ` Eric Auger
  2023-04-01 14:44 ` [Intel-gfx] [PATCH v3 08/12] vfio/pci: Renaming for accepting device fd in " Yi Liu
                   ` (5 subsequent siblings)
  12 siblings, 2 replies; 145+ messages in thread
From: Yi Liu @ 2023-04-01 14:44 UTC (permalink / raw)
  To: alex.williamson, jgg, kevin.tian
  Cc: mjrosato, jasowang, xudong.hao, peterx, terrence.xu, chao.p.peng,
	linux-s390, yi.l.liu, kvm, lulu, yanting.jiang, joro, nicolinc,
	yan.y.zhao, intel-gfx, eric.auger, intel-gvt-dev, yi.y.sun,
	cohuck, shameerali.kolothum.thodi, suravee.suthikulpanit,
	robin.murphy

This extends both vfio_file_is_valid() and vfio_file_has_dev() to accept
device file from the vfio PCI hot reset.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Tested-by: Yanting Jiang <yanting.jiang@intel.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
---
 drivers/vfio/vfio_main.c | 23 +++++++++++++++++++----
 1 file changed, 19 insertions(+), 4 deletions(-)

diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
index fe7446805afd..ebbb6b91a498 100644
--- a/drivers/vfio/vfio_main.c
+++ b/drivers/vfio/vfio_main.c
@@ -1154,13 +1154,23 @@ const struct file_operations vfio_device_fops = {
 	.mmap		= vfio_device_fops_mmap,
 };
 
+static struct vfio_device *vfio_device_from_file(struct file *file)
+{
+	struct vfio_device *device = file->private_data;
+
+	if (file->f_op != &vfio_device_fops)
+		return NULL;
+	return device;
+}
+
 /**
  * vfio_file_is_valid - True if the file is valid vfio file
  * @file: VFIO group file or VFIO device file
  */
 bool vfio_file_is_valid(struct file *file)
 {
-	return vfio_group_from_file(file);
+	return vfio_group_from_file(file) ||
+	       vfio_device_from_file(file);
 }
 EXPORT_SYMBOL_GPL(vfio_file_is_valid);
 
@@ -1174,12 +1184,17 @@ EXPORT_SYMBOL_GPL(vfio_file_is_valid);
 bool vfio_file_has_dev(struct file *file, struct vfio_device *device)
 {
 	struct vfio_group *group;
+	struct vfio_device *vdev;
 
 	group = vfio_group_from_file(file);
-	if (!group)
-		return false;
+	if (group)
+		return vfio_group_has_dev(group, device);
+
+	vdev = vfio_device_from_file(file);
+	if (vdev)
+		return vdev == device;
 
-	return vfio_group_has_dev(group, device);
+	return false;
 }
 EXPORT_SYMBOL_GPL(vfio_file_has_dev);
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 145+ messages in thread

* [Intel-gfx] [PATCH v3 08/12] vfio/pci: Renaming for accepting device fd in hot reset path
  2023-04-01 14:44 [Intel-gfx] [PATCH v3 00/12] Introduce new methods for verifying ownership in vfio PCI hot reset Yi Liu
                   ` (6 preceding siblings ...)
  2023-04-01 14:44 ` [Intel-gfx] [PATCH v3 07/12] vfio: Accpet device file from vfio PCI hot reset path Yi Liu
@ 2023-04-01 14:44 ` Yi Liu
  2023-04-04 21:23   ` Alex Williamson
  2023-04-05  9:32   ` Eric Auger
  2023-04-01 14:44 ` [Intel-gfx] [PATCH v3 09/12] vfio/pci: Accept device fd in VFIO_DEVICE_PCI_HOT_RESET ioctl Yi Liu
                   ` (4 subsequent siblings)
  12 siblings, 2 replies; 145+ messages in thread
From: Yi Liu @ 2023-04-01 14:44 UTC (permalink / raw)
  To: alex.williamson, jgg, kevin.tian
  Cc: mjrosato, jasowang, xudong.hao, peterx, terrence.xu, chao.p.peng,
	linux-s390, yi.l.liu, kvm, lulu, yanting.jiang, joro, nicolinc,
	yan.y.zhao, intel-gfx, eric.auger, intel-gvt-dev, yi.y.sun,
	cohuck, shameerali.kolothum.thodi, suravee.suthikulpanit,
	robin.murphy

No functional change is intended.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Tested-by: Yanting Jiang <yanting.jiang@intel.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
---
 drivers/vfio/pci/vfio_pci_core.c | 52 ++++++++++++++++----------------
 1 file changed, 26 insertions(+), 26 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 2a510b71edcb..da6325008872 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -177,10 +177,10 @@ static void vfio_pci_probe_mmaps(struct vfio_pci_core_device *vdev)
 	}
 }
 
-struct vfio_pci_group_info;
+struct vfio_pci_file_info;
 static void vfio_pci_dev_set_try_reset(struct vfio_device_set *dev_set);
 static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
-				      struct vfio_pci_group_info *groups,
+				      struct vfio_pci_file_info *info,
 				      struct iommufd_ctx *iommufd_ctx);
 
 /*
@@ -800,7 +800,7 @@ static int vfio_pci_fill_devs(struct pci_dev *pdev, void *data)
 	return 0;
 }
 
-struct vfio_pci_group_info {
+struct vfio_pci_file_info {
 	int count;
 	struct file **files;
 };
@@ -1257,14 +1257,14 @@ static int vfio_pci_ioctl_get_pci_hot_reset_info(
 }
 
 static int
-vfio_pci_ioctl_pci_hot_reset_groups(struct vfio_pci_core_device *vdev,
-				    struct vfio_pci_hot_reset *hdr,
-				    bool slot,
-				    struct vfio_pci_hot_reset __user *arg)
+vfio_pci_ioctl_pci_hot_reset_files(struct vfio_pci_core_device *vdev,
+				   struct vfio_pci_hot_reset *hdr,
+				   bool slot,
+				   struct vfio_pci_hot_reset __user *arg)
 {
-	int32_t *group_fds;
+	int32_t *fds;
 	struct file **files;
-	struct vfio_pci_group_info info;
+	struct vfio_pci_file_info info;
 	int file_idx, count = 0, ret = 0;
 
 	/*
@@ -1281,17 +1281,17 @@ vfio_pci_ioctl_pci_hot_reset_groups(struct vfio_pci_core_device *vdev,
 	if (hdr->count > count)
 		return -EINVAL;
 
-	group_fds = kcalloc(hdr->count, sizeof(*group_fds), GFP_KERNEL);
+	fds = kcalloc(hdr->count, sizeof(*fds), GFP_KERNEL);
 	files = kcalloc(hdr->count, sizeof(*files), GFP_KERNEL);
-	if (!group_fds || !files) {
-		kfree(group_fds);
+	if (!fds || !files) {
+		kfree(fds);
 		kfree(files);
 		return -ENOMEM;
 	}
 
-	if (copy_from_user(group_fds, arg->group_fds,
-			   hdr->count * sizeof(*group_fds))) {
-		kfree(group_fds);
+	if (copy_from_user(fds, arg->group_fds,
+			   hdr->count * sizeof(*fds))) {
+		kfree(fds);
 		kfree(files);
 		return -EFAULT;
 	}
@@ -1301,7 +1301,7 @@ vfio_pci_ioctl_pci_hot_reset_groups(struct vfio_pci_core_device *vdev,
 	 * the reset
 	 */
 	for (file_idx = 0; file_idx < hdr->count; file_idx++) {
-		struct file *file = fget(group_fds[file_idx]);
+		struct file *file = fget(fds[file_idx]);
 
 		if (!file) {
 			ret = -EBADF;
@@ -1318,9 +1318,9 @@ vfio_pci_ioctl_pci_hot_reset_groups(struct vfio_pci_core_device *vdev,
 		files[file_idx] = file;
 	}
 
-	kfree(group_fds);
+	kfree(fds);
 
-	/* release reference to groups on error */
+	/* release reference to fds on error */
 	if (ret)
 		goto hot_reset_release;
 
@@ -1358,7 +1358,7 @@ static int vfio_pci_ioctl_pci_hot_reset(struct vfio_pci_core_device *vdev,
 		return -ENODEV;
 
 	if (hdr.count)
-		return vfio_pci_ioctl_pci_hot_reset_groups(vdev, &hdr, slot, arg);
+		return vfio_pci_ioctl_pci_hot_reset_files(vdev, &hdr, slot, arg);
 
 	iommufd = vfio_iommufd_physical_ictx(&vdev->vdev);
 
@@ -2329,16 +2329,16 @@ const struct pci_error_handlers vfio_pci_core_err_handlers = {
 };
 EXPORT_SYMBOL_GPL(vfio_pci_core_err_handlers);
 
-static bool vfio_dev_in_groups(struct vfio_pci_core_device *vdev,
-			       struct vfio_pci_group_info *groups)
+static bool vfio_dev_in_files(struct vfio_pci_core_device *vdev,
+			      struct vfio_pci_file_info *info)
 {
 	unsigned int i;
 
-	if (!groups)
+	if (!info)
 		return false;
 
-	for (i = 0; i < groups->count; i++)
-		if (vfio_file_has_dev(groups->files[i], &vdev->vdev))
+	for (i = 0; i < info->count; i++)
+		if (vfio_file_has_dev(info->files[i], &vdev->vdev))
 			return true;
 	return false;
 }
@@ -2429,7 +2429,7 @@ static bool vfio_dev_in_iommufd_ctx(struct vfio_pci_core_device *vdev,
  * get each memory_lock.
  */
 static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
-				      struct vfio_pci_group_info *groups,
+				      struct vfio_pci_file_info *info,
 				      struct iommufd_ctx *iommufd_ctx)
 {
 	struct vfio_pci_core_device *cur_mem;
@@ -2478,7 +2478,7 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
 		 * the calling device is in a singleton dev_set.
 		 */
 		if (cur_vma->vdev.open_count &&
-		    !vfio_dev_in_groups(cur_vma, groups) &&
+		    !vfio_dev_in_files(cur_vma, info) &&
 		    !vfio_dev_in_iommufd_ctx(cur_vma, iommufd_ctx) &&
 		    (dev_set->device_count > 1)) {
 			ret = -EINVAL;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 145+ messages in thread

* [Intel-gfx] [PATCH v3 09/12] vfio/pci: Accept device fd in VFIO_DEVICE_PCI_HOT_RESET ioctl
  2023-04-01 14:44 [Intel-gfx] [PATCH v3 00/12] Introduce new methods for verifying ownership in vfio PCI hot reset Yi Liu
                   ` (7 preceding siblings ...)
  2023-04-01 14:44 ` [Intel-gfx] [PATCH v3 08/12] vfio/pci: Renaming for accepting device fd in " Yi Liu
@ 2023-04-01 14:44 ` Yi Liu
  2023-04-05  9:36   ` Eric Auger
  2023-04-01 14:44 ` [Intel-gfx] [PATCH v3 10/12] vfio: Mark cdev usage in vfio_device Yi Liu
                   ` (3 subsequent siblings)
  12 siblings, 1 reply; 145+ messages in thread
From: Yi Liu @ 2023-04-01 14:44 UTC (permalink / raw)
  To: alex.williamson, jgg, kevin.tian
  Cc: mjrosato, jasowang, xudong.hao, peterx, terrence.xu, chao.p.peng,
	linux-s390, yi.l.liu, kvm, lulu, yanting.jiang, joro, nicolinc,
	yan.y.zhao, intel-gfx, eric.auger, intel-gvt-dev, yi.y.sun,
	cohuck, shameerali.kolothum.thodi, suravee.suthikulpanit,
	robin.murphy

Now user can also provide an array of device fds as a 3rd method to verify
the reset ownership. It's not useful at this point when the device fds are
acquired via group fds. But it's necessary when moving to device cdev which
allows the user to directly acquire device fds by skipping group. In that
case this method can be used as a last resort when the preferred iommufd
verification doesn't work, e.g. in noiommu usages.

Clarify it in uAPI.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Tested-by: Yanting Jiang <yanting.jiang@intel.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
---
 drivers/vfio/pci/vfio_pci_core.c | 9 +++++----
 include/uapi/linux/vfio.h        | 3 ++-
 2 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index da6325008872..19f5b075d70a 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -1289,7 +1289,7 @@ vfio_pci_ioctl_pci_hot_reset_files(struct vfio_pci_core_device *vdev,
 		return -ENOMEM;
 	}
 
-	if (copy_from_user(fds, arg->group_fds,
+	if (copy_from_user(fds, arg->fds,
 			   hdr->count * sizeof(*fds))) {
 		kfree(fds);
 		kfree(files);
@@ -1297,8 +1297,8 @@ vfio_pci_ioctl_pci_hot_reset_files(struct vfio_pci_core_device *vdev,
 	}
 
 	/*
-	 * Get the group file for each fd to ensure the group held across
-	 * the reset
+	 * Get the file for each fd to ensure the group/device file
+	 * is held across the reset
 	 */
 	for (file_idx = 0; file_idx < hdr->count; file_idx++) {
 		struct file *file = fget(fds[file_idx]);
@@ -2469,7 +2469,8 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
 		 * cannot race being opened by another user simultaneously.
 		 *
 		 * Otherwise all opened devices in the dev_set must be
-		 * contained by the set of groups provided by the user.
+		 * contained by the set of groups/devices provided by
+		 * the user.
 		 *
 		 * If user provides a zero-length array, then all the
 		 * opened devices must be bound to a same iommufd_ctx.
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 17aa5d09db41..25432ef213ee 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -681,6 +681,7 @@ struct vfio_pci_hot_reset_info {
  *
  * The ownership can be proved by:
  *   - An array of group fds
+ *   - An array of device fds
  *   - A zero-length array
  *
  * In the last case all affected devices which are opened by this user
@@ -694,7 +695,7 @@ struct vfio_pci_hot_reset {
 	__u32	argsz;
 	__u32	flags;
 	__u32	count;
-	__s32	group_fds[];
+	__s32	fds[];
 };
 
 #define VFIO_DEVICE_PCI_HOT_RESET	_IO(VFIO_TYPE, VFIO_BASE + 13)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 145+ messages in thread

* [Intel-gfx] [PATCH v3 10/12] vfio: Mark cdev usage in vfio_device
  2023-04-01 14:44 [Intel-gfx] [PATCH v3 00/12] Introduce new methods for verifying ownership in vfio PCI hot reset Yi Liu
                   ` (8 preceding siblings ...)
  2023-04-01 14:44 ` [Intel-gfx] [PATCH v3 09/12] vfio/pci: Accept device fd in VFIO_DEVICE_PCI_HOT_RESET ioctl Yi Liu
@ 2023-04-01 14:44 ` Yi Liu
  2023-04-05 11:48   ` Eric Auger
  2023-04-01 14:44 ` [Intel-gfx] [PATCH v3 11/12] iommufd: Define IOMMUFD_INVALID_ID in uapi Yi Liu
                   ` (2 subsequent siblings)
  12 siblings, 1 reply; 145+ messages in thread
From: Yi Liu @ 2023-04-01 14:44 UTC (permalink / raw)
  To: alex.williamson, jgg, kevin.tian
  Cc: mjrosato, jasowang, xudong.hao, peterx, terrence.xu, chao.p.peng,
	linux-s390, yi.l.liu, kvm, lulu, yanting.jiang, joro, nicolinc,
	yan.y.zhao, intel-gfx, eric.auger, intel-gvt-dev, yi.y.sun,
	cohuck, shameerali.kolothum.thodi, suravee.suthikulpanit,
	robin.murphy

There are users that need to check if vfio_device is opened as cdev.
e.g. vfio-pci. This adds a flag in vfio_device, it will be set in the
cdev path when device is opened. This is not used at this moment, but
a preparation for vfio device cdev support.

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
---
 include/linux/vfio.h | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index f8fb9ab25188..d9a0770e5fc1 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -62,6 +62,7 @@ struct vfio_device {
 	struct iommufd_device *iommufd_device;
 	bool iommufd_attached;
 #endif
+	bool cdev_opened;
 };
 
 /**
@@ -151,6 +152,12 @@ vfio_iommufd_physical_devid(struct vfio_device *vdev, u32 *id)
 	((int (*)(struct vfio_device *vdev, u32 *pt_id)) NULL)
 #endif
 
+static inline bool vfio_device_cdev_opened(struct vfio_device *device)
+{
+	lockdep_assert_held(&device->dev_set->lock);
+	return device->cdev_opened;
+}
+
 /**
  * @migration_set_state: Optional callback to change the migration state for
  *         devices that support migration. It's mandatory for
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 145+ messages in thread

* [Intel-gfx] [PATCH v3 11/12] iommufd: Define IOMMUFD_INVALID_ID in uapi
  2023-04-01 14:44 [Intel-gfx] [PATCH v3 00/12] Introduce new methods for verifying ownership in vfio PCI hot reset Yi Liu
                   ` (9 preceding siblings ...)
  2023-04-01 14:44 ` [Intel-gfx] [PATCH v3 10/12] vfio: Mark cdev usage in vfio_device Yi Liu
@ 2023-04-01 14:44 ` Yi Liu
  2023-04-04 21:00   ` Alex Williamson
  2023-04-05 11:46   ` Eric Auger
  2023-04-01 14:44 ` [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO Yi Liu
  2023-04-01 14:47 ` [Intel-gfx] ✗ Fi.CI.BUILD: failure for Introduce new methods for verifying ownership in vfio PCI hot reset (rev4) Patchwork
  12 siblings, 2 replies; 145+ messages in thread
From: Yi Liu @ 2023-04-01 14:44 UTC (permalink / raw)
  To: alex.williamson, jgg, kevin.tian
  Cc: mjrosato, jasowang, xudong.hao, peterx, terrence.xu, chao.p.peng,
	linux-s390, yi.l.liu, kvm, lulu, yanting.jiang, joro, nicolinc,
	yan.y.zhao, intel-gfx, eric.auger, intel-gvt-dev, yi.y.sun,
	cohuck, shameerali.kolothum.thodi, suravee.suthikulpanit,
	robin.murphy

as there are IOMMUFD users that want to know check if an ID generated
by IOMMUFD is valid or not. e.g. vfio-pci optionaly returns invalid
dev_id to user in the VFIO_DEVICE_GET_PCI_HOT_RESET_INFO ioctl. User
needs to check if the ID is valid or not.

IOMMUFD_INVALID_ID is defined as 0 since the IDs generated by IOMMUFD
starts from 0.

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
---
 include/uapi/linux/iommufd.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
index 98ebba80cfa1..aeae73a93833 100644
--- a/include/uapi/linux/iommufd.h
+++ b/include/uapi/linux/iommufd.h
@@ -9,6 +9,9 @@
 
 #define IOMMUFD_TYPE (';')
 
+/* IDs allocated by IOMMUFD starts from 0 */
+#define IOMMUFD_INVALID_ID 0
+
 /**
  * DOC: General ioctl format
  *
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 145+ messages in thread

* [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
  2023-04-01 14:44 [Intel-gfx] [PATCH v3 00/12] Introduce new methods for verifying ownership in vfio PCI hot reset Yi Liu
                   ` (10 preceding siblings ...)
  2023-04-01 14:44 ` [Intel-gfx] [PATCH v3 11/12] iommufd: Define IOMMUFD_INVALID_ID in uapi Yi Liu
@ 2023-04-01 14:44 ` Yi Liu
  2023-04-03  9:25   ` Liu, Yi L
                     ` (2 more replies)
  2023-04-01 14:47 ` [Intel-gfx] ✗ Fi.CI.BUILD: failure for Introduce new methods for verifying ownership in vfio PCI hot reset (rev4) Patchwork
  12 siblings, 3 replies; 145+ messages in thread
From: Yi Liu @ 2023-04-01 14:44 UTC (permalink / raw)
  To: alex.williamson, jgg, kevin.tian
  Cc: mjrosato, jasowang, xudong.hao, peterx, terrence.xu, chao.p.peng,
	linux-s390, yi.l.liu, kvm, lulu, yanting.jiang, joro, nicolinc,
	yan.y.zhao, intel-gfx, eric.auger, intel-gvt-dev, yi.y.sun,
	cohuck, shameerali.kolothum.thodi, suravee.suthikulpanit,
	robin.murphy

for the users that accept device fds passed from management stacks to be
able to figure out the host reset affected devices among the devices
opened by the user. This is needed as such users do not have BDF (bus,
devfn) knowledge about the devices it has opened, hence unable to use
the information reported by existing VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
to figure out the affected devices.

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
---
 drivers/vfio/pci/vfio_pci_core.c | 58 ++++++++++++++++++++++++++++----
 include/uapi/linux/vfio.h        | 24 ++++++++++++-
 2 files changed, 74 insertions(+), 8 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 19f5b075d70a..a5a7e148dce1 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -30,6 +30,7 @@
 #if IS_ENABLED(CONFIG_EEH)
 #include <asm/eeh.h>
 #endif
+#include <uapi/linux/iommufd.h>
 
 #include "vfio_pci_priv.h"
 
@@ -767,6 +768,20 @@ static int vfio_pci_get_irq_count(struct vfio_pci_core_device *vdev, int irq_typ
 	return 0;
 }
 
+static struct vfio_device *
+vfio_pci_find_device_in_devset(struct vfio_device_set *dev_set,
+			       struct pci_dev *pdev)
+{
+	struct vfio_device *cur;
+
+	lockdep_assert_held(&dev_set->lock);
+
+	list_for_each_entry(cur, &dev_set->device_list, dev_set_list)
+		if (cur->dev == &pdev->dev)
+			return cur;
+	return NULL;
+}
+
 static int vfio_pci_count_devs(struct pci_dev *pdev, void *data)
 {
 	(*(int *)data)++;
@@ -776,13 +791,20 @@ static int vfio_pci_count_devs(struct pci_dev *pdev, void *data)
 struct vfio_pci_fill_info {
 	int max;
 	int cur;
+	bool require_devid;
+	struct iommufd_ctx *iommufd;
+	struct vfio_device_set *dev_set;
 	struct vfio_pci_dependent_device *devices;
 };
 
 static int vfio_pci_fill_devs(struct pci_dev *pdev, void *data)
 {
 	struct vfio_pci_fill_info *fill = data;
+	struct vfio_device_set *dev_set = fill->dev_set;
 	struct iommu_group *iommu_group;
+	struct vfio_device *vdev;
+
+	lockdep_assert_held(&dev_set->lock);
 
 	if (fill->cur == fill->max)
 		return -EAGAIN; /* Something changed, try again */
@@ -791,7 +813,21 @@ static int vfio_pci_fill_devs(struct pci_dev *pdev, void *data)
 	if (!iommu_group)
 		return -EPERM; /* Cannot reset non-isolated devices */
 
-	fill->devices[fill->cur].group_id = iommu_group_id(iommu_group);
+	if (fill->require_devid) {
+		/*
+		 * Report dev_id of the devices that are opened as cdev
+		 * and have the same iommufd with the fill->iommufd.
+		 * Otherwise, just fill IOMMUFD_INVALID_ID.
+		 */
+		vdev = vfio_pci_find_device_in_devset(dev_set, pdev);
+		if (vdev && vfio_device_cdev_opened(vdev) &&
+		    fill->iommufd == vfio_iommufd_physical_ictx(vdev))
+			vfio_iommufd_physical_devid(vdev, &fill->devices[fill->cur].dev_id);
+		else
+			fill->devices[fill->cur].dev_id = IOMMUFD_INVALID_ID;
+	} else {
+		fill->devices[fill->cur].group_id = iommu_group_id(iommu_group);
+	}
 	fill->devices[fill->cur].segment = pci_domain_nr(pdev->bus);
 	fill->devices[fill->cur].bus = pdev->bus->number;
 	fill->devices[fill->cur].devfn = pdev->devfn;
@@ -1230,17 +1266,27 @@ static int vfio_pci_ioctl_get_pci_hot_reset_info(
 		return -ENOMEM;
 
 	fill.devices = devices;
+	fill.dev_set = vdev->vdev.dev_set;
 
+	mutex_lock(&vdev->vdev.dev_set->lock);
+	if (vfio_device_cdev_opened(&vdev->vdev)) {
+		fill.require_devid = true;
+		fill.iommufd = vfio_iommufd_physical_ictx(&vdev->vdev);
+	}
 	ret = vfio_pci_for_each_slot_or_bus(vdev->pdev, vfio_pci_fill_devs,
 					    &fill, slot);
+	mutex_unlock(&vdev->vdev.dev_set->lock);
 
 	/*
 	 * If a device was removed between counting and filling, we may come up
 	 * short of fill.max.  If a device was added, we'll have a return of
 	 * -EAGAIN above.
 	 */
-	if (!ret)
+	if (!ret) {
 		hdr.count = fill.cur;
+		if (fill.require_devid)
+			hdr.flags = VFIO_PCI_HOT_RESET_FLAG_IOMMUFD_DEV_ID;
+	}
 
 reset_info_exit:
 	if (copy_to_user(arg, &hdr, minsz))
@@ -2346,12 +2392,10 @@ static bool vfio_dev_in_files(struct vfio_pci_core_device *vdev,
 static int vfio_pci_is_device_in_set(struct pci_dev *pdev, void *data)
 {
 	struct vfio_device_set *dev_set = data;
-	struct vfio_device *cur;
 
-	list_for_each_entry(cur, &dev_set->device_list, dev_set_list)
-		if (cur->dev == &pdev->dev)
-			return 0;
-	return -EBUSY;
+	lockdep_assert_held(&dev_set->lock);
+
+	return vfio_pci_find_device_in_devset(dev_set, pdev) ? 0 : -EBUSY;
 }
 
 /*
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 25432ef213ee..5a34364e3b94 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -650,11 +650,32 @@ enum {
  * VFIO_DEVICE_GET_PCI_HOT_RESET_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 12,
  *					      struct vfio_pci_hot_reset_info)
  *
+ * This command is used to query the affected devices in the hot reset for
+ * a given device.  User could use the information reported by this command
+ * to figure out the affected devices among the devices it has opened.
+ * This command always reports the segment, bus and devfn information for
+ * each affected device, and selectively report the group_id or the dev_id
+ * per the way how the device being queried is opened.
+ *	- If the device is opened via the traditional group/container manner,
+ *	  this command reports the group_id for each affected device.
+ *
+ *	- If the device is opened as a cdev, this command needs to report
+ *	  dev_id for each affected device and set the
+ *	  VFIO_PCI_HOT_RESET_FLAG_IOMMUFD_DEV_ID flag.  For the affected
+ *	  devices that are not opened as cdev or bound to different iommufds
+ *	  with the device that is queried, report an invalid dev_id to avoid
+ *	  potential dev_id conflict as dev_id is local to iommufd.  For such
+ *	  affected devices, user shall fall back to use the segment, bus and
+ *	  devfn info to map it to opened device.
+ *
  * Return: 0 on success, -errno on failure:
  *	-enospc = insufficient buffer, -enodev = unsupported for device.
  */
 struct vfio_pci_dependent_device {
-	__u32	group_id;
+	union {
+		__u32   group_id;
+		__u32	dev_id;
+	};
 	__u16	segment;
 	__u8	bus;
 	__u8	devfn; /* Use PCI_SLOT/PCI_FUNC */
@@ -663,6 +684,7 @@ struct vfio_pci_dependent_device {
 struct vfio_pci_hot_reset_info {
 	__u32	argsz;
 	__u32	flags;
+#define VFIO_PCI_HOT_RESET_FLAG_IOMMUFD_DEV_ID	(1 << 0)
 	__u32	count;
 	struct vfio_pci_dependent_device	devices[];
 };
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 145+ messages in thread

* [Intel-gfx] ✗ Fi.CI.BUILD: failure for Introduce new methods for verifying ownership in vfio PCI hot reset (rev4)
  2023-04-01 14:44 [Intel-gfx] [PATCH v3 00/12] Introduce new methods for verifying ownership in vfio PCI hot reset Yi Liu
                   ` (11 preceding siblings ...)
  2023-04-01 14:44 ` [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO Yi Liu
@ 2023-04-01 14:47 ` Patchwork
  12 siblings, 0 replies; 145+ messages in thread
From: Patchwork @ 2023-04-01 14:47 UTC (permalink / raw)
  To: Yi Liu; +Cc: intel-gfx

== Series Details ==

Series: Introduce new methods for verifying ownership in vfio PCI hot reset (rev4)
URL   : https://patchwork.freedesktop.org/series/115264/
State : failure

== Summary ==

Error: patch https://patchwork.freedesktop.org/api/1.0/series/115264/revisions/4/mbox/ not applied
Applying: vfio/pci: Update comment around group_fd get in vfio_pci_ioctl_pci_hot_reset()
Applying: vfio/pci: Only check ownership of opened devices in hot reset
Applying: vfio/pci: Move the existing hot reset logic to be a helper
Applying: vfio-iommufd: Add helper to retrieve iommufd_ctx and devid for vfio_device
Applying: vfio/pci: Allow passing zero-length fd array in VFIO_DEVICE_PCI_HOT_RESET
Applying: vfio: Refine vfio file kAPIs for vfio PCI hot reset
Applying: vfio: Accpet device file from vfio PCI hot reset path
Applying: vfio/pci: Renaming for accepting device fd in hot reset path
Applying: vfio/pci: Accept device fd in VFIO_DEVICE_PCI_HOT_RESET ioctl
Applying: vfio: Mark cdev usage in vfio_device
error: sha1 information is lacking or useless (include/linux/vfio.h).
error: could not build fake ancestor
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0010 vfio: Mark cdev usage in vfio_device
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".
Build failed, no error log produced



^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
  2023-04-01 14:44 ` [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO Yi Liu
@ 2023-04-03  9:25   ` Liu, Yi L
  2023-04-03 15:01     ` Alex Williamson
  2023-04-04 22:20   ` Alex Williamson
  2023-04-05 12:19   ` Eric Auger
  2 siblings, 1 reply; 145+ messages in thread
From: Liu, Yi L @ 2023-04-03  9:25 UTC (permalink / raw)
  To: alex.williamson, jgg, Tian, Kevin
  Cc: linux-s390, yi.y.sun, mjrosato, kvm, intel-gvt-dev, joro, cohuck,
	Hao, Xudong, peterx, Zhao, Yan Y, eric.auger, Xu, Terrence,
	nicolinc, shameerali.kolothum.thodi, suravee.suthikulpanit,
	intel-gfx, chao.p.peng, lulu, robin.murphy, jasowang, Jiang,
	Yanting

> From: Liu, Yi L <yi.l.liu@intel.com>
> Sent: Saturday, April 1, 2023 10:44 PM

> @@ -791,7 +813,21 @@ static int vfio_pci_fill_devs(struct pci_dev *pdev, void *data)
>  	if (!iommu_group)
>  		return -EPERM; /* Cannot reset non-isolated devices */

Hi Alex,

Is disabling iommu a sane way to test vfio noiommu mode? If no, just skip
the below contents. 😊 If yes, then may need to check if below is expected.

I added intel_iommu=off to disable intel iommu and bind a device to vfio-pci.
I can see the /dev/vfio/noiommu-0 and /dev/vfio/devices/noiommu-vfio0. Bind
iommufd==-1 can succeed, but failed to get hot reset info due to the above
group check. Reason is that this happens to have some affected devices, and
these devices have no valid iommu_group (because they are not bound to vfio-pci
hence nobody allocates noiommu group for them). So when hot reset info loops
such devices, it failed with -EPERM. Is this expected?

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
  2023-04-03  9:25   ` Liu, Yi L
@ 2023-04-03 15:01     ` Alex Williamson
  2023-04-03 15:22       ` Liu, Yi L
  2023-04-07 10:09       ` Liu, Yi L
  0 siblings, 2 replies; 145+ messages in thread
From: Alex Williamson @ 2023-04-03 15:01 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: mjrosato, jasowang, Hao, Xudong, peterx, Xu, Terrence,
	chao.p.peng, linux-s390, kvm, lulu, Jiang, Yanting, joro,
	nicolinc, jgg, Zhao, Yan Y, intel-gfx, eric.auger, intel-gvt-dev,
	yi.y.sun, cohuck, shameerali.kolothum.thodi,
	suravee.suthikulpanit, robin.murphy

On Mon, 3 Apr 2023 09:25:06 +0000
"Liu, Yi L" <yi.l.liu@intel.com> wrote:

> > From: Liu, Yi L <yi.l.liu@intel.com>
> > Sent: Saturday, April 1, 2023 10:44 PM  
> 
> > @@ -791,7 +813,21 @@ static int vfio_pci_fill_devs(struct pci_dev *pdev, void *data)
> >  	if (!iommu_group)
> >  		return -EPERM; /* Cannot reset non-isolated devices */  
> 
> Hi Alex,
> 
> Is disabling iommu a sane way to test vfio noiommu mode?

Yes

> I added intel_iommu=off to disable intel iommu and bind a device to vfio-pci.
> I can see the /dev/vfio/noiommu-0 and /dev/vfio/devices/noiommu-vfio0. Bind
> iommufd==-1 can succeed, but failed to get hot reset info due to the above
> group check. Reason is that this happens to have some affected devices, and
> these devices have no valid iommu_group (because they are not bound to vfio-pci
> hence nobody allocates noiommu group for them). So when hot reset info loops
> such devices, it failed with -EPERM. Is this expected?

Hmm, I didn't recall that we put in such a limitation, but given the
minimally intrusive approach to no-iommu and the fact that we never
defined an invalid group ID to return to the user, it makes sense that
we just blocked the ioctl for no-iommu use.  I guess we can do the same
for no-iommu cdev.

BTW, what does this series apply on?  I'm assuming[1], but I don't see
a branch from Jason yet.  Thanks,

Alex

[1]https://lore.kernel.org/all/20230327093351.44505-1-yi.l.liu@intel.com/


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
  2023-04-03 15:01     ` Alex Williamson
@ 2023-04-03 15:22       ` Liu, Yi L
  2023-04-03 15:32         ` Alex Williamson
  2023-04-07 10:09       ` Liu, Yi L
  1 sibling, 1 reply; 145+ messages in thread
From: Liu, Yi L @ 2023-04-03 15:22 UTC (permalink / raw)
  To: Alex Williamson
  Cc: mjrosato, jasowang, Hao, Xudong, peterx, Xu, Terrence,
	chao.p.peng, linux-s390, kvm, lulu, Jiang, Yanting, joro,
	nicolinc, jgg, Zhao, Yan Y, intel-gfx, eric.auger, intel-gvt-dev,
	yi.y.sun, cohuck, shameerali.kolothum.thodi,
	suravee.suthikulpanit, robin.murphy

> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Monday, April 3, 2023 11:02 PM
> 
> On Mon, 3 Apr 2023 09:25:06 +0000
> "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> 
> > > From: Liu, Yi L <yi.l.liu@intel.com>
> > > Sent: Saturday, April 1, 2023 10:44 PM
> >
> > > @@ -791,7 +813,21 @@ static int vfio_pci_fill_devs(struct pci_dev *pdev, void
> *data)
> > >  	if (!iommu_group)
> > >  		return -EPERM; /* Cannot reset non-isolated devices */
> >
> > Hi Alex,
> >
> > Is disabling iommu a sane way to test vfio noiommu mode?
> 
> Yes
> 
> > I added intel_iommu=off to disable intel iommu and bind a device to vfio-pci.
> > I can see the /dev/vfio/noiommu-0 and /dev/vfio/devices/noiommu-vfio0. Bind
> > iommufd==-1 can succeed, but failed to get hot reset info due to the above
> > group check. Reason is that this happens to have some affected devices, and
> > these devices have no valid iommu_group (because they are not bound to vfio-pci
> > hence nobody allocates noiommu group for them). So when hot reset info loops
> > such devices, it failed with -EPERM. Is this expected?
> 
> Hmm, I didn't recall that we put in such a limitation, but given the
> minimally intrusive approach to no-iommu and the fact that we never
> defined an invalid group ID to return to the user, it makes sense that
> we just blocked the ioctl for no-iommu use.  I guess we can do the same
> for no-iommu cdev.

sure.

> 
> BTW, what does this series apply on?  I'm assuming[1], but I don't see
> a branch from Jason yet.  Thanks,

yes, this series is applied on [1]. I put the [1], this series and cdev series
in https://github.com/yiliu1765/iommufd/commits/vfio_device_cdev_v9.

Jason has taken [1] in the below branch. It is based on rc1. So I hesitated
to apply this series and cdev series on top of it. Maybe I should have done
it to make life easier. 😊

https://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd.git/log/?h=for-next

> Alex
> 
> [1]https://lore.kernel.org/all/20230327093351.44505-1-yi.l.liu@intel.com/

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
  2023-04-03 15:22       ` Liu, Yi L
@ 2023-04-03 15:32         ` Alex Williamson
  2023-04-03 16:12           ` Jason Gunthorpe
  0 siblings, 1 reply; 145+ messages in thread
From: Alex Williamson @ 2023-04-03 15:32 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: mjrosato, jasowang, Hao, Xudong, peterx, Xu, Terrence,
	chao.p.peng, linux-s390, kvm, lulu, Jiang, Yanting, joro,
	nicolinc, jgg, Zhao, Yan Y, intel-gfx, eric.auger, intel-gvt-dev,
	yi.y.sun, cohuck, shameerali.kolothum.thodi,
	suravee.suthikulpanit, robin.murphy

On Mon, 3 Apr 2023 15:22:03 +0000
"Liu, Yi L" <yi.l.liu@intel.com> wrote:

> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Monday, April 3, 2023 11:02 PM
> > 
> > On Mon, 3 Apr 2023 09:25:06 +0000
> > "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> >   
> > > > From: Liu, Yi L <yi.l.liu@intel.com>
> > > > Sent: Saturday, April 1, 2023 10:44 PM  
> > >  
> > > > @@ -791,7 +813,21 @@ static int vfio_pci_fill_devs(struct pci_dev *pdev, void  
> > *data)  
> > > >  	if (!iommu_group)
> > > >  		return -EPERM; /* Cannot reset non-isolated devices */  
> > >
> > > Hi Alex,
> > >
> > > Is disabling iommu a sane way to test vfio noiommu mode?  
> > 
> > Yes
> >   
> > > I added intel_iommu=off to disable intel iommu and bind a device to vfio-pci.
> > > I can see the /dev/vfio/noiommu-0 and /dev/vfio/devices/noiommu-vfio0. Bind
> > > iommufd==-1 can succeed, but failed to get hot reset info due to the above
> > > group check. Reason is that this happens to have some affected devices, and
> > > these devices have no valid iommu_group (because they are not bound to vfio-pci
> > > hence nobody allocates noiommu group for them). So when hot reset info loops
> > > such devices, it failed with -EPERM. Is this expected?  
> > 
> > Hmm, I didn't recall that we put in such a limitation, but given the
> > minimally intrusive approach to no-iommu and the fact that we never
> > defined an invalid group ID to return to the user, it makes sense that
> > we just blocked the ioctl for no-iommu use.  I guess we can do the same
> > for no-iommu cdev.  
> 
> sure.
> 
> > 
> > BTW, what does this series apply on?  I'm assuming[1], but I don't see
> > a branch from Jason yet.  Thanks,  
> 
> yes, this series is applied on [1]. I put the [1], this series and cdev series
> in https://github.com/yiliu1765/iommufd/commits/vfio_device_cdev_v9.
> 
> Jason has taken [1] in the below branch. It is based on rc1. So I hesitated
> to apply this series and cdev series on top of it. Maybe I should have done
> it to make life easier. 😊
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd.git/log/?h=for-next

Seems like it must be in the vfio_mdev_ops branch which has not been
pushed aside from the merge back to for-next.  Jason?  Thanks,

Alex


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
  2023-04-03 15:32         ` Alex Williamson
@ 2023-04-03 16:12           ` Jason Gunthorpe
  0 siblings, 0 replies; 145+ messages in thread
From: Jason Gunthorpe @ 2023-04-03 16:12 UTC (permalink / raw)
  To: Alex Williamson
  Cc: mjrosato, jasowang, Hao, Xudong, peterx, Xu, Terrence,
	chao.p.peng, linux-s390, Liu, Yi L, kvm, lulu, Jiang, Yanting,
	joro, nicolinc, Zhao, Yan Y, intel-gfx, eric.auger,
	intel-gvt-dev, yi.y.sun, cohuck, shameerali.kolothum.thodi,
	suravee.suthikulpanit, robin.murphy

On Mon, Apr 03, 2023 at 09:32:18AM -0600, Alex Williamson wrote:
> > yes, this series is applied on [1]. I put the [1], this series and cdev series
> > in https://github.com/yiliu1765/iommufd/commits/vfio_device_cdev_v9.
> > 
> > Jason has taken [1] in the below branch. It is based on rc1. So I hesitated
> > to apply this series and cdev series on top of it. Maybe I should have done
> > it to make life easier. 😊
> > 
> > https://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd.git/log/?h=for-next
> 
> Seems like it must be in the vfio_mdev_ops branch which has not been
> pushed aside from the merge back to for-next.  Jason?  Thanks,

Yeah, I didn't think we'd need it until we got to the cdev series, let
me do the steps..

Jason

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 03/12] vfio/pci: Move the existing hot reset logic to be a helper
  2023-04-01 14:44 ` [Intel-gfx] [PATCH v3 03/12] vfio/pci: Move the existing hot reset logic to be a helper Yi Liu
@ 2023-04-04 13:59   ` Eric Auger
  2023-04-04 14:24     ` Liu, Yi L
  0 siblings, 1 reply; 145+ messages in thread
From: Eric Auger @ 2023-04-04 13:59 UTC (permalink / raw)
  To: Yi Liu, alex.williamson, jgg, kevin.tian
  Cc: linux-s390, yi.y.sun, kvm, mjrosato, intel-gvt-dev, joro, cohuck,
	xudong.hao, peterx, yan.y.zhao, terrence.xu, nicolinc,
	shameerali.kolothum.thodi, suravee.suthikulpanit, intel-gfx,
	chao.p.peng, lulu, robin.murphy, jasowang, yanting.jiang

Hi Yi,

On 4/1/23 16:44, Yi Liu wrote:
> This prepares to add another method for hot reset. The major hot reset logic
> are moved to vfio_pci_ioctl_pci_hot_reset_groups().
>
> No functional change is intended.
>
> Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> Tested-by: Yanting Jiang <yanting.jiang@intel.com>
> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> ---
>  drivers/vfio/pci/vfio_pci_core.c | 56 +++++++++++++++++++-------------
>  1 file changed, 33 insertions(+), 23 deletions(-)
>
> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> index 5d745c9abf05..3696b8e58445 100644
> --- a/drivers/vfio/pci/vfio_pci_core.c
> +++ b/drivers/vfio/pci/vfio_pci_core.c
> @@ -1255,29 +1255,17 @@ static int vfio_pci_ioctl_get_pci_hot_reset_info(
>  	return ret;
>  }
>  
> -static int vfio_pci_ioctl_pci_hot_reset(struct vfio_pci_core_device *vdev,
> -					struct vfio_pci_hot_reset __user *arg)
> +static int
> +vfio_pci_ioctl_pci_hot_reset_groups(struct vfio_pci_core_device *vdev,
> +				    struct vfio_pci_hot_reset *hdr,
nit why don't you simply pass the user group count as decoded earlier.
hdr sounds like a dup of arg.
> +				    bool slot,
> +				    struct vfio_pci_hot_reset __user *arg)
>  {
> -	unsigned long minsz = offsetofend(struct vfio_pci_hot_reset, count);
> -	struct vfio_pci_hot_reset hdr;
>  	int32_t *group_fds;
>  	struct file **files;
>  	struct vfio_pci_group_info info;
> -	bool slot = false;
>  	int file_idx, count = 0, ret = 0;
>  
> -	if (copy_from_user(&hdr, arg, minsz))
> -		return -EFAULT;
> -
> -	if (hdr.argsz < minsz || hdr.flags)
> -		return -EINVAL;
> -
> -	/* Can we do a slot or bus reset or neither? */
> -	if (!pci_probe_reset_slot(vdev->pdev->slot))
> -		slot = true;
> -	else if (pci_probe_reset_bus(vdev->pdev->bus))
> -		return -ENODEV;
> -
>  	/*
>  	 * We can't let userspace give us an arbitrarily large buffer to copy,
>  	 * so verify how many we think there could be.  Note groups can have
> @@ -1289,11 +1277,11 @@ static int vfio_pci_ioctl_pci_hot_reset(struct vfio_pci_core_device *vdev,
>  		return ret;
>  
>  	/* Somewhere between 1 and count is OK */
> -	if (!hdr.count || hdr.count > count)
> +	if (!hdr->count || hdr->count > count)
>  		return -EINVAL;
>  
> -	group_fds = kcalloc(hdr.count, sizeof(*group_fds), GFP_KERNEL);
> -	files = kcalloc(hdr.count, sizeof(*files), GFP_KERNEL);
> +	group_fds = kcalloc(hdr->count, sizeof(*group_fds), GFP_KERNEL);
> +	files = kcalloc(hdr->count, sizeof(*files), GFP_KERNEL);
>  	if (!group_fds || !files) {
>  		kfree(group_fds);
>  		kfree(files);
> @@ -1301,7 +1289,7 @@ static int vfio_pci_ioctl_pci_hot_reset(struct vfio_pci_core_device *vdev,
>  	}
>  
>  	if (copy_from_user(group_fds, arg->group_fds,
> -			   hdr.count * sizeof(*group_fds))) {
> +			   hdr->count * sizeof(*group_fds))) {
>  		kfree(group_fds);
>  		kfree(files);
>  		return -EFAULT;
> @@ -1311,7 +1299,7 @@ static int vfio_pci_ioctl_pci_hot_reset(struct vfio_pci_core_device *vdev,
>  	 * Get the group file for each fd to ensure the group held across
>  	 * the reset
>  	 */
> -	for (file_idx = 0; file_idx < hdr.count; file_idx++) {
> +	for (file_idx = 0; file_idx < hdr->count; file_idx++) {
>  		struct file *file = fget(group_fds[file_idx]);
>  
>  		if (!file) {
> @@ -1335,7 +1323,7 @@ static int vfio_pci_ioctl_pci_hot_reset(struct vfio_pci_core_device *vdev,
>  	if (ret)
>  		goto hot_reset_release;
>  
> -	info.count = hdr.count;
> +	info.count = hdr->count;
>  	info.files = files;
>  
>  	ret = vfio_pci_dev_set_hot_reset(vdev->vdev.dev_set, &info);
> @@ -1348,6 +1336,28 @@ static int vfio_pci_ioctl_pci_hot_reset(struct vfio_pci_core_device *vdev,
>  	return ret;
>  }
>  
> +static int vfio_pci_ioctl_pci_hot_reset(struct vfio_pci_core_device *vdev,
> +					struct vfio_pci_hot_reset __user *arg)
> +{
> +	unsigned long minsz = offsetofend(struct vfio_pci_hot_reset, count);
> +	struct vfio_pci_hot_reset hdr;
> +	bool slot = false;
> +
> +	if (copy_from_user(&hdr, arg, minsz))
> +		return -EFAULT;
> +
> +	if (hdr.argsz < minsz || hdr.flags)
> +		return -EINVAL;
> +
> +	/* Can we do a slot or bus reset or neither? */
> +	if (!pci_probe_reset_slot(vdev->pdev->slot))
> +		slot = true;
> +	else if (pci_probe_reset_bus(vdev->pdev->bus))
> +		return -ENODEV;
> +
> +	return vfio_pci_ioctl_pci_hot_reset_groups(vdev, &hdr, slot, arg);
> +}
> +
>  static int vfio_pci_ioctl_ioeventfd(struct vfio_pci_core_device *vdev,
>  				    struct vfio_device_ioeventfd __user *arg)
>  {
Besides
Reviewed-by: Eric Auger <eric.auger@redhat.com>

Thanks

Eric


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 02/12] vfio/pci: Only check ownership of opened devices in hot reset
  2023-04-01 14:44 ` [Intel-gfx] [PATCH v3 02/12] vfio/pci: Only check ownership of opened devices in hot reset Yi Liu
@ 2023-04-04 13:59   ` Eric Auger
  2023-04-04 14:37     ` Liu, Yi L
  0 siblings, 1 reply; 145+ messages in thread
From: Eric Auger @ 2023-04-04 13:59 UTC (permalink / raw)
  To: Yi Liu, alex.williamson, jgg, kevin.tian
  Cc: linux-s390, yi.y.sun, kvm, mjrosato, intel-gvt-dev, joro, cohuck,
	xudong.hao, peterx, yan.y.zhao, terrence.xu, nicolinc,
	shameerali.kolothum.thodi, suravee.suthikulpanit, intel-gfx,
	chao.p.peng, lulu, robin.murphy, jasowang, yanting.jiang

Hi YI,

On 4/1/23 16:44, Yi Liu wrote:
> If the affected device is not opened by any user, it's safe to reset it
> given it's not in use.
>
> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> Tested-by: Yanting Jiang <yanting.jiang@intel.com>
> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> ---
>  drivers/vfio/pci/vfio_pci_core.c | 14 +++++++++++---
>  include/uapi/linux/vfio.h        |  8 ++++++++
>  2 files changed, 19 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> index 65bbef562268..5d745c9abf05 100644
> --- a/drivers/vfio/pci/vfio_pci_core.c
> +++ b/drivers/vfio/pci/vfio_pci_core.c
> @@ -2429,10 +2429,18 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
>  
>  	list_for_each_entry(cur_vma, &dev_set->device_list, vdev.dev_set_list) {
>  		/*
> -		 * Test whether all the affected devices are contained by the
> -		 * set of groups provided by the user.
> +		 * Test whether all the affected devices can be reset by the
> +		 * user.
> +		 *
> +		 * Resetting an unused device (not opened) is safe, because
> +		 * dev_set->lock is held in hot reset path so this device
> +		 * cannot race being opened by another user simultaneously.
> +		 *
> +		 * Otherwise all opened devices in the dev_set must be
> +		 * contained by the set of groups provided by the user.
>  		 */
> -		if (!vfio_dev_in_groups(cur_vma, groups)) {
> +		if (cur_vma->vdev.open_count &&
> +		    !vfio_dev_in_groups(cur_vma, groups)) {
>  			ret = -EINVAL;
>  			goto err_undo;
>  		}
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 0552e8dcf0cb..f96e5689cffc 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -673,6 +673,14 @@ struct vfio_pci_hot_reset_info {
>   * VFIO_DEVICE_PCI_HOT_RESET - _IOW(VFIO_TYPE, VFIO_BASE + 13,
>   *				    struct vfio_pci_hot_reset)
>   *
> + * Userspace requests hot reset for the devices it uses.  Due to the
> + * underlying topology, multiple devices can be affected in the reset
by the reset
> + * while some might be opened by another user.  To avoid interference
s/interference/hot reset failure?
> + * the calling user must ensure all affected devices, if opened, are
> + * owned by itself.
> + *
> + * The ownership is proved by an array of group fds.
> + *
>   * Return: 0 on success, -errno on failure.
>   */
>  struct vfio_pci_hot_reset {
Thanks

Eric


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 01/12] vfio/pci: Update comment around group_fd get in vfio_pci_ioctl_pci_hot_reset()
  2023-04-01 14:44 ` [Intel-gfx] [PATCH v3 01/12] vfio/pci: Update comment around group_fd get in vfio_pci_ioctl_pci_hot_reset() Yi Liu
@ 2023-04-04 13:59   ` Eric Auger
  2023-04-04 14:37     ` Liu, Yi L
  0 siblings, 1 reply; 145+ messages in thread
From: Eric Auger @ 2023-04-04 13:59 UTC (permalink / raw)
  To: Yi Liu, alex.williamson, jgg, kevin.tian
  Cc: linux-s390, yi.y.sun, kvm, mjrosato, intel-gvt-dev, joro, cohuck,
	xudong.hao, peterx, yan.y.zhao, terrence.xu, nicolinc,
	shameerali.kolothum.thodi, suravee.suthikulpanit, intel-gfx,
	chao.p.peng, lulu, robin.murphy, jasowang, yanting.jiang

Hi Yi,

On 4/1/23 16:44, Yi Liu wrote:
> this suits more on what the code does.
>
> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> Tested-by: Yanting Jiang <yanting.jiang@intel.com>
> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> ---
>  drivers/vfio/pci/vfio_pci_core.c | 5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> index a5ab416cf476..65bbef562268 100644
> --- a/drivers/vfio/pci/vfio_pci_core.c
> +++ b/drivers/vfio/pci/vfio_pci_core.c
> @@ -1308,9 +1308,8 @@ static int vfio_pci_ioctl_pci_hot_reset(struct vfio_pci_core_device *vdev,
>  	}
>  
>  	/*
> -	 * For each group_fd, get the group through the vfio external user
> -	 * interface and store the group and iommu ID.  This ensures the group
> -	 * is held across the reset.
> +	 * Get the group file for each fd to ensure the group held across
to ensure the group is held

Besides

Reviewed-by: Eric Auger <eric.auger@redhat.com>

Eric


> +	 * the reset
>  	 */
>  	for (file_idx = 0; file_idx < hdr.count; file_idx++) {
>  		struct file *file = fget(group_fds[file_idx]);


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 03/12] vfio/pci: Move the existing hot reset logic to be a helper
  2023-04-04 13:59   ` Eric Auger
@ 2023-04-04 14:24     ` Liu, Yi L
  0 siblings, 0 replies; 145+ messages in thread
From: Liu, Yi L @ 2023-04-04 14:24 UTC (permalink / raw)
  To: eric.auger, alex.williamson, jgg, Tian, Kevin
  Cc: linux-s390, yi.y.sun, kvm, mjrosato, intel-gvt-dev, joro, cohuck,
	Hao, Xudong, peterx, Zhao, Yan Y, Xu, Terrence, nicolinc,
	shameerali.kolothum.thodi, suravee.suthikulpanit, intel-gfx,
	chao.p.peng, lulu, robin.murphy, jasowang, Jiang, Yanting

> From: Eric Auger <eric.auger@redhat.com>
> Sent: Tuesday, April 4, 2023 10:00 PM
> 
> Hi Yi,
> 
> On 4/1/23 16:44, Yi Liu wrote:
> > This prepares to add another method for hot reset. The major hot reset logic
> > are moved to vfio_pci_ioctl_pci_hot_reset_groups().
> >
> > No functional change is intended.
> >
> > Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
> > Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> > Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> > Tested-by: Yanting Jiang <yanting.jiang@intel.com>
> > Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> > ---
> >  drivers/vfio/pci/vfio_pci_core.c | 56 +++++++++++++++++++-------------
> >  1 file changed, 33 insertions(+), 23 deletions(-)
> >
> > diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> > index 5d745c9abf05..3696b8e58445 100644
> > --- a/drivers/vfio/pci/vfio_pci_core.c
> > +++ b/drivers/vfio/pci/vfio_pci_core.c
> > @@ -1255,29 +1255,17 @@ static int vfio_pci_ioctl_get_pci_hot_reset_info(
> >  	return ret;
> >  }
> >
> > -static int vfio_pci_ioctl_pci_hot_reset(struct vfio_pci_core_device *vdev,
> > -					struct vfio_pci_hot_reset __user *arg)
> > +static int
> > +vfio_pci_ioctl_pci_hot_reset_groups(struct vfio_pci_core_device *vdev,
> > +				    struct vfio_pci_hot_reset *hdr,
> nit why don't you simply pass the user group count as decoded earlier.
> hdr sounds like a dup of arg.

indeed. only hdr->count is needed.

> > +				    bool slot,
> > +				    struct vfio_pci_hot_reset __user *arg)
> >  {
> > -	unsigned long minsz = offsetofend(struct vfio_pci_hot_reset, count);
> > -	struct vfio_pci_hot_reset hdr;
> >  	int32_t *group_fds;
> >  	struct file **files;
> >  	struct vfio_pci_group_info info;
> > -	bool slot = false;
> >  	int file_idx, count = 0, ret = 0;
> >
> > -	if (copy_from_user(&hdr, arg, minsz))
> > -		return -EFAULT;
> > -
> > -	if (hdr.argsz < minsz || hdr.flags)
> > -		return -EINVAL;
> > -
> > -	/* Can we do a slot or bus reset or neither? */
> > -	if (!pci_probe_reset_slot(vdev->pdev->slot))
> > -		slot = true;
> > -	else if (pci_probe_reset_bus(vdev->pdev->bus))
> > -		return -ENODEV;
> > -
> >  	/*
> >  	 * We can't let userspace give us an arbitrarily large buffer to copy,
> >  	 * so verify how many we think there could be.  Note groups can have
> > @@ -1289,11 +1277,11 @@ static int vfio_pci_ioctl_pci_hot_reset(struct
> vfio_pci_core_device *vdev,
> >  		return ret;
> >
> >  	/* Somewhere between 1 and count is OK */
> > -	if (!hdr.count || hdr.count > count)
> > +	if (!hdr->count || hdr->count > count)
> >  		return -EINVAL;
> >
> > -	group_fds = kcalloc(hdr.count, sizeof(*group_fds), GFP_KERNEL);
> > -	files = kcalloc(hdr.count, sizeof(*files), GFP_KERNEL);
> > +	group_fds = kcalloc(hdr->count, sizeof(*group_fds), GFP_KERNEL);
> > +	files = kcalloc(hdr->count, sizeof(*files), GFP_KERNEL);
> >  	if (!group_fds || !files) {
> >  		kfree(group_fds);
> >  		kfree(files);
> > @@ -1301,7 +1289,7 @@ static int vfio_pci_ioctl_pci_hot_reset(struct
> vfio_pci_core_device *vdev,
> >  	}
> >
> >  	if (copy_from_user(group_fds, arg->group_fds,
> > -			   hdr.count * sizeof(*group_fds))) {
> > +			   hdr->count * sizeof(*group_fds))) {
> >  		kfree(group_fds);
> >  		kfree(files);
> >  		return -EFAULT;
> > @@ -1311,7 +1299,7 @@ static int vfio_pci_ioctl_pci_hot_reset(struct
> vfio_pci_core_device *vdev,
> >  	 * Get the group file for each fd to ensure the group held across
> >  	 * the reset
> >  	 */
> > -	for (file_idx = 0; file_idx < hdr.count; file_idx++) {
> > +	for (file_idx = 0; file_idx < hdr->count; file_idx++) {
> >  		struct file *file = fget(group_fds[file_idx]);
> >
> >  		if (!file) {
> > @@ -1335,7 +1323,7 @@ static int vfio_pci_ioctl_pci_hot_reset(struct
> vfio_pci_core_device *vdev,
> >  	if (ret)
> >  		goto hot_reset_release;
> >
> > -	info.count = hdr.count;
> > +	info.count = hdr->count;
> >  	info.files = files;
> >
> >  	ret = vfio_pci_dev_set_hot_reset(vdev->vdev.dev_set, &info);
> > @@ -1348,6 +1336,28 @@ static int vfio_pci_ioctl_pci_hot_reset(struct
> vfio_pci_core_device *vdev,
> >  	return ret;
> >  }
> >
> > +static int vfio_pci_ioctl_pci_hot_reset(struct vfio_pci_core_device *vdev,
> > +					struct vfio_pci_hot_reset __user *arg)
> > +{
> > +	unsigned long minsz = offsetofend(struct vfio_pci_hot_reset, count);
> > +	struct vfio_pci_hot_reset hdr;
> > +	bool slot = false;
> > +
> > +	if (copy_from_user(&hdr, arg, minsz))
> > +		return -EFAULT;
> > +
> > +	if (hdr.argsz < minsz || hdr.flags)
> > +		return -EINVAL;
> > +
> > +	/* Can we do a slot or bus reset or neither? */
> > +	if (!pci_probe_reset_slot(vdev->pdev->slot))
> > +		slot = true;
> > +	else if (pci_probe_reset_bus(vdev->pdev->bus))
> > +		return -ENODEV;
> > +
> > +	return vfio_pci_ioctl_pci_hot_reset_groups(vdev, &hdr, slot, arg);
> > +}
> > +
> >  static int vfio_pci_ioctl_ioeventfd(struct vfio_pci_core_device *vdev,
> >  				    struct vfio_device_ioeventfd __user *arg)
> >  {
> Besides
> Reviewed-by: Eric Auger <eric.auger@redhat.com>

Thanks,
Yi Liu

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 02/12] vfio/pci: Only check ownership of opened devices in hot reset
  2023-04-04 13:59   ` Eric Auger
@ 2023-04-04 14:37     ` Liu, Yi L
  2023-04-04 15:18       ` Eric Auger
  0 siblings, 1 reply; 145+ messages in thread
From: Liu, Yi L @ 2023-04-04 14:37 UTC (permalink / raw)
  To: eric.auger, alex.williamson, jgg, Tian, Kevin
  Cc: linux-s390, yi.y.sun, kvm, mjrosato, intel-gvt-dev, joro, cohuck,
	Hao, Xudong, peterx, Zhao, Yan Y, Xu, Terrence, nicolinc,
	shameerali.kolothum.thodi, suravee.suthikulpanit, intel-gfx,
	chao.p.peng, lulu, robin.murphy, jasowang, Jiang, Yanting

Hi Eric,

> From: Eric Auger <eric.auger@redhat.com>
> Sent: Tuesday, April 4, 2023 10:00 PM
> 
> Hi YI,
> 
> On 4/1/23 16:44, Yi Liu wrote:
> > If the affected device is not opened by any user, it's safe to reset it
> > given it's not in use.
> >
> > Reviewed-by: Kevin Tian <kevin.tian@intel.com>
> > Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> > Tested-by: Yanting Jiang <yanting.jiang@intel.com>
> > Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> > ---
> >  drivers/vfio/pci/vfio_pci_core.c | 14 +++++++++++---
> >  include/uapi/linux/vfio.h        |  8 ++++++++
> >  2 files changed, 19 insertions(+), 3 deletions(-)
> >
> > diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> > index 65bbef562268..5d745c9abf05 100644
> > --- a/drivers/vfio/pci/vfio_pci_core.c
> > +++ b/drivers/vfio/pci/vfio_pci_core.c
> > @@ -2429,10 +2429,18 @@ static int vfio_pci_dev_set_hot_reset(struct
> vfio_device_set *dev_set,
> >
> >  	list_for_each_entry(cur_vma, &dev_set->device_list, vdev.dev_set_list) {
> >  		/*
> > -		 * Test whether all the affected devices are contained by the
> > -		 * set of groups provided by the user.
> > +		 * Test whether all the affected devices can be reset by the
> > +		 * user.
> > +		 *
> > +		 * Resetting an unused device (not opened) is safe, because
> > +		 * dev_set->lock is held in hot reset path so this device
> > +		 * cannot race being opened by another user simultaneously.
> > +		 *
> > +		 * Otherwise all opened devices in the dev_set must be
> > +		 * contained by the set of groups provided by the user.
> >  		 */
> > -		if (!vfio_dev_in_groups(cur_vma, groups)) {
> > +		if (cur_vma->vdev.open_count &&
> > +		    !vfio_dev_in_groups(cur_vma, groups)) {
> >  			ret = -EINVAL;
> >  			goto err_undo;
> >  		}
> > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> > index 0552e8dcf0cb..f96e5689cffc 100644
> > --- a/include/uapi/linux/vfio.h
> > +++ b/include/uapi/linux/vfio.h
> > @@ -673,6 +673,14 @@ struct vfio_pci_hot_reset_info {
> >   * VFIO_DEVICE_PCI_HOT_RESET - _IOW(VFIO_TYPE, VFIO_BASE + 13,
> >   *				    struct vfio_pci_hot_reset)
> >   *
> > + * Userspace requests hot reset for the devices it uses.  Due to the
> > + * underlying topology, multiple devices can be affected in the reset
> by the reset
> > + * while some might be opened by another user.  To avoid interference
> s/interference/hot reset failure?

I don’t think user can really avoid hot reset failure since there may
be new devices plugged into the affected slot. Even user has opened
all the groups/devices reported by VFIO_DEVICE_GET_PCI_HOT_RESET_INFO,
the hot reset can fail if new device is plugged in and has not been
bound to vfio or opened by another user during the window of
_INFO and HOT_RESET.

maybe the whole statement should be as below:

To avoid interference, the hot reset can only be conducted when all
the affected devices are either opened by the calling user or not
opened yet at the moment of the hot reset attempt.

> > + * the calling user must ensure all affected devices, if opened, are
> > + * owned by itself.
> > + *
> > + * The ownership is proved by an array of group fds.
> > + *
> >   * Return: 0 on success, -errno on failure.
> >   */
> >  struct vfio_pci_hot_reset {

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 01/12] vfio/pci: Update comment around group_fd get in vfio_pci_ioctl_pci_hot_reset()
  2023-04-04 13:59   ` Eric Auger
@ 2023-04-04 14:37     ` Liu, Yi L
  0 siblings, 0 replies; 145+ messages in thread
From: Liu, Yi L @ 2023-04-04 14:37 UTC (permalink / raw)
  To: eric.auger, alex.williamson, jgg, Tian, Kevin
  Cc: linux-s390, yi.y.sun, kvm, mjrosato, intel-gvt-dev, joro, cohuck,
	Hao, Xudong, peterx, Zhao, Yan Y, Xu, Terrence, nicolinc,
	shameerali.kolothum.thodi, suravee.suthikulpanit, intel-gfx,
	chao.p.peng, lulu, robin.murphy, jasowang, Jiang, Yanting

> From: Eric Auger <eric.auger@redhat.com>
> Sent: Tuesday, April 4, 2023 10:00 PM
> 
> Hi Yi,
> 
> On 4/1/23 16:44, Yi Liu wrote:
> > this suits more on what the code does.
> >
> > Reviewed-by: Kevin Tian <kevin.tian@intel.com>
> > Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> > Tested-by: Yanting Jiang <yanting.jiang@intel.com>
> > Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> > ---
> >  drivers/vfio/pci/vfio_pci_core.c | 5 ++---
> >  1 file changed, 2 insertions(+), 3 deletions(-)
> >
> > diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> > index a5ab416cf476..65bbef562268 100644
> > --- a/drivers/vfio/pci/vfio_pci_core.c
> > +++ b/drivers/vfio/pci/vfio_pci_core.c
> > @@ -1308,9 +1308,8 @@ static int vfio_pci_ioctl_pci_hot_reset(struct
> vfio_pci_core_device *vdev,
> >  	}
> >
> >  	/*
> > -	 * For each group_fd, get the group through the vfio external user
> > -	 * interface and store the group and iommu ID.  This ensures the group
> > -	 * is held across the reset.
> > +	 * Get the group file for each fd to ensure the group held across
> to ensure the group is held

got it.

> Besides
> 
> Reviewed-by: Eric Auger <eric.auger@redhat.com>
> 
> Eric
> 
> 
> > +	 * the reset
> >  	 */
> >  	for (file_idx = 0; file_idx < hdr.count; file_idx++) {
> >  		struct file *file = fget(group_fds[file_idx]);

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 02/12] vfio/pci: Only check ownership of opened devices in hot reset
  2023-04-04 14:37     ` Liu, Yi L
@ 2023-04-04 15:18       ` Eric Auger
  2023-04-04 15:29         ` Liu, Yi L
  0 siblings, 1 reply; 145+ messages in thread
From: Eric Auger @ 2023-04-04 15:18 UTC (permalink / raw)
  To: Liu, Yi L, alex.williamson, jgg, Tian, Kevin
  Cc: linux-s390, yi.y.sun, kvm, mjrosato, intel-gvt-dev, joro, cohuck,
	Hao, Xudong, peterx, Zhao, Yan Y, Xu, Terrence, nicolinc,
	shameerali.kolothum.thodi, suravee.suthikulpanit, intel-gfx,
	chao.p.peng, lulu, robin.murphy, jasowang, Jiang, Yanting

Hi Yi,

On 4/4/23 16:37, Liu, Yi L wrote:
> Hi Eric,
>
>> From: Eric Auger <eric.auger@redhat.com>
>> Sent: Tuesday, April 4, 2023 10:00 PM
>>
>> Hi YI,
>>
>> On 4/1/23 16:44, Yi Liu wrote:
>>> If the affected device is not opened by any user, it's safe to reset it
>>> given it's not in use.
>>>
>>> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
>>> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
>>> Tested-by: Yanting Jiang <yanting.jiang@intel.com>
>>> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
>>> ---
>>>  drivers/vfio/pci/vfio_pci_core.c | 14 +++++++++++---
>>>  include/uapi/linux/vfio.h        |  8 ++++++++
>>>  2 files changed, 19 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
>>> index 65bbef562268..5d745c9abf05 100644
>>> --- a/drivers/vfio/pci/vfio_pci_core.c
>>> +++ b/drivers/vfio/pci/vfio_pci_core.c
>>> @@ -2429,10 +2429,18 @@ static int vfio_pci_dev_set_hot_reset(struct
>> vfio_device_set *dev_set,
>>>  	list_for_each_entry(cur_vma, &dev_set->device_list, vdev.dev_set_list) {
>>>  		/*
>>> -		 * Test whether all the affected devices are contained by the
>>> -		 * set of groups provided by the user.
>>> +		 * Test whether all the affected devices can be reset by the
>>> +		 * user.
>>> +		 *
>>> +		 * Resetting an unused device (not opened) is safe, because
>>> +		 * dev_set->lock is held in hot reset path so this device
>>> +		 * cannot race being opened by another user simultaneously.
>>> +		 *
>>> +		 * Otherwise all opened devices in the dev_set must be
>>> +		 * contained by the set of groups provided by the user.
>>>  		 */
>>> -		if (!vfio_dev_in_groups(cur_vma, groups)) {
>>> +		if (cur_vma->vdev.open_count &&
>>> +		    !vfio_dev_in_groups(cur_vma, groups)) {
>>>  			ret = -EINVAL;
>>>  			goto err_undo;
>>>  		}
>>> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
>>> index 0552e8dcf0cb..f96e5689cffc 100644
>>> --- a/include/uapi/linux/vfio.h
>>> +++ b/include/uapi/linux/vfio.h
>>> @@ -673,6 +673,14 @@ struct vfio_pci_hot_reset_info {
>>>   * VFIO_DEVICE_PCI_HOT_RESET - _IOW(VFIO_TYPE, VFIO_BASE + 13,
>>>   *				    struct vfio_pci_hot_reset)
>>>   *
>>> + * Userspace requests hot reset for the devices it uses.  Due to the
>>> + * underlying topology, multiple devices can be affected in the reset
>> by the reset
>>> + * while some might be opened by another user.  To avoid interference
>> s/interference/hot reset failure?
> I don’t think user can really avoid hot reset failure since there may
> be new devices plugged into the affected slot. Even user has opened
I don't know the legacy wrt that issue but this sounds a serious issue,
meaning the reset of an assigned device could impact another device
belonging to another group not not owned by the user?
> all the groups/devices reported by VFIO_DEVICE_GET_PCI_HOT_RESET_INFO,
> the hot reset can fail if new device is plugged in and has not been
> bound to vfio or opened by another user during the window of
> _INFO and HOT_RESET.
with respect to the latter isn't the dev_set lock held during the hot
reset and sufficient to prevent any new opening to occur?
>
> maybe the whole statement should be as below:
>
> To avoid interference, the hot reset can only be conducted when all
> the affected devices are either opened by the calling user or not
> opened yet at the moment of the hot reset attempt.

OK

Eric
>
>>> + * the calling user must ensure all affected devices, if opened, are
>>> + * owned by itself.
>>> + *
>>> + * The ownership is proved by an array of group fds.
>>> + *
>>>   * Return: 0 on success, -errno on failure.
>>>   */
>>>  struct vfio_pci_hot_reset {
> Regards,
> Yi Liu


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 04/12] vfio-iommufd: Add helper to retrieve iommufd_ctx and devid for vfio_device
  2023-04-01 14:44 ` [Intel-gfx] [PATCH v3 04/12] vfio-iommufd: Add helper to retrieve iommufd_ctx and devid for vfio_device Yi Liu
@ 2023-04-04 15:28   ` Eric Auger
  2023-04-04 21:48     ` Alex Williamson
  0 siblings, 1 reply; 145+ messages in thread
From: Eric Auger @ 2023-04-04 15:28 UTC (permalink / raw)
  To: Yi Liu, alex.williamson, jgg, kevin.tian
  Cc: linux-s390, yi.y.sun, kvm, mjrosato, intel-gvt-dev, joro, cohuck,
	xudong.hao, peterx, yan.y.zhao, terrence.xu, nicolinc,
	shameerali.kolothum.thodi, suravee.suthikulpanit, intel-gfx,
	chao.p.peng, lulu, robin.murphy, jasowang, yanting.jiang

Hi,

On 4/1/23 16:44, Yi Liu wrote:
> This is needed by the vfio-pci driver to report affected devices in the
> hot reset for a given device.
>
> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> Tested-by: Yanting Jiang <yanting.jiang@intel.com>
> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> ---
>  drivers/iommu/iommufd/device.c | 12 ++++++++++++
>  drivers/vfio/iommufd.c         | 14 ++++++++++++++
>  include/linux/iommufd.h        |  3 +++
>  include/linux/vfio.h           | 13 +++++++++++++
>  4 files changed, 42 insertions(+)
>
> diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c
> index 25115d401d8f..04a57aa1ae2c 100644
> --- a/drivers/iommu/iommufd/device.c
> +++ b/drivers/iommu/iommufd/device.c
> @@ -131,6 +131,18 @@ void iommufd_device_unbind(struct iommufd_device *idev)
>  }
>  EXPORT_SYMBOL_NS_GPL(iommufd_device_unbind, IOMMUFD);
>  
> +struct iommufd_ctx *iommufd_device_to_ictx(struct iommufd_device *idev)
> +{
> +	return idev->ictx;
> +}
> +EXPORT_SYMBOL_NS_GPL(iommufd_device_to_ictx, IOMMUFD);
> +
> +u32 iommufd_device_to_id(struct iommufd_device *idev)
> +{
> +	return idev->obj.id;
> +}
> +EXPORT_SYMBOL_NS_GPL(iommufd_device_to_id, IOMMUFD);
> +
>  static int iommufd_device_setup_msi(struct iommufd_device *idev,
>  				    struct iommufd_hw_pagetable *hwpt,
>  				    phys_addr_t sw_msi_start)
> diff --git a/drivers/vfio/iommufd.c b/drivers/vfio/iommufd.c
> index 88b00c501015..809f2dd73b9e 100644
> --- a/drivers/vfio/iommufd.c
> +++ b/drivers/vfio/iommufd.c
> @@ -66,6 +66,20 @@ void vfio_iommufd_unbind(struct vfio_device *vdev)
>  		vdev->ops->unbind_iommufd(vdev);
>  }
>  
> +struct iommufd_ctx *vfio_iommufd_physical_ictx(struct vfio_device *vdev)
> +{
> +	if (!vdev->iommufd_device)
> +		return NULL;
> +	return iommufd_device_to_ictx(vdev->iommufd_device);
> +}
> +EXPORT_SYMBOL_GPL(vfio_iommufd_physical_ictx);
> +
> +void vfio_iommufd_physical_devid(struct vfio_device *vdev, u32 *id)
> +{
> +	if (vdev->iommufd_device)
> +		*id = iommufd_device_to_id(vdev->iommufd_device);
since there is no return value, may be worth to add at least a WARN_ON
in case of !vdev->iommufd_device
> +}
> +EXPORT_SYMBOL_GPL(vfio_iommufd_physical_devid);
>  /*
>   * The physical standard ops mean that the iommufd_device is bound to the
>   * physical device vdev->dev that was provided to vfio_init_group_dev(). Drivers
> diff --git a/include/linux/iommufd.h b/include/linux/iommufd.h
> index 1129a36a74c4..ac96df406833 100644
> --- a/include/linux/iommufd.h
> +++ b/include/linux/iommufd.h
> @@ -24,6 +24,9 @@ void iommufd_device_unbind(struct iommufd_device *idev);
>  int iommufd_device_attach(struct iommufd_device *idev, u32 *pt_id);
>  void iommufd_device_detach(struct iommufd_device *idev);
>  
> +struct iommufd_ctx *iommufd_device_to_ictx(struct iommufd_device *idev);
> +u32 iommufd_device_to_id(struct iommufd_device *idev);
> +
>  struct iommufd_access_ops {
>  	u8 needs_pin_pages : 1;
>  	void (*unmap)(void *data, unsigned long iova, unsigned long length);
> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> index 3188d8a374bd..97a1174b922f 100644
> --- a/include/linux/vfio.h
> +++ b/include/linux/vfio.h
> @@ -113,6 +113,8 @@ struct vfio_device_ops {
>  };
>  
>  #if IS_ENABLED(CONFIG_IOMMUFD)
> +struct iommufd_ctx *vfio_iommufd_physical_ictx(struct vfio_device *vdev);
> +void vfio_iommufd_physical_devid(struct vfio_device *vdev, u32 *id);
>  int vfio_iommufd_physical_bind(struct vfio_device *vdev,
>  			       struct iommufd_ctx *ictx, u32 *out_device_id);
>  void vfio_iommufd_physical_unbind(struct vfio_device *vdev);
> @@ -122,6 +124,17 @@ int vfio_iommufd_emulated_bind(struct vfio_device *vdev,
>  void vfio_iommufd_emulated_unbind(struct vfio_device *vdev);
>  int vfio_iommufd_emulated_attach_ioas(struct vfio_device *vdev, u32 *pt_id);
>  #else
> +static inline struct iommufd_ctx *
> +vfio_iommufd_physical_ictx(struct vfio_device *vdev)
> +{
> +	return NULL;
> +}
> +
> +static inline void
> +vfio_iommufd_physical_devid(struct vfio_device *vdev, u32 *id)
> +{
> +}
> +
>  #define vfio_iommufd_physical_bind                                      \
>  	((int (*)(struct vfio_device *vdev, struct iommufd_ctx *ictx,   \
>  		  u32 *out_device_id)) NULL)
besides

Reviewed-by: Eric Auger <eric.auger@redhat.com>

Eric


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 02/12] vfio/pci: Only check ownership of opened devices in hot reset
  2023-04-04 15:18       ` Eric Auger
@ 2023-04-04 15:29         ` Liu, Yi L
  2023-04-04 15:59           ` Eric Auger
  0 siblings, 1 reply; 145+ messages in thread
From: Liu, Yi L @ 2023-04-04 15:29 UTC (permalink / raw)
  To: eric.auger, alex.williamson, jgg, Tian, Kevin
  Cc: linux-s390, yi.y.sun, kvm, mjrosato, intel-gvt-dev, joro, cohuck,
	Hao, Xudong, peterx, Zhao, Yan Y, Xu, Terrence, nicolinc,
	shameerali.kolothum.thodi, suravee.suthikulpanit, intel-gfx,
	chao.p.peng, lulu, robin.murphy, jasowang, Jiang, Yanting

> From: Eric Auger <eric.auger@redhat.com>
> Sent: Tuesday, April 4, 2023 11:19 PM
> 
> Hi Yi,
> 
> On 4/4/23 16:37, Liu, Yi L wrote:
> > Hi Eric,
> >
> >> From: Eric Auger <eric.auger@redhat.com>
> >> Sent: Tuesday, April 4, 2023 10:00 PM
> >>
> >> Hi YI,
> >>
> >> On 4/1/23 16:44, Yi Liu wrote:
> >>> If the affected device is not opened by any user, it's safe to reset it
> >>> given it's not in use.
> >>>
> >>> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
> >>> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> >>> Tested-by: Yanting Jiang <yanting.jiang@intel.com>
> >>> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> >>> ---
> >>>  drivers/vfio/pci/vfio_pci_core.c | 14 +++++++++++---
> >>>  include/uapi/linux/vfio.h        |  8 ++++++++
> >>>  2 files changed, 19 insertions(+), 3 deletions(-)
> >>>
> >>> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> >>> index 65bbef562268..5d745c9abf05 100644
> >>> --- a/drivers/vfio/pci/vfio_pci_core.c
> >>> +++ b/drivers/vfio/pci/vfio_pci_core.c
> >>> @@ -2429,10 +2429,18 @@ static int vfio_pci_dev_set_hot_reset(struct
> >> vfio_device_set *dev_set,
> >>>  	list_for_each_entry(cur_vma, &dev_set->device_list, vdev.dev_set_list) {
> >>>  		/*
> >>> -		 * Test whether all the affected devices are contained by the
> >>> -		 * set of groups provided by the user.
> >>> +		 * Test whether all the affected devices can be reset by the
> >>> +		 * user.
> >>> +		 *
> >>> +		 * Resetting an unused device (not opened) is safe, because
> >>> +		 * dev_set->lock is held in hot reset path so this device
> >>> +		 * cannot race being opened by another user simultaneously.
> >>> +		 *
> >>> +		 * Otherwise all opened devices in the dev_set must be
> >>> +		 * contained by the set of groups provided by the user.
> >>>  		 */
> >>> -		if (!vfio_dev_in_groups(cur_vma, groups)) {
> >>> +		if (cur_vma->vdev.open_count &&
> >>> +		    !vfio_dev_in_groups(cur_vma, groups)) {
> >>>  			ret = -EINVAL;
> >>>  			goto err_undo;
> >>>  		}
> >>> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> >>> index 0552e8dcf0cb..f96e5689cffc 100644
> >>> --- a/include/uapi/linux/vfio.h
> >>> +++ b/include/uapi/linux/vfio.h
> >>> @@ -673,6 +673,14 @@ struct vfio_pci_hot_reset_info {
> >>>   * VFIO_DEVICE_PCI_HOT_RESET - _IOW(VFIO_TYPE, VFIO_BASE + 13,
> >>>   *				    struct vfio_pci_hot_reset)
> >>>   *
> >>> + * Userspace requests hot reset for the devices it uses.  Due to the
> >>> + * underlying topology, multiple devices can be affected in the reset
> >> by the reset
> >>> + * while some might be opened by another user.  To avoid interference
> >> s/interference/hot reset failure?
> > I don’t think user can really avoid hot reset failure since there may
> > be new devices plugged into the affected slot. Even user has opened
> I don't know the legacy wrt that issue but this sounds a serious issue,
> meaning the reset of an assigned device could impact another device
> belonging to another group not not owned by the user?

but the hot reset shall fail as the group is not owned by the user.

> > all the groups/devices reported by VFIO_DEVICE_GET_PCI_HOT_RESET_INFO,
> > the hot reset can fail if new device is plugged in and has not been
> > bound to vfio or opened by another user during the window of
> > _INFO and HOT_RESET.
> with respect to the latter isn't the dev_set lock held during the hot
> reset and sufficient to prevent any new opening to occur?

yes. new open needs to acquire the dev_set lock. So when hot reset
acquires the dev_set lock, then no new open can occur. 

Regards,
Yi Liu

> >
> > maybe the whole statement should be as below:
> >
> > To avoid interference, the hot reset can only be conducted when all
> > the affected devices are either opened by the calling user or not
> > opened yet at the moment of the hot reset attempt.
> 
> OK
> 
> Eric
> >
> >>> + * the calling user must ensure all affected devices, if opened, are
> >>> + * owned by itself.
> >>> + *
> >>> + * The ownership is proved by an array of group fds.
> >>> + *
> >>>   * Return: 0 on success, -errno on failure.
> >>>   */
> >>>  struct vfio_pci_hot_reset {
> > Regards,
> > Yi Liu


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 02/12] vfio/pci: Only check ownership of opened devices in hot reset
  2023-04-04 15:29         ` Liu, Yi L
@ 2023-04-04 15:59           ` Eric Auger
  2023-04-05 11:41             ` Jason Gunthorpe
  0 siblings, 1 reply; 145+ messages in thread
From: Eric Auger @ 2023-04-04 15:59 UTC (permalink / raw)
  To: Liu, Yi L, alex.williamson, jgg, Tian, Kevin
  Cc: linux-s390, yi.y.sun, kvm, mjrosato, intel-gvt-dev, joro, cohuck,
	Hao, Xudong, peterx, Zhao, Yan Y, Xu, Terrence, nicolinc,
	shameerali.kolothum.thodi, suravee.suthikulpanit, intel-gfx,
	chao.p.peng, lulu, robin.murphy, jasowang, Jiang, Yanting



On 4/4/23 17:29, Liu, Yi L wrote:
>> From: Eric Auger <eric.auger@redhat.com>
>> Sent: Tuesday, April 4, 2023 11:19 PM
>>
>> Hi Yi,
>>
>> On 4/4/23 16:37, Liu, Yi L wrote:
>>> Hi Eric,
>>>
>>>> From: Eric Auger <eric.auger@redhat.com>
>>>> Sent: Tuesday, April 4, 2023 10:00 PM
>>>>
>>>> Hi YI,
>>>>
>>>> On 4/1/23 16:44, Yi Liu wrote:
>>>>> If the affected device is not opened by any user, it's safe to reset it
>>>>> given it's not in use.
>>>>>
>>>>> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
>>>>> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
>>>>> Tested-by: Yanting Jiang <yanting.jiang@intel.com>
>>>>> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
>>>>> ---
>>>>>  drivers/vfio/pci/vfio_pci_core.c | 14 +++++++++++---
>>>>>  include/uapi/linux/vfio.h        |  8 ++++++++
>>>>>  2 files changed, 19 insertions(+), 3 deletions(-)
>>>>>
>>>>> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
>>>>> index 65bbef562268..5d745c9abf05 100644
>>>>> --- a/drivers/vfio/pci/vfio_pci_core.c
>>>>> +++ b/drivers/vfio/pci/vfio_pci_core.c
>>>>> @@ -2429,10 +2429,18 @@ static int vfio_pci_dev_set_hot_reset(struct
>>>> vfio_device_set *dev_set,
>>>>>  	list_for_each_entry(cur_vma, &dev_set->device_list, vdev.dev_set_list) {
>>>>>  		/*
>>>>> -		 * Test whether all the affected devices are contained by the
>>>>> -		 * set of groups provided by the user.
>>>>> +		 * Test whether all the affected devices can be reset by the
>>>>> +		 * user.
>>>>> +		 *
>>>>> +		 * Resetting an unused device (not opened) is safe, because
>>>>> +		 * dev_set->lock is held in hot reset path so this device
>>>>> +		 * cannot race being opened by another user simultaneously.
>>>>> +		 *
>>>>> +		 * Otherwise all opened devices in the dev_set must be
>>>>> +		 * contained by the set of groups provided by the user.
>>>>>  		 */
>>>>> -		if (!vfio_dev_in_groups(cur_vma, groups)) {
>>>>> +		if (cur_vma->vdev.open_count &&
>>>>> +		    !vfio_dev_in_groups(cur_vma, groups)) {
>>>>>  			ret = -EINVAL;
>>>>>  			goto err_undo;
>>>>>  		}
>>>>> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
>>>>> index 0552e8dcf0cb..f96e5689cffc 100644
>>>>> --- a/include/uapi/linux/vfio.h
>>>>> +++ b/include/uapi/linux/vfio.h
>>>>> @@ -673,6 +673,14 @@ struct vfio_pci_hot_reset_info {
>>>>>   * VFIO_DEVICE_PCI_HOT_RESET - _IOW(VFIO_TYPE, VFIO_BASE + 13,
>>>>>   *				    struct vfio_pci_hot_reset)
>>>>>   *
>>>>> + * Userspace requests hot reset for the devices it uses.  Due to the
>>>>> + * underlying topology, multiple devices can be affected in the reset
>>>> by the reset
>>>>> + * while some might be opened by another user.  To avoid interference
>>>> s/interference/hot reset failure?
>>> I don’t think user can really avoid hot reset failure since there may
>>> be new devices plugged into the affected slot. Even user has opened
>> I don't know the legacy wrt that issue but this sounds a serious issue,
>> meaning the reset of an assigned device could impact another device
>> belonging to another group not not owned by the user?
> but the hot reset shall fail as the group is not owned by the user.

sure it shall but I fail to understand if the reset fails or the device
plug is somehow delayed until the reset completes.
>
>>> all the groups/devices reported by VFIO_DEVICE_GET_PCI_HOT_RESET_INFO,
>>> the hot reset can fail if new device is plugged in and has not been
>>> bound to vfio or opened by another user during the window of
>>> _INFO and HOT_RESET.
>> with respect to the latter isn't the dev_set lock held during the hot
>> reset and sufficient to prevent any new opening to occur?
> yes. new open needs to acquire the dev_set lock. So when hot reset
> acquires the dev_set lock, then no new open can occur. 
>
> Regards,
> Yi Liu
>
>>> maybe the whole statement should be as below:
>>>
>>> To avoid interference, the hot reset can only be conducted when all
>>> the affected devices are either opened by the calling user or not
>>> opened yet at the moment of the hot reset attempt.
>> OK
>>
>> Eric
>>>>> + * the calling user must ensure all affected devices, if opened, are
>>>>> + * owned by itself.
>>>>> + *
>>>>> + * The ownership is proved by an array of group fds.
>>>>> + *
>>>>>   * Return: 0 on success, -errno on failure.
>>>>>   */
>>>>>  struct vfio_pci_hot_reset {
>>> Regards,
>>> Yi Liu


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 05/12] vfio/pci: Allow passing zero-length fd array in VFIO_DEVICE_PCI_HOT_RESET
  2023-04-01 14:44 ` [Intel-gfx] [PATCH v3 05/12] vfio/pci: Allow passing zero-length fd array in VFIO_DEVICE_PCI_HOT_RESET Yi Liu
@ 2023-04-04 16:54   ` Eric Auger
  2023-04-04 20:18   ` Alex Williamson
  1 sibling, 0 replies; 145+ messages in thread
From: Eric Auger @ 2023-04-04 16:54 UTC (permalink / raw)
  To: Yi Liu, alex.williamson, jgg, kevin.tian
  Cc: linux-s390, yi.y.sun, kvm, mjrosato, intel-gvt-dev, joro, cohuck,
	xudong.hao, peterx, yan.y.zhao, terrence.xu, nicolinc,
	shameerali.kolothum.thodi, suravee.suthikulpanit, intel-gfx,
	chao.p.peng, lulu, robin.murphy, jasowang, yanting.jiang

Hi Yi,

On 4/1/23 16:44, Yi Liu wrote:
> as an alternative method for ownership check when iommufd is used. In
I don't understand the 1st sentence.
> this case all opened devices in the affected dev_set are verified to
> be bound to a same valid iommufd value to allow reset. It's simpler
> and faster as user does not need to pass a set of fds and kernel no
kernel does not need to search
> need to search the device within the given fds.
>
> a device in noiommu mode doesn't have a valid iommufd, so this method
> should not be used in a dev_set which contains multiple devices and one
> of them is in noiommu. The only allowed noiommu scenario is that the
> calling device is noiommu and it's in a singleton dev_set.
>
> Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> Tested-by: Yanting Jiang <yanting.jiang@intel.com>
> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> ---
>  drivers/vfio/pci/vfio_pci_core.c | 42 +++++++++++++++++++++++++++-----
>  include/uapi/linux/vfio.h        |  9 ++++++-
>  2 files changed, 44 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> index 3696b8e58445..b68fcba67a4b 100644
> --- a/drivers/vfio/pci/vfio_pci_core.c
> +++ b/drivers/vfio/pci/vfio_pci_core.c
> @@ -180,7 +180,8 @@ static void vfio_pci_probe_mmaps(struct vfio_pci_core_device *vdev)
>  struct vfio_pci_group_info;
>  static void vfio_pci_dev_set_try_reset(struct vfio_device_set *dev_set);
>  static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
> -				      struct vfio_pci_group_info *groups);
> +				      struct vfio_pci_group_info *groups,
> +				      struct iommufd_ctx *iommufd_ctx);
>  
>  /*
>   * INTx masking requires the ability to disable INTx signaling via PCI_COMMAND
> @@ -1277,7 +1278,7 @@ vfio_pci_ioctl_pci_hot_reset_groups(struct vfio_pci_core_device *vdev,
>  		return ret;
>  
>  	/* Somewhere between 1 and count is OK */
> -	if (!hdr->count || hdr->count > count)
> +	if (hdr->count > count)
then I would simply remove the above comment since !count check is done
by the caller.
>  		return -EINVAL;
>  
>  	group_fds = kcalloc(hdr->count, sizeof(*group_fds), GFP_KERNEL);
> @@ -1326,7 +1327,7 @@ vfio_pci_ioctl_pci_hot_reset_groups(struct vfio_pci_core_device *vdev,
>  	info.count = hdr->count;
>  	info.files = files;
>  
> -	ret = vfio_pci_dev_set_hot_reset(vdev->vdev.dev_set, &info);
> +	ret = vfio_pci_dev_set_hot_reset(vdev->vdev.dev_set, &info, NULL);
>  
>  hot_reset_release:
>  	for (file_idx--; file_idx >= 0; file_idx--)
> @@ -1341,6 +1342,7 @@ static int vfio_pci_ioctl_pci_hot_reset(struct vfio_pci_core_device *vdev,
>  {
>  	unsigned long minsz = offsetofend(struct vfio_pci_hot_reset, count);
>  	struct vfio_pci_hot_reset hdr;
> +	struct iommufd_ctx *iommufd;
>  	bool slot = false;
>  
>  	if (copy_from_user(&hdr, arg, minsz))
> @@ -1355,7 +1357,12 @@ static int vfio_pci_ioctl_pci_hot_reset(struct vfio_pci_core_device *vdev,
>  	else if (pci_probe_reset_bus(vdev->pdev->bus))
>  		return -ENODEV;
>  
> -	return vfio_pci_ioctl_pci_hot_reset_groups(vdev, &hdr, slot, arg);
> +	if (hdr.count)
> +		return vfio_pci_ioctl_pci_hot_reset_groups(vdev, &hdr, slot, arg);
> +
> +	iommufd = vfio_iommufd_physical_ictx(&vdev->vdev);
> +
> +	return vfio_pci_dev_set_hot_reset(vdev->vdev.dev_set, NULL, iommufd);
>  }
>  
>  static int vfio_pci_ioctl_ioeventfd(struct vfio_pci_core_device *vdev,
> @@ -2327,6 +2334,9 @@ static bool vfio_dev_in_groups(struct vfio_pci_core_device *vdev,
>  {
>  	unsigned int i;
>  
> +	if (!groups)
> +		return false;
> +
>  	for (i = 0; i < groups->count; i++)
>  		if (vfio_file_has_dev(groups->files[i], &vdev->vdev))
>  			return true;
> @@ -2402,13 +2412,25 @@ static int vfio_pci_dev_set_pm_runtime_get(struct vfio_device_set *dev_set)
>  	return ret;
>  }
>  
> +static bool vfio_dev_in_iommufd_ctx(struct vfio_pci_core_device *vdev,
> +				    struct iommufd_ctx *iommufd_ctx)
> +{
> +	struct iommufd_ctx *iommufd = vfio_iommufd_physical_ictx(&vdev->vdev);
> +
> +	if (!iommufd)
> +		return false;
> +
> +	return iommufd == iommufd_ctx;
> +}
> +
>  /*
>   * We need to get memory_lock for each device, but devices can share mmap_lock,
>   * therefore we need to zap and hold the vma_lock for each device, and only then
>   * get each memory_lock.
>   */
>  static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
> -				      struct vfio_pci_group_info *groups)
> +				      struct vfio_pci_group_info *groups,
> +				      struct iommufd_ctx *iommufd_ctx)
>  {
>  	struct vfio_pci_core_device *cur_mem;
>  	struct vfio_pci_core_device *cur_vma;
> @@ -2448,9 +2470,17 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
>  		 *
>  		 * Otherwise all opened devices in the dev_set must be
>  		 * contained by the set of groups provided by the user.
> +		 *
> +		 * If user provides a zero-length array, then all the
> +		 * opened devices must be bound to a same iommufd_ctx.
> +		 *
> +		 * If all above checks are failed, reset is allowed only if
> +		 * the calling device is in a singleton dev_set.
>  		 */
>  		if (cur_vma->vdev.open_count &&
> -		    !vfio_dev_in_groups(cur_vma, groups)) {
> +		    !vfio_dev_in_groups(cur_vma, groups) &&
> +		    !vfio_dev_in_iommufd_ctx(cur_vma, iommufd_ctx) &&
> +		    (dev_set->device_count > 1)) {
>  			ret = -EINVAL;
>  			goto err_undo;
>  		}
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index f96e5689cffc..17aa5d09db41 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -679,7 +679,14 @@ struct vfio_pci_hot_reset_info {
>   * the calling user must ensure all affected devices, if opened, are
>   * owned by itself.
>   *
> - * The ownership is proved by an array of group fds.
> + * The ownership can be proved by:
> + *   - An array of group fds
> + *   - A zero-length array

I would suggest something alike
in case a non void group fd array is passed, the devices affected by the
reset must belong to those opened VFIO groups.
in case a zero length array is passed, the other devices affected by the
reset, if any, must be bound to the same iommufd as this VFIO device
Either of the 2 methods is applied to check the feasibility of the reset
> + *
> + * In the last case all affected devices which are opened by this user
> + * must have been bound to a same iommufd. If the calling device is in
> + * noiommu mode (no valid iommufd) then it can be reset only if the reset
> + * doesn't affect other devices.
and keep that too
>   *
>   * Return: 0 on success, -errno on failure.
>   */
Thanks

Eric


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 05/12] vfio/pci: Allow passing zero-length fd array in VFIO_DEVICE_PCI_HOT_RESET
  2023-04-01 14:44 ` [Intel-gfx] [PATCH v3 05/12] vfio/pci: Allow passing zero-length fd array in VFIO_DEVICE_PCI_HOT_RESET Yi Liu
  2023-04-04 16:54   ` Eric Auger
@ 2023-04-04 20:18   ` Alex Williamson
  2023-04-05  7:55     ` Liu, Yi L
  2023-04-05  8:02     ` Eric Auger
  1 sibling, 2 replies; 145+ messages in thread
From: Alex Williamson @ 2023-04-04 20:18 UTC (permalink / raw)
  To: Yi Liu
  Cc: mjrosato, jasowang, xudong.hao, peterx, terrence.xu, chao.p.peng,
	linux-s390, kvm, lulu, yanting.jiang, joro, nicolinc, jgg,
	yan.y.zhao, intel-gfx, eric.auger, intel-gvt-dev, yi.y.sun,
	cohuck, shameerali.kolothum.thodi, suravee.suthikulpanit,
	robin.murphy

On Sat,  1 Apr 2023 07:44:22 -0700
Yi Liu <yi.l.liu@intel.com> wrote:

> as an alternative method for ownership check when iommufd is used. In
> this case all opened devices in the affected dev_set are verified to
> be bound to a same valid iommufd value to allow reset. It's simpler
> and faster as user does not need to pass a set of fds and kernel no
> need to search the device within the given fds.
> 
> a device in noiommu mode doesn't have a valid iommufd, so this method
> should not be used in a dev_set which contains multiple devices and one
> of them is in noiommu. The only allowed noiommu scenario is that the
> calling device is noiommu and it's in a singleton dev_set.
> 
> Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> Tested-by: Yanting Jiang <yanting.jiang@intel.com>
> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> ---
>  drivers/vfio/pci/vfio_pci_core.c | 42 +++++++++++++++++++++++++++-----
>  include/uapi/linux/vfio.h        |  9 ++++++-
>  2 files changed, 44 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> index 3696b8e58445..b68fcba67a4b 100644
> --- a/drivers/vfio/pci/vfio_pci_core.c
> +++ b/drivers/vfio/pci/vfio_pci_core.c
> @@ -180,7 +180,8 @@ static void vfio_pci_probe_mmaps(struct vfio_pci_core_device *vdev)
>  struct vfio_pci_group_info;
>  static void vfio_pci_dev_set_try_reset(struct vfio_device_set *dev_set);
>  static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
> -				      struct vfio_pci_group_info *groups);
> +				      struct vfio_pci_group_info *groups,
> +				      struct iommufd_ctx *iommufd_ctx);
>  
>  /*
>   * INTx masking requires the ability to disable INTx signaling via PCI_COMMAND
> @@ -1277,7 +1278,7 @@ vfio_pci_ioctl_pci_hot_reset_groups(struct vfio_pci_core_device *vdev,
>  		return ret;
>  
>  	/* Somewhere between 1 and count is OK */
> -	if (!hdr->count || hdr->count > count)
> +	if (hdr->count > count)
>  		return -EINVAL;
>  
>  	group_fds = kcalloc(hdr->count, sizeof(*group_fds), GFP_KERNEL);
> @@ -1326,7 +1327,7 @@ vfio_pci_ioctl_pci_hot_reset_groups(struct vfio_pci_core_device *vdev,
>  	info.count = hdr->count;
>  	info.files = files;
>  
> -	ret = vfio_pci_dev_set_hot_reset(vdev->vdev.dev_set, &info);
> +	ret = vfio_pci_dev_set_hot_reset(vdev->vdev.dev_set, &info, NULL);
>  
>  hot_reset_release:
>  	for (file_idx--; file_idx >= 0; file_idx--)
> @@ -1341,6 +1342,7 @@ static int vfio_pci_ioctl_pci_hot_reset(struct vfio_pci_core_device *vdev,
>  {
>  	unsigned long minsz = offsetofend(struct vfio_pci_hot_reset, count);
>  	struct vfio_pci_hot_reset hdr;
> +	struct iommufd_ctx *iommufd;
>  	bool slot = false;
>  
>  	if (copy_from_user(&hdr, arg, minsz))
> @@ -1355,7 +1357,12 @@ static int vfio_pci_ioctl_pci_hot_reset(struct vfio_pci_core_device *vdev,
>  	else if (pci_probe_reset_bus(vdev->pdev->bus))
>  		return -ENODEV;
>  
> -	return vfio_pci_ioctl_pci_hot_reset_groups(vdev, &hdr, slot, arg);
> +	if (hdr.count)
> +		return vfio_pci_ioctl_pci_hot_reset_groups(vdev, &hdr, slot, arg);
> +
> +	iommufd = vfio_iommufd_physical_ictx(&vdev->vdev);
> +
> +	return vfio_pci_dev_set_hot_reset(vdev->vdev.dev_set, NULL, iommufd);
>  }
>  
>  static int vfio_pci_ioctl_ioeventfd(struct vfio_pci_core_device *vdev,
> @@ -2327,6 +2334,9 @@ static bool vfio_dev_in_groups(struct vfio_pci_core_device *vdev,
>  {
>  	unsigned int i;
>  
> +	if (!groups)
> +		return false;
> +
>  	for (i = 0; i < groups->count; i++)
>  		if (vfio_file_has_dev(groups->files[i], &vdev->vdev))
>  			return true;
> @@ -2402,13 +2412,25 @@ static int vfio_pci_dev_set_pm_runtime_get(struct vfio_device_set *dev_set)
>  	return ret;
>  }
>  
> +static bool vfio_dev_in_iommufd_ctx(struct vfio_pci_core_device *vdev,
> +				    struct iommufd_ctx *iommufd_ctx)
> +{
> +	struct iommufd_ctx *iommufd = vfio_iommufd_physical_ictx(&vdev->vdev);
> +
> +	if (!iommufd)
> +		return false;
> +
> +	return iommufd == iommufd_ctx;
> +}
> +
>  /*
>   * We need to get memory_lock for each device, but devices can share mmap_lock,
>   * therefore we need to zap and hold the vma_lock for each device, and only then
>   * get each memory_lock.
>   */
>  static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
> -				      struct vfio_pci_group_info *groups)
> +				      struct vfio_pci_group_info *groups,
> +				      struct iommufd_ctx *iommufd_ctx)
>  {
>  	struct vfio_pci_core_device *cur_mem;
>  	struct vfio_pci_core_device *cur_vma;
> @@ -2448,9 +2470,17 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
>  		 *
>  		 * Otherwise all opened devices in the dev_set must be
>  		 * contained by the set of groups provided by the user.
> +		 *
> +		 * If user provides a zero-length array, then all the
> +		 * opened devices must be bound to a same iommufd_ctx.
> +		 *
> +		 * If all above checks are failed, reset is allowed only if
> +		 * the calling device is in a singleton dev_set.
>  		 */
>  		if (cur_vma->vdev.open_count &&
> -		    !vfio_dev_in_groups(cur_vma, groups)) {
> +		    !vfio_dev_in_groups(cur_vma, groups) &&
> +		    !vfio_dev_in_iommufd_ctx(cur_vma, iommufd_ctx) &&
> +		    (dev_set->device_count > 1)) {

This last condition looks buggy to me, we need all conditions to be
true to generate an error here, which means that for a singleton
dev_set, it doesn't matter what group fds are passed, if any, or whether
the iommufd context matches.  I think in fact this means that the empty
array path is equally available for group use cases with a singleton
dev_set, but we don't enable it for multiple device dev_sets like we do
iommufd.

You pointed out a previous issue with hot-reset info and no-iommu where
if other affected devices are not bound to vfio-pci the info ioctl
returns error.  That's handled in the hot-reset ioctl by the fact that
all affected devices must be in the dev_set and therefore bound to
vfio-pci drivers.  So it seems to me that aside from the spurious error
because we can't report an iommu group when none exists, and didn't
spot it to invent an invalid group for debugging, hot-reset otherwise
works with no-iommu just like it does for iommu backed devices.  We
don't currently require singleton no-iommu dev_sets afaict.

I'll also note that if the dev_set is singleton, this suggests that
pci_reset_function() can make use of bus reset, so a hot-reset is
accessible via VFIO_DEVICE_RESET if the appropriate reset method is
selected.

Therefore, I think as written, the singleton dev_set hot-reset is
enabled for iommufd and (unintentionally?) for the group path, while
also negating a requirement for a group fd or that a provided group fd
actually matches the device in this latter case.  The null-array
approach is not however extended to groups for more general use.
Additionally, limiting no-iommu hot-reset to singleton dev_sets
provides only a marginal functional difference vs VFIO_DEVICE_RESET.
Thanks,

Alex

>  			ret = -EINVAL;
>  			goto err_undo;
>  		}
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index f96e5689cffc..17aa5d09db41 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -679,7 +679,14 @@ struct vfio_pci_hot_reset_info {
>   * the calling user must ensure all affected devices, if opened, are
>   * owned by itself.
>   *
> - * The ownership is proved by an array of group fds.
> + * The ownership can be proved by:
> + *   - An array of group fds
> + *   - A zero-length array
> + *
> + * In the last case all affected devices which are opened by this user
> + * must have been bound to a same iommufd. If the calling device is in
> + * noiommu mode (no valid iommufd) then it can be reset only if the reset
> + * doesn't affect other devices.
>   *
>   * Return: 0 on success, -errno on failure.
>   */


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 07/12] vfio: Accpet device file from vfio PCI hot reset path
  2023-04-01 14:44 ` [Intel-gfx] [PATCH v3 07/12] vfio: Accpet device file from vfio PCI hot reset path Yi Liu
@ 2023-04-04 20:31   ` Alex Williamson
  2023-04-05  8:07   ` Eric Auger
  1 sibling, 0 replies; 145+ messages in thread
From: Alex Williamson @ 2023-04-04 20:31 UTC (permalink / raw)
  To: Yi Liu
  Cc: mjrosato, jasowang, xudong.hao, peterx, terrence.xu, chao.p.peng,
	linux-s390, kvm, lulu, yanting.jiang, joro, nicolinc, jgg,
	yan.y.zhao, intel-gfx, eric.auger, intel-gvt-dev, yi.y.sun,
	cohuck, shameerali.kolothum.thodi, suravee.suthikulpanit,
	robin.murphy

On Sat,  1 Apr 2023 07:44:24 -0700
Yi Liu <yi.l.liu@intel.com> wrote:

> This extends both vfio_file_is_valid() and vfio_file_has_dev() to accept
> device file from the vfio PCI hot reset.
> 
> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> Tested-by: Yanting Jiang <yanting.jiang@intel.com>
> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> ---
>  drivers/vfio/vfio_main.c | 23 +++++++++++++++++++----
>  1 file changed, 19 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
> index fe7446805afd..ebbb6b91a498 100644
> --- a/drivers/vfio/vfio_main.c
> +++ b/drivers/vfio/vfio_main.c
> @@ -1154,13 +1154,23 @@ const struct file_operations vfio_device_fops = {
>  	.mmap		= vfio_device_fops_mmap,
>  };
>  
> +static struct vfio_device *vfio_device_from_file(struct file *file)
> +{
> +	struct vfio_device *device = file->private_data;
> +
> +	if (file->f_op != &vfio_device_fops)
> +		return NULL;
> +	return device;
> +}
> +
>  /**
>   * vfio_file_is_valid - True if the file is valid vfio file
>   * @file: VFIO group file or VFIO device file
>   */
>  bool vfio_file_is_valid(struct file *file)
>  {
> -	return vfio_group_from_file(file);
> +	return vfio_group_from_file(file) ||
> +	       vfio_device_from_file(file);
>  }
>  EXPORT_SYMBOL_GPL(vfio_file_is_valid);
>  
> @@ -1174,12 +1184,17 @@ EXPORT_SYMBOL_GPL(vfio_file_is_valid);
>  bool vfio_file_has_dev(struct file *file, struct vfio_device *device)
>  {
>  	struct vfio_group *group;
> +	struct vfio_device *vdev;
>  
>  	group = vfio_group_from_file(file);
> -	if (!group)
> -		return false;
> +	if (group)
> +		return vfio_group_has_dev(group, device);
> +
> +	vdev = vfio_device_from_file(file);
> +	if (vdev)
> +		return vdev == device;
>  
> -	return vfio_group_has_dev(group, device);
> +	return false;

Nit, unless we expect to be testing against NULL devices, this could
just be:

	return device == vfio_device_from_file(file);

Thanks,
Alex


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 11/12] iommufd: Define IOMMUFD_INVALID_ID in uapi
  2023-04-01 14:44 ` [Intel-gfx] [PATCH v3 11/12] iommufd: Define IOMMUFD_INVALID_ID in uapi Yi Liu
@ 2023-04-04 21:00   ` Alex Williamson
  2023-04-05  9:31     ` Liu, Yi L
  2023-04-05 11:46   ` Eric Auger
  1 sibling, 1 reply; 145+ messages in thread
From: Alex Williamson @ 2023-04-04 21:00 UTC (permalink / raw)
  To: Yi Liu
  Cc: mjrosato, jasowang, xudong.hao, peterx, terrence.xu, chao.p.peng,
	linux-s390, kvm, lulu, yanting.jiang, joro, nicolinc, jgg,
	yan.y.zhao, intel-gfx, eric.auger, intel-gvt-dev, yi.y.sun,
	cohuck, shameerali.kolothum.thodi, suravee.suthikulpanit,
	robin.murphy

On Sat,  1 Apr 2023 07:44:28 -0700
Yi Liu <yi.l.liu@intel.com> wrote:

> as there are IOMMUFD users that want to know check if an ID generated
> by IOMMUFD is valid or not. e.g. vfio-pci optionaly returns invalid
> dev_id to user in the VFIO_DEVICE_GET_PCI_HOT_RESET_INFO ioctl. User
> needs to check if the ID is valid or not.
> 
> IOMMUFD_INVALID_ID is defined as 0 since the IDs generated by IOMMUFD
> starts from 0.
> 
> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> ---
>  include/uapi/linux/iommufd.h | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
> index 98ebba80cfa1..aeae73a93833 100644
> --- a/include/uapi/linux/iommufd.h
> +++ b/include/uapi/linux/iommufd.h
> @@ -9,6 +9,9 @@
>  
>  #define IOMMUFD_TYPE (';')
>  
> +/* IDs allocated by IOMMUFD starts from 0 */
> +#define IOMMUFD_INVALID_ID 0
> +
>  /**
>   * DOC: General ioctl format
>   *

If allocation "starts from 0" then 0 is a valid id, no?  Does allocation
start from 1, ie. skip 0?  Thanks,

Alex


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 08/12] vfio/pci: Renaming for accepting device fd in hot reset path
  2023-04-01 14:44 ` [Intel-gfx] [PATCH v3 08/12] vfio/pci: Renaming for accepting device fd in " Yi Liu
@ 2023-04-04 21:23   ` Alex Williamson
  2023-04-05  9:32   ` Eric Auger
  1 sibling, 0 replies; 145+ messages in thread
From: Alex Williamson @ 2023-04-04 21:23 UTC (permalink / raw)
  To: Yi Liu
  Cc: mjrosato, jasowang, xudong.hao, peterx, terrence.xu, chao.p.peng,
	linux-s390, kvm, lulu, yanting.jiang, joro, nicolinc, jgg,
	yan.y.zhao, intel-gfx, eric.auger, intel-gvt-dev, yi.y.sun,
	cohuck, shameerali.kolothum.thodi, suravee.suthikulpanit,
	robin.murphy

On Sat,  1 Apr 2023 07:44:25 -0700
Yi Liu <yi.l.liu@intel.com> wrote:

> No functional change is intended.
> 
> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> Tested-by: Yanting Jiang <yanting.jiang@intel.com>
> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> ---
>  drivers/vfio/pci/vfio_pci_core.c | 52 ++++++++++++++++----------------
>  1 file changed, 26 insertions(+), 26 deletions(-)
> 
> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> index 2a510b71edcb..da6325008872 100644
> --- a/drivers/vfio/pci/vfio_pci_core.c
> +++ b/drivers/vfio/pci/vfio_pci_core.c
> @@ -177,10 +177,10 @@ static void vfio_pci_probe_mmaps(struct vfio_pci_core_device *vdev)
>  	}
>  }
>  
> -struct vfio_pci_group_info;
> +struct vfio_pci_file_info;
>  static void vfio_pci_dev_set_try_reset(struct vfio_device_set *dev_set);
>  static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
> -				      struct vfio_pci_group_info *groups,
> +				      struct vfio_pci_file_info *info,
>  				      struct iommufd_ctx *iommufd_ctx);
>  
>  /*
> @@ -800,7 +800,7 @@ static int vfio_pci_fill_devs(struct pci_dev *pdev, void *data)
>  	return 0;
>  }
>  
> -struct vfio_pci_group_info {
> +struct vfio_pci_file_info {
>  	int count;
>  	struct file **files;
>  };
> @@ -1257,14 +1257,14 @@ static int vfio_pci_ioctl_get_pci_hot_reset_info(
>  }
>  
>  static int
> -vfio_pci_ioctl_pci_hot_reset_groups(struct vfio_pci_core_device *vdev,
> -				    struct vfio_pci_hot_reset *hdr,
> -				    bool slot,
> -				    struct vfio_pci_hot_reset __user *arg)
> +vfio_pci_ioctl_pci_hot_reset_files(struct vfio_pci_core_device *vdev,
> +				   struct vfio_pci_hot_reset *hdr,
> +				   bool slot,
> +				   struct vfio_pci_hot_reset __user *arg)
>  {
> -	int32_t *group_fds;
> +	int32_t *fds;
>  	struct file **files;
> -	struct vfio_pci_group_info info;
> +	struct vfio_pci_file_info info;
>  	int file_idx, count = 0, ret = 0;
>  
>  	/*
> @@ -1281,17 +1281,17 @@ vfio_pci_ioctl_pci_hot_reset_groups(struct vfio_pci_core_device *vdev,
>  	if (hdr->count > count)
>  		return -EINVAL;
>  
> -	group_fds = kcalloc(hdr->count, sizeof(*group_fds), GFP_KERNEL);
> +	fds = kcalloc(hdr->count, sizeof(*fds), GFP_KERNEL);
>  	files = kcalloc(hdr->count, sizeof(*files), GFP_KERNEL);
> -	if (!group_fds || !files) {
> -		kfree(group_fds);
> +	if (!fds || !files) {
> +		kfree(fds);
>  		kfree(files);
>  		return -ENOMEM;
>  	}
>  
> -	if (copy_from_user(group_fds, arg->group_fds,
> -			   hdr->count * sizeof(*group_fds))) {
> -		kfree(group_fds);
> +	if (copy_from_user(fds, arg->group_fds,
> +			   hdr->count * sizeof(*fds))) {
> +		kfree(fds);
>  		kfree(files);
>  		return -EFAULT;
>  	}
> @@ -1301,7 +1301,7 @@ vfio_pci_ioctl_pci_hot_reset_groups(struct vfio_pci_core_device *vdev,
>  	 * the reset
>  	 */
>  	for (file_idx = 0; file_idx < hdr->count; file_idx++) {
> -		struct file *file = fget(group_fds[file_idx]);
> +		struct file *file = fget(fds[file_idx]);
>  
>  		if (!file) {
>  			ret = -EBADF;
> @@ -1318,9 +1318,9 @@ vfio_pci_ioctl_pci_hot_reset_groups(struct vfio_pci_core_device *vdev,
>  		files[file_idx] = file;
>  	}
>  
> -	kfree(group_fds);
> +	kfree(fds);
>  
> -	/* release reference to groups on error */
> +	/* release reference to fds on error */
>  	if (ret)
>  		goto hot_reset_release;
>  
> @@ -1358,7 +1358,7 @@ static int vfio_pci_ioctl_pci_hot_reset(struct vfio_pci_core_device *vdev,
>  		return -ENODEV;
>  
>  	if (hdr.count)
> -		return vfio_pci_ioctl_pci_hot_reset_groups(vdev, &hdr, slot, arg);
> +		return vfio_pci_ioctl_pci_hot_reset_files(vdev, &hdr, slot, arg);
>  
>  	iommufd = vfio_iommufd_physical_ictx(&vdev->vdev);
>  
> @@ -2329,16 +2329,16 @@ const struct pci_error_handlers vfio_pci_core_err_handlers = {
>  };
>  EXPORT_SYMBOL_GPL(vfio_pci_core_err_handlers);
>  
> -static bool vfio_dev_in_groups(struct vfio_pci_core_device *vdev,
> -			       struct vfio_pci_group_info *groups)
> +static bool vfio_dev_in_files(struct vfio_pci_core_device *vdev,
> +			      struct vfio_pci_file_info *info)
>  {
>  	unsigned int i;
>  
> -	if (!groups)
> +	if (!info)
>  		return false;
>  
> -	for (i = 0; i < groups->count; i++)
> -		if (vfio_file_has_dev(groups->files[i], &vdev->vdev))
> +	for (i = 0; i < info->count; i++)
> +		if (vfio_file_has_dev(info->files[i], &vdev->vdev))
>  			return true;
>  	return false;
>  }
> @@ -2429,7 +2429,7 @@ static bool vfio_dev_in_iommufd_ctx(struct vfio_pci_core_device *vdev,
>   * get each memory_lock.
>   */
>  static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
> -				      struct vfio_pci_group_info *groups,
> +				      struct vfio_pci_file_info *info,
>  				      struct iommufd_ctx *iommufd_ctx)
>  {
>  	struct vfio_pci_core_device *cur_mem;
> @@ -2478,7 +2478,7 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
>  		 * the calling device is in a singleton dev_set.
>  		 */
>  		if (cur_vma->vdev.open_count &&
> -		    !vfio_dev_in_groups(cur_vma, groups) &&
> +		    !vfio_dev_in_files(cur_vma, info) &&
>  		    !vfio_dev_in_iommufd_ctx(cur_vma, iommufd_ctx) &&
>  		    (dev_set->device_count > 1)) {
>  			ret = -EINVAL;

At this point, vfio_dev_in_files() supports both group and cdev fds and
these can be used for both regular IOMMU protected and no-IOMMU
devices AFAICT.  We only add this 1-off dev_set device count test and
its subtle side-effects in order to support the null-array mode, which
IMO really has yet to be shown as a requirement.

IIRC, we were wanting to add that mode as part of the cdev interface so
that the existence of cdevs implies this support, but now we're already
making use of vfio_pci_hot_reset_info.flags to indicate group-id vs
dev-id in the output, so does anything prevent us from setting another
bit there if/when this feature proves itself useful and error free, to
indicate it's an available mode for the hot-reset ioctl?

With that I think we could drop patches 4 & 5 with a plan for
introducing them later without trying to strong arm the feature in
without a proven and available use case now.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 04/12] vfio-iommufd: Add helper to retrieve iommufd_ctx and devid for vfio_device
  2023-04-04 15:28   ` Eric Auger
@ 2023-04-04 21:48     ` Alex Williamson
  2023-04-21  7:11       ` Liu, Yi L
  0 siblings, 1 reply; 145+ messages in thread
From: Alex Williamson @ 2023-04-04 21:48 UTC (permalink / raw)
  To: Eric Auger
  Cc: mjrosato, jasowang, xudong.hao, peterx, terrence.xu, chao.p.peng,
	linux-s390, Yi Liu, kvm, lulu, yanting.jiang, joro, nicolinc,
	jgg, yan.y.zhao, intel-gfx, intel-gvt-dev, yi.y.sun, cohuck,
	shameerali.kolothum.thodi, suravee.suthikulpanit, robin.murphy

On Tue, 4 Apr 2023 17:28:40 +0200
Eric Auger <eric.auger@redhat.com> wrote:

> Hi,
> 
> On 4/1/23 16:44, Yi Liu wrote:
> > This is needed by the vfio-pci driver to report affected devices in the
> > hot reset for a given device.
> >
> > Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> > Tested-by: Yanting Jiang <yanting.jiang@intel.com>
> > Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> > ---
> >  drivers/iommu/iommufd/device.c | 12 ++++++++++++
> >  drivers/vfio/iommufd.c         | 14 ++++++++++++++
> >  include/linux/iommufd.h        |  3 +++
> >  include/linux/vfio.h           | 13 +++++++++++++
> >  4 files changed, 42 insertions(+)
> >
> > diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c
> > index 25115d401d8f..04a57aa1ae2c 100644
> > --- a/drivers/iommu/iommufd/device.c
> > +++ b/drivers/iommu/iommufd/device.c
> > @@ -131,6 +131,18 @@ void iommufd_device_unbind(struct iommufd_device *idev)
> >  }
> >  EXPORT_SYMBOL_NS_GPL(iommufd_device_unbind, IOMMUFD);
> >  
> > +struct iommufd_ctx *iommufd_device_to_ictx(struct iommufd_device *idev)
> > +{
> > +	return idev->ictx;
> > +}
> > +EXPORT_SYMBOL_NS_GPL(iommufd_device_to_ictx, IOMMUFD);
> > +
> > +u32 iommufd_device_to_id(struct iommufd_device *idev)
> > +{
> > +	return idev->obj.id;
> > +}
> > +EXPORT_SYMBOL_NS_GPL(iommufd_device_to_id, IOMMUFD);
> > +
> >  static int iommufd_device_setup_msi(struct iommufd_device *idev,
> >  				    struct iommufd_hw_pagetable *hwpt,
> >  				    phys_addr_t sw_msi_start)
> > diff --git a/drivers/vfio/iommufd.c b/drivers/vfio/iommufd.c
> > index 88b00c501015..809f2dd73b9e 100644
> > --- a/drivers/vfio/iommufd.c
> > +++ b/drivers/vfio/iommufd.c
> > @@ -66,6 +66,20 @@ void vfio_iommufd_unbind(struct vfio_device *vdev)
> >  		vdev->ops->unbind_iommufd(vdev);
> >  }
> >  
> > +struct iommufd_ctx *vfio_iommufd_physical_ictx(struct vfio_device *vdev)
> > +{
> > +	if (!vdev->iommufd_device)
> > +		return NULL;
> > +	return iommufd_device_to_ictx(vdev->iommufd_device);
> > +}
> > +EXPORT_SYMBOL_GPL(vfio_iommufd_physical_ictx);
> > +
> > +void vfio_iommufd_physical_devid(struct vfio_device *vdev, u32 *id)
> > +{
> > +	if (vdev->iommufd_device)
> > +		*id = iommufd_device_to_id(vdev->iommufd_device);  
> since there is no return value, may be worth to add at least a WARN_ON
> in case of !vdev->iommufd_device

Yeah, this is bizarre and makes the one caller of this interface very
awkward.  We later go on to define IOMMUFD_INVALID_ID, so this should
simply return that in the case of no iommufd_device and skip this
unnecessary pointer passing.  Thanks,

Alex

> > +}
> > +EXPORT_SYMBOL_GPL(vfio_iommufd_physical_devid);
> >  /*
> >   * The physical standard ops mean that the iommufd_device is bound to the
> >   * physical device vdev->dev that was provided to vfio_init_group_dev(). Drivers
> > diff --git a/include/linux/iommufd.h b/include/linux/iommufd.h
> > index 1129a36a74c4..ac96df406833 100644
> > --- a/include/linux/iommufd.h
> > +++ b/include/linux/iommufd.h
> > @@ -24,6 +24,9 @@ void iommufd_device_unbind(struct iommufd_device *idev);
> >  int iommufd_device_attach(struct iommufd_device *idev, u32 *pt_id);
> >  void iommufd_device_detach(struct iommufd_device *idev);
> >  
> > +struct iommufd_ctx *iommufd_device_to_ictx(struct iommufd_device *idev);
> > +u32 iommufd_device_to_id(struct iommufd_device *idev);
> > +
> >  struct iommufd_access_ops {
> >  	u8 needs_pin_pages : 1;
> >  	void (*unmap)(void *data, unsigned long iova, unsigned long length);
> > diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> > index 3188d8a374bd..97a1174b922f 100644
> > --- a/include/linux/vfio.h
> > +++ b/include/linux/vfio.h
> > @@ -113,6 +113,8 @@ struct vfio_device_ops {
> >  };
> >  
> >  #if IS_ENABLED(CONFIG_IOMMUFD)
> > +struct iommufd_ctx *vfio_iommufd_physical_ictx(struct vfio_device *vdev);
> > +void vfio_iommufd_physical_devid(struct vfio_device *vdev, u32 *id);
> >  int vfio_iommufd_physical_bind(struct vfio_device *vdev,
> >  			       struct iommufd_ctx *ictx, u32 *out_device_id);
> >  void vfio_iommufd_physical_unbind(struct vfio_device *vdev);
> > @@ -122,6 +124,17 @@ int vfio_iommufd_emulated_bind(struct vfio_device *vdev,
> >  void vfio_iommufd_emulated_unbind(struct vfio_device *vdev);
> >  int vfio_iommufd_emulated_attach_ioas(struct vfio_device *vdev, u32 *pt_id);
> >  #else
> > +static inline struct iommufd_ctx *
> > +vfio_iommufd_physical_ictx(struct vfio_device *vdev)
> > +{
> > +	return NULL;
> > +}
> > +
> > +static inline void
> > +vfio_iommufd_physical_devid(struct vfio_device *vdev, u32 *id)
> > +{
> > +}
> > +
> >  #define vfio_iommufd_physical_bind                                      \
> >  	((int (*)(struct vfio_device *vdev, struct iommufd_ctx *ictx,   \
> >  		  u32 *out_device_id)) NULL)  
> besides
> 
> Reviewed-by: Eric Auger <eric.auger@redhat.com>
> 
> Eric
> 


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
  2023-04-01 14:44 ` [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO Yi Liu
  2023-04-03  9:25   ` Liu, Yi L
@ 2023-04-04 22:20   ` Alex Williamson
  2023-04-05 12:19   ` Eric Auger
  2 siblings, 0 replies; 145+ messages in thread
From: Alex Williamson @ 2023-04-04 22:20 UTC (permalink / raw)
  To: Yi Liu
  Cc: mjrosato, jasowang, xudong.hao, peterx, terrence.xu, chao.p.peng,
	linux-s390, kvm, lulu, yanting.jiang, joro, nicolinc, jgg,
	yan.y.zhao, intel-gfx, eric.auger, intel-gvt-dev, yi.y.sun,
	cohuck, shameerali.kolothum.thodi, suravee.suthikulpanit,
	robin.murphy

On Sat,  1 Apr 2023 07:44:29 -0700
Yi Liu <yi.l.liu@intel.com> wrote:

> for the users that accept device fds passed from management stacks to be
> able to figure out the host reset affected devices among the devices
> opened by the user. This is needed as such users do not have BDF (bus,
> devfn) knowledge about the devices it has opened, hence unable to use
> the information reported by existing VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
> to figure out the affected devices.
> 
> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> ---
>  drivers/vfio/pci/vfio_pci_core.c | 58 ++++++++++++++++++++++++++++----
>  include/uapi/linux/vfio.h        | 24 ++++++++++++-
>  2 files changed, 74 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> index 19f5b075d70a..a5a7e148dce1 100644
> --- a/drivers/vfio/pci/vfio_pci_core.c
> +++ b/drivers/vfio/pci/vfio_pci_core.c
> @@ -30,6 +30,7 @@
>  #if IS_ENABLED(CONFIG_EEH)
>  #include <asm/eeh.h>
>  #endif
> +#include <uapi/linux/iommufd.h>
>  
>  #include "vfio_pci_priv.h"
>  
> @@ -767,6 +768,20 @@ static int vfio_pci_get_irq_count(struct vfio_pci_core_device *vdev, int irq_typ
>  	return 0;
>  }
>  
> +static struct vfio_device *
> +vfio_pci_find_device_in_devset(struct vfio_device_set *dev_set,
> +			       struct pci_dev *pdev)
> +{
> +	struct vfio_device *cur;
> +
> +	lockdep_assert_held(&dev_set->lock);
> +
> +	list_for_each_entry(cur, &dev_set->device_list, dev_set_list)
> +		if (cur->dev == &pdev->dev)
> +			return cur;
> +	return NULL;
> +}
> +
>  static int vfio_pci_count_devs(struct pci_dev *pdev, void *data)
>  {
>  	(*(int *)data)++;
> @@ -776,13 +791,20 @@ static int vfio_pci_count_devs(struct pci_dev *pdev, void *data)
>  struct vfio_pci_fill_info {
>  	int max;
>  	int cur;
> +	bool require_devid;
> +	struct iommufd_ctx *iommufd;
> +	struct vfio_device_set *dev_set;
>  	struct vfio_pci_dependent_device *devices;

Poor structure packing, move the bool to the end.

Nit, maybe just name it @devid.

>  };
>  
>  static int vfio_pci_fill_devs(struct pci_dev *pdev, void *data)
>  {
>  	struct vfio_pci_fill_info *fill = data;
> +	struct vfio_device_set *dev_set = fill->dev_set;
>  	struct iommu_group *iommu_group;
> +	struct vfio_device *vdev;
> +
> +	lockdep_assert_held(&dev_set->lock);
>  
>  	if (fill->cur == fill->max)
>  		return -EAGAIN; /* Something changed, try again */
> @@ -791,7 +813,21 @@ static int vfio_pci_fill_devs(struct pci_dev *pdev, void *data)
>  	if (!iommu_group)
>  		return -EPERM; /* Cannot reset non-isolated devices */
>  
> -	fill->devices[fill->cur].group_id = iommu_group_id(iommu_group);
> +	if (fill->require_devid) {

Nit, @vdev could be scoped here.

> +		/*
> +		 * Report dev_id of the devices that are opened as cdev
> +		 * and have the same iommufd with the fill->iommufd.
> +		 * Otherwise, just fill IOMMUFD_INVALID_ID.
> +		 */
> +		vdev = vfio_pci_find_device_in_devset(dev_set, pdev);

I wish I had a better solution to this, but I don't.

> +		if (vdev && vfio_device_cdev_opened(vdev) &&
> +		    fill->iommufd == vfio_iommufd_physical_ictx(vdev))
> +			vfio_iommufd_physical_devid(vdev, &fill->devices[fill->cur].dev_id);

Long line, maybe a pointer to &fill->devices[fill->cur] would help.

> +		else
> +			fill->devices[fill->cur].dev_id = IOMMUFD_INVALID_ID;
> +	} else {
> +		fill->devices[fill->cur].group_id = iommu_group_id(iommu_group);
> +	}
>  	fill->devices[fill->cur].segment = pci_domain_nr(pdev->bus);
>  	fill->devices[fill->cur].bus = pdev->bus->number;
>  	fill->devices[fill->cur].devfn = pdev->devfn;
> @@ -1230,17 +1266,27 @@ static int vfio_pci_ioctl_get_pci_hot_reset_info(
>  		return -ENOMEM;
>  
>  	fill.devices = devices;
> +	fill.dev_set = vdev->vdev.dev_set;
>  
> +	mutex_lock(&vdev->vdev.dev_set->lock);
> +	if (vfio_device_cdev_opened(&vdev->vdev)) {
> +		fill.require_devid = true;
> +		fill.iommufd = vfio_iommufd_physical_ictx(&vdev->vdev);
> +	}

We can do this unconditionally:

	fill.devid = vfio_device_cdev_opened(&vdev->vdev);
	fill.iommufd = vfio_iommufd_physical_ictx(&vdev->vdev);

Thanks,
Alex

>  	ret = vfio_pci_for_each_slot_or_bus(vdev->pdev, vfio_pci_fill_devs,
>  					    &fill, slot);
> +	mutex_unlock(&vdev->vdev.dev_set->lock);
>  
>  	/*
>  	 * If a device was removed between counting and filling, we may come up
>  	 * short of fill.max.  If a device was added, we'll have a return of
>  	 * -EAGAIN above.
>  	 */
> -	if (!ret)
> +	if (!ret) {
>  		hdr.count = fill.cur;
> +		if (fill.require_devid)
> +			hdr.flags = VFIO_PCI_HOT_RESET_FLAG_IOMMUFD_DEV_ID;
> +	}
>  
>  reset_info_exit:
>  	if (copy_to_user(arg, &hdr, minsz))
> @@ -2346,12 +2392,10 @@ static bool vfio_dev_in_files(struct vfio_pci_core_device *vdev,
>  static int vfio_pci_is_device_in_set(struct pci_dev *pdev, void *data)
>  {
>  	struct vfio_device_set *dev_set = data;
> -	struct vfio_device *cur;
>  
> -	list_for_each_entry(cur, &dev_set->device_list, dev_set_list)
> -		if (cur->dev == &pdev->dev)
> -			return 0;
> -	return -EBUSY;
> +	lockdep_assert_held(&dev_set->lock);
> +
> +	return vfio_pci_find_device_in_devset(dev_set, pdev) ? 0 : -EBUSY;
>  }
>  
>  /*
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 25432ef213ee..5a34364e3b94 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -650,11 +650,32 @@ enum {
>   * VFIO_DEVICE_GET_PCI_HOT_RESET_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 12,
>   *					      struct vfio_pci_hot_reset_info)
>   *
> + * This command is used to query the affected devices in the hot reset for
> + * a given device.  User could use the information reported by this command
> + * to figure out the affected devices among the devices it has opened.
> + * This command always reports the segment, bus and devfn information for
> + * each affected device, and selectively report the group_id or the dev_id
> + * per the way how the device being queried is opened.
> + *	- If the device is opened via the traditional group/container manner,
> + *	  this command reports the group_id for each affected device.
> + *
> + *	- If the device is opened as a cdev, this command needs to report
> + *	  dev_id for each affected device and set the
> + *	  VFIO_PCI_HOT_RESET_FLAG_IOMMUFD_DEV_ID flag.  For the affected
> + *	  devices that are not opened as cdev or bound to different iommufds
> + *	  with the device that is queried, report an invalid dev_id to avoid
> + *	  potential dev_id conflict as dev_id is local to iommufd.  For such
> + *	  affected devices, user shall fall back to use the segment, bus and
> + *	  devfn info to map it to opened device.
> + *
>   * Return: 0 on success, -errno on failure:
>   *	-enospc = insufficient buffer, -enodev = unsupported for device.
>   */
>  struct vfio_pci_dependent_device {
> -	__u32	group_id;
> +	union {
> +		__u32   group_id;
> +		__u32	dev_id;
> +	};
>  	__u16	segment;
>  	__u8	bus;
>  	__u8	devfn; /* Use PCI_SLOT/PCI_FUNC */
> @@ -663,6 +684,7 @@ struct vfio_pci_dependent_device {
>  struct vfio_pci_hot_reset_info {
>  	__u32	argsz;
>  	__u32	flags;
> +#define VFIO_PCI_HOT_RESET_FLAG_IOMMUFD_DEV_ID	(1 << 0)
>  	__u32	count;
>  	struct vfio_pci_dependent_device	devices[];
>  };


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 05/12] vfio/pci: Allow passing zero-length fd array in VFIO_DEVICE_PCI_HOT_RESET
  2023-04-04 20:18   ` Alex Williamson
@ 2023-04-05  7:55     ` Liu, Yi L
  2023-04-05  8:01       ` Liu, Yi L
  2023-04-05  8:02     ` Eric Auger
  1 sibling, 1 reply; 145+ messages in thread
From: Liu, Yi L @ 2023-04-05  7:55 UTC (permalink / raw)
  To: Alex Williamson
  Cc: mjrosato, jasowang, Hao, Xudong, peterx, Xu, Terrence,
	chao.p.peng, linux-s390, kvm, lulu, Jiang, Yanting, joro,
	nicolinc, jgg, Zhao, Yan Y, intel-gfx, eric.auger, intel-gvt-dev,
	yi.y.sun, cohuck, shameerali.kolothum.thodi,
	suravee.suthikulpanit, robin.murphy

> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Wednesday, April 5, 2023 4:19 AM
> 
> On Sat,  1 Apr 2023 07:44:22 -0700
> Yi Liu <yi.l.liu@intel.com> wrote:
> 
> > as an alternative method for ownership check when iommufd is used. In
> > this case all opened devices in the affected dev_set are verified to
> > be bound to a same valid iommufd value to allow reset. It's simpler
> > and faster as user does not need to pass a set of fds and kernel no
> > need to search the device within the given fds.
> >
> > a device in noiommu mode doesn't have a valid iommufd, so this method
> > should not be used in a dev_set which contains multiple devices and one
> > of them is in noiommu. The only allowed noiommu scenario is that the
> > calling device is noiommu and it's in a singleton dev_set.
> >
> > Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
> > Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> > Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> > Tested-by: Yanting Jiang <yanting.jiang@intel.com>
> > Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> > ---
> >  drivers/vfio/pci/vfio_pci_core.c | 42 +++++++++++++++++++++++++++-----
> >  include/uapi/linux/vfio.h        |  9 ++++++-
> >  2 files changed, 44 insertions(+), 7 deletions(-)
> >
> > diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> > index 3696b8e58445..b68fcba67a4b 100644
> > --- a/drivers/vfio/pci/vfio_pci_core.c
> > +++ b/drivers/vfio/pci/vfio_pci_core.c
> > @@ -180,7 +180,8 @@ static void vfio_pci_probe_mmaps(struct
> vfio_pci_core_device *vdev)
> >  struct vfio_pci_group_info;
> >  static void vfio_pci_dev_set_try_reset(struct vfio_device_set *dev_set);
> >  static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
> > -				      struct vfio_pci_group_info *groups);
> > +				      struct vfio_pci_group_info *groups,
> > +				      struct iommufd_ctx *iommufd_ctx);
> >
> >  /*
> >   * INTx masking requires the ability to disable INTx signaling via PCI_COMMAND
> > @@ -1277,7 +1278,7 @@ vfio_pci_ioctl_pci_hot_reset_groups(struct
> vfio_pci_core_device *vdev,
> >  		return ret;
> >
> >  	/* Somewhere between 1 and count is OK */
> > -	if (!hdr->count || hdr->count > count)
> > +	if (hdr->count > count)
> >  		return -EINVAL;
> >
> >  	group_fds = kcalloc(hdr->count, sizeof(*group_fds), GFP_KERNEL);
> > @@ -1326,7 +1327,7 @@ vfio_pci_ioctl_pci_hot_reset_groups(struct
> vfio_pci_core_device *vdev,
> >  	info.count = hdr->count;
> >  	info.files = files;
> >
> > -	ret = vfio_pci_dev_set_hot_reset(vdev->vdev.dev_set, &info);
> > +	ret = vfio_pci_dev_set_hot_reset(vdev->vdev.dev_set, &info, NULL);
> >
> >  hot_reset_release:
> >  	for (file_idx--; file_idx >= 0; file_idx--)
> > @@ -1341,6 +1342,7 @@ static int vfio_pci_ioctl_pci_hot_reset(struct
> vfio_pci_core_device *vdev,
> >  {
> >  	unsigned long minsz = offsetofend(struct vfio_pci_hot_reset, count);
> >  	struct vfio_pci_hot_reset hdr;
> > +	struct iommufd_ctx *iommufd;
> >  	bool slot = false;
> >
> >  	if (copy_from_user(&hdr, arg, minsz))
> > @@ -1355,7 +1357,12 @@ static int vfio_pci_ioctl_pci_hot_reset(struct
> vfio_pci_core_device *vdev,
> >  	else if (pci_probe_reset_bus(vdev->pdev->bus))
> >  		return -ENODEV;
> >
> > -	return vfio_pci_ioctl_pci_hot_reset_groups(vdev, &hdr, slot, arg);
> > +	if (hdr.count)
> > +		return vfio_pci_ioctl_pci_hot_reset_groups(vdev, &hdr, slot, arg);
> > +
> > +	iommufd = vfio_iommufd_physical_ictx(&vdev->vdev);
> > +
> > +	return vfio_pci_dev_set_hot_reset(vdev->vdev.dev_set, NULL, iommufd);
> >  }
> >
> >  static int vfio_pci_ioctl_ioeventfd(struct vfio_pci_core_device *vdev,
> > @@ -2327,6 +2334,9 @@ static bool vfio_dev_in_groups(struct
> vfio_pci_core_device *vdev,
> >  {
> >  	unsigned int i;
> >
> > +	if (!groups)
> > +		return false;
> > +
> >  	for (i = 0; i < groups->count; i++)
> >  		if (vfio_file_has_dev(groups->files[i], &vdev->vdev))
> >  			return true;
> > @@ -2402,13 +2412,25 @@ static int vfio_pci_dev_set_pm_runtime_get(struct
> vfio_device_set *dev_set)
> >  	return ret;
> >  }
> >
> > +static bool vfio_dev_in_iommufd_ctx(struct vfio_pci_core_device *vdev,
> > +				    struct iommufd_ctx *iommufd_ctx)
> > +{
> > +	struct iommufd_ctx *iommufd = vfio_iommufd_physical_ictx(&vdev->vdev);
> > +
> > +	if (!iommufd)
> > +		return false;
> > +
> > +	return iommufd == iommufd_ctx;
> > +}
> > +
> >  /*
> >   * We need to get memory_lock for each device, but devices can share mmap_lock,
> >   * therefore we need to zap and hold the vma_lock for each device, and only then
> >   * get each memory_lock.
> >   */
> >  static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
> > -				      struct vfio_pci_group_info *groups)
> > +				      struct vfio_pci_group_info *groups,
> > +				      struct iommufd_ctx *iommufd_ctx)
> >  {
> >  	struct vfio_pci_core_device *cur_mem;
> >  	struct vfio_pci_core_device *cur_vma;
> > @@ -2448,9 +2470,17 @@ static int vfio_pci_dev_set_hot_reset(struct
> vfio_device_set *dev_set,
> >  		 *
> >  		 * Otherwise all opened devices in the dev_set must be
> >  		 * contained by the set of groups provided by the user.
> > +		 *
> > +		 * If user provides a zero-length array, then all the
> > +		 * opened devices must be bound to a same iommufd_ctx.
> > +		 *
> > +		 * If all above checks are failed, reset is allowed only if
> > +		 * the calling device is in a singleton dev_set.
> >  		 */
> >  		if (cur_vma->vdev.open_count &&
> > -		    !vfio_dev_in_groups(cur_vma, groups)) {
> > +		    !vfio_dev_in_groups(cur_vma, groups) &&
> > +		    !vfio_dev_in_iommufd_ctx(cur_vma, iommufd_ctx) &&
> > +		    (dev_set->device_count > 1)) {
> 
> This last condition looks buggy to me, we need all conditions to be
> true to generate an error here, which means that for a singleton
> dev_set, it doesn't matter what group fds are passed, if any, or whether
> the iommufd context matches.  I think in fact this means that the empty
> array path is equally available for group use cases with a singleton
> dev_set, but we don't enable it for multiple device dev_sets like we do
> iommufd.

you are right. The last condition allows the empty-fd array path to
work for the group use case if the dev_set happens to be a singleton.

> 
> You pointed out a previous issue with hot-reset info and no-iommu where
> if other affected devices are not bound to vfio-pci the info ioctl
> returns error.  That's handled in the hot-reset ioctl by the fact that
> all affected devices must be in the dev_set and therefore bound to
> vfio-pci drivers. 

yes, hot-reset ioctl requires all affected devices listed in the dev_set.
So for the case there are devices not bound to vfio yet, hot-reset ioctl
just fails. If all affected devices are in the dev_set, they will have a
fake group allocated by vfio. So the info ioctl won't fail.

> So it seems to me that aside from the spurious error
> because we can't report an iommu group when none exists, and didn't
> spot it to invent an invalid group for debugging, hot-reset otherwise
> works with no-iommu just like it does for iommu backed devices.  We
> don't currently require singleton no-iommu dev_sets afaict.

yes. the requirement for hot-reset is the same between no-iommu and
the iommufd backed devices.

> I'll also note that if the dev_set is singleton, this suggests that
> pci_reset_function() can make use of bus reset, so a hot-reset is
> accessible via VFIO_DEVICE_RESET if the appropriate reset method is
> selected.

yes. so does it mean not necessary to allow singleton dev_set support
in hot-reset ioctl? If user uses hot-reset, it should because of unable to
use VFIO_DEVICE_RESET, is it?

> 
> Therefore, I think as written, the singleton dev_set hot-reset is
> enabled for iommufd and (unintentionally?) for the group path, while
> also negating a requirement for a group fd or that a provided group fd
> actually matches the device in this latter case.  The null-array
> approach is not however extended to groups for more general use.
> Additionally, limiting no-iommu hot-reset to singleton dev_sets
> provides only a marginal functional difference vs VFIO_DEVICE_RESET.

I think the singletion dev_set hot-reset is for iommufd (or more accurately
for the noiommu case in cdev path). 

> Thanks,
> 
> Alex
> 
> >  			ret = -EINVAL;
> >  			goto err_undo;
> >  		}
> > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> > index f96e5689cffc..17aa5d09db41 100644
> > --- a/include/uapi/linux/vfio.h
> > +++ b/include/uapi/linux/vfio.h
> > @@ -679,7 +679,14 @@ struct vfio_pci_hot_reset_info {
> >   * the calling user must ensure all affected devices, if opened, are
> >   * owned by itself.
> >   *
> > - * The ownership is proved by an array of group fds.
> > + * The ownership can be proved by:
> > + *   - An array of group fds
> > + *   - A zero-length array
> > + *
> > + * In the last case all affected devices which are opened by this user
> > + * must have been bound to a same iommufd. If the calling device is in
> > + * noiommu mode (no valid iommufd) then it can be reset only if the reset
> > + * doesn't affect other devices.
> >   *
> >   * Return: 0 on success, -errno on failure.
> >   */


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 05/12] vfio/pci: Allow passing zero-length fd array in VFIO_DEVICE_PCI_HOT_RESET
  2023-04-05  7:55     ` Liu, Yi L
@ 2023-04-05  8:01       ` Liu, Yi L
  2023-04-05 15:36         ` Alex Williamson
  0 siblings, 1 reply; 145+ messages in thread
From: Liu, Yi L @ 2023-04-05  8:01 UTC (permalink / raw)
  To: Alex Williamson
  Cc: mjrosato, jasowang, Hao, Xudong, peterx, Xu, Terrence,
	chao.p.peng, linux-s390, kvm, lulu, Jiang, Yanting, joro,
	nicolinc, jgg, Zhao, Yan Y, intel-gfx, eric.auger, intel-gvt-dev,
	yi.y.sun, cohuck, shameerali.kolothum.thodi,
	suravee.suthikulpanit, robin.murphy

> From: Liu, Yi L <yi.l.liu@intel.com>
> Sent: Wednesday, April 5, 2023 3:55 PM
 
> >
> > Therefore, I think as written, the singleton dev_set hot-reset is
> > enabled for iommufd and (unintentionally?) for the group path, while
> > also negating a requirement for a group fd or that a provided group fd
> > actually matches the device in this latter case.  The null-array
> > approach is not however extended to groups for more general use.
> > Additionally, limiting no-iommu hot-reset to singleton dev_sets
> > provides only a marginal functional difference vs VFIO_DEVICE_RESET.
> 
> I think the singletion dev_set hot-reset is for iommufd (or more accurately
> for the noiommu case in cdev path).

but actually, singleton dev_set hot-reset can work for group path as well.
Based on this, I'm also wondering do we really want to have singleton dev_set
hot-reset only for cdev noiommu case? or we allow it generally or just
don't support it as it is equivalent with VFIO_DEVICE_RESET?

If we don't support singletion dev_set hot-reset, noiommu devices in cdev
path shall fail the hot-reset if empty-fd array is provided. But we may just
document that empty-fd array does not work for noiommu. User should
use the device fd array.

Regards,
Yi Liu


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 05/12] vfio/pci: Allow passing zero-length fd array in VFIO_DEVICE_PCI_HOT_RESET
  2023-04-04 20:18   ` Alex Williamson
  2023-04-05  7:55     ` Liu, Yi L
@ 2023-04-05  8:02     ` Eric Auger
  2023-04-05  8:09       ` Liu, Yi L
  1 sibling, 1 reply; 145+ messages in thread
From: Eric Auger @ 2023-04-05  8:02 UTC (permalink / raw)
  To: Alex Williamson, Yi Liu
  Cc: mjrosato, jasowang, xudong.hao, peterx, terrence.xu, chao.p.peng,
	linux-s390, kvm, lulu, yanting.jiang, joro, nicolinc, jgg,
	yan.y.zhao, intel-gfx, intel-gvt-dev, yi.y.sun, cohuck,
	shameerali.kolothum.thodi, suravee.suthikulpanit, robin.murphy



On 4/4/23 22:18, Alex Williamson wrote:
> On Sat,  1 Apr 2023 07:44:22 -0700
> Yi Liu <yi.l.liu@intel.com> wrote:
>
>> as an alternative method for ownership check when iommufd is used. In
>> this case all opened devices in the affected dev_set are verified to
>> be bound to a same valid iommufd value to allow reset. It's simpler
>> and faster as user does not need to pass a set of fds and kernel no
>> need to search the device within the given fds.
>>
>> a device in noiommu mode doesn't have a valid iommufd, so this method
>> should not be used in a dev_set which contains multiple devices and one
>> of them is in noiommu. The only allowed noiommu scenario is that the
>> calling device is noiommu and it's in a singleton dev_set.
>>
>> Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
>> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
>> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
>> Tested-by: Yanting Jiang <yanting.jiang@intel.com>
>> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
>> ---
>>  drivers/vfio/pci/vfio_pci_core.c | 42 +++++++++++++++++++++++++++-----
>>  include/uapi/linux/vfio.h        |  9 ++++++-
>>  2 files changed, 44 insertions(+), 7 deletions(-)
>>
>> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
>> index 3696b8e58445..b68fcba67a4b 100644
>> --- a/drivers/vfio/pci/vfio_pci_core.c
>> +++ b/drivers/vfio/pci/vfio_pci_core.c
>> @@ -180,7 +180,8 @@ static void vfio_pci_probe_mmaps(struct vfio_pci_core_device *vdev)
>>  struct vfio_pci_group_info;
>>  static void vfio_pci_dev_set_try_reset(struct vfio_device_set *dev_set);
>>  static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
>> -				      struct vfio_pci_group_info *groups);
>> +				      struct vfio_pci_group_info *groups,
>> +				      struct iommufd_ctx *iommufd_ctx);
>>  
>>  /*
>>   * INTx masking requires the ability to disable INTx signaling via PCI_COMMAND
>> @@ -1277,7 +1278,7 @@ vfio_pci_ioctl_pci_hot_reset_groups(struct vfio_pci_core_device *vdev,
>>  		return ret;
>>  
>>  	/* Somewhere between 1 and count is OK */
>> -	if (!hdr->count || hdr->count > count)
>> +	if (hdr->count > count)
>>  		return -EINVAL;
>>  
>>  	group_fds = kcalloc(hdr->count, sizeof(*group_fds), GFP_KERNEL);
>> @@ -1326,7 +1327,7 @@ vfio_pci_ioctl_pci_hot_reset_groups(struct vfio_pci_core_device *vdev,
>>  	info.count = hdr->count;
>>  	info.files = files;
>>  
>> -	ret = vfio_pci_dev_set_hot_reset(vdev->vdev.dev_set, &info);
>> +	ret = vfio_pci_dev_set_hot_reset(vdev->vdev.dev_set, &info, NULL);
>>  
>>  hot_reset_release:
>>  	for (file_idx--; file_idx >= 0; file_idx--)
>> @@ -1341,6 +1342,7 @@ static int vfio_pci_ioctl_pci_hot_reset(struct vfio_pci_core_device *vdev,
>>  {
>>  	unsigned long minsz = offsetofend(struct vfio_pci_hot_reset, count);
>>  	struct vfio_pci_hot_reset hdr;
>> +	struct iommufd_ctx *iommufd;
>>  	bool slot = false;
>>  
>>  	if (copy_from_user(&hdr, arg, minsz))
>> @@ -1355,7 +1357,12 @@ static int vfio_pci_ioctl_pci_hot_reset(struct vfio_pci_core_device *vdev,
>>  	else if (pci_probe_reset_bus(vdev->pdev->bus))
>>  		return -ENODEV;
>>  
>> -	return vfio_pci_ioctl_pci_hot_reset_groups(vdev, &hdr, slot, arg);
>> +	if (hdr.count)
>> +		return vfio_pci_ioctl_pci_hot_reset_groups(vdev, &hdr, slot, arg);
>> +
>> +	iommufd = vfio_iommufd_physical_ictx(&vdev->vdev);
>> +
>> +	return vfio_pci_dev_set_hot_reset(vdev->vdev.dev_set, NULL, iommufd);
>>  }
>>  
>>  static int vfio_pci_ioctl_ioeventfd(struct vfio_pci_core_device *vdev,
>> @@ -2327,6 +2334,9 @@ static bool vfio_dev_in_groups(struct vfio_pci_core_device *vdev,
>>  {
>>  	unsigned int i;
>>  
>> +	if (!groups)
>> +		return false;
>> +
>>  	for (i = 0; i < groups->count; i++)
>>  		if (vfio_file_has_dev(groups->files[i], &vdev->vdev))
>>  			return true;
>> @@ -2402,13 +2412,25 @@ static int vfio_pci_dev_set_pm_runtime_get(struct vfio_device_set *dev_set)
>>  	return ret;
>>  }
>>  
>> +static bool vfio_dev_in_iommufd_ctx(struct vfio_pci_core_device *vdev,
>> +				    struct iommufd_ctx *iommufd_ctx)
>> +{
>> +	struct iommufd_ctx *iommufd = vfio_iommufd_physical_ictx(&vdev->vdev);
>> +
>> +	if (!iommufd)
>> +		return false;
>> +
>> +	return iommufd == iommufd_ctx;
>> +}
>> +
>>  /*
>>   * We need to get memory_lock for each device, but devices can share mmap_lock,
>>   * therefore we need to zap and hold the vma_lock for each device, and only then
>>   * get each memory_lock.
>>   */
>>  static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
>> -				      struct vfio_pci_group_info *groups)
>> +				      struct vfio_pci_group_info *groups,
>> +				      struct iommufd_ctx *iommufd_ctx)
>>  {
>>  	struct vfio_pci_core_device *cur_mem;
>>  	struct vfio_pci_core_device *cur_vma;
>> @@ -2448,9 +2470,17 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
>>  		 *
>>  		 * Otherwise all opened devices in the dev_set must be
>>  		 * contained by the set of groups provided by the user.
>> +		 *
>> +		 * If user provides a zero-length array, then all the
>> +		 * opened devices must be bound to a same iommufd_ctx.
>> +		 *
>> +		 * If all above checks are failed, reset is allowed only if
>> +		 * the calling device is in a singleton dev_set.
>>  		 */
>>  		if (cur_vma->vdev.open_count &&
>> -		    !vfio_dev_in_groups(cur_vma, groups)) {
>> +		    !vfio_dev_in_groups(cur_vma, groups) &&
>> +		    !vfio_dev_in_iommufd_ctx(cur_vma, iommufd_ctx) &&
>> +		    (dev_set->device_count > 1)) {
> This last condition looks buggy to me, we need all conditions to be
> true to generate an error here, which means that for a singleton
> dev_set, it doesn't matter what group fds are passed, if any, or whether
> the iommufd context matches.  I think in fact this means that the empty
> array path is equally available for group use cases with a singleton
> dev_set, but we don't enable it for multiple device dev_sets like we do
> iommufd.
>
> You pointed out a previous issue with hot-reset info and no-iommu where
> if other affected devices are not bound to vfio-pci the info ioctl
> returns error.  That's handled in the hot-reset ioctl by the fact that
> all affected devices must be in the dev_set and therefore bound to
> vfio-pci drivers.  So it seems to me that aside from the spurious error
> because we can't report an iommu group when none exists, and didn't
> spot it to invent an invalid group for debugging, hot-reset otherwise
> works with no-iommu just like it does for iommu backed devices.  We
> don't currently require singleton no-iommu dev_sets afaict.
>
> I'll also note that if the dev_set is singleton, this suggests that
> pci_reset_function() can make use of bus reset, so a hot-reset is
> accessible via VFIO_DEVICE_RESET if the appropriate reset method is
> selected.
>
> Therefore, I think as written, the singleton dev_set hot-reset is
> enabled for iommufd and (unintentionally?) for the group path, while
> also negating a requirement for a group fd or that a provided group fd
> actually matches the device in this latter case.  The null-array
> approach is not however extended to groups for more general use.
> Additionally, limiting no-iommu hot-reset to singleton dev_sets
> provides only a marginal functional difference vs VFIO_DEVICE_RESET.
> Thanks,
>
> Alex
What bout introducing a helper
static bool is_reset_ok(pdev, groups, ctx) {
    if (!pdev->vdev.open_count)
        return true;
    if (groups && vfio_dev_in_groups(pdev, groups))
        return true;
    if (ctx && vfio_dev_in_iommufd_ctx(pdev, ctx)
        return true;
    return false;
}

Assuming the above logic is correct I think this would make the code
more readable

Thanks

Eric
>>  			ret = -EINVAL;
>>  			goto err_undo;
>>  		}
>> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
>> index f96e5689cffc..17aa5d09db41 100644
>> --- a/include/uapi/linux/vfio.h
>> +++ b/include/uapi/linux/vfio.h
>> @@ -679,7 +679,14 @@ struct vfio_pci_hot_reset_info {
>>   * the calling user must ensure all affected devices, if opened, are
>>   * owned by itself.
>>   *
>> - * The ownership is proved by an array of group fds.
>> + * The ownership can be proved by:
>> + *   - An array of group fds
>> + *   - A zero-length array
>> + *
>> + * In the last case all affected devices which are opened by this user
>> + * must have been bound to a same iommufd. If the calling device is in
>> + * noiommu mode (no valid iommufd) then it can be reset only if the reset
>> + * doesn't affect other devices.
>>   *
>>   * Return: 0 on success, -errno on failure.
>>   */


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 07/12] vfio: Accpet device file from vfio PCI hot reset path
  2023-04-01 14:44 ` [Intel-gfx] [PATCH v3 07/12] vfio: Accpet device file from vfio PCI hot reset path Yi Liu
  2023-04-04 20:31   ` Alex Williamson
@ 2023-04-05  8:07   ` Eric Auger
  2023-04-05  8:10     ` Liu, Yi L
  1 sibling, 1 reply; 145+ messages in thread
From: Eric Auger @ 2023-04-05  8:07 UTC (permalink / raw)
  To: Yi Liu, alex.williamson, jgg, kevin.tian
  Cc: linux-s390, yi.y.sun, kvm, mjrosato, intel-gvt-dev, joro, cohuck,
	xudong.hao, peterx, yan.y.zhao, terrence.xu, nicolinc,
	shameerali.kolothum.thodi, suravee.suthikulpanit, intel-gfx,
	chao.p.peng, lulu, robin.murphy, jasowang, yanting.jiang

Hi Yi,

On 4/1/23 16:44, Yi Liu wrote:
> This extends both vfio_file_is_valid() and vfio_file_has_dev() to accept
> device file from the vfio PCI hot reset.
typo in the title s/Accpet/Accept
>
> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> Tested-by: Yanting Jiang <yanting.jiang@intel.com>
> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> ---
>  drivers/vfio/vfio_main.c | 23 +++++++++++++++++++----
>  1 file changed, 19 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
> index fe7446805afd..ebbb6b91a498 100644
> --- a/drivers/vfio/vfio_main.c
> +++ b/drivers/vfio/vfio_main.c
> @@ -1154,13 +1154,23 @@ const struct file_operations vfio_device_fops = {
>  	.mmap		= vfio_device_fops_mmap,
>  };
>  
> +static struct vfio_device *vfio_device_from_file(struct file *file)
> +{
> +	struct vfio_device *device = file->private_data;
> +
> +	if (file->f_op != &vfio_device_fops)
> +		return NULL;
> +	return device;
> +}
> +
>  /**
>   * vfio_file_is_valid - True if the file is valid vfio file
>   * @file: VFIO group file or VFIO device file
>   */
>  bool vfio_file_is_valid(struct file *file)
>  {
> -	return vfio_group_from_file(file);
> +	return vfio_group_from_file(file) ||
> +	       vfio_device_from_file(file);
>  }
>  EXPORT_SYMBOL_GPL(vfio_file_is_valid);
>  
> @@ -1174,12 +1184,17 @@ EXPORT_SYMBOL_GPL(vfio_file_is_valid);
>  bool vfio_file_has_dev(struct file *file, struct vfio_device *device)
>  {
>  	struct vfio_group *group;
> +	struct vfio_device *vdev;
>  
>  	group = vfio_group_from_file(file);
> -	if (!group)
> -		return false;
> +	if (group)
> +		return vfio_group_has_dev(group, device);
> +
> +	vdev = vfio_device_from_file(file);
> +	if (vdev)
> +		return vdev == device;
>  
> -	return vfio_group_has_dev(group, device);
> +	return false;
>  }
>  EXPORT_SYMBOL_GPL(vfio_file_has_dev);
>  
With Alex' suggestion
Reviewed-by: Eric Auger <eric.auger@redhat.com>

Eric


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 05/12] vfio/pci: Allow passing zero-length fd array in VFIO_DEVICE_PCI_HOT_RESET
  2023-04-05  8:02     ` Eric Auger
@ 2023-04-05  8:09       ` Liu, Yi L
  0 siblings, 0 replies; 145+ messages in thread
From: Liu, Yi L @ 2023-04-05  8:09 UTC (permalink / raw)
  To: eric.auger, Alex Williamson
  Cc: mjrosato, jasowang, Hao, Xudong, peterx, Xu, Terrence,
	chao.p.peng, linux-s390, kvm, lulu, Jiang, Yanting, joro,
	nicolinc, jgg, Zhao, Yan Y, intel-gfx, intel-gvt-dev, yi.y.sun,
	cohuck, shameerali.kolothum.thodi, suravee.suthikulpanit,
	robin.murphy

Hi Eric,

> From: Eric Auger <eric.auger@redhat.com>
> Sent: Wednesday, April 5, 2023 4:02 PM 
> 
> On 4/4/23 22:18, Alex Williamson wrote:
> > On Sat,  1 Apr 2023 07:44:22 -0700
> > Yi Liu <yi.l.liu@intel.com> wrote:
> >
> >> as an alternative method for ownership check when iommufd is used. In
> >> this case all opened devices in the affected dev_set are verified to
> >> be bound to a same valid iommufd value to allow reset. It's simpler
> >> and faster as user does not need to pass a set of fds and kernel no
> >> need to search the device within the given fds.
> >>
> >> a device in noiommu mode doesn't have a valid iommufd, so this method
> >> should not be used in a dev_set which contains multiple devices and one
> >> of them is in noiommu. The only allowed noiommu scenario is that the
> >> calling device is noiommu and it's in a singleton dev_set.
> >>
> >> Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
> >> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> >> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> >> Tested-by: Yanting Jiang <yanting.jiang@intel.com>
> >> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> >> ---
> >>  drivers/vfio/pci/vfio_pci_core.c | 42 +++++++++++++++++++++++++++-----
> >>  include/uapi/linux/vfio.h        |  9 ++++++-
> >>  2 files changed, 44 insertions(+), 7 deletions(-)
> >>
> >> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> >> index 3696b8e58445..b68fcba67a4b 100644
> >> --- a/drivers/vfio/pci/vfio_pci_core.c
> >> +++ b/drivers/vfio/pci/vfio_pci_core.c
> >> @@ -180,7 +180,8 @@ static void vfio_pci_probe_mmaps(struct
> vfio_pci_core_device *vdev)
> >>  struct vfio_pci_group_info;
> >>  static void vfio_pci_dev_set_try_reset(struct vfio_device_set *dev_set);
> >>  static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
> >> -				      struct vfio_pci_group_info *groups);
> >> +				      struct vfio_pci_group_info *groups,
> >> +				      struct iommufd_ctx *iommufd_ctx);
> >>
> >>  /*
> >>   * INTx masking requires the ability to disable INTx signaling via PCI_COMMAND
> >> @@ -1277,7 +1278,7 @@ vfio_pci_ioctl_pci_hot_reset_groups(struct
> vfio_pci_core_device *vdev,
> >>  		return ret;
> >>
> >>  	/* Somewhere between 1 and count is OK */
> >> -	if (!hdr->count || hdr->count > count)
> >> +	if (hdr->count > count)
> >>  		return -EINVAL;
> >>
> >>  	group_fds = kcalloc(hdr->count, sizeof(*group_fds), GFP_KERNEL);
> >> @@ -1326,7 +1327,7 @@ vfio_pci_ioctl_pci_hot_reset_groups(struct
> vfio_pci_core_device *vdev,
> >>  	info.count = hdr->count;
> >>  	info.files = files;
> >>
> >> -	ret = vfio_pci_dev_set_hot_reset(vdev->vdev.dev_set, &info);
> >> +	ret = vfio_pci_dev_set_hot_reset(vdev->vdev.dev_set, &info, NULL);
> >>
> >>  hot_reset_release:
> >>  	for (file_idx--; file_idx >= 0; file_idx--)
> >> @@ -1341,6 +1342,7 @@ static int vfio_pci_ioctl_pci_hot_reset(struct
> vfio_pci_core_device *vdev,
> >>  {
> >>  	unsigned long minsz = offsetofend(struct vfio_pci_hot_reset, count);
> >>  	struct vfio_pci_hot_reset hdr;
> >> +	struct iommufd_ctx *iommufd;
> >>  	bool slot = false;
> >>
> >>  	if (copy_from_user(&hdr, arg, minsz))
> >> @@ -1355,7 +1357,12 @@ static int vfio_pci_ioctl_pci_hot_reset(struct
> vfio_pci_core_device *vdev,
> >>  	else if (pci_probe_reset_bus(vdev->pdev->bus))
> >>  		return -ENODEV;
> >>
> >> -	return vfio_pci_ioctl_pci_hot_reset_groups(vdev, &hdr, slot, arg);
> >> +	if (hdr.count)
> >> +		return vfio_pci_ioctl_pci_hot_reset_groups(vdev, &hdr, slot, arg);
> >> +
> >> +	iommufd = vfio_iommufd_physical_ictx(&vdev->vdev);
> >> +
> >> +	return vfio_pci_dev_set_hot_reset(vdev->vdev.dev_set, NULL, iommufd);
> >>  }
> >>
> >>  static int vfio_pci_ioctl_ioeventfd(struct vfio_pci_core_device *vdev,
> >> @@ -2327,6 +2334,9 @@ static bool vfio_dev_in_groups(struct
> vfio_pci_core_device *vdev,
> >>  {
> >>  	unsigned int i;
> >>
> >> +	if (!groups)
> >> +		return false;
> >> +
> >>  	for (i = 0; i < groups->count; i++)
> >>  		if (vfio_file_has_dev(groups->files[i], &vdev->vdev))
> >>  			return true;
> >> @@ -2402,13 +2412,25 @@ static int vfio_pci_dev_set_pm_runtime_get(struct
> vfio_device_set *dev_set)
> >>  	return ret;
> >>  }
> >>
> >> +static bool vfio_dev_in_iommufd_ctx(struct vfio_pci_core_device *vdev,
> >> +				    struct iommufd_ctx *iommufd_ctx)
> >> +{
> >> +	struct iommufd_ctx *iommufd = vfio_iommufd_physical_ictx(&vdev->vdev);
> >> +
> >> +	if (!iommufd)
> >> +		return false;
> >> +
> >> +	return iommufd == iommufd_ctx;
> >> +}
> >> +
> >>  /*
> >>   * We need to get memory_lock for each device, but devices can share mmap_lock,
> >>   * therefore we need to zap and hold the vma_lock for each device, and only then
> >>   * get each memory_lock.
> >>   */
> >>  static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
> >> -				      struct vfio_pci_group_info *groups)
> >> +				      struct vfio_pci_group_info *groups,
> >> +				      struct iommufd_ctx *iommufd_ctx)
> >>  {
> >>  	struct vfio_pci_core_device *cur_mem;
> >>  	struct vfio_pci_core_device *cur_vma;
> >> @@ -2448,9 +2470,17 @@ static int vfio_pci_dev_set_hot_reset(struct
> vfio_device_set *dev_set,
> >>  		 *
> >>  		 * Otherwise all opened devices in the dev_set must be
> >>  		 * contained by the set of groups provided by the user.
> >> +		 *
> >> +		 * If user provides a zero-length array, then all the
> >> +		 * opened devices must be bound to a same iommufd_ctx.
> >> +		 *
> >> +		 * If all above checks are failed, reset is allowed only if
> >> +		 * the calling device is in a singleton dev_set.
> >>  		 */
> >>  		if (cur_vma->vdev.open_count &&
> >> -		    !vfio_dev_in_groups(cur_vma, groups)) {
> >> +		    !vfio_dev_in_groups(cur_vma, groups) &&
> >> +		    !vfio_dev_in_iommufd_ctx(cur_vma, iommufd_ctx) &&
> >> +		    (dev_set->device_count > 1)) {
> > This last condition looks buggy to me, we need all conditions to be
> > true to generate an error here, which means that for a singleton
> > dev_set, it doesn't matter what group fds are passed, if any, or whether
> > the iommufd context matches.  I think in fact this means that the empty
> > array path is equally available for group use cases with a singleton
> > dev_set, but we don't enable it for multiple device dev_sets like we do
> > iommufd.
> >
> > You pointed out a previous issue with hot-reset info and no-iommu where
> > if other affected devices are not bound to vfio-pci the info ioctl
> > returns error.  That's handled in the hot-reset ioctl by the fact that
> > all affected devices must be in the dev_set and therefore bound to
> > vfio-pci drivers.  So it seems to me that aside from the spurious error
> > because we can't report an iommu group when none exists, and didn't
> > spot it to invent an invalid group for debugging, hot-reset otherwise
> > works with no-iommu just like it does for iommu backed devices.  We
> > don't currently require singleton no-iommu dev_sets afaict.
> >
> > I'll also note that if the dev_set is singleton, this suggests that
> > pci_reset_function() can make use of bus reset, so a hot-reset is
> > accessible via VFIO_DEVICE_RESET if the appropriate reset method is
> > selected.
> >
> > Therefore, I think as written, the singleton dev_set hot-reset is
> > enabled for iommufd and (unintentionally?) for the group path, while
> > also negating a requirement for a group fd or that a provided group fd
> > actually matches the device in this latter case.  The null-array
> > approach is not however extended to groups for more general use.
> > Additionally, limiting no-iommu hot-reset to singleton dev_sets
> > provides only a marginal functional difference vs VFIO_DEVICE_RESET.
> > Thanks,
> >
> > Alex
> What bout introducing a helper
> static bool is_reset_ok(pdev, groups, ctx) {
>     if (!pdev->vdev.open_count)
>         return true;
>     if (groups && vfio_dev_in_groups(pdev, groups))
>         return true;
>     if (ctx && vfio_dev_in_iommufd_ctx(pdev, ctx)
>         return true;
>     return false;
> }
> 
> Assuming the above logic is correct I think this would make the code
> more readable

this logic may fail the noiommu devices in the cdev path as the
cdev path binds the devices to iommufd==-1. The ctx would be
NULL. So we agreed to allow the reset if the dev_set is sigletion.
Detail can be found in below paragraph. As I replied in another
email. Maybe this singleton support can be dropped since singleton
dev_set may just do reset with VFIO_DEVICE_RESET. Alex may
correct me if userspace is not so intelligent.

"However the iommufd method has difficulty working with noiommu devices
since those devices don't have a valid iommufd, unless the noiommu device
is in a singleton dev_set hence no ownership check is required. [3]

[3] https://lore.kernel.org/kvm/ZACX+Np%2FIY7ygqL5@nvidia.com/"

Regards,
Yi Liu

> Thanks
> 
> Eric
> >>  			ret = -EINVAL;
> >>  			goto err_undo;
> >>  		}
> >> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> >> index f96e5689cffc..17aa5d09db41 100644
> >> --- a/include/uapi/linux/vfio.h
> >> +++ b/include/uapi/linux/vfio.h
> >> @@ -679,7 +679,14 @@ struct vfio_pci_hot_reset_info {
> >>   * the calling user must ensure all affected devices, if opened, are
> >>   * owned by itself.
> >>   *
> >> - * The ownership is proved by an array of group fds.
> >> + * The ownership can be proved by:
> >> + *   - An array of group fds
> >> + *   - A zero-length array
> >> + *
> >> + * In the last case all affected devices which are opened by this user
> >> + * must have been bound to a same iommufd. If the calling device is in
> >> + * noiommu mode (no valid iommufd) then it can be reset only if the reset
> >> + * doesn't affect other devices.
> >>   *
> >>   * Return: 0 on success, -errno on failure.
> >>   */


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 07/12] vfio: Accpet device file from vfio PCI hot reset path
  2023-04-05  8:07   ` Eric Auger
@ 2023-04-05  8:10     ` Liu, Yi L
  0 siblings, 0 replies; 145+ messages in thread
From: Liu, Yi L @ 2023-04-05  8:10 UTC (permalink / raw)
  To: eric.auger, alex.williamson, jgg, Tian, Kevin
  Cc: linux-s390, yi.y.sun, kvm, mjrosato, intel-gvt-dev, joro, cohuck,
	Hao, Xudong, peterx, Zhao, Yan Y, Xu, Terrence, nicolinc,
	shameerali.kolothum.thodi, suravee.suthikulpanit, intel-gfx,
	chao.p.peng, lulu, robin.murphy, jasowang, Jiang, Yanting

> From: Eric Auger <eric.auger@redhat.com>
> Sent: Wednesday, April 5, 2023 4:08 PM
> 
> Hi Yi,
> 
> On 4/1/23 16:44, Yi Liu wrote:
> > This extends both vfio_file_is_valid() and vfio_file_has_dev() to accept
> > device file from the vfio PCI hot reset.
> typo in the title s/Accpet/Accept

thanks. would correct it.

> >
> > Reviewed-by: Kevin Tian <kevin.tian@intel.com>
> > Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> > Tested-by: Yanting Jiang <yanting.jiang@intel.com>
> > Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> > ---
> >  drivers/vfio/vfio_main.c | 23 +++++++++++++++++++----
> >  1 file changed, 19 insertions(+), 4 deletions(-)
> >
> > diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
> > index fe7446805afd..ebbb6b91a498 100644
> > --- a/drivers/vfio/vfio_main.c
> > +++ b/drivers/vfio/vfio_main.c
> > @@ -1154,13 +1154,23 @@ const struct file_operations vfio_device_fops = {
> >  	.mmap		= vfio_device_fops_mmap,
> >  };
> >
> > +static struct vfio_device *vfio_device_from_file(struct file *file)
> > +{
> > +	struct vfio_device *device = file->private_data;
> > +
> > +	if (file->f_op != &vfio_device_fops)
> > +		return NULL;
> > +	return device;
> > +}
> > +
> >  /**
> >   * vfio_file_is_valid - True if the file is valid vfio file
> >   * @file: VFIO group file or VFIO device file
> >   */
> >  bool vfio_file_is_valid(struct file *file)
> >  {
> > -	return vfio_group_from_file(file);
> > +	return vfio_group_from_file(file) ||
> > +	       vfio_device_from_file(file);
> >  }
> >  EXPORT_SYMBOL_GPL(vfio_file_is_valid);
> >
> > @@ -1174,12 +1184,17 @@ EXPORT_SYMBOL_GPL(vfio_file_is_valid);
> >  bool vfio_file_has_dev(struct file *file, struct vfio_device *device)
> >  {
> >  	struct vfio_group *group;
> > +	struct vfio_device *vdev;
> >
> >  	group = vfio_group_from_file(file);
> > -	if (!group)
> > -		return false;
> > +	if (group)
> > +		return vfio_group_has_dev(group, device);
> > +
> > +	vdev = vfio_device_from_file(file);
> > +	if (vdev)
> > +		return vdev == device;
> >
> > -	return vfio_group_has_dev(group, device);
> > +	return false;
> >  }
> >  EXPORT_SYMBOL_GPL(vfio_file_has_dev);
> >
> With Alex' suggestion
> Reviewed-by: Eric Auger <eric.auger@redhat.com>
> 
> Eric


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 06/12] vfio: Refine vfio file kAPIs for vfio PCI hot reset
  2023-04-01 14:44 ` [Intel-gfx] [PATCH v3 06/12] vfio: Refine vfio file kAPIs for vfio PCI hot reset Yi Liu
@ 2023-04-05  8:27   ` Eric Auger
  2023-04-05  9:23     ` Liu, Yi L
  0 siblings, 1 reply; 145+ messages in thread
From: Eric Auger @ 2023-04-05  8:27 UTC (permalink / raw)
  To: Yi Liu, alex.williamson, jgg, kevin.tian
  Cc: linux-s390, yi.y.sun, kvm, mjrosato, intel-gvt-dev, joro, cohuck,
	xudong.hao, peterx, yan.y.zhao, terrence.xu, nicolinc,
	shameerali.kolothum.thodi, suravee.suthikulpanit, intel-gfx,
	chao.p.peng, lulu, robin.murphy, jasowang, yanting.jiang

Hi Yi,
On 4/1/23 16:44, Yi Liu wrote:
> This prepares vfio core to accept vfio device file from the vfio PCI
> hot reset path. vfio_file_is_group() is still kept for KVM usage.
>
> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> Tested-by: Yanting Jiang <yanting.jiang@intel.com>
> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> ---
>  drivers/vfio/group.c             | 32 ++++++++++++++------------------
>  drivers/vfio/pci/vfio_pci_core.c |  4 ++--
>  drivers/vfio/vfio.h              |  2 ++
>  drivers/vfio/vfio_main.c         | 29 +++++++++++++++++++++++++++++
>  include/linux/vfio.h             |  1 +
>  5 files changed, 48 insertions(+), 20 deletions(-)
>
> diff --git a/drivers/vfio/group.c b/drivers/vfio/group.c
> index 27d5ba7cf9dc..d0c95d033605 100644
> --- a/drivers/vfio/group.c
> +++ b/drivers/vfio/group.c
> @@ -745,6 +745,15 @@ bool vfio_device_has_container(struct vfio_device *device)
>  	return device->group->container;
>  }
>  
> +struct vfio_group *vfio_group_from_file(struct file *file)
> +{
> +	struct vfio_group *group = file->private_data;
> +
> +	if (file->f_op != &vfio_group_fops)
> +		return NULL;
> +	return group;
> +}
> +
>  /**
>   * vfio_file_iommu_group - Return the struct iommu_group for the vfio group file
>   * @file: VFIO group file
> @@ -755,13 +764,13 @@ bool vfio_device_has_container(struct vfio_device *device)
>   */
>  struct iommu_group *vfio_file_iommu_group(struct file *file)
>  {
> -	struct vfio_group *group = file->private_data;
> +	struct vfio_group *group = vfio_group_from_file(file);
>  	struct iommu_group *iommu_group = NULL;
>  
>  	if (!IS_ENABLED(CONFIG_SPAPR_TCE_IOMMU))
>  		return NULL;
>  
> -	if (!vfio_file_is_group(file))
> +	if (!group)
>  		return NULL;
>  
>  	mutex_lock(&group->group_lock);
> @@ -775,12 +784,12 @@ struct iommu_group *vfio_file_iommu_group(struct file *file)
>  EXPORT_SYMBOL_GPL(vfio_file_iommu_group);
>  
>  /**
> - * vfio_file_is_group - True if the file is usable with VFIO aPIS
> + * vfio_file_is_group - True if the file is a vfio group file
>   * @file: VFIO group file
>   */
>  bool vfio_file_is_group(struct file *file)
>  {
> -	return file->f_op == &vfio_group_fops;
> +	return vfio_group_from_file(file);
>  }
>  EXPORT_SYMBOL_GPL(vfio_file_is_group);
>  
> @@ -842,23 +851,10 @@ void vfio_file_set_kvm(struct file *file, struct kvm *kvm)
>  }
>  EXPORT_SYMBOL_GPL(vfio_file_set_kvm);
>  
> -/**
> - * vfio_file_has_dev - True if the VFIO file is a handle for device
> - * @file: VFIO file to check
> - * @device: Device that must be part of the file
> - *
> - * Returns true if given file has permission to manipulate the given device.
> - */
> -bool vfio_file_has_dev(struct file *file, struct vfio_device *device)
> +bool vfio_group_has_dev(struct vfio_group *group, struct vfio_device *device)
>  {
> -	struct vfio_group *group = file->private_data;
> -
> -	if (!vfio_file_is_group(file))
> -		return false;
> -
>  	return group == device->group;
>  }
> -EXPORT_SYMBOL_GPL(vfio_file_has_dev);
>  
>  static char *vfio_devnode(const struct device *dev, umode_t *mode)
>  {
> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> index b68fcba67a4b..2a510b71edcb 100644
> --- a/drivers/vfio/pci/vfio_pci_core.c
> +++ b/drivers/vfio/pci/vfio_pci_core.c
> @@ -1308,8 +1308,8 @@ vfio_pci_ioctl_pci_hot_reset_groups(struct vfio_pci_core_device *vdev,
>  			break;
>  		}
>  
> -		/* Ensure the FD is a vfio group FD.*/
> -		if (!vfio_file_is_group(file)) {
> +		/* Ensure the FD is a vfio FD. vfio group or vfio device */
it is a bit strange to update the comment here and in the other places
in this patch whereas file_is_valid still sticks to group file check
By the way I would simply remove the comment which does not bring much
> +		if (!vfio_file_is_valid(file)) {
>  			fput(file);
>  			ret = -EINVAL;
>  			break;
> diff --git a/drivers/vfio/vfio.h b/drivers/vfio/vfio.h
> index 7b19c621e0e6..c0aeea24fbd6 100644
> --- a/drivers/vfio/vfio.h
> +++ b/drivers/vfio/vfio.h
> @@ -84,6 +84,8 @@ void vfio_device_group_unregister(struct vfio_device *device);
>  int vfio_device_group_use_iommu(struct vfio_device *device);
>  void vfio_device_group_unuse_iommu(struct vfio_device *device);
>  void vfio_device_group_close(struct vfio_device *device);
> +struct vfio_group *vfio_group_from_file(struct file *file);
> +bool vfio_group_has_dev(struct vfio_group *group, struct vfio_device *device);
>  bool vfio_device_has_container(struct vfio_device *device);
>  int __init vfio_group_init(void);
>  void vfio_group_cleanup(void);
> diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
> index 89497c933490..fe7446805afd 100644
> --- a/drivers/vfio/vfio_main.c
> +++ b/drivers/vfio/vfio_main.c
> @@ -1154,6 +1154,35 @@ const struct file_operations vfio_device_fops = {
>  	.mmap		= vfio_device_fops_mmap,
>  };
>  
> +/**
> + * vfio_file_is_valid - True if the file is valid vfio file
> + * @file: VFIO group file or VFIO device file
I wonder if you shouldn't squash with next patch tbh.
> + */
> +bool vfio_file_is_valid(struct file *file)
> +{
> +	return vfio_group_from_file(file);
> +}
> +EXPORT_SYMBOL_GPL(vfio_file_is_valid);
> +
> +/**
> + * vfio_file_has_dev - True if the VFIO file is a handle for device
> + * @file: VFIO file to check
> + * @device: Device that must be part of the file
> + *
> + * Returns true if given file has permission to manipulate the given device.
> + */
> +bool vfio_file_has_dev(struct file *file, struct vfio_device *device)
> +{
> +	struct vfio_group *group;
> +
> +	group = vfio_group_from_file(file);
> +	if (!group)
> +		return false;
> +
> +	return vfio_group_has_dev(group, device);
> +}
> +EXPORT_SYMBOL_GPL(vfio_file_has_dev);
> +
>  /*
>   * Sub-module support
>   */
> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> index 97a1174b922f..f8fb9ab25188 100644
> --- a/include/linux/vfio.h
> +++ b/include/linux/vfio.h
> @@ -258,6 +258,7 @@ int vfio_mig_get_next_state(struct vfio_device *device,
>   */
>  struct iommu_group *vfio_file_iommu_group(struct file *file);
>  bool vfio_file_is_group(struct file *file);
> +bool vfio_file_is_valid(struct file *file);
>  bool vfio_file_enforced_coherent(struct file *file);
>  void vfio_file_set_kvm(struct file *file, struct kvm *kvm);
>  bool vfio_file_has_dev(struct file *file, struct vfio_device *device);
Eric


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 06/12] vfio: Refine vfio file kAPIs for vfio PCI hot reset
  2023-04-05  8:27   ` Eric Auger
@ 2023-04-05  9:23     ` Liu, Yi L
  0 siblings, 0 replies; 145+ messages in thread
From: Liu, Yi L @ 2023-04-05  9:23 UTC (permalink / raw)
  To: eric.auger, alex.williamson, jgg, Tian, Kevin
  Cc: linux-s390, yi.y.sun, kvm, mjrosato, intel-gvt-dev, joro, cohuck,
	Hao, Xudong, peterx, Zhao, Yan Y, Xu, Terrence, nicolinc,
	shameerali.kolothum.thodi, suravee.suthikulpanit, intel-gfx,
	chao.p.peng, lulu, robin.murphy, jasowang, Jiang, Yanting

> From: Eric Auger <eric.auger@redhat.com>
> Sent: Wednesday, April 5, 2023 4:28 PM
> 
> Hi Yi,
> On 4/1/23 16:44, Yi Liu wrote:
> > This prepares vfio core to accept vfio device file from the vfio PCI
> > hot reset path. vfio_file_is_group() is still kept for KVM usage.
> >
> > Reviewed-by: Kevin Tian <kevin.tian@intel.com>
> > Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> > Tested-by: Yanting Jiang <yanting.jiang@intel.com>
> > Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> > ---
> >  drivers/vfio/group.c             | 32 ++++++++++++++------------------
> >  drivers/vfio/pci/vfio_pci_core.c |  4 ++--
> >  drivers/vfio/vfio.h              |  2 ++
> >  drivers/vfio/vfio_main.c         | 29 +++++++++++++++++++++++++++++
> >  include/linux/vfio.h             |  1 +
> >  5 files changed, 48 insertions(+), 20 deletions(-)
> >
> > diff --git a/drivers/vfio/group.c b/drivers/vfio/group.c
> > index 27d5ba7cf9dc..d0c95d033605 100644
> > --- a/drivers/vfio/group.c
> > +++ b/drivers/vfio/group.c
> > @@ -745,6 +745,15 @@ bool vfio_device_has_container(struct vfio_device *device)
> >  	return device->group->container;
> >  }
> >
> > +struct vfio_group *vfio_group_from_file(struct file *file)
> > +{
> > +	struct vfio_group *group = file->private_data;
> > +
> > +	if (file->f_op != &vfio_group_fops)
> > +		return NULL;
> > +	return group;
> > +}
> > +
> >  /**
> >   * vfio_file_iommu_group - Return the struct iommu_group for the vfio group file
> >   * @file: VFIO group file
> > @@ -755,13 +764,13 @@ bool vfio_device_has_container(struct vfio_device
> *device)
> >   */
> >  struct iommu_group *vfio_file_iommu_group(struct file *file)
> >  {
> > -	struct vfio_group *group = file->private_data;
> > +	struct vfio_group *group = vfio_group_from_file(file);
> >  	struct iommu_group *iommu_group = NULL;
> >
> >  	if (!IS_ENABLED(CONFIG_SPAPR_TCE_IOMMU))
> >  		return NULL;
> >
> > -	if (!vfio_file_is_group(file))
> > +	if (!group)
> >  		return NULL;
> >
> >  	mutex_lock(&group->group_lock);
> > @@ -775,12 +784,12 @@ struct iommu_group *vfio_file_iommu_group(struct file
> *file)
> >  EXPORT_SYMBOL_GPL(vfio_file_iommu_group);
> >
> >  /**
> > - * vfio_file_is_group - True if the file is usable with VFIO aPIS
> > + * vfio_file_is_group - True if the file is a vfio group file
> >   * @file: VFIO group file
> >   */
> >  bool vfio_file_is_group(struct file *file)
> >  {
> > -	return file->f_op == &vfio_group_fops;
> > +	return vfio_group_from_file(file);
> >  }
> >  EXPORT_SYMBOL_GPL(vfio_file_is_group);
> >
> > @@ -842,23 +851,10 @@ void vfio_file_set_kvm(struct file *file, struct kvm *kvm)
> >  }
> >  EXPORT_SYMBOL_GPL(vfio_file_set_kvm);
> >
> > -/**
> > - * vfio_file_has_dev - True if the VFIO file is a handle for device
> > - * @file: VFIO file to check
> > - * @device: Device that must be part of the file
> > - *
> > - * Returns true if given file has permission to manipulate the given device.
> > - */
> > -bool vfio_file_has_dev(struct file *file, struct vfio_device *device)
> > +bool vfio_group_has_dev(struct vfio_group *group, struct vfio_device *device)
> >  {
> > -	struct vfio_group *group = file->private_data;
> > -
> > -	if (!vfio_file_is_group(file))
> > -		return false;
> > -
> >  	return group == device->group;
> >  }
> > -EXPORT_SYMBOL_GPL(vfio_file_has_dev);
> >
> >  static char *vfio_devnode(const struct device *dev, umode_t *mode)
> >  {
> > diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> > index b68fcba67a4b..2a510b71edcb 100644
> > --- a/drivers/vfio/pci/vfio_pci_core.c
> > +++ b/drivers/vfio/pci/vfio_pci_core.c
> > @@ -1308,8 +1308,8 @@ vfio_pci_ioctl_pci_hot_reset_groups(struct
> vfio_pci_core_device *vdev,
> >  			break;
> >  		}
> >
> > -		/* Ensure the FD is a vfio group FD.*/
> > -		if (!vfio_file_is_group(file)) {
> > +		/* Ensure the FD is a vfio FD. vfio group or vfio device */
> it is a bit strange to update the comment here and in the other places
> in this patch whereas file_is_valid still sticks to group file check
> By the way I would simply remove the comment which does not bring much

ok. yeah, at this moment, it's still group file. may just delete this comment.

> > +		if (!vfio_file_is_valid(file)) {
> >  			fput(file);
> >  			ret = -EINVAL;
> >  			break;
> > diff --git a/drivers/vfio/vfio.h b/drivers/vfio/vfio.h
> > index 7b19c621e0e6..c0aeea24fbd6 100644
> > --- a/drivers/vfio/vfio.h
> > +++ b/drivers/vfio/vfio.h
> > @@ -84,6 +84,8 @@ void vfio_device_group_unregister(struct vfio_device *device);
> >  int vfio_device_group_use_iommu(struct vfio_device *device);
> >  void vfio_device_group_unuse_iommu(struct vfio_device *device);
> >  void vfio_device_group_close(struct vfio_device *device);
> > +struct vfio_group *vfio_group_from_file(struct file *file);
> > +bool vfio_group_has_dev(struct vfio_group *group, struct vfio_device *device);
> >  bool vfio_device_has_container(struct vfio_device *device);
> >  int __init vfio_group_init(void);
> >  void vfio_group_cleanup(void);
> > diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
> > index 89497c933490..fe7446805afd 100644
> > --- a/drivers/vfio/vfio_main.c
> > +++ b/drivers/vfio/vfio_main.c
> > @@ -1154,6 +1154,35 @@ const struct file_operations vfio_device_fops = {
> >  	.mmap		= vfio_device_fops_mmap,
> >  };
> >
> > +/**
> > + * vfio_file_is_valid - True if the file is valid vfio file
> > + * @file: VFIO group file or VFIO device file
> I wonder if you shouldn't squash with next patch tbh.

yes. this is still group file, no device file yet.

Thanks,
Yi Liu

> > + */
> > +bool vfio_file_is_valid(struct file *file)
> > +{
> > +	return vfio_group_from_file(file);
> > +}
> > +EXPORT_SYMBOL_GPL(vfio_file_is_valid);
> > +
> > +/**
> > + * vfio_file_has_dev - True if the VFIO file is a handle for device
> > + * @file: VFIO file to check
> > + * @device: Device that must be part of the file
> > + *
> > + * Returns true if given file has permission to manipulate the given device.
> > + */
> > +bool vfio_file_has_dev(struct file *file, struct vfio_device *device)
> > +{
> > +	struct vfio_group *group;
> > +
> > +	group = vfio_group_from_file(file);
> > +	if (!group)
> > +		return false;
> > +
> > +	return vfio_group_has_dev(group, device);
> > +}
> > +EXPORT_SYMBOL_GPL(vfio_file_has_dev);
> > +
> >  /*
> >   * Sub-module support
> >   */
> > diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> > index 97a1174b922f..f8fb9ab25188 100644
> > --- a/include/linux/vfio.h
> > +++ b/include/linux/vfio.h
> > @@ -258,6 +258,7 @@ int vfio_mig_get_next_state(struct vfio_device *device,
> >   */
> >  struct iommu_group *vfio_file_iommu_group(struct file *file);
> >  bool vfio_file_is_group(struct file *file);
> > +bool vfio_file_is_valid(struct file *file);
> >  bool vfio_file_enforced_coherent(struct file *file);
> >  void vfio_file_set_kvm(struct file *file, struct kvm *kvm);
> >  bool vfio_file_has_dev(struct file *file, struct vfio_device *device);
> Eric


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 11/12] iommufd: Define IOMMUFD_INVALID_ID in uapi
  2023-04-04 21:00   ` Alex Williamson
@ 2023-04-05  9:31     ` Liu, Yi L
  2023-04-05 15:13       ` Alex Williamson
  0 siblings, 1 reply; 145+ messages in thread
From: Liu, Yi L @ 2023-04-05  9:31 UTC (permalink / raw)
  To: Alex Williamson
  Cc: mjrosato, jasowang, Hao, Xudong, peterx, Xu, Terrence,
	chao.p.peng, linux-s390, kvm, lulu, Jiang, Yanting, joro,
	nicolinc, jgg, Zhao, Yan Y, intel-gfx, eric.auger, intel-gvt-dev,
	yi.y.sun, cohuck, shameerali.kolothum.thodi,
	suravee.suthikulpanit, robin.murphy

> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Wednesday, April 5, 2023 5:01 AM
> 
> On Sat,  1 Apr 2023 07:44:28 -0700
> Yi Liu <yi.l.liu@intel.com> wrote:
> 
> > as there are IOMMUFD users that want to know check if an ID generated
> > by IOMMUFD is valid or not. e.g. vfio-pci optionaly returns invalid
> > dev_id to user in the VFIO_DEVICE_GET_PCI_HOT_RESET_INFO ioctl. User
> > needs to check if the ID is valid or not.
> >
> > IOMMUFD_INVALID_ID is defined as 0 since the IDs generated by IOMMUFD
> > starts from 0.
> >
> > Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> > ---
> >  include/uapi/linux/iommufd.h | 3 +++
> >  1 file changed, 3 insertions(+)
> >
> > diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
> > index 98ebba80cfa1..aeae73a93833 100644
> > --- a/include/uapi/linux/iommufd.h
> > +++ b/include/uapi/linux/iommufd.h
> > @@ -9,6 +9,9 @@
> >
> >  #define IOMMUFD_TYPE (';')
> >
> > +/* IDs allocated by IOMMUFD starts from 0 */
> > +#define IOMMUFD_INVALID_ID 0
> > +
> >  /**
> >   * DOC: General ioctl format
> >   *
> 
> If allocation "starts from 0" then 0 is a valid id, no?  Does allocation
> start from 1, ie. skip 0?  Thanks,

yes, it starts from 1, that's why we can use 0 as invalid id.

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 08/12] vfio/pci: Renaming for accepting device fd in hot reset path
  2023-04-01 14:44 ` [Intel-gfx] [PATCH v3 08/12] vfio/pci: Renaming for accepting device fd in " Yi Liu
  2023-04-04 21:23   ` Alex Williamson
@ 2023-04-05  9:32   ` Eric Auger
  1 sibling, 0 replies; 145+ messages in thread
From: Eric Auger @ 2023-04-05  9:32 UTC (permalink / raw)
  To: Yi Liu, alex.williamson, jgg, kevin.tian
  Cc: linux-s390, yi.y.sun, kvm, mjrosato, intel-gvt-dev, joro, cohuck,
	xudong.hao, peterx, yan.y.zhao, terrence.xu, nicolinc,
	shameerali.kolothum.thodi, suravee.suthikulpanit, intel-gfx,
	chao.p.peng, lulu, robin.murphy, jasowang, yanting.jiang



On 4/1/23 16:44, Yi Liu wrote:
> No functional change is intended.
>
> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> Tested-by: Yanting Jiang <yanting.jiang@intel.com>
> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>

Eric
> ---
>  drivers/vfio/pci/vfio_pci_core.c | 52 ++++++++++++++++----------------
>  1 file changed, 26 insertions(+), 26 deletions(-)
>
> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> index 2a510b71edcb..da6325008872 100644
> --- a/drivers/vfio/pci/vfio_pci_core.c
> +++ b/drivers/vfio/pci/vfio_pci_core.c
> @@ -177,10 +177,10 @@ static void vfio_pci_probe_mmaps(struct vfio_pci_core_device *vdev)
>  	}
>  }
>  
> -struct vfio_pci_group_info;
> +struct vfio_pci_file_info;
>  static void vfio_pci_dev_set_try_reset(struct vfio_device_set *dev_set);
>  static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
> -				      struct vfio_pci_group_info *groups,
> +				      struct vfio_pci_file_info *info,
>  				      struct iommufd_ctx *iommufd_ctx);
>  
>  /*
> @@ -800,7 +800,7 @@ static int vfio_pci_fill_devs(struct pci_dev *pdev, void *data)
>  	return 0;
>  }
>  
> -struct vfio_pci_group_info {
> +struct vfio_pci_file_info {
>  	int count;
>  	struct file **files;
>  };
> @@ -1257,14 +1257,14 @@ static int vfio_pci_ioctl_get_pci_hot_reset_info(
>  }
>  
>  static int
> -vfio_pci_ioctl_pci_hot_reset_groups(struct vfio_pci_core_device *vdev,
> -				    struct vfio_pci_hot_reset *hdr,
> -				    bool slot,
> -				    struct vfio_pci_hot_reset __user *arg)
> +vfio_pci_ioctl_pci_hot_reset_files(struct vfio_pci_core_device *vdev,
> +				   struct vfio_pci_hot_reset *hdr,
> +				   bool slot,
> +				   struct vfio_pci_hot_reset __user *arg)
>  {
> -	int32_t *group_fds;
> +	int32_t *fds;
>  	struct file **files;
> -	struct vfio_pci_group_info info;
> +	struct vfio_pci_file_info info;
>  	int file_idx, count = 0, ret = 0;
>  
>  	/*
> @@ -1281,17 +1281,17 @@ vfio_pci_ioctl_pci_hot_reset_groups(struct vfio_pci_core_device *vdev,
>  	if (hdr->count > count)
>  		return -EINVAL;
>  
> -	group_fds = kcalloc(hdr->count, sizeof(*group_fds), GFP_KERNEL);
> +	fds = kcalloc(hdr->count, sizeof(*fds), GFP_KERNEL);
>  	files = kcalloc(hdr->count, sizeof(*files), GFP_KERNEL);
> -	if (!group_fds || !files) {
> -		kfree(group_fds);
> +	if (!fds || !files) {
> +		kfree(fds);
>  		kfree(files);
>  		return -ENOMEM;
>  	}
>  
> -	if (copy_from_user(group_fds, arg->group_fds,
> -			   hdr->count * sizeof(*group_fds))) {
> -		kfree(group_fds);
> +	if (copy_from_user(fds, arg->group_fds,
> +			   hdr->count * sizeof(*fds))) {
> +		kfree(fds);
>  		kfree(files);
>  		return -EFAULT;
>  	}
> @@ -1301,7 +1301,7 @@ vfio_pci_ioctl_pci_hot_reset_groups(struct vfio_pci_core_device *vdev,
>  	 * the reset
>  	 */
>  	for (file_idx = 0; file_idx < hdr->count; file_idx++) {
> -		struct file *file = fget(group_fds[file_idx]);
> +		struct file *file = fget(fds[file_idx]);
>  
>  		if (!file) {
>  			ret = -EBADF;
> @@ -1318,9 +1318,9 @@ vfio_pci_ioctl_pci_hot_reset_groups(struct vfio_pci_core_device *vdev,
>  		files[file_idx] = file;
>  	}
>  
> -	kfree(group_fds);
> +	kfree(fds);
>  
> -	/* release reference to groups on error */
> +	/* release reference to fds on error */
>  	if (ret)
>  		goto hot_reset_release;
>  
> @@ -1358,7 +1358,7 @@ static int vfio_pci_ioctl_pci_hot_reset(struct vfio_pci_core_device *vdev,
>  		return -ENODEV;
>  
>  	if (hdr.count)
> -		return vfio_pci_ioctl_pci_hot_reset_groups(vdev, &hdr, slot, arg);
> +		return vfio_pci_ioctl_pci_hot_reset_files(vdev, &hdr, slot, arg);
>  
>  	iommufd = vfio_iommufd_physical_ictx(&vdev->vdev);
>  
> @@ -2329,16 +2329,16 @@ const struct pci_error_handlers vfio_pci_core_err_handlers = {
>  };
>  EXPORT_SYMBOL_GPL(vfio_pci_core_err_handlers);
>  
> -static bool vfio_dev_in_groups(struct vfio_pci_core_device *vdev,
> -			       struct vfio_pci_group_info *groups)
> +static bool vfio_dev_in_files(struct vfio_pci_core_device *vdev,
> +			      struct vfio_pci_file_info *info)
>  {
>  	unsigned int i;
>  
> -	if (!groups)
> +	if (!info)
>  		return false;
>  
> -	for (i = 0; i < groups->count; i++)
> -		if (vfio_file_has_dev(groups->files[i], &vdev->vdev))
> +	for (i = 0; i < info->count; i++)
> +		if (vfio_file_has_dev(info->files[i], &vdev->vdev))
>  			return true;
>  	return false;
>  }
> @@ -2429,7 +2429,7 @@ static bool vfio_dev_in_iommufd_ctx(struct vfio_pci_core_device *vdev,
>   * get each memory_lock.
>   */
>  static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
> -				      struct vfio_pci_group_info *groups,
> +				      struct vfio_pci_file_info *info,
>  				      struct iommufd_ctx *iommufd_ctx)
>  {
>  	struct vfio_pci_core_device *cur_mem;
> @@ -2478,7 +2478,7 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
>  		 * the calling device is in a singleton dev_set.
>  		 */
>  		if (cur_vma->vdev.open_count &&
> -		    !vfio_dev_in_groups(cur_vma, groups) &&
> +		    !vfio_dev_in_files(cur_vma, info) &&
>  		    !vfio_dev_in_iommufd_ctx(cur_vma, iommufd_ctx) &&
>  		    (dev_set->device_count > 1)) {
>  			ret = -EINVAL;


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 09/12] vfio/pci: Accept device fd in VFIO_DEVICE_PCI_HOT_RESET ioctl
  2023-04-01 14:44 ` [Intel-gfx] [PATCH v3 09/12] vfio/pci: Accept device fd in VFIO_DEVICE_PCI_HOT_RESET ioctl Yi Liu
@ 2023-04-05  9:36   ` Eric Auger
  0 siblings, 0 replies; 145+ messages in thread
From: Eric Auger @ 2023-04-05  9:36 UTC (permalink / raw)
  To: Yi Liu, alex.williamson, jgg, kevin.tian
  Cc: linux-s390, yi.y.sun, kvm, mjrosato, intel-gvt-dev, joro, cohuck,
	xudong.hao, peterx, yan.y.zhao, terrence.xu, nicolinc,
	shameerali.kolothum.thodi, suravee.suthikulpanit, intel-gfx,
	chao.p.peng, lulu, robin.murphy, jasowang, yanting.jiang



On 4/1/23 16:44, Yi Liu wrote:
> Now user can also provide an array of device fds as a 3rd method to verify
> the reset ownership. It's not useful at this point when the device fds are
> acquired via group fds. But it's necessary when moving to device cdev which
> allows the user to directly acquire device fds by skipping group. In that
> case this method can be used as a last resort when the preferred iommufd
> verification doesn't work, e.g. in noiommu usages.
>
> Clarify it in uAPI.
>
> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> Tested-by: Yanting Jiang <yanting.jiang@intel.com>
> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>

Eric
> ---
>  drivers/vfio/pci/vfio_pci_core.c | 9 +++++----
>  include/uapi/linux/vfio.h        | 3 ++-
>  2 files changed, 7 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> index da6325008872..19f5b075d70a 100644
> --- a/drivers/vfio/pci/vfio_pci_core.c
> +++ b/drivers/vfio/pci/vfio_pci_core.c
> @@ -1289,7 +1289,7 @@ vfio_pci_ioctl_pci_hot_reset_files(struct vfio_pci_core_device *vdev,
>  		return -ENOMEM;
>  	}
>  
> -	if (copy_from_user(fds, arg->group_fds,
> +	if (copy_from_user(fds, arg->fds,
>  			   hdr->count * sizeof(*fds))) {
>  		kfree(fds);
>  		kfree(files);
> @@ -1297,8 +1297,8 @@ vfio_pci_ioctl_pci_hot_reset_files(struct vfio_pci_core_device *vdev,
>  	}
>  
>  	/*
> -	 * Get the group file for each fd to ensure the group held across
> -	 * the reset
> +	 * Get the file for each fd to ensure the group/device file
> +	 * is held across the reset
>  	 */
>  	for (file_idx = 0; file_idx < hdr->count; file_idx++) {
>  		struct file *file = fget(fds[file_idx]);
> @@ -2469,7 +2469,8 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
>  		 * cannot race being opened by another user simultaneously.
>  		 *
>  		 * Otherwise all opened devices in the dev_set must be
> -		 * contained by the set of groups provided by the user.
> +		 * contained by the set of groups/devices provided by
> +		 * the user.
>  		 *
>  		 * If user provides a zero-length array, then all the
>  		 * opened devices must be bound to a same iommufd_ctx.
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 17aa5d09db41..25432ef213ee 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -681,6 +681,7 @@ struct vfio_pci_hot_reset_info {
>   *
>   * The ownership can be proved by:
>   *   - An array of group fds
> + *   - An array of device fds
>   *   - A zero-length array
>   *
>   * In the last case all affected devices which are opened by this user
> @@ -694,7 +695,7 @@ struct vfio_pci_hot_reset {
>  	__u32	argsz;
>  	__u32	flags;
>  	__u32	count;
> -	__s32	group_fds[];
> +	__s32	fds[];
>  };
>  
>  #define VFIO_DEVICE_PCI_HOT_RESET	_IO(VFIO_TYPE, VFIO_BASE + 13)


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 02/12] vfio/pci: Only check ownership of opened devices in hot reset
  2023-04-04 15:59           ` Eric Auger
@ 2023-04-05 11:41             ` Jason Gunthorpe
  2023-04-05 15:14               ` Eric Auger
  0 siblings, 1 reply; 145+ messages in thread
From: Jason Gunthorpe @ 2023-04-05 11:41 UTC (permalink / raw)
  To: Eric Auger
  Cc: mjrosato, jasowang, Hao, Xudong, peterx, Xu, Terrence,
	chao.p.peng, linux-s390, Liu, Yi L, kvm, lulu, Jiang, Yanting,
	joro, nicolinc, Zhao, Yan Y, intel-gfx, intel-gvt-dev, yi.y.sun,
	cohuck, shameerali.kolothum.thodi, suravee.suthikulpanit,
	robin.murphy

On Tue, Apr 04, 2023 at 05:59:01PM +0200, Eric Auger wrote:

> > but the hot reset shall fail as the group is not owned by the user.
> 
> sure it shall but I fail to understand if the reset fails or the device
> plug is somehow delayed until the reset completes.

It is just racy today - vfio_pci_dev_set_resettable() doesn't hold any
locks across the pci_walk_bus() check to prevent hot plug in while it is
working on the reset.

Jason

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 11/12] iommufd: Define IOMMUFD_INVALID_ID in uapi
  2023-04-01 14:44 ` [Intel-gfx] [PATCH v3 11/12] iommufd: Define IOMMUFD_INVALID_ID in uapi Yi Liu
  2023-04-04 21:00   ` Alex Williamson
@ 2023-04-05 11:46   ` Eric Auger
  1 sibling, 0 replies; 145+ messages in thread
From: Eric Auger @ 2023-04-05 11:46 UTC (permalink / raw)
  To: Yi Liu, alex.williamson, jgg, kevin.tian
  Cc: linux-s390, yi.y.sun, kvm, mjrosato, intel-gvt-dev, joro, cohuck,
	xudong.hao, peterx, yan.y.zhao, terrence.xu, nicolinc,
	shameerali.kolothum.thodi, suravee.suthikulpanit, intel-gfx,
	chao.p.peng, lulu, robin.murphy, jasowang, yanting.jiang

Hi Yi

On 4/1/23 16:44, Yi Liu wrote:
> as there are IOMMUFD users that want to know check if an ID generated
s/want to know check/ need to check
which type of ID?
> by IOMMUFD is valid or not. e.g. vfio-pci optionaly returns invalid
optionally
> dev_id to user in the VFIO_DEVICE_GET_PCI_HOT_RESET_INFO ioctl. User
> needs to check if the ID is valid or not.
so dev id ...
>
> IOMMUFD_INVALID_ID is defined as 0 since the IDs generated by IOMMUFD
> starts from 0.
from 1, same as below
>
> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> ---
>  include/uapi/linux/iommufd.h | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
> index 98ebba80cfa1..aeae73a93833 100644
> --- a/include/uapi/linux/iommufd.h
> +++ b/include/uapi/linux/iommufd.h
> @@ -9,6 +9,9 @@
>  
>  #define IOMMUFD_TYPE (';')
>  
> +/* IDs allocated by IOMMUFD starts from 0 */
ditto
> +#define IOMMUFD_INVALID_ID 0
> +
>  /**
>   * DOC: General ioctl format
>   *
Eric


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 10/12] vfio: Mark cdev usage in vfio_device
  2023-04-01 14:44 ` [Intel-gfx] [PATCH v3 10/12] vfio: Mark cdev usage in vfio_device Yi Liu
@ 2023-04-05 11:48   ` Eric Auger
  2023-04-21  7:06     ` Liu, Yi L
  0 siblings, 1 reply; 145+ messages in thread
From: Eric Auger @ 2023-04-05 11:48 UTC (permalink / raw)
  To: Yi Liu, alex.williamson, jgg, kevin.tian
  Cc: linux-s390, yi.y.sun, kvm, mjrosato, intel-gvt-dev, joro, cohuck,
	xudong.hao, peterx, yan.y.zhao, terrence.xu, nicolinc,
	shameerali.kolothum.thodi, suravee.suthikulpanit, intel-gfx,
	chao.p.peng, lulu, robin.murphy, jasowang, yanting.jiang



On 4/1/23 16:44, Yi Liu wrote:
> There are users that need to check if vfio_device is opened as cdev.
> e.g. vfio-pci. This adds a flag in vfio_device, it will be set in the
> cdev path when device is opened. This is not used at this moment, but
> a preparation for vfio device cdev support.

better to squash this patch with the patch setting cdev_opened then?

Thanks

Eric
>
> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> ---
>  include/linux/vfio.h | 7 +++++++
>  1 file changed, 7 insertions(+)
>
> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> index f8fb9ab25188..d9a0770e5fc1 100644
> --- a/include/linux/vfio.h
> +++ b/include/linux/vfio.h
> @@ -62,6 +62,7 @@ struct vfio_device {
>  	struct iommufd_device *iommufd_device;
>  	bool iommufd_attached;
>  #endif
> +	bool cdev_opened;
>  };
>  
>  /**
> @@ -151,6 +152,12 @@ vfio_iommufd_physical_devid(struct vfio_device *vdev, u32 *id)
>  	((int (*)(struct vfio_device *vdev, u32 *pt_id)) NULL)
>  #endif
>  
> +static inline bool vfio_device_cdev_opened(struct vfio_device *device)
> +{
> +	lockdep_assert_held(&device->dev_set->lock);
> +	return device->cdev_opened;
> +}
> +
>  /**
>   * @migration_set_state: Optional callback to change the migration state for
>   *         devices that support migration. It's mandatory for


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
  2023-04-01 14:44 ` [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO Yi Liu
  2023-04-03  9:25   ` Liu, Yi L
  2023-04-04 22:20   ` Alex Williamson
@ 2023-04-05 12:19   ` Eric Auger
  2023-04-05 14:04     ` Liu, Yi L
  2 siblings, 1 reply; 145+ messages in thread
From: Eric Auger @ 2023-04-05 12:19 UTC (permalink / raw)
  To: Yi Liu, alex.williamson, jgg, kevin.tian
  Cc: linux-s390, yi.y.sun, kvm, mjrosato, intel-gvt-dev, joro, cohuck,
	xudong.hao, peterx, yan.y.zhao, terrence.xu, nicolinc,
	shameerali.kolothum.thodi, suravee.suthikulpanit, intel-gfx,
	chao.p.peng, lulu, robin.murphy, jasowang, yanting.jiang


Hi Yi,
On 4/1/23 16:44, Yi Liu wrote:
> for the users that accept device fds passed from management stacks to be
> able to figure out the host reset affected devices among the devices
> opened by the user. This is needed as such users do not have BDF (bus,
> devfn) knowledge about the devices it has opened, hence unable to use
> the information reported by existing VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
> to figure out the affected devices.
>
> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> ---
>  drivers/vfio/pci/vfio_pci_core.c | 58 ++++++++++++++++++++++++++++----
>  include/uapi/linux/vfio.h        | 24 ++++++++++++-
>  2 files changed, 74 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> index 19f5b075d70a..a5a7e148dce1 100644
> --- a/drivers/vfio/pci/vfio_pci_core.c
> +++ b/drivers/vfio/pci/vfio_pci_core.c
> @@ -30,6 +30,7 @@
>  #if IS_ENABLED(CONFIG_EEH)
>  #include <asm/eeh.h>
>  #endif
> +#include <uapi/linux/iommufd.h>
>  
>  #include "vfio_pci_priv.h"
>  
> @@ -767,6 +768,20 @@ static int vfio_pci_get_irq_count(struct vfio_pci_core_device *vdev, int irq_typ
>  	return 0;
>  }
>  
> +static struct vfio_device *
> +vfio_pci_find_device_in_devset(struct vfio_device_set *dev_set,
> +			       struct pci_dev *pdev)
> +{
> +	struct vfio_device *cur;
> +
> +	lockdep_assert_held(&dev_set->lock);
> +
> +	list_for_each_entry(cur, &dev_set->device_list, dev_set_list)
> +		if (cur->dev == &pdev->dev)
> +			return cur;
> +	return NULL;
> +}
> +
>  static int vfio_pci_count_devs(struct pci_dev *pdev, void *data)
>  {
>  	(*(int *)data)++;
> @@ -776,13 +791,20 @@ static int vfio_pci_count_devs(struct pci_dev *pdev, void *data)
>  struct vfio_pci_fill_info {
>  	int max;
>  	int cur;
> +	bool require_devid;
> +	struct iommufd_ctx *iommufd;
> +	struct vfio_device_set *dev_set;
>  	struct vfio_pci_dependent_device *devices;
>  };
>  
>  static int vfio_pci_fill_devs(struct pci_dev *pdev, void *data)
>  {
>  	struct vfio_pci_fill_info *fill = data;
> +	struct vfio_device_set *dev_set = fill->dev_set;
>  	struct iommu_group *iommu_group;
> +	struct vfio_device *vdev;
> +
> +	lockdep_assert_held(&dev_set->lock);
>  
>  	if (fill->cur == fill->max)
>  		return -EAGAIN; /* Something changed, try again */
> @@ -791,7 +813,21 @@ static int vfio_pci_fill_devs(struct pci_dev *pdev, void *data)
>  	if (!iommu_group)
>  		return -EPERM; /* Cannot reset non-isolated devices */
>  
> -	fill->devices[fill->cur].group_id = iommu_group_id(iommu_group);
> +	if (fill->require_devid) {
> +		/*
> +		 * Report dev_id of the devices that are opened as cdev
> +		 * and have the same iommufd with the fill->iommufd.
> +		 * Otherwise, just fill IOMMUFD_INVALID_ID.
> +		 */
> +		vdev = vfio_pci_find_device_in_devset(dev_set, pdev);
> +		if (vdev && vfio_device_cdev_opened(vdev) &&
> +		    fill->iommufd == vfio_iommufd_physical_ictx(vdev))
> +			vfio_iommufd_physical_devid(vdev, &fill->devices[fill->cur].dev_id);
> +		else
> +			fill->devices[fill->cur].dev_id = IOMMUFD_INVALID_ID;
> +	} else {
> +		fill->devices[fill->cur].group_id = iommu_group_id(iommu_group);
> +	}
>  	fill->devices[fill->cur].segment = pci_domain_nr(pdev->bus);
>  	fill->devices[fill->cur].bus = pdev->bus->number;
>  	fill->devices[fill->cur].devfn = pdev->devfn;
> @@ -1230,17 +1266,27 @@ static int vfio_pci_ioctl_get_pci_hot_reset_info(
>  		return -ENOMEM;
>  
>  	fill.devices = devices;
> +	fill.dev_set = vdev->vdev.dev_set;
>  
> +	mutex_lock(&vdev->vdev.dev_set->lock);
> +	if (vfio_device_cdev_opened(&vdev->vdev)) {
> +		fill.require_devid = true;
> +		fill.iommufd = vfio_iommufd_physical_ictx(&vdev->vdev);
> +	}
>  	ret = vfio_pci_for_each_slot_or_bus(vdev->pdev, vfio_pci_fill_devs,
>  					    &fill, slot);
> +	mutex_unlock(&vdev->vdev.dev_set->lock);
>  
>  	/*
>  	 * If a device was removed between counting and filling, we may come up
>  	 * short of fill.max.  If a device was added, we'll have a return of
>  	 * -EAGAIN above.
>  	 */
> -	if (!ret)
> +	if (!ret) {
>  		hdr.count = fill.cur;
> +		if (fill.require_devid)
> +			hdr.flags = VFIO_PCI_HOT_RESET_FLAG_IOMMUFD_DEV_ID;
> +	}
>  
>  reset_info_exit:
>  	if (copy_to_user(arg, &hdr, minsz))
> @@ -2346,12 +2392,10 @@ static bool vfio_dev_in_files(struct vfio_pci_core_device *vdev,
>  static int vfio_pci_is_device_in_set(struct pci_dev *pdev, void *data)
>  {
>  	struct vfio_device_set *dev_set = data;
> -	struct vfio_device *cur;
>  
> -	list_for_each_entry(cur, &dev_set->device_list, dev_set_list)
> -		if (cur->dev == &pdev->dev)
> -			return 0;
> -	return -EBUSY;
> +	lockdep_assert_held(&dev_set->lock);
> +
> +	return vfio_pci_find_device_in_devset(dev_set, pdev) ? 0 : -EBUSY;
>  }
>  
>  /*
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 25432ef213ee..5a34364e3b94 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -650,11 +650,32 @@ enum {
>   * VFIO_DEVICE_GET_PCI_HOT_RESET_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 12,
>   *					      struct vfio_pci_hot_reset_info)
>   *
> + * This command is used to query the affected devices in the hot reset for
> + * a given device.  User could use the information reported by this command
> + * to figure out the affected devices among the devices it has opened.
> + * This command always reports the segment, bus and devfn information for
> + * each affected device, and selectively report the group_id or the dev_id
> + * per the way how the device being queried is opened.
> + *	- If the device is opened via the traditional group/container manner,
> + *	  this command reports the group_id for each affected device.
> + *
> + *	- If the device is opened as a cdev, this command needs to report
s/needs to report/reports
> + *	  dev_id for each affected device and set the
> + *	  VFIO_PCI_HOT_RESET_FLAG_IOMMUFD_DEV_ID flag.  For the affected
> + *	  devices that are not opened as cdev or bound to different iommufds
> + *	  with the device that is queried, report an invalid dev_id to avoid
s/bound to different iommufds with the device that is queried/bound to
iommufds different from the reset device one?
> + *	  potential dev_id conflict as dev_id is local to iommufd.  For such
> + *	  affected devices, user shall fall back to use the segment, bus and
> + *	  devfn info to map it to opened device.
> + *
>   * Return: 0 on success, -errno on failure:
>   *	-enospc = insufficient buffer, -enodev = unsupported for device.
>   */
>  struct vfio_pci_dependent_device {
> -	__u32	group_id;
> +	union {
> +		__u32   group_id;
> +		__u32	dev_id;
> +	};
>  	__u16	segment;
>  	__u8	bus;
>  	__u8	devfn; /* Use PCI_SLOT/PCI_FUNC */
> @@ -663,6 +684,7 @@ struct vfio_pci_dependent_device {
>  struct vfio_pci_hot_reset_info {
>  	__u32	argsz;
>  	__u32	flags;
> +#define VFIO_PCI_HOT_RESET_FLAG_IOMMUFD_DEV_ID	(1 << 0)
>  	__u32	count;
>  	struct vfio_pci_dependent_device	devices[];
>  };
Eric


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
  2023-04-05 12:19   ` Eric Auger
@ 2023-04-05 14:04     ` Liu, Yi L
  2023-04-05 16:25       ` Alex Williamson
  0 siblings, 1 reply; 145+ messages in thread
From: Liu, Yi L @ 2023-04-05 14:04 UTC (permalink / raw)
  To: eric.auger, alex.williamson, jgg, Tian, Kevin
  Cc: linux-s390, yi.y.sun, kvm, mjrosato, intel-gvt-dev, joro, cohuck,
	Hao, Xudong, peterx, Zhao, Yan Y, Xu, Terrence, nicolinc,
	shameerali.kolothum.thodi, suravee.suthikulpanit, intel-gfx,
	chao.p.peng, lulu, robin.murphy, jasowang, Jiang, Yanting

Hi Eric,

> From: Eric Auger <eric.auger@redhat.com>
> Sent: Wednesday, April 5, 2023 8:20 PM
> 
> Hi Yi,
> On 4/1/23 16:44, Yi Liu wrote:
> > for the users that accept device fds passed from management stacks to be
> > able to figure out the host reset affected devices among the devices
> > opened by the user. This is needed as such users do not have BDF (bus,
> > devfn) knowledge about the devices it has opened, hence unable to use
> > the information reported by existing VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
> > to figure out the affected devices.
> >
> > Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> > ---
> >  drivers/vfio/pci/vfio_pci_core.c | 58 ++++++++++++++++++++++++++++----
> >  include/uapi/linux/vfio.h        | 24 ++++++++++++-
> >  2 files changed, 74 insertions(+), 8 deletions(-)
> >
> > diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> > index 19f5b075d70a..a5a7e148dce1 100644
> > --- a/drivers/vfio/pci/vfio_pci_core.c
> > +++ b/drivers/vfio/pci/vfio_pci_core.c
> > @@ -30,6 +30,7 @@
> >  #if IS_ENABLED(CONFIG_EEH)
> >  #include <asm/eeh.h>
> >  #endif
> > +#include <uapi/linux/iommufd.h>
> >
> >  #include "vfio_pci_priv.h"
> >
> > @@ -767,6 +768,20 @@ static int vfio_pci_get_irq_count(struct
> vfio_pci_core_device *vdev, int irq_typ
> >  	return 0;
> >  }
> >
> > +static struct vfio_device *
> > +vfio_pci_find_device_in_devset(struct vfio_device_set *dev_set,
> > +			       struct pci_dev *pdev)
> > +{
> > +	struct vfio_device *cur;
> > +
> > +	lockdep_assert_held(&dev_set->lock);
> > +
> > +	list_for_each_entry(cur, &dev_set->device_list, dev_set_list)
> > +		if (cur->dev == &pdev->dev)
> > +			return cur;
> > +	return NULL;
> > +}
> > +
> >  static int vfio_pci_count_devs(struct pci_dev *pdev, void *data)
> >  {
> >  	(*(int *)data)++;
> > @@ -776,13 +791,20 @@ static int vfio_pci_count_devs(struct pci_dev *pdev, void
> *data)
> >  struct vfio_pci_fill_info {
> >  	int max;
> >  	int cur;
> > +	bool require_devid;
> > +	struct iommufd_ctx *iommufd;
> > +	struct vfio_device_set *dev_set;
> >  	struct vfio_pci_dependent_device *devices;
> >  };
> >
> >  static int vfio_pci_fill_devs(struct pci_dev *pdev, void *data)
> >  {
> >  	struct vfio_pci_fill_info *fill = data;
> > +	struct vfio_device_set *dev_set = fill->dev_set;
> >  	struct iommu_group *iommu_group;
> > +	struct vfio_device *vdev;
> > +
> > +	lockdep_assert_held(&dev_set->lock);
> >
> >  	if (fill->cur == fill->max)
> >  		return -EAGAIN; /* Something changed, try again */
> > @@ -791,7 +813,21 @@ static int vfio_pci_fill_devs(struct pci_dev *pdev, void
> *data)
> >  	if (!iommu_group)
> >  		return -EPERM; /* Cannot reset non-isolated devices */
> >
> > -	fill->devices[fill->cur].group_id = iommu_group_id(iommu_group);
> > +	if (fill->require_devid) {
> > +		/*
> > +		 * Report dev_id of the devices that are opened as cdev
> > +		 * and have the same iommufd with the fill->iommufd.
> > +		 * Otherwise, just fill IOMMUFD_INVALID_ID.
> > +		 */
> > +		vdev = vfio_pci_find_device_in_devset(dev_set, pdev);
> > +		if (vdev && vfio_device_cdev_opened(vdev) &&
> > +		    fill->iommufd == vfio_iommufd_physical_ictx(vdev))
> > +			vfio_iommufd_physical_devid(vdev, &fill->devices[fill-
> >cur].dev_id);
> > +		else
> > +			fill->devices[fill->cur].dev_id = IOMMUFD_INVALID_ID;
> > +	} else {
> > +		fill->devices[fill->cur].group_id = iommu_group_id(iommu_group);
> > +	}
> >  	fill->devices[fill->cur].segment = pci_domain_nr(pdev->bus);
> >  	fill->devices[fill->cur].bus = pdev->bus->number;
> >  	fill->devices[fill->cur].devfn = pdev->devfn;
> > @@ -1230,17 +1266,27 @@ static int vfio_pci_ioctl_get_pci_hot_reset_info(
> >  		return -ENOMEM;
> >
> >  	fill.devices = devices;
> > +	fill.dev_set = vdev->vdev.dev_set;
> >
> > +	mutex_lock(&vdev->vdev.dev_set->lock);
> > +	if (vfio_device_cdev_opened(&vdev->vdev)) {
> > +		fill.require_devid = true;
> > +		fill.iommufd = vfio_iommufd_physical_ictx(&vdev->vdev);
> > +	}
> >  	ret = vfio_pci_for_each_slot_or_bus(vdev->pdev, vfio_pci_fill_devs,
> >  					    &fill, slot);
> > +	mutex_unlock(&vdev->vdev.dev_set->lock);
> >
> >  	/*
> >  	 * If a device was removed between counting and filling, we may come up
> >  	 * short of fill.max.  If a device was added, we'll have a return of
> >  	 * -EAGAIN above.
> >  	 */
> > -	if (!ret)
> > +	if (!ret) {
> >  		hdr.count = fill.cur;
> > +		if (fill.require_devid)
> > +			hdr.flags = VFIO_PCI_HOT_RESET_FLAG_IOMMUFD_DEV_ID;
> > +	}
> >
> >  reset_info_exit:
> >  	if (copy_to_user(arg, &hdr, minsz))
> > @@ -2346,12 +2392,10 @@ static bool vfio_dev_in_files(struct
> vfio_pci_core_device *vdev,
> >  static int vfio_pci_is_device_in_set(struct pci_dev *pdev, void *data)
> >  {
> >  	struct vfio_device_set *dev_set = data;
> > -	struct vfio_device *cur;
> >
> > -	list_for_each_entry(cur, &dev_set->device_list, dev_set_list)
> > -		if (cur->dev == &pdev->dev)
> > -			return 0;
> > -	return -EBUSY;
> > +	lockdep_assert_held(&dev_set->lock);
> > +
> > +	return vfio_pci_find_device_in_devset(dev_set, pdev) ? 0 : -EBUSY;
> >  }
> >
> >  /*
> > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> > index 25432ef213ee..5a34364e3b94 100644
> > --- a/include/uapi/linux/vfio.h
> > +++ b/include/uapi/linux/vfio.h
> > @@ -650,11 +650,32 @@ enum {
> >   * VFIO_DEVICE_GET_PCI_HOT_RESET_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 12,
> >   *					      struct vfio_pci_hot_reset_info)
> >   *
> > + * This command is used to query the affected devices in the hot reset for
> > + * a given device.  User could use the information reported by this command
> > + * to figure out the affected devices among the devices it has opened.
> > + * This command always reports the segment, bus and devfn information for
> > + * each affected device, and selectively report the group_id or the dev_id
> > + * per the way how the device being queried is opened.
> > + *	- If the device is opened via the traditional group/container manner,
> > + *	  this command reports the group_id for each affected device.
> > + *
> > + *	- If the device is opened as a cdev, this command needs to report
> s/needs to report/reports

got it.

> > + *	  dev_id for each affected device and set the
> > + *	  VFIO_PCI_HOT_RESET_FLAG_IOMMUFD_DEV_ID flag.  For the affected
> > + *	  devices that are not opened as cdev or bound to different iommufds
> > + *	  with the device that is queried, report an invalid dev_id to avoid
> s/bound to different iommufds with the device that is queried/bound to
> iommufds different from the reset device one?

hmmm, I'm not a native speaker here. This _INFO is to query if want
hot reset a given device, what devices would be affected. So it appears
the queried device is better. But I'd admit "the queried device" is also
"the reset device". may Alex help pick one. 😊

Regards,
Yi Liu

> > + *	  potential dev_id conflict as dev_id is local to iommufd.  For such
> > + *	  affected devices, user shall fall back to use the segment, bus and
> > + *	  devfn info to map it to opened device.
> > + *
> >   * Return: 0 on success, -errno on failure:
> >   *	-enospc = insufficient buffer, -enodev = unsupported for device.
> >   */
> >  struct vfio_pci_dependent_device {
> > -	__u32	group_id;
> > +	union {
> > +		__u32   group_id;
> > +		__u32	dev_id;
> > +	};
> >  	__u16	segment;
> >  	__u8	bus;
> >  	__u8	devfn; /* Use PCI_SLOT/PCI_FUNC */
> > @@ -663,6 +684,7 @@ struct vfio_pci_dependent_device {
> >  struct vfio_pci_hot_reset_info {
> >  	__u32	argsz;
> >  	__u32	flags;
> > +#define VFIO_PCI_HOT_RESET_FLAG_IOMMUFD_DEV_ID	(1 << 0)
> >  	__u32	count;
> >  	struct vfio_pci_dependent_device	devices[];
> >  };
> Eric


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 11/12] iommufd: Define IOMMUFD_INVALID_ID in uapi
  2023-04-05  9:31     ` Liu, Yi L
@ 2023-04-05 15:13       ` Alex Williamson
  2023-04-05 15:17         ` Liu, Yi L
  0 siblings, 1 reply; 145+ messages in thread
From: Alex Williamson @ 2023-04-05 15:13 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: mjrosato, jasowang, Hao, Xudong, peterx, Xu, Terrence,
	chao.p.peng, linux-s390, kvm, lulu, Jiang, Yanting, joro,
	nicolinc, jgg, Zhao, Yan Y, intel-gfx, eric.auger, intel-gvt-dev,
	yi.y.sun, cohuck, shameerali.kolothum.thodi,
	suravee.suthikulpanit, robin.murphy

On Wed, 5 Apr 2023 09:31:39 +0000
"Liu, Yi L" <yi.l.liu@intel.com> wrote:

> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Wednesday, April 5, 2023 5:01 AM
> > 
> > On Sat,  1 Apr 2023 07:44:28 -0700
> > Yi Liu <yi.l.liu@intel.com> wrote:
> >   
> > > as there are IOMMUFD users that want to know check if an ID generated
> > > by IOMMUFD is valid or not. e.g. vfio-pci optionaly returns invalid
> > > dev_id to user in the VFIO_DEVICE_GET_PCI_HOT_RESET_INFO ioctl. User
> > > needs to check if the ID is valid or not.
> > >
> > > IOMMUFD_INVALID_ID is defined as 0 since the IDs generated by IOMMUFD
> > > starts from 0.
> > >
> > > Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> > > ---
> > >  include/uapi/linux/iommufd.h | 3 +++
> > >  1 file changed, 3 insertions(+)
> > >
> > > diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
> > > index 98ebba80cfa1..aeae73a93833 100644
> > > --- a/include/uapi/linux/iommufd.h
> > > +++ b/include/uapi/linux/iommufd.h
> > > @@ -9,6 +9,9 @@
> > >
> > >  #define IOMMUFD_TYPE (';')
> > >
> > > +/* IDs allocated by IOMMUFD starts from 0 */
> > > +#define IOMMUFD_INVALID_ID 0
> > > +
> > >  /**
> > >   * DOC: General ioctl format
> > >   *  
> > 
> > If allocation "starts from 0" then 0 is a valid id, no?  Does allocation
> > start from 1, ie. skip 0?  Thanks,  
> 
> yes, it starts from 1, that's why we can use 0 as invalid id.

So the comment is wrong, correct?


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 02/12] vfio/pci: Only check ownership of opened devices in hot reset
  2023-04-05 11:41             ` Jason Gunthorpe
@ 2023-04-05 15:14               ` Eric Auger
  0 siblings, 0 replies; 145+ messages in thread
From: Eric Auger @ 2023-04-05 15:14 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: mjrosato, jasowang, Hao, Xudong, peterx, Xu, Terrence,
	chao.p.peng, linux-s390, Liu, Yi L, kvm, lulu, Jiang, Yanting,
	joro, nicolinc, Zhao, Yan Y, intel-gfx, intel-gvt-dev, yi.y.sun,
	cohuck, shameerali.kolothum.thodi, suravee.suthikulpanit,
	robin.murphy

Hi Jason,

On 4/5/23 13:41, Jason Gunthorpe wrote:
> On Tue, Apr 04, 2023 at 05:59:01PM +0200, Eric Auger wrote:
>
>>> but the hot reset shall fail as the group is not owned by the user.
>> sure it shall but I fail to understand if the reset fails or the device
>> plug is somehow delayed until the reset completes.
> It is just racy today - vfio_pci_dev_set_resettable() doesn't hold any
> locks across the pci_walk_bus() check to prevent hot plug in while it is
> working on the reset.

OK thanks

Eric
>
> Jason
>


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 11/12] iommufd: Define IOMMUFD_INVALID_ID in uapi
  2023-04-05 15:13       ` Alex Williamson
@ 2023-04-05 15:17         ` Liu, Yi L
  0 siblings, 0 replies; 145+ messages in thread
From: Liu, Yi L @ 2023-04-05 15:17 UTC (permalink / raw)
  To: Alex Williamson
  Cc: mjrosato, jasowang, Hao, Xudong, peterx, Xu, Terrence,
	chao.p.peng, linux-s390, kvm, lulu, Jiang, Yanting, joro,
	nicolinc, jgg, Zhao, Yan Y, intel-gfx, eric.auger, intel-gvt-dev,
	yi.y.sun, cohuck, shameerali.kolothum.thodi,
	suravee.suthikulpanit, robin.murphy

> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Wednesday, April 5, 2023 11:13 PM
> 
> On Wed, 5 Apr 2023 09:31:39 +0000
> "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> 
> > > From: Alex Williamson <alex.williamson@redhat.com>
> > > Sent: Wednesday, April 5, 2023 5:01 AM
> > >
> > > On Sat,  1 Apr 2023 07:44:28 -0700
> > > Yi Liu <yi.l.liu@intel.com> wrote:
> > >
> > > > as there are IOMMUFD users that want to know check if an ID generated
> > > > by IOMMUFD is valid or not. e.g. vfio-pci optionaly returns invalid
> > > > dev_id to user in the VFIO_DEVICE_GET_PCI_HOT_RESET_INFO ioctl. User
> > > > needs to check if the ID is valid or not.
> > > >
> > > > IOMMUFD_INVALID_ID is defined as 0 since the IDs generated by IOMMUFD
> > > > starts from 0.
> > > >
> > > > Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> > > > ---
> > > >  include/uapi/linux/iommufd.h | 3 +++
> > > >  1 file changed, 3 insertions(+)
> > > >
> > > > diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
> > > > index 98ebba80cfa1..aeae73a93833 100644
> > > > --- a/include/uapi/linux/iommufd.h
> > > > +++ b/include/uapi/linux/iommufd.h
> > > > @@ -9,6 +9,9 @@
> > > >
> > > >  #define IOMMUFD_TYPE (';')
> > > >
> > > > +/* IDs allocated by IOMMUFD starts from 0 */
> > > > +#define IOMMUFD_INVALID_ID 0
> > > > +
> > > >  /**
> > > >   * DOC: General ioctl format
> > > >   *
> > >
> > > If allocation "starts from 0" then 0 is a valid id, no?  Does allocation
> > > start from 1, ie. skip 0?  Thanks,
> >
> > yes, it starts from 1, that's why we can use 0 as invalid id.
> 
> So the comment is wrong, correct?

yes.

Regards
Yi Liu


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 05/12] vfio/pci: Allow passing zero-length fd array in VFIO_DEVICE_PCI_HOT_RESET
  2023-04-05  8:01       ` Liu, Yi L
@ 2023-04-05 15:36         ` Alex Williamson
  2023-04-05 16:46           ` Jason Gunthorpe
  0 siblings, 1 reply; 145+ messages in thread
From: Alex Williamson @ 2023-04-05 15:36 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: mjrosato, jasowang, Hao, Xudong, peterx, Xu, Terrence,
	chao.p.peng, linux-s390, kvm, lulu, Jiang, Yanting, joro,
	nicolinc, jgg, Zhao, Yan Y, intel-gfx, eric.auger, intel-gvt-dev,
	yi.y.sun, cohuck, shameerali.kolothum.thodi,
	suravee.suthikulpanit, robin.murphy

On Wed, 5 Apr 2023 08:01:49 +0000
"Liu, Yi L" <yi.l.liu@intel.com> wrote:

> > From: Liu, Yi L <yi.l.liu@intel.com>
> > Sent: Wednesday, April 5, 2023 3:55 PM  
>  
> > >
> > > Therefore, I think as written, the singleton dev_set hot-reset is
> > > enabled for iommufd and (unintentionally?) for the group path, while
> > > also negating a requirement for a group fd or that a provided group fd
> > > actually matches the device in this latter case.  The null-array
> > > approach is not however extended to groups for more general use.
> > > Additionally, limiting no-iommu hot-reset to singleton dev_sets
> > > provides only a marginal functional difference vs VFIO_DEVICE_RESET.  
> > 
> > I think the singletion dev_set hot-reset is for iommufd (or more accurately
> > for the noiommu case in cdev path).  
> 
> but actually, singleton dev_set hot-reset can work for group path as well.
> Based on this, I'm also wondering do we really want to have singleton dev_set
> hot-reset only for cdev noiommu case? or we allow it generally or just
> don't support it as it is equivalent with VFIO_DEVICE_RESET?

I think you're taking the potential that VFIO_DEVICE_RESET and
hot-reset could do the same thing too far.  The former is more likely
to do an FLR, or even a PM reset.  QEMU even tries to guess what reset
VFIO_DEVICE_RESET might use in order to choose to do a hot-reset if it
seems like the device might only support a PM reset otherwise.

Changing the reset method of a device requires privilege, which is
maybe something we'd compromise on for no-iommu, but the general
expectation is that VFIO_DEVICE_RESET provides a device level scope and
hot-reset provides a... hot-reset, and sometimes those are the same
thing, but that doesn't mean we can lean on the former.

> If we don't support singletion dev_set hot-reset, noiommu devices in cdev
> path shall fail the hot-reset if empty-fd array is provided. But we may just
> document that empty-fd array does not work for noiommu. User should
> use the device fd array.

I don't see any replies to my comment on 08/12 where I again question
why we need an empty array option.  It's causing all sorts of headaches
and I don't see the justification for it beyond some hand waving that
it reduces complexity for the user.  This singleton dev-set notion
seems equally unjustified.  Do we just need to deal with hot-reset
being unsupported for no-iommu devices with iommufd?  Thanks,

Alex


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
  2023-04-05 14:04     ` Liu, Yi L
@ 2023-04-05 16:25       ` Alex Williamson
  2023-04-05 16:37         ` Jason Gunthorpe
  2023-04-05 17:58         ` Eric Auger
  0 siblings, 2 replies; 145+ messages in thread
From: Alex Williamson @ 2023-04-05 16:25 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: mjrosato, jasowang, Hao, Xudong, peterx, Xu, Terrence,
	chao.p.peng, linux-s390, kvm, lulu, Jiang, Yanting, joro,
	nicolinc, jgg, Zhao, Yan Y, intel-gfx, eric.auger, intel-gvt-dev,
	yi.y.sun, cohuck, shameerali.kolothum.thodi,
	suravee.suthikulpanit, robin.murphy

On Wed, 5 Apr 2023 14:04:51 +0000
"Liu, Yi L" <yi.l.liu@intel.com> wrote:

> Hi Eric,
> 
> > From: Eric Auger <eric.auger@redhat.com>
> > Sent: Wednesday, April 5, 2023 8:20 PM
> > 
> > Hi Yi,
> > On 4/1/23 16:44, Yi Liu wrote:  
> > > for the users that accept device fds passed from management stacks to be
> > > able to figure out the host reset affected devices among the devices
> > > opened by the user. This is needed as such users do not have BDF (bus,
> > > devfn) knowledge about the devices it has opened, hence unable to use
> > > the information reported by existing VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
> > > to figure out the affected devices.
> > >
> > > Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> > > ---
> > >  drivers/vfio/pci/vfio_pci_core.c | 58 ++++++++++++++++++++++++++++----
> > >  include/uapi/linux/vfio.h        | 24 ++++++++++++-
> > >  2 files changed, 74 insertions(+), 8 deletions(-)
> > >
> > > diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> > > index 19f5b075d70a..a5a7e148dce1 100644
> > > --- a/drivers/vfio/pci/vfio_pci_core.c
> > > +++ b/drivers/vfio/pci/vfio_pci_core.c
> > > @@ -30,6 +30,7 @@
> > >  #if IS_ENABLED(CONFIG_EEH)
> > >  #include <asm/eeh.h>
> > >  #endif
> > > +#include <uapi/linux/iommufd.h>
> > >
> > >  #include "vfio_pci_priv.h"
> > >
> > > @@ -767,6 +768,20 @@ static int vfio_pci_get_irq_count(struct  
> > vfio_pci_core_device *vdev, int irq_typ  
> > >  	return 0;
> > >  }
> > >
> > > +static struct vfio_device *
> > > +vfio_pci_find_device_in_devset(struct vfio_device_set *dev_set,
> > > +			       struct pci_dev *pdev)
> > > +{
> > > +	struct vfio_device *cur;
> > > +
> > > +	lockdep_assert_held(&dev_set->lock);
> > > +
> > > +	list_for_each_entry(cur, &dev_set->device_list, dev_set_list)
> > > +		if (cur->dev == &pdev->dev)
> > > +			return cur;
> > > +	return NULL;
> > > +}
> > > +
> > >  static int vfio_pci_count_devs(struct pci_dev *pdev, void *data)
> > >  {
> > >  	(*(int *)data)++;
> > > @@ -776,13 +791,20 @@ static int vfio_pci_count_devs(struct pci_dev *pdev, void  
> > *data)  
> > >  struct vfio_pci_fill_info {
> > >  	int max;
> > >  	int cur;
> > > +	bool require_devid;
> > > +	struct iommufd_ctx *iommufd;
> > > +	struct vfio_device_set *dev_set;
> > >  	struct vfio_pci_dependent_device *devices;
> > >  };
> > >
> > >  static int vfio_pci_fill_devs(struct pci_dev *pdev, void *data)
> > >  {
> > >  	struct vfio_pci_fill_info *fill = data;
> > > +	struct vfio_device_set *dev_set = fill->dev_set;
> > >  	struct iommu_group *iommu_group;
> > > +	struct vfio_device *vdev;
> > > +
> > > +	lockdep_assert_held(&dev_set->lock);
> > >
> > >  	if (fill->cur == fill->max)
> > >  		return -EAGAIN; /* Something changed, try again */
> > > @@ -791,7 +813,21 @@ static int vfio_pci_fill_devs(struct pci_dev *pdev, void  
> > *data)  
> > >  	if (!iommu_group)
> > >  		return -EPERM; /* Cannot reset non-isolated devices */
> > >
> > > -	fill->devices[fill->cur].group_id = iommu_group_id(iommu_group);
> > > +	if (fill->require_devid) {
> > > +		/*
> > > +		 * Report dev_id of the devices that are opened as cdev
> > > +		 * and have the same iommufd with the fill->iommufd.
> > > +		 * Otherwise, just fill IOMMUFD_INVALID_ID.
> > > +		 */
> > > +		vdev = vfio_pci_find_device_in_devset(dev_set, pdev);
> > > +		if (vdev && vfio_device_cdev_opened(vdev) &&
> > > +		    fill->iommufd == vfio_iommufd_physical_ictx(vdev))
> > > +			vfio_iommufd_physical_devid(vdev, &fill->devices[fill-
> > >cur].dev_id);
> > > +		else
> > > +			fill->devices[fill->cur].dev_id = IOMMUFD_INVALID_ID;
> > > +	} else {
> > > +		fill->devices[fill->cur].group_id = iommu_group_id(iommu_group);
> > > +	}
> > >  	fill->devices[fill->cur].segment = pci_domain_nr(pdev->bus);
> > >  	fill->devices[fill->cur].bus = pdev->bus->number;
> > >  	fill->devices[fill->cur].devfn = pdev->devfn;
> > > @@ -1230,17 +1266,27 @@ static int vfio_pci_ioctl_get_pci_hot_reset_info(
> > >  		return -ENOMEM;
> > >
> > >  	fill.devices = devices;
> > > +	fill.dev_set = vdev->vdev.dev_set;
> > >
> > > +	mutex_lock(&vdev->vdev.dev_set->lock);
> > > +	if (vfio_device_cdev_opened(&vdev->vdev)) {
> > > +		fill.require_devid = true;
> > > +		fill.iommufd = vfio_iommufd_physical_ictx(&vdev->vdev);
> > > +	}
> > >  	ret = vfio_pci_for_each_slot_or_bus(vdev->pdev, vfio_pci_fill_devs,
> > >  					    &fill, slot);
> > > +	mutex_unlock(&vdev->vdev.dev_set->lock);
> > >
> > >  	/*
> > >  	 * If a device was removed between counting and filling, we may come up
> > >  	 * short of fill.max.  If a device was added, we'll have a return of
> > >  	 * -EAGAIN above.
> > >  	 */
> > > -	if (!ret)
> > > +	if (!ret) {
> > >  		hdr.count = fill.cur;
> > > +		if (fill.require_devid)
> > > +			hdr.flags = VFIO_PCI_HOT_RESET_FLAG_IOMMUFD_DEV_ID;
> > > +	}
> > >
> > >  reset_info_exit:
> > >  	if (copy_to_user(arg, &hdr, minsz))
> > > @@ -2346,12 +2392,10 @@ static bool vfio_dev_in_files(struct  
> > vfio_pci_core_device *vdev,  
> > >  static int vfio_pci_is_device_in_set(struct pci_dev *pdev, void *data)
> > >  {
> > >  	struct vfio_device_set *dev_set = data;
> > > -	struct vfio_device *cur;
> > >
> > > -	list_for_each_entry(cur, &dev_set->device_list, dev_set_list)
> > > -		if (cur->dev == &pdev->dev)
> > > -			return 0;
> > > -	return -EBUSY;
> > > +	lockdep_assert_held(&dev_set->lock);
> > > +
> > > +	return vfio_pci_find_device_in_devset(dev_set, pdev) ? 0 : -EBUSY;
> > >  }
> > >
> > >  /*
> > > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> > > index 25432ef213ee..5a34364e3b94 100644
> > > --- a/include/uapi/linux/vfio.h
> > > +++ b/include/uapi/linux/vfio.h
> > > @@ -650,11 +650,32 @@ enum {
> > >   * VFIO_DEVICE_GET_PCI_HOT_RESET_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 12,
> > >   *					      struct vfio_pci_hot_reset_info)
> > >   *
> > > + * This command is used to query the affected devices in the hot reset for
> > > + * a given device.  User could use the information reported by this command
> > > + * to figure out the affected devices among the devices it has opened.
> > > + * This command always reports the segment, bus and devfn information for
> > > + * each affected device, and selectively report the group_id or the dev_id
> > > + * per the way how the device being queried is opened.
> > > + *	- If the device is opened via the traditional group/container manner,
> > > + *	  this command reports the group_id for each affected device.
> > > + *
> > > + *	- If the device is opened as a cdev, this command needs to report  
> > s/needs to report/reports  
> 
> got it.
> 
> > > + *	  dev_id for each affected device and set the
> > > + *	  VFIO_PCI_HOT_RESET_FLAG_IOMMUFD_DEV_ID flag.  For the affected
> > > + *	  devices that are not opened as cdev or bound to different iommufds
> > > + *	  with the device that is queried, report an invalid dev_id to avoid  
> > s/bound to different iommufds with the device that is queried/bound to
> > iommufds different from the reset device one?  
> 
> hmmm, I'm not a native speaker here. This _INFO is to query if want
> hot reset a given device, what devices would be affected. So it appears
> the queried device is better. But I'd admit "the queried device" is also
> "the reset device". may Alex help pick one. 😊

	- If the calling device is opened directly via cdev rather than
	  accessed through the vfio group, the returned
	  vfio_pci_depdendent_device structure reports the dev_id
	  rather than the group_id, which is indicated by the
	  VFIO_PCI_HOT_RESET_FLAG_IOMMUFD_DEV_ID flag in
	  vfio_pci_hot_reset_info.  If the reset affects devices that
	  are not opened within the same iommufd context as the calling
	  device, IOMMUFD_INVALID_ID will be provided as the dev_id.

But that kind of brings to light the question of what does the user do
when they encounter this situation.  If the device is not opened, the
reset can complete.  If the device is opened by a different user, the
reset is blocked.  The only logical conclusion is that the user should
try the reset regardless of the result of the info ioctl, which the
null-array approach further solidifies as the direction of the API.
I'm not liking this.  Thanks,

Alex


> > > + *	  potential dev_id conflict as dev_id is local to iommufd.  For such
> > > + *	  affected devices, user shall fall back to use the segment, bus and
> > > + *	  devfn info to map it to opened device.
> > > + *
> > >   * Return: 0 on success, -errno on failure:
> > >   *	-enospc = insufficient buffer, -enodev = unsupported for device.
> > >   */
> > >  struct vfio_pci_dependent_device {
> > > -	__u32	group_id;
> > > +	union {
> > > +		__u32   group_id;
> > > +		__u32	dev_id;
> > > +	};
> > >  	__u16	segment;
> > >  	__u8	bus;
> > >  	__u8	devfn; /* Use PCI_SLOT/PCI_FUNC */
> > > @@ -663,6 +684,7 @@ struct vfio_pci_dependent_device {
> > >  struct vfio_pci_hot_reset_info {
> > >  	__u32	argsz;
> > >  	__u32	flags;
> > > +#define VFIO_PCI_HOT_RESET_FLAG_IOMMUFD_DEV_ID	(1 << 0)
> > >  	__u32	count;
> > >  	struct vfio_pci_dependent_device	devices[];
> > >  };  
> > Eric  
> 


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
  2023-04-05 16:25       ` Alex Williamson
@ 2023-04-05 16:37         ` Jason Gunthorpe
  2023-04-05 16:52           ` Alex Williamson
  2023-04-05 17:58         ` Eric Auger
  1 sibling, 1 reply; 145+ messages in thread
From: Jason Gunthorpe @ 2023-04-05 16:37 UTC (permalink / raw)
  To: Alex Williamson
  Cc: mjrosato, jasowang, Hao, Xudong, peterx, Xu, Terrence,
	chao.p.peng, linux-s390, Liu, Yi L, kvm, lulu, Jiang, Yanting,
	joro, nicolinc, Zhao, Yan Y, intel-gfx, eric.auger,
	intel-gvt-dev, yi.y.sun, cohuck, shameerali.kolothum.thodi,
	suravee.suthikulpanit, robin.murphy

On Wed, Apr 05, 2023 at 10:25:45AM -0600, Alex Williamson wrote:

> But that kind of brings to light the question of what does the user do
> when they encounter this situation.

What does it do now when it encounters a group_id it doesn't
understand? Userspace already doesn't know if the foreign group is
open or not, right?

> reset can complete.  If the device is opened by a different user, the
> reset is blocked.  The only logical conclusion is that the user should
> try the reset regardless of the result of the info ioctl, which the

IMHO my suggested version is still the overall saner uAPI.

An info that basically returns success/fail if reset is security
authorized and information about the reset groupings.

Actual reset follows the returned groupings automatically.

Easy for qemu. Call the info at startup to confirm reset can be
emulated, use the returned information to propogate the reset groups
to the guest. Trigger the reset with no fuss when the guest asks for
it.

Less weird corner cases.

Jason

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 05/12] vfio/pci: Allow passing zero-length fd array in VFIO_DEVICE_PCI_HOT_RESET
  2023-04-05 15:36         ` Alex Williamson
@ 2023-04-05 16:46           ` Jason Gunthorpe
  0 siblings, 0 replies; 145+ messages in thread
From: Jason Gunthorpe @ 2023-04-05 16:46 UTC (permalink / raw)
  To: Alex Williamson
  Cc: mjrosato, jasowang, Hao, Xudong, peterx, Xu, Terrence,
	chao.p.peng, linux-s390, Liu, Yi L, kvm, lulu, Jiang, Yanting,
	joro, nicolinc, Zhao, Yan Y, intel-gfx, eric.auger,
	intel-gvt-dev, yi.y.sun, cohuck, shameerali.kolothum.thodi,
	suravee.suthikulpanit, robin.murphy

On Wed, Apr 05, 2023 at 09:36:46AM -0600, Alex Williamson wrote:

> > If we don't support singletion dev_set hot-reset, noiommu devices in cdev
> > path shall fail the hot-reset if empty-fd array is provided. But we may just
> > document that empty-fd array does not work for noiommu. User should
> > use the device fd array.
> 
> I don't see any replies to my comment on 08/12 where I again question
> why we need an empty array option.

I was pressing we'd do empty-fd only and not do the device fd array at
all since it is such an ugly fit for the use cases we have.

But it is such a minor detail if you don't want it then take it out.

> This singleton dev-set notion seems equally unjustified.  Do we just
> need to deal with hot-reset being unsupported for no-iommu devices
> with iommufd?

It was to support no-iommu, if you want to de-support it then it can
go away too. AFAIK dpdk doesn't use this feature and it is the only
user we know of that has support for no-iommu so it is probably safe.

Jason

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
  2023-04-05 16:37         ` Jason Gunthorpe
@ 2023-04-05 16:52           ` Alex Williamson
  2023-04-05 17:23             ` Jason Gunthorpe
  0 siblings, 1 reply; 145+ messages in thread
From: Alex Williamson @ 2023-04-05 16:52 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: mjrosato, jasowang, Hao, Xudong, peterx, Xu, Terrence,
	chao.p.peng, linux-s390, Liu, Yi L, kvm, lulu, Jiang, Yanting,
	joro, nicolinc, Zhao,  Yan Y, intel-gfx, eric.auger,
	intel-gvt-dev, yi.y.sun, cohuck, shameerali.kolothum.thodi,
	suravee.suthikulpanit, robin.murphy

On Wed, 5 Apr 2023 13:37:05 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Wed, Apr 05, 2023 at 10:25:45AM -0600, Alex Williamson wrote:
> 
> > But that kind of brings to light the question of what does the user do
> > when they encounter this situation.  
> 
> What does it do now when it encounters a group_id it doesn't
> understand? Userspace already doesn't know if the foreign group is
> open or not, right?

It's simple, there is currently no screwiness around opened devices.
If the caller doesn't own all the groups mapping to the affected
devices, hot-reset is not available.

> > reset can complete.  If the device is opened by a different user, the
> > reset is blocked.  The only logical conclusion is that the user should
> > try the reset regardless of the result of the info ioctl, which the  
> 
> IMHO my suggested version is still the overall saner uAPI.
> 
> An info that basically returns success/fail if reset is security
> authorized and information about the reset groupings.
> 
> Actual reset follows the returned groupings automatically.
> 
> Easy for qemu. Call the info at startup to confirm reset can be
> emulated, use the returned information to propogate the reset groups
> to the guest. Trigger the reset with no fuss when the guest asks for
> it.
> 
> Less weird corner cases.

This leads to scenarios where the info ioctl indicates a hot-reset is
initially available, perhaps only because one of the affected devices
was not opened at the time, and now it fails when QEMU actually tries
to use it.  In the group model, QEMU can know the set of affected
devices and the required groups, confirm it owns those, and for all
practical purposes guarantee that a hot-reset is available (yes, there
might be some exceptionally rare topology changes).

This goofiness around unopened devices and null-arrays is killing this
API.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
  2023-04-05 16:52           ` Alex Williamson
@ 2023-04-05 17:23             ` Jason Gunthorpe
  2023-04-05 18:56               ` Alex Williamson
  0 siblings, 1 reply; 145+ messages in thread
From: Jason Gunthorpe @ 2023-04-05 17:23 UTC (permalink / raw)
  To: Alex Williamson
  Cc: mjrosato, jasowang, Hao, Xudong, peterx, Xu, Terrence,
	chao.p.peng, linux-s390, Liu, Yi L, kvm, lulu, Jiang, Yanting,
	joro, nicolinc, Zhao, Yan Y, intel-gfx, eric.auger,
	intel-gvt-dev, yi.y.sun, cohuck, shameerali.kolothum.thodi,
	suravee.suthikulpanit, robin.murphy

On Wed, Apr 05, 2023 at 10:52:15AM -0600, Alex Williamson wrote:
> On Wed, 5 Apr 2023 13:37:05 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Wed, Apr 05, 2023 at 10:25:45AM -0600, Alex Williamson wrote:
> > 
> > > But that kind of brings to light the question of what does the user do
> > > when they encounter this situation.  
> > 
> > What does it do now when it encounters a group_id it doesn't
> > understand? Userspace already doesn't know if the foreign group is
> > open or not, right?
> 
> It's simple, there is currently no screwiness around opened devices.
> If the caller doesn't own all the groups mapping to the affected
> devices, hot-reset is not available.

That still has nasty edge cases. If the reset group spans beyond a
single iommu group you end up with qemu being unable to operate reset
at all, and it is unfixable from an API perspective as we can't pass
in groups that VFIO isn't going to use.

I think you are right, the fact we'd have to return -1 dev_ids to this
modified API is pretty damaging, it doesn't seem like a good
direction.

> This leads to scenarios where the info ioctl indicates a hot-reset is
> initially available, perhaps only because one of the affected devices
> was not opened at the time, and now it fails when QEMU actually tries
> to use it.

I would like it if the APIs toward the kernel were only about the
kernel's security apparatus. It is makes it easier to reason about the
kernel side and gives nice simple well defined APIs.

This is a good point that qemu needs to make a policy decision if it
is happy about the VFIO configuration - but that is a policy decision
that should not become entangled with the kernel's security checks.

Today qemu can make this policy choice the same way it does right now
- call _INFO and check the group_ids. It gets the exact same outcome
as today. We already discussed that we need to expose the group ID
through an ioctl someplace.

If this is too awkward we could add a query to the kernel if the cdev
is "reset exclusive" - eg the iommufd covers all the groups that span
the reset set.

Jason

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
  2023-04-05 16:25       ` Alex Williamson
  2023-04-05 16:37         ` Jason Gunthorpe
@ 2023-04-05 17:58         ` Eric Auger
  2023-04-06  5:31           ` Liu, Yi L
  1 sibling, 1 reply; 145+ messages in thread
From: Eric Auger @ 2023-04-05 17:58 UTC (permalink / raw)
  To: Alex Williamson, Liu, Yi L
  Cc: mjrosato, jasowang, Hao, Xudong, peterx, Xu, Terrence,
	chao.p.peng, linux-s390, kvm, lulu, Jiang, Yanting, joro,
	nicolinc, jgg, Zhao, Yan Y, intel-gfx, intel-gvt-dev, yi.y.sun,
	cohuck, shameerali.kolothum.thodi, suravee.suthikulpanit,
	robin.murphy



On 4/5/23 18:25, Alex Williamson wrote:
> On Wed, 5 Apr 2023 14:04:51 +0000
> "Liu, Yi L" <yi.l.liu@intel.com> wrote:
>
>> Hi Eric,
>>
>>> From: Eric Auger <eric.auger@redhat.com>
>>> Sent: Wednesday, April 5, 2023 8:20 PM
>>>
>>> Hi Yi,
>>> On 4/1/23 16:44, Yi Liu wrote:  
>>>> for the users that accept device fds passed from management stacks to be
>>>> able to figure out the host reset affected devices among the devices
>>>> opened by the user. This is needed as such users do not have BDF (bus,
>>>> devfn) knowledge about the devices it has opened, hence unable to use
>>>> the information reported by existing VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
>>>> to figure out the affected devices.
>>>>
>>>> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
>>>> ---
>>>>  drivers/vfio/pci/vfio_pci_core.c | 58 ++++++++++++++++++++++++++++----
>>>>  include/uapi/linux/vfio.h        | 24 ++++++++++++-
>>>>  2 files changed, 74 insertions(+), 8 deletions(-)
>>>>
>>>> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
>>>> index 19f5b075d70a..a5a7e148dce1 100644
>>>> --- a/drivers/vfio/pci/vfio_pci_core.c
>>>> +++ b/drivers/vfio/pci/vfio_pci_core.c
>>>> @@ -30,6 +30,7 @@
>>>>  #if IS_ENABLED(CONFIG_EEH)
>>>>  #include <asm/eeh.h>
>>>>  #endif
>>>> +#include <uapi/linux/iommufd.h>
>>>>
>>>>  #include "vfio_pci_priv.h"
>>>>
>>>> @@ -767,6 +768,20 @@ static int vfio_pci_get_irq_count(struct  
>>> vfio_pci_core_device *vdev, int irq_typ  
>>>>  	return 0;
>>>>  }
>>>>
>>>> +static struct vfio_device *
>>>> +vfio_pci_find_device_in_devset(struct vfio_device_set *dev_set,
>>>> +			       struct pci_dev *pdev)
>>>> +{
>>>> +	struct vfio_device *cur;
>>>> +
>>>> +	lockdep_assert_held(&dev_set->lock);
>>>> +
>>>> +	list_for_each_entry(cur, &dev_set->device_list, dev_set_list)
>>>> +		if (cur->dev == &pdev->dev)
>>>> +			return cur;
>>>> +	return NULL;
>>>> +}
>>>> +
>>>>  static int vfio_pci_count_devs(struct pci_dev *pdev, void *data)
>>>>  {
>>>>  	(*(int *)data)++;
>>>> @@ -776,13 +791,20 @@ static int vfio_pci_count_devs(struct pci_dev *pdev, void  
>>> *data)  
>>>>  struct vfio_pci_fill_info {
>>>>  	int max;
>>>>  	int cur;
>>>> +	bool require_devid;
>>>> +	struct iommufd_ctx *iommufd;
>>>> +	struct vfio_device_set *dev_set;
>>>>  	struct vfio_pci_dependent_device *devices;
>>>>  };
>>>>
>>>>  static int vfio_pci_fill_devs(struct pci_dev *pdev, void *data)
>>>>  {
>>>>  	struct vfio_pci_fill_info *fill = data;
>>>> +	struct vfio_device_set *dev_set = fill->dev_set;
>>>>  	struct iommu_group *iommu_group;
>>>> +	struct vfio_device *vdev;
>>>> +
>>>> +	lockdep_assert_held(&dev_set->lock);
>>>>
>>>>  	if (fill->cur == fill->max)
>>>>  		return -EAGAIN; /* Something changed, try again */
>>>> @@ -791,7 +813,21 @@ static int vfio_pci_fill_devs(struct pci_dev *pdev, void  
>>> *data)  
>>>>  	if (!iommu_group)
>>>>  		return -EPERM; /* Cannot reset non-isolated devices */
>>>>
>>>> -	fill->devices[fill->cur].group_id = iommu_group_id(iommu_group);
>>>> +	if (fill->require_devid) {
>>>> +		/*
>>>> +		 * Report dev_id of the devices that are opened as cdev
>>>> +		 * and have the same iommufd with the fill->iommufd.
>>>> +		 * Otherwise, just fill IOMMUFD_INVALID_ID.
>>>> +		 */
>>>> +		vdev = vfio_pci_find_device_in_devset(dev_set, pdev);
>>>> +		if (vdev && vfio_device_cdev_opened(vdev) &&
>>>> +		    fill->iommufd == vfio_iommufd_physical_ictx(vdev))
>>>> +			vfio_iommufd_physical_devid(vdev, &fill->devices[fill-
>>>> cur].dev_id);
>>>> +		else
>>>> +			fill->devices[fill->cur].dev_id = IOMMUFD_INVALID_ID;
>>>> +	} else {
>>>> +		fill->devices[fill->cur].group_id = iommu_group_id(iommu_group);
>>>> +	}
>>>>  	fill->devices[fill->cur].segment = pci_domain_nr(pdev->bus);
>>>>  	fill->devices[fill->cur].bus = pdev->bus->number;
>>>>  	fill->devices[fill->cur].devfn = pdev->devfn;
>>>> @@ -1230,17 +1266,27 @@ static int vfio_pci_ioctl_get_pci_hot_reset_info(
>>>>  		return -ENOMEM;
>>>>
>>>>  	fill.devices = devices;
>>>> +	fill.dev_set = vdev->vdev.dev_set;
>>>>
>>>> +	mutex_lock(&vdev->vdev.dev_set->lock);
>>>> +	if (vfio_device_cdev_opened(&vdev->vdev)) {
>>>> +		fill.require_devid = true;
>>>> +		fill.iommufd = vfio_iommufd_physical_ictx(&vdev->vdev);
>>>> +	}
>>>>  	ret = vfio_pci_for_each_slot_or_bus(vdev->pdev, vfio_pci_fill_devs,
>>>>  					    &fill, slot);
>>>> +	mutex_unlock(&vdev->vdev.dev_set->lock);
>>>>
>>>>  	/*
>>>>  	 * If a device was removed between counting and filling, we may come up
>>>>  	 * short of fill.max.  If a device was added, we'll have a return of
>>>>  	 * -EAGAIN above.
>>>>  	 */
>>>> -	if (!ret)
>>>> +	if (!ret) {
>>>>  		hdr.count = fill.cur;
>>>> +		if (fill.require_devid)
>>>> +			hdr.flags = VFIO_PCI_HOT_RESET_FLAG_IOMMUFD_DEV_ID;
>>>> +	}
>>>>
>>>>  reset_info_exit:
>>>>  	if (copy_to_user(arg, &hdr, minsz))
>>>> @@ -2346,12 +2392,10 @@ static bool vfio_dev_in_files(struct  
>>> vfio_pci_core_device *vdev,  
>>>>  static int vfio_pci_is_device_in_set(struct pci_dev *pdev, void *data)
>>>>  {
>>>>  	struct vfio_device_set *dev_set = data;
>>>> -	struct vfio_device *cur;
>>>>
>>>> -	list_for_each_entry(cur, &dev_set->device_list, dev_set_list)
>>>> -		if (cur->dev == &pdev->dev)
>>>> -			return 0;
>>>> -	return -EBUSY;
>>>> +	lockdep_assert_held(&dev_set->lock);
>>>> +
>>>> +	return vfio_pci_find_device_in_devset(dev_set, pdev) ? 0 : -EBUSY;
>>>>  }
>>>>
>>>>  /*
>>>> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
>>>> index 25432ef213ee..5a34364e3b94 100644
>>>> --- a/include/uapi/linux/vfio.h
>>>> +++ b/include/uapi/linux/vfio.h
>>>> @@ -650,11 +650,32 @@ enum {
>>>>   * VFIO_DEVICE_GET_PCI_HOT_RESET_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 12,
>>>>   *					      struct vfio_pci_hot_reset_info)
>>>>   *
>>>> + * This command is used to query the affected devices in the hot reset for
>>>> + * a given device.  User could use the information reported by this command
>>>> + * to figure out the affected devices among the devices it has opened.
the 'opened' terminology does not look sufficient here because it is not
only a matter of the device being opened using cdev but it also needs to
have been bound to an iommufd, dev_id being the output of the
dev-iommufd binding.

By the way I am now confused. What does happen if the reset impact some
devices which are not bound to an iommu ctx. Previously we returned the
iommu group which always pre-exists but now you will report invalid id?
>>>> + * This command always reports the segment, bus and devfn information for
>>>> + * each affected device, and selectively report the group_id or the dev_id
>>>> + * per the way how the device being queried is opened.
>>>> + *	- If the device is opened via the traditional group/container manner,
>>>> + *	  this command reports the group_id for each affected device.
>>>> + *
>>>> + *	- If the device is opened as a cdev, this command needs to report  
>>> s/needs to report/reports  
>> got it.
>>
>>>> + *	  dev_id for each affected device and set the
>>>> + *	  VFIO_PCI_HOT_RESET_FLAG_IOMMUFD_DEV_ID flag.  For the affected
>>>> + *	  devices that are not opened as cdev or bound to different iommufds
>>>> + *	  with the device that is queried, report an invalid dev_id to avoid  
or not bound at all
>>> s/bound to different iommufds with the device that is queried/bound to
>>> iommufds different from the reset device one?  
>> hmmm, I'm not a native speaker here. This _INFO is to query if want
>> hot reset a given device, what devices would be affected. So it appears
>> the queried device is better. But I'd admit "the queried device" is also
>> "the reset device". may Alex help pick one. 😊
> 	- If the calling device is opened directly via cdev rather than
> 	  accessed through the vfio group, the returned
> 	  vfio_pci_depdendent_device structure reports the dev_id
> 	  rather than the group_id, which is indicated by the
> 	  VFIO_PCI_HOT_RESET_FLAG_IOMMUFD_DEV_ID flag in
> 	  vfio_pci_hot_reset_info.  If the reset affects devices that
> 	  are not opened within the same iommufd context as the calling
> 	  device, IOMMUFD_INVALID_ID will be provided as the dev_id.
>
> But that kind of brings to light the question of what does the user do
> when they encounter this situation.  If the device is not opened, the
> reset can complete.  If the device is opened by a different user, the
> reset is blocked.  The only logical conclusion is that the user should
> try the reset regardless of the result of the info ioctl, which the
> null-array approach further solidifies as the direction of the API.
> I'm not liking this.  Thanks,
>
> Alex

Thanks

Eric
>
>
>>>> + *	  potential dev_id conflict as dev_id is local to iommufd.  For such
>>>> + *	  affected devices, user shall fall back to use the segment, bus and
>>>> + *	  devfn info to map it to opened device.
>>>> + *
>>>>   * Return: 0 on success, -errno on failure:
>>>>   *	-enospc = insufficient buffer, -enodev = unsupported for device.
>>>>   */
>>>>  struct vfio_pci_dependent_device {
>>>> -	__u32	group_id;
>>>> +	union {
>>>> +		__u32   group_id;
>>>> +		__u32	dev_id;
>>>> +	};
>>>>  	__u16	segment;
>>>>  	__u8	bus;
>>>>  	__u8	devfn; /* Use PCI_SLOT/PCI_FUNC */
>>>> @@ -663,6 +684,7 @@ struct vfio_pci_dependent_device {
>>>>  struct vfio_pci_hot_reset_info {
>>>>  	__u32	argsz;
>>>>  	__u32	flags;
>>>> +#define VFIO_PCI_HOT_RESET_FLAG_IOMMUFD_DEV_ID	(1 << 0)
>>>>  	__u32	count;
>>>>  	struct vfio_pci_dependent_device	devices[];
>>>>  };  
>>> Eric  


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
  2023-04-05 17:23             ` Jason Gunthorpe
@ 2023-04-05 18:56               ` Alex Williamson
  2023-04-05 19:18                 ` Alex Williamson
  2023-04-05 19:21                 ` Jason Gunthorpe
  0 siblings, 2 replies; 145+ messages in thread
From: Alex Williamson @ 2023-04-05 18:56 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: mjrosato, jasowang, Hao, Xudong, peterx, Xu, Terrence,
	chao.p.peng, linux-s390, Liu, Yi L, kvm, lulu, Jiang, Yanting,
	joro, nicolinc, Zhao,  Yan Y, intel-gfx, eric.auger,
	intel-gvt-dev, yi.y.sun, cohuck, shameerali.kolothum.thodi,
	suravee.suthikulpanit, robin.murphy

On Wed, 5 Apr 2023 14:23:43 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Wed, Apr 05, 2023 at 10:52:15AM -0600, Alex Williamson wrote:
> > On Wed, 5 Apr 2023 13:37:05 -0300
> > Jason Gunthorpe <jgg@nvidia.com> wrote:
> >   
> > > On Wed, Apr 05, 2023 at 10:25:45AM -0600, Alex Williamson wrote:
> > >   
> > > > But that kind of brings to light the question of what does the user do
> > > > when they encounter this situation.    
> > > 
> > > What does it do now when it encounters a group_id it doesn't
> > > understand? Userspace already doesn't know if the foreign group is
> > > open or not, right?  
> > 
> > It's simple, there is currently no screwiness around opened devices.
> > If the caller doesn't own all the groups mapping to the affected
> > devices, hot-reset is not available.  
> 
> That still has nasty edge cases. If the reset group spans beyond a
> single iommu group you end up with qemu being unable to operate reset
> at all, and it is unfixable from an API perspective as we can't pass
> in groups that VFIO isn't going to use.

Hmm, s/nasty/niche/?  Yes, QEMU currently has no way to own a group
without assigning a device from the group, but technically that could
be fixed within QEMU.  If QEMU doesn't own that affected group, then it
can't very well count on that group to not be used in some other way
when it comes time to actually do a hot-reset.
 
> I think you are right, the fact we'd have to return -1 dev_ids to this
> modified API is pretty damaging, it doesn't seem like a good
> direction.
> 
> > This leads to scenarios where the info ioctl indicates a hot-reset is
> > initially available, perhaps only because one of the affected devices
> > was not opened at the time, and now it fails when QEMU actually tries
> > to use it.  
> 
> I would like it if the APIs toward the kernel were only about the
> kernel's security apparatus. It is makes it easier to reason about the
> kernel side and gives nice simple well defined APIs.

Usability needs to be a consideration as well.  An interface where the
result is effectively arbitrary from a user perspective because the
kernel is solely focused on whether the operation is allowed,
evaluating constraints that the user is unaware of and cannot control,
is unusable.

> This is a good point that qemu needs to make a policy decision if it
> is happy about the VFIO configuration - but that is a policy decision
> that should not become entangled with the kernel's security checks.
> 
> Today qemu can make this policy choice the same way it does right now
> - call _INFO and check the group_ids. It gets the exact same outcome
> as today. We already discussed that we need to expose the group ID
> through an ioctl someplace.

QEMU can make a policy decision today because the kernel provides a
sufficiently reliable interface, ie. based on the set of owned groups, a
hot-reset is all but guaranteed to work.  If we focus only on whether a
given reset is allowed from a kernel perspective and ignore that
userspace needs some predictability of the kernel behavior, then QEMU
cannot reasonable make that policy decision.

> If this is too awkward we could add a query to the kernel if the cdev
> is "reset exclusive" - eg the iommufd covers all the groups that span
> the reset set.

That's essentially what we have if there are valid dev-ids for each
affected device in the info ioctl.  I don't think it helps the user
experience to create loopholes where the hot-reset ioctl can still work
in spite of those missing devices.  The group interface uses the fact
that ownership of the group implies ownership of all devices within the
group such that the user only needs to prove group ownership.

But we still have underlying groups even with the cdev model, with the
same ownership principles, so don't we just need to prove group
ownership based on a device fd rather than a group fd?

For example, we have a VFIO_DEVICE_GET_INFO ioctl that supports
capability chains, we could add a capability that reports the group ID
for the device.  The hot-reset info ioctl remains as it is today,
reporting group-ids and bdfs.  The hot-reset ioctl itself is modified to
transparently support either group fds or device fds.  The user can now
map cdevs to group-ids and therefore follow the same rules as groups,
providing at least one representative device fd for each group.  We've
essentially already enabled this by allowing the limit of user provided
fds equal to the number of affected devices.

Does that work?  Thanks,

Alex


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
  2023-04-05 18:56               ` Alex Williamson
@ 2023-04-05 19:18                 ` Alex Williamson
  2023-04-05 19:21                 ` Jason Gunthorpe
  1 sibling, 0 replies; 145+ messages in thread
From: Alex Williamson @ 2023-04-05 19:18 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: mjrosato, jasowang, Hao, Xudong, peterx, Xu, Terrence,
	chao.p.peng, linux-s390, Liu, Yi L, kvm, lulu, Jiang, Yanting,
	joro, nicolinc, Zhao,  Yan Y, intel-gfx, eric.auger,
	intel-gvt-dev, yi.y.sun, cohuck, shameerali.kolothum.thodi,
	suravee.suthikulpanit, robin.murphy

On Wed, 5 Apr 2023 12:56:21 -0600
Alex Williamson <alex.williamson@redhat.com> wrote:

> On Wed, 5 Apr 2023 14:23:43 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Wed, Apr 05, 2023 at 10:52:15AM -0600, Alex Williamson wrote:  
> > > On Wed, 5 Apr 2023 13:37:05 -0300
> > > Jason Gunthorpe <jgg@nvidia.com> wrote:
> > >     
> > > > On Wed, Apr 05, 2023 at 10:25:45AM -0600, Alex Williamson wrote:
> > > >     
> > > > > But that kind of brings to light the question of what does the user do
> > > > > when they encounter this situation.      
> > > > 
> > > > What does it do now when it encounters a group_id it doesn't
> > > > understand? Userspace already doesn't know if the foreign group is
> > > > open or not, right?    
> > > 
> > > It's simple, there is currently no screwiness around opened devices.
> > > If the caller doesn't own all the groups mapping to the affected
> > > devices, hot-reset is not available.    
> > 
> > That still has nasty edge cases. If the reset group spans beyond a
> > single iommu group you end up with qemu being unable to operate reset
> > at all, and it is unfixable from an API perspective as we can't pass
> > in groups that VFIO isn't going to use.  
> 
> Hmm, s/nasty/niche/?  Yes, QEMU currently has no way to own a group
> without assigning a device from the group, but technically that could
> be fixed within QEMU.  If QEMU doesn't own that affected group, then it
> can't very well count on that group to not be used in some other way
> when it comes time to actually do a hot-reset.
>  
> > I think you are right, the fact we'd have to return -1 dev_ids to this
> > modified API is pretty damaging, it doesn't seem like a good
> > direction.
> >   
> > > This leads to scenarios where the info ioctl indicates a hot-reset is
> > > initially available, perhaps only because one of the affected devices
> > > was not opened at the time, and now it fails when QEMU actually tries
> > > to use it.    
> > 
> > I would like it if the APIs toward the kernel were only about the
> > kernel's security apparatus. It is makes it easier to reason about the
> > kernel side and gives nice simple well defined APIs.  
> 
> Usability needs to be a consideration as well.  An interface where the
> result is effectively arbitrary from a user perspective because the
> kernel is solely focused on whether the operation is allowed,
> evaluating constraints that the user is unaware of and cannot control,
> is unusable.
> 
> > This is a good point that qemu needs to make a policy decision if it
> > is happy about the VFIO configuration - but that is a policy decision
> > that should not become entangled with the kernel's security checks.
> > 
> > Today qemu can make this policy choice the same way it does right now
> > - call _INFO and check the group_ids. It gets the exact same outcome
> > as today. We already discussed that we need to expose the group ID
> > through an ioctl someplace.  
> 
> QEMU can make a policy decision today because the kernel provides a
> sufficiently reliable interface, ie. based on the set of owned groups, a
> hot-reset is all but guaranteed to work.  If we focus only on whether a
> given reset is allowed from a kernel perspective and ignore that
> userspace needs some predictability of the kernel behavior, then QEMU
> cannot reasonable make that policy decision.
> 
> > If this is too awkward we could add a query to the kernel if the cdev
> > is "reset exclusive" - eg the iommufd covers all the groups that span
> > the reset set.  
> 
> That's essentially what we have if there are valid dev-ids for each
> affected device in the info ioctl.  I don't think it helps the user
> experience to create loopholes where the hot-reset ioctl can still work
> in spite of those missing devices.  The group interface uses the fact
> that ownership of the group implies ownership of all devices within the
> group such that the user only needs to prove group ownership.
> 
> But we still have underlying groups even with the cdev model, with the
> same ownership principles, so don't we just need to prove group
> ownership based on a device fd rather than a group fd?
> 
> For example, we have a VFIO_DEVICE_GET_INFO ioctl that supports
> capability chains, we could add a capability that reports the group ID
> for the device.  The hot-reset info ioctl remains as it is today,
> reporting group-ids and bdfs.  The hot-reset ioctl itself is modified to
> transparently support either group fds or device fds.  The user can now
> map cdevs to group-ids and therefore follow the same rules as groups,
> providing at least one representative device fd for each group.  We've
> essentially already enabled this by allowing the limit of user provided
> fds equal to the number of affected devices.

If I'm not mistaken, I think this resolves cdev no-iommu to work
equivalently to groups as well.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
  2023-04-05 18:56               ` Alex Williamson
  2023-04-05 19:18                 ` Alex Williamson
@ 2023-04-05 19:21                 ` Jason Gunthorpe
  2023-04-05 19:49                   ` Alex Williamson
  1 sibling, 1 reply; 145+ messages in thread
From: Jason Gunthorpe @ 2023-04-05 19:21 UTC (permalink / raw)
  To: Alex Williamson
  Cc: mjrosato, jasowang, Hao, Xudong, peterx, Xu, Terrence,
	chao.p.peng, linux-s390, Liu, Yi L, kvm, lulu, Jiang, Yanting,
	joro, nicolinc, Zhao, Yan Y, intel-gfx, eric.auger,
	intel-gvt-dev, yi.y.sun, cohuck, shameerali.kolothum.thodi,
	suravee.suthikulpanit, robin.murphy

On Wed, Apr 05, 2023 at 12:56:21PM -0600, Alex Williamson wrote:
> Usability needs to be a consideration as well.  An interface where the
> result is effectively arbitrary from a user perspective because the
> kernel is solely focused on whether the operation is allowed,
> evaluating constraints that the user is unaware of and cannot control,
> is unusable.

Considering this API is only invoked by qemu we might be overdoing
this usability and 'no shoot in foot' view.

> > This is a good point that qemu needs to make a policy decision if it
> > is happy about the VFIO configuration - but that is a policy decision
> > that should not become entangled with the kernel's security checks.
> > 
> > Today qemu can make this policy choice the same way it does right now
> > - call _INFO and check the group_ids. It gets the exact same outcome
> > as today. We already discussed that we need to expose the group ID
> > through an ioctl someplace.
> 
> QEMU can make a policy decision today because the kernel provides a
> sufficiently reliable interface, ie. based on the set of owned groups, a
> hot-reset is all but guaranteed to work.  

And we don't change that with cdev. If qemu wants to make the policy
decision it keeps using the exact same _INFO interface to make that
decision same it has always made.

We weaken the actual reset action to only consider the security side.

Applications that want this exclusive reset group policy simply must
check it on their own. It is a reasonable API design.

> > If this is too awkward we could add a query to the kernel if the cdev
> > is "reset exclusive" - eg the iommufd covers all the groups that span
> > the reset set.
> 
> That's essentially what we have if there are valid dev-ids for each
> affected device in the info ioctl.

If you have dev-ids for everything, yes. If you don't, then you can't
make the same policy choice using a dev-id interface.

> I don't think it helps the user experience to create loopholes where
> the hot-reset ioctl can still work in spite of those missing
> devices.

I disagree. The easy straightforward design is that the reset ioctl
works if the process has security permissions. Mixing a policy check
into the kernel on this path is creating complexity we don't really
need.

I don't view it as a loophole, it is flexability to use the API in a
way that is different from what qemu wants - eg an app like dpdk may
be willing to tolerate a reset group that becomes unavailable after
startup. Who knows, why should we force this in the kernel?

> For example, we have a VFIO_DEVICE_GET_INFO ioctl that supports
> capability chains, we could add a capability that reports the group ID
> for the device.  

I was going to put that in an iommufd ioctl so it works with VDPA too,
but sure, lets assume we can get the group ID from a cdev fd.

> The hot-reset info ioctl remains as it is today, reporting group-ids
> and bdfs.

Sure, but userspace still needs to know how to map the reset sets into
dev-ids. Remember the reason we started doing this is because we don't
have easy access to the BDF anymore.

I like leaving this ioctl alone, lets go back to a dedicated ioctl to
return the dev_ids.

> The hot-reset ioctl itself is modified to transparently
> support either group fds or device fds.  The user can now map cdevs
> to group-ids and therefore follow the same rules as groups,
> providing at least one representative device fd for each group.

This looks like a very complex uapi compared to the empty list option,
but it seems like it would work.

Jason

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
  2023-04-05 19:21                 ` Jason Gunthorpe
@ 2023-04-05 19:49                   ` Alex Williamson
  2023-04-05 23:22                     ` Jason Gunthorpe
  2023-04-06  6:34                     ` Liu, Yi L
  0 siblings, 2 replies; 145+ messages in thread
From: Alex Williamson @ 2023-04-05 19:49 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: mjrosato, jasowang, Hao, Xudong, peterx, Xu, Terrence,
	chao.p.peng, linux-s390, Liu, Yi L, kvm, lulu, Jiang, Yanting,
	joro, nicolinc, Zhao,  Yan Y, intel-gfx, eric.auger,
	intel-gvt-dev, yi.y.sun, cohuck, shameerali.kolothum.thodi,
	suravee.suthikulpanit, robin.murphy

On Wed, 5 Apr 2023 16:21:09 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Wed, Apr 05, 2023 at 12:56:21PM -0600, Alex Williamson wrote:
> > Usability needs to be a consideration as well.  An interface where the
> > result is effectively arbitrary from a user perspective because the
> > kernel is solely focused on whether the operation is allowed,
> > evaluating constraints that the user is unaware of and cannot control,
> > is unusable.  
> 
> Considering this API is only invoked by qemu we might be overdoing
> this usability and 'no shoot in foot' view.

Ok, I'm not sure why we're diminishing the de facto vfio userspace...

> > > This is a good point that qemu needs to make a policy decision if it
> > > is happy about the VFIO configuration - but that is a policy decision
> > > that should not become entangled with the kernel's security checks.
> > > 
> > > Today qemu can make this policy choice the same way it does right now
> > > - call _INFO and check the group_ids. It gets the exact same outcome
> > > as today. We already discussed that we need to expose the group ID
> > > through an ioctl someplace.  
> > 
> > QEMU can make a policy decision today because the kernel provides a
> > sufficiently reliable interface, ie. based on the set of owned groups, a
> > hot-reset is all but guaranteed to work.    
> 
> And we don't change that with cdev. If qemu wants to make the policy
> decision it keeps using the exact same _INFO interface to make that
> decision same it has always made.
> 
> We weaken the actual reset action to only consider the security side.
> 
> Applications that want this exclusive reset group policy simply must
> check it on their own. It is a reasonable API design.

I disagree, as I've argued before, the info ioctl becomes so weak and
effectively arbitrary from a user perspective at being able to predict
whether the hot-reset ioctl works that it becomes useless, diminishing
the entire hot-reset info/execute API.

> > > If this is too awkward we could add a query to the kernel if the cdev
> > > is "reset exclusive" - eg the iommufd covers all the groups that span
> > > the reset set.  
> > 
> > That's essentially what we have if there are valid dev-ids for each
> > affected device in the info ioctl.  
> 
> If you have dev-ids for everything, yes. If you don't, then you can't
> make the same policy choice using a dev-id interface.

Exactly, you can't make any policy choice because the success or
failure of the hot-reset ioctl can't be known.

> > I don't think it helps the user experience to create loopholes where
> > the hot-reset ioctl can still work in spite of those missing
> > devices.  
> 
> I disagree. The easy straightforward design is that the reset ioctl
> works if the process has security permissions. Mixing a policy check
> into the kernel on this path is creating complexity we don't really
> need.
> 
> I don't view it as a loophole, it is flexability to use the API in a
> way that is different from what qemu wants - eg an app like dpdk may
> be willing to tolerate a reset group that becomes unavailable after
> startup. Who knows, why should we force this in the kernel?

Because look at all the problems it's causing to try to introduce these
loopholes without also introducing subtle bugs.  There's an argument
that we're overly strict, which is better than the alternative, which
seems to be what we're dabbling with.  It is a straightforward
interface for the hot-reset ioctl to mirror the information provided
via the hot-reset info ioctl.

> > For example, we have a VFIO_DEVICE_GET_INFO ioctl that supports
> > capability chains, we could add a capability that reports the group ID
> > for the device.    
> 
> I was going to put that in an iommufd ioctl so it works with VDPA too,
> but sure, lets assume we can get the group ID from a cdev fd.
> 
> > The hot-reset info ioctl remains as it is today, reporting group-ids
> > and bdfs.  
> 
> Sure, but userspace still needs to know how to map the reset sets into
> dev-ids.

No, it doesn't. 

> Remember the reason we started doing this is because we don't
> have easy access to the BDF anymore.

We don't need it, the info ioctl provides the groups, the group
association can be learned from the DEVICE_GET_INFO ioctl, the
hot-reset ioctl only requires a single representative fd per affected
group.  dev-ids not required.

> I like leaving this ioctl alone, lets go back to a dedicated ioctl to
> return the dev_ids.

I don't see any justification for this.  We could add another PCI
specific DEVICE_GET_INFO capability to report the bdf if we really need
it, but reporting the group seems sufficient for this use case.

> > The hot-reset ioctl itself is modified to transparently
> > support either group fds or device fds.  The user can now map cdevs
> > to group-ids and therefore follow the same rules as groups,
> > providing at least one representative device fd for each group.  
> 
> This looks like a very complex uapi compared to the empty list option,
> but it seems like it would work.

It's the same API that we have now.  What's complex is trying to figure
out all the subtle side-effects from the loopholes that are being
proposed in this series.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
  2023-04-05 19:49                   ` Alex Williamson
@ 2023-04-05 23:22                     ` Jason Gunthorpe
  2023-04-06 10:02                       ` Liu, Yi L
  2023-04-06  6:34                     ` Liu, Yi L
  1 sibling, 1 reply; 145+ messages in thread
From: Jason Gunthorpe @ 2023-04-05 23:22 UTC (permalink / raw)
  To: Alex Williamson
  Cc: mjrosato, jasowang, Hao, Xudong, peterx, Xu, Terrence,
	chao.p.peng, linux-s390, Liu, Yi L, kvm, lulu, Jiang, Yanting,
	joro, nicolinc, Zhao, Yan Y, intel-gfx, eric.auger,
	intel-gvt-dev, yi.y.sun, cohuck, shameerali.kolothum.thodi,
	suravee.suthikulpanit, robin.murphy

On Wed, Apr 05, 2023 at 01:49:45PM -0600, Alex Williamson wrote:

> > > QEMU can make a policy decision today because the kernel provides a
> > > sufficiently reliable interface, ie. based on the set of owned groups, a
> > > hot-reset is all but guaranteed to work.    
> > 
> > And we don't change that with cdev. If qemu wants to make the policy
> > decision it keeps using the exact same _INFO interface to make that
> > decision same it has always made.
> > 
> > We weaken the actual reset action to only consider the security side.
> > 
> > Applications that want this exclusive reset group policy simply must
> > check it on their own. It is a reasonable API design.
> 
> I disagree, as I've argued before, the info ioctl becomes so weak and
> effectively arbitrary from a user perspective at being able to predict
> whether the hot-reset ioctl works that it becomes useless, diminishing
> the entire hot-reset info/execute API.

reset should be strictly more permissive than INFO. If INFO predicts
reset is permitted then reset should succeed.

We don't change INFO so it cannot "becomes so weak"  ??

We don't care about the cases where INFO says it will not succeed but
reset does (temporarily) succeed.

I don't get what argument you are trying to make or what you think is
diminished..

Again, userspace calls INFO, if info says yes then reset *always
works*, exactly just like today.

Userspace will call reset with a 0 length FD list and it uses a
security only check that is strictly more permissive than what
get_info will return. So the new check is simple in the kernel and
always works in the cases we need it to work.

What is getting things into trouble is insisting that RESET have
additional restrictions beyond the minimum checks required for
security.

> > I don't view it as a loophole, it is flexability to use the API in a
> > way that is different from what qemu wants - eg an app like dpdk may
> > be willing to tolerate a reset group that becomes unavailable after
> > startup. Who knows, why should we force this in the kernel?
> 
> Because look at all the problems it's causing to try to introduce these
> loopholes without also introducing subtle bugs.

These problems are coming from tring to do this integrated version,
not from my approach!

AFAICT there was nothing wrong with my original plan of using the
empty fd list for reset. What Yi has here is some mashup of what you
and I both suggested.

> > Remember the reason we started doing this is because we don't
> > have easy access to the BDF anymore.
> 
> We don't need it, the info ioctl provides the groups, the group
> association can be learned from the DEVICE_GET_INFO ioctl, the
> hot-reset ioctl only requires a single representative fd per affected
> group.  dev-ids not required.

I'm not talking about triggering the ioctl.

I'm talking about whatever else qemu needs to do so that the VM is
aware of the reset groups device-by-device on it's side so nested VFIO
in the VM reflects the same data as the hypervisor. Maybe it doesn't
do this right now, but the kernel API should continue to provide the
data.

> > I like leaving this ioctl alone, lets go back to a dedicated ioctl to
> > return the dev_ids.
> 
> I don't see any justification for this.  We could add another PCI
> specific DEVICE_GET_INFO capability to report the bdf if we really need
> it, but reporting the group seems sufficient for this use case.

What I imagine is a single new ioctl 'get reset group 2' or something.
It returns a list of dev_ids in the reset group. It has an output flag
if the reset is reliable. This is the only ioctl user space needs to
call.

The reliable test is done by simply calling the ioctl and throwing
away the dev ids. The mapping of the VM's reset groups is done by
processing the dev_ids to vRIDs and flowing that into the VM somehow.

We don't expose group_ids, and we don't expose BDF. It is much simpler
and cleaner to use.

A BDF DEVICE_GET_INFO and the existing reset INFO will encode the same
data too, it is just not as elegant and requires userspace to do a lot
more work to keep track of the 3 different identifiers.

> > This looks like a very complex uapi compared to the empty list option,
> > but it seems like it would work.
>
> It's the same API that we have now.  What's complex is trying to figure
> out all the subtle side-effects from the loopholes that are being
> proposed in this series.  Thanks,

I might agree with you if we weren't now going backwards - 
ideas didn't work out and Yi has to throw stuff away. :(

Jason

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
  2023-04-05 17:58         ` Eric Auger
@ 2023-04-06  5:31           ` Liu, Yi L
  0 siblings, 0 replies; 145+ messages in thread
From: Liu, Yi L @ 2023-04-06  5:31 UTC (permalink / raw)
  To: eric.auger, Alex Williamson
  Cc: mjrosato, jasowang, Hao, Xudong, peterx, Xu, Terrence,
	chao.p.peng, linux-s390, kvm, lulu, Jiang, Yanting, joro,
	nicolinc, jgg, Zhao, Yan Y, intel-gfx, intel-gvt-dev, yi.y.sun,
	cohuck, shameerali.kolothum.thodi, suravee.suthikulpanit,
	robin.murphy

Hi Eric,

> From: Eric Auger <eric.auger@redhat.com>
> Sent: Thursday, April 6, 2023 1:58 AM
[...]
> >>>> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> >>>> index 25432ef213ee..5a34364e3b94 100644
> >>>> --- a/include/uapi/linux/vfio.h
> >>>> +++ b/include/uapi/linux/vfio.h
> >>>> @@ -650,11 +650,32 @@ enum {
> >>>>   * VFIO_DEVICE_GET_PCI_HOT_RESET_INFO - _IOWR(VFIO_TYPE, VFIO_BASE +
> 12,
> >>>>   *					      struct vfio_pci_hot_reset_info)
> >>>>   *
> >>>> + * This command is used to query the affected devices in the hot reset for
> >>>> + * a given device.  User could use the information reported by this command
> >>>> + * to figure out the affected devices among the devices it has opened.
> the 'opened' terminology does not look sufficient here because it is not
> only a matter of the device being opened using cdev but it also needs to
> have been bound to an iommufd, dev_id being the output of the
> dev-iommufd binding.
> 
> By the way I am now confused. What does happen if the reset impact some
> devices which are not bound to an iommu ctx. Previously we returned the
> iommu group which always pre-exists but now you will report invalid id?

For such devices, user could use the bdf information to check if
affected device is opened by the user. If yes, do some necessary
preparation on the device before issuing hot reset.

Regards,
Yi Liu

> >>>> + * This command always reports the segment, bus and devfn information for
> >>>> + * each affected device, and selectively report the group_id or the dev_id
> >>>> + * per the way how the device being queried is opened.
> >>>> + *	- If the device is opened via the traditional group/container manner,
> >>>> + *	  this command reports the group_id for each affected device.
> >>>> + *
> >>>> + *	- If the device is opened as a cdev, this command needs to report
> >>> s/needs to report/reports
> >> got it.
> >>
> >>>> + *	  dev_id for each affected device and set the
> >>>> + *	  VFIO_PCI_HOT_RESET_FLAG_IOMMUFD_DEV_ID flag.  For the
> affected
> >>>> + *	  devices that are not opened as cdev or bound to different iommufds
> >>>> + *	  with the device that is queried, report an invalid dev_id to avoid
> or not bound at all
> >>> s/bound to different iommufds with the device that is queried/bound to
> >>> iommufds different from the reset device one?
> >> hmmm, I'm not a native speaker here. This _INFO is to query if want
> >> hot reset a given device, what devices would be affected. So it appears
> >> the queried device is better. But I'd admit "the queried device" is also
> >> "the reset device". may Alex help pick one. 😊
> > 	- If the calling device is opened directly via cdev rather than
> > 	  accessed through the vfio group, the returned
> > 	  vfio_pci_depdendent_device structure reports the dev_id
> > 	  rather than the group_id, which is indicated by the
> > 	  VFIO_PCI_HOT_RESET_FLAG_IOMMUFD_DEV_ID flag in
> > 	  vfio_pci_hot_reset_info.  If the reset affects devices that
> > 	  are not opened within the same iommufd context as the calling
> > 	  device, IOMMUFD_INVALID_ID will be provided as the dev_id.
> >
> > But that kind of brings to light the question of what does the user do
> > when they encounter this situation.  If the device is not opened, the
> > reset can complete.  If the device is opened by a different user, the
> > reset is blocked.  The only logical conclusion is that the user should
> > try the reset regardless of the result of the info ioctl, which the
> > null-array approach further solidifies as the direction of the API.
> > I'm not liking this.  Thanks,
> >
> > Alex
> 
> Thanks
> 
> Eric
> >
> >
> >>>> + *	  potential dev_id conflict as dev_id is local to iommufd.  For such
> >>>> + *	  affected devices, user shall fall back to use the segment, bus and
> >>>> + *	  devfn info to map it to opened device.
> >>>> + *
> >>>>   * Return: 0 on success, -errno on failure:
> >>>>   *	-enospc = insufficient buffer, -enodev = unsupported for device.
> >>>>   */
> >>>>  struct vfio_pci_dependent_device {
> >>>> -	__u32	group_id;
> >>>> +	union {
> >>>> +		__u32   group_id;
> >>>> +		__u32	dev_id;
> >>>> +	};
> >>>>  	__u16	segment;
> >>>>  	__u8	bus;
> >>>>  	__u8	devfn; /* Use PCI_SLOT/PCI_FUNC */
> >>>> @@ -663,6 +684,7 @@ struct vfio_pci_dependent_device {
> >>>>  struct vfio_pci_hot_reset_info {
> >>>>  	__u32	argsz;
> >>>>  	__u32	flags;
> >>>> +#define VFIO_PCI_HOT_RESET_FLAG_IOMMUFD_DEV_ID	(1 << 0)
> >>>>  	__u32	count;
> >>>>  	struct vfio_pci_dependent_device	devices[];
> >>>>  };
> >>> Eric


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
  2023-04-05 19:49                   ` Alex Williamson
  2023-04-05 23:22                     ` Jason Gunthorpe
@ 2023-04-06  6:34                     ` Liu, Yi L
  2023-04-06 17:07                       ` Alex Williamson
  1 sibling, 1 reply; 145+ messages in thread
From: Liu, Yi L @ 2023-04-06  6:34 UTC (permalink / raw)
  To: Alex Williamson, Jason Gunthorpe
  Cc: mjrosato, jasowang, Hao, Xudong, peterx, Xu, Terrence,
	chao.p.peng, linux-s390, kvm, lulu, Jiang, Yanting, joro,
	nicolinc, Zhao, Yan Y, intel-gfx, eric.auger, intel-gvt-dev,
	yi.y.sun, cohuck, shameerali.kolothum.thodi,
	suravee.suthikulpanit, robin.murphy

Hi Alex,

> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Thursday, April 6, 2023 3:50 AM
> 
> On Wed, 5 Apr 2023 16:21:09 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Wed, Apr 05, 2023 at 12:56:21PM -0600, Alex Williamson wrote:
> > > Usability needs to be a consideration as well.  An interface where the
> > > result is effectively arbitrary from a user perspective because the
> > > kernel is solely focused on whether the operation is allowed,
> > > evaluating constraints that the user is unaware of and cannot control,
> > > is unusable.
> >
> > Considering this API is only invoked by qemu we might be overdoing
> > this usability and 'no shoot in foot' view.
> 
> Ok, I'm not sure why we're diminishing the de facto vfio userspace...
> 
> > > > This is a good point that qemu needs to make a policy decision if it
> > > > is happy about the VFIO configuration - but that is a policy decision
> > > > that should not become entangled with the kernel's security checks.
> > > >
> > > > Today qemu can make this policy choice the same way it does right now
> > > > - call _INFO and check the group_ids. It gets the exact same outcome
> > > > as today. We already discussed that we need to expose the group ID
> > > > through an ioctl someplace.
> > >
> > > QEMU can make a policy decision today because the kernel provides a
> > > sufficiently reliable interface, ie. based on the set of owned groups, a
> > > hot-reset is all but guaranteed to work.
> >
> > And we don't change that with cdev. If qemu wants to make the policy
> > decision it keeps using the exact same _INFO interface to make that
> > decision same it has always made.
> >
> > We weaken the actual reset action to only consider the security side.
> >
> > Applications that want this exclusive reset group policy simply must
> > check it on their own. It is a reasonable API design.
> 
> I disagree, as I've argued before, the info ioctl becomes so weak and
> effectively arbitrary from a user perspective at being able to predict
> whether the hot-reset ioctl works that it becomes useless, diminishing
> the entire hot-reset info/execute API.
> 
> > > > If this is too awkward we could add a query to the kernel if the cdev
> > > > is "reset exclusive" - eg the iommufd covers all the groups that span
> > > > the reset set.
> > >
> > > That's essentially what we have if there are valid dev-ids for each
> > > affected device in the info ioctl.
> >
> > If you have dev-ids for everything, yes. If you don't, then you can't
> > make the same policy choice using a dev-id interface.
> 
> Exactly, you can't make any policy choice because the success or
> failure of the hot-reset ioctl can't be known.

could you elaborate a bit about what the policy is here. As far as I know,
QEMU makes use of the information reported by _INFO to check:
- if all the affected groups are owned by the current QEMU[1]
- if the affected devices are opened by the current QEMU, if yes, QEMU
  needs to use vfio_pci_pre_reset() to do preparation before issuing
  hot rest[1]

[1] vfio_pci_hot_reset() in https://github.com/qemu/qemu/blob/master/hw/vfio/pci.c

> > > I don't think it helps the user experience to create loopholes where
> > > the hot-reset ioctl can still work in spite of those missing
> > > devices.
> >
> > I disagree. The easy straightforward design is that the reset ioctl
> > works if the process has security permissions. Mixing a policy check
> > into the kernel on this path is creating complexity we don't really
> > need.
> >
> > I don't view it as a loophole, it is flexability to use the API in a
> > way that is different from what qemu wants - eg an app like dpdk may
> > be willing to tolerate a reset group that becomes unavailable after
> > startup. Who knows, why should we force this in the kernel?
> 
> Because look at all the problems it's causing to try to introduce these
> loopholes without also introducing subtle bugs.  There's an argument
> that we're overly strict, which is better than the alternative, which
> seems to be what we're dabbling with.  It is a straightforward
> interface for the hot-reset ioctl to mirror the information provided
> via the hot-reset info ioctl.
> 
> > > For example, we have a VFIO_DEVICE_GET_INFO ioctl that supports
> > > capability chains, we could add a capability that reports the group ID
> > > for the device.
> >
> > I was going to put that in an iommufd ioctl so it works with VDPA too,
> > but sure, lets assume we can get the group ID from a cdev fd.
> >
> > > The hot-reset info ioctl remains as it is today, reporting group-ids
> > > and bdfs.
> >
> > Sure, but userspace still needs to know how to map the reset sets into
> > dev-ids.
> 
> No, it doesn't.
> 
> > Remember the reason we started doing this is because we don't
> > have easy access to the BDF anymore.
> 
> We don't need it, the info ioctl provides the groups, the group
> association can be learned from the DEVICE_GET_INFO ioctl, the
> hot-reset ioctl only requires a single representative fd per affected
> group.  dev-ids not required.
> 
> > I like leaving this ioctl alone, lets go back to a dedicated ioctl to
> > return the dev_ids.
> 
> I don't see any justification for this.  We could add another PCI
> specific DEVICE_GET_INFO capability to report the bdf if we really need
> it, but reporting the group seems sufficient for this use case.

IMHO, the knowledge of group may be not enough. Take QEMU as an example.
QEMU not only needs to ensure the group is owned by it, it also needs to
do preparation on the devices that are already in use and affected by
the hot reset on a new opened device. If there is only group knowledge,
QEMU may blindly prepares all the devices that are already opened and
belong to the same iommu group. But as I got in the discussion iommu
group is not equal to hot reset scope (a.k.a. dev_set). is it? It is
possible that devices in an iommu_group may span into multiple hot
reset scope. For such case, get bdf info from cdev fd is necessary.

Regards,
Yi Liu


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
  2023-04-05 23:22                     ` Jason Gunthorpe
@ 2023-04-06 10:02                       ` Liu, Yi L
  2023-04-06 17:53                         ` Alex Williamson
  0 siblings, 1 reply; 145+ messages in thread
From: Liu, Yi L @ 2023-04-06 10:02 UTC (permalink / raw)
  To: Jason Gunthorpe, Alex Williamson
  Cc: mjrosato, jasowang, Hao, Xudong, Duan,  Zhenzhong, peterx, Xu,
	Terrence, chao.p.peng, linux-s390, kvm, lulu, Jiang, Yanting,
	joro, nicolinc, Zhao, Yan Y, intel-gfx, eric.auger,
	intel-gvt-dev, yi.y.sun, cohuck, shameerali.kolothum.thodi,
	suravee.suthikulpanit, robin.murphy

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Thursday, April 6, 2023 7:23 AM
> 
> On Wed, Apr 05, 2023 at 01:49:45PM -0600, Alex Williamson wrote:
> 
> > > > QEMU can make a policy decision today because the kernel provides a
> > > > sufficiently reliable interface, ie. based on the set of owned groups, a
> > > > hot-reset is all but guaranteed to work.
> > >
> > > And we don't change that with cdev. If qemu wants to make the policy
> > > decision it keeps using the exact same _INFO interface to make that
> > > decision same it has always made.
> > >
> > > We weaken the actual reset action to only consider the security side.
> > >
> > > Applications that want this exclusive reset group policy simply must
> > > check it on their own. It is a reasonable API design.
> >
> > I disagree, as I've argued before, the info ioctl becomes so weak and
> > effectively arbitrary from a user perspective at being able to predict
> > whether the hot-reset ioctl works that it becomes useless, diminishing
> > the entire hot-reset info/execute API.
> 
> reset should be strictly more permissive than INFO. If INFO predicts
> reset is permitted then reset should succeed.
> 
> We don't change INFO so it cannot "becomes so weak"  ??
> 
> We don't care about the cases where INFO says it will not succeed but
> reset does (temporarily) succeed.
> 
> I don't get what argument you are trying to make or what you think is
> diminished..
> 
> Again, userspace calls INFO, if info says yes then reset *always
> works*, exactly just like today.
>
> Userspace will call reset with a 0 length FD list and it uses a
> security only check that is strictly more permissive than what
> get_info will return. So the new check is simple in the kernel and
> always works in the cases we need it to work.
> 
> What is getting things into trouble is insisting that RESET have
> additional restrictions beyond the minimum checks required for
> security.
> 
> > > I don't view it as a loophole, it is flexability to use the API in a
> > > way that is different from what qemu wants - eg an app like dpdk may
> > > be willing to tolerate a reset group that becomes unavailable after
> > > startup. Who knows, why should we force this in the kernel?
> >
> > Because look at all the problems it's causing to try to introduce these
> > loopholes without also introducing subtle bugs.
> 
> These problems are coming from tring to do this integrated version,
> not from my approach!
> 
> AFAICT there was nothing wrong with my original plan of using the
> empty fd list for reset. What Yi has here is some mashup of what you
> and I both suggested.

Hi Alex, Jason,

could be this reason. So let me try to gather the changes of this series
does and the impact as far as I know.

1) only check the ownership of opened devices in the dev_set
     in HOT_RESET ioctl.
     - Impact: it changes the relationship between _INFO  and HOT_RESET.
       As " Each group must have IOMMU protection established for the
       ioctl to succeed." in [1], existing design actually means userspace
       should own all the affected groups before heading to do HOT_RESET.
       With the change here, the user does not need to ensure all affected
       groups are opened and it can do hot-reset successfully as long as the
       devices in the affected group are just un-opened and can be reset.
    
       [1] https://patchwork.kernel.org/project/linux-pci/patch/20130814200845.21923.64284.stgit@bling.home/

2) Allow passing zero-length fd array to do hot reset
    - Impact: this uses the iommufd as ownership check in the kernel side.
      It is only supposed to be used by the users that open cdev instead of
      users that open group. The drawback is that it cannot cover the noiommu
      devices as noiommu does not use iommufd at all. But it works well for
      most cases.

3) Allow hot reset be successful when the dev_set is singleton
     - Impact: this makes sense but it seems to mess up the boundary between
     the group path and cdev path w.r.t. the usage of zero-length fd approach.
     The group path can succeed to do hot reset even if it is passing an empty
     fd array if the dev_set happens to be singleton.

4) Allow passing device fd to do hot reset
    - Impact: this is a new way for hot reset. should have no impact.

5) Extend the _INFO to report devid
    - Impact: this changes the way user to decode the info reported back.
    devid and groupid are returned per the way the queried device is opened.
    Since it was suggested to support the scenario in which some devices
    are opened via cdev while some devices are opened via group. This makes
    us to return invalid_devid for the device that is opened via group if
    it is affected by the hot reset of a device that is opened via cdev.
    
    This was proposed to support the future device fd passing usage which is
    only available in cdev path.

To me the major confusion is from 1) and 3). 1) changes the meaning of
_INFO and HOT_RESET, while 3) messes up the boundary.

Here is my thought:

For 1), it was proposed due to below reason[2]. We'd like to make a scenario
that works in the group path be workable in cdev path as well. But IMHO, we
may just accept that cdev path cannot work for such scenario to avoid sublte
change to uapi. Otherwise, we need to have another HOT_RESET ioctl or a
hint in HOT_RESET ioctl to tell the kernel  whether relaxed ownership check
is expected. Maybe this is awkward. But if we want to keep it, we'd do it
with the awareness by user.

[2] https://lore.kernel.org/kvm/Y%2FdobS6gdSkxnPH7@nvidia.com/

For 3), it was proposed when discussing the hot reset for noiommu[3]. But
it does not make hot reset always workable for noiommu in cdev, just in
case dev_set is singleton. So it is more of a general optimization that can
make the kernel skip the ownership check. But to make use of it, we may
need to test it before sanitizing the group fds from user or the iommufd
check. Maybe the dev_set singleton test in this series is not well placed.
If so, I can further modify it.

[3] https://lore.kernel.org/kvm/ZACX+Np%2FIY7ygqL5@nvidia.com/

Regards,
Yi Liu

> 
> > > Remember the reason we started doing this is because we don't
> > > have easy access to the BDF anymore.
> >
> > We don't need it, the info ioctl provides the groups, the group
> > association can be learned from the DEVICE_GET_INFO ioctl, the
> > hot-reset ioctl only requires a single representative fd per affected
> > group.  dev-ids not required.
> 
> I'm not talking about triggering the ioctl.
> 
> I'm talking about whatever else qemu needs to do so that the VM is
> aware of the reset groups device-by-device on it's side so nested VFIO
> in the VM reflects the same data as the hypervisor. Maybe it doesn't
> do this right now, but the kernel API should continue to provide the
> data.
> 
> > > I like leaving this ioctl alone, lets go back to a dedicated ioctl to
> > > return the dev_ids.
> >
> > I don't see any justification for this.  We could add another PCI
> > specific DEVICE_GET_INFO capability to report the bdf if we really need
> > it, but reporting the group seems sufficient for this use case.
> 
> What I imagine is a single new ioctl 'get reset group 2' or something.
> It returns a list of dev_ids in the reset group. It has an output flag
> if the reset is reliable. This is the only ioctl user space needs to
> call.
> 
> The reliable test is done by simply calling the ioctl and throwing
> away the dev ids. The mapping of the VM's reset groups is done by
> processing the dev_ids to vRIDs and flowing that into the VM somehow.
> 
> We don't expose group_ids, and we don't expose BDF. It is much simpler
> and cleaner to use.
> 
> A BDF DEVICE_GET_INFO and the existing reset INFO will encode the same
> data too, it is just not as elegant and requires userspace to do a lot
> more work to keep track of the 3 different identifiers.
> 
> > > This looks like a very complex uapi compared to the empty list option,
> > > but it seems like it would work.
> >
> > It's the same API that we have now.  What's complex is trying to figure
> > out all the subtle side-effects from the loopholes that are being
> > proposed in this series.  Thanks,
> 
> I might agree with you if we weren't now going backwards -
> ideas didn't work out and Yi has to throw stuff away. :(
> 
> Jason

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
  2023-04-06  6:34                     ` Liu, Yi L
@ 2023-04-06 17:07                       ` Alex Williamson
  0 siblings, 0 replies; 145+ messages in thread
From: Alex Williamson @ 2023-04-06 17:07 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: mjrosato, jasowang, Hao, Xudong, peterx, Xu, Terrence,
	chao.p.peng, linux-s390, kvm, lulu, Jiang, Yanting, joro,
	nicolinc, Jason Gunthorpe, Zhao, Yan Y, intel-gfx, eric.auger,
	intel-gvt-dev, yi.y.sun, cohuck, shameerali.kolothum.thodi,
	suravee.suthikulpanit, robin.murphy

On Thu, 6 Apr 2023 06:34:08 +0000
"Liu, Yi L" <yi.l.liu@intel.com> wrote:

> Hi Alex,
> 
> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Thursday, April 6, 2023 3:50 AM
> > 
> > On Wed, 5 Apr 2023 16:21:09 -0300
> > Jason Gunthorpe <jgg@nvidia.com> wrote:
> >   
> > > On Wed, Apr 05, 2023 at 12:56:21PM -0600, Alex Williamson wrote:  
> > > > Usability needs to be a consideration as well.  An interface where the
> > > > result is effectively arbitrary from a user perspective because the
> > > > kernel is solely focused on whether the operation is allowed,
> > > > evaluating constraints that the user is unaware of and cannot control,
> > > > is unusable.  
> > >
> > > Considering this API is only invoked by qemu we might be overdoing
> > > this usability and 'no shoot in foot' view.  
> > 
> > Ok, I'm not sure why we're diminishing the de facto vfio userspace...
> >   
> > > > > This is a good point that qemu needs to make a policy decision if it
> > > > > is happy about the VFIO configuration - but that is a policy decision
> > > > > that should not become entangled with the kernel's security checks.
> > > > >
> > > > > Today qemu can make this policy choice the same way it does right now
> > > > > - call _INFO and check the group_ids. It gets the exact same outcome
> > > > > as today. We already discussed that we need to expose the group ID
> > > > > through an ioctl someplace.  
> > > >
> > > > QEMU can make a policy decision today because the kernel provides a
> > > > sufficiently reliable interface, ie. based on the set of owned groups, a
> > > > hot-reset is all but guaranteed to work.  
> > >
> > > And we don't change that with cdev. If qemu wants to make the policy
> > > decision it keeps using the exact same _INFO interface to make that
> > > decision same it has always made.
> > >
> > > We weaken the actual reset action to only consider the security side.
> > >
> > > Applications that want this exclusive reset group policy simply must
> > > check it on their own. It is a reasonable API design.  
> > 
> > I disagree, as I've argued before, the info ioctl becomes so weak and
> > effectively arbitrary from a user perspective at being able to predict
> > whether the hot-reset ioctl works that it becomes useless, diminishing
> > the entire hot-reset info/execute API.
> >   
> > > > > If this is too awkward we could add a query to the kernel if the cdev
> > > > > is "reset exclusive" - eg the iommufd covers all the groups that span
> > > > > the reset set.  
> > > >
> > > > That's essentially what we have if there are valid dev-ids for each
> > > > affected device in the info ioctl.  
> > >
> > > If you have dev-ids for everything, yes. If you don't, then you can't
> > > make the same policy choice using a dev-id interface.  
> > 
> > Exactly, you can't make any policy choice because the success or
> > failure of the hot-reset ioctl can't be known.  
> 
> could you elaborate a bit about what the policy is here. As far as I know,
> QEMU makes use of the information reported by _INFO to check:
> - if all the affected groups are owned by the current QEMU[1]
> - if the affected devices are opened by the current QEMU, if yes, QEMU
>   needs to use vfio_pci_pre_reset() to do preparation before issuing
>   hot rest[1]
> 
> [1] vfio_pci_hot_reset() in https://github.com/qemu/qemu/blob/master/hw/vfio/pci.c

Regarding the policy decisions, look for instance at the distinction
between vfio_pci_hot_reset_one() vs vfio_pci_hot_reset_multi(), or the
way QEMU will opt for a bus reset if it believes only a PM reset is
available.

In my proposal, I did miss that if _INFO reports the group and bdf that
allows QEMU to associate fd passed devices to a group affected by the
reset, but not specifically whether the device is affected by the
reset.  I think that would be justification for capabilities on the
DEVICE_GET_INFO ioctl to report both the group and PCI address as
separate capabilities.
 
> > > > I don't think it helps the user experience to create loopholes where
> > > > the hot-reset ioctl can still work in spite of those missing
> > > > devices.  
> > >
> > > I disagree. The easy straightforward design is that the reset ioctl
> > > works if the process has security permissions. Mixing a policy check
> > > into the kernel on this path is creating complexity we don't really
> > > need.
> > >
> > > I don't view it as a loophole, it is flexability to use the API in a
> > > way that is different from what qemu wants - eg an app like dpdk may
> > > be willing to tolerate a reset group that becomes unavailable after
> > > startup. Who knows, why should we force this in the kernel?  
> > 
> > Because look at all the problems it's causing to try to introduce these
> > loopholes without also introducing subtle bugs.  There's an argument
> > that we're overly strict, which is better than the alternative, which
> > seems to be what we're dabbling with.  It is a straightforward
> > interface for the hot-reset ioctl to mirror the information provided
> > via the hot-reset info ioctl.
> >   
> > > > For example, we have a VFIO_DEVICE_GET_INFO ioctl that supports
> > > > capability chains, we could add a capability that reports the group ID
> > > > for the device.  
> > >
> > > I was going to put that in an iommufd ioctl so it works with VDPA too,
> > > but sure, lets assume we can get the group ID from a cdev fd.
> > >  
> > > > The hot-reset info ioctl remains as it is today, reporting group-ids
> > > > and bdfs.  
> > >
> > > Sure, but userspace still needs to know how to map the reset sets into
> > > dev-ids.  
> > 
> > No, it doesn't.
> >   
> > > Remember the reason we started doing this is because we don't
> > > have easy access to the BDF anymore.  
> > 
> > We don't need it, the info ioctl provides the groups, the group
> > association can be learned from the DEVICE_GET_INFO ioctl, the
> > hot-reset ioctl only requires a single representative fd per affected
> > group.  dev-ids not required.
> >   
> > > I like leaving this ioctl alone, lets go back to a dedicated ioctl to
> > > return the dev_ids.  
> > 
> > I don't see any justification for this.  We could add another PCI
> > specific DEVICE_GET_INFO capability to report the bdf if we really need
> > it, but reporting the group seems sufficient for this use case.  
> 
> IMHO, the knowledge of group may be not enough. Take QEMU as an example.
> QEMU not only needs to ensure the group is owned by it, it also needs to
> do preparation on the devices that are already in use and affected by
> the hot reset on a new opened device. If there is only group knowledge,
> QEMU may blindly prepares all the devices that are already opened and
> belong to the same iommu group. But as I got in the discussion iommu
> group is not equal to hot reset scope (a.k.a. dev_set). is it? It is
> possible that devices in an iommu_group may span into multiple hot
> reset scope. For such case, get bdf info from cdev fd is necessary.

Yes, you're correct, group and reset scope are not equivalent, so we'd
require a means to get both the group and the bdf for the device.
Knowing the bdf allows the user to know which opened devices are
directly affected by the reset, knowing the group allows the user to
know if ancillary affected devices are within the set of groups the
user owns and therefore effectively under their purview.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
  2023-04-06 10:02                       ` Liu, Yi L
@ 2023-04-06 17:53                         ` Alex Williamson
  2023-04-07 10:09                           ` Liu, Yi L
  2023-04-11 13:24                           ` Jason Gunthorpe
  0 siblings, 2 replies; 145+ messages in thread
From: Alex Williamson @ 2023-04-06 17:53 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: mjrosato, jasowang, Hao, Xudong, Duan, Zhenzhong, peterx, Xu,
	Terrence, chao.p.peng, linux-s390, kvm, lulu, Jiang, Yanting,
	joro, nicolinc, Jason Gunthorpe, Zhao, Yan Y, intel-gfx,
	eric.auger, intel-gvt-dev, yi.y.sun, cohuck,
	shameerali.kolothum.thodi, suravee.suthikulpanit, robin.murphy

On Thu, 6 Apr 2023 10:02:10 +0000
"Liu, Yi L" <yi.l.liu@intel.com> wrote:

> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Thursday, April 6, 2023 7:23 AM
> > 
> > On Wed, Apr 05, 2023 at 01:49:45PM -0600, Alex Williamson wrote:
> >   
> > > > > QEMU can make a policy decision today because the kernel provides a
> > > > > sufficiently reliable interface, ie. based on the set of owned groups, a
> > > > > hot-reset is all but guaranteed to work.  
> > > >
> > > > And we don't change that with cdev. If qemu wants to make the policy
> > > > decision it keeps using the exact same _INFO interface to make that
> > > > decision same it has always made.
> > > >
> > > > We weaken the actual reset action to only consider the security side.
> > > >
> > > > Applications that want this exclusive reset group policy simply must
> > > > check it on their own. It is a reasonable API design.  
> > >
> > > I disagree, as I've argued before, the info ioctl becomes so weak and
> > > effectively arbitrary from a user perspective at being able to predict
> > > whether the hot-reset ioctl works that it becomes useless, diminishing
> > > the entire hot-reset info/execute API.  
> > 
> > reset should be strictly more permissive than INFO. If INFO predicts
> > reset is permitted then reset should succeed.
> > 
> > We don't change INFO so it cannot "becomes so weak"  ??
> > 
> > We don't care about the cases where INFO says it will not succeed but
> > reset does (temporarily) succeed.
> > 
> > I don't get what argument you are trying to make or what you think is
> > diminished..
> > 
> > Again, userspace calls INFO, if info says yes then reset *always
> > works*, exactly just like today.
> >
> > Userspace will call reset with a 0 length FD list and it uses a
> > security only check that is strictly more permissive than what
> > get_info will return. So the new check is simple in the kernel and
> > always works in the cases we need it to work.
> > 
> > What is getting things into trouble is insisting that RESET have
> > additional restrictions beyond the minimum checks required for
> > security.
> >   
> > > > I don't view it as a loophole, it is flexability to use the API in a
> > > > way that is different from what qemu wants - eg an app like dpdk may
> > > > be willing to tolerate a reset group that becomes unavailable after
> > > > startup. Who knows, why should we force this in the kernel?  
> > >
> > > Because look at all the problems it's causing to try to introduce these
> > > loopholes without also introducing subtle bugs.  
> > 
> > These problems are coming from tring to do this integrated version,
> > not from my approach!
> > 
> > AFAICT there was nothing wrong with my original plan of using the
> > empty fd list for reset. What Yi has here is some mashup of what you
> > and I both suggested.  
> 
> Hi Alex, Jason,
> 
> could be this reason. So let me try to gather the changes of this series
> does and the impact as far as I know.
> 
> 1) only check the ownership of opened devices in the dev_set
>      in HOT_RESET ioctl.
>      - Impact: it changes the relationship between _INFO  and HOT_RESET.
>        As " Each group must have IOMMU protection established for the
>        ioctl to succeed." in [1], existing design actually means userspace
>        should own all the affected groups before heading to do HOT_RESET.
>        With the change here, the user does not need to ensure all affected
>        groups are opened and it can do hot-reset successfully as long as the
>        devices in the affected group are just un-opened and can be reset.
>     
>        [1] https://patchwork.kernel.org/project/linux-pci/patch/20130814200845.21923.64284.stgit@bling.home/

Where whether a device is opened is subject to change outside of the
user's control.  This essentially allows the user to perform hot-resets
of devices outside of their ownership so long as the device is not
used elsewhere, versus the current requirement that the user own all the
affected groups, which implies device ownership.  It's not been
justified why this feature needs to exist, imo.
 
> 2) Allow passing zero-length fd array to do hot reset
>     - Impact: this uses the iommufd as ownership check in the kernel side.
>       It is only supposed to be used by the users that open cdev instead of
>       users that open group. The drawback is that it cannot cover the noiommu
>       devices as noiommu does not use iommufd at all. But it works well for
>       most cases.

The "only supposed to be used" is problematic here, we're extending all
the interfaces to transparently accept group and device fds, but here
we need to make a distinction because the ioctl needs to perform one
way for groups and another way for devices, which it currently doesn't
do.  As above, I've not seen sufficient justification for this other
than references to reducing complexity, but the only userspace expected
to make use of this interface already has equivalent complexity.
 
> 3) Allow hot reset be successful when the dev_set is singleton
>      - Impact: this makes sense but it seems to mess up the boundary between
>      the group path and cdev path w.r.t. the usage of zero-length fd approach.
>      The group path can succeed to do hot reset even if it is passing an empty
>      fd array if the dev_set happens to be singleton.

Again, what is the justification for requiring this, it seems to be
only a hack towards no-iommu support with cdev, which we can achieve by
other means.  Why have we not needed this in the group model?  It
introduces subtle loopholes, so while maybe we could, I don't see why we
should, therefore I cannot agree with "this makes sense".

> 4) Allow passing device fd to do hot reset
>     - Impact: this is a new way for hot reset. should have no impact.
> 
> 5) Extend the _INFO to report devid
>     - Impact: this changes the way user to decode the info reported back.
>     devid and groupid are returned per the way the queried device is opened.
>     Since it was suggested to support the scenario in which some devices
>     are opened via cdev while some devices are opened via group. This makes
>     us to return invalid_devid for the device that is opened via group if
>     it is affected by the hot reset of a device that is opened via cdev.
>     
>     This was proposed to support the future device fd passing usage which is
>     only available in cdev path.

I think this is fundamentally flawed because of the scope of the
dev-id.  We can only provide dev-ids for devices which belong to the
same iommufd of the calling device, thus there are multiple instances
where no dev-id can be provided.  The group-id and bdf are static
properties of the devices, regardless of their ownership.  The bdf
provides the specific device level association while the group-id
indicates implied, static ownership.

> To me the major confusion is from 1) and 3). 1) changes the meaning of
> _INFO and HOT_RESET, while 3) messes up the boundary.

As above, I think 2) is also an issue.

> Here is my thought:
> 
> For 1), it was proposed due to below reason[2]. We'd like to make a scenario
> that works in the group path be workable in cdev path as well. But IMHO, we
> may just accept that cdev path cannot work for such scenario to avoid sublte
> change to uapi. Otherwise, we need to have another HOT_RESET ioctl or a
> hint in HOT_RESET ioctl to tell the kernel  whether relaxed ownership check
> is expected. Maybe this is awkward. But if we want to keep it, we'd do it
> with the awareness by user.
> 
> [2] https://lore.kernel.org/kvm/Y%2FdobS6gdSkxnPH7@nvidia.com/

The group association is that relaxed ownership test.  Yes, there are
corner cases where we have a dual function card with separate IOMMU
groups, where a user owning function 0 could do a bus reset because
function 1 is temporarily unused, but so what, what good is that, have
we ever had an issue raised because of this?  The user can't rely on
the unopened state of the other function.  It's an entirely
opportunistic optimization.

The much more typical scenario is that a multi-function device does not
provide isolation, all the functions are in the same group and because
of the association of the group the user has implied ownership of the
other devices for the purpose of a reset.

> For 3), it was proposed when discussing the hot reset for noiommu[3]. But
> it does not make hot reset always workable for noiommu in cdev, just in
> case dev_set is singleton. So it is more of a general optimization that can
> make the kernel skip the ownership check. But to make use of it, we may
> need to test it before sanitizing the group fds from user or the iommufd
> check. Maybe the dev_set singleton test in this series is not well placed.
> If so, I can further modify it.
> 
> [3] https://lore.kernel.org/kvm/ZACX+Np%2FIY7ygqL5@nvidia.com/

As above, this seems to be some optimization related to no-iommu for
cdev because we don't have an iommufd association for the device in
no-iommu mode.  Note however that the current group interface doesn't
care about the IOMMU context of the devices.  We only need proof that
the user owns the affected groups.  So why are we bringing iommufd
context anywhere into this interface, here or the null-array interface?

It seems like the minor difference with cdev is that a) we're passing
device fds rather than group fds, and b) those device fds need to be
validated as having device access to complete the proof of ownership
relative to the group.  Otherwise we add capabilities to
DEVICE_GET_INFO to support the device fd passing model where the user
doesn't know the device group or bdf and allow the reset ioctl itself
to accept device fds (extracting the group relationship for those which
the user has configured for access).  Thanks,

Alex


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
  2023-04-06 17:53                         ` Alex Williamson
@ 2023-04-07 10:09                           ` Liu, Yi L
  2023-04-11 13:24                           ` Jason Gunthorpe
  1 sibling, 0 replies; 145+ messages in thread
From: Liu, Yi L @ 2023-04-07 10:09 UTC (permalink / raw)
  To: Alex Williamson
  Cc: mjrosato, jasowang, Hao, Xudong, Duan,  Zhenzhong, peterx, Xu,
	Terrence, chao.p.peng, linux-s390, kvm, lulu, Jiang, Yanting,
	joro, nicolinc, Jason Gunthorpe, Zhao, Yan Y, intel-gfx,
	eric.auger, intel-gvt-dev, yi.y.sun, cohuck,
	shameerali.kolothum.thodi, suravee.suthikulpanit, robin.murphy

> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Friday, April 7, 2023 1:54 AM
> 
> On Thu, 6 Apr 2023 10:02:10 +0000
> "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> 
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Thursday, April 6, 2023 7:23 AM
> > >
> > > On Wed, Apr 05, 2023 at 01:49:45PM -0600, Alex Williamson wrote:
> > >
> > > > > > QEMU can make a policy decision today because the kernel provides a
> > > > > > sufficiently reliable interface, ie. based on the set of owned groups, a
> > > > > > hot-reset is all but guaranteed to work.
> > > > >
> > > > > And we don't change that with cdev. If qemu wants to make the policy
> > > > > decision it keeps using the exact same _INFO interface to make that
> > > > > decision same it has always made.
> > > > >
> > > > > We weaken the actual reset action to only consider the security side.
> > > > >
> > > > > Applications that want this exclusive reset group policy simply must
> > > > > check it on their own. It is a reasonable API design.
> > > >
> > > > I disagree, as I've argued before, the info ioctl becomes so weak and
> > > > effectively arbitrary from a user perspective at being able to predict
> > > > whether the hot-reset ioctl works that it becomes useless, diminishing
> > > > the entire hot-reset info/execute API.
> > >
> > > reset should be strictly more permissive than INFO. If INFO predicts
> > > reset is permitted then reset should succeed.
> > >
> > > We don't change INFO so it cannot "becomes so weak"  ??
> > >
> > > We don't care about the cases where INFO says it will not succeed but
> > > reset does (temporarily) succeed.
> > >
> > > I don't get what argument you are trying to make or what you think is
> > > diminished..
> > >
> > > Again, userspace calls INFO, if info says yes then reset *always
> > > works*, exactly just like today.
> > >
> > > Userspace will call reset with a 0 length FD list and it uses a
> > > security only check that is strictly more permissive than what
> > > get_info will return. So the new check is simple in the kernel and
> > > always works in the cases we need it to work.
> > >
> > > What is getting things into trouble is insisting that RESET have
> > > additional restrictions beyond the minimum checks required for
> > > security.
> > >
> > > > > I don't view it as a loophole, it is flexability to use the API in a
> > > > > way that is different from what qemu wants - eg an app like dpdk may
> > > > > be willing to tolerate a reset group that becomes unavailable after
> > > > > startup. Who knows, why should we force this in the kernel?
> > > >
> > > > Because look at all the problems it's causing to try to introduce these
> > > > loopholes without also introducing subtle bugs.
> > >
> > > These problems are coming from tring to do this integrated version,
> > > not from my approach!
> > >
> > > AFAICT there was nothing wrong with my original plan of using the
> > > empty fd list for reset. What Yi has here is some mashup of what you
> > > and I both suggested.
> >
> > Hi Alex, Jason,
> >
> > could be this reason. So let me try to gather the changes of this series
> > does and the impact as far as I know.
> >
> > 1) only check the ownership of opened devices in the dev_set
> >      in HOT_RESET ioctl.
> >      - Impact: it changes the relationship between _INFO  and HOT_RESET.
> >        As " Each group must have IOMMU protection established for the
> >        ioctl to succeed." in [1], existing design actually means userspace
> >        should own all the affected groups before heading to do HOT_RESET.
> >        With the change here, the user does not need to ensure all affected
> >        groups are opened and it can do hot-reset successfully as long as the
> >        devices in the affected group are just un-opened and can be reset.
> >
> >        [1] https://patchwork.kernel.org/project/linux-
> pci/patch/20130814200845.21923.64284.stgit@bling.home/
> 
> Where whether a device is opened is subject to change outside of the
> user's control.  This essentially allows the user to perform hot-resets
> of devices outside of their ownership so long as the device is not
> used elsewhere, versus the current requirement that the user own all the
> affected groups, which implies device ownership.  It's not been
> justified why this feature needs to exist, imo.
> 
> > 2) Allow passing zero-length fd array to do hot reset
> >     - Impact: this uses the iommufd as ownership check in the kernel side.
> >       It is only supposed to be used by the users that open cdev instead of
> >       users that open group. The drawback is that it cannot cover the noiommu
> >       devices as noiommu does not use iommufd at all. But it works well for
> >       most cases.
> 
> The "only supposed to be used" is problematic here, we're extending all
> the interfaces to transparently accept group and device fds, but here
> we need to make a distinction because the ioctl needs to perform one
> way for groups and another way for devices, which it currently doesn't
> do.  As above, I've not seen sufficient justification for this other
> than references to reducing complexity, but the only userspace expected
> to make use of this interface already has equivalent complexity.
> 
> > 3) Allow hot reset be successful when the dev_set is singleton
> >      - Impact: this makes sense but it seems to mess up the boundary between
> >      the group path and cdev path w.r.t. the usage of zero-length fd approach.
> >      The group path can succeed to do hot reset even if it is passing an empty
> >      fd array if the dev_set happens to be singleton.
> 
> Again, what is the justification for requiring this, it seems to be
> only a hack towards no-iommu support with cdev, which we can achieve by
> other means.  Why have we not needed this in the group model?  It
> introduces subtle loopholes, so while maybe we could, I don't see why we
> should, therefore I cannot agree with "this makes sense".
> 
> > 4) Allow passing device fd to do hot reset
> >     - Impact: this is a new way for hot reset. should have no impact.
> >
> > 5) Extend the _INFO to report devid
> >     - Impact: this changes the way user to decode the info reported back.
> >     devid and groupid are returned per the way the queried device is opened.
> >     Since it was suggested to support the scenario in which some devices
> >     are opened via cdev while some devices are opened via group. This makes
> >     us to return invalid_devid for the device that is opened via group if
> >     it is affected by the hot reset of a device that is opened via cdev.
> >
> >     This was proposed to support the future device fd passing usage which is
> >     only available in cdev path.
> 
> I think this is fundamentally flawed because of the scope of the
> dev-id.  We can only provide dev-ids for devices which belong to the
> same iommufd of the calling device, thus there are multiple instances
> where no dev-id can be provided.  The group-id and bdf are static
> properties of the devices, regardless of their ownership.  The bdf
> provides the specific device level association while the group-id
> indicates implied, static ownership.
> 
> > To me the major confusion is from 1) and 3). 1) changes the meaning of
> > _INFO and HOT_RESET, while 3) messes up the boundary.
> 
> As above, I think 2) is also an issue.
> 
> > Here is my thought:
> >
> > For 1), it was proposed due to below reason[2]. We'd like to make a scenario
> > that works in the group path be workable in cdev path as well. But IMHO, we
> > may just accept that cdev path cannot work for such scenario to avoid sublte
> > change to uapi. Otherwise, we need to have another HOT_RESET ioctl or a
> > hint in HOT_RESET ioctl to tell the kernel  whether relaxed ownership check
> > is expected. Maybe this is awkward. But if we want to keep it, we'd do it
> > with the awareness by user.
> >
> > [2] https://lore.kernel.org/kvm/Y%2FdobS6gdSkxnPH7@nvidia.com/
> 
> The group association is that relaxed ownership test.  Yes, there are
> corner cases where we have a dual function card with separate IOMMU
> groups, where a user owning function 0 could do a bus reset because
> function 1 is temporarily unused, but so what, what good is that, have
> we ever had an issue raised because of this?  The user can't rely on
> the unopened state of the other function.  It's an entirely
> opportunistic optimization.
> 
> The much more typical scenario is that a multi-function device does not
> provide isolation, all the functions are in the same group and because
> of the association of the group the user has implied ownership of the
> other devices for the purpose of a reset.
> 
> > For 3), it was proposed when discussing the hot reset for noiommu[3]. But
> > it does not make hot reset always workable for noiommu in cdev, just in
> > case dev_set is singleton. So it is more of a general optimization that can
> > make the kernel skip the ownership check. But to make use of it, we may
> > need to test it before sanitizing the group fds from user or the iommufd
> > check. Maybe the dev_set singleton test in this series is not well placed.
> > If so, I can further modify it.
> >
> > [3] https://lore.kernel.org/kvm/ZACX+Np%2FIY7ygqL5@nvidia.com/
> 
> As above, this seems to be some optimization related to no-iommu for
> cdev because we don't have an iommufd association for the device in
> no-iommu mode.  Note however that the current group interface doesn't
> care about the IOMMU context of the devices.  We only need proof that
> the user owns the affected groups.  So why are we bringing iommufd
> context anywhere into this interface, here or the null-array interface?
> 
> It seems like the minor difference with cdev is that a) we're passing
> device fds rather than group fds, and b) those device fds need to be
> validated as having device access to complete the proof of ownership
> relative to the group.  Otherwise we add capabilities to
> DEVICE_GET_INFO to support the device fd passing model where the user
> doesn't know the device group or bdf and allow the reset ioctl itself
> to accept device fds (extracting the group relationship for those which
> the user has configured for access).  Thanks,

so your suggestion is to drop 1) 2) 3) and 5), keep 4) and add new bdf/group
capability to DEVICE_GET_INFO to retrieve group_id and bdf. In this way, the
existing _INFO ioctl can be reused without any change. is it?

Regards,
Yi Liu


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
  2023-04-03 15:01     ` Alex Williamson
  2023-04-03 15:22       ` Liu, Yi L
@ 2023-04-07 10:09       ` Liu, Yi L
  2023-04-07 12:03         ` Alex Williamson
  1 sibling, 1 reply; 145+ messages in thread
From: Liu, Yi L @ 2023-04-07 10:09 UTC (permalink / raw)
  To: Alex Williamson
  Cc: mjrosato, jasowang, Hao, Xudong, peterx, Xu, Terrence,
	chao.p.peng, linux-s390, kvm, lulu, Jiang, Yanting, joro,
	nicolinc, jgg, Zhao, Yan Y, intel-gfx, eric.auger, intel-gvt-dev,
	yi.y.sun, cohuck, shameerali.kolothum.thodi,
	suravee.suthikulpanit, robin.murphy

Hi Alex,

> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Monday, April 3, 2023 11:02 PM
> 
> On Mon, 3 Apr 2023 09:25:06 +0000
> "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> 
> > > From: Liu, Yi L <yi.l.liu@intel.com>
> > > Sent: Saturday, April 1, 2023 10:44 PM
> >
> > > @@ -791,7 +813,21 @@ static int vfio_pci_fill_devs(struct pci_dev *pdev, void
> *data)
> > >  	if (!iommu_group)
> > >  		return -EPERM; /* Cannot reset non-isolated devices */
> >
> > Hi Alex,
> >
> > Is disabling iommu a sane way to test vfio noiommu mode?
> 
> Yes
> 
> > I added intel_iommu=off to disable intel iommu and bind a device to vfio-pci.
> > I can see the /dev/vfio/noiommu-0 and /dev/vfio/devices/noiommu-vfio0. Bind
> > iommufd==-1 can succeed, but failed to get hot reset info due to the above
> > group check. Reason is that this happens to have some affected devices, and
> > these devices have no valid iommu_group (because they are not bound to vfio-pci
> > hence nobody allocates noiommu group for them). So when hot reset info loops
> > such devices, it failed with -EPERM. Is this expected?
> 
> Hmm, I didn't recall that we put in such a limitation, but given the
> minimally intrusive approach to no-iommu and the fact that we never
> defined an invalid group ID to return to the user, it makes sense that
> we just blocked the ioctl for no-iommu use.  I guess we can do the same
> for no-iommu cdev.

I just realize a further issue related to this limitation. Remember that we
may finally compile out the vfio group infrastructure in the future. Say I
want to test noiommu, I may boot such a kernel with iommu disabled. I think
the _INFO ioctl would fail as there is no iommu_group. Does it mean we will
not support hot reset for noiommu in future if vfio group infrastructure is
compiled out?

As another thread, we are going to add a new bdf/group capability to
DEVICE_GET_INFO. If the above kernel is booted, shall we exclude the new
bdf/group capability or add a flag in the capability to mark the group_id
is invalid?

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
  2023-04-07 10:09       ` Liu, Yi L
@ 2023-04-07 12:03         ` Alex Williamson
  2023-04-07 13:24           ` Liu, Yi L
  2023-04-11  6:16           ` Liu, Yi L
  0 siblings, 2 replies; 145+ messages in thread
From: Alex Williamson @ 2023-04-07 12:03 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: mjrosato, jasowang, Hao, Xudong, peterx, Xu, Terrence,
	chao.p.peng, linux-s390, kvm, lulu, Jiang, Yanting, joro,
	nicolinc, jgg, Zhao, Yan Y, intel-gfx, eric.auger, intel-gvt-dev,
	yi.y.sun, cohuck, shameerali.kolothum.thodi,
	suravee.suthikulpanit, robin.murphy

On Fri, 7 Apr 2023 10:09:58 +0000
"Liu, Yi L" <yi.l.liu@intel.com> wrote:

> Hi Alex,
> 
> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Monday, April 3, 2023 11:02 PM
> > 
> > On Mon, 3 Apr 2023 09:25:06 +0000
> > "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> >   
> > > > From: Liu, Yi L <yi.l.liu@intel.com>
> > > > Sent: Saturday, April 1, 2023 10:44 PM  
> > >  
> > > > @@ -791,7 +813,21 @@ static int vfio_pci_fill_devs(struct pci_dev *pdev, void  
> > *data)  
> > > >  	if (!iommu_group)
> > > >  		return -EPERM; /* Cannot reset non-isolated devices */  
> > >
> > > Hi Alex,
> > >
> > > Is disabling iommu a sane way to test vfio noiommu mode?  
> > 
> > Yes
> >   
> > > I added intel_iommu=off to disable intel iommu and bind a device to vfio-pci.
> > > I can see the /dev/vfio/noiommu-0 and /dev/vfio/devices/noiommu-vfio0. Bind
> > > iommufd==-1 can succeed, but failed to get hot reset info due to the above
> > > group check. Reason is that this happens to have some affected devices, and
> > > these devices have no valid iommu_group (because they are not bound to vfio-pci
> > > hence nobody allocates noiommu group for them). So when hot reset info loops
> > > such devices, it failed with -EPERM. Is this expected?  
> > 
> > Hmm, I didn't recall that we put in such a limitation, but given the
> > minimally intrusive approach to no-iommu and the fact that we never
> > defined an invalid group ID to return to the user, it makes sense that
> > we just blocked the ioctl for no-iommu use.  I guess we can do the same
> > for no-iommu cdev.  
> 
> I just realize a further issue related to this limitation. Remember that we
> may finally compile out the vfio group infrastructure in the future. Say I
> want to test noiommu, I may boot such a kernel with iommu disabled. I think
> the _INFO ioctl would fail as there is no iommu_group. Does it mean we will
> not support hot reset for noiommu in future if vfio group infrastructure is
> compiled out?

We're talking about IOMMU groups, IOMMU groups are always present
regardless of whether we expose a vfio group interface to userspace.
Remember, we create IOMMU groups even in the no-iommu case.  Even with
pure cdev, there are underlying IOMMU groups that maintain the DMA
ownership.

> As another thread, we are going to add a new bdf/group capability to
> DEVICE_GET_INFO. If the above kernel is booted, shall we exclude the new
> bdf/group capability or add a flag in the capability to mark the group_id
> is invalid?

As above, there's always an IOMMU group, it's never invalid.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
  2023-04-07 12:03         ` Alex Williamson
@ 2023-04-07 13:24           ` Liu, Yi L
  2023-04-07 13:51             ` Alex Williamson
  2023-04-11  6:16           ` Liu, Yi L
  1 sibling, 1 reply; 145+ messages in thread
From: Liu, Yi L @ 2023-04-07 13:24 UTC (permalink / raw)
  To: Alex Williamson
  Cc: mjrosato, jasowang, Hao, Xudong, peterx, Xu, Terrence,
	chao.p.peng, linux-s390, kvm, lulu, Jiang, Yanting, joro,
	nicolinc, jgg, Zhao, Yan Y, intel-gfx, eric.auger, intel-gvt-dev,
	yi.y.sun, cohuck, shameerali.kolothum.thodi,
	suravee.suthikulpanit, robin.murphy

> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Friday, April 7, 2023 8:04 PM
> 
> > > > > @@ -791,7 +813,21 @@ static int vfio_pci_fill_devs(struct pci_dev *pdev, void
> > > *data)
> > > > >  	if (!iommu_group)
> > > > >  		return -EPERM; /* Cannot reset non-isolated devices */

[1]

> > > >
> > > > Hi Alex,
> > > >
> > > > Is disabling iommu a sane way to test vfio noiommu mode?
> > >
> > > Yes
> > >
> > > > I added intel_iommu=off to disable intel iommu and bind a device to vfio-pci.
> > > > I can see the /dev/vfio/noiommu-0 and /dev/vfio/devices/noiommu-vfio0. Bind
> > > > iommufd==-1 can succeed, but failed to get hot reset info due to the above
> > > > group check. Reason is that this happens to have some affected devices, and
> > > > these devices have no valid iommu_group (because they are not bound to vfio-
> pci
> > > > hence nobody allocates noiommu group for them). So when hot reset info loops
> > > > such devices, it failed with -EPERM. Is this expected?
> > >
> > > Hmm, I didn't recall that we put in such a limitation, but given the
> > > minimally intrusive approach to no-iommu and the fact that we never
> > > defined an invalid group ID to return to the user, it makes sense that
> > > we just blocked the ioctl for no-iommu use.  I guess we can do the same
> > > for no-iommu cdev.
> >
> > I just realize a further issue related to this limitation. Remember that we
> > may finally compile out the vfio group infrastructure in the future. Say I
> > want to test noiommu, I may boot such a kernel with iommu disabled. I think
> > the _INFO ioctl would fail as there is no iommu_group. Does it mean we will
> > not support hot reset for noiommu in future if vfio group infrastructure is
> > compiled out?
> 
> We're talking about IOMMU groups, IOMMU groups are always present
> regardless of whether we expose a vfio group interface to userspace.
> Remember, we create IOMMU groups even in the no-iommu case.  Even with
> pure cdev, there are underlying IOMMU groups that maintain the DMA
> ownership.

hmmm. As [1], when iommu is disabled, there will be no iommu_group for a
given device unless it is registered to VFIO, which a fake group is created.
That's why I hit the limitation [1]. When vfio_group is compiled out, then
even fake group goes away.

>
> > As another thread, we are going to add a new bdf/group capability to
> > DEVICE_GET_INFO. If the above kernel is booted, shall we exclude the new
> > bdf/group capability or add a flag in the capability to mark the group_id
> > is invalid?
> 
> As above, there's always an IOMMU group, it's never invalid.  Thanks,

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
  2023-04-07 13:24           ` Liu, Yi L
@ 2023-04-07 13:51             ` Alex Williamson
  2023-04-07 14:04               ` Liu, Yi L
  0 siblings, 1 reply; 145+ messages in thread
From: Alex Williamson @ 2023-04-07 13:51 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: mjrosato, jasowang, Hao, Xudong, peterx, Xu, Terrence,
	chao.p.peng, linux-s390, kvm, lulu, Jiang, Yanting, joro,
	nicolinc, jgg, Zhao, Yan Y, intel-gfx, eric.auger, intel-gvt-dev,
	yi.y.sun, cohuck, shameerali.kolothum.thodi,
	suravee.suthikulpanit, robin.murphy

On Fri, 7 Apr 2023 13:24:25 +0000
"Liu, Yi L" <yi.l.liu@intel.com> wrote:

> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Friday, April 7, 2023 8:04 PM
> >   
> > > > > > @@ -791,7 +813,21 @@ static int vfio_pci_fill_devs(struct pci_dev *pdev, void  
> > > > *data)  
> > > > > >  	if (!iommu_group)
> > > > > >  		return -EPERM; /* Cannot reset non-isolated devices */  
> 
> [1]
> 
> > > > >
> > > > > Hi Alex,
> > > > >
> > > > > Is disabling iommu a sane way to test vfio noiommu mode?  
> > > >
> > > > Yes
> > > >  
> > > > > I added intel_iommu=off to disable intel iommu and bind a device to vfio-pci.
> > > > > I can see the /dev/vfio/noiommu-0 and /dev/vfio/devices/noiommu-vfio0. Bind
> > > > > iommufd==-1 can succeed, but failed to get hot reset info due to the above
> > > > > group check. Reason is that this happens to have some affected devices, and
> > > > > these devices have no valid iommu_group (because they are not bound to vfio-  
> > pci  
> > > > > hence nobody allocates noiommu group for them). So when hot reset info loops
> > > > > such devices, it failed with -EPERM. Is this expected?  
> > > >
> > > > Hmm, I didn't recall that we put in such a limitation, but given the
> > > > minimally intrusive approach to no-iommu and the fact that we never
> > > > defined an invalid group ID to return to the user, it makes sense that
> > > > we just blocked the ioctl for no-iommu use.  I guess we can do the same
> > > > for no-iommu cdev.  
> > >
> > > I just realize a further issue related to this limitation. Remember that we
> > > may finally compile out the vfio group infrastructure in the future. Say I
> > > want to test noiommu, I may boot such a kernel with iommu disabled. I think
> > > the _INFO ioctl would fail as there is no iommu_group. Does it mean we will
> > > not support hot reset for noiommu in future if vfio group infrastructure is
> > > compiled out?  
> > 
> > We're talking about IOMMU groups, IOMMU groups are always present
> > regardless of whether we expose a vfio group interface to userspace.
> > Remember, we create IOMMU groups even in the no-iommu case.  Even with
> > pure cdev, there are underlying IOMMU groups that maintain the DMA
> > ownership.  
> 
> hmmm. As [1], when iommu is disabled, there will be no iommu_group for a
> given device unless it is registered to VFIO, which a fake group is created.
> That's why I hit the limitation [1]. When vfio_group is compiled out, then
> even fake group goes away.

In the vfio group case, [1] can be hit with no-iommu only when there
are affected devices which are not bound to vfio.  Why are we not
allocating an IOMMU group to no-iommu devices when vfio group is
disabled?  Thanks,

Alex


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
  2023-04-07 13:51             ` Alex Williamson
@ 2023-04-07 14:04               ` Liu, Yi L
  2023-04-07 15:14                 ` Alex Williamson
  0 siblings, 1 reply; 145+ messages in thread
From: Liu, Yi L @ 2023-04-07 14:04 UTC (permalink / raw)
  To: Alex Williamson
  Cc: mjrosato, jasowang, Hao, Xudong, peterx, Xu, Terrence,
	chao.p.peng, linux-s390, kvm, lulu, Jiang, Yanting, joro,
	nicolinc, jgg, Zhao, Yan Y, intel-gfx, eric.auger, intel-gvt-dev,
	yi.y.sun, cohuck, shameerali.kolothum.thodi,
	suravee.suthikulpanit, robin.murphy

> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Friday, April 7, 2023 9:52 PM
> 
> On Fri, 7 Apr 2023 13:24:25 +0000
> "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> 
> > > From: Alex Williamson <alex.williamson@redhat.com>
> > > Sent: Friday, April 7, 2023 8:04 PM
> > >
> > > > > > > @@ -791,7 +813,21 @@ static int vfio_pci_fill_devs(struct pci_dev *pdev,
> void
> > > > > *data)
> > > > > > >  	if (!iommu_group)
> > > > > > >  		return -EPERM; /* Cannot reset non-isolated devices */
> >
> > [1]
> >
> > > > > >
> > > > > > Hi Alex,
> > > > > >
> > > > > > Is disabling iommu a sane way to test vfio noiommu mode?
> > > > >
> > > > > Yes
> > > > >
> > > > > > I added intel_iommu=off to disable intel iommu and bind a device to vfio-pci.
> > > > > > I can see the /dev/vfio/noiommu-0 and /dev/vfio/devices/noiommu-vfio0.
> Bind
> > > > > > iommufd==-1 can succeed, but failed to get hot reset info due to the above
> > > > > > group check. Reason is that this happens to have some affected devices, and
> > > > > > these devices have no valid iommu_group (because they are not bound to
> vfio-
> > > pci
> > > > > > hence nobody allocates noiommu group for them). So when hot reset info
> loops
> > > > > > such devices, it failed with -EPERM. Is this expected?
> > > > >
> > > > > Hmm, I didn't recall that we put in such a limitation, but given the
> > > > > minimally intrusive approach to no-iommu and the fact that we never
> > > > > defined an invalid group ID to return to the user, it makes sense that
> > > > > we just blocked the ioctl for no-iommu use.  I guess we can do the same
> > > > > for no-iommu cdev.
> > > >
> > > > I just realize a further issue related to this limitation. Remember that we
> > > > may finally compile out the vfio group infrastructure in the future. Say I
> > > > want to test noiommu, I may boot such a kernel with iommu disabled. I think
> > > > the _INFO ioctl would fail as there is no iommu_group. Does it mean we will
> > > > not support hot reset for noiommu in future if vfio group infrastructure is
> > > > compiled out?
> > >
> > > We're talking about IOMMU groups, IOMMU groups are always present
> > > regardless of whether we expose a vfio group interface to userspace.
> > > Remember, we create IOMMU groups even in the no-iommu case.  Even with
> > > pure cdev, there are underlying IOMMU groups that maintain the DMA
> > > ownership.
> >
> > hmmm. As [1], when iommu is disabled, there will be no iommu_group for a
> > given device unless it is registered to VFIO, which a fake group is created.
> > That's why I hit the limitation [1]. When vfio_group is compiled out, then
> > even fake group goes away.
> 
> In the vfio group case, [1] can be hit with no-iommu only when there
> are affected devices which are not bound to vfio.

yes. because vfio would allocate fake group when device is registered to
it.

> Why are we not
> allocating an IOMMU group to no-iommu devices when vfio group is
> disabled?  Thanks,

hmmm. when the vfio group code is configured out. The
vfio_device_set_group() just returns 0 after below patch is
applied and CONFIG_VFIO_GROUP=n. So when there is no
vfio group, the fake group also goes away.

https://lore.kernel.org/kvm/20230401151833.124749-25-yi.l.liu@intel.com/

Regards,
Yi Liu


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
  2023-04-07 14:04               ` Liu, Yi L
@ 2023-04-07 15:14                 ` Alex Williamson
  2023-04-07 15:47                   ` Liu, Yi L
  0 siblings, 1 reply; 145+ messages in thread
From: Alex Williamson @ 2023-04-07 15:14 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: mjrosato, jasowang, Hao, Xudong, peterx, Xu, Terrence,
	chao.p.peng, linux-s390, kvm, lulu, Jiang, Yanting, joro,
	nicolinc, jgg, Zhao, Yan Y, intel-gfx, eric.auger, intel-gvt-dev,
	yi.y.sun, cohuck, shameerali.kolothum.thodi,
	suravee.suthikulpanit, robin.murphy

On Fri, 7 Apr 2023 14:04:02 +0000
"Liu, Yi L" <yi.l.liu@intel.com> wrote:

> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Friday, April 7, 2023 9:52 PM
> > 
> > On Fri, 7 Apr 2023 13:24:25 +0000
> > "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> >   
> > > > From: Alex Williamson <alex.williamson@redhat.com>
> > > > Sent: Friday, April 7, 2023 8:04 PM
> > > >  
> > > > > > > > @@ -791,7 +813,21 @@ static int vfio_pci_fill_devs(struct pci_dev *pdev,  
> > void  
> > > > > > *data)  
> > > > > > > >  	if (!iommu_group)
> > > > > > > >  		return -EPERM; /* Cannot reset non-isolated devices */  
> > >
> > > [1]
> > >  
> > > > > > >
> > > > > > > Hi Alex,
> > > > > > >
> > > > > > > Is disabling iommu a sane way to test vfio noiommu mode?  
> > > > > >
> > > > > > Yes
> > > > > >  
> > > > > > > I added intel_iommu=off to disable intel iommu and bind a device to vfio-pci.
> > > > > > > I can see the /dev/vfio/noiommu-0 and /dev/vfio/devices/noiommu-vfio0.  
> > Bind  
> > > > > > > iommufd==-1 can succeed, but failed to get hot reset info due to the above
> > > > > > > group check. Reason is that this happens to have some affected devices, and
> > > > > > > these devices have no valid iommu_group (because they are not bound to  
> > vfio-  
> > > > pci  
> > > > > > > hence nobody allocates noiommu group for them). So when hot reset info  
> > loops  
> > > > > > > such devices, it failed with -EPERM. Is this expected?  
> > > > > >
> > > > > > Hmm, I didn't recall that we put in such a limitation, but given the
> > > > > > minimally intrusive approach to no-iommu and the fact that we never
> > > > > > defined an invalid group ID to return to the user, it makes sense that
> > > > > > we just blocked the ioctl for no-iommu use.  I guess we can do the same
> > > > > > for no-iommu cdev.  
> > > > >
> > > > > I just realize a further issue related to this limitation. Remember that we
> > > > > may finally compile out the vfio group infrastructure in the future. Say I
> > > > > want to test noiommu, I may boot such a kernel with iommu disabled. I think
> > > > > the _INFO ioctl would fail as there is no iommu_group. Does it mean we will
> > > > > not support hot reset for noiommu in future if vfio group infrastructure is
> > > > > compiled out?  
> > > >
> > > > We're talking about IOMMU groups, IOMMU groups are always present
> > > > regardless of whether we expose a vfio group interface to userspace.
> > > > Remember, we create IOMMU groups even in the no-iommu case.  Even with
> > > > pure cdev, there are underlying IOMMU groups that maintain the DMA
> > > > ownership.  
> > >
> > > hmmm. As [1], when iommu is disabled, there will be no iommu_group for a
> > > given device unless it is registered to VFIO, which a fake group is created.
> > > That's why I hit the limitation [1]. When vfio_group is compiled out, then
> > > even fake group goes away.  
> > 
> > In the vfio group case, [1] can be hit with no-iommu only when there
> > are affected devices which are not bound to vfio.  
> 
> yes. because vfio would allocate fake group when device is registered to
> it.
> 
> > Why are we not
> > allocating an IOMMU group to no-iommu devices when vfio group is
> > disabled?  Thanks,  
> 
> hmmm. when the vfio group code is configured out. The
> vfio_device_set_group() just returns 0 after below patch is
> applied and CONFIG_VFIO_GROUP=n. So when there is no
> vfio group, the fake group also goes away.
> 
> https://lore.kernel.org/kvm/20230401151833.124749-25-yi.l.liu@intel.com/

Is this a fundamental issue or just a problem with the current
implementation proposal?  It seems like the latter.  FWIW, I also don't
see a taint happening in the cdev path for no-iommu use.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
  2023-04-07 15:14                 ` Alex Williamson
@ 2023-04-07 15:47                   ` Liu, Yi L
  2023-04-07 21:07                     ` Alex Williamson
  0 siblings, 1 reply; 145+ messages in thread
From: Liu, Yi L @ 2023-04-07 15:47 UTC (permalink / raw)
  To: Alex Williamson
  Cc: mjrosato, jasowang, Hao, Xudong, peterx, Xu, Terrence,
	chao.p.peng, linux-s390, kvm, lulu, Jiang, Yanting, joro,
	nicolinc, jgg, Zhao, Yan Y, intel-gfx, eric.auger, intel-gvt-dev,
	yi.y.sun, cohuck, shameerali.kolothum.thodi,
	suravee.suthikulpanit, robin.murphy

> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Friday, April 7, 2023 11:14 PM
> 
> On Fri, 7 Apr 2023 14:04:02 +0000
> "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> 
> > > From: Alex Williamson <alex.williamson@redhat.com>
> > > Sent: Friday, April 7, 2023 9:52 PM
> > >
> > > On Fri, 7 Apr 2023 13:24:25 +0000
> > > "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> > >
> > > > > From: Alex Williamson <alex.williamson@redhat.com>
> > > > > Sent: Friday, April 7, 2023 8:04 PM
> > > > >
> > > > > > > > > @@ -791,7 +813,21 @@ static int vfio_pci_fill_devs(struct pci_dev
> *pdev,
> > > void
> > > > > > > *data)
> > > > > > > > >  	if (!iommu_group)
> > > > > > > > >  		return -EPERM; /* Cannot reset non-isolated devices */
> > > >
> > > > [1]
> > > >
> > > > > > > >
> > > > > > > > Hi Alex,
> > > > > > > >
> > > > > > > > Is disabling iommu a sane way to test vfio noiommu mode?
> > > > > > >
> > > > > > > Yes
> > > > > > >
> > > > > > > > I added intel_iommu=off to disable intel iommu and bind a device to vfio-
> pci.
> > > > > > > > I can see the /dev/vfio/noiommu-0 and /dev/vfio/devices/noiommu-vfio0.
> > > Bind
> > > > > > > > iommufd==-1 can succeed, but failed to get hot reset info due to the
> above
> > > > > > > > group check. Reason is that this happens to have some affected devices,
> and
> > > > > > > > these devices have no valid iommu_group (because they are not bound to
> > > vfio-
> > > > > pci
> > > > > > > > hence nobody allocates noiommu group for them). So when hot reset info
> > > loops
> > > > > > > > such devices, it failed with -EPERM. Is this expected?
> > > > > > >
> > > > > > > Hmm, I didn't recall that we put in such a limitation, but given the
> > > > > > > minimally intrusive approach to no-iommu and the fact that we never
> > > > > > > defined an invalid group ID to return to the user, it makes sense that
> > > > > > > we just blocked the ioctl for no-iommu use.  I guess we can do the same
> > > > > > > for no-iommu cdev.
> > > > > >
> > > > > > I just realize a further issue related to this limitation. Remember that we
> > > > > > may finally compile out the vfio group infrastructure in the future. Say I
> > > > > > want to test noiommu, I may boot such a kernel with iommu disabled. I think
> > > > > > the _INFO ioctl would fail as there is no iommu_group. Does it mean we will
> > > > > > not support hot reset for noiommu in future if vfio group infrastructure is
> > > > > > compiled out?
> > > > >
> > > > > We're talking about IOMMU groups, IOMMU groups are always present
> > > > > regardless of whether we expose a vfio group interface to userspace.
> > > > > Remember, we create IOMMU groups even in the no-iommu case.  Even with
> > > > > pure cdev, there are underlying IOMMU groups that maintain the DMA
> > > > > ownership.
> > > >
> > > > hmmm. As [1], when iommu is disabled, there will be no iommu_group for a
> > > > given device unless it is registered to VFIO, which a fake group is created.
> > > > That's why I hit the limitation [1]. When vfio_group is compiled out, then
> > > > even fake group goes away.
> > >
> > > In the vfio group case, [1] can be hit with no-iommu only when there
> > > are affected devices which are not bound to vfio.
> >
> > yes. because vfio would allocate fake group when device is registered to
> > it.
> >
> > > Why are we not
> > > allocating an IOMMU group to no-iommu devices when vfio group is
> > > disabled?  Thanks,
> >
> > hmmm. when the vfio group code is configured out. The
> > vfio_device_set_group() just returns 0 after below patch is
> > applied and CONFIG_VFIO_GROUP=n. So when there is no
> > vfio group, the fake group also goes away.
> >
> > https://lore.kernel.org/kvm/20230401151833.124749-25-yi.l.liu@intel.com/
> 
> Is this a fundamental issue or just a problem with the current
> implementation proposal?  It seems like the latter.  FWIW, I also don't
> see a taint happening in the cdev path for no-iommu use.  Thanks,

yes. the latter case. The reason I raised it here is to confirm the
policy on the new group/bdf capability in the DEVICE_GET_INFO. If
there is no iommu group, perhaps I only need to exclude the new
group/bdf capability from the cap chain of DEVICE_GET_INFO. is it?

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
  2023-04-07 15:47                   ` Liu, Yi L
@ 2023-04-07 21:07                     ` Alex Williamson
  2023-04-08  5:07                       ` Liu, Yi L
  2023-04-11 13:33                       ` Jason Gunthorpe
  0 siblings, 2 replies; 145+ messages in thread
From: Alex Williamson @ 2023-04-07 21:07 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: mjrosato, jasowang, Hao, Xudong, peterx, Xu, Terrence,
	chao.p.peng, linux-s390, kvm, lulu, Jiang, Yanting, joro,
	nicolinc, jgg, Zhao, Yan Y, intel-gfx, eric.auger, intel-gvt-dev,
	yi.y.sun, cohuck, shameerali.kolothum.thodi,
	suravee.suthikulpanit, robin.murphy

On Fri, 7 Apr 2023 15:47:10 +0000
"Liu, Yi L" <yi.l.liu@intel.com> wrote:

> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Friday, April 7, 2023 11:14 PM
> > 
> > On Fri, 7 Apr 2023 14:04:02 +0000
> > "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> >   
> > > > From: Alex Williamson <alex.williamson@redhat.com>
> > > > Sent: Friday, April 7, 2023 9:52 PM
> > > >
> > > > On Fri, 7 Apr 2023 13:24:25 +0000
> > > > "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> > > >  
> > > > > > From: Alex Williamson <alex.williamson@redhat.com>
> > > > > > Sent: Friday, April 7, 2023 8:04 PM
> > > > > >  
> > > > > > > > > > @@ -791,7 +813,21 @@ static int vfio_pci_fill_devs(struct pci_dev  
> > *pdev,  
> > > > void  
> > > > > > > > *data)  
> > > > > > > > > >  	if (!iommu_group)
> > > > > > > > > >  		return -EPERM; /* Cannot reset non-isolated devices */  
> > > > >
> > > > > [1]
> > > > >  
> > > > > > > > >
> > > > > > > > > Hi Alex,
> > > > > > > > >
> > > > > > > > > Is disabling iommu a sane way to test vfio noiommu mode?  
> > > > > > > >
> > > > > > > > Yes
> > > > > > > >  
> > > > > > > > > I added intel_iommu=off to disable intel iommu and bind a device to vfio-  
> > pci.  
> > > > > > > > > I can see the /dev/vfio/noiommu-0 and /dev/vfio/devices/noiommu-vfio0.  
> > > > Bind  
> > > > > > > > > iommufd==-1 can succeed, but failed to get hot reset info due to the  
> > above  
> > > > > > > > > group check. Reason is that this happens to have some affected devices,  
> > and  
> > > > > > > > > these devices have no valid iommu_group (because they are not bound to  
> > > > vfio-  
> > > > > > pci  
> > > > > > > > > hence nobody allocates noiommu group for them). So when hot reset info  
> > > > loops  
> > > > > > > > > such devices, it failed with -EPERM. Is this expected?  
> > > > > > > >
> > > > > > > > Hmm, I didn't recall that we put in such a limitation, but given the
> > > > > > > > minimally intrusive approach to no-iommu and the fact that we never
> > > > > > > > defined an invalid group ID to return to the user, it makes sense that
> > > > > > > > we just blocked the ioctl for no-iommu use.  I guess we can do the same
> > > > > > > > for no-iommu cdev.  
> > > > > > >
> > > > > > > I just realize a further issue related to this limitation. Remember that we
> > > > > > > may finally compile out the vfio group infrastructure in the future. Say I
> > > > > > > want to test noiommu, I may boot such a kernel with iommu disabled. I think
> > > > > > > the _INFO ioctl would fail as there is no iommu_group. Does it mean we will
> > > > > > > not support hot reset for noiommu in future if vfio group infrastructure is
> > > > > > > compiled out?  
> > > > > >
> > > > > > We're talking about IOMMU groups, IOMMU groups are always present
> > > > > > regardless of whether we expose a vfio group interface to userspace.
> > > > > > Remember, we create IOMMU groups even in the no-iommu case.  Even with
> > > > > > pure cdev, there are underlying IOMMU groups that maintain the DMA
> > > > > > ownership.  
> > > > >
> > > > > hmmm. As [1], when iommu is disabled, there will be no iommu_group for a
> > > > > given device unless it is registered to VFIO, which a fake group is created.
> > > > > That's why I hit the limitation [1]. When vfio_group is compiled out, then
> > > > > even fake group goes away.  
> > > >
> > > > In the vfio group case, [1] can be hit with no-iommu only when there
> > > > are affected devices which are not bound to vfio.  
> > >
> > > yes. because vfio would allocate fake group when device is registered to
> > > it.
> > >  
> > > > Why are we not
> > > > allocating an IOMMU group to no-iommu devices when vfio group is
> > > > disabled?  Thanks,  
> > >
> > > hmmm. when the vfio group code is configured out. The
> > > vfio_device_set_group() just returns 0 after below patch is
> > > applied and CONFIG_VFIO_GROUP=n. So when there is no
> > > vfio group, the fake group also goes away.
> > >
> > > https://lore.kernel.org/kvm/20230401151833.124749-25-yi.l.liu@intel.com/  
> > 
> > Is this a fundamental issue or just a problem with the current
> > implementation proposal?  It seems like the latter.  FWIW, I also don't
> > see a taint happening in the cdev path for no-iommu use.  Thanks,  
> 
> yes. the latter case. The reason I raised it here is to confirm the
> policy on the new group/bdf capability in the DEVICE_GET_INFO. If
> there is no iommu group, perhaps I only need to exclude the new
> group/bdf capability from the cap chain of DEVICE_GET_INFO. is it?

I think we need to revisit the question of why allocating an IOMMU
group for a no-iommu device is exclusive to the vfio group support.
We've already been down the path of trying to report a field that only
exists for devices with certain properties with dev-id.  It doesn't
work well.  I think we've said all along that while the cdev interface
is device based, there are still going to be underlying IOMMU groups
for the user to be aware of, they're just not as much a fundamental
part of the interface.  There should not be a case where a device
doesn't have a group to report.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
  2023-04-07 21:07                     ` Alex Williamson
@ 2023-04-08  5:07                       ` Liu, Yi L
  2023-04-08 14:20                         ` Alex Williamson
  2023-04-11 13:33                       ` Jason Gunthorpe
  1 sibling, 1 reply; 145+ messages in thread
From: Liu, Yi L @ 2023-04-08  5:07 UTC (permalink / raw)
  To: Alex Williamson
  Cc: mjrosato, jasowang, Hao, Xudong, peterx, Xu, Terrence,
	chao.p.peng, linux-s390, kvm, lulu, Jiang, Yanting, joro,
	nicolinc, jgg, Zhao, Yan Y, intel-gfx, eric.auger, intel-gvt-dev,
	yi.y.sun, cohuck, shameerali.kolothum.thodi,
	suravee.suthikulpanit, robin.murphy

> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Saturday, April 8, 2023 5:07 AM
> 
> On Fri, 7 Apr 2023 15:47:10 +0000
> "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> 
> > > From: Alex Williamson <alex.williamson@redhat.com>
> > > Sent: Friday, April 7, 2023 11:14 PM
> > >
> > > On Fri, 7 Apr 2023 14:04:02 +0000
> > > "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> > >
> > > > > From: Alex Williamson <alex.williamson@redhat.com>
> > > > > Sent: Friday, April 7, 2023 9:52 PM
> > > > >
> > > > > On Fri, 7 Apr 2023 13:24:25 +0000
> > > > > "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> > > > >
> > > > > > > From: Alex Williamson <alex.williamson@redhat.com>
> > > > > > > Sent: Friday, April 7, 2023 8:04 PM
> > > > > > >
> > > > > > > > > > > @@ -791,7 +813,21 @@ static int vfio_pci_fill_devs(struct pci_dev
> > > *pdev,
> > > > > void
> > > > > > > > > *data)
> > > > > > > > > > >  	if (!iommu_group)
> > > > > > > > > > >  		return -EPERM; /* Cannot reset non-isolated devices
> */
> > > > > >
> > > > > > [1]
> > > > > >
> > > > > > > > > >
> > > > > > > > > > Hi Alex,
> > > > > > > > > >
> > > > > > > > > > Is disabling iommu a sane way to test vfio noiommu mode?
> > > > > > > > >
> > > > > > > > > Yes
> > > > > > > > >
> > > > > > > > > > I added intel_iommu=off to disable intel iommu and bind a device to
> vfio-
> > > pci.
> > > > > > > > > > I can see the /dev/vfio/noiommu-0 and /dev/vfio/devices/noiommu-
> vfio0.
> > > > > Bind
> > > > > > > > > > iommufd==-1 can succeed, but failed to get hot reset info due to the
> > > above
> > > > > > > > > > group check. Reason is that this happens to have some affected
> devices,
> > > and
> > > > > > > > > > these devices have no valid iommu_group (because they are not
> bound to
> > > > > vfio-
> > > > > > > pci
> > > > > > > > > > hence nobody allocates noiommu group for them). So when hot reset
> info
> > > > > loops
> > > > > > > > > > such devices, it failed with -EPERM. Is this expected?
> > > > > > > > >
> > > > > > > > > Hmm, I didn't recall that we put in such a limitation, but given the
> > > > > > > > > minimally intrusive approach to no-iommu and the fact that we never
> > > > > > > > > defined an invalid group ID to return to the user, it makes sense that
> > > > > > > > > we just blocked the ioctl for no-iommu use.  I guess we can do the same
> > > > > > > > > for no-iommu cdev.
> > > > > > > >
> > > > > > > > I just realize a further issue related to this limitation. Remember that we
> > > > > > > > may finally compile out the vfio group infrastructure in the future. Say I
> > > > > > > > want to test noiommu, I may boot such a kernel with iommu disabled. I
> think
> > > > > > > > the _INFO ioctl would fail as there is no iommu_group. Does it mean we
> will
> > > > > > > > not support hot reset for noiommu in future if vfio group infrastructure is
> > > > > > > > compiled out?
> > > > > > >
> > > > > > > We're talking about IOMMU groups, IOMMU groups are always present
> > > > > > > regardless of whether we expose a vfio group interface to userspace.
> > > > > > > Remember, we create IOMMU groups even in the no-iommu case.  Even
> with
> > > > > > > pure cdev, there are underlying IOMMU groups that maintain the DMA
> > > > > > > ownership.
> > > > > >
> > > > > > hmmm. As [1], when iommu is disabled, there will be no iommu_group for a
> > > > > > given device unless it is registered to VFIO, which a fake group is created.
> > > > > > That's why I hit the limitation [1]. When vfio_group is compiled out, then
> > > > > > even fake group goes away.
> > > > >
> > > > > In the vfio group case, [1] can be hit with no-iommu only when there
> > > > > are affected devices which are not bound to vfio.
> > > >
> > > > yes. because vfio would allocate fake group when device is registered to
> > > > it.
> > > >
> > > > > Why are we not
> > > > > allocating an IOMMU group to no-iommu devices when vfio group is
> > > > > disabled?  Thanks,
> > > >
> > > > hmmm. when the vfio group code is configured out. The
> > > > vfio_device_set_group() just returns 0 after below patch is
> > > > applied and CONFIG_VFIO_GROUP=n. So when there is no
> > > > vfio group, the fake group also goes away.
> > > >
> > > > https://lore.kernel.org/kvm/20230401151833.124749-25-yi.l.liu@intel.com/
> > >
> > > Is this a fundamental issue or just a problem with the current
> > > implementation proposal?  It seems like the latter.  FWIW, I also don't
> > > see a taint happening in the cdev path for no-iommu use.  Thanks,
> >
> > yes. the latter case. The reason I raised it here is to confirm the
> > policy on the new group/bdf capability in the DEVICE_GET_INFO. If
> > there is no iommu group, perhaps I only need to exclude the new
> > group/bdf capability from the cap chain of DEVICE_GET_INFO. is it?
> 
> I think we need to revisit the question of why allocating an IOMMU
> group for a no-iommu device is exclusive to the vfio group support.

For no-iommu device, the iommu group is a fake group allocated by vfio.
is it? And the fake group allocation is part of the vfio group code.
It is the vfio_device_set_group() in group.c. If vfio group code is not
compiled in, vfio does not allocate fake groups. Detail for this compiling
can be found in link [1].

> We've already been down the path of trying to report a field that only
> exists for devices with certain properties with dev-id.  It doesn't
> work well.  I think we've said all along that while the cdev interface
> is device based, there are still going to be underlying IOMMU groups
> for the user to be aware of, they're just not as much a fundamental
> part of the interface.  There should not be a case where a device
> doesn't have a group to report.  Thanks,

As the patch in link [1] makes vfio group optional, so if compile a kernel
with CONFIG_VFIO_GROUP=n, and boot it with iommu disabled, then there is no
group to report. Perhaps this is not a typical usage but still a sane usage
for noiommu mode as I confirmed with you in this thread. So when it comes,
needs to consider what to report for the group field.

Perhaps I messed up the discussion by referring to a patch that is part of
another series. But I think it should be considered when talking about the
group to be reported.

[1] https://lore.kernel.org/kvm/20230401151833.124749-25-yi.l.liu@intel.com/

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
  2023-04-08  5:07                       ` Liu, Yi L
@ 2023-04-08 14:20                         ` Alex Williamson
  2023-04-09 11:58                           ` Yi Liu
  0 siblings, 1 reply; 145+ messages in thread
From: Alex Williamson @ 2023-04-08 14:20 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: mjrosato, jasowang, Hao, Xudong, peterx, Xu, Terrence,
	chao.p.peng, linux-s390, kvm, lulu, Jiang, Yanting, joro,
	nicolinc, jgg, Zhao, Yan Y, intel-gfx, eric.auger, intel-gvt-dev,
	yi.y.sun, cohuck, shameerali.kolothum.thodi,
	suravee.suthikulpanit, robin.murphy

On Sat, 8 Apr 2023 05:07:16 +0000
"Liu, Yi L" <yi.l.liu@intel.com> wrote:

> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Saturday, April 8, 2023 5:07 AM
> > 
> > On Fri, 7 Apr 2023 15:47:10 +0000
> > "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> >   
> > > > From: Alex Williamson <alex.williamson@redhat.com>
> > > > Sent: Friday, April 7, 2023 11:14 PM
> > > >
> > > > On Fri, 7 Apr 2023 14:04:02 +0000
> > > > "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> > > >  
> > > > > > From: Alex Williamson <alex.williamson@redhat.com>
> > > > > > Sent: Friday, April 7, 2023 9:52 PM
> > > > > >
> > > > > > On Fri, 7 Apr 2023 13:24:25 +0000
> > > > > > "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> > > > > >  
> > > > > > > > From: Alex Williamson <alex.williamson@redhat.com>
> > > > > > > > Sent: Friday, April 7, 2023 8:04 PM
> > > > > > > >  
> > > > > > > > > > > > @@ -791,7 +813,21 @@ static int vfio_pci_fill_devs(struct pci_dev  
> > > > *pdev,  
> > > > > > void  
> > > > > > > > > > *data)  
> > > > > > > > > > > >  	if (!iommu_group)
> > > > > > > > > > > >  		return -EPERM; /* Cannot reset non-isolated devices  
> > */  
> > > > > > >
> > > > > > > [1]
> > > > > > >  
> > > > > > > > > > >
> > > > > > > > > > > Hi Alex,
> > > > > > > > > > >
> > > > > > > > > > > Is disabling iommu a sane way to test vfio noiommu mode?  
> > > > > > > > > >
> > > > > > > > > > Yes
> > > > > > > > > >  
> > > > > > > > > > > I added intel_iommu=off to disable intel iommu and bind a device to  
> > vfio-  
> > > > pci.  
> > > > > > > > > > > I can see the /dev/vfio/noiommu-0 and /dev/vfio/devices/noiommu-  
> > vfio0.  
> > > > > > Bind  
> > > > > > > > > > > iommufd==-1 can succeed, but failed to get hot reset info due to the  
> > > > above  
> > > > > > > > > > > group check. Reason is that this happens to have some affected  
> > devices,  
> > > > and  
> > > > > > > > > > > these devices have no valid iommu_group (because they are not  
> > bound to  
> > > > > > vfio-  
> > > > > > > > pci  
> > > > > > > > > > > hence nobody allocates noiommu group for them). So when hot reset  
> > info  
> > > > > > loops  
> > > > > > > > > > > such devices, it failed with -EPERM. Is this expected?  
> > > > > > > > > >
> > > > > > > > > > Hmm, I didn't recall that we put in such a limitation, but given the
> > > > > > > > > > minimally intrusive approach to no-iommu and the fact that we never
> > > > > > > > > > defined an invalid group ID to return to the user, it makes sense that
> > > > > > > > > > we just blocked the ioctl for no-iommu use.  I guess we can do the same
> > > > > > > > > > for no-iommu cdev.  
> > > > > > > > >
> > > > > > > > > I just realize a further issue related to this limitation. Remember that we
> > > > > > > > > may finally compile out the vfio group infrastructure in the future. Say I
> > > > > > > > > want to test noiommu, I may boot such a kernel with iommu disabled. I  
> > think  
> > > > > > > > > the _INFO ioctl would fail as there is no iommu_group. Does it mean we  
> > will  
> > > > > > > > > not support hot reset for noiommu in future if vfio group infrastructure is
> > > > > > > > > compiled out?  
> > > > > > > >
> > > > > > > > We're talking about IOMMU groups, IOMMU groups are always present
> > > > > > > > regardless of whether we expose a vfio group interface to userspace.
> > > > > > > > Remember, we create IOMMU groups even in the no-iommu case.  Even  
> > with  
> > > > > > > > pure cdev, there are underlying IOMMU groups that maintain the DMA
> > > > > > > > ownership.  
> > > > > > >
> > > > > > > hmmm. As [1], when iommu is disabled, there will be no iommu_group for a
> > > > > > > given device unless it is registered to VFIO, which a fake group is created.
> > > > > > > That's why I hit the limitation [1]. When vfio_group is compiled out, then
> > > > > > > even fake group goes away.  
> > > > > >
> > > > > > In the vfio group case, [1] can be hit with no-iommu only when there
> > > > > > are affected devices which are not bound to vfio.  
> > > > >
> > > > > yes. because vfio would allocate fake group when device is registered to
> > > > > it.
> > > > >  
> > > > > > Why are we not
> > > > > > allocating an IOMMU group to no-iommu devices when vfio group is
> > > > > > disabled?  Thanks,  
> > > > >
> > > > > hmmm. when the vfio group code is configured out. The
> > > > > vfio_device_set_group() just returns 0 after below patch is
> > > > > applied and CONFIG_VFIO_GROUP=n. So when there is no
> > > > > vfio group, the fake group also goes away.
> > > > >
> > > > > https://lore.kernel.org/kvm/20230401151833.124749-25-yi.l.liu@intel.com/  
> > > >
> > > > Is this a fundamental issue or just a problem with the current
> > > > implementation proposal?  It seems like the latter.  FWIW, I also don't
> > > > see a taint happening in the cdev path for no-iommu use.  Thanks,  
> > >
> > > yes. the latter case. The reason I raised it here is to confirm the
> > > policy on the new group/bdf capability in the DEVICE_GET_INFO. If
> > > there is no iommu group, perhaps I only need to exclude the new
> > > group/bdf capability from the cap chain of DEVICE_GET_INFO. is it?  
> > 
> > I think we need to revisit the question of why allocating an IOMMU
> > group for a no-iommu device is exclusive to the vfio group support.  
> 
> For no-iommu device, the iommu group is a fake group allocated by vfio.
> is it? And the fake group allocation is part of the vfio group code.
> It is the vfio_device_set_group() in group.c. If vfio group code is not
> compiled in, vfio does not allocate fake groups. Detail for this compiling
> can be found in link [1].
> 
> > We've already been down the path of trying to report a field that only
> > exists for devices with certain properties with dev-id.  It doesn't
> > work well.  I think we've said all along that while the cdev interface
> > is device based, there are still going to be underlying IOMMU groups
> > for the user to be aware of, they're just not as much a fundamental
> > part of the interface.  There should not be a case where a device
> > doesn't have a group to report.  Thanks,  
> 
> As the patch in link [1] makes vfio group optional, so if compile a kernel
> with CONFIG_VFIO_GROUP=n, and boot it with iommu disabled, then there is no
> group to report. Perhaps this is not a typical usage but still a sane usage
> for noiommu mode as I confirmed with you in this thread. So when it comes,
> needs to consider what to report for the group field.
> 
> Perhaps I messed up the discussion by referring to a patch that is part of
> another series. But I think it should be considered when talking about the
> group to be reported.

The question is whether the split that group.c code handles both the
vfio group AND creation of the IOMMU group in such cases is the correct
split.  I'm not arguing that the way the code is currently laid out has
the fake IOMMU group for no-iommu devices created in vfio group
specific code, but we have a common interface that makes use of IOMMU
group information for which we don't have an equivalent alternative
data field to report.

We've shown that dev-id doesn't work here because dev-ids only exist
for devices within the user's IOMMU context.  Also reporting an invalid
ID of any sort fails to indicate the potential implied ownership.
Therefore I recognize that if this interface is to report an IOMMU
group, then the creation of fake IOMMU groups existing only in vfio
group code would need to be refactored.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
  2023-04-08 14:20                         ` Alex Williamson
@ 2023-04-09 11:58                           ` Yi Liu
  2023-04-09 13:29                             ` Alex Williamson
  0 siblings, 1 reply; 145+ messages in thread
From: Yi Liu @ 2023-04-09 11:58 UTC (permalink / raw)
  To: Alex Williamson
  Cc: mjrosato, jasowang, Hao, Xudong, peterx, Xu, Terrence,
	chao.p.peng, linux-s390, kvm, lulu, Jiang, Yanting, joro,
	nicolinc, jgg, Zhao, Yan Y, intel-gfx, eric.auger, intel-gvt-dev,
	yi.y.sun, cohuck, shameerali.kolothum.thodi,
	suravee.suthikulpanit, robin.murphy

On 2023/4/8 22:20, Alex Williamson wrote:
> On Sat, 8 Apr 2023 05:07:16 +0000
> "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> 
>>> From: Alex Williamson <alex.williamson@redhat.com>
>>> Sent: Saturday, April 8, 2023 5:07 AM
>>>
>>> On Fri, 7 Apr 2023 15:47:10 +0000
>>> "Liu, Yi L" <yi.l.liu@intel.com> wrote:
>>>    
>>>>> From: Alex Williamson <alex.williamson@redhat.com>
>>>>> Sent: Friday, April 7, 2023 11:14 PM
>>>>>
>>>>> On Fri, 7 Apr 2023 14:04:02 +0000
>>>>> "Liu, Yi L" <yi.l.liu@intel.com> wrote:
>>>>>   
>>>>>>> From: Alex Williamson <alex.williamson@redhat.com>
>>>>>>> Sent: Friday, April 7, 2023 9:52 PM
>>>>>>>
>>>>>>> On Fri, 7 Apr 2023 13:24:25 +0000
>>>>>>> "Liu, Yi L" <yi.l.liu@intel.com> wrote:
>>>>>>>   
>>>>>>>>> From: Alex Williamson <alex.williamson@redhat.com>
>>>>>>>>> Sent: Friday, April 7, 2023 8:04 PM
>>>>>>>>>   
>>>>>>>>>>>>> @@ -791,7 +813,21 @@ static int vfio_pci_fill_devs(struct pci_dev
>>>>> *pdev,
>>>>>>> void
>>>>>>>>>>> *data)
>>>>>>>>>>>>>   	if (!iommu_group)
>>>>>>>>>>>>>   		return -EPERM; /* Cannot reset non-isolated devices
>>> */
>>>>>>>>
>>>>>>>> [1]
>>>>>>>>   
>>>>>>>>>>>>
>>>>>>>>>>>> Hi Alex,
>>>>>>>>>>>>
>>>>>>>>>>>> Is disabling iommu a sane way to test vfio noiommu mode?
>>>>>>>>>>>
>>>>>>>>>>> Yes
>>>>>>>>>>>   
>>>>>>>>>>>> I added intel_iommu=off to disable intel iommu and bind a device to
>>> vfio-
>>>>> pci.
>>>>>>>>>>>> I can see the /dev/vfio/noiommu-0 and /dev/vfio/devices/noiommu-
>>> vfio0.
>>>>>>> Bind
>>>>>>>>>>>> iommufd==-1 can succeed, but failed to get hot reset info due to the
>>>>> above
>>>>>>>>>>>> group check. Reason is that this happens to have some affected
>>> devices,
>>>>> and
>>>>>>>>>>>> these devices have no valid iommu_group (because they are not
>>> bound to
>>>>>>> vfio-
>>>>>>>>> pci
>>>>>>>>>>>> hence nobody allocates noiommu group for them). So when hot reset
>>> info
>>>>>>> loops
>>>>>>>>>>>> such devices, it failed with -EPERM. Is this expected?
>>>>>>>>>>>
>>>>>>>>>>> Hmm, I didn't recall that we put in such a limitation, but given the
>>>>>>>>>>> minimally intrusive approach to no-iommu and the fact that we never
>>>>>>>>>>> defined an invalid group ID to return to the user, it makes sense that
>>>>>>>>>>> we just blocked the ioctl for no-iommu use.  I guess we can do the same
>>>>>>>>>>> for no-iommu cdev.
>>>>>>>>>>
>>>>>>>>>> I just realize a further issue related to this limitation. Remember that we
>>>>>>>>>> may finally compile out the vfio group infrastructure in the future. Say I
>>>>>>>>>> want to test noiommu, I may boot such a kernel with iommu disabled. I
>>> think
>>>>>>>>>> the _INFO ioctl would fail as there is no iommu_group. Does it mean we
>>> will
>>>>>>>>>> not support hot reset for noiommu in future if vfio group infrastructure is
>>>>>>>>>> compiled out?
>>>>>>>>>
>>>>>>>>> We're talking about IOMMU groups, IOMMU groups are always present
>>>>>>>>> regardless of whether we expose a vfio group interface to userspace.
>>>>>>>>> Remember, we create IOMMU groups even in the no-iommu case.  Even
>>> with
>>>>>>>>> pure cdev, there are underlying IOMMU groups that maintain the DMA
>>>>>>>>> ownership.
>>>>>>>>
>>>>>>>> hmmm. As [1], when iommu is disabled, there will be no iommu_group for a
>>>>>>>> given device unless it is registered to VFIO, which a fake group is created.
>>>>>>>> That's why I hit the limitation [1]. When vfio_group is compiled out, then
>>>>>>>> even fake group goes away.
>>>>>>>
>>>>>>> In the vfio group case, [1] can be hit with no-iommu only when there
>>>>>>> are affected devices which are not bound to vfio.
>>>>>>
>>>>>> yes. because vfio would allocate fake group when device is registered to
>>>>>> it.
>>>>>>   
>>>>>>> Why are we not
>>>>>>> allocating an IOMMU group to no-iommu devices when vfio group is
>>>>>>> disabled?  Thanks,
>>>>>>
>>>>>> hmmm. when the vfio group code is configured out. The
>>>>>> vfio_device_set_group() just returns 0 after below patch is
>>>>>> applied and CONFIG_VFIO_GROUP=n. So when there is no
>>>>>> vfio group, the fake group also goes away.
>>>>>>
>>>>>> https://lore.kernel.org/kvm/20230401151833.124749-25-yi.l.liu@intel.com/
>>>>>
>>>>> Is this a fundamental issue or just a problem with the current
>>>>> implementation proposal?  It seems like the latter.  FWIW, I also don't
>>>>> see a taint happening in the cdev path for no-iommu use.  Thanks,
>>>>
>>>> yes. the latter case. The reason I raised it here is to confirm the
>>>> policy on the new group/bdf capability in the DEVICE_GET_INFO. If
>>>> there is no iommu group, perhaps I only need to exclude the new
>>>> group/bdf capability from the cap chain of DEVICE_GET_INFO. is it?
>>>
>>> I think we need to revisit the question of why allocating an IOMMU
>>> group for a no-iommu device is exclusive to the vfio group support.
>>
>> For no-iommu device, the iommu group is a fake group allocated by vfio.
>> is it? And the fake group allocation is part of the vfio group code.
>> It is the vfio_device_set_group() in group.c. If vfio group code is not
>> compiled in, vfio does not allocate fake groups. Detail for this compiling
>> can be found in link [1].
>>
>>> We've already been down the path of trying to report a field that only
>>> exists for devices with certain properties with dev-id.  It doesn't
>>> work well.  I think we've said all along that while the cdev interface
>>> is device based, there are still going to be underlying IOMMU groups
>>> for the user to be aware of, they're just not as much a fundamental
>>> part of the interface.  There should not be a case where a device
>>> doesn't have a group to report.  Thanks,
>>
>> As the patch in link [1] makes vfio group optional, so if compile a kernel
>> with CONFIG_VFIO_GROUP=n, and boot it with iommu disabled, then there is no
>> group to report. Perhaps this is not a typical usage but still a sane usage
>> for noiommu mode as I confirmed with you in this thread. So when it comes,
>> needs to consider what to report for the group field.
>>
>> Perhaps I messed up the discussion by referring to a patch that is part of
>> another series. But I think it should be considered when talking about the
>> group to be reported.
> 
> The question is whether the split that group.c code handles both the
> vfio group AND creation of the IOMMU group in such cases is the correct
> split.  I'm not arguing that the way the code is currently laid out has
> the fake IOMMU group for no-iommu devices created in vfio group
> specific code, but we have a common interface that makes use of IOMMU
> group information for which we don't have an equivalent alternative
> data field to report.

yes. It is needed to ensure _HOT_RESET_INFO workable for noiommu devices.

> We've shown that dev-id doesn't work here because dev-ids only exist
> for devices within the user's IOMMU context.  Also reporting an invalid
> ID of any sort fails to indicate the potential implied ownership.
> Therefore I recognize that if this interface is to report an IOMMU
> group, then the creation of fake IOMMU groups existing only in vfio
> group code would need to be refactored.  Thanks,

yeah, needs to move the iommu group creation back to vfio_main.c. This
would be a prerequisite for [1]

[1] https://lore.kernel.org/kvm/20230401151833.124749-25-yi.l.liu@intel.com/

I'll also try out your suggestion to add a capability like below and link
it in the vfio_device_info cap chain.

#define VFIO_DEVICE_INFO_CAP_PCI_BDF          5

struct vfio_device_info_cap_pci_bdf {
         struct vfio_info_cap_header header;
         __u32   group_id;
         __u16   segment;
         __u8    bus;
         __u8    devfn; /* Use PCI_SLOT/PCI_FUNC */
};

-- 
Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
  2023-04-09 11:58                           ` Yi Liu
@ 2023-04-09 13:29                             ` Alex Williamson
  2023-04-10  8:48                               ` Liu, Yi L
  2023-04-11 13:34                               ` Jason Gunthorpe
  0 siblings, 2 replies; 145+ messages in thread
From: Alex Williamson @ 2023-04-09 13:29 UTC (permalink / raw)
  To: Yi Liu
  Cc: mjrosato, jasowang, Hao, Xudong, peterx, Xu, Terrence,
	chao.p.peng, linux-s390, kvm, lulu, Jiang, Yanting, joro,
	nicolinc, jgg, Zhao, Yan Y, intel-gfx, eric.auger, intel-gvt-dev,
	yi.y.sun, cohuck, shameerali.kolothum.thodi,
	suravee.suthikulpanit, robin.murphy

On Sun, 9 Apr 2023 19:58:47 +0800
Yi Liu <yi.l.liu@intel.com> wrote:

> On 2023/4/8 22:20, Alex Williamson wrote:
> > On Sat, 8 Apr 2023 05:07:16 +0000
> > "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> >   
> >>> From: Alex Williamson <alex.williamson@redhat.com>
> >>> Sent: Saturday, April 8, 2023 5:07 AM
> >>>
> >>> On Fri, 7 Apr 2023 15:47:10 +0000
> >>> "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> >>>      
> >>>>> From: Alex Williamson <alex.williamson@redhat.com>
> >>>>> Sent: Friday, April 7, 2023 11:14 PM
> >>>>>
> >>>>> On Fri, 7 Apr 2023 14:04:02 +0000
> >>>>> "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> >>>>>     
> >>>>>>> From: Alex Williamson <alex.williamson@redhat.com>
> >>>>>>> Sent: Friday, April 7, 2023 9:52 PM
> >>>>>>>
> >>>>>>> On Fri, 7 Apr 2023 13:24:25 +0000
> >>>>>>> "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> >>>>>>>     
> >>>>>>>>> From: Alex Williamson <alex.williamson@redhat.com>
> >>>>>>>>> Sent: Friday, April 7, 2023 8:04 PM
> >>>>>>>>>     
> >>>>>>>>>>>>> @@ -791,7 +813,21 @@ static int vfio_pci_fill_devs(struct pci_dev  
> >>>>> *pdev,  
> >>>>>>> void  
> >>>>>>>>>>> *data)  
> >>>>>>>>>>>>>   	if (!iommu_group)
> >>>>>>>>>>>>>   		return -EPERM; /* Cannot reset non-isolated devices  
> >>> */  
> >>>>>>>>
> >>>>>>>> [1]
> >>>>>>>>     
> >>>>>>>>>>>>
> >>>>>>>>>>>> Hi Alex,
> >>>>>>>>>>>>
> >>>>>>>>>>>> Is disabling iommu a sane way to test vfio noiommu mode?  
> >>>>>>>>>>>
> >>>>>>>>>>> Yes
> >>>>>>>>>>>     
> >>>>>>>>>>>> I added intel_iommu=off to disable intel iommu and bind a device to  
> >>> vfio-  
> >>>>> pci.  
> >>>>>>>>>>>> I can see the /dev/vfio/noiommu-0 and /dev/vfio/devices/noiommu-  
> >>> vfio0.  
> >>>>>>> Bind  
> >>>>>>>>>>>> iommufd==-1 can succeed, but failed to get hot reset info due to the  
> >>>>> above  
> >>>>>>>>>>>> group check. Reason is that this happens to have some affected  
> >>> devices,  
> >>>>> and  
> >>>>>>>>>>>> these devices have no valid iommu_group (because they are not  
> >>> bound to  
> >>>>>>> vfio-  
> >>>>>>>>> pci  
> >>>>>>>>>>>> hence nobody allocates noiommu group for them). So when hot reset  
> >>> info  
> >>>>>>> loops  
> >>>>>>>>>>>> such devices, it failed with -EPERM. Is this expected?  
> >>>>>>>>>>>
> >>>>>>>>>>> Hmm, I didn't recall that we put in such a limitation, but given the
> >>>>>>>>>>> minimally intrusive approach to no-iommu and the fact that we never
> >>>>>>>>>>> defined an invalid group ID to return to the user, it makes sense that
> >>>>>>>>>>> we just blocked the ioctl for no-iommu use.  I guess we can do the same
> >>>>>>>>>>> for no-iommu cdev.  
> >>>>>>>>>>
> >>>>>>>>>> I just realize a further issue related to this limitation. Remember that we
> >>>>>>>>>> may finally compile out the vfio group infrastructure in the future. Say I
> >>>>>>>>>> want to test noiommu, I may boot such a kernel with iommu disabled. I  
> >>> think  
> >>>>>>>>>> the _INFO ioctl would fail as there is no iommu_group. Does it mean we  
> >>> will  
> >>>>>>>>>> not support hot reset for noiommu in future if vfio group infrastructure is
> >>>>>>>>>> compiled out?  
> >>>>>>>>>
> >>>>>>>>> We're talking about IOMMU groups, IOMMU groups are always present
> >>>>>>>>> regardless of whether we expose a vfio group interface to userspace.
> >>>>>>>>> Remember, we create IOMMU groups even in the no-iommu case.  Even  
> >>> with  
> >>>>>>>>> pure cdev, there are underlying IOMMU groups that maintain the DMA
> >>>>>>>>> ownership.  
> >>>>>>>>
> >>>>>>>> hmmm. As [1], when iommu is disabled, there will be no iommu_group for a
> >>>>>>>> given device unless it is registered to VFIO, which a fake group is created.
> >>>>>>>> That's why I hit the limitation [1]. When vfio_group is compiled out, then
> >>>>>>>> even fake group goes away.  
> >>>>>>>
> >>>>>>> In the vfio group case, [1] can be hit with no-iommu only when there
> >>>>>>> are affected devices which are not bound to vfio.  
> >>>>>>
> >>>>>> yes. because vfio would allocate fake group when device is registered to
> >>>>>> it.
> >>>>>>     
> >>>>>>> Why are we not
> >>>>>>> allocating an IOMMU group to no-iommu devices when vfio group is
> >>>>>>> disabled?  Thanks,  
> >>>>>>
> >>>>>> hmmm. when the vfio group code is configured out. The
> >>>>>> vfio_device_set_group() just returns 0 after below patch is
> >>>>>> applied and CONFIG_VFIO_GROUP=n. So when there is no
> >>>>>> vfio group, the fake group also goes away.
> >>>>>>
> >>>>>> https://lore.kernel.org/kvm/20230401151833.124749-25-yi.l.liu@intel.com/  
> >>>>>
> >>>>> Is this a fundamental issue or just a problem with the current
> >>>>> implementation proposal?  It seems like the latter.  FWIW, I also don't
> >>>>> see a taint happening in the cdev path for no-iommu use.  Thanks,  
> >>>>
> >>>> yes. the latter case. The reason I raised it here is to confirm the
> >>>> policy on the new group/bdf capability in the DEVICE_GET_INFO. If
> >>>> there is no iommu group, perhaps I only need to exclude the new
> >>>> group/bdf capability from the cap chain of DEVICE_GET_INFO. is it?  
> >>>
> >>> I think we need to revisit the question of why allocating an IOMMU
> >>> group for a no-iommu device is exclusive to the vfio group support.  
> >>
> >> For no-iommu device, the iommu group is a fake group allocated by vfio.
> >> is it? And the fake group allocation is part of the vfio group code.
> >> It is the vfio_device_set_group() in group.c. If vfio group code is not
> >> compiled in, vfio does not allocate fake groups. Detail for this compiling
> >> can be found in link [1].
> >>  
> >>> We've already been down the path of trying to report a field that only
> >>> exists for devices with certain properties with dev-id.  It doesn't
> >>> work well.  I think we've said all along that while the cdev interface
> >>> is device based, there are still going to be underlying IOMMU groups
> >>> for the user to be aware of, they're just not as much a fundamental
> >>> part of the interface.  There should not be a case where a device
> >>> doesn't have a group to report.  Thanks,  
> >>
> >> As the patch in link [1] makes vfio group optional, so if compile a kernel
> >> with CONFIG_VFIO_GROUP=n, and boot it with iommu disabled, then there is no
> >> group to report. Perhaps this is not a typical usage but still a sane usage
> >> for noiommu mode as I confirmed with you in this thread. So when it comes,
> >> needs to consider what to report for the group field.
> >>
> >> Perhaps I messed up the discussion by referring to a patch that is part of
> >> another series. But I think it should be considered when talking about the
> >> group to be reported.  
> > 
> > The question is whether the split that group.c code handles both the
> > vfio group AND creation of the IOMMU group in such cases is the correct
> > split.  I'm not arguing that the way the code is currently laid out has
> > the fake IOMMU group for no-iommu devices created in vfio group
> > specific code, but we have a common interface that makes use of IOMMU
> > group information for which we don't have an equivalent alternative
> > data field to report.  
> 
> yes. It is needed to ensure _HOT_RESET_INFO workable for noiommu devices.
> 
> > We've shown that dev-id doesn't work here because dev-ids only exist
> > for devices within the user's IOMMU context.  Also reporting an invalid
> > ID of any sort fails to indicate the potential implied ownership.
> > Therefore I recognize that if this interface is to report an IOMMU
> > group, then the creation of fake IOMMU groups existing only in vfio
> > group code would need to be refactored.  Thanks,  
> 
> yeah, needs to move the iommu group creation back to vfio_main.c. This
> would be a prerequisite for [1]
> 
> [1] https://lore.kernel.org/kvm/20230401151833.124749-25-yi.l.liu@intel.com/
> 
> I'll also try out your suggestion to add a capability like below and link
> it in the vfio_device_info cap chain.
> 
> #define VFIO_DEVICE_INFO_CAP_PCI_BDF          5
> 
> struct vfio_device_info_cap_pci_bdf {
>          struct vfio_info_cap_header header;
>          __u32   group_id;
>          __u16   segment;
>          __u8    bus;
>          __u8    devfn; /* Use PCI_SLOT/PCI_FUNC */
> };
> 

Group-id and bdf should be separate capabilities, all device should
report a group-id capability and only PCI devices a bdf capability.
Thanks,

Alex


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
  2023-04-09 13:29                             ` Alex Williamson
@ 2023-04-10  8:48                               ` Liu, Yi L
  2023-04-10 14:41                                 ` Alex Williamson
  2023-04-11 13:34                               ` Jason Gunthorpe
  1 sibling, 1 reply; 145+ messages in thread
From: Liu, Yi L @ 2023-04-10  8:48 UTC (permalink / raw)
  To: Alex Williamson
  Cc: mjrosato, jasowang, Hao, Xudong, peterx, Xu, Terrence,
	chao.p.peng, linux-s390, kvm, lulu, Jiang, Yanting, joro,
	nicolinc, jgg, Zhao, Yan Y, intel-gfx, eric.auger, intel-gvt-dev,
	yi.y.sun, cohuck, shameerali.kolothum.thodi,
	suravee.suthikulpanit, robin.murphy

> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Sunday, April 9, 2023 9:30 PM
[...]
> > yeah, needs to move the iommu group creation back to vfio_main.c. This
> > would be a prerequisite for [1]
> >
> > [1] https://lore.kernel.org/kvm/20230401151833.124749-25-yi.l.liu@intel.com/
> >
> > I'll also try out your suggestion to add a capability like below and link
> > it in the vfio_device_info cap chain.
> >
> > #define VFIO_DEVICE_INFO_CAP_PCI_BDF          5
> >
> > struct vfio_device_info_cap_pci_bdf {
> >          struct vfio_info_cap_header header;
> >          __u32   group_id;
> >          __u16   segment;
> >          __u8    bus;
> >          __u8    devfn; /* Use PCI_SLOT/PCI_FUNC */
> > };
> >
> 
> Group-id and bdf should be separate capabilities, all device should
> report a group-id capability and only PCI devices a bdf capability.

ok. Since this is to support the device fd passing usage, so we need to
let all the vfio device drivers report group-id capability. is it? So may
have a below helper in vfio_main.c. How about the sample drivers?
seems not necessary for them. right?

int vfio_pci_info_add_group_cap(struct device *dev,
                                struct vfio_info_cap *caps)
{
        struct vfio_pci_device_info_cap_group cap = {
                .header.id = VFIO_DEVICE_INFO_CAP_GROUP_ID,
                .header.version = 1,
        };
        struct iommu_group *iommu_group;

        iommu_group = iommu_group_get(&pdev->dev);
        if (!iommu_group) {
                kfree(caps->buf);
                return -EPERM;
        }

        cap.group_id = iommu_group_id(iommu_group);

        iommu_group_put(iommu_group);

        return vfio_info_add_capability(caps, &cap.header, sizeof(cap));
}

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
  2023-04-10  8:48                               ` Liu, Yi L
@ 2023-04-10 14:41                                 ` Alex Williamson
  2023-04-10 15:18                                   ` Liu, Yi L
  0 siblings, 1 reply; 145+ messages in thread
From: Alex Williamson @ 2023-04-10 14:41 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: mjrosato, jasowang, Hao, Xudong, peterx, Xu, Terrence,
	chao.p.peng, linux-s390, kvm, lulu, Jiang, Yanting, joro,
	nicolinc, jgg, Zhao, Yan Y, intel-gfx, eric.auger, intel-gvt-dev,
	yi.y.sun, cohuck, shameerali.kolothum.thodi,
	suravee.suthikulpanit, robin.murphy

On Mon, 10 Apr 2023 08:48:54 +0000
"Liu, Yi L" <yi.l.liu@intel.com> wrote:

> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Sunday, April 9, 2023 9:30 PM  
> [...]
> > > yeah, needs to move the iommu group creation back to vfio_main.c. This
> > > would be a prerequisite for [1]
> > >
> > > [1] https://lore.kernel.org/kvm/20230401151833.124749-25-yi.l.liu@intel.com/
> > >
> > > I'll also try out your suggestion to add a capability like below and link
> > > it in the vfio_device_info cap chain.
> > >
> > > #define VFIO_DEVICE_INFO_CAP_PCI_BDF          5
> > >
> > > struct vfio_device_info_cap_pci_bdf {
> > >          struct vfio_info_cap_header header;
> > >          __u32   group_id;
> > >          __u16   segment;
> > >          __u8    bus;
> > >          __u8    devfn; /* Use PCI_SLOT/PCI_FUNC */
> > > };
> > >  
> > 
> > Group-id and bdf should be separate capabilities, all device should
> > report a group-id capability and only PCI devices a bdf capability.  
> 
> ok. Since this is to support the device fd passing usage, so we need to
> let all the vfio device drivers report group-id capability. is it? So may
> have a below helper in vfio_main.c. How about the sample drivers?
> seems not necessary for them. right?

The more common we can make it, the better, but if it ends up that the
individual drivers need to initialize the capability then it would
probably be limited to those driver with a need to expose the group.
Sample drivers for the purpose of illustrating the interface and of
course anything based on vfio-pci-core which exposes hot-reset.  Thanks

Alex


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
  2023-04-10 14:41                                 ` Alex Williamson
@ 2023-04-10 15:18                                   ` Liu, Yi L
  2023-04-10 15:23                                     ` Alex Williamson
  0 siblings, 1 reply; 145+ messages in thread
From: Liu, Yi L @ 2023-04-10 15:18 UTC (permalink / raw)
  To: Alex Williamson
  Cc: mjrosato, jasowang, Hao, Xudong, peterx, Xu, Terrence,
	chao.p.peng, linux-s390, kvm, lulu, Jiang, Yanting, joro,
	nicolinc, jgg, Zhao, Yan Y, intel-gfx, eric.auger, intel-gvt-dev,
	yi.y.sun, cohuck, shameerali.kolothum.thodi,
	suravee.suthikulpanit, robin.murphy

> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Monday, April 10, 2023 10:41 PM
> 
> On Mon, 10 Apr 2023 08:48:54 +0000
> "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> 
> > > From: Alex Williamson <alex.williamson@redhat.com>
> > > Sent: Sunday, April 9, 2023 9:30 PM
> > [...]
> > > > yeah, needs to move the iommu group creation back to vfio_main.c. This
> > > > would be a prerequisite for [1]
> > > >
> > > > [1] https://lore.kernel.org/kvm/20230401151833.124749-25-yi.l.liu@intel.com/
> > > >
> > > > I'll also try out your suggestion to add a capability like below and link
> > > > it in the vfio_device_info cap chain.
> > > >
> > > > #define VFIO_DEVICE_INFO_CAP_PCI_BDF          5
> > > >
> > > > struct vfio_device_info_cap_pci_bdf {
> > > >          struct vfio_info_cap_header header;
> > > >          __u32   group_id;
> > > >          __u16   segment;
> > > >          __u8    bus;
> > > >          __u8    devfn; /* Use PCI_SLOT/PCI_FUNC */
> > > > };
> > > >
> > >
> > > Group-id and bdf should be separate capabilities, all device should
> > > report a group-id capability and only PCI devices a bdf capability.
> >
> > ok. Since this is to support the device fd passing usage, so we need to
> > let all the vfio device drivers report group-id capability. is it? So may
> > have a below helper in vfio_main.c. How about the sample drivers?
> > seems not necessary for them. right?
> 
> The more common we can make it, the better, but if it ends up that the
> individual drivers need to initialize the capability then it would
> probably be limited to those driver with a need to expose the group.

looks to be such a case. vfio_device_info is assembled by the individual
drivers. If want to report group_id capability as a common behavior, needs
to change all of them. Had a quick draft for it as below commit:

https://github.com/yiliu1765/iommufd/commit/ff4b8bee90761961041126305183a9a7e0f0542d

https://github.com/yiliu1765/iommufd/commits/report_group_id

> Sample drivers for the purpose of illustrating the interface and of
> course anything based on vfio-pci-core which exposes hot-reset.  Thanks

do you see any sample drivers need to report group_id cap? IMHO, seems
no.

Regards,
Yi Liu


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
  2023-04-10 15:18                                   ` Liu, Yi L
@ 2023-04-10 15:23                                     ` Alex Williamson
  0 siblings, 0 replies; 145+ messages in thread
From: Alex Williamson @ 2023-04-10 15:23 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: mjrosato, jasowang, Hao, Xudong, peterx, Xu, Terrence,
	chao.p.peng, linux-s390, kvm, lulu, Jiang, Yanting, joro,
	nicolinc, jgg, Zhao, Yan Y, intel-gfx, eric.auger, intel-gvt-dev,
	yi.y.sun, cohuck, shameerali.kolothum.thodi,
	suravee.suthikulpanit, robin.murphy

On Mon, 10 Apr 2023 15:18:27 +0000
"Liu, Yi L" <yi.l.liu@intel.com> wrote:

> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Monday, April 10, 2023 10:41 PM
> > 
> > On Mon, 10 Apr 2023 08:48:54 +0000
> > "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> >   
> > > > From: Alex Williamson <alex.williamson@redhat.com>
> > > > Sent: Sunday, April 9, 2023 9:30 PM  
> > > [...]  
> > > > > yeah, needs to move the iommu group creation back to vfio_main.c. This
> > > > > would be a prerequisite for [1]
> > > > >
> > > > > [1] https://lore.kernel.org/kvm/20230401151833.124749-25-yi.l.liu@intel.com/
> > > > >
> > > > > I'll also try out your suggestion to add a capability like below and link
> > > > > it in the vfio_device_info cap chain.
> > > > >
> > > > > #define VFIO_DEVICE_INFO_CAP_PCI_BDF          5
> > > > >
> > > > > struct vfio_device_info_cap_pci_bdf {
> > > > >          struct vfio_info_cap_header header;
> > > > >          __u32   group_id;
> > > > >          __u16   segment;
> > > > >          __u8    bus;
> > > > >          __u8    devfn; /* Use PCI_SLOT/PCI_FUNC */
> > > > > };
> > > > >  
> > > >
> > > > Group-id and bdf should be separate capabilities, all device should
> > > > report a group-id capability and only PCI devices a bdf capability.  
> > >
> > > ok. Since this is to support the device fd passing usage, so we need to
> > > let all the vfio device drivers report group-id capability. is it? So may
> > > have a below helper in vfio_main.c. How about the sample drivers?
> > > seems not necessary for them. right?  
> > 
> > The more common we can make it, the better, but if it ends up that the
> > individual drivers need to initialize the capability then it would
> > probably be limited to those driver with a need to expose the group.  
> 
> looks to be such a case. vfio_device_info is assembled by the individual
> drivers. If want to report group_id capability as a common behavior, needs
> to change all of them. Had a quick draft for it as below commit:
> 
> https://github.com/yiliu1765/iommufd/commit/ff4b8bee90761961041126305183a9a7e0f0542d
> 
> https://github.com/yiliu1765/iommufd/commits/report_group_id
> 
> > Sample drivers for the purpose of illustrating the interface and of
> > course anything based on vfio-pci-core which exposes hot-reset.  Thanks  
> 
> do you see any sample drivers need to report group_id cap? IMHO, seems
> no.

As in the quoted text, part of the purpose of the sample drivers is to
act both as a proof-of-concept and illustration of the API, therefore
gratuitous exposure of such capabilities should be encouraged.  They
would also provide a proof point of an mdev device, ie. emulated IOMMU
device, exposing the capability.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
  2023-04-07 12:03         ` Alex Williamson
  2023-04-07 13:24           ` Liu, Yi L
@ 2023-04-11  6:16           ` Liu, Yi L
  1 sibling, 0 replies; 145+ messages in thread
From: Liu, Yi L @ 2023-04-11  6:16 UTC (permalink / raw)
  To: Alex Williamson
  Cc: mjrosato, jasowang, Hao, Xudong, peterx, Xu, Terrence,
	chao.p.peng, linux-s390, kvm, lulu, Jiang, Yanting, joro,
	nicolinc, jgg, Zhao, Yan Y, intel-gfx, eric.auger, intel-gvt-dev,
	yi.y.sun, cohuck, shameerali.kolothum.thodi,
	suravee.suthikulpanit, robin.murphy

Hi Alex,

> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Friday, April 7, 2023 8:04 PM
> 
> On Fri, 7 Apr 2023 10:09:58 +0000
> "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> 
> > Hi Alex,
> >
> > > From: Alex Williamson <alex.williamson@redhat.com>
> > > Sent: Monday, April 3, 2023 11:02 PM
> > >
> > > On Mon, 3 Apr 2023 09:25:06 +0000
> > > "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> > >
> > > > > From: Liu, Yi L <yi.l.liu@intel.com>
> > > > > Sent: Saturday, April 1, 2023 10:44 PM
> > > >
> > > > > @@ -791,7 +813,21 @@ static int vfio_pci_fill_devs(struct pci_dev *pdev, void
> > > *data)
> > > > >  	if (!iommu_group)
> > > > >  		return -EPERM; /* Cannot reset non-isolated devices */
> > > >
> > > > Hi Alex,
> > > >
> > > > Is disabling iommu a sane way to test vfio noiommu mode?
> > >
> > > Yes
> > >
> > > > I added intel_iommu=off to disable intel iommu and bind a device to vfio-pci.
> > > > I can see the /dev/vfio/noiommu-0 and /dev/vfio/devices/noiommu-vfio0. Bind
> > > > iommufd==-1 can succeed, but failed to get hot reset info due to the above
> > > > group check. Reason is that this happens to have some affected devices, and
> > > > these devices have no valid iommu_group (because they are not bound to vfio-
> pci
> > > > hence nobody allocates noiommu group for them). So when hot reset info loops
> > > > such devices, it failed with -EPERM. Is this expected?
> > >
> > > Hmm, I didn't recall that we put in such a limitation, but given the
> > > minimally intrusive approach to no-iommu and the fact that we never
> > > defined an invalid group ID to return to the user, it makes sense that
> > > we just blocked the ioctl for no-iommu use.  I guess we can do the same
> > > for no-iommu cdev.
> >
> > I just realize a further issue related to this limitation. Remember that we
> > may finally compile out the vfio group infrastructure in the future. Say I
> > want to test noiommu, I may boot such a kernel with iommu disabled. I think
> > the _INFO ioctl would fail as there is no iommu_group. Does it mean we will
> > not support hot reset for noiommu in future if vfio group infrastructure is
> > compiled out?
> 
> We're talking about IOMMU groups, IOMMU groups are always present
> regardless of whether we expose a vfio group interface to userspace.
> Remember, we create IOMMU groups even in the no-iommu case.  Even with
> pure cdev, there are underlying IOMMU groups that maintain the DMA
> ownership.

I just realize that there is one case that does not have iommu group.
although not implemented yet. There was a discussion on SIOV support.
IIRC, it was agreed that no need to allocate iommu_group for SIOV case.
Kevin or Jason can keep me honest here. I failed to find out the link
of this discussion.

> > As another thread, we are going to add a new bdf/group capability to
> > DEVICE_GET_INFO. If the above kernel is booted, shall we exclude the new
> > bdf/group capability or add a flag in the capability to mark the group_id
> > is invalid?
> 
> As above, there's always an IOMMU group, it's never invalid.  Thanks,

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
  2023-04-06 17:53                         ` Alex Williamson
  2023-04-07 10:09                           ` Liu, Yi L
@ 2023-04-11 13:24                           ` Jason Gunthorpe
  2023-04-11 15:54                             ` Alex Williamson
  1 sibling, 1 reply; 145+ messages in thread
From: Jason Gunthorpe @ 2023-04-11 13:24 UTC (permalink / raw)
  To: Alex Williamson
  Cc: mjrosato, jasowang, Hao, Xudong, Duan, Zhenzhong, peterx, Xu,
	Terrence, chao.p.peng, linux-s390, Liu, Yi L, kvm, lulu, Jiang,
	Yanting, joro, nicolinc, Zhao, Yan Y, intel-gfx, eric.auger,
	intel-gvt-dev, yi.y.sun, cohuck, shameerali.kolothum.thodi,
	suravee.suthikulpanit, robin.murphy

On Thu, Apr 06, 2023 at 11:53:47AM -0600, Alex Williamson wrote:

> Where whether a device is opened is subject to change outside of the
> user's control.  This essentially allows the user to perform hot-resets
> of devices outside of their ownership so long as the device is not
> used elsewhere, versus the current requirement that the user own all the
> affected groups, which implies device ownership.  It's not been
> justified why this feature needs to exist, imo.

The cdev API doesn't have the notion that owning a group means you
"own" some collection of devices. It still happens as a side effect,
but it isn't obviously part of the API. I'm really loath to
re-introduce that group-based concept just for this. We are trying
reduce the group API surface.

How about a different direction.

We add a new uAPI for cdev mode that is "take ownership of the reset
group". Maybe it can be a flag in during bind.

When requested vfio will ensure that every device in the reset group
is only bound to this iommufd_ctx or left closed. Now and in the
future. Since no-iommu has no iommufd_ctx this means we can open only
one device in the reset group.

With this flag RESET is guaranteed to always work by definition.

We continue with the zero-length FD, but we can just replace the
security checks with a check if we are in reset group ownership mode.

_INFO is unchanged.

We decide if we add a new IOCTL to return the BDF so the existing
_INFO can get back to the dev_id or a new IOCTL that returns the
dev_id list of the reset group.

Userspace is required to figure out the extent of the reset, but we
don't require that userspace prove to the kernel it did this when
requesting the reset.

Jason

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
  2023-04-07 21:07                     ` Alex Williamson
  2023-04-08  5:07                       ` Liu, Yi L
@ 2023-04-11 13:33                       ` Jason Gunthorpe
  1 sibling, 0 replies; 145+ messages in thread
From: Jason Gunthorpe @ 2023-04-11 13:33 UTC (permalink / raw)
  To: Alex Williamson
  Cc: mjrosato, jasowang, Hao, Xudong, peterx, Xu, Terrence,
	chao.p.peng, linux-s390, Liu, Yi L, kvm, lulu, Jiang, Yanting,
	joro, nicolinc, Zhao, Yan Y, intel-gfx, eric.auger,
	intel-gvt-dev, yi.y.sun, cohuck, shameerali.kolothum.thodi,
	suravee.suthikulpanit, robin.murphy

On Fri, Apr 07, 2023 at 03:07:21PM -0600, Alex Williamson wrote:

> I think we need to revisit the question of why allocating an IOMMU
> group for a no-iommu device is exclusive to the vfio group support.

One of the points of this effort is to remove the co-mingling of iommu
and VFIO so much. We should not create the fake iommu groups for
no-iommu.

The _INFO API reporting the group is not a good reason to wreck this
clean separation.

Jason

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
  2023-04-09 13:29                             ` Alex Williamson
  2023-04-10  8:48                               ` Liu, Yi L
@ 2023-04-11 13:34                               ` Jason Gunthorpe
  1 sibling, 0 replies; 145+ messages in thread
From: Jason Gunthorpe @ 2023-04-11 13:34 UTC (permalink / raw)
  To: Alex Williamson
  Cc: mjrosato, jasowang, Hao, Xudong, peterx, Xu, Terrence,
	chao.p.peng, linux-s390, Yi Liu, kvm, lulu, Jiang, Yanting, joro,
	nicolinc, Zhao, Yan Y, intel-gfx, eric.auger, intel-gvt-dev,
	yi.y.sun, cohuck, shameerali.kolothum.thodi,
	suravee.suthikulpanit, robin.murphy

On Sun, Apr 09, 2023 at 07:29:51AM -0600, Alex Williamson wrote:

> > struct vfio_device_info_cap_pci_bdf {
> >          struct vfio_info_cap_header header;
> >          __u32   group_id;
> >          __u16   segment;
> >          __u8    bus;
> >          __u8    devfn; /* Use PCI_SLOT/PCI_FUNC */
> > };
> > 
> 
> Group-id and bdf should be separate capabilities, all device should
> report a group-id capability and only PCI devices a bdf capability.

Group should be reported by iommufd using a generic ioctl, and not be
part of VFIO.

This should report BDF only and only work for PCI.

Jason

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
  2023-04-11 13:24                           ` Jason Gunthorpe
@ 2023-04-11 15:54                             ` Alex Williamson
  2023-04-11 17:11                               ` Alex Williamson
  0 siblings, 1 reply; 145+ messages in thread
From: Alex Williamson @ 2023-04-11 15:54 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Yanting, mjrosato, jasowang, peterx,   <lulu@redhat.com>, ,
	suravee.suthikulpanit, chao.p.peng, Liu,  Yi L, kvm, joro, Yan,
	nicolinc,      <intel-gvt-dev@lists.freedesktop.org>,  ,
	intel-gfx, linux-s390, ,
	Xudong, Zhenzhong,   <suravee.suthikulpanit@amd.com>, ,
	intel-gvt-dev, ,  <intel-gfx@lists.freedesktop.org>,   ,
	linux-s390, Terrence, yi.y.sun, eric.auger, cohuck, robin.murphy,
	shameerali.kolothum.thodi@huawei.com"         
	<shameerali.kolothum.thodi@huawei.com>, ,
	lulu

On Tue, 11 Apr 2023 10:24:58 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Thu, Apr 06, 2023 at 11:53:47AM -0600, Alex Williamson wrote:
> 
> > Where whether a device is opened is subject to change outside of the
> > user's control.  This essentially allows the user to perform hot-resets
> > of devices outside of their ownership so long as the device is not
> > used elsewhere, versus the current requirement that the user own all the
> > affected groups, which implies device ownership.  It's not been
> > justified why this feature needs to exist, imo.  
> 
> The cdev API doesn't have the notion that owning a group means you
> "own" some collection of devices. It still happens as a side effect,
> but it isn't obviously part of the API. I'm really loath to
> re-introduce that group-based concept just for this. We are trying
> reduce the group API surface.
> 
> How about a different direction.
> 
> We add a new uAPI for cdev mode that is "take ownership of the reset
> group". Maybe it can be a flag in during bind.
> 
> When requested vfio will ensure that every device in the reset group
> is only bound to this iommufd_ctx or left closed. Now and in the
> future. Since no-iommu has no iommufd_ctx this means we can open only
> one device in the reset group.
> 
> With this flag RESET is guaranteed to always work by definition.
> 
> We continue with the zero-length FD, but we can just replace the
> security checks with a check if we are in reset group ownership mode.
> 
> _INFO is unchanged.
> 
> We decide if we add a new IOCTL to return the BDF so the existing
> _INFO can get back to the dev_id or a new IOCTL that returns the
> dev_id list of the reset group.
> 
> Userspace is required to figure out the extent of the reset, but we
> don't require that userspace prove to the kernel it did this when
> requesting the reset.

Take for example a multi-function PCIe device with ACS isolation between
functions, are you going to allow a user who has only been granted
ownership of a subset of functions control of the entire dev_set?  It
seems this proposal essentially extends the ownership model to the
greater of the dev_set or iommu group, apparently neither of which are
explicitly exposed to the user in the cdev API.  How does a user
determine when devices cannot be used independently in the cdev API?
Thanks,

Alex


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
  2023-04-11 15:54                             ` Alex Williamson
@ 2023-04-11 17:11                               ` Alex Williamson
  2023-04-11 18:40                                 ` Jason Gunthorpe
  0 siblings, 1 reply; 145+ messages in thread
From: Alex Williamson @ 2023-04-11 17:11 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Yanting, mjrosato, jasowang, peterx,   <lulu@redhat.com>, ,
	suravee.suthikulpanit, chao.p.peng, Liu,  Yi L, kvm, joro, Yan,
	nicolinc,      <intel-gvt-dev@lists.freedesktop.org>,  ,
	intel-gfx, linux-s390, ,
	Xudong, Zhenzhong,   <suravee.suthikulpanit@amd.com>, ,
	intel-gvt-dev, ,  <intel-gfx@lists.freedesktop.org>,   ,
	linux-s390, Terrence, yi.y.sun, eric.auger, cohuck, robin.murphy,
	shameerali.kolothum.thodi@huawei.com"         
	<shameerali.kolothum.thodi@huawei.com>, ,
	lulu

[Appears the list got dropped, replying to my previous message to re-add]

On Tue, 11 Apr 2023 13:32:16 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Tue, Apr 11, 2023 at 09:54:17AM -0600, Alex Williamson wrote:
> > On Tue, 11 Apr 2023 10:24:58 -0300
> > Jason Gunthorpe <jgg@nvidia.com> wrote:
> >   
> > > On Thu, Apr 06, 2023 at 11:53:47AM -0600, Alex Williamson wrote:
> > >   
> > > > Where whether a device is opened is subject to change outside of the
> > > > user's control.  This essentially allows the user to perform hot-resets
> > > > of devices outside of their ownership so long as the device is not
> > > > used elsewhere, versus the current requirement that the user own all the
> > > > affected groups, which implies device ownership.  It's not been
> > > > justified why this feature needs to exist, imo.    
> > > 
> > > The cdev API doesn't have the notion that owning a group means you
> > > "own" some collection of devices. It still happens as a side effect,
> > > but it isn't obviously part of the API. I'm really loath to
> > > re-introduce that group-based concept just for this. We are trying
> > > reduce the group API surface.
> > > 
> > > How about a different direction.
> > > 
> > > We add a new uAPI for cdev mode that is "take ownership of the reset
> > > group". Maybe it can be a flag in during bind.
> > > 
> > > When requested vfio will ensure that every device in the reset group
> > > is only bound to this iommufd_ctx or left closed. Now and in the
> > > future. Since no-iommu has no iommufd_ctx this means we can open only
> > > one device in the reset group.
> > > 
> > > With this flag RESET is guaranteed to always work by definition.
> > > 
> > > We continue with the zero-length FD, but we can just replace the
> > > security checks with a check if we are in reset group ownership mode.
> > > 
> > > _INFO is unchanged.
> > > 
> > > We decide if we add a new IOCTL to return the BDF so the existing
> > > _INFO can get back to the dev_id or a new IOCTL that returns the
> > > dev_id list of the reset group.
> > > 
> > > Userspace is required to figure out the extent of the reset, but we
> > > don't require that userspace prove to the kernel it did this when
> > > requesting the reset.  
> > 
> > Take for example a multi-function PCIe device with ACS isolation between
> > functions, are you going to allow a user who has only been granted
> > ownership of a subset of functions control of the entire dev_set?  
> 
> Our cdev model says that opening a cdev locks out other cdevs from
> independent use, eg because of the group sharing. Extending this to
> include the reset group as well seems consistent.

The DMA ownership model based on the IOMMU group is consistent with
legacy vfio, but now you're proposing a new ownership model that
optionally allows a user to extend their ownership, opportunistically
lock out other users, and wreaking havoc for management utilities that
also have no insight into dev_sets or userspace driver behavior.

> There is some security concern here, but that goes both ways, a 3rd
> party should not be able to break an application that needs to use
> this RESET and had sufficient privileges to assert an ownership.

There are clearly scenarios we have now that could break.  For example,
today if QEMU doesn't own all the IOMMU groups for a mult-function
device, it can't do a reset, the remaining functions are available for
other users.  As I understand the proposal, QEMU now gets to attempt to
claim ownership of the dev_set, so it opportunistically extends its
ownership and may block other users from the affected devices.
Ordering makes this effectively unpredictable, if a userspace like DPDK
that doesn't assert dev_set ownership is started first, QEMU can start
and be denied hot-reset support.  In the reverse ordering, the DPDK
application can be locked out by QEMU.

> I'd say anyone should be able to assert RESET ownership if, like
> today, the iommufd_ctx has all the groups of the dev_set inside
> it. Once asserted it becomes safe against all forms of hotplug, and
> continues to be safe even if some of the devices are closed. eg hot
> unplugging from the VM doesn't change the availability of RESET.
> 
> This comes from your ask that qemu know clearly if RESET works, and it
> doesn't change while qemu is running. This seems stronger and clearer
> than the current implicit scheme. It also doesn't require usespace to
> do any calculations with groups or BDFs to figure out of RESET is
> available, kernel confirms it directly.

As above, clarity and predictability seem lacking in this proposal.
With the current scheme, the ownership of the affected devices is
implied if they exist within an owned group, but the strength of that
ownership is clear.  Affected devices outside the set of owned groups
says that hot-reset is unavailable without any of this "but QEMU might
be able to request it" or "unless the affected device is currently
unopened" variables.

> > seems this proposal essentially extends the ownership model to the
> > greater of the dev_set or iommu group, apparently neither of which
> > are explicitly exposed to the user in the cdev API.  
> 
> IIRC the group id can be learned from sysfs before opening the cdev
> file. Something like /sys/class/vfio/XX/../../iommu_group

And in the passed cdev fd model... ?

> We should also have an iommufd ioctl to report the "same ioas"
> groupings of dev_ids to make it easy on userspace. I haven't checked
> to see what the current qemu patches are doing with this..

Seems we're ignoring that no-iommu doesn't have a valid iommufd.

> > How does a user determine when devices cannot be used independently
> > in the cdev API?   
> 
> We have this problem right now. The only way to learn the reset group
> is to call the _INFO ioctl. We could add a sysfs "pci_reset_group"
> under /sys/class/vfio/XX/ if something needs it earlier.

For all the complaints about complexity, now we're asking management
tools to not only take into account IOMMU groups, but also reset
groups, and some inferred knowledge about the application and devices
to speculate whether reset group ownership is taken by a given
userspace??  Thanks,

Alex


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
  2023-04-11 17:11                               ` Alex Williamson
@ 2023-04-11 18:40                                 ` Jason Gunthorpe
  2023-04-11 21:58                                   ` Alex Williamson
  0 siblings, 1 reply; 145+ messages in thread
From: Jason Gunthorpe @ 2023-04-11 18:40 UTC (permalink / raw)
  To: Alex Williamson
  Cc: mjrosato, jasowang, Hao, Xudong, Duan, Zhenzhong, peterx, Xu,
	Terrence, chao.p.peng, linux-s390, Liu, Yi L, kvm, lulu, Jiang,
	Yanting, joro, nicolinc, Zhao, Yan Y, intel-gfx, eric.auger,
	intel-gvt-dev, yi.y.sun, cohuck, shameerali.kolothum.thodi,
	suravee.suthikulpanit, robin.murphy

On Tue, Apr 11, 2023 at 11:11:17AM -0600, Alex Williamson wrote:
> [Appears the list got dropped, replying to my previous message to re-add]

Wowo this got mesed up alot, mutt drops the cc when replying for some
reason. I think it is fixed up now

> > Our cdev model says that opening a cdev locks out other cdevs from
> > independent use, eg because of the group sharing. Extending this to
> > include the reset group as well seems consistent.
> 
> The DMA ownership model based on the IOMMU group is consistent with
> legacy vfio, but now you're proposing a new ownership model that
> optionally allows a user to extend their ownership, opportunistically
> lock out other users, and wreaking havoc for management utilities that
> also have no insight into dev_sets or userspace driver behavior.

I suggested below that the owership require enough open devices - so
it doesn't "extend ownership opportunistically", and there is no
havoc.

Management tools already need to understand dev_set if they want to
offer reliable reset support to the VMs. Same as today.
 
> > There is some security concern here, but that goes both ways, a 3rd
> > party should not be able to break an application that needs to use
> > this RESET and had sufficient privileges to assert an ownership.
> 
> There are clearly scenarios we have now that could break.  For example,
> today if QEMU doesn't own all the IOMMU groups for a mult-function
> device, it can't do a reset, the remaining functions are available for
> other users. 

Sure, and we can keep that with this approach.

> As I understand the proposal, QEMU now gets to attempt to
> claim ownership of the dev_set, so it opportunistically extends its
> ownership and may block other users from the affected devices.

We can decide the policy for the kernel to accept a claim. I suggested
below "same as today" - it must hold all the groups within the
iommufd_ctx.

The main point is to make this claiming operation qemu needs to do
clearer and more explicit. I view this as better than trying to guess
if it successfully made the claim by inspecting the _INFO output.

> > I'd say anyone should be able to assert RESET ownership if, like
> > today, the iommufd_ctx has all the groups of the dev_set inside
> > it. Once asserted it becomes safe against all forms of hotplug, and
> > continues to be safe even if some of the devices are closed. eg hot
> > unplugging from the VM doesn't change the availability of RESET.
> > 
> > This comes from your ask that qemu know clearly if RESET works, and it
> > doesn't change while qemu is running. This seems stronger and clearer
> > than the current implicit scheme. It also doesn't require usespace to
> > do any calculations with groups or BDFs to figure out of RESET is
> > available, kernel confirms it directly.
> 
> As above, clarity and predictability seem lacking in this proposal.
> With the current scheme, the ownership of the affected devices is
> implied if they exist within an owned group, but the strength of that
> ownership is clear.  

Same logic holds here

Ownership is claimed same as today by having all groups representated
in the iommufd_ctx. This seems just as clear as today.

> > > seems this proposal essentially extends the ownership model to the
> > > greater of the dev_set or iommu group, apparently neither of which
> > > are explicitly exposed to the user in the cdev API.  
> > 
> > IIRC the group id can be learned from sysfs before opening the cdev
> > file. Something like /sys/class/vfio/XX/../../iommu_group
> 
> And in the passed cdev fd model... ?

IMHO we should try to avoid needing to expose group_id specifically to
userspace. We are missing a way to learn the "same ioas" restriction
in iommufd, and it should provide that directly based on dev_ids.

Otherwise if we really really need group_id then iommufd should
provide an ioctl to get it. Let's find a good reason first

> > We should also have an iommufd ioctl to report the "same ioas"
> > groupings of dev_ids to make it easy on userspace. I haven't checked
> > to see what the current qemu patches are doing with this..
> 
> Seems we're ignoring that no-iommu doesn't have a valid iommufd.

no-iommu doesn't and shouldn't have iommu_groups either. It also
doesn't have an IOAS so querying for same-IOAS is not necessary.

The simplest option for no-iommu is to require it to pass in every
device fd to the reset ioctl.

> > > How does a user determine when devices cannot be used independently
> > > in the cdev API?   
> > 
> > We have this problem right now. The only way to learn the reset group
> > is to call the _INFO ioctl. We could add a sysfs "pci_reset_group"
> > under /sys/class/vfio/XX/ if something needs it earlier.
> 
> For all the complaints about complexity, now we're asking management
> tools to not only take into account IOMMU groups, but also reset
> groups, and some inferred knowledge about the application and devices
> to speculate whether reset group ownership is taken by a given
> userspace??

No, we are trying to keep things pretty much the same as today without
resorting to exposing a lot of group related concepts.

The reset group is a clear concept that already exists and isn't
exposed. If we really need to know about it then it should be exposed
on its own, as a seperate discussion from this cdev stuff.

I want to re-focus on the basics of what cdev is supposed to be doing,
because several of the idea you suggested seem against this direction:

 - cdev does not have, and cannot rely on vfio_groups. We enforce this
   by compiling all the vfio_group infrastructure out. iommu_groups
   continue to exist.
   
   So converting a cdev to a vfio_group is not an allowed operation.

 - no-iommu should not have iommu_groups. We enforce this by compiling
   out all the no-iommu vfio_group infrastructure.

 - cdev APIs should ideally not require the user to know the group_id,
   we should try hard to design APIs to avoid this.

We have solved every other problem but reset like this, I would like
to get past reset without compromising the above.

Jason

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
  2023-04-11 18:40                                 ` Jason Gunthorpe
@ 2023-04-11 21:58                                   ` Alex Williamson
  2023-04-12  0:01                                     ` Jason Gunthorpe
  2023-04-12  7:14                                     ` Tian, Kevin
  0 siblings, 2 replies; 145+ messages in thread
From: Alex Williamson @ 2023-04-11 21:58 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: mjrosato, jasowang, Hao, Xudong, Duan, Zhenzhong, peterx, Xu,
	Terrence, chao.p.peng, linux-s390, Liu, Yi L, kvm, lulu, Jiang,
	Yanting, joro, nicolinc, Zhao,  Yan Y, intel-gfx, eric.auger,
	intel-gvt-dev, yi.y.sun, cohuck, shameerali.kolothum.thodi,
	suravee.suthikulpanit, robin.murphy

On Tue, 11 Apr 2023 15:40:07 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Tue, Apr 11, 2023 at 11:11:17AM -0600, Alex Williamson wrote:
> > [Appears the list got dropped, replying to my previous message to re-add]  
> 
> Wowo this got mesed up alot, mutt drops the cc when replying for some
> reason. I think it is fixed up now
> 
> > > Our cdev model says that opening a cdev locks out other cdevs from
> > > independent use, eg because of the group sharing. Extending this to
> > > include the reset group as well seems consistent.  
> > 
> > The DMA ownership model based on the IOMMU group is consistent with
> > legacy vfio, but now you're proposing a new ownership model that
> > optionally allows a user to extend their ownership, opportunistically
> > lock out other users, and wreaking havoc for management utilities that
> > also have no insight into dev_sets or userspace driver behavior.  
> 
> I suggested below that the owership require enough open devices - so
> it doesn't "extend ownership opportunistically", and there is no
> havoc.
> 
> Management tools already need to understand dev_set if they want to
> offer reliable reset support to the VMs. Same as today.

I don't think that's true.  Our primary hot-reset use case is GPUs and
subordinate functions, where the isolation and reset scope are often
sufficiently similar to make hot-reset possible, regardless whether
all the functions are assigned to a VM.  I don't think you'll find any
management tools that takes reset scope into account otherwise.

> > > There is some security concern here, but that goes both ways, a 3rd
> > > party should not be able to break an application that needs to use
> > > this RESET and had sufficient privileges to assert an ownership.  
> > 
> > There are clearly scenarios we have now that could break.  For example,
> > today if QEMU doesn't own all the IOMMU groups for a mult-function
> > device, it can't do a reset, the remaining functions are available for
> > other users.   
> 
> Sure, and we can keep that with this approach.
> 
> > As I understand the proposal, QEMU now gets to attempt to
> > claim ownership of the dev_set, so it opportunistically extends its
> > ownership and may block other users from the affected devices.  
> 
> We can decide the policy for the kernel to accept a claim. I suggested
> below "same as today" - it must hold all the groups within the
> iommufd_ctx.

It must hold all the groups [that the user doesn't know about because
it's not a formal part of the cdev API] within the iommufd_ctx?
 
> The main point is to make this claiming operation qemu needs to do
> clearer and more explicit. I view this as better than trying to guess
> if it successfully made the claim by inspecting the _INFO output.

There is no guessing in the current API.  Guessing is what happens
when hot-reset magically works because one of the devices wasn't opened
at the time, or the iommufd_ctx happens to hold all the affected groups
that the user doesn't have an API to understand.  The current API has a
very concise requirement, the user must own all of the groups affected
by the hot-reset in order to effect a hot-reset.

> > > I'd say anyone should be able to assert RESET ownership if, like
> > > today, the iommufd_ctx has all the groups of the dev_set inside
> > > it. Once asserted it becomes safe against all forms of hotplug, and
> > > continues to be safe even if some of the devices are closed. eg hot
> > > unplugging from the VM doesn't change the availability of RESET.
> > > 
> > > This comes from your ask that qemu know clearly if RESET works, and it
> > > doesn't change while qemu is running. This seems stronger and clearer
> > > than the current implicit scheme. It also doesn't require usespace to
> > > do any calculations with groups or BDFs to figure out of RESET is
> > > available, kernel confirms it directly.  
> > 
> > As above, clarity and predictability seem lacking in this proposal.
> > With the current scheme, the ownership of the affected devices is
> > implied if they exist within an owned group, but the strength of that
> > ownership is clear.    
> 
> Same logic holds here
> 
> Ownership is claimed same as today by having all groups representated
> in the iommufd_ctx. This seems just as clear as today.

I don't know if anyone else is having this trouble, but I'm seeing
conflicting requirements.  The cdev API is not to expose groups unless
a requirement is found to need them, of which this is apparently not
one, but all the groups need to be represented in the iommufd_ctx in
order to make use of this interface.  How is that clear?

> > > > seems this proposal essentially extends the ownership model to the
> > > > greater of the dev_set or iommu group, apparently neither of which
> > > > are explicitly exposed to the user in the cdev API.    
> > > 
> > > IIRC the group id can be learned from sysfs before opening the cdev
> > > file. Something like /sys/class/vfio/XX/../../iommu_group  
> > 
> > And in the passed cdev fd model... ?  
> 
> IMHO we should try to avoid needing to expose group_id specifically to
> userspace. We are missing a way to learn the "same ioas" restriction
> in iommufd, and it should provide that directly based on dev_ids.

Is this yet another "we need to expose groups to understand the ioas
restriction but we're not going to because reasons" argument?

> Otherwise if we really really need group_id then iommufd should
> provide an ioctl to get it. Let's find a good reason first

If needing to have all of the groups represented in an iommufd_ctx in
order to effect a reset without allowing the user to know the set of
affected groups and device to group relationship isn't a reason... well
I'm just lost.

> > > We should also have an iommufd ioctl to report the "same ioas"
> > > groupings of dev_ids to make it easy on userspace. I haven't checked
> > > to see what the current qemu patches are doing with this..  
> > 
> > Seems we're ignoring that no-iommu doesn't have a valid iommufd.  
> 
> no-iommu doesn't and shouldn't have iommu_groups either. It also
> doesn't have an IOAS so querying for same-IOAS is not necessary.
> 
> The simplest option for no-iommu is to require it to pass in every
> device fd to the reset ioctl.

Which ironically is exactly how it ends up working today, each no-iommu
device has a fake IOMMU group, so every affected device (group) needs
to be provided.

> > > > How does a user determine when devices cannot be used independently
> > > > in the cdev API?     
> > > 
> > > We have this problem right now. The only way to learn the reset group
> > > is to call the _INFO ioctl. We could add a sysfs "pci_reset_group"
> > > under /sys/class/vfio/XX/ if something needs it earlier.  
> > 
> > For all the complaints about complexity, now we're asking management
> > tools to not only take into account IOMMU groups, but also reset
> > groups, and some inferred knowledge about the application and devices
> > to speculate whether reset group ownership is taken by a given
> > userspace??  
> 
> No, we are trying to keep things pretty much the same as today without
> resorting to exposing a lot of group related concepts.
> 
> The reset group is a clear concept that already exists and isn't
> exposed. If we really need to know about it then it should be exposed
> on its own, as a seperate discussion from this cdev stuff.

"[A]nd isn't exposed"... what exactly is the hot-reset INFO ioctl
exposing if not that?

> I want to re-focus on the basics of what cdev is supposed to be doing,
> because several of the idea you suggested seem against this direction:
> 
>  - cdev does not have, and cannot rely on vfio_groups. We enforce this
>    by compiling all the vfio_group infrastructure out. iommu_groups
>    continue to exist.
>    
>    So converting a cdev to a vfio_group is not an allowed operation.

My only statements in this respect were towards the notion that IOMMU
groups continue to exist.  I'm well aware of the desire to deprecate
and remove vfio groups.
 
>  - no-iommu should not have iommu_groups. We enforce this by compiling
>    out all the no-iommu vfio_group infrastructure.

This is not logically inferred from the above if IOMMU groups continue
to exist and continue to be a basis for describing DMA ownership as
well as "reset groups" 

>  - cdev APIs should ideally not require the user to know the group_id,
>    we should try hard to design APIs to avoid this.

This is a nuance, group_id vs group, where it's been previously
discussed that users will need to continue to know the boundaries of a
group for the purpose of DMA isolation and potentially IOAS
independence should cdev/iommufd choose to tackle those topics.
 
> We have solved every other problem but reset like this, I would like
> to get past reset without compromising the above.

"These aren't the droids we're looking for."

What is the actual proposal here?  You've said that hot-reset works if
the iommufd_ctx has representation from each affected group, the INFO
ioctl remains as it is, which suggests that it's reporting group ID and
BDF, yet only sysfs tells the user the relation between a vfio cdev and
a group and we're trying to enable a pass-by-fd model for cdev where
the user has no reference to a sysfs node for the device.  Show me how
these pieces fit together.

OTOH, if we say IOMMU groups continue to exist [agreed], every vfio
device has an IOMMU group, and there's an API to learn the group ID, the
solution becomes much more clear and no-iommu devices require no
special cases or restrictions.  Not only does the INFO ioctl remain the
same, but the hot-reset ioctl itself remains effectively the same
accepting either vfio cdevs or groups.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
  2023-04-11 21:58                                   ` Alex Williamson
@ 2023-04-12  0:01                                     ` Jason Gunthorpe
  2023-04-12  7:27                                       ` Tian, Kevin
                                                         ` (2 more replies)
  2023-04-12  7:14                                     ` Tian, Kevin
  1 sibling, 3 replies; 145+ messages in thread
From: Jason Gunthorpe @ 2023-04-12  0:01 UTC (permalink / raw)
  To: Alex Williamson
  Cc: mjrosato, jasowang, Hao, Xudong, Duan, Zhenzhong, peterx, Xu,
	Terrence, chao.p.peng, linux-s390, Liu, Yi L, kvm, lulu, Jiang,
	Yanting, joro, nicolinc, Zhao, Yan Y, intel-gfx, eric.auger,
	intel-gvt-dev, yi.y.sun, cohuck, shameerali.kolothum.thodi,
	suravee.suthikulpanit, robin.murphy

On Tue, Apr 11, 2023 at 03:58:27PM -0600, Alex Williamson wrote:

> > Management tools already need to understand dev_set if they want to
> > offer reliable reset support to the VMs. Same as today.
> 
> I don't think that's true. Our primary hot-reset use case is GPUs and
> subordinate functions, where the isolation and reset scope are often
> sufficiently similar to make hot-reset possible, regardless whether
> all the functions are assigned to a VM.  I don't think you'll find any
> management tools that takes reset scope into account otherwise.

When I think of "reliable reset support" I think of the management
tool offering a checkbox that says "ensure PCI function reset
availability" and if checked it will not launch the VM without a
working reset.

If the user configures a set of VFIO devices and then hopes they get
working reset, that is fine, but doesn't require any reporting of
reset groups, or iommu groups to the management layer to work.

> > > As I understand the proposal, QEMU now gets to attempt to
> > > claim ownership of the dev_set, so it opportunistically extends its
> > > ownership and may block other users from the affected devices.  
> > 
> > We can decide the policy for the kernel to accept a claim. I suggested
> > below "same as today" - it must hold all the groups within the
> > iommufd_ctx.
> 
> It must hold all the groups [that the user doesn't know about because
> it's not a formal part of the cdev API] within the iommufd_ctx?

You keep going back to this, but I maintain userspace doesn't
care. qemu is given a list of VFIO devices to use, all it wants to
know is if it is allowed to use reset or not. Why should it need to
know groups and group_ids to get that binary signal out of the kernel?

> > The simplest option for no-iommu is to require it to pass in every
> > device fd to the reset ioctl.
> 
> Which ironically is exactly how it ends up working today, each no-iommu
> device has a fake IOMMU group, so every affected device (group) needs
> to be provided.

Sure, that is probably the way forward for no-iommu. Not that anyone
uses it..

The kicker is we don't force the user to generate a de-duplicated list
of devices FDs, one per group, just because.

> > I want to re-focus on the basics of what cdev is supposed to be doing,
> > because several of the idea you suggested seem against this direction:
> > 
> >  - cdev does not have, and cannot rely on vfio_groups. We enforce this
> >    by compiling all the vfio_group infrastructure out. iommu_groups
> >    continue to exist.
> >    
> >    So converting a cdev to a vfio_group is not an allowed operation.
> 
> My only statements in this respect were towards the notion that IOMMU
> groups continue to exist.  I'm well aware of the desire to deprecate
> and remove vfio groups.

Yes

> >  - no-iommu should not have iommu_groups. We enforce this by compiling
> >    out all the no-iommu vfio_group infrastructure.
> 
> This is not logically inferred from the above if IOMMU groups continue
> to exist and continue to be a basis for describing DMA ownership as
> well as "reset groups"

It is not ment to flow out of the above, it is a seperate statement. I
want the iommu_group mechanism to stop being abused outside the iommu
core code. The only thing that should be creating groups is an
attached iommu driver operating under ops->device_group().

VFIO needed this to support mdev and no-iommu. We already have mdev
free of iommu_groups, I would like no-iommu to also be free of it too,
we are very close.

That would leave POWER as the only abuser of the
iommu_group_add_device() API, and it is only doing it because it
hasn't got a proper iommu driver implementation yet. It turns out
their abuse is mislocked and maybe racy to boot :(

> >  - cdev APIs should ideally not require the user to know the group_id,
> >    we should try hard to design APIs to avoid this.
> 
> This is a nuance, group_id vs group, where it's been previously
> discussed that users will need to continue to know the boundaries of a
> group for the purpose of DMA isolation and potentially IOAS
> independence should cdev/iommufd choose to tackle those topics.

Yes, group_id is a value we have no specific use for and would require
userspace to keep seperate track of. I'd prefer to rely on dev_id as
much as possible instead.

> What is the actual proposal here?

I don't know anymore, you don't seem to like this direction either...

> You've said that hot-reset works if the iommufd_ctx has
> representation from each affected group, the INFO ioctl remains as
> it is, which suggests that it's reporting group ID and BDF, yet only
> sysfs tells the user the relation between a vfio cdev and a group
> and we're trying to enable a pass-by-fd model for cdev where the
> user has no reference to a sysfs node for the device.  Show me how
> these pieces fit together.

I prefer the version where INFO2 returns the dev_id, but info can work
if we do the BDF cap like you suggested to Yi

> OTOH, if we say IOMMU groups continue to exist [agreed], every vfio
> device has an IOMMU group

I don't desire every VFIO device to have an iommu_group. I want VFIO
devices with real IOMMU drivers to have an iommu_group. mdev and
no-iommu should not. I don't want to add them back into the design
just so INFO has a value to return.

I'd rather give no-iommu a dummy dev_id in iommufdctx then give it an
iommu_group...

I see this problem as a few basic requirements from a qemu-like
application:

 1) Does the configuration I was given support reset right now?
 2) Will the configuration I was given support reset for the duration
    of my execution?
 3) What groups of the devices I already have open does the reset
    effect?
 4) For debugging, report to the user the full list of devices in the
    reset group, in a way that relates back to sysfs.
 5) Away to trigger a reset on a group of devices

#1/#2 is the API I suggested here. Ask the kernel if the current
configuration works, and ask it to keep it working.

#3 is either INFO and a CAP for BDF or INFO2 reporting dev_id

#4 is either INFO and print the BDFs or INFO2 reporting the struct
vfio_device IDR # (eg /sys/class/vfio/vfioXXX/).

#5 is adjusting the FD list in existing RESET ioctl. Remove the need
for userspace to specify a minimal exact list of FDs means userspace
doesn't need the information to figure out what that list actually
is. Pass a 0 length list and use iommufdctx.

None of these requirements suggests to me that qemu needs to know the
group_id, or that it needs to have enough information to know how to
fix an unavailable reset.

Did I miss a requirement here?

Regards,
Jason

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
  2023-04-11 21:58                                   ` Alex Williamson
  2023-04-12  0:01                                     ` Jason Gunthorpe
@ 2023-04-12  7:14                                     ` Tian, Kevin
  1 sibling, 0 replies; 145+ messages in thread
From: Tian, Kevin @ 2023-04-12  7:14 UTC (permalink / raw)
  To: Alex Williamson, Jason Gunthorpe
  Cc: mjrosato, jasowang, Hao, Xudong, Duan,  Zhenzhong, peterx, Xu,
	Terrence, chao.p.peng, linux-s390, Liu, Yi L, kvm, lulu, Jiang,
	Yanting, joro, nicolinc, Zhao, Yan Y, intel-gfx, eric.auger,
	intel-gvt-dev, yi.y.sun, cohuck, shameerali.kolothum.thodi,
	suravee.suthikulpanit, robin.murphy

> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Wednesday, April 12, 2023 5:58 AM
> 
> On Tue, 11 Apr 2023 15:40:07 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Tue, Apr 11, 2023 at 11:11:17AM -0600, Alex Williamson wrote:
> > > [Appears the list got dropped, replying to my previous message to re-add]
> >
> > Wowo this got mesed up alot, mutt drops the cc when replying for some
> > reason. I think it is fixed up now
> >
> > > > Our cdev model says that opening a cdev locks out other cdevs from
> > > > independent use, eg because of the group sharing. Extending this to
> > > > include the reset group as well seems consistent.
> > >
> > > The DMA ownership model based on the IOMMU group is consistent with
> > > legacy vfio, but now you're proposing a new ownership model that
> > > optionally allows a user to extend their ownership, opportunistically
> > > lock out other users, and wreaking havoc for management utilities that
> > > also have no insight into dev_sets or userspace driver behavior.
> >
> > I suggested below that the owership require enough open devices - so
> > it doesn't "extend ownership opportunistically", and there is no
> > havoc.
> >
> > Management tools already need to understand dev_set if they want to
> > offer reliable reset support to the VMs. Same as today.
> 
> I don't think that's true.  Our primary hot-reset use case is GPUs and
> subordinate functions, where the isolation and reset scope are often
> sufficiently similar to make hot-reset possible, regardless whether
> all the functions are assigned to a VM.  I don't think you'll find any
> management tools that takes reset scope into account otherwise.

If we only care about the primary case where iommu group and reset
scope matches, then why would the new claim model in Jason's proposal
urge the management tools to understand the reset scope now?

btw in your earlier replies you pointed out the issue of unpredictable
ordering on a multi-function device e.g. upon which one runs first
dpdk or qmeu will block the other. But I wonder what is the actual use
of allowing both running while both can't do reset due to affected reset
scope in current model.

If a vfio user cannot do reset doesn't it imply it hasn't acquired the full
permission on the device then Jason's proposal of explicitly failing it
is actually a cleaner model?

Thanks
Kevin

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
  2023-04-12  0:01                                     ` Jason Gunthorpe
@ 2023-04-12  7:27                                       ` Tian, Kevin
  2023-04-12 15:05                                         ` Jason Gunthorpe
  2023-04-12 10:09                                       ` Liu, Yi L
  2023-04-12 16:50                                       ` Alex Williamson
  2 siblings, 1 reply; 145+ messages in thread
From: Tian, Kevin @ 2023-04-12  7:27 UTC (permalink / raw)
  To: Jason Gunthorpe, Alex Williamson
  Cc: kvm, jasowang, Hao, Xudong, Duan,  Zhenzhong, peterx, Xu,
	Terrence, chao.p.peng, linux-s390, Liu, Yi L, mjrosato, lulu,
	Jiang, Yanting, joro, nicolinc, Zhao, Yan Y, intel-gfx,
	eric.auger, intel-gvt-dev, yi.y.sun, cohuck,
	shameerali.kolothum.thodi, suravee.suthikulpanit, robin.murphy

> From: Jason Gunthorpe
> Sent: Wednesday, April 12, 2023 8:01 AM
> 
> I see this problem as a few basic requirements from a qemu-like
> application:
> 
>  1) Does the configuration I was given support reset right now?
>  2) Will the configuration I was given support reset for the duration
>     of my execution?
>  3) What groups of the devices I already have open does the reset
>     effect?
>  4) For debugging, report to the user the full list of devices in the
>     reset group, in a way that relates back to sysfs.
>  5) Away to trigger a reset on a group of devices
> 
> #1/#2 is the API I suggested here. Ask the kernel if the current
> configuration works, and ask it to keep it working.
> 
> #3 is either INFO and a CAP for BDF or INFO2 reporting dev_id
> 
> #4 is either INFO and print the BDFs or INFO2 reporting the struct
> vfio_device IDR # (eg /sys/class/vfio/vfioXXX/).

mdev doesn't have BDF. Of course it doesn't support hot_reset either.

but it's presented to userspace as a pci device. Is it weird for a pci
device which doesn't provide a BDF cap?

from this point the vfio_device IDR# sounds more generic.

> 
> #5 is adjusting the FD list in existing RESET ioctl. Remove the need
> for userspace to specify a minimal exact list of FDs means userspace
> doesn't need the information to figure out what that list actually
> is. Pass a 0 length list and use iommufdctx.
> 
> None of these requirements suggests to me that qemu needs to know the
> group_id, or that it needs to have enough information to know how to
> fix an unavailable reset.
> 

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
  2023-04-12  0:01                                     ` Jason Gunthorpe
  2023-04-12  7:27                                       ` Tian, Kevin
@ 2023-04-12 10:09                                       ` Liu, Yi L
  2023-04-12 16:54                                         ` Alex Williamson
  2023-04-12 16:50                                       ` Alex Williamson
  2 siblings, 1 reply; 145+ messages in thread
From: Liu, Yi L @ 2023-04-12 10:09 UTC (permalink / raw)
  To: Jason Gunthorpe, Alex Williamson
  Cc: mjrosato, jasowang, Hao, Xudong, Duan,  Zhenzhong, peterx, Xu,
	Terrence, chao.p.peng, linux-s390, kvm, lulu, Jiang, Yanting,
	joro, nicolinc, Zhao, Yan Y, intel-gfx, eric.auger,
	intel-gvt-dev, yi.y.sun, cohuck, shameerali.kolothum.thodi,
	suravee.suthikulpanit, robin.murphy

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, April 12, 2023 8:01 AM
> 
> On Tue, Apr 11, 2023 at 03:58:27PM -0600, Alex Williamson wrote:
> 
> > > Management tools already need to understand dev_set if they want to
> > > offer reliable reset support to the VMs. Same as today.
> >
> > I don't think that's true. Our primary hot-reset use case is GPUs and
> > subordinate functions, where the isolation and reset scope are often
> > sufficiently similar to make hot-reset possible, regardless whether
> > all the functions are assigned to a VM.  I don't think you'll find any
> > management tools that takes reset scope into account otherwise.
> 
> When I think of "reliable reset support" I think of the management
> tool offering a checkbox that says "ensure PCI function reset
> availability" and if checked it will not launch the VM without a
> working reset.
> 
> If the user configures a set of VFIO devices and then hopes they get
> working reset, that is fine, but doesn't require any reporting of
> reset groups, or iommu groups to the management layer to work.
> 
> > > > As I understand the proposal, QEMU now gets to attempt to
> > > > claim ownership of the dev_set, so it opportunistically extends its
> > > > ownership and may block other users from the affected devices.
> > >
> > > We can decide the policy for the kernel to accept a claim. I suggested
> > > below "same as today" - it must hold all the groups within the
> > > iommufd_ctx.
> >
> > It must hold all the groups [that the user doesn't know about because
> > it's not a formal part of the cdev API] within the iommufd_ctx?
> 
> You keep going back to this, but I maintain userspace doesn't
> care. qemu is given a list of VFIO devices to use, all it wants to
> know is if it is allowed to use reset or not. Why should it need to
> know groups and group_ids to get that binary signal out of the kernel?
> 
> > > The simplest option for no-iommu is to require it to pass in every
> > > device fd to the reset ioctl.
> >
> > Which ironically is exactly how it ends up working today, each no-iommu
> > device has a fake IOMMU group, so every affected device (group) needs
> > to be provided.
> 
> Sure, that is probably the way forward for no-iommu. Not that anyone
> uses it..
> 
> The kicker is we don't force the user to generate a de-duplicated list
> of devices FDs, one per group, just because.
> 
> > > I want to re-focus on the basics of what cdev is supposed to be doing,
> > > because several of the idea you suggested seem against this direction:
> > >
> > >  - cdev does not have, and cannot rely on vfio_groups. We enforce this
> > >    by compiling all the vfio_group infrastructure out. iommu_groups
> > >    continue to exist.
> > >
> > >    So converting a cdev to a vfio_group is not an allowed operation.
> >
> > My only statements in this respect were towards the notion that IOMMU
> > groups continue to exist.  I'm well aware of the desire to deprecate
> > and remove vfio groups.
> 
> Yes
> 
> > >  - no-iommu should not have iommu_groups. We enforce this by compiling
> > >    out all the no-iommu vfio_group infrastructure.
> >
> > This is not logically inferred from the above if IOMMU groups continue
> > to exist and continue to be a basis for describing DMA ownership as
> > well as "reset groups"
> 
> It is not ment to flow out of the above, it is a seperate statement. I
> want the iommu_group mechanism to stop being abused outside the iommu
> core code. The only thing that should be creating groups is an
> attached iommu driver operating under ops->device_group().
> 
> VFIO needed this to support mdev and no-iommu. We already have mdev
> free of iommu_groups, I would like no-iommu to also be free of it too,
> we are very close.
> 
> That would leave POWER as the only abuser of the
> iommu_group_add_device() API, and it is only doing it because it
> hasn't got a proper iommu driver implementation yet. It turns out
> their abuse is mislocked and maybe racy to boot :(
> 
> > >  - cdev APIs should ideally not require the user to know the group_id,
> > >    we should try hard to design APIs to avoid this.
> >
> > This is a nuance, group_id vs group, where it's been previously
> > discussed that users will need to continue to know the boundaries of a
> > group for the purpose of DMA isolation and potentially IOAS
> > independence should cdev/iommufd choose to tackle those topics.
> 
> Yes, group_id is a value we have no specific use for and would require
> userspace to keep seperate track of. I'd prefer to rely on dev_id as
> much as possible instead.
> 
> > What is the actual proposal here?
> 
> I don't know anymore, you don't seem to like this direction either...
> 
> > You've said that hot-reset works if the iommufd_ctx has
> > representation from each affected group, the INFO ioctl remains as
> > it is, which suggests that it's reporting group ID and BDF, yet only
> > sysfs tells the user the relation between a vfio cdev and a group
> > and we're trying to enable a pass-by-fd model for cdev where the
> > user has no reference to a sysfs node for the device.  Show me how
> > these pieces fit together.
> 
> I prefer the version where INFO2 returns the dev_id, but info can work
> if we do the BDF cap like you suggested to Yi
> 
> > OTOH, if we say IOMMU groups continue to exist [agreed], every vfio
> > device has an IOMMU group
> 
> I don't desire every VFIO device to have an iommu_group. I want VFIO
> devices with real IOMMU drivers to have an iommu_group. mdev and
> no-iommu should not. I don't want to add them back into the design
> just so INFO has a value to return.
> 
> I'd rather give no-iommu a dummy dev_id in iommufdctx then give it an
> iommu_group...
> 
> I see this problem as a few basic requirements from a qemu-like
> application:
> 
>  1) Does the configuration I was given support reset right now?
>  2) Will the configuration I was given support reset for the duration
>     of my execution?
>  3) What groups of the devices I already have open does the reset
>     effect?
>  4) For debugging, report to the user the full list of devices in the
>     reset group, in a way that relates back to sysfs.
>  5) Away to trigger a reset on a group of devices
> 
> #1/#2 is the API I suggested here. Ask the kernel if the current
> configuration works, and ask it to keep it working.
> 
> #3 is either INFO and a CAP for BDF or INFO2 reporting dev_id
> 
> #4 is either INFO and print the BDFs or INFO2 reporting the struct
> vfio_device IDR # (eg /sys/class/vfio/vfioXXX/).

I hope we can have a clear statement on the _INFO or INFO2 usage.
Today, per QEMU's implementation, the output of _INFO is used to:

1) do a self-check to see if all the affected groups are opened by the
    current user before it can invoke hot-reset.
2) figure out the devices that are already opened by the user. QEMU
    needs to save the state of such devices as the device may already
    been in use. If so, its state should be saved and restored prior/post
    the hot-reset.

Seems like we are relaxing the self-check as it may be done by locking
the reset group. is it?

> #5 is adjusting the FD list in existing RESET ioctl. Remove the need
> for userspace to specify a minimal exact list of FDs means userspace
> doesn't need the information to figure out what that list actually
> is. Pass a 0 length list and use iommufdctx.

If the reset group is locked, seems no need to check iommufdctx.

Thanks,
Yi Liu


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
  2023-04-12  7:27                                       ` Tian, Kevin
@ 2023-04-12 15:05                                         ` Jason Gunthorpe
  2023-04-12 17:01                                           ` Alex Williamson
  2023-04-13  2:57                                           ` Tian, Kevin
  0 siblings, 2 replies; 145+ messages in thread
From: Jason Gunthorpe @ 2023-04-12 15:05 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: kvm, jasowang, Hao, Xudong, Duan, Zhenzhong, peterx, Xu,
	Terrence, chao.p.peng, linux-s390, Liu, Yi L, mjrosato, lulu,
	Jiang, Yanting, joro, nicolinc, Zhao, Yan Y, intel-gfx,
	eric.auger, intel-gvt-dev, yi.y.sun, cohuck,
	shameerali.kolothum.thodi, suravee.suthikulpanit, robin.murphy

On Wed, Apr 12, 2023 at 07:27:43AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe
> > Sent: Wednesday, April 12, 2023 8:01 AM
> > 
> > I see this problem as a few basic requirements from a qemu-like
> > application:
> > 
> >  1) Does the configuration I was given support reset right now?
> >  2) Will the configuration I was given support reset for the duration
> >     of my execution?
> >  3) What groups of the devices I already have open does the reset
> >     effect?
> >  4) For debugging, report to the user the full list of devices in the
> >     reset group, in a way that relates back to sysfs.
> >  5) Away to trigger a reset on a group of devices
> > 
> > #1/#2 is the API I suggested here. Ask the kernel if the current
> > configuration works, and ask it to keep it working.
> > 
> > #3 is either INFO and a CAP for BDF or INFO2 reporting dev_id
> > 
> > #4 is either INFO and print the BDFs or INFO2 reporting the struct
> > vfio_device IDR # (eg /sys/class/vfio/vfioXXX/).
> 
> mdev doesn't have BDF. Of course it doesn't support hot_reset either.

It should support a reset.. Maybe idxd doesn't, but it should be part
of the SIOV model. Our SIOV devices would need it for instance.

> but it's presented to userspace as a pci device. Is it weird for a pci
> device which doesn't provide a BDF cap?

It is weird for a PCI device, but it is not weird for a VFIO
device. Leaking the physical labels out of the uAPI is not clean,
IMHO.

> from this point the vfio_device IDR# sounds more generic.

Yes, I was thinking about this for the SIOV model.

Jason

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
  2023-04-12  0:01                                     ` Jason Gunthorpe
  2023-04-12  7:27                                       ` Tian, Kevin
  2023-04-12 10:09                                       ` Liu, Yi L
@ 2023-04-12 16:50                                       ` Alex Williamson
  2023-04-12 20:06                                         ` Jason Gunthorpe
  2 siblings, 1 reply; 145+ messages in thread
From: Alex Williamson @ 2023-04-12 16:50 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: mjrosato, jasowang, Hao, Xudong, Duan, Zhenzhong, peterx, Xu,
	Terrence, chao.p.peng, linux-s390, Liu, Yi L, kvm, lulu, Jiang,
	Yanting, joro, nicolinc, Zhao,  Yan Y, intel-gfx, eric.auger,
	intel-gvt-dev, yi.y.sun, cohuck, shameerali.kolothum.thodi,
	suravee.suthikulpanit, robin.murphy

On Tue, 11 Apr 2023 21:01:06 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Tue, Apr 11, 2023 at 03:58:27PM -0600, Alex Williamson wrote:
> 
> > > Management tools already need to understand dev_set if they want to
> > > offer reliable reset support to the VMs. Same as today.  
> > 
> > I don't think that's true. Our primary hot-reset use case is GPUs and
> > subordinate functions, where the isolation and reset scope are often
> > sufficiently similar to make hot-reset possible, regardless whether
> > all the functions are assigned to a VM.  I don't think you'll find any
> > management tools that takes reset scope into account otherwise.  
> 
> When I think of "reliable reset support" I think of the management
> tool offering a checkbox that says "ensure PCI function reset
> availability" and if checked it will not launch the VM without a
> working reset.

This doesn't exist.

> If the user configures a set of VFIO devices and then hopes they get
> working reset, that is fine, but doesn't require any reporting of
> reset groups, or iommu groups to the management layer to work.

I think there's more than hope involved here, there are recipes to
create working hot-reset configurations because it is well specified
and predictable currently.  QEMU can indicate whether hot-reset is
available thanks to the information provided in the INFO ioctl and a VM
that owns the necessary set of groups may consistently and repeatedly
perform hot-resets.

> > > > As I understand the proposal, QEMU now gets to attempt to
> > > > claim ownership of the dev_set, so it opportunistically extends its
> > > > ownership and may block other users from the affected devices.    
> > > 
> > > We can decide the policy for the kernel to accept a claim. I suggested
> > > below "same as today" - it must hold all the groups within the
> > > iommufd_ctx.  
> > 
> > It must hold all the groups [that the user doesn't know about because
> > it's not a formal part of the cdev API] within the iommufd_ctx?  
> 
> You keep going back to this, but I maintain userspace doesn't
> care. qemu is given a list of VFIO devices to use, all it wants to
> know is if it is allowed to use reset or not. Why should it need to
> know groups and group_ids to get that binary signal out of the kernel?

hw/vfio/pci.c:2320
        error_report("vfio: Cannot reset device %s, "
                     "depends on group %d which is not owned.",
                     vdev->vbasedev.name, devices[i].group_id);

That creates a feedback loop where a user can take corrective action
with actual information in hand to resolve the issue.

> > > The simplest option for no-iommu is to require it to pass in every
> > > device fd to the reset ioctl.  
> > 
> > Which ironically is exactly how it ends up working today, each no-iommu
> > device has a fake IOMMU group, so every affected device (group) needs
> > to be provided.  
> 
> Sure, that is probably the way forward for no-iommu. Not that anyone
> uses it..
> 
> The kicker is we don't force the user to generate a de-duplicated list
> of devices FDs, one per group, just because.

So on one hand you're asking for simplicity, but on the other you're
criticizing a trivial simplification that we chose to allow the user to
pass number of group fds equal to number of devices affected so that
the user doesn't need to take that step to de-duplicate the list.  We
can't win.
 
> > > I want to re-focus on the basics of what cdev is supposed to be doing,
> > > because several of the idea you suggested seem against this direction:
> > > 
> > >  - cdev does not have, and cannot rely on vfio_groups. We enforce this
> > >    by compiling all the vfio_group infrastructure out. iommu_groups
> > >    continue to exist.
> > >    
> > >    So converting a cdev to a vfio_group is not an allowed operation.  
> > 
> > My only statements in this respect were towards the notion that IOMMU
> > groups continue to exist.  I'm well aware of the desire to deprecate
> > and remove vfio groups.  
> 
> Yes
> 
> > >  - no-iommu should not have iommu_groups. We enforce this by compiling
> > >    out all the no-iommu vfio_group infrastructure.  
> > 
> > This is not logically inferred from the above if IOMMU groups continue
> > to exist and continue to be a basis for describing DMA ownership as
> > well as "reset groups"  
> 
> It is not ment to flow out of the above, it is a seperate statement. I
> want the iommu_group mechanism to stop being abused outside the iommu
> core code. The only thing that should be creating groups is an
> attached iommu driver operating under ops->device_group().
> 
> VFIO needed this to support mdev and no-iommu. We already have mdev
> free of iommu_groups, I would like no-iommu to also be free of it too,
> we are very close.
> 
> That would leave POWER as the only abuser of the
> iommu_group_add_device() API, and it is only doing it because it
> hasn't got a proper iommu driver implementation yet. It turns out
> their abuse is mislocked and maybe racy to boot :(
> 
> > >  - cdev APIs should ideally not require the user to know the group_id,
> > >    we should try hard to design APIs to avoid this.  
> > 
> > This is a nuance, group_id vs group, where it's been previously
> > discussed that users will need to continue to know the boundaries of a
> > group for the purpose of DMA isolation and potentially IOAS
> > independence should cdev/iommufd choose to tackle those topics.  
> 
> Yes, group_id is a value we have no specific use for and would require
> userspace to keep seperate track of. I'd prefer to rely on dev_id as
> much as possible instead.

But dev-id only has meaning in relation to an iommufd_ctx, so it fails
to be useful in the context of implied ownership.

> > What is the actual proposal here?  
> 
> I don't know anymore, you don't seem to like this direction either...
> 
> > You've said that hot-reset works if the iommufd_ctx has
> > representation from each affected group, the INFO ioctl remains as
> > it is, which suggests that it's reporting group ID and BDF, yet only
> > sysfs tells the user the relation between a vfio cdev and a group
> > and we're trying to enable a pass-by-fd model for cdev where the
> > user has no reference to a sysfs node for the device.  Show me how
> > these pieces fit together.  
> 
> I prefer the version where INFO2 returns the dev_id, but info can work
> if we do the BDF cap like you suggested to Yi

As discussed ad nauseam, dev-id is useless if an affected device is not
already within the iommufd ctx.  BDF provides a mapping to specific
affected devices, but can't express implied ownership.  Group id
provides the implied ownership, but can't express specific devices.  As
Yi has pointed out, QEMU needs to know both if it has ownership of all
the affected devices, both direct and implied, and which specific
devices that it owns are affected.

> > OTOH, if we say IOMMU groups continue to exist [agreed], every vfio
> > device has an IOMMU group  
> 
> I don't desire every VFIO device to have an iommu_group. I want VFIO
> devices with real IOMMU drivers to have an iommu_group. mdev and
> no-iommu should not. I don't want to add them back into the design
> just so INFO has a value to return.
> 
> I'd rather give no-iommu a dummy dev_id in iommufdctx then give it an
> iommu_group...

It's not been shown to me that dev-id is a useful replacement for
anything here.

> I see this problem as a few basic requirements from a qemu-like
> application:
> 
>  1) Does the configuration I was given support reset right now?
>  2) Will the configuration I was given support reset for the duration
>     of my execution?
>  3) What groups of the devices I already have open does the reset
>     effect?
>  4) For debugging, report to the user the full list of devices in the
>     reset group, in a way that relates back to sysfs.
>  5) Away to trigger a reset on a group of devices
> 
> #1/#2 is the API I suggested here. Ask the kernel if the current
> configuration works, and ask it to keep it working.

That is super sketchy because you're also advocating for
opportunistically supporting reset if the instantaneous conditions
allow is (ex. unopened devices), and going back and forth whether "ask
it to keep working" suggests that a user is able to extend their
granted ownership themselves.  I think both needs to be based on some
form of granted, not requested, ownership and not opportunism.

> #3 is either INFO and a CAP for BDF or INFO2 reporting dev_id

Where dev-id is useful for... ?  I think there's a misuse of "groups"
in 3) above, userspace needs to know specific devices affected, thus
BDF.

> #4 is either INFO and print the BDFs or INFO2 reporting the struct
> vfio_device IDR # (eg /sys/class/vfio/vfioXXX/).

We can't assume that all the affected devices are bound to vfio,
therefore we cannot assume a vfio_device IDR exists.

> #5 is adjusting the FD list in existing RESET ioctl. Remove the need
> for userspace to specify a minimal exact list of FDs means userspace
> doesn't need the information to figure out what that list actually
> is. Pass a 0 length list and use iommufdctx.

"...doesn't need the information to figure out what the list actually
is."  That's false, userspace needs the information whether it uses it
to make a list or not, ex. pre- and post-reset processing of specific
affected devices.  Furthermore, supporting a zero length array removes
context from the existing ioctl, which has been shown to make it prone
to creating gaps in legacy group use cases, so I don't understand why
this optimization is so pervasive or important.
 
> None of these requirements suggests to me that qemu needs to know the
> group_id, or that it needs to have enough information to know how to
> fix an unavailable reset.
> 
> Did I miss a requirement here?

So what is the exact proposal?  We can't have an INFO ioctl that simply
returns error if the ownership requirements are not met as that doesn't
support 4).  So we need one or more ioctls that a) indicates whether
the ownership requirements are met and b) indicates the set of affected
devices.  Is b) only the set of affected devices within the calling
devices iommufd_ctx (ie. dev-ids), in which case we need c) a way to
report the overall set of affected devices regardless of ownership in
support of 4), BDF?

Are we back to replacing group-ids with dev-ids in the INFO structure,
where an invalid dev-id either indicates an affected device with
implied ownership (ok) or a gap in ownership (bad) and a flag somewhere
is meant to indicate the overall disposition based on the availability
of reset?  I'm not sure how that fully supports 4) since the user can't
determine if a given invalid dev-id is in fact a blocker, so do we end
up with multiple invalid IDs, perhaps one to indicate unknown but ok
and another to indicate an ownership gap?  Are devices outside of the
iommufd_ctx, but with implied ownership via group omitted entirely from
the lists?  I think we need an actual proposal here.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
  2023-04-12 10:09                                       ` Liu, Yi L
@ 2023-04-12 16:54                                         ` Alex Williamson
  0 siblings, 0 replies; 145+ messages in thread
From: Alex Williamson @ 2023-04-12 16:54 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: mjrosato, jasowang, Hao, Xudong, Duan, Zhenzhong, peterx, Xu,
	Terrence, chao.p.peng, linux-s390, kvm, lulu, Jiang, Yanting,
	joro, nicolinc, Jason Gunthorpe, Zhao, Yan Y, intel-gfx,
	eric.auger, intel-gvt-dev, yi.y.sun, cohuck,
	shameerali.kolothum.thodi, suravee.suthikulpanit, robin.murphy

On Wed, 12 Apr 2023 10:09:32 +0000
"Liu, Yi L" <yi.l.liu@intel.com> wrote:

> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Wednesday, April 12, 2023 8:01 AM
> > 
> > On Tue, Apr 11, 2023 at 03:58:27PM -0600, Alex Williamson wrote:
> >   
> > > > Management tools already need to understand dev_set if they want to
> > > > offer reliable reset support to the VMs. Same as today.  
> > >
> > > I don't think that's true. Our primary hot-reset use case is GPUs and
> > > subordinate functions, where the isolation and reset scope are often
> > > sufficiently similar to make hot-reset possible, regardless whether
> > > all the functions are assigned to a VM.  I don't think you'll find any
> > > management tools that takes reset scope into account otherwise.  
> > 
> > When I think of "reliable reset support" I think of the management
> > tool offering a checkbox that says "ensure PCI function reset
> > availability" and if checked it will not launch the VM without a
> > working reset.
> > 
> > If the user configures a set of VFIO devices and then hopes they get
> > working reset, that is fine, but doesn't require any reporting of
> > reset groups, or iommu groups to the management layer to work.
> >   
> > > > > As I understand the proposal, QEMU now gets to attempt to
> > > > > claim ownership of the dev_set, so it opportunistically extends its
> > > > > ownership and may block other users from the affected devices.  
> > > >
> > > > We can decide the policy for the kernel to accept a claim. I suggested
> > > > below "same as today" - it must hold all the groups within the
> > > > iommufd_ctx.  
> > >
> > > It must hold all the groups [that the user doesn't know about because
> > > it's not a formal part of the cdev API] within the iommufd_ctx?  
> > 
> > You keep going back to this, but I maintain userspace doesn't
> > care. qemu is given a list of VFIO devices to use, all it wants to
> > know is if it is allowed to use reset or not. Why should it need to
> > know groups and group_ids to get that binary signal out of the kernel?
> >   
> > > > The simplest option for no-iommu is to require it to pass in every
> > > > device fd to the reset ioctl.  
> > >
> > > Which ironically is exactly how it ends up working today, each no-iommu
> > > device has a fake IOMMU group, so every affected device (group) needs
> > > to be provided.  
> > 
> > Sure, that is probably the way forward for no-iommu. Not that anyone
> > uses it..
> > 
> > The kicker is we don't force the user to generate a de-duplicated list
> > of devices FDs, one per group, just because.
> >   
> > > > I want to re-focus on the basics of what cdev is supposed to be doing,
> > > > because several of the idea you suggested seem against this direction:
> > > >
> > > >  - cdev does not have, and cannot rely on vfio_groups. We enforce this
> > > >    by compiling all the vfio_group infrastructure out. iommu_groups
> > > >    continue to exist.
> > > >
> > > >    So converting a cdev to a vfio_group is not an allowed operation.  
> > >
> > > My only statements in this respect were towards the notion that IOMMU
> > > groups continue to exist.  I'm well aware of the desire to deprecate
> > > and remove vfio groups.  
> > 
> > Yes
> >   
> > > >  - no-iommu should not have iommu_groups. We enforce this by compiling
> > > >    out all the no-iommu vfio_group infrastructure.  
> > >
> > > This is not logically inferred from the above if IOMMU groups continue
> > > to exist and continue to be a basis for describing DMA ownership as
> > > well as "reset groups"  
> > 
> > It is not ment to flow out of the above, it is a seperate statement. I
> > want the iommu_group mechanism to stop being abused outside the iommu
> > core code. The only thing that should be creating groups is an
> > attached iommu driver operating under ops->device_group().
> > 
> > VFIO needed this to support mdev and no-iommu. We already have mdev
> > free of iommu_groups, I would like no-iommu to also be free of it too,
> > we are very close.
> > 
> > That would leave POWER as the only abuser of the
> > iommu_group_add_device() API, and it is only doing it because it
> > hasn't got a proper iommu driver implementation yet. It turns out
> > their abuse is mislocked and maybe racy to boot :(
> >   
> > > >  - cdev APIs should ideally not require the user to know the group_id,
> > > >    we should try hard to design APIs to avoid this.  
> > >
> > > This is a nuance, group_id vs group, where it's been previously
> > > discussed that users will need to continue to know the boundaries of a
> > > group for the purpose of DMA isolation and potentially IOAS
> > > independence should cdev/iommufd choose to tackle those topics.  
> > 
> > Yes, group_id is a value we have no specific use for and would require
> > userspace to keep seperate track of. I'd prefer to rely on dev_id as
> > much as possible instead.
> >   
> > > What is the actual proposal here?  
> > 
> > I don't know anymore, you don't seem to like this direction either...
> >   
> > > You've said that hot-reset works if the iommufd_ctx has
> > > representation from each affected group, the INFO ioctl remains as
> > > it is, which suggests that it's reporting group ID and BDF, yet only
> > > sysfs tells the user the relation between a vfio cdev and a group
> > > and we're trying to enable a pass-by-fd model for cdev where the
> > > user has no reference to a sysfs node for the device.  Show me how
> > > these pieces fit together.  
> > 
> > I prefer the version where INFO2 returns the dev_id, but info can work
> > if we do the BDF cap like you suggested to Yi
> >   
> > > OTOH, if we say IOMMU groups continue to exist [agreed], every vfio
> > > device has an IOMMU group  
> > 
> > I don't desire every VFIO device to have an iommu_group. I want VFIO
> > devices with real IOMMU drivers to have an iommu_group. mdev and
> > no-iommu should not. I don't want to add them back into the design
> > just so INFO has a value to return.
> > 
> > I'd rather give no-iommu a dummy dev_id in iommufdctx then give it an
> > iommu_group...
> > 
> > I see this problem as a few basic requirements from a qemu-like
> > application:
> > 
> >  1) Does the configuration I was given support reset right now?
> >  2) Will the configuration I was given support reset for the duration
> >     of my execution?
> >  3) What groups of the devices I already have open does the reset
> >     effect?
> >  4) For debugging, report to the user the full list of devices in the
> >     reset group, in a way that relates back to sysfs.
> >  5) Away to trigger a reset on a group of devices
> > 
> > #1/#2 is the API I suggested here. Ask the kernel if the current
> > configuration works, and ask it to keep it working.
> > 
> > #3 is either INFO and a CAP for BDF or INFO2 reporting dev_id
> > 
> > #4 is either INFO and print the BDFs or INFO2 reporting the struct
> > vfio_device IDR # (eg /sys/class/vfio/vfioXXX/).  
> 
> I hope we can have a clear statement on the _INFO or INFO2 usage.
> Today, per QEMU's implementation, the output of _INFO is used to:
> 
> 1) do a self-check to see if all the affected groups are opened by the
>     current user before it can invoke hot-reset.
> 2) figure out the devices that are already opened by the user. QEMU
>     needs to save the state of such devices as the device may already
>     been in use. If so, its state should be saved and restored prior/post
>     the hot-reset.
> 
> Seems like we are relaxing the self-check as it may be done by locking
> the reset group. is it?

I hope not.  Locking the reset group suggests the user is able to
extend their ownership.  IMO we should not allow that.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
  2023-04-12 15:05                                         ` Jason Gunthorpe
@ 2023-04-12 17:01                                           ` Alex Williamson
  2023-04-13  2:57                                           ` Tian, Kevin
  1 sibling, 0 replies; 145+ messages in thread
From: Alex Williamson @ 2023-04-12 17:01 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: kvm, jasowang, Hao, Xudong, Duan, Zhenzhong, peterx, Xu,
	 Terrence, chao.p.peng, linux-s390, Liu, Yi L, mjrosato, lulu,
	Jiang, Yanting, joro, nicolinc, Zhao, Yan Y, intel-gfx,
	eric.auger, intel-gvt-dev, yi.y.sun, cohuck,
	shameerali.kolothum.thodi, suravee.suthikulpanit, robin.murphy

On Wed, 12 Apr 2023 12:05:50 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Wed, Apr 12, 2023 at 07:27:43AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe
> > > Sent: Wednesday, April 12, 2023 8:01 AM
> > > 
> > > I see this problem as a few basic requirements from a qemu-like
> > > application:
> > > 
> > >  1) Does the configuration I was given support reset right now?
> > >  2) Will the configuration I was given support reset for the duration
> > >     of my execution?
> > >  3) What groups of the devices I already have open does the reset
> > >     effect?
> > >  4) For debugging, report to the user the full list of devices in the
> > >     reset group, in a way that relates back to sysfs.
> > >  5) Away to trigger a reset on a group of devices
> > > 
> > > #1/#2 is the API I suggested here. Ask the kernel if the current
> > > configuration works, and ask it to keep it working.
> > > 
> > > #3 is either INFO and a CAP for BDF or INFO2 reporting dev_id
> > > 
> > > #4 is either INFO and print the BDFs or INFO2 reporting the struct
> > > vfio_device IDR # (eg /sys/class/vfio/vfioXXX/).  
> > 
> > mdev doesn't have BDF. Of course it doesn't support hot_reset either.  
> 
> It should support a reset.. Maybe idxd doesn't, but it should be part
> of the SIOV model. Our SIOV devices would need it for instance.

IIRC we require mdev devices to support VFIO_DEVICE_RESET, hot-reset is
a different beast.  I assume SIOV device support would also require
VFIO_DEVICE_RESET support and hot-reset would also be irrelevant to
them.

> > but it's presented to userspace as a pci device. Is it weird for a pci
> > device which doesn't provide a BDF cap?  
> 
> It is weird for a PCI device, but it is not weird for a VFIO
> device. Leaking the physical labels out of the uAPI is not clean,
> IMHO.
> 
> > from this point the vfio_device IDR# sounds more generic.  
> 
> Yes, I was thinking about this for the SIOV model.

Seems like we're off on a tangent, the hot-reset ioctl is not relevant
to devices simply because they expose a vfio-pci API, there is any
underlying hardware aspect that anything that is only virtualizing a
vfio-pci API shouldn't be concerned with.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
  2023-04-12 16:50                                       ` Alex Williamson
@ 2023-04-12 20:06                                         ` Jason Gunthorpe
  2023-04-13  8:25                                           ` Tian, Kevin
  0 siblings, 1 reply; 145+ messages in thread
From: Jason Gunthorpe @ 2023-04-12 20:06 UTC (permalink / raw)
  To: Alex Williamson
  Cc: mjrosato, jasowang, Hao, Xudong, Duan, Zhenzhong, peterx, Xu,
	Terrence, chao.p.peng, linux-s390, Liu, Yi L, kvm, lulu, Jiang,
	Yanting, joro, nicolinc, Zhao, Yan Y, intel-gfx, eric.auger,
	intel-gvt-dev, yi.y.sun, cohuck, shameerali.kolothum.thodi,
	suravee.suthikulpanit, robin.murphy

On Wed, Apr 12, 2023 at 10:50:45AM -0600, Alex Williamson wrote:

> > You keep going back to this, but I maintain userspace doesn't
> > care. qemu is given a list of VFIO devices to use, all it wants to
> > know is if it is allowed to use reset or not. Why should it need to
> > know groups and group_ids to get that binary signal out of the kernel?
> 
> hw/vfio/pci.c:2320
>         error_report("vfio: Cannot reset device %s, "
>                      "depends on group %d which is not owned.",
>                      vdev->vbasedev.name, devices[i].group_id);
> 
> That creates a feedback loop where a user can take corrective action
> with actual information in hand to resolve the issue.

Which is why I listed debugging as requirement #4, and solve
requirement #4 by using the existing INFO and printing the BDF list it
returns.

> > The kicker is we don't force the user to generate a de-duplicated list
> > of devices FDs, one per group, just because.
> 
> So on one hand you're asking for simplicity, but on the other you're
> criticizing a trivial simplification that we chose to allow the user to
> pass number of group fds equal to number of devices affected so that
> the user doesn't need to take that step to de-duplicate the list.  We
> can't win.

It is not a simplification because the kernel is wired to accept only
a list of exactly that group length, no more no less. It turns into a
pointless puzzle that userspace has to solve, and it can only solve it
by knowing about groups.

If we get rid of groups we have to do something about this so
userspace doesn't need to do the calculation. That is the point of
this change.

> > > You've said that hot-reset works if the iommufd_ctx has
> > > representation from each affected group, the INFO ioctl remains as
> > > it is, which suggests that it's reporting group ID and BDF, yet only
> > > sysfs tells the user the relation between a vfio cdev and a group
> > > and we're trying to enable a pass-by-fd model for cdev where the
> > > user has no reference to a sysfs node for the device.  Show me how
> > > these pieces fit together.  
> > 
> > I prefer the version where INFO2 returns the dev_id, but info can work
> > if we do the BDF cap like you suggested to Yi
> 
> As discussed ad nauseam, dev-id is useless if an affected device is not
> already within the iommufd ctx.  

The purpose of INFO2 is to satisfy requirement #3 - which is to report
the effected devices *that are already opened*. For this dev_id is
fine. There is nothing qemu can do with devices that are outside its
iommufdctx, so it is pointless to tell it about them. It will generate
the debug print of #4 using INFO. I don't think we don't need one API here.

> > I see this problem as a few basic requirements from a qemu-like
> > application:
> > 
> >  1) Does the configuration I was given support reset right now?
> >  2) Will the configuration I was given support reset for the duration
> >     of my execution?
> >  3) What groups of the devices I already have open does the reset
> >     effect?
> >  4) For debugging, report to the user the full list of devices in the
> >     reset group, in a way that relates back to sysfs.
> >  5) Away to trigger a reset on a group of devices
> > 
> > #1/#2 is the API I suggested here. Ask the kernel if the current
> > configuration works, and ask it to keep it working.
> 
> That is super sketchy because you're also advocating for
> opportunistically supporting reset if the instantaneous conditions
> allow is (ex. unopened devices), and going back and forth whether "ask
> it to keep working" suggests that a user is able to extend their
> granted ownership themselves.  I think both needs to be based on some
> form of granted, not requested, ownership and not opportunism.

Ok, lets give up on ownership then

> > #3 is either INFO and a CAP for BDF or INFO2 reporting dev_id
> 
> Where dev-id is useful for... ?  I think there's a misuse of "groups"
> in 3) above, userspace needs to know specific devices affected, thus
> BDF.

I did not mean "group of devices" to mean iommu_group, I mean "the set of
devices affected by the reset"

> > #4 is either INFO and print the BDFs or INFO2 reporting the struct
> > vfio_device IDR # (eg /sys/class/vfio/vfioXXX/).
> 
> We can't assume that all the affected devices are bound to vfio,
> therefore we cannot assume a vfio_device IDR exists.

So BDF is better for the debugging print.

> > #5 is adjusting the FD list in existing RESET ioctl. Remove the need
> > for userspace to specify a minimal exact list of FDs means userspace
> > doesn't need the information to figure out what that list actually
> > is. Pass a 0 length list and use iommufdctx.
> 
> "...doesn't need the information to figure out what the list actually
> is."  That's false, userspace needs the information whether it uses it
> to make a list or not,

#3 is the need of affected devices, it is already covered.

I mean that #5 should not need this, #5 is only about triggering the
reset.

What I want is a #5 action that does not require doing a calcuation on
group IDs.

At the core, without any notion of groups, #5 requires userspace to
pass in every opened device FD and kernel checks that every opened
device is in the passed FD list. Close devices are ignored. Devices
with unattached drivers are ignored.

#5 does not need the answer to requirement #2.

> So we need one or more ioctls that a) indicates whether
> the ownership requirements are met 

If we reject the ownership direction, then I go back to suggesting
that INFO2 should do this.

> b) indicates the set of affected
> devices.  

INFO2 will return the dev_id which is sufficient to satisfy
requirement #3

> Is b) only the set of affected devices within the calling
> devices iommufd_ctx (ie. dev-ids),

I vote yes

> in which case we need c) a way to
> report the overall set of affected devices regardless of ownership in
> support of 4), BDF?

Yes, continue to use INFO unmodified.
 
> Are we back to replacing group-ids with dev-ids in the INFO structure,
> where an invalid dev-id either indicates an affected device with
> implied ownership (ok) or a gap in ownership (bad) and a flag somewhere
> is meant to indicate the overall disposition based on the availability
> of reset?  

As you explore in the following this gets ugly. I prefer to keep INFO
unchanged and add INFO2.

So maybe we should make patches that look something like this, try to
come up with a workable INFO2 and squeeze no-iommu into it somehow.

Jason

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
  2023-04-12 15:05                                         ` Jason Gunthorpe
  2023-04-12 17:01                                           ` Alex Williamson
@ 2023-04-13  2:57                                           ` Tian, Kevin
  1 sibling, 0 replies; 145+ messages in thread
From: Tian, Kevin @ 2023-04-13  2:57 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: kvm, jasowang, Hao,  Xudong, Duan, Zhenzhong, peterx, Xu,
	Terrence, chao.p.peng, linux-s390, Liu, Yi L, mjrosato, lulu,
	Jiang, Yanting, joro, nicolinc, Zhao, Yan Y, intel-gfx,
	eric.auger, intel-gvt-dev, yi.y.sun, cohuck,
	shameerali.kolothum.thodi, suravee.suthikulpanit, robin.murphy

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, April 12, 2023 11:06 PM
> 
> On Wed, Apr 12, 2023 at 07:27:43AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe
> > > Sent: Wednesday, April 12, 2023 8:01 AM
> > >
> > > I see this problem as a few basic requirements from a qemu-like
> > > application:
> > >
> > >  1) Does the configuration I was given support reset right now?
> > >  2) Will the configuration I was given support reset for the duration
> > >     of my execution?
> > >  3) What groups of the devices I already have open does the reset
> > >     effect?
> > >  4) For debugging, report to the user the full list of devices in the
> > >     reset group, in a way that relates back to sysfs.
> > >  5) Away to trigger a reset on a group of devices
> > >
> > > #1/#2 is the API I suggested here. Ask the kernel if the current
> > > configuration works, and ask it to keep it working.
> > >
> > > #3 is either INFO and a CAP for BDF or INFO2 reporting dev_id
> > >
> > > #4 is either INFO and print the BDFs or INFO2 reporting the struct
> > > vfio_device IDR # (eg /sys/class/vfio/vfioXXX/).
> >
> > mdev doesn't have BDF. Of course it doesn't support hot_reset either.
> 
> It should support a reset.. Maybe idxd doesn't, but it should be part
> of the SIOV model. Our SIOV devices would need it for instance.

yes, supporting VFIO_DEVICE_RESET is assumed. That is required by
the siov spec.

Then no need to support hot_reset.

> 
> > but it's presented to userspace as a pci device. Is it weird for a pci
> > device which doesn't provide a BDF cap?
> 
> It is weird for a PCI device, but it is not weird for a VFIO
> device. Leaking the physical labels out of the uAPI is not clean,
> IMHO.

yes. Reporting pasid is also incorrect since it's invisible to user.

> 
> > from this point the vfio_device IDR# sounds more generic.
> 
> Yes, I was thinking about this for the SIOV model.
> 
> Jason

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
  2023-04-12 20:06                                         ` Jason Gunthorpe
@ 2023-04-13  8:25                                           ` Tian, Kevin
  2023-04-13 11:50                                             ` Jason Gunthorpe
  0 siblings, 1 reply; 145+ messages in thread
From: Tian, Kevin @ 2023-04-13  8:25 UTC (permalink / raw)
  To: Jason Gunthorpe, Alex Williamson
  Cc: mjrosato, jasowang, Hao, Xudong, Duan,  Zhenzhong, peterx, Xu,
	Terrence, chao.p.peng, linux-s390, Liu, Yi L, kvm, lulu, Jiang,
	Yanting, joro, nicolinc, Zhao, Yan Y, intel-gfx, eric.auger,
	intel-gvt-dev, yi.y.sun, cohuck, shameerali.kolothum.thodi,
	suravee.suthikulpanit, robin.murphy

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Thursday, April 13, 2023 4:07 AM
> 
> 
> > in which case we need c) a way to
> > report the overall set of affected devices regardless of ownership in
> > support of 4), BDF?
> 
> Yes, continue to use INFO unmodified.
> 
> > Are we back to replacing group-ids with dev-ids in the INFO structure,
> > where an invalid dev-id either indicates an affected device with
> > implied ownership (ok) or a gap in ownership (bad) and a flag somewhere
> > is meant to indicate the overall disposition based on the availability
> > of reset?
> 
> As you explore in the following this gets ugly. I prefer to keep INFO
> unchanged and add INFO2.
> 

INFO needs a change when VFIO_GROUP is disabled. Now it assumes
a valid iommu group always exists:

vfio_pci_fill_devs()
{
	...
	iommu_group = iommu_group_get(&pdev->dev);
	if (!iommu_group)
		return -EPERM; /* Cannot reset non-isolated devices */
	...
}

Probably we need a special value e.g. -1 to represent noiommu case
given valid group ids are positive.

with that plus BDF cap, I'm curious what is the actual purpose of
INFO2 or why cannot requirement#3 reuse the information collected
via existing INFO?

For each opened device Qemu can find the related group id via
sysfs (if group exists) or an optional GROUP cap and use that id to
match the group id in INFO.

For noiommu it has a group id if VFIO_GROUP=y then same case.

For noiommu if VFIO_GROUP=n just do exact match based on BDF.

Either way the information returned by INFO is a superset of knowing
the reset scope between opened devices. 

Thanks
Kevin

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
  2023-04-13  8:25                                           ` Tian, Kevin
@ 2023-04-13 11:50                                             ` Jason Gunthorpe
  2023-04-13 14:35                                               ` Liu, Yi L
  2023-04-13 18:07                                               ` Alex Williamson
  0 siblings, 2 replies; 145+ messages in thread
From: Jason Gunthorpe @ 2023-04-13 11:50 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: mjrosato, jasowang, Hao, Xudong, Duan, Zhenzhong, peterx, Xu,
	Terrence, chao.p.peng, linux-s390, Liu, Yi L, kvm, lulu, Jiang,
	Yanting, joro, nicolinc, Zhao, Yan Y, intel-gfx, eric.auger,
	intel-gvt-dev, yi.y.sun, cohuck, shameerali.kolothum.thodi,
	suravee.suthikulpanit, robin.murphy

On Thu, Apr 13, 2023 at 08:25:52AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Thursday, April 13, 2023 4:07 AM
> > 
> > 
> > > in which case we need c) a way to
> > > report the overall set of affected devices regardless of ownership in
> > > support of 4), BDF?
> > 
> > Yes, continue to use INFO unmodified.
> > 
> > > Are we back to replacing group-ids with dev-ids in the INFO structure,
> > > where an invalid dev-id either indicates an affected device with
> > > implied ownership (ok) or a gap in ownership (bad) and a flag somewhere
> > > is meant to indicate the overall disposition based on the availability
> > > of reset?
> > 
> > As you explore in the following this gets ugly. I prefer to keep INFO
> > unchanged and add INFO2.
> > 
> 
> INFO needs a change when VFIO_GROUP is disabled. Now it assumes
> a valid iommu group always exists:
> 
> vfio_pci_fill_devs()
> {
> 	...
> 	iommu_group = iommu_group_get(&pdev->dev);
> 	if (!iommu_group)
> 		return -EPERM; /* Cannot reset non-isolated devices */
> 	...
> }

This can still work in a ugly way. With a INFO2 the only purpose of
INFO would be debugging, so if someone uses no-iommu, with hotreset
and misconfigures it then the only downside is they don't get the
debugging print. But we know of nothing that uses this combination
anyhow..

> with that plus BDF cap, I'm curious what is the actual purpose of
> INFO2 or why cannot requirement#3 reuse the information collected
> via existing INFO?

It can - it is just more complicated for userspace to do it, it has to
extract and match the BDFs and then run some algorithm to determine if
the opened devices cover the right set of devices in the reset group,
and it has to have some special code for no-iommu.

VS info2 would return the dev_id's and a single yes/no if the right
set is present. Kernel runs the algorithm instead of userspace, it
seems more abstract this way.

Also, if we make iommufd return a 'ioas dev_id group' as well it
composes nicely that userspace just needs one translation from dev_id.

Jason

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
  2023-04-13 11:50                                             ` Jason Gunthorpe
@ 2023-04-13 14:35                                               ` Liu, Yi L
  2023-04-13 14:41                                                 ` Jason Gunthorpe
  2023-04-13 18:07                                               ` Alex Williamson
  1 sibling, 1 reply; 145+ messages in thread
From: Liu, Yi L @ 2023-04-13 14:35 UTC (permalink / raw)
  To: Jason Gunthorpe, Tian, Kevin
  Cc: mjrosato, jasowang, Hao, Xudong, Duan,  Zhenzhong, peterx, Xu,
	Terrence, chao.p.peng, linux-s390, kvm, lulu, Jiang, Yanting,
	joro, nicolinc, Zhao, Yan Y, intel-gfx, eric.auger,
	intel-gvt-dev, yi.y.sun, cohuck, shameerali.kolothum.thodi,
	suravee.suthikulpanit, robin.murphy

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Thursday, April 13, 2023 7:51 PM
> 
> On Thu, Apr 13, 2023 at 08:25:52AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Thursday, April 13, 2023 4:07 AM
> > >
> > >
> > > > in which case we need c) a way to
> > > > report the overall set of affected devices regardless of ownership in
> > > > support of 4), BDF?
> > >
> > > Yes, continue to use INFO unmodified.
> > >
> > > > Are we back to replacing group-ids with dev-ids in the INFO structure,
> > > > where an invalid dev-id either indicates an affected device with
> > > > implied ownership (ok) or a gap in ownership (bad) and a flag somewhere
> > > > is meant to indicate the overall disposition based on the availability
> > > > of reset?
> > >
> > > As you explore in the following this gets ugly. I prefer to keep INFO
> > > unchanged and add INFO2.
> > >
> >
> > INFO needs a change when VFIO_GROUP is disabled. Now it assumes
> > a valid iommu group always exists:
> >
> > vfio_pci_fill_devs()
> > {
> > 	...
> > 	iommu_group = iommu_group_get(&pdev->dev);
> > 	if (!iommu_group)
> > 		return -EPERM; /* Cannot reset non-isolated devices */
> > 	...
> > }
> 
> This can still work in a ugly way. With a INFO2 the only purpose of
> INFO would be debugging, so if someone uses no-iommu, with hotreset
> and misconfigures it then the only downside is they don't get the
> debugging print. But we know of nothing that uses this combination
> anyhow..

Today, at least QEMU will not go to do hot-reset if _INFO fails. I think
this check may need to be relaxed if want _INFO work when there is
no VFIO_GROUP (also no fake iommu_group).

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
  2023-04-13 14:35                                               ` Liu, Yi L
@ 2023-04-13 14:41                                                 ` Jason Gunthorpe
  0 siblings, 0 replies; 145+ messages in thread
From: Jason Gunthorpe @ 2023-04-13 14:41 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: mjrosato, jasowang, Hao, Xudong, Duan, Zhenzhong, peterx, Xu,
	Terrence, chao.p.peng, linux-s390, kvm, lulu, Jiang, Yanting,
	joro, nicolinc, Zhao, Yan Y, intel-gfx, eric.auger,
	intel-gvt-dev, yi.y.sun, cohuck, shameerali.kolothum.thodi,
	suravee.suthikulpanit, robin.murphy

On Thu, Apr 13, 2023 at 02:35:57PM +0000, Liu, Yi L wrote:

> Today, at least QEMU will not go to do hot-reset if _INFO fails. I think
> this check may need to be relaxed if want _INFO work when there is
> no VFIO_GROUP (also no fake iommu_group).

Current qemu does not work if there is no VFIO_GROUP, so it doesn't
matter.

In cdev mode qemu should work differently, we can make the kernel
return -1 for group_id and qemu can ignore group_id for the debug
print, or we can just make it fail.

Given qemu doesn't, and can't, support no-iommu this is pretty fringe
stuff.

Jason

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
  2023-04-13 11:50                                             ` Jason Gunthorpe
  2023-04-13 14:35                                               ` Liu, Yi L
@ 2023-04-13 18:07                                               ` Alex Williamson
  2023-04-14  9:11                                                 ` Tian, Kevin
  2023-04-17 14:05                                                 ` Jason Gunthorpe
  1 sibling, 2 replies; 145+ messages in thread
From: Alex Williamson @ 2023-04-13 18:07 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: mjrosato, jasowang, Hao, Xudong, Duan, Zhenzhong, peterx, Xu,
	Terrence, chao.p.peng, linux-s390, Liu, Yi L, kvm, lulu, Jiang,
	Yanting, joro, nicolinc, Zhao,  Yan Y, intel-gfx, eric.auger,
	intel-gvt-dev, yi.y.sun, cohuck, shameerali.kolothum.thodi,
	suravee.suthikulpanit, robin.murphy

On Thu, 13 Apr 2023 08:50:45 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Thu, Apr 13, 2023 at 08:25:52AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Thursday, April 13, 2023 4:07 AM
> > > 
> > >   
> > > > in which case we need c) a way to
> > > > report the overall set of affected devices regardless of ownership in
> > > > support of 4), BDF?  
> > > 
> > > Yes, continue to use INFO unmodified.
> > >   
> > > > Are we back to replacing group-ids with dev-ids in the INFO structure,
> > > > where an invalid dev-id either indicates an affected device with
> > > > implied ownership (ok) or a gap in ownership (bad) and a flag somewhere
> > > > is meant to indicate the overall disposition based on the availability
> > > > of reset?  
> > > 
> > > As you explore in the following this gets ugly. I prefer to keep INFO
> > > unchanged and add INFO2.
> > >   
> > 
> > INFO needs a change when VFIO_GROUP is disabled. Now it assumes
> > a valid iommu group always exists:
> > 
> > vfio_pci_fill_devs()
> > {
> > 	...
> > 	iommu_group = iommu_group_get(&pdev->dev);
> > 	if (!iommu_group)
> > 		return -EPERM; /* Cannot reset non-isolated devices */
> > 	...
> > }  
> 
> This can still work in a ugly way. With a INFO2 the only purpose of
> INFO would be debugging, so if someone uses no-iommu, with hotreset
> and misconfigures it then the only downside is they don't get the
> debugging print. But we know of nothing that uses this combination
> anyhow..
> 
> > with that plus BDF cap, I'm curious what is the actual purpose of
> > INFO2 or why cannot requirement#3 reuse the information collected
> > via existing INFO?  
> 
> It can - it is just more complicated for userspace to do it, it has to
> extract and match the BDFs and then run some algorithm to determine if
> the opened devices cover the right set of devices in the reset group,
> and it has to have some special code for no-iommu.
> 
> VS info2 would return the dev_id's and a single yes/no if the right
> set is present. Kernel runs the algorithm instead of userspace, it
> seems more abstract this way.
> 
> Also, if we make iommufd return a 'ioas dev_id group' as well it
> composes nicely that userspace just needs one translation from dev_id.


IIUC, the semantics we're proposing is that an INFO2 ioctl would return
success or failure indicating whether the user has sufficient ownership
of the affected devices, and in the success case returns an array of
affected dev-ids within the user's iommufd_ctx.  Unopened, affected
devices, are not reported via INFO2, and unopened, affected devices
outside the user's scope of ownership (ie. outside the owned IOMMU
group) will generate a failure condition.

As for the INFO ioctl, it's described as unchanged, which does raise
the question of what is reported for IOMMU groups and how does the
value there coherently relate to anything else in the cdev-exclusive
vfio API...

We had already iterated a proposal where the group-id is replaced with
a dev-id in the existing ioctl and a flag indicates when the return
value is a dev-id vs group-id.  This had a gap that userspace cannot
determine if a reset is available given this information since un-owned
devices report an invalid dev-id and userspace can't know if it has
implicit ownership.

It seems cleaner to me though that we would could still re-use INFO in
a similar way, simply defining a new flag bit which is valid only in
the case of returning dev-ids and indicates if the reset is available.
Therefore in one ioctl, userspace knows if hot-reset is available
(based on a kernel determination) and can pull valid dev-ids from the
array to associate affected, owned devices, and still has the
equivalent information to know that one or more of the devices listed
with an invalid dev-id are preventing the hot-reset from being
available.

Is that an option?  Thanks,

Alex


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
  2023-04-13 18:07                                               ` Alex Williamson
@ 2023-04-14  9:11                                                 ` Tian, Kevin
  2023-04-14 11:38                                                   ` Liu, Yi L
                                                                     ` (2 more replies)
  2023-04-17 14:05                                                 ` Jason Gunthorpe
  1 sibling, 3 replies; 145+ messages in thread
From: Tian, Kevin @ 2023-04-14  9:11 UTC (permalink / raw)
  To: Alex Williamson, Jason Gunthorpe
  Cc: mjrosato, jasowang, Hao, Xudong, Duan,  Zhenzhong, peterx, Xu,
	Terrence, chao.p.peng, linux-s390, Liu, Yi L, kvm, lulu, Jiang,
	Yanting, joro, nicolinc, Zhao, Yan Y, intel-gfx, eric.auger,
	intel-gvt-dev, yi.y.sun, cohuck, shameerali.kolothum.thodi,
	suravee.suthikulpanit, robin.murphy

> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Friday, April 14, 2023 2:07 AM
> 
> We had already iterated a proposal where the group-id is replaced with
> a dev-id in the existing ioctl and a flag indicates when the return
> value is a dev-id vs group-id.  This had a gap that userspace cannot
> determine if a reset is available given this information since un-owned
> devices report an invalid dev-id and userspace can't know if it has
> implicit ownership.
> 
> It seems cleaner to me though that we would could still re-use INFO in
> a similar way, simply defining a new flag bit which is valid only in
> the case of returning dev-ids and indicates if the reset is available.
> Therefore in one ioctl, userspace knows if hot-reset is available
> (based on a kernel determination) and can pull valid dev-ids from the

So the kernel needs to compare the group id between devices with
valid dev-ids and devices with invalid dev-ids to decide the implicit
ownership. For noiommu device which has no group_id when
VFIO_GROUP is off then it's resettable only if having a valid dev_id.

The only corner case with this option is when a user mixes group
and cdev usages. iirc you mentioned it's a valid usage to be supported.
In that case the kernel doesn't have sufficient knowledge to judge
'resettable' as it doesn't know which groups are opened by this user.

Not sure whether we can leave it in a ugly way so INFO may not tell
'resettable' accurately in that weird scenario.

> array to associate affected, owned devices, and still has the
> equivalent information to know that one or more of the devices listed
> with an invalid dev-id are preventing the hot-reset from being
> available.
> 
> Is that an option?  Thanks,
> 

This works for me if above corner case can be waived.

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
  2023-04-14  9:11                                                 ` Tian, Kevin
@ 2023-04-14 11:38                                                   ` Liu, Yi L
  2023-04-14 17:10                                                     ` Alex Williamson
  2023-04-14 16:34                                                   ` Alex Williamson
  2023-04-17 13:39                                                   ` Jason Gunthorpe
  2 siblings, 1 reply; 145+ messages in thread
From: Liu, Yi L @ 2023-04-14 11:38 UTC (permalink / raw)
  To: Tian, Kevin, Alex Williamson, Jason Gunthorpe
  Cc: mjrosato, jasowang, Hao, Xudong, peterx, Xu, Terrence,
	chao.p.peng, linux-s390, kvm, lulu, Duan,  Zhenzhong, joro,
	nicolinc, Zhao, Yan Y, intel-gfx, eric.auger, intel-gvt-dev,
	yi.y.sun, cohuck, shameerali.kolothum.thodi,
	suravee.suthikulpanit, Jiang, Yanting, robin.murphy

> From: Tian, Kevin <kevin.tian@intel.com>
> Sent: Friday, April 14, 2023 5:12 PM
> 
> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Friday, April 14, 2023 2:07 AM
> >
> > We had already iterated a proposal where the group-id is replaced with
> > a dev-id in the existing ioctl and a flag indicates when the return
> > value is a dev-id vs group-id.  This had a gap that userspace cannot
> > determine if a reset is available given this information since un-owned
> > devices report an invalid dev-id and userspace can't know if it has
> > implicit ownership.
>
> >
> > It seems cleaner to me though that we would could still re-use INFO in
> > a similar way, simply defining a new flag bit which is valid only in
> > the case of returning dev-ids and indicates if the reset is available.
> > Therefore in one ioctl, userspace knows if hot-reset is available
> > (based on a kernel determination) and can pull valid dev-ids from the

Need to confirm the meaning of hot-reset available flag. I think it
should at least meet below two conditions to set this flag. Although
it may not mean hot-reset is for sure to succeed. (but should be
a high chance).

1) dev_set is resettable (all affected device are in dev_set)
2) affected device are owned by the current user

Also, we need to has assumption that below two cases are rare
if user encounters it, it just bad luck for them. I think the existing
_INFO and hot-reset already has such assumption. So cdev mode
can adopt it as well.

a) physical topology change (e.g. new devices plugged to affected slot)
b) an affected device is unbound from vfio

> So the kernel needs to compare the group id between devices with
> valid dev-ids and devices with invalid dev-ids to decide the implicit
> ownership. For noiommu device which has no group_id when
> VFIO_GROUP is off then it's resettable only if having a valid dev_id.

In cdev mode, noiommu device doesn't have dev_id as it is not
bound to valid iommufd. So if VFIO_GROUP is off, we may never
allow hot-reset for noiommu devices. But we don't want to have
regression with noiommu devices. Perhaps we may define the usage
of the resettable flag like this:
1) if it is set, user does not need to own all the affected devices as
    some of them may have been owned implicitly. Kernel should have
    checked it.
2) if the flag is not set, that means user needs to check ownership
    by itself. It needs to own all the affected devices. If not, don't
   do hot-reset.

This way we can still make noiommu devices support hot-reset
just like VFIO_GROUP is on. Because noiommu devices have fake
groups, such groups are all singleton. So checking all affected
devices are opened by user is just same as check all affected
groups.

> The only corner case with this option is when a user mixes group
> and cdev usages. iirc you mentioned it's a valid usage to be supported.
> In that case the kernel doesn't have sufficient knowledge to judge
> 'resettable' as it doesn't know which groups are opened by this user.
>
> Not sure whether we can leave it in a ugly way so INFO may not tell
> 'resettable' accurately in that weird scenario.

This seems not easy to support. If above scenario is allowed there can be
three cases that returns invalid dev_id.
1) devices not opened by user but owned implicitly
2) devices not owned by user
3) devices opened via group but owned by user

User would require more info to tell the above cases from each other.

> > array to associate affected, owned devices, and still has the
> > equivalent information to know that one or more of the devices listed
> > with an invalid dev-id are preventing the hot-reset from being
> > available.
> >
> > Is that an option?  Thanks,
> >
> 
> This works for me if above corner case can be waived.

One side check, perhaps already confirmed in prior email. @Alex, So
the reason for the prediction of hot-reset is to avoid the possible
vfio_pci_pre_reset() which does heavy operations like stop DMA and
copy config space. Is it? Any other special reason? Anyhow, this reason
is enough for this prediction per my understanding.

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
  2023-04-14  9:11                                                 ` Tian, Kevin
  2023-04-14 11:38                                                   ` Liu, Yi L
@ 2023-04-14 16:34                                                   ` Alex Williamson
  2023-04-17 13:39                                                   ` Jason Gunthorpe
  2 siblings, 0 replies; 145+ messages in thread
From: Alex Williamson @ 2023-04-14 16:34 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: mjrosato, jasowang, Hao, Xudong, Duan, Zhenzhong, peterx, Xu,
	Terrence, chao.p.peng, linux-s390, Liu, Yi L, kvm, lulu, Jiang,
	Yanting, joro, nicolinc, Jason Gunthorpe, Zhao, Yan Y, intel-gfx,
	eric.auger, intel-gvt-dev, yi.y.sun, cohuck,
	shameerali.kolothum.thodi, suravee.suthikulpanit, robin.murphy

On Fri, 14 Apr 2023 09:11:30 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Friday, April 14, 2023 2:07 AM
> > 
> > We had already iterated a proposal where the group-id is replaced with
> > a dev-id in the existing ioctl and a flag indicates when the return
> > value is a dev-id vs group-id.  This had a gap that userspace cannot
> > determine if a reset is available given this information since un-owned
> > devices report an invalid dev-id and userspace can't know if it has
> > implicit ownership.
> > 
> > It seems cleaner to me though that we would could still re-use INFO in
> > a similar way, simply defining a new flag bit which is valid only in
> > the case of returning dev-ids and indicates if the reset is available.
> > Therefore in one ioctl, userspace knows if hot-reset is available
> > (based on a kernel determination) and can pull valid dev-ids from the  
> 
> So the kernel needs to compare the group id between devices with
> valid dev-ids and devices with invalid dev-ids to decide the implicit
> ownership. For noiommu device which has no group_id when
> VFIO_GROUP is off then it's resettable only if having a valid dev_id.

With no-iommu and VFIO_GROUP on, each no-iommu device gets it's own
group and the user must have ownership of each affected group, so
there's really no difference here.  Every affected no-iommu device must
be owned in either case.
 
> The only corner case with this option is when a user mixes group
> and cdev usages. iirc you mentioned it's a valid usage to be supported.
> In that case the kernel doesn't have sufficient knowledge to judge
> 'resettable' as it doesn't know which groups are opened by this user.

So for example we might have a 2-function device, fn0 is opened via
cdev and part of an iommufd ctx and fn1 is opened via the group
interface and potentially bound to a type1 container context.

In the INFO/INFO2 proposal, the INFO ioctl would return an array
reporting the group and BDF for each function.  The INFO ioctl is
callable from either device (aiui).  The INFO2 ioctl would fail on the
group opened device because it doesn't have an iommufd_ctx.  When
called on the cdev opened device, INFO2 would fail because the dev-set
is not represented within the iommufd_ctx.  Is this right?

In my proposal, the INFO ioctl can also be called on either device.
When called on the cdev opened device, the return structure provides
dev-ids with a flag indicating such in the return structure.  The cdev
device has a valid dev-id, the group device invalid.  The
reset-available flag is clear because the kernel cannot infer ownership
of the group opened device.  When called on the group opened device, the
IOMMU group and BDF are returned for each device.

So both approaches have similar issues here, but I think there's an
advantage to the approach of extending INFO.  In that case, the user
still gets the dev-id of the affected cdev device and therefore could
build a hot-reset ioctl call using a combination of groupfds and
devicefds, even if the cdev opened device are passed by fd.  Perhaps
it's obvious that the hot-reset device is itself affected by the reset,
but I think the example scenario could be extended to one where there
are multiple cdev opened devices and one or more group opened devices.
AIUI, the INFO2 proposal essentially only returns success if the
null-array approach is supported, ie. the kernel can infer the full
ownership of the dev-set.  However, I think we could still support a
proof-of-ownership based hot-reset with devicefds and groupfds provide
by the user.

I think what this means is that the flag we're exposing is not
"hot-reset available", but really whether the kernel can infer
ownership and the ownership conditions are satisfied.  Therefore it
essentially only flags the availability of the null-array interface
while the proof-of-ownership approach is always available.

> Not sure whether we can leave it in a ugly way so INFO may not tell
> 'resettable' accurately in that weird scenario.

Is it still ugly with the above design?  Thanks,

Alex


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
  2023-04-14 11:38                                                   ` Liu, Yi L
@ 2023-04-14 17:10                                                     ` Alex Williamson
  2023-04-17  4:20                                                       ` Liu, Yi L
  0 siblings, 1 reply; 145+ messages in thread
From: Alex Williamson @ 2023-04-14 17:10 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: mjrosato, jasowang, Hao, Xudong, Duan, Zhenzhong, peterx, Xu,
	Terrence, chao.p.peng, linux-s390, kvm, lulu, Jiang, Yanting,
	joro, nicolinc, Jason Gunthorpe, Zhao, Yan Y, intel-gfx,
	eric.auger, intel-gvt-dev, yi.y.sun, cohuck,
	shameerali.kolothum.thodi, suravee.suthikulpanit, robin.murphy

On Fri, 14 Apr 2023 11:38:24 +0000
"Liu, Yi L" <yi.l.liu@intel.com> wrote:

> > From: Tian, Kevin <kevin.tian@intel.com>
> > Sent: Friday, April 14, 2023 5:12 PM
> >   
> > > From: Alex Williamson <alex.williamson@redhat.com>
> > > Sent: Friday, April 14, 2023 2:07 AM
> > >
> > > We had already iterated a proposal where the group-id is replaced with
> > > a dev-id in the existing ioctl and a flag indicates when the return
> > > value is a dev-id vs group-id.  This had a gap that userspace cannot
> > > determine if a reset is available given this information since un-owned
> > > devices report an invalid dev-id and userspace can't know if it has
> > > implicit ownership.  
> >  
> > >
> > > It seems cleaner to me though that we would could still re-use INFO in
> > > a similar way, simply defining a new flag bit which is valid only in
> > > the case of returning dev-ids and indicates if the reset is available.
> > > Therefore in one ioctl, userspace knows if hot-reset is available
> > > (based on a kernel determination) and can pull valid dev-ids from the  
> 
> Need to confirm the meaning of hot-reset available flag. I think it
> should at least meet below two conditions to set this flag. Although
> it may not mean hot-reset is for sure to succeed. (but should be
> a high chance).
> 
> 1) dev_set is resettable (all affected device are in dev_set)
> 2) affected device are owned by the current user

Per thread with Kevin, ownership can't always be known by the kernel.
Beyond the group vs cdev discussion there, isn't it also possible
(though perhaps not recommended) that a user can have multiple iommufd
ctxs?  So I think 2) becomes "ownership of the affected dev-set can be
inferred from the iommufd_ctx of the calling device", iow, the
null-array calling model is available and the flag is redefined to
match.  Reset may still be available via the proof-of-ownership model.
 
> Also, we need to has assumption that below two cases are rare
> if user encounters it, it just bad luck for them. I think the existing
> _INFO and hot-reset already has such assumption. So cdev mode
> can adopt it as well.
> 
> a) physical topology change (e.g. new devices plugged to affected slot)
> b) an affected device is unbound from vfio

Yes, these are sufficiently rare that we can't do much about them.

> > So the kernel needs to compare the group id between devices with
> > valid dev-ids and devices with invalid dev-ids to decide the implicit
> > ownership. For noiommu device which has no group_id when
> > VFIO_GROUP is off then it's resettable only if having a valid dev_id.  
> 
> In cdev mode, noiommu device doesn't have dev_id as it is not
> bound to valid iommufd. So if VFIO_GROUP is off, we may never
> allow hot-reset for noiommu devices. But we don't want to have
> regression with noiommu devices. Perhaps we may define the usage
> of the resettable flag like this:
> 1) if it is set, user does not need to own all the affected devices as
>     some of them may have been owned implicitly. Kernel should have
>     checked it.
> 2) if the flag is not set, that means user needs to check ownership
>     by itself. It needs to own all the affected devices. If not, don't
>    do hot-reset.

Exactly, the flag essentially indicates that the null-array approach is
available, lack of the flag indicates proof-of-ownership is required.
 
> This way we can still make noiommu devices support hot-reset
> just like VFIO_GROUP is on. Because noiommu devices have fake
> groups, such groups are all singleton. So checking all affected
> devices are opened by user is just same as check all affected
> groups.

Yep.

> > The only corner case with this option is when a user mixes group
> > and cdev usages. iirc you mentioned it's a valid usage to be supported.
> > In that case the kernel doesn't have sufficient knowledge to judge
> > 'resettable' as it doesn't know which groups are opened by this user.
> >
> > Not sure whether we can leave it in a ugly way so INFO may not tell
> > 'resettable' accurately in that weird scenario.  
> 
> This seems not easy to support. If above scenario is allowed there can be
> three cases that returns invalid dev_id.
> 1) devices not opened by user but owned implicitly

The cdev approach has a hard time with this in general, it has no way to
represent unopened devices. so any case where the nature of an unopened
device block reset on the dev-set is rather opaque to the user.

> 2) devices not owned by user

(and presumable not owned)  We still provide BDF.  Not much difference
from the group case here, being able to point to a BDF or group is
about all we can do.

> 3) devices opened via group but owned by user

I think this still works in the proof-of-ownership, passing fds to
hot-reset model.

> User would require more info to tell the above cases from each other.

Obviously we could be equivalent to the group model if IOMMU groups
were exposed for a device and all devices had IOMMU groups, but
reasons...

> > > array to associate affected, owned devices, and still has the
> > > equivalent information to know that one or more of the devices listed
> > > with an invalid dev-id are preventing the hot-reset from being
> > > available.
> > >
> > > Is that an option?  Thanks,
> > >  
> > 
> > This works for me if above corner case can be waived.  
> 
> One side check, perhaps already confirmed in prior email. @Alex, So
> the reason for the prediction of hot-reset is to avoid the possible
> vfio_pci_pre_reset() which does heavy operations like stop DMA and
> copy config space. Is it? Any other special reason? Anyhow, this reason
> is enough for this prediction per my understanding.

It's not clear to me what "prediction" is referring to.  As above, I
think we can redefine the reset-available flag I proposed to more
restrictively indicate that the null-array approach is available based
on the dev-set group in relation to the iommufd_ctx of the calling
device.  Prediction of the affected devices seems like basic
functionality to me, we can't assume the user's usage model, they must
be able to make a well informed decision regarding affected devices.
Thanks,

Alex


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
  2023-04-14 17:10                                                     ` Alex Williamson
@ 2023-04-17  4:20                                                       ` Liu, Yi L
  2023-04-17 19:01                                                         ` Alex Williamson
  0 siblings, 1 reply; 145+ messages in thread
From: Liu, Yi L @ 2023-04-17  4:20 UTC (permalink / raw)
  To: Alex Williamson
  Cc: mjrosato, jasowang, Hao, Xudong, Duan,  Zhenzhong, peterx, Xu,
	Terrence, chao.p.peng, linux-s390, kvm, lulu, Jiang, Yanting,
	joro, nicolinc, Jason Gunthorpe, Zhao, Yan Y, intel-gfx,
	eric.auger, intel-gvt-dev, yi.y.sun, cohuck,
	shameerali.kolothum.thodi, suravee.suthikulpanit, robin.murphy

> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Saturday, April 15, 2023 1:11 AM
> 
> On Fri, 14 Apr 2023 11:38:24 +0000
> "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> 
> > > From: Tian, Kevin <kevin.tian@intel.com>
> > > Sent: Friday, April 14, 2023 5:12 PM
> > >
> > > > From: Alex Williamson <alex.williamson@redhat.com>
> > > > Sent: Friday, April 14, 2023 2:07 AM
> > > >
> > > > We had already iterated a proposal where the group-id is replaced with
> > > > a dev-id in the existing ioctl and a flag indicates when the return
> > > > value is a dev-id vs group-id.  This had a gap that userspace cannot
> > > > determine if a reset is available given this information since un-owned
> > > > devices report an invalid dev-id and userspace can't know if it has
> > > > implicit ownership.
> > >
> > > >
> > > > It seems cleaner to me though that we would could still re-use INFO in
> > > > a similar way, simply defining a new flag bit which is valid only in
> > > > the case of returning dev-ids and indicates if the reset is available.
> > > > Therefore in one ioctl, userspace knows if hot-reset is available
> > > > (based on a kernel determination) and can pull valid dev-ids from the
> >
> > Need to confirm the meaning of hot-reset available flag. I think it
> > should at least meet below two conditions to set this flag. Although
> > it may not mean hot-reset is for sure to succeed. (but should be
> > a high chance).
> >
> > 1) dev_set is resettable (all affected device are in dev_set)
> > 2) affected device are owned by the current user
> 
> Per thread with Kevin, ownership can't always be known by the kernel.
> Beyond the group vs cdev discussion there, isn't it also possible
> (though perhaps not recommended) that a user can have multiple iommufd
> ctxs?  So I think 2) becomes "ownership of the affected dev-set can be
> inferred from the iommufd_ctx of the calling device", iow, the
> null-array calling model is available and the flag is redefined to
> match.  Reset may still be available via the proof-of-ownership model.

Yes, if there are multiple iommufd ctxs, this shall fall back to use
the proof-of-ownership model.

> 
> > Also, we need to has assumption that below two cases are rare
> > if user encounters it, it just bad luck for them. I think the existing
> > _INFO and hot-reset already has such assumption. So cdev mode
> > can adopt it as well.
> >
> > a) physical topology change (e.g. new devices plugged to affected slot)
> > b) an affected device is unbound from vfio
> 
> Yes, these are sufficiently rare that we can't do much about them.
> 
> > > So the kernel needs to compare the group id between devices with
> > > valid dev-ids and devices with invalid dev-ids to decide the implicit
> > > ownership. For noiommu device which has no group_id when
> > > VFIO_GROUP is off then it's resettable only if having a valid dev_id.
> >
> > In cdev mode, noiommu device doesn't have dev_id as it is not
> > bound to valid iommufd. So if VFIO_GROUP is off, we may never
> > allow hot-reset for noiommu devices. But we don't want to have
> > regression with noiommu devices. Perhaps we may define the usage
> > of the resettable flag like this:
> > 1) if it is set, user does not need to own all the affected devices as
> >     some of them may have been owned implicitly. Kernel should have
> >     checked it.
> > 2) if the flag is not set, that means user needs to check ownership
> >     by itself. It needs to own all the affected devices. If not, don't
> >    do hot-reset.
> 
> Exactly, the flag essentially indicates that the null-array approach is
> available, lack of the flag indicates proof-of-ownership is required.
> 
> > This way we can still make noiommu devices support hot-reset
> > just like VFIO_GROUP is on. Because noiommu devices have fake
> > groups, such groups are all singleton. So checking all affected
> > devices are opened by user is just same as check all affected
> > groups.
> 
> Yep.
> 
> > > The only corner case with this option is when a user mixes group
> > > and cdev usages. iirc you mentioned it's a valid usage to be supported.
> > > In that case the kernel doesn't have sufficient knowledge to judge
> > > 'resettable' as it doesn't know which groups are opened by this user.
> > >
> > > Not sure whether we can leave it in a ugly way so INFO may not tell
> > > 'resettable' accurately in that weird scenario.
> >
> > This seems not easy to support. If above scenario is allowed there can be
> > three cases that returns invalid dev_id.
> > 1) devices not opened by user but owned implicitly
> 
> The cdev approach has a hard time with this in general, it has no way to
> represent unopened devices. so any case where the nature of an unopened
> device block reset on the dev-set is rather opaque to the user.
> 
> > 2) devices not owned by user
> 
> (and presumable not owned)  We still provide BDF.  Not much difference
> from the group case here, being able to point to a BDF or group is
> about all we can do.
> 
> > 3) devices opened via group but owned by user
> 
> I think this still works in the proof-of-ownership, passing fds to
> hot-reset model.

Ok. let's see below scenario and user's processing makes sense.

Say there are five devices (devA, devB, devC, devD, devE) in the same reset
group. devA and devB are in the same iommu group. devC, devD, and devE have
separate iommu groups. Say devA is opened via cdev, devB is not opened, devC
is opened via group, devD is opened cdev but bound to another iommufdctx that
is different with devA. devE is not opened by any user

If this INFO is called on devA, user should get a valid dev_id for devA, but
four invalid dev_ids. The resettable flag should be clear. Below is how user
to handle the info returned.

- For devB, user shall get the group_id for devA, and also get group_id for
  devB, hence able to check ownership of devB by checking the group
- For devC, user can check ownership by the group_id and bdf returned
- For devD, if it is opened by the user, should be able to find it by bdf
- For devE, user shall fail to find it hence consider no ownership on it.

To finish the above check, user needs to get group_id via devid an also needs
to get group_id via device fd. Is it?

The above example may be the most tricky scenario. Is it? user shall not do
hot-reset as not all affected devices are owned by user. But if devE is also
opened by user, it could do hot-reset.

> > User would require more info to tell the above cases from each other.
> 
> Obviously we could be equivalent to the group model if IOMMU groups
> were exposed for a device and all devices had IOMMU groups, but
> reasons...
> 
> > > > array to associate affected, owned devices, and still has the
> > > > equivalent information to know that one or more of the devices listed
> > > > with an invalid dev-id are preventing the hot-reset from being
> > > > available.
> > > >
> > > > Is that an option?  Thanks,
> > > >
> > >
> > > This works for me if above corner case can be waived.
> >
> > One side check, perhaps already confirmed in prior email. @Alex, So
> > the reason for the prediction of hot-reset is to avoid the possible
> > vfio_pci_pre_reset() which does heavy operations like stop DMA and
> > copy config space. Is it? Any other special reason? Anyhow, this reason
> > is enough for this prediction per my understanding.
> 
> It's not clear to me what "prediction" is referring to.

It is predicting whether hot-reset ioctl can work or not as you mentioned
in prior discussion.[1].

"I disagree, as I've argued before, the info ioctl becomes so weak and
effectively arbitrary from a user perspective at being able to predict
whether the hot-reset ioctl works that it becomes useless, diminishing
the entire hot-reset info/execute API."

[1] https://lore.kernel.org/kvm/20230405134945.29e967be.alex.williamson@redhat.com/

> As above, I
> think we can redefine the reset-available flag I proposed to more
> restrictively indicate that the null-array approach is available based
> on the dev-set group in relation to the iommufd_ctx of the calling
> device.  Prediction of the affected devices seems like basic
> functionality to me, we can't assume the user's usage model, they must
> be able to make a well informed decision regarding affected devices.
> Thanks,

As my above reply with the five-device scenario. It still needs to get
group_id to check implicit ownership in the case of sharing the same
iommu_group.

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
  2023-04-14  9:11                                                 ` Tian, Kevin
  2023-04-14 11:38                                                   ` Liu, Yi L
  2023-04-14 16:34                                                   ` Alex Williamson
@ 2023-04-17 13:39                                                   ` Jason Gunthorpe
  2023-04-18  1:28                                                     ` Tian, Kevin
  2023-04-18 10:23                                                     ` Liu, Yi L
  2 siblings, 2 replies; 145+ messages in thread
From: Jason Gunthorpe @ 2023-04-17 13:39 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: mjrosato, jasowang, Hao, Xudong, Duan, Zhenzhong, peterx, Xu,
	Terrence, chao.p.peng, linux-s390, Liu, Yi L, kvm, lulu, Jiang,
	Yanting, joro, nicolinc, Zhao, Yan Y, intel-gfx, eric.auger,
	intel-gvt-dev, yi.y.sun, cohuck, shameerali.kolothum.thodi,
	suravee.suthikulpanit, robin.murphy

On Fri, Apr 14, 2023 at 09:11:30AM +0000, Tian, Kevin wrote:

> The only corner case with this option is when a user mixes group
> and cdev usages. iirc you mentioned it's a valid usage to be supported.
> In that case the kernel doesn't have sufficient knowledge to judge
> 'resettable' as it doesn't know which groups are opened by this user.

IMHO we don't need to support this combination.

We can say that to use the hot reset API the user must put all their
devices into the same iommufd_ctx and cover 100% of the known use
cases for this.

There are already other situations, like nesting, that do force users
to put everything into one iommufd_ctx.

No reason to make things harder and more complicated.

I'm coming to the feeling that we should put no-iommu devices in
iommufd_ctx's as well. They would be an iommufd_access like
mdevs. That would clean up the complications they cause here.

I suppose we should have done that from the beginning - no-iommu is an
IOMMUFD access, it just uses a crazy /proc based way to learn the
PFNs. Making it a proper access and making a real VFIO ioctl that
calls iommufd_access_pin_pages() and returns the DMA mapped addresses
to userspace would go a long way to making no-iommu work in a logical,
usable, way.

Jason

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
  2023-04-13 18:07                                               ` Alex Williamson
  2023-04-14  9:11                                                 ` Tian, Kevin
@ 2023-04-17 14:05                                                 ` Jason Gunthorpe
  1 sibling, 0 replies; 145+ messages in thread
From: Jason Gunthorpe @ 2023-04-17 14:05 UTC (permalink / raw)
  To: Alex Williamson
  Cc: mjrosato, jasowang, Hao, Xudong, Duan, Zhenzhong, peterx, Xu,
	Terrence, chao.p.peng, linux-s390, Liu, Yi L, kvm, lulu, Jiang,
	Yanting, joro, nicolinc, Zhao, Yan Y, intel-gfx, eric.auger,
	intel-gvt-dev, yi.y.sun, cohuck, shameerali.kolothum.thodi,
	suravee.suthikulpanit, robin.murphy

On Thu, Apr 13, 2023 at 12:07:12PM -0600, Alex Williamson wrote:

> IIUC, the semantics we're proposing is that an INFO2 ioctl would return
> success or failure indicating whether the user has sufficient ownership
> of the affected devices, 

Or a flag, but yes

> and in the success case returns an array of
> affected dev-ids within the user's iommufd_ctx.  Unopened, affected
> devices, are not reported via INFO2, and unopened, affected devices
> outside the user's scope of ownership (ie. outside the owned IOMMU
> group) will generate a failure condition.

Yes

> As for the INFO ioctl, it's described as unchanged, which does raise
> the question of what is reported for IOMMU groups and how does the
> value there coherently relate to anything else in the cdev-exclusive
> vfio API...

For cdev mode the value of the group_id has no functional
purpose. INFO has no functional purpose beyond debugging. The cdev
enabled userspace should print the BDFs from the INFO in a debug
message and ignore the group_id.

Kernel will still fill the group_id using the iommu_get_group() stuff,
and set -1 for no-iommu.

> We had already iterated a proposal where the group-id is replaced with
> a dev-id in the existing ioctl and a flag indicates when the return
> value is a dev-id vs group-id.  This had a gap that userspace cannot
> determine if a reset is available given this information since un-owned
> devices report an invalid dev-id and userspace can't know if it has
> implicit ownership.

IIRC, yes.

> It seems cleaner to me though that we would could still re-use INFO in
> a similar way, simply defining a new flag bit which is valid only in
> the case of returning dev-ids and indicates if the reset is
> available.

Yes, it could be done like this as well. INFO2 is more a discussion
object, how we encode it in the uAPI matters a lot less. The point is
that INFO2, as an idea, returns information that no other existing API
returns: the "ownership passed flag" and "dev_id list"

Then as I said in the other mail we roll no-iommu into an iommufd_ctx
object and just follow the design that userspace must have a single
iommufd_ctx containing all the devices to use the hot reset feature.

Jason

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
  2023-04-17  4:20                                                       ` Liu, Yi L
@ 2023-04-17 19:01                                                         ` Alex Williamson
  2023-04-17 19:31                                                           ` Jason Gunthorpe
  0 siblings, 1 reply; 145+ messages in thread
From: Alex Williamson @ 2023-04-17 19:01 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: mjrosato, jasowang, Hao, Xudong, Duan, Zhenzhong, peterx, Xu,
	Terrence, chao.p.peng, linux-s390, kvm, lulu, Jiang, Yanting,
	joro, nicolinc, Jason Gunthorpe, Zhao, Yan Y, intel-gfx,
	eric.auger, intel-gvt-dev, yi.y.sun, cohuck,
	shameerali.kolothum.thodi, suravee.suthikulpanit, robin.murphy

On Mon, 17 Apr 2023 04:20:27 +0000
"Liu, Yi L" <yi.l.liu@intel.com> wrote:

> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Saturday, April 15, 2023 1:11 AM
> > 
> > On Fri, 14 Apr 2023 11:38:24 +0000
> > "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> >   
> > > > From: Tian, Kevin <kevin.tian@intel.com>
> > > > Sent: Friday, April 14, 2023 5:12 PM
> > > >  
> > > > > From: Alex Williamson <alex.williamson@redhat.com>
> > > > > Sent: Friday, April 14, 2023 2:07 AM
> > > > >
> > > > > We had already iterated a proposal where the group-id is replaced with
> > > > > a dev-id in the existing ioctl and a flag indicates when the return
> > > > > value is a dev-id vs group-id.  This had a gap that userspace cannot
> > > > > determine if a reset is available given this information since un-owned
> > > > > devices report an invalid dev-id and userspace can't know if it has
> > > > > implicit ownership.  
> > > >  
> > > > >
> > > > > It seems cleaner to me though that we would could still re-use INFO in
> > > > > a similar way, simply defining a new flag bit which is valid only in
> > > > > the case of returning dev-ids and indicates if the reset is available.
> > > > > Therefore in one ioctl, userspace knows if hot-reset is available
> > > > > (based on a kernel determination) and can pull valid dev-ids from the  
> > >
> > > Need to confirm the meaning of hot-reset available flag. I think it
> > > should at least meet below two conditions to set this flag. Although
> > > it may not mean hot-reset is for sure to succeed. (but should be
> > > a high chance).
> > >
> > > 1) dev_set is resettable (all affected device are in dev_set)
> > > 2) affected device are owned by the current user  
> > 
> > Per thread with Kevin, ownership can't always be known by the kernel.
> > Beyond the group vs cdev discussion there, isn't it also possible
> > (though perhaps not recommended) that a user can have multiple iommufd
> > ctxs?  So I think 2) becomes "ownership of the affected dev-set can be
> > inferred from the iommufd_ctx of the calling device", iow, the
> > null-array calling model is available and the flag is redefined to
> > match.  Reset may still be available via the proof-of-ownership model.  
> 
> Yes, if there are multiple iommufd ctxs, this shall fall back to use
> the proof-of-ownership model.
> 
> >   
> > > Also, we need to has assumption that below two cases are rare
> > > if user encounters it, it just bad luck for them. I think the existing
> > > _INFO and hot-reset already has such assumption. So cdev mode
> > > can adopt it as well.
> > >
> > > a) physical topology change (e.g. new devices plugged to affected slot)
> > > b) an affected device is unbound from vfio  
> > 
> > Yes, these are sufficiently rare that we can't do much about them.
> >   
> > > > So the kernel needs to compare the group id between devices with
> > > > valid dev-ids and devices with invalid dev-ids to decide the implicit
> > > > ownership. For noiommu device which has no group_id when
> > > > VFIO_GROUP is off then it's resettable only if having a valid dev_id.  
> > >
> > > In cdev mode, noiommu device doesn't have dev_id as it is not
> > > bound to valid iommufd. So if VFIO_GROUP is off, we may never
> > > allow hot-reset for noiommu devices. But we don't want to have
> > > regression with noiommu devices. Perhaps we may define the usage
> > > of the resettable flag like this:
> > > 1) if it is set, user does not need to own all the affected devices as
> > >     some of them may have been owned implicitly. Kernel should have
> > >     checked it.
> > > 2) if the flag is not set, that means user needs to check ownership
> > >     by itself. It needs to own all the affected devices. If not, don't
> > >    do hot-reset.  
> > 
> > Exactly, the flag essentially indicates that the null-array approach is
> > available, lack of the flag indicates proof-of-ownership is required.
> >   
> > > This way we can still make noiommu devices support hot-reset
> > > just like VFIO_GROUP is on. Because noiommu devices have fake
> > > groups, such groups are all singleton. So checking all affected
> > > devices are opened by user is just same as check all affected
> > > groups.  
> > 
> > Yep.
> >   
> > > > The only corner case with this option is when a user mixes group
> > > > and cdev usages. iirc you mentioned it's a valid usage to be supported.
> > > > In that case the kernel doesn't have sufficient knowledge to judge
> > > > 'resettable' as it doesn't know which groups are opened by this user.
> > > >
> > > > Not sure whether we can leave it in a ugly way so INFO may not tell
> > > > 'resettable' accurately in that weird scenario.  
> > >
> > > This seems not easy to support. If above scenario is allowed there can be
> > > three cases that returns invalid dev_id.
> > > 1) devices not opened by user but owned implicitly  
> > 
> > The cdev approach has a hard time with this in general, it has no way to
> > represent unopened devices. so any case where the nature of an unopened
> > device block reset on the dev-set is rather opaque to the user.
> >   
> > > 2) devices not owned by user  
> > 
> > (and presumable not owned)  We still provide BDF.  Not much difference
> > from the group case here, being able to point to a BDF or group is
> > about all we can do.
> >   
> > > 3) devices opened via group but owned by user  
> > 
> > I think this still works in the proof-of-ownership, passing fds to
> > hot-reset model.  
> 
> Ok. let's see below scenario and user's processing makes sense.
> 
> Say there are five devices (devA, devB, devC, devD, devE) in the same reset
> group. devA and devB are in the same iommu group. devC, devD, and devE have
> separate iommu groups. Say devA is opened via cdev, devB is not opened, devC
> is opened via group, devD is opened cdev but bound to another iommufdctx that
> is different with devA. devE is not opened by any user
> 
> If this INFO is called on devA, user should get a valid dev_id for devA, but
> four invalid dev_ids. The resettable flag should be clear. Below is how user
> to handle the info returned.

INFO from devA returns:

flags: NOT_RESETABLE | DEV_ID
{
  { valid devA-id,  devA-BDF },
  { invalid dev-id, devB-BDF },
  { invalid dev-id, devC-BDF },
  { invalid dev-id, devD-BDF },
  { invalid dev-id, devE-BDF },
}

User knows devA-id, learns devA-BDF

from devC:
{
  { devA/B-group-id, devA-BDF },
  { devA/B-group-id, devB-BDF },
  { devC-group-id,   devC-BDF },
  { devD-group-id,   devD-BDF },
  { devE-group-id,   devE-BDF },
}

User is assumed to know devC group-id + BDF given group semantics,
knows devA ownership, infers devB ownership.

from devD:
flags: NOT_RESETABLE | DEV_ID
{
  { invalid dev-id, devA-BDF },
  { invalid dev-id, devB-BDF },
  { invalid dev-id, devC-BDF },
  { valid devD-id,  devD-BDF },
  { invalid dev-id, devE-BDF },
}

User knows devD-id, learns devD-bdf, knows devA and devC ownership, and
inferred devB ownership

> - For devB, user shall get the group_id for devA, and also get group_id for
>   devB, hence able to check ownership of devB by checking the group

Per above, groups are only available through the group devices,
therefore inferred ownership of devB can only be learned from devC.

> - For devC, user can check ownership by the group_id and bdf returned

Yes, the INFO ioctl on devC can confirm devC is affected, but more
importantly this is the bridge to learn BDF of other affected devices
and their groups.

> - For devD, if it is opened by the user, should be able to find it by bdf

I think the reverse, the user presumably already knows the dev-id for
devD and knows that a hot-reset of the calling device necessarily
affects the device, but it learns the BDF, which helps it connect 4 of
the 5 device affected by the reset.

> - For devE, user shall fail to find it hence consider no ownership on it.

Yes, which is correct.

> To finish the above check, user needs to get group_id via devid an also needs
> to get group_id via device fd. Is it?

Not absolutely required, but the user needs to do a lot of inferring via
BDF.

> The above example may be the most tricky scenario. Is it? user shall not do
> hot-reset as not all affected devices are owned by user. But if devE is also
> opened by user, it could do hot-reset.

Yes, it's not trivial, but Jason is now proposing that we consider
mixing groups, cdevs, and multiple iommufd_ctxs as invalid.  I think
this means that regardless of which device calls INFO, there's only one
answer (assuming same set of devices opened, all cdev, all within same
iommufd_ctx).  Based on what I explained about my understanding of INFO2
and Jason agreed to, I think the output would be:

flags: NOT_RESETABLE | DEV_ID
{
  { valid devA-id,  devA-BDF },
  { valid devC-id,  devC-BDF },
  { valid devD-id,  devD-BDF },
  { invalid dev-id, devE-BDF },
}

Here devB gets dropped because the kernel understands that devB is
unopened, affected, and owned.  It's therefore not a blocker for
hot-reset.  OTOH, devE is unopened, affected, and un-owned, and we
previously agreed against the opportunistic un-opened/un-owned loophole.

If devA and devD were separate iommufd_ctxs, with devC in the same
ctx as devA, I think this becomes:

INFO on devA:
flags: NOT_RESETABLE | DEV_ID
{
  { valid devA-id,  devA-BDF },
  { valid devC-id,  devC-BDF },
  { invalid dev-id, devD-BDF },
  { invalid dev-id, devE-BDF },
}

INFO on devD:
flags: NOT_RESETABLE | DEV_ID
{
  { invalid dev-id, devA-BDF },
  { invalid dev-id, devB-BDF },
  { invalid dev-id, devC-BDF },
  { valid devD-id, devD-BDF },
  { invalid dev-id, devE-BDF },
}

I think this illustrates that it makes sense for unopened affected
devices with implicit ownership to always be hidden, but otherwise are
fully enumerated.

> > > User would require more info to tell the above cases from each other.  
> > 
> > Obviously we could be equivalent to the group model if IOMMU groups
> > were exposed for a device and all devices had IOMMU groups, but
> > reasons...
> >   
> > > > > array to associate affected, owned devices, and still has the
> > > > > equivalent information to know that one or more of the devices listed
> > > > > with an invalid dev-id are preventing the hot-reset from being
> > > > > available.
> > > > >
> > > > > Is that an option?  Thanks,
> > > > >  
> > > >
> > > > This works for me if above corner case can be waived.  
> > >
> > > One side check, perhaps already confirmed in prior email. @Alex, So
> > > the reason for the prediction of hot-reset is to avoid the possible
> > > vfio_pci_pre_reset() which does heavy operations like stop DMA and
> > > copy config space. Is it? Any other special reason? Anyhow, this reason
> > > is enough for this prediction per my understanding.  
> > 
> > It's not clear to me what "prediction" is referring to.  
> 
> It is predicting whether hot-reset ioctl can work or not as you mentioned
> in prior discussion.[1].
> 
> "I disagree, as I've argued before, the info ioctl becomes so weak and
> effectively arbitrary from a user perspective at being able to predict
> whether the hot-reset ioctl works that it becomes useless, diminishing
> the entire hot-reset info/execute API."
> 
> [1] https://lore.kernel.org/kvm/20230405134945.29e967be.alex.williamson@redhat.com/

I think we're narrowing in on an interface that isn't as arbitrary.  If
we assume the restrictions that Jason proposes, then cdev is exclusively
a kernel determined reset availability model, where I'd agree that
passing device-fds as a proof of ownership is pointless.  The group
interface would therefore remain exclusively a proof-of-ownership
model since we have no incentive to extend it to kernel-determined
given the limited use case of all affected devices managed by the same
vfio container.

> > As above, I
> > think we can redefine the reset-available flag I proposed to more
> > restrictively indicate that the null-array approach is available based
> > on the dev-set group in relation to the iommufd_ctx of the calling
> > device.  Prediction of the affected devices seems like basic
> > functionality to me, we can't assume the user's usage model, they must
> > be able to make a well informed decision regarding affected devices.
> > Thanks,  
> 
> As my above reply with the five-device scenario. It still needs to get
> group_id to check implicit ownership in the case of sharing the same
> iommu_group.

Moot, but there's actually enough information there to infer IOMMU
groups for each device, but we probably can't prove that would always
be the case.  If we adopt Jason's proposal though, I don't see that we
need either a group-id or BDF capability, the BDF is only for debug
reporting.  However, there is a new burden on the kernel to identify
the affected, un-owned devices for that report.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
  2023-04-17 19:01                                                         ` Alex Williamson
@ 2023-04-17 19:31                                                           ` Jason Gunthorpe
  2023-04-17 20:06                                                             ` Alex Williamson
  0 siblings, 1 reply; 145+ messages in thread
From: Jason Gunthorpe @ 2023-04-17 19:31 UTC (permalink / raw)
  To: Alex Williamson
  Cc: mjrosato, jasowang, Hao, Xudong, Duan, Zhenzhong, peterx, Xu,
	Terrence, chao.p.peng, linux-s390, Liu, Yi L, kvm, lulu, Jiang,
	Yanting, joro, nicolinc, Zhao, Yan Y, intel-gfx, eric.auger,
	intel-gvt-dev, yi.y.sun, cohuck, shameerali.kolothum.thodi,
	suravee.suthikulpanit, robin.murphy

On Mon, Apr 17, 2023 at 01:01:40PM -0600, Alex Williamson wrote:
> Yes, it's not trivial, but Jason is now proposing that we consider
> mixing groups, cdevs, and multiple iommufd_ctxs as invalid.  I think
> this means that regardless of which device calls INFO, there's only one
> answer (assuming same set of devices opened, all cdev, all within same
> iommufd_ctx).  Based on what I explained about my understanding of INFO2
> and Jason agreed to, I think the output would be:
> 
> flags: NOT_RESETABLE | DEV_ID
> {
>   { valid devA-id,  devA-BDF },
>   { valid devC-id,  devC-BDF },
>   { valid devD-id,  devD-BDF },
>   { invalid dev-id, devE-BDF },
> }
> 
> Here devB gets dropped because the kernel understands that devB is
> unopened, affected, and owned.  It's therefore not a blocker for
> hot-reset.

I don't think we want to drop anything because it makes the API
ill suited for the debugging purpose.

devb should be returned with an invalid dev_id if I understand your
example. Maybe it should return with -1 as the dev_id instead of 0, to
make the debugging a bit better.

Userspace should look at only NOT_RESETTABLE to determine if it
proceeds or not, and it should use the valid dev_id list to iterate
over the devices it has open to do the config stuff.

> OTOH, devE is unopened, affected, and un-owned, and we
> previously agreed against the opportunistic un-opened/un-owned loophole.

NOT_RESETABLE should be returned in this case, yes.

If we want to enable userspace to use the loophole it should be an
additional flag. RESETABLE_FOR_NOW or something

> I think we're narrowing in on an interface that isn't as arbitrary.  If
> we assume the restrictions that Jason proposes, then cdev is exclusively
> a kernel determined reset availability model

Yes, I think this is probably best looking forward.

> where I'd agree that
> passing device-fds as a proof of ownership is pointless.  The group
> interface would therefore remain exclusively a proof-of-ownership
> model since we have no incentive to extend it to kernel-determined
> given the limited use case of all affected devices managed by the same
> vfio container.

Yes

> Moot, but there's actually enough information there to infer IOMMU
> groups for each device, but we probably can't prove that would always
> be the case.  If we adopt Jason's proposal though, I don't see that we
> need either a group-id or BDF capability, the BDF is only for debug
> reporting.  However, there is a new burden on the kernel to identify
> the affected, un-owned devices for that report.  

Yes and yes

Jason

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
  2023-04-17 19:31                                                           ` Jason Gunthorpe
@ 2023-04-17 20:06                                                             ` Alex Williamson
  2023-04-18  3:24                                                               ` Tian, Kevin
  2023-04-18 12:57                                                               ` Jason Gunthorpe
  0 siblings, 2 replies; 145+ messages in thread
From: Alex Williamson @ 2023-04-17 20:06 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: mjrosato, jasowang, Hao, Xudong, Duan, Zhenzhong, peterx, Xu,
	Terrence, chao.p.peng, linux-s390, Liu, Yi L, kvm, lulu, Jiang,
	Yanting, joro, nicolinc, Zhao,  Yan Y, intel-gfx, eric.auger,
	intel-gvt-dev, yi.y.sun, cohuck, shameerali.kolothum.thodi,
	suravee.suthikulpanit, robin.murphy

On Mon, 17 Apr 2023 16:31:56 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Mon, Apr 17, 2023 at 01:01:40PM -0600, Alex Williamson wrote:
> > Yes, it's not trivial, but Jason is now proposing that we consider
> > mixing groups, cdevs, and multiple iommufd_ctxs as invalid.  I think
> > this means that regardless of which device calls INFO, there's only one
> > answer (assuming same set of devices opened, all cdev, all within same
> > iommufd_ctx).  Based on what I explained about my understanding of INFO2
> > and Jason agreed to, I think the output would be:
> > 
> > flags: NOT_RESETABLE | DEV_ID
> > {
> >   { valid devA-id,  devA-BDF },
> >   { valid devC-id,  devC-BDF },
> >   { valid devD-id,  devD-BDF },
> >   { invalid dev-id, devE-BDF },
> > }
> > 
> > Here devB gets dropped because the kernel understands that devB is
> > unopened, affected, and owned.  It's therefore not a blocker for
> > hot-reset.  
> 
> I don't think we want to drop anything because it makes the API
> ill suited for the debugging purpose.
> 
> devb should be returned with an invalid dev_id if I understand your
> example. Maybe it should return with -1 as the dev_id instead of 0, to
> make the debugging a bit better.
> 
> Userspace should look at only NOT_RESETTABLE to determine if it
> proceeds or not, and it should use the valid dev_id list to iterate
> over the devices it has open to do the config stuff.

If an affected device is owned, not opened, and not interfering with
the reset, what is it adding to the API to report it for debugging
purposes?  I'm afraid this leads into expanding "invalid dev-id" into an
errno or bitmap of error conditions that the user needs to parse.

> > OTOH, devE is unopened, affected, and un-owned, and we
> > previously agreed against the opportunistic un-opened/un-owned loophole.  
> 
> NOT_RESETABLE should be returned in this case, yes.
> 
> If we want to enable userspace to use the loophole it should be an
> additional flag. RESETABLE_FOR_NOW or something

Ugh, please no.  It's always a volatile result, but a volatile result
that relies on device state outside the scope or control of the user is
not even worthwhile imo.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
  2023-04-17 13:39                                                   ` Jason Gunthorpe
@ 2023-04-18  1:28                                                     ` Tian, Kevin
  2023-04-18 10:23                                                     ` Liu, Yi L
  1 sibling, 0 replies; 145+ messages in thread
From: Tian, Kevin @ 2023-04-18  1:28 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: kvm, jasowang, Hao, Xudong, Duan,  Zhenzhong, peterx, Xu,
	Terrence, chao.p.peng, linux-s390, Liu, Yi L, mjrosato, lulu,
	Jiang, Yanting, joro, nicolinc, Zhao, Yan Y, intel-gfx,
	eric.auger, intel-gvt-dev, yi.y.sun, cohuck,
	shameerali.kolothum.thodi, suravee.suthikulpanit, robin.murphy

> From: Jason Gunthorpe
> Sent: Monday, April 17, 2023 9:39 PM
> 
> On Fri, Apr 14, 2023 at 09:11:30AM +0000, Tian, Kevin wrote:
> 
> > The only corner case with this option is when a user mixes group
> > and cdev usages. iirc you mentioned it's a valid usage to be supported.
> > In that case the kernel doesn't have sufficient knowledge to judge
> > 'resettable' as it doesn't know which groups are opened by this user.
> 
> IMHO we don't need to support this combination.
> 
> We can say that to use the hot reset API the user must put all their
> devices into the same iommufd_ctx and cover 100% of the known use
> cases for this.

Make sense.

> 
> There are already other situations, like nesting, that do force users
> to put everything into one iommufd_ctx.
> 
> No reason to make things harder and more complicated.
> 
> I'm coming to the feeling that we should put no-iommu devices in
> iommufd_ctx's as well. They would be an iommufd_access like
> mdevs. That would clean up the complications they cause here.

This certainly simplifies the matter a lot!

> 
> I suppose we should have done that from the beginning - no-iommu is an
> IOMMUFD access, it just uses a crazy /proc based way to learn the
> PFNs. Making it a proper access and making a real VFIO ioctl that
> calls iommufd_access_pin_pages() and returns the DMA mapped addresses
> to userspace would go a long way to making no-iommu work in a logical,
> usable, way.
> 

Yes. This would provide a more reliable/clean way to learn PFNs for
noiommufd case.

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
  2023-04-17 20:06                                                             ` Alex Williamson
@ 2023-04-18  3:24                                                               ` Tian, Kevin
  2023-04-18  4:10                                                                 ` Alex Williamson
  2023-04-18 12:57                                                               ` Jason Gunthorpe
  1 sibling, 1 reply; 145+ messages in thread
From: Tian, Kevin @ 2023-04-18  3:24 UTC (permalink / raw)
  To: Alex Williamson, Jason Gunthorpe
  Cc: mjrosato, jasowang, Hao, Xudong, Duan,  Zhenzhong, peterx, Xu,
	Terrence, chao.p.peng, linux-s390, Liu, Yi L, kvm, lulu, Jiang,
	Yanting, joro, nicolinc, Zhao, Yan Y, intel-gfx, eric.auger,
	intel-gvt-dev, yi.y.sun, cohuck, shameerali.kolothum.thodi,
	suravee.suthikulpanit, robin.murphy

> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Tuesday, April 18, 2023 4:07 AM
> 
> On Mon, 17 Apr 2023 16:31:56 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Mon, Apr 17, 2023 at 01:01:40PM -0600, Alex Williamson wrote:
> > > Yes, it's not trivial, but Jason is now proposing that we consider
> > > mixing groups, cdevs, and multiple iommufd_ctxs as invalid.  I think
> > > this means that regardless of which device calls INFO, there's only one
> > > answer (assuming same set of devices opened, all cdev, all within same
> > > iommufd_ctx).  Based on what I explained about my understanding of
> INFO2
> > > and Jason agreed to, I think the output would be:
> > >
> > > flags: NOT_RESETABLE | DEV_ID
> > > {
> > >   { valid devA-id,  devA-BDF },
> > >   { valid devC-id,  devC-BDF },
> > >   { valid devD-id,  devD-BDF },
> > >   { invalid dev-id, devE-BDF },
> > > }
> > >
> > > Here devB gets dropped because the kernel understands that devB is
> > > unopened, affected, and owned.  It's therefore not a blocker for
> > > hot-reset.
> >
> > I don't think we want to drop anything because it makes the API
> > ill suited for the debugging purpose.
> >
> > devb should be returned with an invalid dev_id if I understand your
> > example. Maybe it should return with -1 as the dev_id instead of 0, to
> > make the debugging a bit better.
> >
> > Userspace should look at only NOT_RESETTABLE to determine if it
> > proceeds or not, and it should use the valid dev_id list to iterate
> > over the devices it has open to do the config stuff.
> 
> If an affected device is owned, not opened, and not interfering with
> the reset, what is it adding to the API to report it for debugging
> purposes?  I'm afraid this leads into expanding "invalid dev-id" into an

consistent output before and after devB is opened.

> errno or bitmap of error conditions that the user needs to parse.
> 

Not exactly.

If RESETABLE invalid dev_id doesn't matter. The user only use the
valid dev_id list to iterate as Jason pointed out.

If NOT_RESETTABLE due to devE not assigned to the VM one can
easily figure out the fact by simply looking at the list of affected BDFs
and the configuration of assigned devices of the VM. Then invalid
dev_id also doesn't matter.

If NOT_RESETTABLE while devE is already assigned to the VM then it's
indication of mixing groups, cdevs or multiple iommufd_ctxs. Then
people should debug with other means/hints to dig out the exact
culprit.

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
  2023-04-18  3:24                                                               ` Tian, Kevin
@ 2023-04-18  4:10                                                                 ` Alex Williamson
  2023-04-18  5:02                                                                   ` Tian, Kevin
  2023-04-18 10:34                                                                   ` Liu, Yi L
  0 siblings, 2 replies; 145+ messages in thread
From: Alex Williamson @ 2023-04-18  4:10 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: mjrosato, jasowang, Hao, Xudong, Duan, Zhenzhong, peterx, Xu,
	Terrence, chao.p.peng, linux-s390, Liu, Yi L, kvm, lulu, Jiang,
	Yanting, joro, nicolinc, Jason Gunthorpe, Zhao, Yan Y, intel-gfx,
	eric.auger, intel-gvt-dev, yi.y.sun, cohuck,
	shameerali.kolothum.thodi, suravee.suthikulpanit, robin.murphy

On Tue, 18 Apr 2023 03:24:46 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Tuesday, April 18, 2023 4:07 AM
> > 
> > On Mon, 17 Apr 2023 16:31:56 -0300
> > Jason Gunthorpe <jgg@nvidia.com> wrote:
> >   
> > > On Mon, Apr 17, 2023 at 01:01:40PM -0600, Alex Williamson wrote:  
> > > > Yes, it's not trivial, but Jason is now proposing that we consider
> > > > mixing groups, cdevs, and multiple iommufd_ctxs as invalid.  I think
> > > > this means that regardless of which device calls INFO, there's only one
> > > > answer (assuming same set of devices opened, all cdev, all within same
> > > > iommufd_ctx).  Based on what I explained about my understanding of  
> > INFO2  
> > > > and Jason agreed to, I think the output would be:
> > > >
> > > > flags: NOT_RESETABLE | DEV_ID
> > > > {
> > > >   { valid devA-id,  devA-BDF },
> > > >   { valid devC-id,  devC-BDF },
> > > >   { valid devD-id,  devD-BDF },
> > > >   { invalid dev-id, devE-BDF },
> > > > }
> > > >
> > > > Here devB gets dropped because the kernel understands that devB is
> > > > unopened, affected, and owned.  It's therefore not a blocker for
> > > > hot-reset.  
> > >
> > > I don't think we want to drop anything because it makes the API
> > > ill suited for the debugging purpose.
> > >
> > > devb should be returned with an invalid dev_id if I understand your
> > > example. Maybe it should return with -1 as the dev_id instead of 0, to
> > > make the debugging a bit better.
> > >
> > > Userspace should look at only NOT_RESETTABLE to determine if it
> > > proceeds or not, and it should use the valid dev_id list to iterate
> > > over the devices it has open to do the config stuff.  
> > 
> > If an affected device is owned, not opened, and not interfering with
> > the reset, what is it adding to the API to report it for debugging
> > purposes?  I'm afraid this leads into expanding "invalid dev-id" into an  
> 
> consistent output before and after devB is opened.

In the case where devB is not opened including it only provides
useless information.  In the case where devB is opened it's necessary
to be reported as an opened, affected device.

> > errno or bitmap of error conditions that the user needs to parse.
> >   
> 
> Not exactly.
> 
> If RESETABLE invalid dev_id doesn't matter. The user only use the
> valid dev_id list to iterate as Jason pointed out.

Yes, but...

> If NOT_RESETTABLE due to devE not assigned to the VM one can
> easily figure out the fact by simply looking at the list of affected BDFs
> and the configuration of assigned devices of the VM. Then invalid
> dev_id also doesn't matter.

Huh?

Given:

flags: NOT_RESETABLE | DEV_ID
{
  { valid devA-id,  devA-BDF },
  { invalid dev-id, devB-BDF },
  { valid devC-id,  devC-BDF },
  { valid devD-id,  devD-BDF },
  { invalid dev-id, devE-BDF },
}

How does the user determine that devE is to blame and not devB based on
BDF?  The user cannot rely on sysfs for help, they don't know the IOMMU
grouping, nor do they know the BDF except as inferred by matching valid
dev-ids in the above output.
 
> If NOT_RESETTABLE while devE is already assigned to the VM then it's
> indication of mixing groups, cdevs or multiple iommufd_ctxs. Then
> people should debug with other means/hints to dig out the exact
> culprit.

I don't know what situation you're trying to explain here.  If devE
were opened within the same iommufd_ctx, this becomes:

flags: RESETABLE | DEV_ID
{
  { valid devA-id,  devA-BDF },
  { invalid dev-id, devB-BDF },
  { valid devC-id,  devC-BDF },
  { valid devD-id,  devD-BDF },
  { valid devE-id,  devE-BDF },
}

Yes, the user should only be looking at the flag to determine the
availability of hot-reset, (here's the but) but how is it consistent to
indicate both that hot-reset is available and include an invalid
dev-id?  The consistency as I propose is that an invalid dev-id is only
presented with NOT_RESETTABLE for the device blocking hot-reset.  In
the previous case, devB is not blocking reset and reporting an invalid
dev-id only serves to obfuscate determining the blocking device.

For the cases of affected group-opened devices or separate
iommufd_ctxs, the user gets invalid dev-ids for anything outside of
the calling device's iommufd_ctx.

We haven't discussed how it fails when called on a group-opened device
in a mixed environment.  I'd propose that the INFO ioctl behaves
exactly as it does today, reporting group-id and BDF for each affected
device.  However, the hot-reset ioctl itself is not extended to accept
devicefd because there is no proof-of-ownership model for cdevs.
Therefore even if the user could map group-id to devicefd, they get
-EINVAL calling HOT_RESET with a devicefd when the ioctl is called from
a group-opened device.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
  2023-04-18  4:10                                                                 ` Alex Williamson
@ 2023-04-18  5:02                                                                   ` Tian, Kevin
  2023-04-18 12:59                                                                     ` Jason Gunthorpe
  2023-04-18 16:44                                                                     ` Alex Williamson
  2023-04-18 10:34                                                                   ` Liu, Yi L
  1 sibling, 2 replies; 145+ messages in thread
From: Tian, Kevin @ 2023-04-18  5:02 UTC (permalink / raw)
  To: Alex Williamson
  Cc: mjrosato, jasowang, Hao, Xudong, Duan,  Zhenzhong, peterx, Xu,
	Terrence, chao.p.peng, linux-s390, Liu, Yi L, kvm, lulu, Jiang,
	Yanting, joro, nicolinc, Jason Gunthorpe, Zhao, Yan Y, intel-gfx,
	eric.auger, intel-gvt-dev, yi.y.sun, cohuck,
	shameerali.kolothum.thodi, suravee.suthikulpanit, robin.murphy

> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Tuesday, April 18, 2023 12:11 PM
> 
> On Tue, 18 Apr 2023 03:24:46 +0000
> "Tian, Kevin" <kevin.tian@intel.com> wrote:
> 
> > > From: Alex Williamson <alex.williamson@redhat.com>
> > > Sent: Tuesday, April 18, 2023 4:07 AM
> > >
> > > On Mon, 17 Apr 2023 16:31:56 -0300
> > > Jason Gunthorpe <jgg@nvidia.com> wrote:
> > >
> > > > On Mon, Apr 17, 2023 at 01:01:40PM -0600, Alex Williamson wrote:
> > > > > Yes, it's not trivial, but Jason is now proposing that we consider
> > > > > mixing groups, cdevs, and multiple iommufd_ctxs as invalid.  I think
> > > > > this means that regardless of which device calls INFO, there's only one
> > > > > answer (assuming same set of devices opened, all cdev, all within
> same
> > > > > iommufd_ctx).  Based on what I explained about my understanding of
> > > INFO2
> > > > > and Jason agreed to, I think the output would be:
> > > > >
> > > > > flags: NOT_RESETABLE | DEV_ID
> > > > > {
> > > > >   { valid devA-id,  devA-BDF },
> > > > >   { valid devC-id,  devC-BDF },
> > > > >   { valid devD-id,  devD-BDF },
> > > > >   { invalid dev-id, devE-BDF },
> > > > > }
> > > > >
> > > > > Here devB gets dropped because the kernel understands that devB is
> > > > > unopened, affected, and owned.  It's therefore not a blocker for
> > > > > hot-reset.
> > > >
> > > > I don't think we want to drop anything because it makes the API
> > > > ill suited for the debugging purpose.
> > > >
> > > > devb should be returned with an invalid dev_id if I understand your
> > > > example. Maybe it should return with -1 as the dev_id instead of 0, to
> > > > make the debugging a bit better.
> > > >
> > > > Userspace should look at only NOT_RESETTABLE to determine if it
> > > > proceeds or not, and it should use the valid dev_id list to iterate
> > > > over the devices it has open to do the config stuff.
> > >
> > > If an affected device is owned, not opened, and not interfering with
> > > the reset, what is it adding to the API to report it for debugging
> > > purposes?  I'm afraid this leads into expanding "invalid dev-id" into an
> >
> > consistent output before and after devB is opened.
> 
> In the case where devB is not opened including it only provides
> useless information.  In the case where devB is opened it's necessary
> to be reported as an opened, affected device.
> 
> > > errno or bitmap of error conditions that the user needs to parse.
> > >
> >
> > Not exactly.
> >
> > If RESETABLE invalid dev_id doesn't matter. The user only use the
> > valid dev_id list to iterate as Jason pointed out.
> 
> Yes, but...
> 
> > If NOT_RESETTABLE due to devE not assigned to the VM one can
> > easily figure out the fact by simply looking at the list of affected BDFs
> > and the configuration of assigned devices of the VM. Then invalid
> > dev_id also doesn't matter.
> 
> Huh?
> 
> Given:
> 
> flags: NOT_RESETABLE | DEV_ID
> {
>   { valid devA-id,  devA-BDF },
>   { invalid dev-id, devB-BDF },
>   { valid devC-id,  devC-BDF },
>   { valid devD-id,  devD-BDF },
>   { invalid dev-id, devE-BDF },
> }
> 
> How does the user determine that devE is to blame and not devB based on
> BDF?  The user cannot rely on sysfs for help, they don't know the IOMMU
> grouping, nor do they know the BDF except as inferred by matching valid
> dev-ids in the above output.

emmm aren't we talking about the 'person' who does diagnostic? This guy
will look at the VM configuration file to know that devA/B/C/D have been
assigned to the VM but not devE.

> 
> > If NOT_RESETTABLE while devE is already assigned to the VM then it's
> > indication of mixing groups, cdevs or multiple iommufd_ctxs. Then
> > people should debug with other means/hints to dig out the exact
> > culprit.
> 
> I don't know what situation you're trying to explain here.  If devE
> were opened within the same iommufd_ctx, this becomes:

It's about a scenario where the mgmt.. stack has assigned all affected
devices to Qemu but Qemu itself messed it up with mixed group/cdev
or multiple iommufd_ctx so hitting the NON_RESETTABLE situation.

> 
> flags: RESETABLE | DEV_ID
> {
>   { valid devA-id,  devA-BDF },
>   { invalid dev-id, devB-BDF },
>   { valid devC-id,  devC-BDF },
>   { valid devD-id,  devD-BDF },
>   { valid devE-id,  devE-BDF },
> }
> 
> Yes, the user should only be looking at the flag to determine the
> availability of hot-reset, (here's the but) but how is it consistent to
> indicate both that hot-reset is available and include an invalid
> dev-id?  The consistency as I propose is that an invalid dev-id is only
> presented with NOT_RESETTABLE for the device blocking hot-reset.  In
> the previous case, devB is not blocking reset and reporting an invalid
> dev-id only serves to obfuscate determining the blocking device.
> 
> For the cases of affected group-opened devices or separate
> iommufd_ctxs, the user gets invalid dev-ids for anything outside of
> the calling device's iommufd_ctx.
> 
> We haven't discussed how it fails when called on a group-opened device
> in a mixed environment.  I'd propose that the INFO ioctl behaves
> exactly as it does today, reporting group-id and BDF for each affected
> device.  However, the hot-reset ioctl itself is not extended to accept
> devicefd because there is no proof-of-ownership model for cdevs.
> Therefore even if the user could map group-id to devicefd, they get
> -EINVAL calling HOT_RESET with a devicefd when the ioctl is called from
> a group-opened device.  Thanks,
> 

Yes I chatted with Yi about it.

If the calling device of the INFO ioctl is opened by group then behave
as it does today.

If the calling device is opened via cdev then use dev_id scheme as
discussed above.

in hot_reset ioctl the fd array only accepts group fd's.

cdev can be reset only via null fd array.

It remains a small open that null fd array could potentially work for
group-opened device too if vfio-compat is used. In that case devices
are in same iommufd ctx with valid dev_id even though they are opened 
via group. But probably it's not worthy blocking it?

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
  2023-04-17 13:39                                                   ` Jason Gunthorpe
  2023-04-18  1:28                                                     ` Tian, Kevin
@ 2023-04-18 10:23                                                     ` Liu, Yi L
  2023-04-18 13:02                                                       ` Jason Gunthorpe
  1 sibling, 1 reply; 145+ messages in thread
From: Liu, Yi L @ 2023-04-18 10:23 UTC (permalink / raw)
  To: Jason Gunthorpe, Tian, Kevin
  Cc: mjrosato, jasowang, Hao, Xudong, Duan,  Zhenzhong, peterx, Xu,
	Terrence, chao.p.peng, linux-s390, kvm, lulu, Jiang, Yanting,
	joro, nicolinc, Zhao, Yan Y, intel-gfx, eric.auger,
	intel-gvt-dev, yi.y.sun, cohuck, shameerali.kolothum.thodi,
	suravee.suthikulpanit, robin.murphy

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Monday, April 17, 2023 9:39 PM
> 
> On Fri, Apr 14, 2023 at 09:11:30AM +0000, Tian, Kevin wrote:
> 
> > The only corner case with this option is when a user mixes group
> > and cdev usages. iirc you mentioned it's a valid usage to be supported.
> > In that case the kernel doesn't have sufficient knowledge to judge
> > 'resettable' as it doesn't know which groups are opened by this user.
> 
> IMHO we don't need to support this combination.

Do you mean we don't support hot-reset for this combination or we don't
support user using this combination. I guess the prior one. Right?

> 
> We can say that to use the hot reset API the user must put all their
> devices into the same iommufd_ctx and cover 100% of the known use
> cases for this.
> 
> There are already other situations, like nesting, that do force users
> to put everything into one iommufd_ctx.
> 
> No reason to make things harder and more complicated.

Ditto. We just fail hot-reset for the multiple iommufds case. Is it?
Otherwise, we need to prevent users from using multiple iommufds.

> I'm coming to the feeling that we should put no-iommu devices in
> iommufd_ctx's as well. They would be an iommufd_access like
> mdevs. That would clean up the complications they cause here.

Ok, the lucky thing is you have merged the patch series that creates
iommufd_access for emulated devices in bind. So cdev series needs
to handle noiommu case by creating iommufd_access.

> 
> I suppose we should have done that from the beginning - no-iommu is an
> IOMMUFD access, it just uses a crazy /proc based way to learn the
> PFNs. Making it a proper access and making a real VFIO ioctl that
> calls iommufd_access_pin_pages() and returns the DMA mapped addresses
> to userspace would go a long way to making no-iommu work in a logical,
> usable, way.

This seems to be an improvement for noiommu mode. It can be done later.
For now, generating access_id and binding noiommu devices with iommufdctx
is enough for supporting noiommu hot-reset.

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
  2023-04-18  4:10                                                                 ` Alex Williamson
  2023-04-18  5:02                                                                   ` Tian, Kevin
@ 2023-04-18 10:34                                                                   ` Liu, Yi L
  2023-04-18 16:49                                                                     ` Alex Williamson
  1 sibling, 1 reply; 145+ messages in thread
From: Liu, Yi L @ 2023-04-18 10:34 UTC (permalink / raw)
  To: Alex Williamson, Tian, Kevin
  Cc: mjrosato, jasowang, Hao, Xudong, Duan,  Zhenzhong, peterx, Xu,
	Terrence, chao.p.peng, linux-s390, kvm, lulu, Jiang, Yanting,
	joro, nicolinc, Jason Gunthorpe, Zhao, Yan Y, intel-gfx,
	eric.auger, intel-gvt-dev, yi.y.sun, cohuck,
	shameerali.kolothum.thodi, suravee.suthikulpanit, robin.murphy

> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Tuesday, April 18, 2023 12:11 PM
> 
[...]
>
> We haven't discussed how it fails when called on a group-opened device
> in a mixed environment.  I'd propose that the INFO ioctl behaves
> exactly as it does today, reporting group-id and BDF for each affected
> device.  However, the hot-reset ioctl itself is not extended to accept
> devicefd because there is no proof-of-ownership model for cdevs.
> Therefore even if the user could map group-id to devicefd, they get
> -EINVAL calling HOT_RESET with a devicefd when the ioctl is called from
> a group-opened device.  Thanks,

Will it be better to let userspace know it shall fail if invoking hot
reset due to no proof-of-ownership as it also has cdev devices? Maybe
the RESETTABLE flag should always be meaningful. Even if the calling
device of _INFO is group-opened device. Old user applications does not
need to check it as it will never have such mixed environment. But for
new applications or the applications that have been updated per latest
vfio uapi, it should strictly check this flag before going ahead to do
hot-reset.

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
  2023-04-17 20:06                                                             ` Alex Williamson
  2023-04-18  3:24                                                               ` Tian, Kevin
@ 2023-04-18 12:57                                                               ` Jason Gunthorpe
  2023-04-18 18:39                                                                 ` Alex Williamson
  1 sibling, 1 reply; 145+ messages in thread
From: Jason Gunthorpe @ 2023-04-18 12:57 UTC (permalink / raw)
  To: Alex Williamson
  Cc: mjrosato, jasowang, Hao, Xudong, Duan, Zhenzhong, peterx, Xu,
	Terrence, chao.p.peng, linux-s390, Liu, Yi L, kvm, lulu, Jiang,
	Yanting, joro, nicolinc, Zhao, Yan Y, intel-gfx, eric.auger,
	intel-gvt-dev, yi.y.sun, cohuck, shameerali.kolothum.thodi,
	suravee.suthikulpanit, robin.murphy

On Mon, Apr 17, 2023 at 02:06:42PM -0600, Alex Williamson wrote:
> On Mon, 17 Apr 2023 16:31:56 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Mon, Apr 17, 2023 at 01:01:40PM -0600, Alex Williamson wrote:
> > > Yes, it's not trivial, but Jason is now proposing that we consider
> > > mixing groups, cdevs, and multiple iommufd_ctxs as invalid.  I think
> > > this means that regardless of which device calls INFO, there's only one
> > > answer (assuming same set of devices opened, all cdev, all within same
> > > iommufd_ctx).  Based on what I explained about my understanding of INFO2
> > > and Jason agreed to, I think the output would be:
> > > 
> > > flags: NOT_RESETABLE | DEV_ID
> > > {
> > >   { valid devA-id,  devA-BDF },
> > >   { valid devC-id,  devC-BDF },
> > >   { valid devD-id,  devD-BDF },
> > >   { invalid dev-id, devE-BDF },
> > > }
> > > 
> > > Here devB gets dropped because the kernel understands that devB is
> > > unopened, affected, and owned.  It's therefore not a blocker for
> > > hot-reset.  
> > 
> > I don't think we want to drop anything because it makes the API
> > ill suited for the debugging purpose.
> > 
> > devb should be returned with an invalid dev_id if I understand your
> > example. Maybe it should return with -1 as the dev_id instead of 0, to
> > make the debugging a bit better.
> > 
> > Userspace should look at only NOT_RESETTABLE to determine if it
> > proceeds or not, and it should use the valid dev_id list to iterate
> > over the devices it has open to do the config stuff.
> 
> If an affected device is owned, not opened, and not interfering with
> the reset, what is it adding to the API to report it for debugging
> purposes?

It lets it print the entire group of devices, this is the only way
something can learn the actual list of all BDFs affected.

dev_id can just return 0, we don't need a complex bitmap. Userspace
looks at the flag, if !NOT_RESETABLE then it ignores dev_id=0.

Jason

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
  2023-04-18  5:02                                                                   ` Tian, Kevin
@ 2023-04-18 12:59                                                                     ` Jason Gunthorpe
  2023-04-18 16:44                                                                     ` Alex Williamson
  1 sibling, 0 replies; 145+ messages in thread
From: Jason Gunthorpe @ 2023-04-18 12:59 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: mjrosato, jasowang, Hao, Xudong, Duan, Zhenzhong, peterx, Xu,
	Terrence, chao.p.peng, linux-s390, Liu, Yi L, kvm, lulu, Jiang,
	Yanting, joro, nicolinc, Zhao, Yan Y, intel-gfx, eric.auger,
	intel-gvt-dev, yi.y.sun, cohuck, shameerali.kolothum.thodi,
	suravee.suthikulpanit, robin.murphy

On Tue, Apr 18, 2023 at 05:02:44AM +0000, Tian, Kevin wrote:

> Yes I chatted with Yi about it.
> 
> If the calling device of the INFO ioctl is opened by group then behave
> as it does today.
> 
> If the calling device is opened via cdev then use dev_id scheme as
> discussed above.
> 
> in hot_reset ioctl the fd array only accepts group fd's.
> 
> cdev can be reset only via null fd array.

Agree
 
> It remains a small open that null fd array could potentially work for
> group-opened device too if vfio-compat is used. In that case devices
> are in same iommufd ctx with valid dev_id even though they are opened 
> via group. But probably it's not worthy blocking it?

IMHO not worth the complexity to block. Security is maintained if we
use an iommufd_ctx check.

Jason

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
  2023-04-18 10:23                                                     ` Liu, Yi L
@ 2023-04-18 13:02                                                       ` Jason Gunthorpe
  2023-04-23 10:28                                                         ` Liu, Yi L
  0 siblings, 1 reply; 145+ messages in thread
From: Jason Gunthorpe @ 2023-04-18 13:02 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: mjrosato, jasowang, Hao, Xudong, Duan, Zhenzhong, peterx, Xu,
	Terrence, chao.p.peng, linux-s390, kvm, lulu, Jiang, Yanting,
	joro, nicolinc, Zhao, Yan Y, intel-gfx, eric.auger,
	intel-gvt-dev, yi.y.sun, cohuck, shameerali.kolothum.thodi,
	suravee.suthikulpanit, robin.murphy

On Tue, Apr 18, 2023 at 10:23:55AM +0000, Liu, Yi L wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Monday, April 17, 2023 9:39 PM
> > 
> > On Fri, Apr 14, 2023 at 09:11:30AM +0000, Tian, Kevin wrote:
> > 
> > > The only corner case with this option is when a user mixes group
> > > and cdev usages. iirc you mentioned it's a valid usage to be supported.
> > > In that case the kernel doesn't have sufficient knowledge to judge
> > > 'resettable' as it doesn't know which groups are opened by this user.
> > 
> > IMHO we don't need to support this combination.
> 
> Do you mean we don't support hot-reset for this combination or we don't
> support user using this combination. I guess the prior one. Right?

Yes

> Ditto. We just fail hot-reset for the multiple iommufds case. Is it?

Yes

> > I suppose we should have done that from the beginning - no-iommu is an
> > IOMMUFD access, it just uses a crazy /proc based way to learn the
> > PFNs. Making it a proper access and making a real VFIO ioctl that
> > calls iommufd_access_pin_pages() and returns the DMA mapped addresses
> > to userspace would go a long way to making no-iommu work in a logical,
> > usable, way.
> 
> This seems to be an improvement for noiommu mode. It can be done later.
> For now, generating access_id and binding noiommu devices with iommufdctx
> is enough for supporting noiommu hot-reset.

Yes, I'm not sure there is much value in improving no-iommu unless
someone also wants to go in and update dpdk.

At some point we will need to revise dpdk to use iommufd, maybe that
would be a good time to fix this too.

The point is that using an access is actually a logical and sensible
thing to do, no a hack to make hot reset work better.

Jason

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
  2023-04-18  5:02                                                                   ` Tian, Kevin
  2023-04-18 12:59                                                                     ` Jason Gunthorpe
@ 2023-04-18 16:44                                                                     ` Alex Williamson
  1 sibling, 0 replies; 145+ messages in thread
From: Alex Williamson @ 2023-04-18 16:44 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: mjrosato, jasowang, Hao, Xudong, Duan, Zhenzhong, peterx, Xu,
	Terrence, chao.p.peng, linux-s390, Liu, Yi L, kvm, lulu, Jiang,
	Yanting, joro, nicolinc, Jason Gunthorpe, Zhao, Yan Y, intel-gfx,
	eric.auger, intel-gvt-dev, yi.y.sun, cohuck,
	shameerali.kolothum.thodi, suravee.suthikulpanit, robin.murphy

On Tue, 18 Apr 2023 05:02:44 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Tuesday, April 18, 2023 12:11 PM
> > 
> > On Tue, 18 Apr 2023 03:24:46 +0000
> > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> >   
> > > > From: Alex Williamson <alex.williamson@redhat.com>
> > > > Sent: Tuesday, April 18, 2023 4:07 AM
> > > >
> > > > On Mon, 17 Apr 2023 16:31:56 -0300
> > > > Jason Gunthorpe <jgg@nvidia.com> wrote:
> > > >  
> > > > > On Mon, Apr 17, 2023 at 01:01:40PM -0600, Alex Williamson wrote:  
> > > > > > Yes, it's not trivial, but Jason is now proposing that we consider
> > > > > > mixing groups, cdevs, and multiple iommufd_ctxs as invalid.  I think
> > > > > > this means that regardless of which device calls INFO, there's only one
> > > > > > answer (assuming same set of devices opened, all cdev, all within  
> > same  
> > > > > > iommufd_ctx).  Based on what I explained about my understanding of  
> > > > INFO2  
> > > > > > and Jason agreed to, I think the output would be:
> > > > > >
> > > > > > flags: NOT_RESETABLE | DEV_ID
> > > > > > {
> > > > > >   { valid devA-id,  devA-BDF },
> > > > > >   { valid devC-id,  devC-BDF },
> > > > > >   { valid devD-id,  devD-BDF },
> > > > > >   { invalid dev-id, devE-BDF },
> > > > > > }
> > > > > >
> > > > > > Here devB gets dropped because the kernel understands that devB is
> > > > > > unopened, affected, and owned.  It's therefore not a blocker for
> > > > > > hot-reset.  
> > > > >
> > > > > I don't think we want to drop anything because it makes the API
> > > > > ill suited for the debugging purpose.
> > > > >
> > > > > devb should be returned with an invalid dev_id if I understand your
> > > > > example. Maybe it should return with -1 as the dev_id instead of 0, to
> > > > > make the debugging a bit better.
> > > > >
> > > > > Userspace should look at only NOT_RESETTABLE to determine if it
> > > > > proceeds or not, and it should use the valid dev_id list to iterate
> > > > > over the devices it has open to do the config stuff.  
> > > >
> > > > If an affected device is owned, not opened, and not interfering with
> > > > the reset, what is it adding to the API to report it for debugging
> > > > purposes?  I'm afraid this leads into expanding "invalid dev-id" into an  
> > >
> > > consistent output before and after devB is opened.  
> > 
> > In the case where devB is not opened including it only provides
> > useless information.  In the case where devB is opened it's necessary
> > to be reported as an opened, affected device.
> >   
> > > > errno or bitmap of error conditions that the user needs to parse.
> > > >  
> > >
> > > Not exactly.
> > >
> > > If RESETABLE invalid dev_id doesn't matter. The user only use the
> > > valid dev_id list to iterate as Jason pointed out.  
> > 
> > Yes, but...
> >   
> > > If NOT_RESETTABLE due to devE not assigned to the VM one can
> > > easily figure out the fact by simply looking at the list of affected BDFs
> > > and the configuration of assigned devices of the VM. Then invalid
> > > dev_id also doesn't matter.  
> > 
> > Huh?
> > 
> > Given:
> > 
> > flags: NOT_RESETABLE | DEV_ID
> > {
> >   { valid devA-id,  devA-BDF },
> >   { invalid dev-id, devB-BDF },
> >   { valid devC-id,  devC-BDF },
> >   { valid devD-id,  devD-BDF },
> >   { invalid dev-id, devE-BDF },
> > }
> > 
> > How does the user determine that devE is to blame and not devB based on
> > BDF?  The user cannot rely on sysfs for help, they don't know the IOMMU
> > grouping, nor do they know the BDF except as inferred by matching valid
> > dev-ids in the above output.  
> 
> emmm aren't we talking about the 'person' who does diagnostic? This guy
> will look at the VM configuration file to know that devA/B/C/D have been
> assigned to the VM but not devE.

Actually the scenario is that devA/C/D are assigned, devB is implicitly
owned, and it's devE that blocks the reset.  If you've followed any of
the community forums for vfio over the years, it should be readily
apparent that placing the burden solely on the end user to perform such
a diagnosis is an unreasonable expectation.

> > > If NOT_RESETTABLE while devE is already assigned to the VM then it's
> > > indication of mixing groups, cdevs or multiple iommufd_ctxs. Then
> > > people should debug with other means/hints to dig out the exact
> > > culprit.  
> > 
> > I don't know what situation you're trying to explain here.  If devE
> > were opened within the same iommufd_ctx, this becomes:  
> 
> It's about a scenario where the mgmt.. stack has assigned all affected
> devices to Qemu but Qemu itself messed it up with mixed group/cdev
> or multiple iommufd_ctx so hitting the NON_RESETTABLE situation.

Is this a reasonable scenario?  I expect the QEMU support to favor cdev
access where available and fd passing methods will only use cdev, so
QEMU should never mess up to create such an environment.  There should
never be a case where a device is exclusively available via group
rather than cdev.

> > flags: RESETABLE | DEV_ID
> > {
> >   { valid devA-id,  devA-BDF },
> >   { invalid dev-id, devB-BDF },
> >   { valid devC-id,  devC-BDF },
> >   { valid devD-id,  devD-BDF },
> >   { valid devE-id,  devE-BDF },
> > }
> > 
> > Yes, the user should only be looking at the flag to determine the
> > availability of hot-reset, (here's the but) but how is it consistent to
> > indicate both that hot-reset is available and include an invalid
> > dev-id?  The consistency as I propose is that an invalid dev-id is only
> > presented with NOT_RESETTABLE for the device blocking hot-reset.  In
> > the previous case, devB is not blocking reset and reporting an invalid
> > dev-id only serves to obfuscate determining the blocking device.
> > 
> > For the cases of affected group-opened devices or separate
> > iommufd_ctxs, the user gets invalid dev-ids for anything outside of
> > the calling device's iommufd_ctx.
> > 
> > We haven't discussed how it fails when called on a group-opened device
> > in a mixed environment.  I'd propose that the INFO ioctl behaves
> > exactly as it does today, reporting group-id and BDF for each affected
> > device.  However, the hot-reset ioctl itself is not extended to accept
> > devicefd because there is no proof-of-ownership model for cdevs.
> > Therefore even if the user could map group-id to devicefd, they get
> > -EINVAL calling HOT_RESET with a devicefd when the ioctl is called from
> > a group-opened device.  Thanks,
> >   
> 
> Yes I chatted with Yi about it.
> 
> If the calling device of the INFO ioctl is opened by group then behave
> as it does today.
> 
> If the calling device is opened via cdev then use dev_id scheme as
> discussed above.
> 
> in hot_reset ioctl the fd array only accepts group fd's.
> 
> cdev can be reset only via null fd array.
> 
> It remains a small open that null fd array could potentially work for
> group-opened device too if vfio-compat is used. In that case devices
> are in same iommufd ctx with valid dev_id even though they are opened 
> via group. But probably it's not worthy blocking it?

Yes, let's not create new models for the compatibility interface, stick
with group-opened = group-id = proof-of-ownership.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
  2023-04-18 10:34                                                                   ` Liu, Yi L
@ 2023-04-18 16:49                                                                     ` Alex Williamson
  0 siblings, 0 replies; 145+ messages in thread
From: Alex Williamson @ 2023-04-18 16:49 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: mjrosato, jasowang, Hao, Xudong, Duan, Zhenzhong, peterx, Xu,
	Terrence, chao.p.peng, linux-s390, kvm, lulu, Jiang, Yanting,
	joro, nicolinc, Jason Gunthorpe, Zhao, Yan Y, intel-gfx,
	eric.auger, intel-gvt-dev, yi.y.sun, cohuck,
	shameerali.kolothum.thodi, suravee.suthikulpanit, robin.murphy

On Tue, 18 Apr 2023 10:34:45 +0000
"Liu, Yi L" <yi.l.liu@intel.com> wrote:

> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Tuesday, April 18, 2023 12:11 PM
> >   
> [...]
> >
> > We haven't discussed how it fails when called on a group-opened device
> > in a mixed environment.  I'd propose that the INFO ioctl behaves
> > exactly as it does today, reporting group-id and BDF for each affected
> > device.  However, the hot-reset ioctl itself is not extended to accept
> > devicefd because there is no proof-of-ownership model for cdevs.
> > Therefore even if the user could map group-id to devicefd, they get
> > -EINVAL calling HOT_RESET with a devicefd when the ioctl is called from
> > a group-opened device.  Thanks,  
> 
> Will it be better to let userspace know it shall fail if invoking hot
> reset due to no proof-of-ownership as it also has cdev devices? Maybe
> the RESETTABLE flag should always be meaningful. Even if the calling
> device of _INFO is group-opened device. Old user applications does not
> need to check it as it will never have such mixed environment. But for
> new applications or the applications that have been updated per latest
> vfio uapi, it should strictly check this flag before going ahead to do
> hot-reset.

The group-opened model cannot consistently predict whether the user can
provide proof-of-ownership.  I don't think we should define a flag
simply because there's a case that we can predict, the definition of
that flag becomes problematic.  Let's not complicate the interface by
trying to optimize a case that will likely never exist in practice and
can be handled via the existing legacy API.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
  2023-04-18 12:57                                                               ` Jason Gunthorpe
@ 2023-04-18 18:39                                                                 ` Alex Williamson
  2023-04-20 12:10                                                                   ` Liu, Yi L
  0 siblings, 1 reply; 145+ messages in thread
From: Alex Williamson @ 2023-04-18 18:39 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: mjrosato, jasowang, Hao, Xudong, Duan, Zhenzhong, peterx, Xu,
	Terrence, chao.p.peng, linux-s390, Liu, Yi L, kvm, lulu, Jiang,
	Yanting, joro, nicolinc, Zhao,  Yan Y, intel-gfx, eric.auger,
	intel-gvt-dev, yi.y.sun, cohuck, shameerali.kolothum.thodi,
	suravee.suthikulpanit, robin.murphy

On Tue, 18 Apr 2023 09:57:32 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Mon, Apr 17, 2023 at 02:06:42PM -0600, Alex Williamson wrote:
> > On Mon, 17 Apr 2023 16:31:56 -0300
> > Jason Gunthorpe <jgg@nvidia.com> wrote:
> >   
> > > On Mon, Apr 17, 2023 at 01:01:40PM -0600, Alex Williamson wrote:  
> > > > Yes, it's not trivial, but Jason is now proposing that we consider
> > > > mixing groups, cdevs, and multiple iommufd_ctxs as invalid.  I think
> > > > this means that regardless of which device calls INFO, there's only one
> > > > answer (assuming same set of devices opened, all cdev, all within same
> > > > iommufd_ctx).  Based on what I explained about my understanding of INFO2
> > > > and Jason agreed to, I think the output would be:
> > > > 
> > > > flags: NOT_RESETABLE | DEV_ID
> > > > {
> > > >   { valid devA-id,  devA-BDF },
> > > >   { valid devC-id,  devC-BDF },
> > > >   { valid devD-id,  devD-BDF },
> > > >   { invalid dev-id, devE-BDF },
> > > > }
> > > > 
> > > > Here devB gets dropped because the kernel understands that devB is
> > > > unopened, affected, and owned.  It's therefore not a blocker for
> > > > hot-reset.    
> > > 
> > > I don't think we want to drop anything because it makes the API
> > > ill suited for the debugging purpose.
> > > 
> > > devb should be returned with an invalid dev_id if I understand your
> > > example. Maybe it should return with -1 as the dev_id instead of 0, to
> > > make the debugging a bit better.
> > > 
> > > Userspace should look at only NOT_RESETTABLE to determine if it
> > > proceeds or not, and it should use the valid dev_id list to iterate
> > > over the devices it has open to do the config stuff.  
> > 
> > If an affected device is owned, not opened, and not interfering with
> > the reset, what is it adding to the API to report it for debugging
> > purposes?  
> 
> It lets it print the entire group of devices, this is the only way
> something can learn the actual list of all BDFs affected.

If we do so, userspace must be able to differentiate which devices are
blocking, which necessitates at least a bi-modal invalid dev-id.

> dev_id can just return 0, we don't need a complex bitmap. Userspace
> looks at the flag, if !NOT_RESETABLE then it ignores dev_id=0.

I'm having trouble with a succinct definition of dev-id == 0, is it "A
device affected by the hot-reset reset, which does not directly
contribute to the availability of the hot-reset, ex. an unopened device
within the same IOMMU group as an opened device (ie. this is not the
device responsible if hot-reset is unavailable).  Whereas dev-id < 0
(== -1) is an affected device which prevents hot-reset, ex. an un-owned
device, device configured within a different iommufd_ctx, or device
opened outside of the vfio cdev API."  Is that about right?  Thanks,

Alex


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
  2023-04-18 18:39                                                                 ` Alex Williamson
@ 2023-04-20 12:10                                                                   ` Liu, Yi L
  2023-04-20 14:08                                                                     ` Alex Williamson
  0 siblings, 1 reply; 145+ messages in thread
From: Liu, Yi L @ 2023-04-20 12:10 UTC (permalink / raw)
  To: Alex Williamson, Jason Gunthorpe
  Cc: mjrosato, jasowang, Hao, Xudong, Duan,  Zhenzhong, peterx, Xu,
	Terrence, chao.p.peng, linux-s390, kvm, lulu, Jiang, Yanting,
	joro, nicolinc, Zhao, Yan Y, intel-gfx, eric.auger,
	intel-gvt-dev, yi.y.sun, cohuck, shameerali.kolothum.thodi,
	suravee.suthikulpanit, robin.murphy

> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Wednesday, April 19, 2023 2:39 AM
> 
> On Tue, 18 Apr 2023 09:57:32 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Mon, Apr 17, 2023 at 02:06:42PM -0600, Alex Williamson wrote:
> > > On Mon, 17 Apr 2023 16:31:56 -0300
> > > Jason Gunthorpe <jgg@nvidia.com> wrote:
> > >
> > > > On Mon, Apr 17, 2023 at 01:01:40PM -0600, Alex Williamson wrote:
> > > > > Yes, it's not trivial, but Jason is now proposing that we consider
> > > > > mixing groups, cdevs, and multiple iommufd_ctxs as invalid.  I think
> > > > > this means that regardless of which device calls INFO, there's only one
> > > > > answer (assuming same set of devices opened, all cdev, all within same
> > > > > iommufd_ctx).  Based on what I explained about my understanding of INFO2
> > > > > and Jason agreed to, I think the output would be:
> > > > >
> > > > > flags: NOT_RESETABLE | DEV_ID
> > > > > {
> > > > >   { valid devA-id,  devA-BDF },
> > > > >   { valid devC-id,  devC-BDF },
> > > > >   { valid devD-id,  devD-BDF },
> > > > >   { invalid dev-id, devE-BDF },
> > > > > }
> > > > >
> > > > > Here devB gets dropped because the kernel understands that devB is
> > > > > unopened, affected, and owned.  It's therefore not a blocker for
> > > > > hot-reset.
> > > >
> > > > I don't think we want to drop anything because it makes the API
> > > > ill suited for the debugging purpose.
> > > >
> > > > devb should be returned with an invalid dev_id if I understand your
> > > > example. Maybe it should return with -1 as the dev_id instead of 0, to
> > > > make the debugging a bit better.
> > > >
> > > > Userspace should look at only NOT_RESETTABLE to determine if it
> > > > proceeds or not, and it should use the valid dev_id list to iterate
> > > > over the devices it has open to do the config stuff.
> > >
> > > If an affected device is owned, not opened, and not interfering with
> > > the reset, what is it adding to the API to report it for debugging
> > > purposes?
> >
> > It lets it print the entire group of devices, this is the only way
> > something can learn the actual list of all BDFs affected.
> 
> If we do so, userspace must be able to differentiate which devices are
> blocking, which necessitates at least a bi-modal invalid dev-id.
> 
> > dev_id can just return 0, we don't need a complex bitmap. Userspace
> > looks at the flag, if !NOT_RESETABLE then it ignores dev_id=0.
> 
> I'm having trouble with a succinct definition of dev-id == 0, is it "A
> device affected by the hot-reset reset, which does not directly
> contribute to the availability of the hot-reset, ex. an unopened device
> within the same IOMMU group as an opened device (ie. this is not the
> device responsible if hot-reset is unavailable). 

Hide this device in the list looks fine to me. But the calling user should
not do any new device open before finishing hot-reset. Otherwise, user may
miss a device that needs to do pre/post reset. I think this requirement is
acceptable. Is it? 

> Whereas dev-id < 0
> (== -1) is an affected device which prevents hot-reset, ex. an un-owned
> device, device configured within a different iommufd_ctx, or device
> opened outside of the vfio cdev API."  Is that about right?  Thanks,

Do you mean to have separate err-code for the three possibilities? As
the devid is generated by iommufd and it is u32. I'm not sure if we can
have such err-code definition without reserving some ids in iommufd. 

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
  2023-04-20 12:10                                                                   ` Liu, Yi L
@ 2023-04-20 14:08                                                                     ` Alex Williamson
  2023-04-21 22:35                                                                       ` Jason Gunthorpe
  2023-04-26  7:22                                                                       ` Liu, Yi L
  0 siblings, 2 replies; 145+ messages in thread
From: Alex Williamson @ 2023-04-20 14:08 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: mjrosato, jasowang, Hao, Xudong, Duan, Zhenzhong, peterx, Xu,
	Terrence, chao.p.peng, linux-s390, kvm, lulu, Jiang, Yanting,
	joro, nicolinc, Jason Gunthorpe, Zhao, Yan Y, intel-gfx,
	eric.auger, intel-gvt-dev, yi.y.sun, cohuck,
	shameerali.kolothum.thodi, suravee.suthikulpanit, robin.murphy

On Thu, 20 Apr 2023 12:10:20 +0000
"Liu, Yi L" <yi.l.liu@intel.com> wrote:

> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Wednesday, April 19, 2023 2:39 AM
> > 
> > On Tue, 18 Apr 2023 09:57:32 -0300
> > Jason Gunthorpe <jgg@nvidia.com> wrote:
> >   
> > > On Mon, Apr 17, 2023 at 02:06:42PM -0600, Alex Williamson wrote:  
> > > > On Mon, 17 Apr 2023 16:31:56 -0300
> > > > Jason Gunthorpe <jgg@nvidia.com> wrote:
> > > >  
> > > > > On Mon, Apr 17, 2023 at 01:01:40PM -0600, Alex Williamson wrote:  
> > > > > > Yes, it's not trivial, but Jason is now proposing that we consider
> > > > > > mixing groups, cdevs, and multiple iommufd_ctxs as invalid.  I think
> > > > > > this means that regardless of which device calls INFO, there's only one
> > > > > > answer (assuming same set of devices opened, all cdev, all within same
> > > > > > iommufd_ctx).  Based on what I explained about my understanding of INFO2
> > > > > > and Jason agreed to, I think the output would be:
> > > > > >
> > > > > > flags: NOT_RESETABLE | DEV_ID
> > > > > > {
> > > > > >   { valid devA-id,  devA-BDF },
> > > > > >   { valid devC-id,  devC-BDF },
> > > > > >   { valid devD-id,  devD-BDF },
> > > > > >   { invalid dev-id, devE-BDF },
> > > > > > }
> > > > > >
> > > > > > Here devB gets dropped because the kernel understands that devB is
> > > > > > unopened, affected, and owned.  It's therefore not a blocker for
> > > > > > hot-reset.  
> > > > >
> > > > > I don't think we want to drop anything because it makes the API
> > > > > ill suited for the debugging purpose.
> > > > >
> > > > > devb should be returned with an invalid dev_id if I understand your
> > > > > example. Maybe it should return with -1 as the dev_id instead of 0, to
> > > > > make the debugging a bit better.
> > > > >
> > > > > Userspace should look at only NOT_RESETTABLE to determine if it
> > > > > proceeds or not, and it should use the valid dev_id list to iterate
> > > > > over the devices it has open to do the config stuff.  
> > > >
> > > > If an affected device is owned, not opened, and not interfering with
> > > > the reset, what is it adding to the API to report it for debugging
> > > > purposes?  
> > >
> > > It lets it print the entire group of devices, this is the only way
> > > something can learn the actual list of all BDFs affected.  
> > 
> > If we do so, userspace must be able to differentiate which devices are
> > blocking, which necessitates at least a bi-modal invalid dev-id.
> >   
> > > dev_id can just return 0, we don't need a complex bitmap. Userspace
> > > looks at the flag, if !NOT_RESETABLE then it ignores dev_id=0.  
> > 
> > I'm having trouble with a succinct definition of dev-id == 0, is it "A
> > device affected by the hot-reset reset, which does not directly
> > contribute to the availability of the hot-reset, ex. an unopened device
> > within the same IOMMU group as an opened device (ie. this is not the
> > device responsible if hot-reset is unavailable).   
> 
> Hide this device in the list looks fine to me. But the calling user should
> not do any new device open before finishing hot-reset. Otherwise, user may
> miss a device that needs to do pre/post reset. I think this requirement is
> acceptable. Is it? 

I think Kevin and Jason are leaning towards reporting the entire
dev-set.  The INFO ioctl has always been a point-in-time reading, no
guarantees are made if the host or user configuration is changed.
Nothing changes in that respect.

> > Whereas dev-id < 0
> > (== -1) is an affected device which prevents hot-reset, ex. an un-owned
> > device, device configured within a different iommufd_ctx, or device
> > opened outside of the vfio cdev API."  Is that about right?  Thanks,  
> 
> Do you mean to have separate err-code for the three possibilities? As
> the devid is generated by iommufd and it is u32. I'm not sure if we can
> have such err-code definition without reserving some ids in iommufd. 

Yes, if we're going to report the full dev-set, I think we need at
least two unique error codes or else the user has no way to determine
the subset of invalid dev-ids which block the reset.  I think Jason is
proposing the set of valid dev-ids are >0, a dev-id of zero indicates
some form of non-blocking, while <0 (or maybe specifically -1)
indicates a blocking device.  I was trying to get consensus on a formal
definition of each of those error codes in my previous reply.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 10/12] vfio: Mark cdev usage in vfio_device
  2023-04-05 11:48   ` Eric Auger
@ 2023-04-21  7:06     ` Liu, Yi L
  0 siblings, 0 replies; 145+ messages in thread
From: Liu, Yi L @ 2023-04-21  7:06 UTC (permalink / raw)
  To: eric.auger, alex.williamson, jgg, Tian, Kevin
  Cc: linux-s390, yi.y.sun, kvm, mjrosato, intel-gvt-dev, joro, cohuck,
	Hao, Xudong, peterx, Zhao, Yan Y, Xu, Terrence, nicolinc,
	shameerali.kolothum.thodi, suravee.suthikulpanit, intel-gfx,
	chao.p.peng, lulu, robin.murphy, jasowang, Jiang, Yanting

> From: Eric Auger <eric.auger@redhat.com>
> Sent: Wednesday, April 5, 2023 7:48 PM
> 
> On 4/1/23 16:44, Yi Liu wrote:
> > There are users that need to check if vfio_device is opened as cdev.
> > e.g. vfio-pci. This adds a flag in vfio_device, it will be set in the
> > cdev path when device is opened. This is not used at this moment, but
> > a preparation for vfio device cdev support.
> 
> better to squash this patch with the patch setting cdev_opened then?

But that would be in the cdev series. Maybe only add this helper to
return false and add the cdev_opened in below patch. Will this be
better?

https://lore.kernel.org/kvm/20230401151833.124749-23-yi.l.liu@intel.com/

> Thanks
> 
> Eric
> >
> > Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> > ---
> >  include/linux/vfio.h | 7 +++++++
> >  1 file changed, 7 insertions(+)
> >
> > diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> > index f8fb9ab25188..d9a0770e5fc1 100644
> > --- a/include/linux/vfio.h
> > +++ b/include/linux/vfio.h
> > @@ -62,6 +62,7 @@ struct vfio_device {
> >  	struct iommufd_device *iommufd_device;
> >  	bool iommufd_attached;
> >  #endif
> > +	bool cdev_opened;
> >  };
> >
> >  /**
> > @@ -151,6 +152,12 @@ vfio_iommufd_physical_devid(struct vfio_device *vdev,
> u32 *id)
> >  	((int (*)(struct vfio_device *vdev, u32 *pt_id)) NULL)
> >  #endif
> >
> > +static inline bool vfio_device_cdev_opened(struct vfio_device *device)
> > +{
> > +	lockdep_assert_held(&device->dev_set->lock);
> > +	return device->cdev_opened;
> > +}
> > +
> >  /**
> >   * @migration_set_state: Optional callback to change the migration state for
> >   *         devices that support migration. It's mandatory for


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 04/12] vfio-iommufd: Add helper to retrieve iommufd_ctx and devid for vfio_device
  2023-04-04 21:48     ` Alex Williamson
@ 2023-04-21  7:11       ` Liu, Yi L
  0 siblings, 0 replies; 145+ messages in thread
From: Liu, Yi L @ 2023-04-21  7:11 UTC (permalink / raw)
  To: Alex Williamson, Eric Auger
  Cc: mjrosato, jasowang, Hao, Xudong, peterx, Xu, Terrence,
	chao.p.peng, linux-s390, kvm, lulu, Jiang, Yanting, joro,
	nicolinc, jgg, Zhao, Yan Y, intel-gfx, intel-gvt-dev, yi.y.sun,
	cohuck, shameerali.kolothum.thodi, suravee.suthikulpanit,
	robin.murphy

> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Wednesday, April 5, 2023 5:49 AM
> On Tue, 4 Apr 2023 17:28:40 +0200
> Eric Auger <eric.auger@redhat.com> wrote:
> 
> > Hi,
> >
> > On 4/1/23 16:44, Yi Liu wrote:
> > > This is needed by the vfio-pci driver to report affected devices in the
> > > hot reset for a given device.
> > >
> > > Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> > > Tested-by: Yanting Jiang <yanting.jiang@intel.com>
> > > Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> > > ---
> > >  drivers/iommu/iommufd/device.c | 12 ++++++++++++
> > >  drivers/vfio/iommufd.c         | 14 ++++++++++++++
> > >  include/linux/iommufd.h        |  3 +++
> > >  include/linux/vfio.h           | 13 +++++++++++++
> > >  4 files changed, 42 insertions(+)
> > >
> > > diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c
> > > index 25115d401d8f..04a57aa1ae2c 100644
> > > --- a/drivers/iommu/iommufd/device.c
> > > +++ b/drivers/iommu/iommufd/device.c
> > > @@ -131,6 +131,18 @@ void iommufd_device_unbind(struct iommufd_device
> *idev)
> > >  }
> > >  EXPORT_SYMBOL_NS_GPL(iommufd_device_unbind, IOMMUFD);
> > >
> > > +struct iommufd_ctx *iommufd_device_to_ictx(struct iommufd_device *idev)
> > > +{
> > > +	return idev->ictx;
> > > +}
> > > +EXPORT_SYMBOL_NS_GPL(iommufd_device_to_ictx, IOMMUFD);
> > > +
> > > +u32 iommufd_device_to_id(struct iommufd_device *idev)
> > > +{
> > > +	return idev->obj.id;
> > > +}
> > > +EXPORT_SYMBOL_NS_GPL(iommufd_device_to_id, IOMMUFD);
> > > +
> > >  static int iommufd_device_setup_msi(struct iommufd_device *idev,
> > >  				    struct iommufd_hw_pagetable *hwpt,
> > >  				    phys_addr_t sw_msi_start)
> > > diff --git a/drivers/vfio/iommufd.c b/drivers/vfio/iommufd.c
> > > index 88b00c501015..809f2dd73b9e 100644
> > > --- a/drivers/vfio/iommufd.c
> > > +++ b/drivers/vfio/iommufd.c
> > > @@ -66,6 +66,20 @@ void vfio_iommufd_unbind(struct vfio_device *vdev)
> > >  		vdev->ops->unbind_iommufd(vdev);
> > >  }
> > >
> > > +struct iommufd_ctx *vfio_iommufd_physical_ictx(struct vfio_device *vdev)
> > > +{
> > > +	if (!vdev->iommufd_device)
> > > +		return NULL;
> > > +	return iommufd_device_to_ictx(vdev->iommufd_device);
> > > +}
> > > +EXPORT_SYMBOL_GPL(vfio_iommufd_physical_ictx);
> > > +
> > > +void vfio_iommufd_physical_devid(struct vfio_device *vdev, u32 *id)
> > > +{
> > > +	if (vdev->iommufd_device)
> > > +		*id = iommufd_device_to_id(vdev->iommufd_device);
> > since there is no return value, may be worth to add at least a WARN_ON
> > in case of !vdev->iommufd_device

This may be a user-triggerable warning if the input device is not bound
to iommufd.

> Yeah, this is bizarre and makes the one caller of this interface very
> awkward.  We later go on to define IOMMUFD_INVALID_ID, so this should
> simply return that in the case of no iommufd_device and skip this
> unnecessary pointer passing.  Thanks,

Ok. then it can return invalid id when !CONFIG_IOMMUFD. Also
Needs to wait for the decision in the thread that is talking errr-code.

Regards,
Yi Liu

> Alex
> 
> > > +}
> > > +EXPORT_SYMBOL_GPL(vfio_iommufd_physical_devid);
> > >  /*
> > >   * The physical standard ops mean that the iommufd_device is bound to the
> > >   * physical device vdev->dev that was provided to vfio_init_group_dev(). Drivers
> > > diff --git a/include/linux/iommufd.h b/include/linux/iommufd.h
> > > index 1129a36a74c4..ac96df406833 100644
> > > --- a/include/linux/iommufd.h
> > > +++ b/include/linux/iommufd.h
> > > @@ -24,6 +24,9 @@ void iommufd_device_unbind(struct iommufd_device *idev);
> > >  int iommufd_device_attach(struct iommufd_device *idev, u32 *pt_id);
> > >  void iommufd_device_detach(struct iommufd_device *idev);
> > >
> > > +struct iommufd_ctx *iommufd_device_to_ictx(struct iommufd_device *idev);
> > > +u32 iommufd_device_to_id(struct iommufd_device *idev);
> > > +
> > >  struct iommufd_access_ops {
> > >  	u8 needs_pin_pages : 1;
> > >  	void (*unmap)(void *data, unsigned long iova, unsigned long length);
> > > diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> > > index 3188d8a374bd..97a1174b922f 100644
> > > --- a/include/linux/vfio.h
> > > +++ b/include/linux/vfio.h
> > > @@ -113,6 +113,8 @@ struct vfio_device_ops {
> > >  };
> > >
> > >  #if IS_ENABLED(CONFIG_IOMMUFD)
> > > +struct iommufd_ctx *vfio_iommufd_physical_ictx(struct vfio_device *vdev);
> > > +void vfio_iommufd_physical_devid(struct vfio_device *vdev, u32 *id);
> > >  int vfio_iommufd_physical_bind(struct vfio_device *vdev,
> > >  			       struct iommufd_ctx *ictx, u32 *out_device_id);
> > >  void vfio_iommufd_physical_unbind(struct vfio_device *vdev);
> > > @@ -122,6 +124,17 @@ int vfio_iommufd_emulated_bind(struct vfio_device
> *vdev,
> > >  void vfio_iommufd_emulated_unbind(struct vfio_device *vdev);
> > >  int vfio_iommufd_emulated_attach_ioas(struct vfio_device *vdev, u32 *pt_id);
> > >  #else
> > > +static inline struct iommufd_ctx *
> > > +vfio_iommufd_physical_ictx(struct vfio_device *vdev)
> > > +{
> > > +	return NULL;
> > > +}
> > > +
> > > +static inline void
> > > +vfio_iommufd_physical_devid(struct vfio_device *vdev, u32 *id)
> > > +{
> > > +}
> > > +
> > >  #define vfio_iommufd_physical_bind                                      \
> > >  	((int (*)(struct vfio_device *vdev, struct iommufd_ctx *ictx,   \
> > >  		  u32 *out_device_id)) NULL)
> > besides
> >
> > Reviewed-by: Eric Auger <eric.auger@redhat.com>
> >
> > Eric
> >


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
  2023-04-20 14:08                                                                     ` Alex Williamson
@ 2023-04-21 22:35                                                                       ` Jason Gunthorpe
  2023-04-23 14:46                                                                         ` Liu, Yi L
  2023-04-26  7:22                                                                       ` Liu, Yi L
  1 sibling, 1 reply; 145+ messages in thread
From: Jason Gunthorpe @ 2023-04-21 22:35 UTC (permalink / raw)
  To: Alex Williamson
  Cc: mjrosato, jasowang, Hao, Xudong, Duan, Zhenzhong, peterx, Xu,
	Terrence, chao.p.peng, linux-s390, Liu, Yi L, kvm, lulu, Jiang,
	Yanting, joro, nicolinc, Zhao, Yan Y, intel-gfx, eric.auger,
	intel-gvt-dev, yi.y.sun, cohuck, shameerali.kolothum.thodi,
	suravee.suthikulpanit, robin.murphy

On Thu, Apr 20, 2023 at 08:08:39AM -0600, Alex Williamson wrote:

> > Hide this device in the list looks fine to me. But the calling user should
> > not do any new device open before finishing hot-reset. Otherwise, user may
> > miss a device that needs to do pre/post reset. I think this requirement is
> > acceptable. Is it? 
> 
> I think Kevin and Jason are leaning towards reporting the entire
> dev-set.  The INFO ioctl has always been a point-in-time reading, no
> guarantees are made if the host or user configuration is changed.
> Nothing changes in that respect.

Yeah, I think your point about qemu community formus suggest we should
err toward having qemu provide some fully detailed debug report.
 
> > > Whereas dev-id < 0
> > > (== -1) is an affected device which prevents hot-reset, ex. an un-owned
> > > device, device configured within a different iommufd_ctx, or device
> > > opened outside of the vfio cdev API."  Is that about right?  Thanks,  
> > 
> > Do you mean to have separate err-code for the three possibilities? As
> > the devid is generated by iommufd and it is u32. I'm not sure if we can
> > have such err-code definition without reserving some ids in iommufd. 
> 
> Yes, if we're going to report the full dev-set, I think we need at
> least two unique error codes or else the user has no way to determine
> the subset of invalid dev-ids which block the reset.

If you think this is important to report we should report 0 and -1,
and adjust the iommufd xarray allocator to reserve -1

It depends what you want to show for the debugging.

eg if we have debugging where qemu dumps this table:

   BDF   In VM   iommu_group   Has VFIO driver   Has Kernel Driver

By also doing various sysfs probes based on the BDF, then the admin
action to remedy the situation is:

Make "Has VFIO driver = y" or "Has Kernel Driver = n" for every row in
the table to make the reset work.

And we don't need the distinction. Adding the 0/-1 lets you make a
useful table without doing any sysfs work.

> I think Jason is proposing the set of valid dev-ids are >0, a dev-id
> of zero indicates some form of non-blocking, while <0 (or maybe
> specifically -1) indicates a blocking device.

Yes, 0 and -1 would be fine with those definitions. The only use of
the data is to add a 'blocking use of reset' colum to the table
above..

Thanks,
Jason

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
  2023-04-18 13:02                                                       ` Jason Gunthorpe
@ 2023-04-23 10:28                                                         ` Liu, Yi L
  2023-04-24 17:38                                                           ` Jason Gunthorpe
  0 siblings, 1 reply; 145+ messages in thread
From: Liu, Yi L @ 2023-04-23 10:28 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: mjrosato, jasowang, Hao, Xudong, Duan,  Zhenzhong, peterx, Xu,
	Terrence, chao.p.peng, linux-s390, kvm, lulu, Jiang, Yanting,
	joro, nicolinc, Zhao, Yan Y, intel-gfx, eric.auger,
	intel-gvt-dev, yi.y.sun, cohuck, shameerali.kolothum.thodi,
	suravee.suthikulpanit, robin.murphy

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Tuesday, April 18, 2023 9:02 PM
> 
> On Tue, Apr 18, 2023 at 10:23:55AM +0000, Liu, Yi L wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Monday, April 17, 2023 9:39 PM
> > >
> > > On Fri, Apr 14, 2023 at 09:11:30AM +0000, Tian, Kevin wrote:
> > >
> > > > The only corner case with this option is when a user mixes group
> > > > and cdev usages. iirc you mentioned it's a valid usage to be supported.
> > > > In that case the kernel doesn't have sufficient knowledge to judge
> > > > 'resettable' as it doesn't know which groups are opened by this user.
> > >
> > > IMHO we don't need to support this combination.
> >
> > Do you mean we don't support hot-reset for this combination or we don't
> > support user using this combination. I guess the prior one. Right?
> 
> Yes
> 
> > Ditto. We just fail hot-reset for the multiple iommufds case. Is it?
> 
> Yes
> 
> > > I suppose we should have done that from the beginning - no-iommu is an
> > > IOMMUFD access, it just uses a crazy /proc based way to learn the
> > > PFNs. Making it a proper access and making a real VFIO ioctl that
> > > calls iommufd_access_pin_pages() and returns the DMA mapped addresses
> > > to userspace would go a long way to making no-iommu work in a logical,
> > > usable, way.
> >
> > This seems to be an improvement for noiommu mode. It can be done later.
> > For now, generating access_id and binding noiommu devices with iommufdctx
> > is enough for supporting noiommu hot-reset.
> 
> Yes, I'm not sure there is much value in improving no-iommu unless
> someone also wants to go in and update dpdk.
> 
> At some point we will need to revise dpdk to use iommufd, maybe that
> would be a good time to fix this too.

This noiommu improvement shall allow user to attach ioas to noiommu devices.
is it? This may be done by calling iommufd_access_attach(). So there is a
quick question. In the cdev series, shall we allow the attachment for noiommu?
I think the noiommu improvement shall require extra effort, so it is not
ready yet. If so, seems like I just need to fail the attachment for noiommu
devices. But when in the future it is ready, how can userspace know attach
is allowed for noiommu devices? Will it be an easy thing? or we may just let
the attach as a noop and always succeed for noiommu devices? any suggestions?

Regards,
Yi Liu


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
  2023-04-21 22:35                                                                       ` Jason Gunthorpe
@ 2023-04-23 14:46                                                                         ` Liu, Yi L
  0 siblings, 0 replies; 145+ messages in thread
From: Liu, Yi L @ 2023-04-23 14:46 UTC (permalink / raw)
  To: Jason Gunthorpe, Alex Williamson
  Cc: mjrosato, jasowang, Hao, Xudong, Duan,  Zhenzhong, peterx, Xu,
	Terrence, chao.p.peng, linux-s390, kvm, lulu, Jiang, Yanting,
	joro, nicolinc, Zhao, Yan Y, intel-gfx, eric.auger,
	intel-gvt-dev, yi.y.sun, cohuck, shameerali.kolothum.thodi,
	suravee.suthikulpanit, robin.murphy

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Saturday, April 22, 2023 6:36 AM
> 
> On Thu, Apr 20, 2023 at 08:08:39AM -0600, Alex Williamson wrote:
> 
> > > Hide this device in the list looks fine to me. But the calling user should
> > > not do any new device open before finishing hot-reset. Otherwise, user may
> > > miss a device that needs to do pre/post reset. I think this requirement is
> > > acceptable. Is it?
> >
> > I think Kevin and Jason are leaning towards reporting the entire
> > dev-set.  The INFO ioctl has always been a point-in-time reading, no
> > guarantees are made if the host or user configuration is changed.
> > Nothing changes in that respect.
> 
> Yeah, I think your point about qemu community formus suggest we should
> err toward having qemu provide some fully detailed debug report.
> 
> > > > Whereas dev-id < 0
> > > > (== -1) is an affected device which prevents hot-reset, ex. an un-owned
> > > > device, device configured within a different iommufd_ctx, or device
> > > > opened outside of the vfio cdev API."  Is that about right?  Thanks,
> > >
> > > Do you mean to have separate err-code for the three possibilities? As
> > > the devid is generated by iommufd and it is u32. I'm not sure if we can
> > > have such err-code definition without reserving some ids in iommufd.
> >
> > Yes, if we're going to report the full dev-set, I think we need at
> > least two unique error codes or else the user has no way to determine
> > the subset of invalid dev-ids which block the reset.
> 
> If you think this is important to report we should report 0 and -1,
> and adjust the iommufd xarray allocator to reserve -1

Then the alloc range should be from 1 to 0xffffffff.
 
> 
> It depends what you want to show for the debugging.
> 
> eg if we have debugging where qemu dumps this table:
> 
>    BDF   In VM   iommu_group   Has VFIO driver   Has Kernel Driver
> 
> By also doing various sysfs probes based on the BDF, then the admin
> action to remedy the situation is:
> 
> Make "Has VFIO driver = y" or "Has Kernel Driver = n" for every row in
> the table to make the reset work.
> 
> And we don't need the distinction. Adding the 0/-1 lets you make a
> useful table without doing any sysfs work.
>
> > I think Jason is proposing the set of valid dev-ids are >0, a dev-id
> > of zero indicates some form of non-blocking, while <0 (or maybe
> > specifically -1) indicates a blocking device.
> 
> Yes, 0 and -1 would be fine with those definitions. The only use of
> the data is to add a 'blocking use of reset' colum to the table
> above..

Should -1 and 0 be defined in uapi as well? If yes, this seems not easy
to get a proper naming for them. Or just document it in vfio
uapi header to say -1 (blocking) and 0 (no-devid-but-not-blocking)
blabla.

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
  2023-04-23 10:28                                                         ` Liu, Yi L
@ 2023-04-24 17:38                                                           ` Jason Gunthorpe
  0 siblings, 0 replies; 145+ messages in thread
From: Jason Gunthorpe @ 2023-04-24 17:38 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: mjrosato, jasowang, Hao, Xudong, Duan, Zhenzhong, peterx, Xu,
	Terrence, chao.p.peng, linux-s390, kvm, lulu, Jiang, Yanting,
	joro, nicolinc, Zhao, Yan Y, intel-gfx, eric.auger,
	intel-gvt-dev, yi.y.sun, cohuck, shameerali.kolothum.thodi,
	suravee.suthikulpanit, robin.murphy

On Sun, Apr 23, 2023 at 10:28:58AM +0000, Liu, Yi L wrote:

> This noiommu improvement shall allow user to attach ioas to noiommu devices.
> is it? This may be done by calling iommufd_access_attach(). So there is a
> quick question. In the cdev series, shall we allow the attachment
> for noiommu?

Yes, I think we need to undo the decision we talked about earlier
where no-iommu would be asked for with a -1 iommufd.

All vfio_devices should have an iommufd_ctx when container is compiled
out.

You don't need to do anything with the ctx for no-iommu beyond demand
that userspace provide it.

Jason

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
  2023-04-20 14:08                                                                     ` Alex Williamson
  2023-04-21 22:35                                                                       ` Jason Gunthorpe
@ 2023-04-26  7:22                                                                       ` Liu, Yi L
  2023-04-26 13:20                                                                         ` Alex Williamson
  1 sibling, 1 reply; 145+ messages in thread
From: Liu, Yi L @ 2023-04-26  7:22 UTC (permalink / raw)
  To: Alex Williamson
  Cc: mjrosato, jasowang, Hao, Xudong, Duan,  Zhenzhong, peterx, Xu,
	Terrence, chao.p.peng, linux-s390, kvm, lulu, Jiang, Yanting,
	joro, nicolinc, Jason Gunthorpe, Zhao, Yan Y, intel-gfx,
	eric.auger, intel-gvt-dev, yi.y.sun, cohuck,
	shameerali.kolothum.thodi, suravee.suthikulpanit, robin.murphy

> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Thursday, April 20, 2023 10:09 PM
[...]
> > > Whereas dev-id < 0
> > > (== -1) is an affected device which prevents hot-reset, ex. an un-owned
> > > device, device configured within a different iommufd_ctx, or device
> > > opened outside of the vfio cdev API."  Is that about right?  Thanks,
> >
> > Do you mean to have separate err-code for the three possibilities? As
> > the devid is generated by iommufd and it is u32. I'm not sure if we can
> > have such err-code definition without reserving some ids in iommufd.
> 
> Yes, if we're going to report the full dev-set, I think we need at
> least two unique error codes or else the user has no way to determine
> the subset of invalid dev-ids which block the reset.  I think Jason is
> proposing the set of valid dev-ids are >0, a dev-id of zero indicates
> some form of non-blocking, while <0 (or maybe specifically -1)
> indicates a blocking device.  I was trying to get consensus on a formal
> definition of each of those error codes in my previous reply.  Thanks,

Seems like RESETTABLE flag is not needed if we report -1 for the devices
that block hotreset. Userspace can deduce if the calling device is resettable
or not by checking if there is any -1 in the affected device list.

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
  2023-04-26  7:22                                                                       ` Liu, Yi L
@ 2023-04-26 13:20                                                                         ` Alex Williamson
  2023-04-26 15:08                                                                           ` Liu, Yi L
  0 siblings, 1 reply; 145+ messages in thread
From: Alex Williamson @ 2023-04-26 13:20 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: mjrosato, jasowang, Hao, Xudong, Duan, Zhenzhong, peterx, Xu,
	Terrence, chao.p.peng, linux-s390, kvm, lulu, Jiang, Yanting,
	joro, nicolinc, Jason Gunthorpe, Zhao, Yan Y, intel-gfx,
	eric.auger, intel-gvt-dev, yi.y.sun, cohuck,
	shameerali.kolothum.thodi, suravee.suthikulpanit, robin.murphy

On Wed, 26 Apr 2023 07:22:17 +0000
"Liu, Yi L" <yi.l.liu@intel.com> wrote:

> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Thursday, April 20, 2023 10:09 PM  
> [...]
> > > > Whereas dev-id < 0
> > > > (== -1) is an affected device which prevents hot-reset, ex. an un-owned
> > > > device, device configured within a different iommufd_ctx, or device
> > > > opened outside of the vfio cdev API."  Is that about right?  Thanks,  
> > >
> > > Do you mean to have separate err-code for the three possibilities? As
> > > the devid is generated by iommufd and it is u32. I'm not sure if we can
> > > have such err-code definition without reserving some ids in iommufd.  
> > 
> > Yes, if we're going to report the full dev-set, I think we need at
> > least two unique error codes or else the user has no way to determine
> > the subset of invalid dev-ids which block the reset.  I think Jason is
> > proposing the set of valid dev-ids are >0, a dev-id of zero indicates
> > some form of non-blocking, while <0 (or maybe specifically -1)
> > indicates a blocking device.  I was trying to get consensus on a formal
> > definition of each of those error codes in my previous reply.  Thanks,  
> 
> Seems like RESETTABLE flag is not needed if we report -1 for the devices
> that block hotreset. Userspace can deduce if the calling device is resettable
> or not by checking if there is any -1 in the affected device list.

There is some redundancy there, yes.  Given the desire for a null array
on the actual reset ioctl I assumed there would also be a desire to
streamline the info ioctl such that userspace isn't required to parse
the return array, for example maybe userspace isn't required to pass a
full buffer and can get the reset availability status from only the
header.  Of course it's still the responsibility of userspace to know
the extent of the reset.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 145+ messages in thread

* Re: [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
  2023-04-26 13:20                                                                         ` Alex Williamson
@ 2023-04-26 15:08                                                                           ` Liu, Yi L
  0 siblings, 0 replies; 145+ messages in thread
From: Liu, Yi L @ 2023-04-26 15:08 UTC (permalink / raw)
  To: Alex Williamson
  Cc: mjrosato, jasowang, Hao, Xudong, Duan,  Zhenzhong, peterx, Xu,
	Terrence, chao.p.peng, linux-s390, kvm, lulu, Jiang, Yanting,
	joro, nicolinc, Jason Gunthorpe, Zhao, Yan Y, intel-gfx,
	eric.auger, intel-gvt-dev, yi.y.sun, cohuck,
	shameerali.kolothum.thodi, suravee.suthikulpanit, robin.murphy

> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Wednesday, April 26, 2023 9:20 PM
> 
> On Wed, 26 Apr 2023 07:22:17 +0000
> "Liu, Yi L" <yi.l.liu@intel.com> wrote:
> 
> > > From: Alex Williamson <alex.williamson@redhat.com>
> > > Sent: Thursday, April 20, 2023 10:09 PM
> > [...]
> > > > > Whereas dev-id < 0
> > > > > (== -1) is an affected device which prevents hot-reset, ex. an un-owned
> > > > > device, device configured within a different iommufd_ctx, or device
> > > > > opened outside of the vfio cdev API."  Is that about right?  Thanks,
> > > >
> > > > Do you mean to have separate err-code for the three possibilities? As
> > > > the devid is generated by iommufd and it is u32. I'm not sure if we can
> > > > have such err-code definition without reserving some ids in iommufd.
> > >
> > > Yes, if we're going to report the full dev-set, I think we need at
> > > least two unique error codes or else the user has no way to determine
> > > the subset of invalid dev-ids which block the reset.  I think Jason is
> > > proposing the set of valid dev-ids are >0, a dev-id of zero indicates
> > > some form of non-blocking, while <0 (or maybe specifically -1)
> > > indicates a blocking device.  I was trying to get consensus on a formal
> > > definition of each of those error codes in my previous reply.  Thanks,
> >
> > Seems like RESETTABLE flag is not needed if we report -1 for the devices
> > that block hotreset. Userspace can deduce if the calling device is resettable
> > or not by checking if there is any -1 in the affected device list.
> 
> There is some redundancy there, yes.  Given the desire for a null array
> on the actual reset ioctl I assumed there would also be a desire to
> streamline the info ioctl such that userspace isn't required to parse
> the return array, for example maybe userspace isn't required to pass a
> full buffer and can get the reset availability status from only the
> header.  Of course it's still the responsibility of userspace to know
> the extent of the reset.  Thanks,

I keep it and has sent a refreshed version for hot-reset. 😊

https://lore.kernel.org/kvm/20230426145419.450922-9-yi.l.liu@intel.com/

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 145+ messages in thread

end of thread, other threads:[~2023-04-26 15:08 UTC | newest]

Thread overview: 145+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-04-01 14:44 [Intel-gfx] [PATCH v3 00/12] Introduce new methods for verifying ownership in vfio PCI hot reset Yi Liu
2023-04-01 14:44 ` [Intel-gfx] [PATCH v3 01/12] vfio/pci: Update comment around group_fd get in vfio_pci_ioctl_pci_hot_reset() Yi Liu
2023-04-04 13:59   ` Eric Auger
2023-04-04 14:37     ` Liu, Yi L
2023-04-01 14:44 ` [Intel-gfx] [PATCH v3 02/12] vfio/pci: Only check ownership of opened devices in hot reset Yi Liu
2023-04-04 13:59   ` Eric Auger
2023-04-04 14:37     ` Liu, Yi L
2023-04-04 15:18       ` Eric Auger
2023-04-04 15:29         ` Liu, Yi L
2023-04-04 15:59           ` Eric Auger
2023-04-05 11:41             ` Jason Gunthorpe
2023-04-05 15:14               ` Eric Auger
2023-04-01 14:44 ` [Intel-gfx] [PATCH v3 03/12] vfio/pci: Move the existing hot reset logic to be a helper Yi Liu
2023-04-04 13:59   ` Eric Auger
2023-04-04 14:24     ` Liu, Yi L
2023-04-01 14:44 ` [Intel-gfx] [PATCH v3 04/12] vfio-iommufd: Add helper to retrieve iommufd_ctx and devid for vfio_device Yi Liu
2023-04-04 15:28   ` Eric Auger
2023-04-04 21:48     ` Alex Williamson
2023-04-21  7:11       ` Liu, Yi L
2023-04-01 14:44 ` [Intel-gfx] [PATCH v3 05/12] vfio/pci: Allow passing zero-length fd array in VFIO_DEVICE_PCI_HOT_RESET Yi Liu
2023-04-04 16:54   ` Eric Auger
2023-04-04 20:18   ` Alex Williamson
2023-04-05  7:55     ` Liu, Yi L
2023-04-05  8:01       ` Liu, Yi L
2023-04-05 15:36         ` Alex Williamson
2023-04-05 16:46           ` Jason Gunthorpe
2023-04-05  8:02     ` Eric Auger
2023-04-05  8:09       ` Liu, Yi L
2023-04-01 14:44 ` [Intel-gfx] [PATCH v3 06/12] vfio: Refine vfio file kAPIs for vfio PCI hot reset Yi Liu
2023-04-05  8:27   ` Eric Auger
2023-04-05  9:23     ` Liu, Yi L
2023-04-01 14:44 ` [Intel-gfx] [PATCH v3 07/12] vfio: Accpet device file from vfio PCI hot reset path Yi Liu
2023-04-04 20:31   ` Alex Williamson
2023-04-05  8:07   ` Eric Auger
2023-04-05  8:10     ` Liu, Yi L
2023-04-01 14:44 ` [Intel-gfx] [PATCH v3 08/12] vfio/pci: Renaming for accepting device fd in " Yi Liu
2023-04-04 21:23   ` Alex Williamson
2023-04-05  9:32   ` Eric Auger
2023-04-01 14:44 ` [Intel-gfx] [PATCH v3 09/12] vfio/pci: Accept device fd in VFIO_DEVICE_PCI_HOT_RESET ioctl Yi Liu
2023-04-05  9:36   ` Eric Auger
2023-04-01 14:44 ` [Intel-gfx] [PATCH v3 10/12] vfio: Mark cdev usage in vfio_device Yi Liu
2023-04-05 11:48   ` Eric Auger
2023-04-21  7:06     ` Liu, Yi L
2023-04-01 14:44 ` [Intel-gfx] [PATCH v3 11/12] iommufd: Define IOMMUFD_INVALID_ID in uapi Yi Liu
2023-04-04 21:00   ` Alex Williamson
2023-04-05  9:31     ` Liu, Yi L
2023-04-05 15:13       ` Alex Williamson
2023-04-05 15:17         ` Liu, Yi L
2023-04-05 11:46   ` Eric Auger
2023-04-01 14:44 ` [Intel-gfx] [PATCH v3 12/12] vfio/pci: Report dev_id in VFIO_DEVICE_GET_PCI_HOT_RESET_INFO Yi Liu
2023-04-03  9:25   ` Liu, Yi L
2023-04-03 15:01     ` Alex Williamson
2023-04-03 15:22       ` Liu, Yi L
2023-04-03 15:32         ` Alex Williamson
2023-04-03 16:12           ` Jason Gunthorpe
2023-04-07 10:09       ` Liu, Yi L
2023-04-07 12:03         ` Alex Williamson
2023-04-07 13:24           ` Liu, Yi L
2023-04-07 13:51             ` Alex Williamson
2023-04-07 14:04               ` Liu, Yi L
2023-04-07 15:14                 ` Alex Williamson
2023-04-07 15:47                   ` Liu, Yi L
2023-04-07 21:07                     ` Alex Williamson
2023-04-08  5:07                       ` Liu, Yi L
2023-04-08 14:20                         ` Alex Williamson
2023-04-09 11:58                           ` Yi Liu
2023-04-09 13:29                             ` Alex Williamson
2023-04-10  8:48                               ` Liu, Yi L
2023-04-10 14:41                                 ` Alex Williamson
2023-04-10 15:18                                   ` Liu, Yi L
2023-04-10 15:23                                     ` Alex Williamson
2023-04-11 13:34                               ` Jason Gunthorpe
2023-04-11 13:33                       ` Jason Gunthorpe
2023-04-11  6:16           ` Liu, Yi L
2023-04-04 22:20   ` Alex Williamson
2023-04-05 12:19   ` Eric Auger
2023-04-05 14:04     ` Liu, Yi L
2023-04-05 16:25       ` Alex Williamson
2023-04-05 16:37         ` Jason Gunthorpe
2023-04-05 16:52           ` Alex Williamson
2023-04-05 17:23             ` Jason Gunthorpe
2023-04-05 18:56               ` Alex Williamson
2023-04-05 19:18                 ` Alex Williamson
2023-04-05 19:21                 ` Jason Gunthorpe
2023-04-05 19:49                   ` Alex Williamson
2023-04-05 23:22                     ` Jason Gunthorpe
2023-04-06 10:02                       ` Liu, Yi L
2023-04-06 17:53                         ` Alex Williamson
2023-04-07 10:09                           ` Liu, Yi L
2023-04-11 13:24                           ` Jason Gunthorpe
2023-04-11 15:54                             ` Alex Williamson
2023-04-11 17:11                               ` Alex Williamson
2023-04-11 18:40                                 ` Jason Gunthorpe
2023-04-11 21:58                                   ` Alex Williamson
2023-04-12  0:01                                     ` Jason Gunthorpe
2023-04-12  7:27                                       ` Tian, Kevin
2023-04-12 15:05                                         ` Jason Gunthorpe
2023-04-12 17:01                                           ` Alex Williamson
2023-04-13  2:57                                           ` Tian, Kevin
2023-04-12 10:09                                       ` Liu, Yi L
2023-04-12 16:54                                         ` Alex Williamson
2023-04-12 16:50                                       ` Alex Williamson
2023-04-12 20:06                                         ` Jason Gunthorpe
2023-04-13  8:25                                           ` Tian, Kevin
2023-04-13 11:50                                             ` Jason Gunthorpe
2023-04-13 14:35                                               ` Liu, Yi L
2023-04-13 14:41                                                 ` Jason Gunthorpe
2023-04-13 18:07                                               ` Alex Williamson
2023-04-14  9:11                                                 ` Tian, Kevin
2023-04-14 11:38                                                   ` Liu, Yi L
2023-04-14 17:10                                                     ` Alex Williamson
2023-04-17  4:20                                                       ` Liu, Yi L
2023-04-17 19:01                                                         ` Alex Williamson
2023-04-17 19:31                                                           ` Jason Gunthorpe
2023-04-17 20:06                                                             ` Alex Williamson
2023-04-18  3:24                                                               ` Tian, Kevin
2023-04-18  4:10                                                                 ` Alex Williamson
2023-04-18  5:02                                                                   ` Tian, Kevin
2023-04-18 12:59                                                                     ` Jason Gunthorpe
2023-04-18 16:44                                                                     ` Alex Williamson
2023-04-18 10:34                                                                   ` Liu, Yi L
2023-04-18 16:49                                                                     ` Alex Williamson
2023-04-18 12:57                                                               ` Jason Gunthorpe
2023-04-18 18:39                                                                 ` Alex Williamson
2023-04-20 12:10                                                                   ` Liu, Yi L
2023-04-20 14:08                                                                     ` Alex Williamson
2023-04-21 22:35                                                                       ` Jason Gunthorpe
2023-04-23 14:46                                                                         ` Liu, Yi L
2023-04-26  7:22                                                                       ` Liu, Yi L
2023-04-26 13:20                                                                         ` Alex Williamson
2023-04-26 15:08                                                                           ` Liu, Yi L
2023-04-14 16:34                                                   ` Alex Williamson
2023-04-17 13:39                                                   ` Jason Gunthorpe
2023-04-18  1:28                                                     ` Tian, Kevin
2023-04-18 10:23                                                     ` Liu, Yi L
2023-04-18 13:02                                                       ` Jason Gunthorpe
2023-04-23 10:28                                                         ` Liu, Yi L
2023-04-24 17:38                                                           ` Jason Gunthorpe
2023-04-17 14:05                                                 ` Jason Gunthorpe
2023-04-12  7:14                                     ` Tian, Kevin
2023-04-06  6:34                     ` Liu, Yi L
2023-04-06 17:07                       ` Alex Williamson
2023-04-05 17:58         ` Eric Auger
2023-04-06  5:31           ` Liu, Yi L
2023-04-01 14:47 ` [Intel-gfx] ✗ Fi.CI.BUILD: failure for Introduce new methods for verifying ownership in vfio PCI hot reset (rev4) Patchwork

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).