intel-gfx.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed
* [Intel-gfx] [PATCH v6 00/24] cover-letter: Add vfio_device cdev for iommufd support
@ 2023-03-08 13:28 Yi Liu
  2023-03-08 13:28 ` [Intel-gfx] [PATCH v6 01/24] vfio: Allocate per device file structure Yi Liu
                   ` (25 more replies)
  0 siblings, 26 replies; 103+ messages in thread
From: Yi Liu @ 2023-03-08 13:28 UTC (permalink / raw)
  To: alex.williamson, jgg, kevin.tian
  Cc: linux-s390, yi.l.liu, yi.y.sun, mjrosato, kvm, intel-gvt-dev,
	joro, cohuck, xudong.hao, peterx, yan.y.zhao, eric.auger,
	terrence.xu, nicolinc, shameerali.kolothum.thodi,
	suravee.suthikulpanit, intel-gfx, chao.p.peng, lulu,
	robin.murphy, jasowang

Existing VFIO provides group-centric user APIs for userspace. Userspace
opens the /dev/vfio/$group_id first before getting device fd and hence
getting access to device. This is not the desired model for iommufd. Per
the conclusion of community discussion[1], iommufd provides device-centric
kAPIs and requires its consumer (like VFIO) to be device-centric user
APIs. Such user APIs are used to associate device with iommufd and also
the I/O address spaces managed by the iommufd.

This series first introduces a per device file structure to be prepared
for further enhancement and refactors the kvm-vfio code to be prepared
for accepting device file from userspace, and also make vfio-pci to be
able to accpet device fd or zero-length fd array in the hot reset path.
The mechanism of blocking device access before iommufd bind is part of
making vfio-pci accepting device fd. Then refactors the vfio to be able
to handle cdev path (e.g. iommufd binding, [de]attach ioas). This refactor
includes making the device_open exclusive between group and cdev path,
only allow single device open in cdev path and vfio-iommufd refactor to
support cdev. Eventually, adds the cdev support for vfio device and the
new ioctls, then makes group infrastructure optional as it is not needed
when vfio device cdev is compiled.

This series is based on some preparation works done to vfio emulated devices[2].
It is a prerequisite for iommu nesting for vfio device[3].

The complete code can be found in below branch, simple tests done to the
legacy group path and the cdev path. Draft QEMU branch can be found at[4]

https://github.com/yiliu1765/iommufd/tree/vfio_device_cdev_v6
(config CONFIG_IOMMUFD=y CONFIG_VFIO_DEVICE_CDEV=y)

base-commit: bb549e3c0c1c498b3729fcf3ee3b3dea5d19dde2

[1] https://lore.kernel.org/kvm/BN9PR11MB5433B1E4AE5B0480369F97178C189@BN9PR11MB5433.namprd11.prod.outlook.com/
[2] https://lore.kernel.org/kvm/20230308131340.459224-1-yi.l.liu@intel.com/#t
[3] https://lore.kernel.org/linux-iommu/20230209043153.14964-1-yi.l.liu@intel.com/
[4] https://github.com/yiliu1765/qemu/tree/iommufd_rfcv3 (it is based on Eric's
    QEMU iommufd rfcv3 (https://lore.kernel.org/kvm/20230131205305.2726330-1-eric.auger@redhat.com/)
    plus two commits to align with vfio_device_cdev v3/v4/v5/v6)

Change log:

v6:
 - Add r-b from Jason on patch 01 - 08 and 13 in v5
 - Based on the prerequisite mini-series which makes vfio emulated devices
   be prepared to cdev (Jason)
 - Add the approach to pass a set of device fds to do hot reset ownership
   check, while the zero-length array approach is also kept. (Jason, Kevin, Alex)
 - Drop patch 10 of v5, it is reworked by patch 13 and 17 in v6 (Jason)
 - Store vfio_group pointer in vfio_device_file to check if user is using
   legacy vfio container (Jason)
 - Drop the is_cdev_device flag (introduced in patch 14 of v5) as the group
   pointer stored in vfio_device_file can cover it.
 - Add iommu_group check in the cdev no-iommu path patch 24 (Kevin)
 - Add t-b from Terrence, Nicolin and Matthew (thanks for the help, some patches
   are new in this version, so I just added t-b to the patches that are also
   in v5 and no big change, for others would add in this version).

v5: https://lore.kernel.org/kvm/20230227111135.61728-1-yi.l.liu@intel.com/
 - Add r-b from Kevin on patch 08, 13, 14, 15 and 17.
 - Rename patch 02 to limit the change for KVM facing kAPIs. The vfio pci
   hot reset path only accepts group file until patch 09. (Kevin)
 - Update comment around smp_load_acquire(&df->access_granted) (Yan)
 - Adopt Jason's suggestion on the vfio pci hot reset path, passing zero-length
   fd array to indicate using bound iommufd_ctx as ownership check. (Jason, Kevin)
 - Direct read df->access_granted value in vfio_device_cdev_close() (Kevin, Yan, Jason)
 - Wrap the iommufd get/put into a helper to refine the error path of
   vfio_device_ioctl_bind_iommufd(). (Yan)

v4: https://lore.kernel.org/kvm/20230221034812.138051-1-yi.l.liu@intel.com/
 - Add r-b from Kevin on patch 09/10
 - Add a line in devices/vfio.rst to emphasize user should add group/device to
   KVM prior to invoke open_device op which may be called in the VFIO_GROUP_GET_DEVICE_FD
   or VFIO_DEVICE_BIND_IOMMUFD ioctl.
 - Modify VFIO_GROUP/VFIO_DEVICE_CDEV Kconfig dependency (Alex)
 - Select VFIO_GROUP for SPAPR (Jason)
 - Check device fully-opened in PCI hotreset path for device fd (Jason)
 - Set df->access_granted in the caller of vfio_device_open() since
   the caller may fail in other operations, but df->access_granted
   does not allow a true to false change. So it should be set only when
   the open path is really done successfully. (Yan, Kevin)
 - Fix missing iommufd_ctx_put() in the cdev path (Yan)
 - Fix an issue found in testing exclusion between group and cdev path.
   vfio_device_cdev_close() should check df->access_granted before heading
   to other operations.
 - Update vfio.rst for iommufd/cdev

v3: https://lore.kernel.org/kvm/20230213151348.56451-1-yi.l.liu@intel.com/
 - Add r-b from Kevin on patch 03, 06, 07, 08.
 - Refine the group and cdev path exclusion. Remove vfio_device:single_open;
   add vfio_group::cdev_device_open_cnt to achieve exlucsion between group
   path and cdev path (Kevin, Jason)
 - Fix a bug in the error handling path (Yan Zhao)
 - Address misc remarks from Kevin

v2: https://lore.kernel.org/kvm/20230206090532.95598-1-yi.l.liu@intel.com/
 - Add r-b from Kevin and Eric on patch 01 02 04.
 - "Split kvm/vfio: Provide struct kvm_device_ops::release() insted of ::destroy()"
   from this series and got applied. (Alex, Kevin, Jason, Mathhew)
 - Add kvm_ref_lock to protect vfio_device_file->kvm instead of reusing
   dev_set->lock as dead-lock is observed with vfio-ap which would try to
   acquire kvm_lock. This is opposite lock order with kvm_device_release()
   which holds kvm_lock first and then hold dev_set->lock. (Kevin)
 - Use a separate ioctl for detaching IOAS. (Alex)
 - Rename vfio_device_file::single_open to be is_cdev_device (Kevin, Alex)
 - Move the vfio device cdev code into device_cdev.c and add a VFIO_DEVICE_CDEV
   kconfig for it. (Kevin, Jason)

v1: https://lore.kernel.org/kvm/20230117134942.101112-1-yi.l.liu@intel.com/
 - Fix the circular refcount between kvm struct and device file reference. (JasonG)
 - Address comments from KevinT
 - Remained the ioctl for detach, needs to Alex's taste
   (https://lore.kernel.org/kvm/BN9PR11MB5276BE9F4B0613EE859317028CFF9@BN9PR11MB5276.namprd11.prod.outlook.com/)

rfc: https://lore.kernel.org/kvm/20221219084718.9342-1-yi.l.liu@intel.com/

Thanks,
	Yi Liu

Yi Liu (24):
  vfio: Allocate per device file structure
  vfio: Refine vfio file kAPIs for KVM
  vfio: Accept vfio device file in the KVM facing kAPI
  kvm/vfio: Rename kvm_vfio_group to prepare for accepting vfio device
    fd
  kvm/vfio: Accept vfio device file from userspace
  vfio: Pass struct vfio_device_file * to vfio_device_open/close()
  vfio: Block device access via device fd until device is opened
  vfio/pci: Update comment around group_fd get in
    vfio_pci_ioctl_pci_hot_reset()
  vfio/pci: Only need to check opened devices in the dev_set for hot
    reset
  vfio/pci: Rename the helpers and data in hot reset path to accept
    device fd
  vfio/pci: Accept device fd in VFIO_DEVICE_PCI_HOT_RESET ioctl
  vfio/pci: Allow passing zero-length fd array in
    VFIO_DEVICE_PCI_HOT_RESET
  vfio/iommufd: Split the compat_ioas attach out from
    vfio_iommufd_bind()
  vfio: Add cdev_device_open_cnt to vfio_group
  vfio: Make vfio_device_open() single open for device cdev path
  vfio: Make vfio_device_first_open() to cover the noiommu mode in cdev
    path
  vfio-iommufd: Make vfio_iommufd_bind() selectively return devid
  vfio-iommufd: Add detach_ioas support for physical VFIO devices
  vfio-iommufd: Add detach_ioas support for emulated VFIO devices
  vfio: Add cdev for vfio_device
  vfio: Add VFIO_DEVICE_BIND_IOMMUFD
  vfio: Add VFIO_DEVICE_AT[DE]TACH_IOMMUFD_PT
  vfio: Compile group optionally
  docs: vfio: Add vfio device cdev description

 Documentation/driver-api/vfio.rst             | 133 +++++++-
 Documentation/virt/kvm/devices/vfio.rst       |  52 ++-
 drivers/gpu/drm/i915/gvt/kvmgt.c              |   1 +
 drivers/iommu/iommufd/device.c                |   6 +
 drivers/s390/cio/vfio_ccw_ops.c               |   1 +
 drivers/s390/crypto/vfio_ap_ops.c             |   1 +
 drivers/vfio/Kconfig                          |  27 +-
 drivers/vfio/Makefile                         |   3 +-
 drivers/vfio/device_cdev.c                    | 313 ++++++++++++++++++
 drivers/vfio/fsl-mc/vfio_fsl_mc.c             |   1 +
 drivers/vfio/group.c                          | 192 +++++++----
 drivers/vfio/iommufd.c                        | 119 +++++--
 .../vfio/pci/hisilicon/hisi_acc_vfio_pci.c    |   2 +
 drivers/vfio/pci/mlx5/main.c                  |   1 +
 drivers/vfio/pci/vfio_pci.c                   |   1 +
 drivers/vfio/pci/vfio_pci_core.c              | 152 ++++++---
 drivers/vfio/platform/vfio_amba.c             |   1 +
 drivers/vfio/platform/vfio_platform.c         |   1 +
 drivers/vfio/vfio.h                           | 212 +++++++++++-
 drivers/vfio/vfio_main.c                      | 293 ++++++++++++++--
 include/linux/iommufd.h                       |   3 +
 include/linux/vfio.h                          |  38 ++-
 include/uapi/linux/kvm.h                      |  16 +-
 include/uapi/linux/vfio.h                     | 106 +++++-
 virt/kvm/vfio.c                               | 141 ++++----
 25 files changed, 1548 insertions(+), 268 deletions(-)
 create mode 100644 drivers/vfio/device_cdev.c

-- 
2.34.1


^ permalink raw reply	[flat|nested] 103+ messages in thread

* [Intel-gfx] [PATCH v6 01/24] vfio: Allocate per device file structure
  2023-03-08 13:28 [Intel-gfx] [PATCH v6 00/24] cover-letter: Add vfio_device cdev for iommufd support Yi Liu
@ 2023-03-08 13:28 ` Yi Liu
  2023-03-08 13:28 ` [Intel-gfx] [PATCH v6 02/24] vfio: Refine vfio file kAPIs for KVM Yi Liu
                   ` (24 subsequent siblings)
  25 siblings, 0 replies; 103+ messages in thread
From: Yi Liu @ 2023-03-08 13:28 UTC (permalink / raw)
  To: alex.williamson, jgg, kevin.tian
  Cc: linux-s390, yi.l.liu, yi.y.sun, mjrosato, kvm, intel-gvt-dev,
	joro, cohuck, xudong.hao, peterx, yan.y.zhao, eric.auger,
	terrence.xu, nicolinc, shameerali.kolothum.thodi,
	suravee.suthikulpanit, intel-gfx, chao.p.peng, lulu,
	robin.murphy, jasowang

This is preparation for adding vfio device cdev support. vfio device
cdev requires:
1) A per device file memory to store the kvm pointer set by KVM. It will
   be propagated to vfio_device:kvm after the device cdev file is bound
   to an iommufd.
2) A mechanism to block device access through device cdev fd before it
   is bound to an iommufd.

To address above requirements, this adds a per device file structure
named vfio_device_file. For now, it's only a wrapper of struct vfio_device
pointer. Other fields will be added to this per file structure in future
commits.

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Tested-by: Terrence Xu <terrence.xu@intel.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Matthew Rosato <mjrosato@linux.ibm.com>
---
 drivers/vfio/group.c     | 13 +++++++++++--
 drivers/vfio/vfio.h      |  6 ++++++
 drivers/vfio/vfio_main.c | 31 ++++++++++++++++++++++++++-----
 3 files changed, 43 insertions(+), 7 deletions(-)

diff --git a/drivers/vfio/group.c b/drivers/vfio/group.c
index 27d5ba7cf9dc..af2ef8006e1d 100644
--- a/drivers/vfio/group.c
+++ b/drivers/vfio/group.c
@@ -218,19 +218,26 @@ void vfio_device_group_close(struct vfio_device *device)
 
 static struct file *vfio_device_open_file(struct vfio_device *device)
 {
+	struct vfio_device_file *df;
 	struct file *filep;
 	int ret;
 
+	df = vfio_allocate_device_file(device);
+	if (IS_ERR(df)) {
+		ret = PTR_ERR(df);
+		goto err_out;
+	}
+
 	ret = vfio_device_group_open(device);
 	if (ret)
-		goto err_out;
+		goto err_free;
 
 	/*
 	 * We can't use anon_inode_getfd() because we need to modify
 	 * the f_mode flags directly to allow more than just ioctls
 	 */
 	filep = anon_inode_getfile("[vfio-device]", &vfio_device_fops,
-				   device, O_RDWR);
+				   df, O_RDWR);
 	if (IS_ERR(filep)) {
 		ret = PTR_ERR(filep);
 		goto err_close_device;
@@ -254,6 +261,8 @@ static struct file *vfio_device_open_file(struct vfio_device *device)
 
 err_close_device:
 	vfio_device_group_close(device);
+err_free:
+	kfree(df);
 err_out:
 	return ERR_PTR(ret);
 }
diff --git a/drivers/vfio/vfio.h b/drivers/vfio/vfio.h
index 7b19c621e0e6..87d3dd6b9ef9 100644
--- a/drivers/vfio/vfio.h
+++ b/drivers/vfio/vfio.h
@@ -16,11 +16,17 @@ struct iommufd_ctx;
 struct iommu_group;
 struct vfio_container;
 
+struct vfio_device_file {
+	struct vfio_device *device;
+};
+
 void vfio_device_put_registration(struct vfio_device *device);
 bool vfio_device_try_get_registration(struct vfio_device *device);
 int vfio_device_open(struct vfio_device *device, struct iommufd_ctx *iommufd);
 void vfio_device_close(struct vfio_device *device,
 		       struct iommufd_ctx *iommufd);
+struct vfio_device_file *
+vfio_allocate_device_file(struct vfio_device *device);
 
 extern const struct file_operations vfio_device_fops;
 
diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
index 89497c933490..f0b9151e3ba7 100644
--- a/drivers/vfio/vfio_main.c
+++ b/drivers/vfio/vfio_main.c
@@ -404,6 +404,20 @@ static bool vfio_assert_device_open(struct vfio_device *device)
 	return !WARN_ON_ONCE(!READ_ONCE(device->open_count));
 }
 
+struct vfio_device_file *
+vfio_allocate_device_file(struct vfio_device *device)
+{
+	struct vfio_device_file *df;
+
+	df = kzalloc(sizeof(*df), GFP_KERNEL_ACCOUNT);
+	if (!df)
+		return ERR_PTR(-ENOMEM);
+
+	df->device = device;
+
+	return df;
+}
+
 static int vfio_device_first_open(struct vfio_device *device,
 				  struct iommufd_ctx *iommufd)
 {
@@ -517,12 +531,15 @@ static inline void vfio_device_pm_runtime_put(struct vfio_device *device)
  */
 static int vfio_device_fops_release(struct inode *inode, struct file *filep)
 {
-	struct vfio_device *device = filep->private_data;
+	struct vfio_device_file *df = filep->private_data;
+	struct vfio_device *device = df->device;
 
 	vfio_device_group_close(device);
 
 	vfio_device_put_registration(device);
 
+	kfree(df);
+
 	return 0;
 }
 
@@ -1087,7 +1104,8 @@ static int vfio_ioctl_device_feature(struct vfio_device *device,
 static long vfio_device_fops_unl_ioctl(struct file *filep,
 				       unsigned int cmd, unsigned long arg)
 {
-	struct vfio_device *device = filep->private_data;
+	struct vfio_device_file *df = filep->private_data;
+	struct vfio_device *device = df->device;
 	int ret;
 
 	ret = vfio_device_pm_runtime_get(device);
@@ -1114,7 +1132,8 @@ static long vfio_device_fops_unl_ioctl(struct file *filep,
 static ssize_t vfio_device_fops_read(struct file *filep, char __user *buf,
 				     size_t count, loff_t *ppos)
 {
-	struct vfio_device *device = filep->private_data;
+	struct vfio_device_file *df = filep->private_data;
+	struct vfio_device *device = df->device;
 
 	if (unlikely(!device->ops->read))
 		return -EINVAL;
@@ -1126,7 +1145,8 @@ static ssize_t vfio_device_fops_write(struct file *filep,
 				      const char __user *buf,
 				      size_t count, loff_t *ppos)
 {
-	struct vfio_device *device = filep->private_data;
+	struct vfio_device_file *df = filep->private_data;
+	struct vfio_device *device = df->device;
 
 	if (unlikely(!device->ops->write))
 		return -EINVAL;
@@ -1136,7 +1156,8 @@ static ssize_t vfio_device_fops_write(struct file *filep,
 
 static int vfio_device_fops_mmap(struct file *filep, struct vm_area_struct *vma)
 {
-	struct vfio_device *device = filep->private_data;
+	struct vfio_device_file *df = filep->private_data;
+	struct vfio_device *device = df->device;
 
 	if (unlikely(!device->ops->mmap))
 		return -EINVAL;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [Intel-gfx] [PATCH v6 02/24] vfio: Refine vfio file kAPIs for KVM
  2023-03-08 13:28 [Intel-gfx] [PATCH v6 00/24] cover-letter: Add vfio_device cdev for iommufd support Yi Liu
  2023-03-08 13:28 ` [Intel-gfx] [PATCH v6 01/24] vfio: Allocate per device file structure Yi Liu
@ 2023-03-08 13:28 ` Yi Liu
  2023-03-08 13:28 ` [Intel-gfx] [PATCH v6 03/24] vfio: Accept vfio device file in the KVM facing kAPI Yi Liu
                   ` (23 subsequent siblings)
  25 siblings, 0 replies; 103+ messages in thread
From: Yi Liu @ 2023-03-08 13:28 UTC (permalink / raw)
  To: alex.williamson, jgg, kevin.tian
  Cc: linux-s390, yi.l.liu, yi.y.sun, mjrosato, kvm, intel-gvt-dev,
	joro, cohuck, xudong.hao, peterx, yan.y.zhao, eric.auger,
	terrence.xu, nicolinc, shameerali.kolothum.thodi,
	suravee.suthikulpanit, intel-gfx, chao.p.peng, lulu,
	robin.murphy, jasowang

This prepares for making the below kAPIs to accept both group file
and device file instead of only vfio group file.

  bool vfio_file_enforced_coherent(struct file *file);
  void vfio_file_set_kvm(struct file *file, struct kvm *kvm);

Besides the above change, vfio_file_is_valid() is added to check if a
given file is a valid vfio file. It would be extended to check both
vfio group file and vfio device file later. vfio_file_is_group() is
kept to for the VFIO PCI hot reset path.

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Tested-by: Terrence Xu <terrence.xu@intel.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Matthew Rosato <mjrosato@linux.ibm.com>
---
 drivers/vfio/group.c     | 57 +++++++++++++++-------------------------
 drivers/vfio/vfio.h      |  3 +++
 drivers/vfio/vfio_main.c | 45 +++++++++++++++++++++++++++++++
 include/linux/vfio.h     |  1 +
 virt/kvm/vfio.c          | 10 +++----
 5 files changed, 75 insertions(+), 41 deletions(-)

diff --git a/drivers/vfio/group.c b/drivers/vfio/group.c
index af2ef8006e1d..a293e1f92e89 100644
--- a/drivers/vfio/group.c
+++ b/drivers/vfio/group.c
@@ -754,6 +754,15 @@ bool vfio_device_has_container(struct vfio_device *device)
 	return device->group->container;
 }
 
+struct vfio_group *vfio_group_from_file(struct file *file)
+{
+	struct vfio_group *group = file->private_data;
+
+	if (file->f_op != &vfio_group_fops)
+		return NULL;
+	return group;
+}
+
 /**
  * vfio_file_iommu_group - Return the struct iommu_group for the vfio group file
  * @file: VFIO group file
@@ -764,13 +773,13 @@ bool vfio_device_has_container(struct vfio_device *device)
  */
 struct iommu_group *vfio_file_iommu_group(struct file *file)
 {
-	struct vfio_group *group = file->private_data;
+	struct vfio_group *group = vfio_group_from_file(file);
 	struct iommu_group *iommu_group = NULL;
 
 	if (!IS_ENABLED(CONFIG_SPAPR_TCE_IOMMU))
 		return NULL;
 
-	if (!vfio_file_is_group(file))
+	if (!group)
 		return NULL;
 
 	mutex_lock(&group->group_lock);
@@ -784,33 +793,20 @@ struct iommu_group *vfio_file_iommu_group(struct file *file)
 EXPORT_SYMBOL_GPL(vfio_file_iommu_group);
 
 /**
- * vfio_file_is_group - True if the file is usable with VFIO aPIS
+ * vfio_file_is_group - True if the file is a vfio group file
  * @file: VFIO group file
  */
 bool vfio_file_is_group(struct file *file)
 {
-	return file->f_op == &vfio_group_fops;
+	return vfio_group_from_file(file);
 }
 EXPORT_SYMBOL_GPL(vfio_file_is_group);
 
-/**
- * vfio_file_enforced_coherent - True if the DMA associated with the VFIO file
- *        is always CPU cache coherent
- * @file: VFIO group file
- *
- * Enforced coherency means that the IOMMU ignores things like the PCIe no-snoop
- * bit in DMA transactions. A return of false indicates that the user has
- * rights to access additional instructions such as wbinvd on x86.
- */
-bool vfio_file_enforced_coherent(struct file *file)
+bool vfio_group_enforced_coherent(struct vfio_group *group)
 {
-	struct vfio_group *group = file->private_data;
 	struct vfio_device *device;
 	bool ret = true;
 
-	if (!vfio_file_is_group(file))
-		return true;
-
 	/*
 	 * If the device does not have IOMMU_CAP_ENFORCE_CACHE_COHERENCY then
 	 * any domain later attached to it will also not support it. If the cap
@@ -828,28 +824,17 @@ bool vfio_file_enforced_coherent(struct file *file)
 	mutex_unlock(&group->device_lock);
 	return ret;
 }
-EXPORT_SYMBOL_GPL(vfio_file_enforced_coherent);
 
-/**
- * vfio_file_set_kvm - Link a kvm with VFIO drivers
- * @file: VFIO group file
- * @kvm: KVM to link
- *
- * When a VFIO device is first opened the KVM will be available in
- * device->kvm if one was associated with the group.
- */
-void vfio_file_set_kvm(struct file *file, struct kvm *kvm)
+void vfio_group_set_kvm(struct vfio_group *group, struct kvm *kvm)
 {
-	struct vfio_group *group = file->private_data;
-
-	if (!vfio_file_is_group(file))
-		return;
-
+	/*
+	 * When a VFIO device is first opened the KVM will be available in
+	 * device->kvm if one was associated with the group.
+	 */
 	spin_lock(&group->kvm_ref_lock);
 	group->kvm = kvm;
 	spin_unlock(&group->kvm_ref_lock);
 }
-EXPORT_SYMBOL_GPL(vfio_file_set_kvm);
 
 /**
  * vfio_file_has_dev - True if the VFIO file is a handle for device
@@ -860,9 +845,9 @@ EXPORT_SYMBOL_GPL(vfio_file_set_kvm);
  */
 bool vfio_file_has_dev(struct file *file, struct vfio_device *device)
 {
-	struct vfio_group *group = file->private_data;
+	struct vfio_group *group = vfio_group_from_file(file);
 
-	if (!vfio_file_is_group(file))
+	if (!group)
 		return false;
 
 	return group == device->group;
diff --git a/drivers/vfio/vfio.h b/drivers/vfio/vfio.h
index 87d3dd6b9ef9..b1e327a85a32 100644
--- a/drivers/vfio/vfio.h
+++ b/drivers/vfio/vfio.h
@@ -90,6 +90,9 @@ void vfio_device_group_unregister(struct vfio_device *device);
 int vfio_device_group_use_iommu(struct vfio_device *device);
 void vfio_device_group_unuse_iommu(struct vfio_device *device);
 void vfio_device_group_close(struct vfio_device *device);
+struct vfio_group *vfio_group_from_file(struct file *file);
+bool vfio_group_enforced_coherent(struct vfio_group *group);
+void vfio_group_set_kvm(struct vfio_group *group, struct kvm *kvm);
 bool vfio_device_has_container(struct vfio_device *device);
 int __init vfio_group_init(void);
 void vfio_group_cleanup(void);
diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
index f0b9151e3ba7..edadfac9be49 100644
--- a/drivers/vfio/vfio_main.c
+++ b/drivers/vfio/vfio_main.c
@@ -1175,6 +1175,51 @@ const struct file_operations vfio_device_fops = {
 	.mmap		= vfio_device_fops_mmap,
 };
 
+/**
+ * vfio_file_is_valid - True if the file is valid vfio file
+ * @file: VFIO group file or VFIO device file
+ */
+bool vfio_file_is_valid(struct file *file)
+{
+	return vfio_group_from_file(file);
+}
+EXPORT_SYMBOL_GPL(vfio_file_is_valid);
+
+/**
+ * vfio_file_enforced_coherent - True if the DMA associated with the VFIO file
+ *        is always CPU cache coherent
+ * @file: VFIO group file or VFIO device file
+ *
+ * Enforced coherency means that the IOMMU ignores things like the PCIe no-snoop
+ * bit in DMA transactions. A return of false indicates that the user has
+ * rights to access additional instructions such as wbinvd on x86.
+ */
+bool vfio_file_enforced_coherent(struct file *file)
+{
+	struct vfio_group *group = vfio_group_from_file(file);
+
+	if (group)
+		return vfio_group_enforced_coherent(group);
+
+	return true;
+}
+EXPORT_SYMBOL_GPL(vfio_file_enforced_coherent);
+
+/**
+ * vfio_file_set_kvm - Link a kvm with VFIO drivers
+ * @file: VFIO group file or VFIO device file
+ * @kvm: KVM to link
+ *
+ */
+void vfio_file_set_kvm(struct file *file, struct kvm *kvm)
+{
+	struct vfio_group *group = vfio_group_from_file(file);
+
+	if (group)
+		vfio_group_set_kvm(group, kvm);
+}
+EXPORT_SYMBOL_GPL(vfio_file_set_kvm);
+
 /*
  * Sub-module support
  */
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index 3188d8a374bd..b14dcdd0b71f 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -245,6 +245,7 @@ int vfio_mig_get_next_state(struct vfio_device *device,
  */
 struct iommu_group *vfio_file_iommu_group(struct file *file);
 bool vfio_file_is_group(struct file *file);
+bool vfio_file_is_valid(struct file *file);
 bool vfio_file_enforced_coherent(struct file *file);
 void vfio_file_set_kvm(struct file *file, struct kvm *kvm);
 bool vfio_file_has_dev(struct file *file, struct vfio_device *device);
diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
index 9584eb57e0ed..8bac308ba630 100644
--- a/virt/kvm/vfio.c
+++ b/virt/kvm/vfio.c
@@ -64,18 +64,18 @@ static bool kvm_vfio_file_enforced_coherent(struct file *file)
 	return ret;
 }
 
-static bool kvm_vfio_file_is_group(struct file *file)
+static bool kvm_vfio_file_is_valid(struct file *file)
 {
 	bool (*fn)(struct file *file);
 	bool ret;
 
-	fn = symbol_get(vfio_file_is_group);
+	fn = symbol_get(vfio_file_is_valid);
 	if (!fn)
 		return false;
 
 	ret = fn(file);
 
-	symbol_put(vfio_file_is_group);
+	symbol_put(vfio_file_is_valid);
 
 	return ret;
 }
@@ -154,8 +154,8 @@ static int kvm_vfio_group_add(struct kvm_device *dev, unsigned int fd)
 	if (!filp)
 		return -EBADF;
 
-	/* Ensure the FD is a vfio group FD.*/
-	if (!kvm_vfio_file_is_group(filp)) {
+	/* Ensure the FD is a vfio FD.*/
+	if (!kvm_vfio_file_is_valid(filp)) {
 		ret = -EINVAL;
 		goto err_fput;
 	}
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [Intel-gfx] [PATCH v6 03/24] vfio: Accept vfio device file in the KVM facing kAPI
  2023-03-08 13:28 [Intel-gfx] [PATCH v6 00/24] cover-letter: Add vfio_device cdev for iommufd support Yi Liu
  2023-03-08 13:28 ` [Intel-gfx] [PATCH v6 01/24] vfio: Allocate per device file structure Yi Liu
  2023-03-08 13:28 ` [Intel-gfx] [PATCH v6 02/24] vfio: Refine vfio file kAPIs for KVM Yi Liu
@ 2023-03-08 13:28 ` Yi Liu
  2023-03-08 13:28 ` [Intel-gfx] [PATCH v6 04/24] kvm/vfio: Rename kvm_vfio_group to prepare for accepting vfio device fd Yi Liu
                   ` (22 subsequent siblings)
  25 siblings, 0 replies; 103+ messages in thread
From: Yi Liu @ 2023-03-08 13:28 UTC (permalink / raw)
  To: alex.williamson, jgg, kevin.tian
  Cc: linux-s390, yi.l.liu, yi.y.sun, mjrosato, kvm, intel-gvt-dev,
	joro, cohuck, xudong.hao, peterx, yan.y.zhao, eric.auger,
	terrence.xu, nicolinc, shameerali.kolothum.thodi,
	suravee.suthikulpanit, intel-gfx, chao.p.peng, lulu,
	robin.murphy, jasowang

This makes the vfio file kAPIs to accepte vfio device files, also a
preparation for vfio device cdev support.

For the kvm set with vfio device file, kvm pointer is stored in struct
vfio_device_file, and use kvm_ref_lock to protect kvm set and kvm
pointer usage within VFIO. This kvm pointer will be set to vfio_device
after device file is bound to iommufd in the cdev path.

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Tested-by: Terrence Xu <terrence.xu@intel.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Matthew Rosato <mjrosato@linux.ibm.com>
---
 drivers/vfio/vfio.h      |  2 ++
 drivers/vfio/vfio_main.c | 42 +++++++++++++++++++++++++++++++++++++---
 2 files changed, 41 insertions(+), 3 deletions(-)

diff --git a/drivers/vfio/vfio.h b/drivers/vfio/vfio.h
index b1e327a85a32..69e1a0692b06 100644
--- a/drivers/vfio/vfio.h
+++ b/drivers/vfio/vfio.h
@@ -18,6 +18,8 @@ struct vfio_container;
 
 struct vfio_device_file {
 	struct vfio_device *device;
+	spinlock_t kvm_ref_lock; /* protect kvm field */
+	struct kvm *kvm;
 };
 
 void vfio_device_put_registration(struct vfio_device *device);
diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
index edadfac9be49..03d5b2979f79 100644
--- a/drivers/vfio/vfio_main.c
+++ b/drivers/vfio/vfio_main.c
@@ -414,6 +414,7 @@ vfio_allocate_device_file(struct vfio_device *device)
 		return ERR_PTR(-ENOMEM);
 
 	df->device = device;
+	spin_lock_init(&df->kvm_ref_lock);
 
 	return df;
 }
@@ -1175,13 +1176,23 @@ const struct file_operations vfio_device_fops = {
 	.mmap		= vfio_device_fops_mmap,
 };
 
+static struct vfio_device *vfio_device_from_file(struct file *file)
+{
+	struct vfio_device_file *df = file->private_data;
+
+	if (file->f_op != &vfio_device_fops)
+		return NULL;
+	return df->device;
+}
+
 /**
  * vfio_file_is_valid - True if the file is valid vfio file
  * @file: VFIO group file or VFIO device file
  */
 bool vfio_file_is_valid(struct file *file)
 {
-	return vfio_group_from_file(file);
+	return vfio_group_from_file(file) ||
+	       vfio_device_from_file(file);
 }
 EXPORT_SYMBOL_GPL(vfio_file_is_valid);
 
@@ -1196,15 +1207,36 @@ EXPORT_SYMBOL_GPL(vfio_file_is_valid);
  */
 bool vfio_file_enforced_coherent(struct file *file)
 {
-	struct vfio_group *group = vfio_group_from_file(file);
+	struct vfio_group *group;
+	struct vfio_device *device;
 
+	group = vfio_group_from_file(file);
 	if (group)
 		return vfio_group_enforced_coherent(group);
 
+	device = vfio_device_from_file(file);
+	if (device)
+		return device_iommu_capable(device->dev,
+					    IOMMU_CAP_ENFORCE_CACHE_COHERENCY);
+
 	return true;
 }
 EXPORT_SYMBOL_GPL(vfio_file_enforced_coherent);
 
+static void vfio_device_file_set_kvm(struct file *file, struct kvm *kvm)
+{
+	struct vfio_device_file *df = file->private_data;
+
+	/*
+	 * The kvm is first recorded in the vfio_device_file, and will
+	 * be propagated to vfio_device::kvm when the file is bound to
+	 * iommufd successfully in the vfio device cdev path.
+	 */
+	spin_lock(&df->kvm_ref_lock);
+	df->kvm = kvm;
+	spin_unlock(&df->kvm_ref_lock);
+}
+
 /**
  * vfio_file_set_kvm - Link a kvm with VFIO drivers
  * @file: VFIO group file or VFIO device file
@@ -1213,10 +1245,14 @@ EXPORT_SYMBOL_GPL(vfio_file_enforced_coherent);
  */
 void vfio_file_set_kvm(struct file *file, struct kvm *kvm)
 {
-	struct vfio_group *group = vfio_group_from_file(file);
+	struct vfio_group *group;
 
+	group = vfio_group_from_file(file);
 	if (group)
 		vfio_group_set_kvm(group, kvm);
+
+	if (vfio_device_from_file(file))
+		vfio_device_file_set_kvm(file, kvm);
 }
 EXPORT_SYMBOL_GPL(vfio_file_set_kvm);
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [Intel-gfx] [PATCH v6 04/24] kvm/vfio: Rename kvm_vfio_group to prepare for accepting vfio device fd
  2023-03-08 13:28 [Intel-gfx] [PATCH v6 00/24] cover-letter: Add vfio_device cdev for iommufd support Yi Liu
                   ` (2 preceding siblings ...)
  2023-03-08 13:28 ` [Intel-gfx] [PATCH v6 03/24] vfio: Accept vfio device file in the KVM facing kAPI Yi Liu
@ 2023-03-08 13:28 ` Yi Liu
  2023-03-08 13:28 ` [Intel-gfx] [PATCH v6 05/24] kvm/vfio: Accept vfio device file from userspace Yi Liu
                   ` (21 subsequent siblings)
  25 siblings, 0 replies; 103+ messages in thread
From: Yi Liu @ 2023-03-08 13:28 UTC (permalink / raw)
  To: alex.williamson, jgg, kevin.tian
  Cc: linux-s390, yi.l.liu, yi.y.sun, mjrosato, kvm, intel-gvt-dev,
	joro, cohuck, xudong.hao, peterx, yan.y.zhao, eric.auger,
	terrence.xu, nicolinc, shameerali.kolothum.thodi,
	suravee.suthikulpanit, intel-gfx, chao.p.peng, lulu,
	robin.murphy, jasowang

Meanwhile, rename related helpers. No functional change is intended.

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Tested-by: Terrence Xu <terrence.xu@intel.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Matthew Rosato <mjrosato@linux.ibm.com>
---
 virt/kvm/vfio.c | 115 ++++++++++++++++++++++++------------------------
 1 file changed, 58 insertions(+), 57 deletions(-)

diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
index 8bac308ba630..857d6ba349e1 100644
--- a/virt/kvm/vfio.c
+++ b/virt/kvm/vfio.c
@@ -21,7 +21,7 @@
 #include <asm/kvm_ppc.h>
 #endif
 
-struct kvm_vfio_group {
+struct kvm_vfio_file {
 	struct list_head node;
 	struct file *file;
 #ifdef CONFIG_SPAPR_TCE_IOMMU
@@ -30,7 +30,7 @@ struct kvm_vfio_group {
 };
 
 struct kvm_vfio {
-	struct list_head group_list;
+	struct list_head file_list;
 	struct mutex lock;
 	bool noncoherent;
 };
@@ -98,34 +98,35 @@ static struct iommu_group *kvm_vfio_file_iommu_group(struct file *file)
 }
 
 static void kvm_spapr_tce_release_vfio_group(struct kvm *kvm,
-					     struct kvm_vfio_group *kvg)
+					     struct kvm_vfio_file *kvf)
 {
-	if (WARN_ON_ONCE(!kvg->iommu_group))
+	if (WARN_ON_ONCE(!kvf->iommu_group))
 		return;
 
-	kvm_spapr_tce_release_iommu_group(kvm, kvg->iommu_group);
-	iommu_group_put(kvg->iommu_group);
-	kvg->iommu_group = NULL;
+	kvm_spapr_tce_release_iommu_group(kvm, kvf->iommu_group);
+	iommu_group_put(kvf->iommu_group);
+	kvf->iommu_group = NULL;
 }
 #endif
 
 /*
- * Groups can use the same or different IOMMU domains.  If the same then
- * adding a new group may change the coherency of groups we've previously
- * been told about.  We don't want to care about any of that so we retest
- * each group and bail as soon as we find one that's noncoherent.  This
- * means we only ever [un]register_noncoherent_dma once for the whole device.
+ * Groups/devices can use the same or different IOMMU domains. If the same
+ * then adding a new group/device may change the coherency of groups/devices
+ * we've previously been told about. We don't want to care about any of
+ * that so we retest each group/device and bail as soon as we find one that's
+ * noncoherent.  This means we only ever [un]register_noncoherent_dma once
+ * for the whole device.
  */
 static void kvm_vfio_update_coherency(struct kvm_device *dev)
 {
 	struct kvm_vfio *kv = dev->private;
 	bool noncoherent = false;
-	struct kvm_vfio_group *kvg;
+	struct kvm_vfio_file *kvf;
 
 	mutex_lock(&kv->lock);
 
-	list_for_each_entry(kvg, &kv->group_list, node) {
-		if (!kvm_vfio_file_enforced_coherent(kvg->file)) {
+	list_for_each_entry(kvf, &kv->file_list, node) {
+		if (!kvm_vfio_file_enforced_coherent(kvf->file)) {
 			noncoherent = true;
 			break;
 		}
@@ -143,10 +144,10 @@ static void kvm_vfio_update_coherency(struct kvm_device *dev)
 	mutex_unlock(&kv->lock);
 }
 
-static int kvm_vfio_group_add(struct kvm_device *dev, unsigned int fd)
+static int kvm_vfio_file_add(struct kvm_device *dev, unsigned int fd)
 {
 	struct kvm_vfio *kv = dev->private;
-	struct kvm_vfio_group *kvg;
+	struct kvm_vfio_file *kvf;
 	struct file *filp;
 	int ret;
 
@@ -162,27 +163,27 @@ static int kvm_vfio_group_add(struct kvm_device *dev, unsigned int fd)
 
 	mutex_lock(&kv->lock);
 
-	list_for_each_entry(kvg, &kv->group_list, node) {
-		if (kvg->file == filp) {
+	list_for_each_entry(kvf, &kv->file_list, node) {
+		if (kvf->file == filp) {
 			ret = -EEXIST;
 			goto err_unlock;
 		}
 	}
 
-	kvg = kzalloc(sizeof(*kvg), GFP_KERNEL_ACCOUNT);
-	if (!kvg) {
+	kvf = kzalloc(sizeof(*kvf), GFP_KERNEL_ACCOUNT);
+	if (!kvf) {
 		ret = -ENOMEM;
 		goto err_unlock;
 	}
 
-	kvg->file = filp;
-	list_add_tail(&kvg->node, &kv->group_list);
+	kvf->file = filp;
+	list_add_tail(&kvf->node, &kv->file_list);
 
 	kvm_arch_start_assignment(dev->kvm);
 
 	mutex_unlock(&kv->lock);
 
-	kvm_vfio_file_set_kvm(kvg->file, dev->kvm);
+	kvm_vfio_file_set_kvm(kvf->file, dev->kvm);
 	kvm_vfio_update_coherency(dev);
 
 	return 0;
@@ -193,10 +194,10 @@ static int kvm_vfio_group_add(struct kvm_device *dev, unsigned int fd)
 	return ret;
 }
 
-static int kvm_vfio_group_del(struct kvm_device *dev, unsigned int fd)
+static int kvm_vfio_file_del(struct kvm_device *dev, unsigned int fd)
 {
 	struct kvm_vfio *kv = dev->private;
-	struct kvm_vfio_group *kvg;
+	struct kvm_vfio_file *kvf;
 	struct fd f;
 	int ret;
 
@@ -208,18 +209,18 @@ static int kvm_vfio_group_del(struct kvm_device *dev, unsigned int fd)
 
 	mutex_lock(&kv->lock);
 
-	list_for_each_entry(kvg, &kv->group_list, node) {
-		if (kvg->file != f.file)
+	list_for_each_entry(kvf, &kv->file_list, node) {
+		if (kvf->file != f.file)
 			continue;
 
-		list_del(&kvg->node);
+		list_del(&kvf->node);
 		kvm_arch_end_assignment(dev->kvm);
 #ifdef CONFIG_SPAPR_TCE_IOMMU
-		kvm_spapr_tce_release_vfio_group(dev->kvm, kvg);
+		kvm_spapr_tce_release_vfio_group(dev->kvm, kvf);
 #endif
-		kvm_vfio_file_set_kvm(kvg->file, NULL);
-		fput(kvg->file);
-		kfree(kvg);
+		kvm_vfio_file_set_kvm(kvf->file, NULL);
+		fput(kvf->file);
+		kfree(kvf);
 		ret = 0;
 		break;
 	}
@@ -234,12 +235,12 @@ static int kvm_vfio_group_del(struct kvm_device *dev, unsigned int fd)
 }
 
 #ifdef CONFIG_SPAPR_TCE_IOMMU
-static int kvm_vfio_group_set_spapr_tce(struct kvm_device *dev,
-					void __user *arg)
+static int kvm_vfio_file_set_spapr_tce(struct kvm_device *dev,
+				       void __user *arg)
 {
 	struct kvm_vfio_spapr_tce param;
 	struct kvm_vfio *kv = dev->private;
-	struct kvm_vfio_group *kvg;
+	struct kvm_vfio_file *kvf;
 	struct fd f;
 	int ret;
 
@@ -254,20 +255,20 @@ static int kvm_vfio_group_set_spapr_tce(struct kvm_device *dev,
 
 	mutex_lock(&kv->lock);
 
-	list_for_each_entry(kvg, &kv->group_list, node) {
-		if (kvg->file != f.file)
+	list_for_each_entry(kvf, &kv->file_list, node) {
+		if (kvf->file != f.file)
 			continue;
 
-		if (!kvg->iommu_group) {
-			kvg->iommu_group = kvm_vfio_file_iommu_group(kvg->file);
-			if (WARN_ON_ONCE(!kvg->iommu_group)) {
+		if (!kvf->iommu_group) {
+			kvf->iommu_group = kvm_vfio_file_iommu_group(kvf->file);
+			if (WARN_ON_ONCE(!kvf->iommu_group)) {
 				ret = -EIO;
 				goto err_fdput;
 			}
 		}
 
 		ret = kvm_spapr_tce_attach_iommu_group(dev->kvm, param.tablefd,
-						       kvg->iommu_group);
+						       kvf->iommu_group);
 		break;
 	}
 
@@ -278,8 +279,8 @@ static int kvm_vfio_group_set_spapr_tce(struct kvm_device *dev,
 }
 #endif
 
-static int kvm_vfio_set_group(struct kvm_device *dev, long attr,
-			      void __user *arg)
+static int kvm_vfio_set_file(struct kvm_device *dev, long attr,
+			     void __user *arg)
 {
 	int32_t __user *argp = arg;
 	int32_t fd;
@@ -288,16 +289,16 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr,
 	case KVM_DEV_VFIO_GROUP_ADD:
 		if (get_user(fd, argp))
 			return -EFAULT;
-		return kvm_vfio_group_add(dev, fd);
+		return kvm_vfio_file_add(dev, fd);
 
 	case KVM_DEV_VFIO_GROUP_DEL:
 		if (get_user(fd, argp))
 			return -EFAULT;
-		return kvm_vfio_group_del(dev, fd);
+		return kvm_vfio_file_del(dev, fd);
 
 #ifdef CONFIG_SPAPR_TCE_IOMMU
 	case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE:
-		return kvm_vfio_group_set_spapr_tce(dev, arg);
+		return kvm_vfio_file_set_spapr_tce(dev, arg);
 #endif
 	}
 
@@ -309,8 +310,8 @@ static int kvm_vfio_set_attr(struct kvm_device *dev,
 {
 	switch (attr->group) {
 	case KVM_DEV_VFIO_GROUP:
-		return kvm_vfio_set_group(dev, attr->attr,
-					  u64_to_user_ptr(attr->addr));
+		return kvm_vfio_set_file(dev, attr->attr,
+					 u64_to_user_ptr(attr->addr));
 	}
 
 	return -ENXIO;
@@ -339,16 +340,16 @@ static int kvm_vfio_has_attr(struct kvm_device *dev,
 static void kvm_vfio_release(struct kvm_device *dev)
 {
 	struct kvm_vfio *kv = dev->private;
-	struct kvm_vfio_group *kvg, *tmp;
+	struct kvm_vfio_file *kvf, *tmp;
 
-	list_for_each_entry_safe(kvg, tmp, &kv->group_list, node) {
+	list_for_each_entry_safe(kvf, tmp, &kv->file_list, node) {
 #ifdef CONFIG_SPAPR_TCE_IOMMU
-		kvm_spapr_tce_release_vfio_group(dev->kvm, kvg);
+		kvm_spapr_tce_release_vfio_group(dev->kvm, kvf);
 #endif
-		kvm_vfio_file_set_kvm(kvg->file, NULL);
-		fput(kvg->file);
-		list_del(&kvg->node);
-		kfree(kvg);
+		kvm_vfio_file_set_kvm(kvf->file, NULL);
+		fput(kvf->file);
+		list_del(&kvf->node);
+		kfree(kvf);
 		kvm_arch_end_assignment(dev->kvm);
 	}
 
@@ -382,7 +383,7 @@ static int kvm_vfio_create(struct kvm_device *dev, u32 type)
 	if (!kv)
 		return -ENOMEM;
 
-	INIT_LIST_HEAD(&kv->group_list);
+	INIT_LIST_HEAD(&kv->file_list);
 	mutex_init(&kv->lock);
 
 	dev->private = kv;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [Intel-gfx] [PATCH v6 05/24] kvm/vfio: Accept vfio device file from userspace
  2023-03-08 13:28 [Intel-gfx] [PATCH v6 00/24] cover-letter: Add vfio_device cdev for iommufd support Yi Liu
                   ` (3 preceding siblings ...)
  2023-03-08 13:28 ` [Intel-gfx] [PATCH v6 04/24] kvm/vfio: Rename kvm_vfio_group to prepare for accepting vfio device fd Yi Liu
@ 2023-03-08 13:28 ` Yi Liu
  2023-03-22 14:10   ` Xu Yilun
  2023-03-08 13:28 ` [Intel-gfx] [PATCH v6 06/24] vfio: Pass struct vfio_device_file * to vfio_device_open/close() Yi Liu
                   ` (20 subsequent siblings)
  25 siblings, 1 reply; 103+ messages in thread
From: Yi Liu @ 2023-03-08 13:28 UTC (permalink / raw)
  To: alex.williamson, jgg, kevin.tian
  Cc: linux-s390, yi.l.liu, yi.y.sun, mjrosato, kvm, intel-gvt-dev,
	joro, cohuck, xudong.hao, peterx, yan.y.zhao, eric.auger,
	terrence.xu, nicolinc, shameerali.kolothum.thodi,
	suravee.suthikulpanit, intel-gfx, chao.p.peng, lulu,
	robin.murphy, jasowang

This defines KVM_DEV_VFIO_FILE* and make alias with KVM_DEV_VFIO_GROUP*.
Old userspace uses KVM_DEV_VFIO_GROUP* works as well.

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Tested-by: Terrence Xu <terrence.xu@intel.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Matthew Rosato <mjrosato@linux.ibm.com>
---
 Documentation/virt/kvm/devices/vfio.rst | 52 +++++++++++++++++--------
 include/uapi/linux/kvm.h                | 16 ++++++--
 virt/kvm/vfio.c                         | 16 ++++----
 3 files changed, 55 insertions(+), 29 deletions(-)

diff --git a/Documentation/virt/kvm/devices/vfio.rst b/Documentation/virt/kvm/devices/vfio.rst
index 79b6811bb4f3..5b05b48abaab 100644
--- a/Documentation/virt/kvm/devices/vfio.rst
+++ b/Documentation/virt/kvm/devices/vfio.rst
@@ -9,24 +9,37 @@ Device types supported:
   - KVM_DEV_TYPE_VFIO
 
 Only one VFIO instance may be created per VM.  The created device
-tracks VFIO groups in use by the VM and features of those groups
-important to the correctness and acceleration of the VM.  As groups
-are enabled and disabled for use by the VM, KVM should be updated
-about their presence.  When registered with KVM, a reference to the
-VFIO-group is held by KVM.
+tracks VFIO files (group or device) in use by the VM and features
+of those groups/devices important to the correctness and acceleration
+of the VM.  As groups/devices are enabled and disabled for use by the
+VM, KVM should be updated about their presence.  When registered with
+KVM, a reference to the VFIO file is held by KVM.
 
 Groups:
-  KVM_DEV_VFIO_GROUP
-
-KVM_DEV_VFIO_GROUP attributes:
-  KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking
-	kvm_device_attr.addr points to an int32_t file descriptor
-	for the VFIO group.
-  KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device tracking
-	kvm_device_attr.addr points to an int32_t file descriptor
-	for the VFIO group.
-  KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: attaches a guest visible TCE table
+  KVM_DEV_VFIO_FILE
+	alias: KVM_DEV_VFIO_GROUP
+
+KVM_DEV_VFIO_FILE attributes:
+  KVM_DEV_VFIO_FILE_ADD: Add a VFIO file (group/device) to VFIO-KVM device
+	tracking
+
+	alias: KVM_DEV_VFIO_GROUP_ADD
+
+	kvm_device_attr.addr points to an int32_t file descriptor for the
+	VFIO file.
+  KVM_DEV_VFIO_FILE_DEL: Remove a VFIO file (group/device) from VFIO-KVM
+	device tracking
+
+	alias: KVM_DEV_VFIO_GROUP_DEL
+
+	kvm_device_attr.addr points to an int32_t file descriptor for the
+	VFIO file.
+
+  KVM_DEV_VFIO_FILE_SET_SPAPR_TCE: attaches a guest visible TCE table
 	allocated by sPAPR KVM.
+
+	alias: KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE
+
 	kvm_device_attr.addr points to a struct::
 
 		struct kvm_vfio_spapr_tce {
@@ -40,9 +53,14 @@ KVM_DEV_VFIO_GROUP attributes:
 	- @tablefd is a file descriptor for a TCE table allocated via
 	  KVM_CREATE_SPAPR_TCE.
 
+	only accepts vfio group file as SPAPR has no iommufd support
+
 ::
 
-The GROUP_ADD operation above should be invoked prior to accessing the
+The FILE/GROUP_ADD operation above should be invoked prior to accessing the
 device file descriptor via VFIO_GROUP_GET_DEVICE_FD in order to support
 drivers which require a kvm pointer to be set in their .open_device()
-callback.
+callback.  It is the same for device file descriptor via character device
+open which gets device access via VFIO_DEVICE_BIND_IOMMUFD.  For such file
+descriptors, FILE_ADD should be invoked before VFIO_DEVICE_BIND_IOMMUFD
+to support the drivers mentioned in piror sentence as well.
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index d77aef872a0a..a8eeca70a498 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1410,10 +1410,18 @@ struct kvm_device_attr {
 	__u64	addr;		/* userspace address of attr data */
 };
 
-#define  KVM_DEV_VFIO_GROUP			1
-#define   KVM_DEV_VFIO_GROUP_ADD			1
-#define   KVM_DEV_VFIO_GROUP_DEL			2
-#define   KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE		3
+#define  KVM_DEV_VFIO_FILE	1
+
+#define   KVM_DEV_VFIO_FILE_ADD			1
+#define   KVM_DEV_VFIO_FILE_DEL			2
+#define   KVM_DEV_VFIO_FILE_SET_SPAPR_TCE	3
+
+/* KVM_DEV_VFIO_GROUP aliases are for compile time uapi compatibility */
+#define  KVM_DEV_VFIO_GROUP	KVM_DEV_VFIO_FILE
+
+#define   KVM_DEV_VFIO_GROUP_ADD	KVM_DEV_VFIO_FILE_ADD
+#define   KVM_DEV_VFIO_GROUP_DEL	KVM_DEV_VFIO_FILE_DEL
+#define   KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE	KVM_DEV_VFIO_FILE_SET_SPAPR_TCE
 
 enum kvm_device_type {
 	KVM_DEV_TYPE_FSL_MPIC_20	= 1,
diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
index 857d6ba349e1..d869913baafd 100644
--- a/virt/kvm/vfio.c
+++ b/virt/kvm/vfio.c
@@ -286,18 +286,18 @@ static int kvm_vfio_set_file(struct kvm_device *dev, long attr,
 	int32_t fd;
 
 	switch (attr) {
-	case KVM_DEV_VFIO_GROUP_ADD:
+	case KVM_DEV_VFIO_FILE_ADD:
 		if (get_user(fd, argp))
 			return -EFAULT;
 		return kvm_vfio_file_add(dev, fd);
 
-	case KVM_DEV_VFIO_GROUP_DEL:
+	case KVM_DEV_VFIO_FILE_DEL:
 		if (get_user(fd, argp))
 			return -EFAULT;
 		return kvm_vfio_file_del(dev, fd);
 
 #ifdef CONFIG_SPAPR_TCE_IOMMU
-	case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE:
+	case KVM_DEV_VFIO_FILE_SET_SPAPR_TCE:
 		return kvm_vfio_file_set_spapr_tce(dev, arg);
 #endif
 	}
@@ -309,7 +309,7 @@ static int kvm_vfio_set_attr(struct kvm_device *dev,
 			     struct kvm_device_attr *attr)
 {
 	switch (attr->group) {
-	case KVM_DEV_VFIO_GROUP:
+	case KVM_DEV_VFIO_FILE:
 		return kvm_vfio_set_file(dev, attr->attr,
 					 u64_to_user_ptr(attr->addr));
 	}
@@ -321,12 +321,12 @@ static int kvm_vfio_has_attr(struct kvm_device *dev,
 			     struct kvm_device_attr *attr)
 {
 	switch (attr->group) {
-	case KVM_DEV_VFIO_GROUP:
+	case KVM_DEV_VFIO_FILE:
 		switch (attr->attr) {
-		case KVM_DEV_VFIO_GROUP_ADD:
-		case KVM_DEV_VFIO_GROUP_DEL:
+		case KVM_DEV_VFIO_FILE_ADD:
+		case KVM_DEV_VFIO_FILE_DEL:
 #ifdef CONFIG_SPAPR_TCE_IOMMU
-		case KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE:
+		case KVM_DEV_VFIO_FILE_SET_SPAPR_TCE:
 #endif
 			return 0;
 		}
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [Intel-gfx] [PATCH v6 06/24] vfio: Pass struct vfio_device_file * to vfio_device_open/close()
  2023-03-08 13:28 [Intel-gfx] [PATCH v6 00/24] cover-letter: Add vfio_device cdev for iommufd support Yi Liu
                   ` (4 preceding siblings ...)
  2023-03-08 13:28 ` [Intel-gfx] [PATCH v6 05/24] kvm/vfio: Accept vfio device file from userspace Yi Liu
@ 2023-03-08 13:28 ` Yi Liu
  2023-03-08 13:28 ` [Intel-gfx] [PATCH v6 07/24] vfio: Block device access via device fd until device is opened Yi Liu
                   ` (19 subsequent siblings)
  25 siblings, 0 replies; 103+ messages in thread
From: Yi Liu @ 2023-03-08 13:28 UTC (permalink / raw)
  To: alex.williamson, jgg, kevin.tian
  Cc: linux-s390, yi.l.liu, yi.y.sun, mjrosato, kvm, intel-gvt-dev,
	joro, cohuck, xudong.hao, peterx, yan.y.zhao, eric.auger,
	terrence.xu, nicolinc, shameerali.kolothum.thodi,
	suravee.suthikulpanit, intel-gfx, chao.p.peng, lulu,
	robin.murphy, jasowang

This avoids passing too much parameters in multiple functions.

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Tested-by: Terrence Xu <terrence.xu@intel.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Matthew Rosato <mjrosato@linux.ibm.com>
---
 drivers/vfio/group.c     | 19 +++++++++++++------
 drivers/vfio/vfio.h      |  8 ++++----
 drivers/vfio/vfio_main.c | 25 +++++++++++++++----------
 3 files changed, 32 insertions(+), 20 deletions(-)

diff --git a/drivers/vfio/group.c b/drivers/vfio/group.c
index a293e1f92e89..160a4c891dda 100644
--- a/drivers/vfio/group.c
+++ b/drivers/vfio/group.c
@@ -169,8 +169,9 @@ static void vfio_device_group_get_kvm_safe(struct vfio_device *device)
 	spin_unlock(&device->group->kvm_ref_lock);
 }
 
-static int vfio_device_group_open(struct vfio_device *device)
+static int vfio_device_group_open(struct vfio_device_file *df)
 {
+	struct vfio_device *device = df->device;
 	int ret;
 
 	mutex_lock(&device->group->group_lock);
@@ -190,7 +191,11 @@ static int vfio_device_group_open(struct vfio_device *device)
 	if (device->open_count == 0)
 		vfio_device_group_get_kvm_safe(device);
 
-	ret = vfio_device_open(device, device->group->iommufd);
+	df->iommufd = device->group->iommufd;
+
+	ret = vfio_device_open(df);
+	if (ret)
+		df->iommufd = NULL;
 
 	if (device->open_count == 0)
 		vfio_device_put_kvm(device);
@@ -202,12 +207,14 @@ static int vfio_device_group_open(struct vfio_device *device)
 	return ret;
 }
 
-void vfio_device_group_close(struct vfio_device *device)
+void vfio_device_group_close(struct vfio_device_file *df)
 {
+	struct vfio_device *device = df->device;
+
 	mutex_lock(&device->group->group_lock);
 	mutex_lock(&device->dev_set->lock);
 
-	vfio_device_close(device, device->group->iommufd);
+	vfio_device_close(df);
 
 	if (device->open_count == 0)
 		vfio_device_put_kvm(device);
@@ -228,7 +235,7 @@ static struct file *vfio_device_open_file(struct vfio_device *device)
 		goto err_out;
 	}
 
-	ret = vfio_device_group_open(device);
+	ret = vfio_device_group_open(df);
 	if (ret)
 		goto err_free;
 
@@ -260,7 +267,7 @@ static struct file *vfio_device_open_file(struct vfio_device *device)
 	return filep;
 
 err_close_device:
-	vfio_device_group_close(device);
+	vfio_device_group_close(df);
 err_free:
 	kfree(df);
 err_out:
diff --git a/drivers/vfio/vfio.h b/drivers/vfio/vfio.h
index 69e1a0692b06..7ced404526d9 100644
--- a/drivers/vfio/vfio.h
+++ b/drivers/vfio/vfio.h
@@ -20,13 +20,13 @@ struct vfio_device_file {
 	struct vfio_device *device;
 	spinlock_t kvm_ref_lock; /* protect kvm field */
 	struct kvm *kvm;
+	struct iommufd_ctx *iommufd; /* protected by struct vfio_device_set::lock */
 };
 
 void vfio_device_put_registration(struct vfio_device *device);
 bool vfio_device_try_get_registration(struct vfio_device *device);
-int vfio_device_open(struct vfio_device *device, struct iommufd_ctx *iommufd);
-void vfio_device_close(struct vfio_device *device,
-		       struct iommufd_ctx *iommufd);
+int vfio_device_open(struct vfio_device_file *df);
+void vfio_device_close(struct vfio_device_file *df);
 struct vfio_device_file *
 vfio_allocate_device_file(struct vfio_device *device);
 
@@ -91,7 +91,7 @@ void vfio_device_group_register(struct vfio_device *device);
 void vfio_device_group_unregister(struct vfio_device *device);
 int vfio_device_group_use_iommu(struct vfio_device *device);
 void vfio_device_group_unuse_iommu(struct vfio_device *device);
-void vfio_device_group_close(struct vfio_device *device);
+void vfio_device_group_close(struct vfio_device_file *df);
 struct vfio_group *vfio_group_from_file(struct file *file);
 bool vfio_group_enforced_coherent(struct vfio_group *group);
 void vfio_group_set_kvm(struct vfio_group *group, struct kvm *kvm);
diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
index 03d5b2979f79..8c9b05f540fd 100644
--- a/drivers/vfio/vfio_main.c
+++ b/drivers/vfio/vfio_main.c
@@ -419,9 +419,10 @@ vfio_allocate_device_file(struct vfio_device *device)
 	return df;
 }
 
-static int vfio_device_first_open(struct vfio_device *device,
-				  struct iommufd_ctx *iommufd)
+static int vfio_device_first_open(struct vfio_device_file *df)
 {
+	struct vfio_device *device = df->device;
+	struct iommufd_ctx *iommufd = df->iommufd;
 	int ret;
 
 	lockdep_assert_held(&device->dev_set->lock);
@@ -453,9 +454,11 @@ static int vfio_device_first_open(struct vfio_device *device,
 	return ret;
 }
 
-static void vfio_device_last_close(struct vfio_device *device,
-				   struct iommufd_ctx *iommufd)
+static void vfio_device_last_close(struct vfio_device_file *df)
 {
+	struct vfio_device *device = df->device;
+	struct iommufd_ctx *iommufd = df->iommufd;
+
 	lockdep_assert_held(&device->dev_set->lock);
 
 	if (device->ops->close_device)
@@ -467,15 +470,16 @@ static void vfio_device_last_close(struct vfio_device *device,
 	module_put(device->dev->driver->owner);
 }
 
-int vfio_device_open(struct vfio_device *device, struct iommufd_ctx *iommufd)
+int vfio_device_open(struct vfio_device_file *df)
 {
+	struct vfio_device *device = df->device;
 	int ret = 0;
 
 	lockdep_assert_held(&device->dev_set->lock);
 
 	device->open_count++;
 	if (device->open_count == 1) {
-		ret = vfio_device_first_open(device, iommufd);
+		ret = vfio_device_first_open(df);
 		if (ret)
 			device->open_count--;
 	}
@@ -483,14 +487,15 @@ int vfio_device_open(struct vfio_device *device, struct iommufd_ctx *iommufd)
 	return ret;
 }
 
-void vfio_device_close(struct vfio_device *device,
-		       struct iommufd_ctx *iommufd)
+void vfio_device_close(struct vfio_device_file *df)
 {
+	struct vfio_device *device = df->device;
+
 	lockdep_assert_held(&device->dev_set->lock);
 
 	vfio_assert_device_open(device);
 	if (device->open_count == 1)
-		vfio_device_last_close(device, iommufd);
+		vfio_device_last_close(df);
 	device->open_count--;
 }
 
@@ -535,7 +540,7 @@ static int vfio_device_fops_release(struct inode *inode, struct file *filep)
 	struct vfio_device_file *df = filep->private_data;
 	struct vfio_device *device = df->device;
 
-	vfio_device_group_close(device);
+	vfio_device_group_close(df);
 
 	vfio_device_put_registration(device);
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [Intel-gfx] [PATCH v6 07/24] vfio: Block device access via device fd until device is opened
  2023-03-08 13:28 [Intel-gfx] [PATCH v6 00/24] cover-letter: Add vfio_device cdev for iommufd support Yi Liu
                   ` (5 preceding siblings ...)
  2023-03-08 13:28 ` [Intel-gfx] [PATCH v6 06/24] vfio: Pass struct vfio_device_file * to vfio_device_open/close() Yi Liu
@ 2023-03-08 13:28 ` Yi Liu
  2023-03-10  4:50   ` Tian, Kevin
  2023-03-08 13:28 ` [Intel-gfx] [PATCH v6 08/24] vfio/pci: Update comment around group_fd get in vfio_pci_ioctl_pci_hot_reset() Yi Liu
                   ` (18 subsequent siblings)
  25 siblings, 1 reply; 103+ messages in thread
From: Yi Liu @ 2023-03-08 13:28 UTC (permalink / raw)
  To: alex.williamson, jgg, kevin.tian
  Cc: linux-s390, yi.l.liu, yi.y.sun, mjrosato, kvm, intel-gvt-dev,
	joro, cohuck, xudong.hao, peterx, yan.y.zhao, eric.auger,
	terrence.xu, nicolinc, shameerali.kolothum.thodi,
	suravee.suthikulpanit, intel-gfx, chao.p.peng, lulu,
	robin.murphy, jasowang

Allow the vfio_device file to be in a state where the device FD is
opened but the device cannot be used by userspace (i.e. its .open_device()
hasn't been called). This inbetween state is not used when the device
FD is spawned from the group FD, however when we create the device FD
directly by opening a cdev it will be opened in the blocked state.

The reason for the inbetween state is that userspace only gets a FD but
doesn't gain access permission until binding the FD to an iommufd. So in
the blocked state, only the bind operation is allowed. Completing bind
will allow user to further access the device.

This is implemented by adding a flag in struct vfio_device_file to mark
the blocked state and using a simple smp_load_acquire() to obtain the
flag value and serialize all the device setup with the thread accessing
this device.

Following this lockless scheme, it can safely handle the device FD
unbound->bound but it cannot handle bound->unbound. To allow this we'd
need to add a lock on all the vfio ioctls which seems costly. So once
device FD is bound, it remains bound until the FD is closed.

Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Tested-by: Terrence Xu <terrence.xu@intel.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Matthew Rosato <mjrosato@linux.ibm.com>
---
 drivers/vfio/group.c     | 11 ++++++++++-
 drivers/vfio/vfio.h      |  1 +
 drivers/vfio/vfio_main.c | 16 ++++++++++++++++
 3 files changed, 27 insertions(+), 1 deletion(-)

diff --git a/drivers/vfio/group.c b/drivers/vfio/group.c
index 160a4c891dda..4a220d5bf79b 100644
--- a/drivers/vfio/group.c
+++ b/drivers/vfio/group.c
@@ -194,9 +194,18 @@ static int vfio_device_group_open(struct vfio_device_file *df)
 	df->iommufd = device->group->iommufd;
 
 	ret = vfio_device_open(df);
-	if (ret)
+	if (ret) {
 		df->iommufd = NULL;
+		goto out_put_kvm;
+	}
+
+	/*
+	 * Paired with smp_load_acquire() in vfio_device_fops::ioctl/
+	 * read/write/mmap
+	 */
+	smp_store_release(&df->access_granted, true);
 
+out_put_kvm:
 	if (device->open_count == 0)
 		vfio_device_put_kvm(device);
 
diff --git a/drivers/vfio/vfio.h b/drivers/vfio/vfio.h
index 7ced404526d9..e60c409868f8 100644
--- a/drivers/vfio/vfio.h
+++ b/drivers/vfio/vfio.h
@@ -18,6 +18,7 @@ struct vfio_container;
 
 struct vfio_device_file {
 	struct vfio_device *device;
+	bool access_granted;
 	spinlock_t kvm_ref_lock; /* protect kvm field */
 	struct kvm *kvm;
 	struct iommufd_ctx *iommufd; /* protected by struct vfio_device_set::lock */
diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
index 8c9b05f540fd..027410e8d4a8 100644
--- a/drivers/vfio/vfio_main.c
+++ b/drivers/vfio/vfio_main.c
@@ -1114,6 +1114,10 @@ static long vfio_device_fops_unl_ioctl(struct file *filep,
 	struct vfio_device *device = df->device;
 	int ret;
 
+	/* Paired with smp_store_release() in vfio_device_group_open() */
+	if (!smp_load_acquire(&df->access_granted))
+		return -EINVAL;
+
 	ret = vfio_device_pm_runtime_get(device);
 	if (ret)
 		return ret;
@@ -1141,6 +1145,10 @@ static ssize_t vfio_device_fops_read(struct file *filep, char __user *buf,
 	struct vfio_device_file *df = filep->private_data;
 	struct vfio_device *device = df->device;
 
+	/* Paired with smp_store_release() in vfio_device_group_open() */
+	if (!smp_load_acquire(&df->access_granted))
+		return -EINVAL;
+
 	if (unlikely(!device->ops->read))
 		return -EINVAL;
 
@@ -1154,6 +1162,10 @@ static ssize_t vfio_device_fops_write(struct file *filep,
 	struct vfio_device_file *df = filep->private_data;
 	struct vfio_device *device = df->device;
 
+	/* Paired with smp_store_release() in vfio_device_group_open() */
+	if (!smp_load_acquire(&df->access_granted))
+		return -EINVAL;
+
 	if (unlikely(!device->ops->write))
 		return -EINVAL;
 
@@ -1165,6 +1177,10 @@ static int vfio_device_fops_mmap(struct file *filep, struct vm_area_struct *vma)
 	struct vfio_device_file *df = filep->private_data;
 	struct vfio_device *device = df->device;
 
+	/* Paired with smp_store_release() in vfio_device_group_open() */
+	if (!smp_load_acquire(&df->access_granted))
+		return -EINVAL;
+
 	if (unlikely(!device->ops->mmap))
 		return -EINVAL;
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [Intel-gfx] [PATCH v6 08/24] vfio/pci: Update comment around group_fd get in vfio_pci_ioctl_pci_hot_reset()
  2023-03-08 13:28 [Intel-gfx] [PATCH v6 00/24] cover-letter: Add vfio_device cdev for iommufd support Yi Liu
                   ` (6 preceding siblings ...)
  2023-03-08 13:28 ` [Intel-gfx] [PATCH v6 07/24] vfio: Block device access via device fd until device is opened Yi Liu
@ 2023-03-08 13:28 ` Yi Liu
  2023-03-08 13:28 ` [Intel-gfx] [PATCH v6 09/24] vfio/pci: Only need to check opened devices in the dev_set for hot reset Yi Liu
                   ` (17 subsequent siblings)
  25 siblings, 0 replies; 103+ messages in thread
From: Yi Liu @ 2023-03-08 13:28 UTC (permalink / raw)
  To: alex.williamson, jgg, kevin.tian
  Cc: linux-s390, yi.l.liu, yi.y.sun, mjrosato, kvm, intel-gvt-dev,
	joro, cohuck, xudong.hao, peterx, yan.y.zhao, eric.auger,
	terrence.xu, nicolinc, shameerali.kolothum.thodi,
	suravee.suthikulpanit, intel-gfx, chao.p.peng, lulu,
	robin.murphy, jasowang

this suits more on what the code does.

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/vfio/pci/vfio_pci_core.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index a5ab416cf476..65bbef562268 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -1308,9 +1308,8 @@ static int vfio_pci_ioctl_pci_hot_reset(struct vfio_pci_core_device *vdev,
 	}
 
 	/*
-	 * For each group_fd, get the group through the vfio external user
-	 * interface and store the group and iommu ID.  This ensures the group
-	 * is held across the reset.
+	 * Get the group file for each fd to ensure the group held across
+	 * the reset
 	 */
 	for (file_idx = 0; file_idx < hdr.count; file_idx++) {
 		struct file *file = fget(group_fds[file_idx]);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [Intel-gfx] [PATCH v6 09/24] vfio/pci: Only need to check opened devices in the dev_set for hot reset
  2023-03-08 13:28 [Intel-gfx] [PATCH v6 00/24] cover-letter: Add vfio_device cdev for iommufd support Yi Liu
                   ` (7 preceding siblings ...)
  2023-03-08 13:28 ` [Intel-gfx] [PATCH v6 08/24] vfio/pci: Update comment around group_fd get in vfio_pci_ioctl_pci_hot_reset() Yi Liu
@ 2023-03-08 13:28 ` Yi Liu
  2023-03-10  5:00   ` Tian, Kevin
  2023-03-08 13:28 ` [Intel-gfx] [PATCH v6 10/24] vfio/pci: Rename the helpers and data in hot reset path to accept device fd Yi Liu
                   ` (16 subsequent siblings)
  25 siblings, 1 reply; 103+ messages in thread
From: Yi Liu @ 2023-03-08 13:28 UTC (permalink / raw)
  To: alex.williamson, jgg, kevin.tian
  Cc: linux-s390, yi.l.liu, yi.y.sun, mjrosato, kvm, intel-gvt-dev,
	joro, cohuck, xudong.hao, peterx, yan.y.zhao, eric.auger,
	terrence.xu, nicolinc, shameerali.kolothum.thodi,
	suravee.suthikulpanit, intel-gfx, chao.p.peng, lulu,
	robin.murphy, jasowang

If the affected device is not opened by any user, it is not necessary to
check its ownership as it will not be opened by any user if a user is hot
resetting a device within this dev_set.

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
---
 drivers/vfio/pci/vfio_pci_core.c | 17 +++++++++++++++--
 include/uapi/linux/vfio.h        |  8 ++++++++
 2 files changed, 23 insertions(+), 2 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 65bbef562268..f13b093557a9 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -2429,10 +2429,23 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
 
 	list_for_each_entry(cur_vma, &dev_set->device_list, vdev.dev_set_list) {
 		/*
-		 * Test whether all the affected devices are contained by the
+		 * Test whether all the affected devices can be reset by the
+		 * user.  The affected devices may already been opened or not
+		 * yet.
+		 *
+		 * For the devices not opened yet, user can reset them as it
+		 * reason is that the hot reset is done under the protection
+		 * of the dev_set->lock, and device open is also under this
+		 * lock.  During the hot reset, such devices can not be opened
+		 * by other users.
+		 *
+		 * For the devices that have been opened, needs to check the
+		 * ownership.  If the user provides a set of group fds, test
+		 * whether all the opened affected devices are contained by the
 		 * set of groups provided by the user.
 		 */
-		if (!vfio_dev_in_groups(cur_vma, groups)) {
+		if (cur_vma->vdev.open_count &&
+		    !vfio_dev_in_groups(cur_vma, groups)) {
 			ret = -EINVAL;
 			goto err_undo;
 		}
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 0552e8dcf0cb..f96e5689cffc 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -673,6 +673,14 @@ struct vfio_pci_hot_reset_info {
  * VFIO_DEVICE_PCI_HOT_RESET - _IOW(VFIO_TYPE, VFIO_BASE + 13,
  *				    struct vfio_pci_hot_reset)
  *
+ * Userspace requests hot reset for the devices it uses.  Due to the
+ * underlying topology, multiple devices can be affected in the reset
+ * while some might be opened by another user.  To avoid interference
+ * the calling user must ensure all affected devices, if opened, are
+ * owned by itself.
+ *
+ * The ownership is proved by an array of group fds.
+ *
  * Return: 0 on success, -errno on failure.
  */
 struct vfio_pci_hot_reset {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [Intel-gfx] [PATCH v6 10/24] vfio/pci: Rename the helpers and data in hot reset path to accept device fd
  2023-03-08 13:28 [Intel-gfx] [PATCH v6 00/24] cover-letter: Add vfio_device cdev for iommufd support Yi Liu
                   ` (8 preceding siblings ...)
  2023-03-08 13:28 ` [Intel-gfx] [PATCH v6 09/24] vfio/pci: Only need to check opened devices in the dev_set for hot reset Yi Liu
@ 2023-03-08 13:28 ` Yi Liu
  2023-03-10  5:01   ` Tian, Kevin
  2023-03-08 13:28 ` [Intel-gfx] [PATCH v6 11/24] vfio/pci: Accept device fd in VFIO_DEVICE_PCI_HOT_RESET ioctl Yi Liu
                   ` (15 subsequent siblings)
  25 siblings, 1 reply; 103+ messages in thread
From: Yi Liu @ 2023-03-08 13:28 UTC (permalink / raw)
  To: alex.williamson, jgg, kevin.tian
  Cc: linux-s390, yi.l.liu, yi.y.sun, mjrosato, kvm, intel-gvt-dev,
	joro, cohuck, xudong.hao, peterx, yan.y.zhao, eric.auger,
	terrence.xu, nicolinc, shameerali.kolothum.thodi,
	suravee.suthikulpanit, intel-gfx, chao.p.peng, lulu,
	robin.murphy, jasowang

No function change is intended, just to make the helpers and structures
to be prepared to accept device fds as proof of device ownership.

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
---
 drivers/vfio/pci/vfio_pci_core.c | 40 ++++++++++++++++----------------
 1 file changed, 20 insertions(+), 20 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index f13b093557a9..265a0058436c 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -177,10 +177,10 @@ static void vfio_pci_probe_mmaps(struct vfio_pci_core_device *vdev)
 	}
 }
 
-struct vfio_pci_group_info;
+struct vfio_pci_user_file_info;
 static void vfio_pci_dev_set_try_reset(struct vfio_device_set *dev_set);
 static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
-				      struct vfio_pci_group_info *groups);
+				      struct vfio_pci_user_file_info *user_info);
 
 /*
  * INTx masking requires the ability to disable INTx signaling via PCI_COMMAND
@@ -799,7 +799,7 @@ static int vfio_pci_fill_devs(struct pci_dev *pdev, void *data)
 	return 0;
 }
 
-struct vfio_pci_group_info {
+struct vfio_pci_user_file_info {
 	int count;
 	struct file **files;
 };
@@ -1260,9 +1260,9 @@ static int vfio_pci_ioctl_pci_hot_reset(struct vfio_pci_core_device *vdev,
 {
 	unsigned long minsz = offsetofend(struct vfio_pci_hot_reset, count);
 	struct vfio_pci_hot_reset hdr;
-	int32_t *group_fds;
+	int32_t *user_fds;
 	struct file **files;
-	struct vfio_pci_group_info info;
+	struct vfio_pci_user_file_info info;
 	bool slot = false;
 	int file_idx, count = 0, ret = 0;
 
@@ -1292,17 +1292,17 @@ static int vfio_pci_ioctl_pci_hot_reset(struct vfio_pci_core_device *vdev,
 	if (!hdr.count || hdr.count > count)
 		return -EINVAL;
 
-	group_fds = kcalloc(hdr.count, sizeof(*group_fds), GFP_KERNEL);
+	user_fds = kcalloc(hdr.count, sizeof(*user_fds), GFP_KERNEL);
 	files = kcalloc(hdr.count, sizeof(*files), GFP_KERNEL);
-	if (!group_fds || !files) {
-		kfree(group_fds);
+	if (!user_fds || !files) {
+		kfree(user_fds);
 		kfree(files);
 		return -ENOMEM;
 	}
 
-	if (copy_from_user(group_fds, arg->group_fds,
-			   hdr.count * sizeof(*group_fds))) {
-		kfree(group_fds);
+	if (copy_from_user(user_fds, arg->group_fds,
+			   hdr.count * sizeof(*user_fds))) {
+		kfree(user_fds);
 		kfree(files);
 		return -EFAULT;
 	}
@@ -1312,7 +1312,7 @@ static int vfio_pci_ioctl_pci_hot_reset(struct vfio_pci_core_device *vdev,
 	 * the reset
 	 */
 	for (file_idx = 0; file_idx < hdr.count; file_idx++) {
-		struct file *file = fget(group_fds[file_idx]);
+		struct file *file = fget(user_fds[file_idx]);
 
 		if (!file) {
 			ret = -EBADF;
@@ -1329,9 +1329,9 @@ static int vfio_pci_ioctl_pci_hot_reset(struct vfio_pci_core_device *vdev,
 		files[file_idx] = file;
 	}
 
-	kfree(group_fds);
+	kfree(user_fds);
 
-	/* release reference to groups on error */
+	/* release reference to user_fds on error */
 	if (ret)
 		goto hot_reset_release;
 
@@ -2312,13 +2312,13 @@ const struct pci_error_handlers vfio_pci_core_err_handlers = {
 };
 EXPORT_SYMBOL_GPL(vfio_pci_core_err_handlers);
 
-static bool vfio_dev_in_groups(struct vfio_pci_core_device *vdev,
-			       struct vfio_pci_group_info *groups)
+static bool vfio_dev_in_user_fds(struct vfio_pci_core_device *vdev,
+				 struct vfio_pci_user_file_info *user_info)
 {
 	unsigned int i;
 
-	for (i = 0; i < groups->count; i++)
-		if (vfio_file_has_dev(groups->files[i], &vdev->vdev))
+	for (i = 0; i < user_info->count; i++)
+		if (vfio_file_has_dev(user_info->files[i], &vdev->vdev))
 			return true;
 	return false;
 }
@@ -2398,7 +2398,7 @@ static int vfio_pci_dev_set_pm_runtime_get(struct vfio_device_set *dev_set)
  * get each memory_lock.
  */
 static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
-				      struct vfio_pci_group_info *groups)
+				      struct vfio_pci_user_file_info *user_info)
 {
 	struct vfio_pci_core_device *cur_mem;
 	struct vfio_pci_core_device *cur_vma;
@@ -2445,7 +2445,7 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
 		 * set of groups provided by the user.
 		 */
 		if (cur_vma->vdev.open_count &&
-		    !vfio_dev_in_groups(cur_vma, groups)) {
+		    !vfio_dev_in_user_fds(cur_vma, user_info)) {
 			ret = -EINVAL;
 			goto err_undo;
 		}
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [Intel-gfx] [PATCH v6 11/24] vfio/pci: Accept device fd in VFIO_DEVICE_PCI_HOT_RESET ioctl
  2023-03-08 13:28 [Intel-gfx] [PATCH v6 00/24] cover-letter: Add vfio_device cdev for iommufd support Yi Liu
                   ` (9 preceding siblings ...)
  2023-03-08 13:28 ` [Intel-gfx] [PATCH v6 10/24] vfio/pci: Rename the helpers and data in hot reset path to accept device fd Yi Liu
@ 2023-03-08 13:28 ` Yi Liu
  2023-03-10  5:08   ` Tian, Kevin
  2023-03-08 13:28 ` [Intel-gfx] [PATCH v6 12/24] vfio/pci: Allow passing zero-length fd array in VFIO_DEVICE_PCI_HOT_RESET Yi Liu
                   ` (14 subsequent siblings)
  25 siblings, 1 reply; 103+ messages in thread
From: Yi Liu @ 2023-03-08 13:28 UTC (permalink / raw)
  To: alex.williamson, jgg, kevin.tian
  Cc: linux-s390, yi.l.liu, yi.y.sun, mjrosato, kvm, intel-gvt-dev,
	joro, cohuck, xudong.hao, peterx, yan.y.zhao, eric.auger,
	terrence.xu, nicolinc, shameerali.kolothum.thodi,
	suravee.suthikulpanit, intel-gfx, chao.p.peng, lulu,
	robin.murphy, jasowang

VFIO PCI device hot reset requires user to provide a set of FDs to prove
ownership on the affected devices in the hot reset. Either group fd or
device fd can be used. But when user uses vfio device cdev, there is only
device fd, hence VFIO_DEVICE_PCI_HOT_RESET needs to be extended to accept
device fds.

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
---
 drivers/vfio/group.c             | 15 +-----------
 drivers/vfio/pci/vfio_pci_core.c | 22 +++++++++++------
 drivers/vfio/vfio.h              |  1 +
 drivers/vfio/vfio_main.c         | 42 ++++++++++++++++++++++++++++++++
 include/linux/vfio.h             |  1 +
 include/uapi/linux/vfio.h        |  6 +++--
 6 files changed, 63 insertions(+), 24 deletions(-)

diff --git a/drivers/vfio/group.c b/drivers/vfio/group.c
index 4a220d5bf79b..6280368eb0bd 100644
--- a/drivers/vfio/group.c
+++ b/drivers/vfio/group.c
@@ -852,23 +852,10 @@ void vfio_group_set_kvm(struct vfio_group *group, struct kvm *kvm)
 	spin_unlock(&group->kvm_ref_lock);
 }
 
-/**
- * vfio_file_has_dev - True if the VFIO file is a handle for device
- * @file: VFIO file to check
- * @device: Device that must be part of the file
- *
- * Returns true if given file has permission to manipulate the given device.
- */
-bool vfio_file_has_dev(struct file *file, struct vfio_device *device)
+bool vfio_group_has_dev(struct vfio_group *group, struct vfio_device *device)
 {
-	struct vfio_group *group = vfio_group_from_file(file);
-
-	if (!group)
-		return false;
-
 	return group == device->group;
 }
-EXPORT_SYMBOL_GPL(vfio_file_has_dev);
 
 static char *vfio_devnode(const struct device *dev, umode_t *mode)
 {
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 265a0058436c..123b468ead73 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -1300,7 +1300,7 @@ static int vfio_pci_ioctl_pci_hot_reset(struct vfio_pci_core_device *vdev,
 		return -ENOMEM;
 	}
 
-	if (copy_from_user(user_fds, arg->group_fds,
+	if (copy_from_user(user_fds, arg->fds,
 			   hdr.count * sizeof(*user_fds))) {
 		kfree(user_fds);
 		kfree(files);
@@ -1308,8 +1308,8 @@ static int vfio_pci_ioctl_pci_hot_reset(struct vfio_pci_core_device *vdev,
 	}
 
 	/*
-	 * Get the group file for each fd to ensure the group held across
-	 * the reset
+	 * Get the file for each fd to ensure the group/device file
+	 * is held across the reset
 	 */
 	for (file_idx = 0; file_idx < hdr.count; file_idx++) {
 		struct file *file = fget(user_fds[file_idx]);
@@ -1319,8 +1319,14 @@ static int vfio_pci_ioctl_pci_hot_reset(struct vfio_pci_core_device *vdev,
 			break;
 		}
 
-		/* Ensure the FD is a vfio group FD.*/
-		if (!vfio_file_is_group(file)) {
+		/*
+		 * For vfio group FD, sanitize the file is enough.
+		 * For vfio device FD, needs to ensure it has got the
+		 * access to device, otherwise it cannot be used as
+		 * proof of device ownership.
+		 */
+		if (!vfio_file_is_valid(file) ||
+		    (!vfio_file_is_group(file) && !vfio_file_has_device_access(file))) {
 			fput(file);
 			ret = -EINVAL;
 			break;
@@ -2440,9 +2446,9 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
 		 * by other users.
 		 *
 		 * For the devices that have been opened, needs to check the
-		 * ownership.  If the user provides a set of group fds, test
-		 * whether all the opened affected devices are contained by the
-		 * set of groups provided by the user.
+		 * ownership.  If the user provides a set of group/device
+		 * fds, test whether all the opened devices are contained
+		 * by the set of groups/devices provided by the user.
 		 */
 		if (cur_vma->vdev.open_count &&
 		    !vfio_dev_in_user_fds(cur_vma, user_info)) {
diff --git a/drivers/vfio/vfio.h b/drivers/vfio/vfio.h
index e60c409868f8..464263288d16 100644
--- a/drivers/vfio/vfio.h
+++ b/drivers/vfio/vfio.h
@@ -96,6 +96,7 @@ void vfio_device_group_close(struct vfio_device_file *df);
 struct vfio_group *vfio_group_from_file(struct file *file);
 bool vfio_group_enforced_coherent(struct vfio_group *group);
 void vfio_group_set_kvm(struct vfio_group *group, struct kvm *kvm);
+bool vfio_group_has_dev(struct vfio_group *group, struct vfio_device *device);
 bool vfio_device_has_container(struct vfio_device *device);
 int __init vfio_group_init(void);
 void vfio_group_cleanup(void);
diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
index 027410e8d4a8..cf9994a65df3 100644
--- a/drivers/vfio/vfio_main.c
+++ b/drivers/vfio/vfio_main.c
@@ -1277,6 +1277,48 @@ void vfio_file_set_kvm(struct file *file, struct kvm *kvm)
 }
 EXPORT_SYMBOL_GPL(vfio_file_set_kvm);
 
+/**
+ * vfio_file_has_device_access - True if the file has opened device
+ * @file: VFIO device file
+ */
+bool vfio_file_has_device_access(struct file *file)
+{
+	struct vfio_device_file *df;
+
+	if (vfio_group_from_file(file) ||
+	    !vfio_device_from_file(file))
+		return false;
+
+	df = file->private_data;
+
+	return READ_ONCE(df->access_granted);
+}
+EXPORT_SYMBOL_GPL(vfio_file_has_device_access);
+
+/**
+ * vfio_file_has_dev - True if the VFIO file is a handle for device
+ * @file: VFIO file to check
+ * @device: Device that must be part of the file
+ *
+ * Returns true if given file has permission to manipulate the given device.
+ */
+bool vfio_file_has_dev(struct file *file, struct vfio_device *device)
+{
+	struct vfio_group *group;
+	struct vfio_device *vdev;
+
+	group = vfio_group_from_file(file);
+	if (group)
+		return vfio_group_has_dev(group, device);
+
+	vdev = vfio_device_from_file(file);
+	if (device)
+		return vdev == device;
+
+	return false;
+}
+EXPORT_SYMBOL_GPL(vfio_file_has_dev);
+
 /*
  * Sub-module support
  */
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index b14dcdd0b71f..1c69be2d687e 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -248,6 +248,7 @@ bool vfio_file_is_group(struct file *file);
 bool vfio_file_is_valid(struct file *file);
 bool vfio_file_enforced_coherent(struct file *file);
 void vfio_file_set_kvm(struct file *file, struct kvm *kvm);
+bool vfio_file_has_device_access(struct file *file);
 bool vfio_file_has_dev(struct file *file, struct vfio_device *device);
 
 #define VFIO_PIN_PAGES_MAX_ENTRIES	(PAGE_SIZE/sizeof(unsigned long))
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index f96e5689cffc..d80141969cd1 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -679,7 +679,9 @@ struct vfio_pci_hot_reset_info {
  * the calling user must ensure all affected devices, if opened, are
  * owned by itself.
  *
- * The ownership is proved by an array of group fds.
+ * The ownership can be proved by:
+ *   - An array of group fds
+ *   - An array of device fds
  *
  * Return: 0 on success, -errno on failure.
  */
@@ -687,7 +689,7 @@ struct vfio_pci_hot_reset {
 	__u32	argsz;
 	__u32	flags;
 	__u32	count;
-	__s32	group_fds[];
+	__s32	fds[];
 };
 
 #define VFIO_DEVICE_PCI_HOT_RESET	_IO(VFIO_TYPE, VFIO_BASE + 13)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [Intel-gfx] [PATCH v6 12/24] vfio/pci: Allow passing zero-length fd array in VFIO_DEVICE_PCI_HOT_RESET
  2023-03-08 13:28 [Intel-gfx] [PATCH v6 00/24] cover-letter: Add vfio_device cdev for iommufd support Yi Liu
                   ` (10 preceding siblings ...)
  2023-03-08 13:28 ` [Intel-gfx] [PATCH v6 11/24] vfio/pci: Accept device fd in VFIO_DEVICE_PCI_HOT_RESET ioctl Yi Liu
@ 2023-03-08 13:28 ` Yi Liu
  2023-03-10  5:31   ` Tian, Kevin
  2023-03-15 22:53   ` Alex Williamson
  2023-03-08 13:28 ` [Intel-gfx] [PATCH v6 13/24] vfio/iommufd: Split the compat_ioas attach out from vfio_iommufd_bind() Yi Liu
                   ` (13 subsequent siblings)
  25 siblings, 2 replies; 103+ messages in thread
From: Yi Liu @ 2023-03-08 13:28 UTC (permalink / raw)
  To: alex.williamson, jgg, kevin.tian
  Cc: linux-s390, yi.l.liu, yi.y.sun, mjrosato, kvm, intel-gvt-dev,
	joro, cohuck, xudong.hao, peterx, yan.y.zhao, eric.auger,
	terrence.xu, nicolinc, shameerali.kolothum.thodi,
	suravee.suthikulpanit, intel-gfx, chao.p.peng, lulu,
	robin.murphy, jasowang

This is another method to issue PCI hot reset for the users that bounds
device to a positive iommufd value. In such case, iommufd is a proof of
device ownership. By passing a zero-length fd array, user indicates kernel
to do ownership check with the bound iommufd. All the opened devices within
the affected dev_set should have been bound to the same iommufd. This is
simpler and faster as user does not need to pass a set of fds and kernel
no need to search the device within the given fds.

Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
---
 drivers/iommu/iommufd/device.c   |  6 +++
 drivers/vfio/iommufd.c           |  9 ++++
 drivers/vfio/pci/vfio_pci_core.c | 92 ++++++++++++++++++++++----------
 include/linux/iommufd.h          |  1 +
 include/linux/vfio.h             |  3 ++
 include/uapi/linux/vfio.h        |  5 ++
 6 files changed, 89 insertions(+), 27 deletions(-)

diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c
index 9087cd8ed3ea..dbcee0d38a48 100644
--- a/drivers/iommu/iommufd/device.c
+++ b/drivers/iommu/iommufd/device.c
@@ -131,6 +131,12 @@ void iommufd_device_unbind(struct iommufd_device *idev)
 }
 EXPORT_SYMBOL_NS_GPL(iommufd_device_unbind, IOMMUFD);
 
+struct iommufd_ctx *iommufd_device_to_ictx(struct iommufd_device *idev)
+{
+	return idev->ictx;
+}
+EXPORT_SYMBOL_NS_GPL(iommufd_device_to_ictx, IOMMUFD);
+
 static int iommufd_device_setup_msi(struct iommufd_device *idev,
 				    struct iommufd_hw_pagetable *hwpt,
 				    phys_addr_t sw_msi_start)
diff --git a/drivers/vfio/iommufd.c b/drivers/vfio/iommufd.c
index 768d353cb6fa..30c0da2e11f9 100644
--- a/drivers/vfio/iommufd.c
+++ b/drivers/vfio/iommufd.c
@@ -69,6 +69,15 @@ void vfio_iommufd_unbind(struct vfio_device *vdev)
 		vdev->ops->unbind_iommufd(vdev);
 }
 
+struct iommufd_ctx *vfio_iommufd_physical_ctx(struct vfio_device *vdev)
+{
+	/* Only serve for physical device */
+	if (!vdev->iommufd_device)
+		return NULL;
+	return iommufd_device_to_ictx(vdev->iommufd_device);
+}
+EXPORT_SYMBOL_GPL(vfio_iommufd_physical_ctx);
+
 /*
  * The physical standard ops mean that the iommufd_device is bound to the
  * physical device vdev->dev that was provided to vfio_init_group_dev(). Drivers
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 123b468ead73..b039fbd5c656 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -180,7 +180,8 @@ static void vfio_pci_probe_mmaps(struct vfio_pci_core_device *vdev)
 struct vfio_pci_user_file_info;
 static void vfio_pci_dev_set_try_reset(struct vfio_device_set *dev_set);
 static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
-				      struct vfio_pci_user_file_info *user_info);
+				      struct vfio_pci_user_file_info *user_info,
+				      struct iommufd_ctx *iommufd_ctx);
 
 /*
  * INTx masking requires the ability to disable INTx signaling via PCI_COMMAND
@@ -1255,29 +1256,17 @@ static int vfio_pci_ioctl_get_pci_hot_reset_info(
 	return ret;
 }
 
-static int vfio_pci_ioctl_pci_hot_reset(struct vfio_pci_core_device *vdev,
+static int
+vfio_pci_ioctl_pci_hot_reset_user_files(struct vfio_pci_core_device *vdev,
+					struct vfio_pci_hot_reset *hdr,
+					bool slot,
 					struct vfio_pci_hot_reset __user *arg)
 {
-	unsigned long minsz = offsetofend(struct vfio_pci_hot_reset, count);
-	struct vfio_pci_hot_reset hdr;
 	int32_t *user_fds;
 	struct file **files;
 	struct vfio_pci_user_file_info info;
-	bool slot = false;
 	int file_idx, count = 0, ret = 0;
 
-	if (copy_from_user(&hdr, arg, minsz))
-		return -EFAULT;
-
-	if (hdr.argsz < minsz || hdr.flags)
-		return -EINVAL;
-
-	/* Can we do a slot or bus reset or neither? */
-	if (!pci_probe_reset_slot(vdev->pdev->slot))
-		slot = true;
-	else if (pci_probe_reset_bus(vdev->pdev->bus))
-		return -ENODEV;
-
 	/*
 	 * We can't let userspace give us an arbitrarily large buffer to copy,
 	 * so verify how many we think there could be.  Note groups can have
@@ -1289,11 +1278,11 @@ static int vfio_pci_ioctl_pci_hot_reset(struct vfio_pci_core_device *vdev,
 		return ret;
 
 	/* Somewhere between 1 and count is OK */
-	if (!hdr.count || hdr.count > count)
+	if (hdr->count > count)
 		return -EINVAL;
 
-	user_fds = kcalloc(hdr.count, sizeof(*user_fds), GFP_KERNEL);
-	files = kcalloc(hdr.count, sizeof(*files), GFP_KERNEL);
+	user_fds = kcalloc(hdr->count, sizeof(*user_fds), GFP_KERNEL);
+	files = kcalloc(hdr->count, sizeof(*files), GFP_KERNEL);
 	if (!user_fds || !files) {
 		kfree(user_fds);
 		kfree(files);
@@ -1301,7 +1290,7 @@ static int vfio_pci_ioctl_pci_hot_reset(struct vfio_pci_core_device *vdev,
 	}
 
 	if (copy_from_user(user_fds, arg->fds,
-			   hdr.count * sizeof(*user_fds))) {
+			   hdr->count * sizeof(*user_fds))) {
 		kfree(user_fds);
 		kfree(files);
 		return -EFAULT;
@@ -1311,7 +1300,7 @@ static int vfio_pci_ioctl_pci_hot_reset(struct vfio_pci_core_device *vdev,
 	 * Get the file for each fd to ensure the group/device file
 	 * is held across the reset
 	 */
-	for (file_idx = 0; file_idx < hdr.count; file_idx++) {
+	for (file_idx = 0; file_idx < hdr->count; file_idx++) {
 		struct file *file = fget(user_fds[file_idx]);
 
 		if (!file) {
@@ -1341,10 +1330,10 @@ static int vfio_pci_ioctl_pci_hot_reset(struct vfio_pci_core_device *vdev,
 	if (ret)
 		goto hot_reset_release;
 
-	info.count = hdr.count;
+	info.count = hdr->count;
 	info.files = files;
 
-	ret = vfio_pci_dev_set_hot_reset(vdev->vdev.dev_set, &info);
+	ret = vfio_pci_dev_set_hot_reset(vdev->vdev.dev_set, &info, NULL);
 
 hot_reset_release:
 	for (file_idx--; file_idx >= 0; file_idx--)
@@ -1354,6 +1343,36 @@ static int vfio_pci_ioctl_pci_hot_reset(struct vfio_pci_core_device *vdev,
 	return ret;
 }
 
+static int vfio_pci_ioctl_pci_hot_reset(struct vfio_pci_core_device *vdev,
+					struct vfio_pci_hot_reset __user *arg)
+{
+	unsigned long minsz = offsetofend(struct vfio_pci_hot_reset, count);
+	struct vfio_pci_hot_reset hdr;
+	struct iommufd_ctx *iommufd;
+	bool slot = false;
+
+	if (copy_from_user(&hdr, arg, minsz))
+		return -EFAULT;
+
+	if (hdr.argsz < minsz || hdr.flags)
+		return -EINVAL;
+
+	/* Can we do a slot or bus reset or neither? */
+	if (!pci_probe_reset_slot(vdev->pdev->slot))
+		slot = true;
+	else if (pci_probe_reset_bus(vdev->pdev->bus))
+		return -ENODEV;
+
+	if (hdr.count)
+		return vfio_pci_ioctl_pci_hot_reset_user_files(vdev, &hdr, slot, arg);
+
+	iommufd = vfio_iommufd_physical_ctx(&vdev->vdev);
+	if (!iommufd)
+		return -EINVAL;
+
+	return vfio_pci_dev_set_hot_reset(vdev->vdev.dev_set, NULL, iommufd);
+}
+
 static int vfio_pci_ioctl_ioeventfd(struct vfio_pci_core_device *vdev,
 				    struct vfio_device_ioeventfd __user *arg)
 {
@@ -2323,6 +2342,9 @@ static bool vfio_dev_in_user_fds(struct vfio_pci_core_device *vdev,
 {
 	unsigned int i;
 
+	if (!user_info)
+		return false;
+
 	for (i = 0; i < user_info->count; i++)
 		if (vfio_file_has_dev(user_info->files[i], &vdev->vdev))
 			return true;
@@ -2398,13 +2420,25 @@ static int vfio_pci_dev_set_pm_runtime_get(struct vfio_device_set *dev_set)
 	return ret;
 }
 
+static bool vfio_dev_in_iommufd_ctx(struct vfio_pci_core_device *vdev,
+				    struct iommufd_ctx *iommufd_ctx)
+{
+	struct iommufd_ctx *iommufd = vfio_iommufd_physical_ctx(&vdev->vdev);
+
+	if (!iommufd)
+		return false;
+
+	return iommufd == iommufd_ctx;
+}
+
 /*
  * We need to get memory_lock for each device, but devices can share mmap_lock,
  * therefore we need to zap and hold the vma_lock for each device, and only then
  * get each memory_lock.
  */
 static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
-				      struct vfio_pci_user_file_info *user_info)
+				      struct vfio_pci_user_file_info *user_info,
+				      struct iommufd_ctx *iommufd_ctx)
 {
 	struct vfio_pci_core_device *cur_mem;
 	struct vfio_pci_core_device *cur_vma;
@@ -2448,10 +2482,14 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
 		 * For the devices that have been opened, needs to check the
 		 * ownership.  If the user provides a set of group/device
 		 * fds, test whether all the opened devices are contained
-		 * by the set of groups/devices provided by the user.
+		 * by the set of groups/devices provided by the user.  If
+		 * user provides a zero-length array, the ownerhsip check
+		 * is done by checking if all the opened devices are bound
+		 * to the same iommufd_ctx.
 		 */
 		if (cur_vma->vdev.open_count &&
-		    !vfio_dev_in_user_fds(cur_vma, user_info)) {
+		    !vfio_dev_in_user_fds(cur_vma, user_info) &&
+		    !vfio_dev_in_iommufd_ctx(cur_vma, iommufd_ctx)) {
 			ret = -EINVAL;
 			goto err_undo;
 		}
diff --git a/include/linux/iommufd.h b/include/linux/iommufd.h
index 365f11e8e615..7a0d7f2c4237 100644
--- a/include/linux/iommufd.h
+++ b/include/linux/iommufd.h
@@ -20,6 +20,7 @@ struct file;
 struct iommufd_device *iommufd_device_bind(struct iommufd_ctx *ictx,
 					   struct device *dev, u32 *id);
 void iommufd_device_unbind(struct iommufd_device *idev);
+struct iommufd_ctx *iommufd_device_to_ictx(struct iommufd_device *idev);
 
 int iommufd_device_attach(struct iommufd_device *idev, u32 *pt_id);
 void iommufd_device_detach(struct iommufd_device *idev);
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index 1c69be2d687e..fc14f8430a10 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -116,6 +116,7 @@ struct vfio_device_ops {
 int vfio_iommufd_physical_bind(struct vfio_device *vdev,
 			       struct iommufd_ctx *ictx, u32 *out_device_id);
 void vfio_iommufd_physical_unbind(struct vfio_device *vdev);
+struct iommufd_ctx *vfio_iommufd_physical_ctx(struct vfio_device *vdev);
 int vfio_iommufd_physical_attach_ioas(struct vfio_device *vdev, u32 *pt_id);
 int vfio_iommufd_emulated_bind(struct vfio_device *vdev,
 			       struct iommufd_ctx *ictx, u32 *out_device_id);
@@ -127,6 +128,8 @@ int vfio_iommufd_emulated_attach_ioas(struct vfio_device *vdev, u32 *pt_id);
 		  u32 *out_device_id)) NULL)
 #define vfio_iommufd_physical_unbind \
 	((void (*)(struct vfio_device *vdev)) NULL)
+#define vfio_iommufd_physical_ctx \
+	((struct iommufd_ctx * (*)(struct vfio_device *vdev)) NULL)
 #define vfio_iommufd_physical_attach_ioas \
 	((int (*)(struct vfio_device *vdev, u32 *pt_id)) NULL)
 #define vfio_iommufd_emulated_bind                                      \
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index d80141969cd1..382d95455f89 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -682,6 +682,11 @@ struct vfio_pci_hot_reset_info {
  * The ownership can be proved by:
  *   - An array of group fds
  *   - An array of device fds
+ *   - A zero-length array
+ *
+ * In the last case all affected devices which are opened by this user
+ * must have been bound to a same iommufd_ctx.  This approach is only
+ * available for devices bound to positive iommufd.
  *
  * Return: 0 on success, -errno on failure.
  */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [Intel-gfx] [PATCH v6 13/24] vfio/iommufd: Split the compat_ioas attach out from vfio_iommufd_bind()
  2023-03-08 13:28 [Intel-gfx] [PATCH v6 00/24] cover-letter: Add vfio_device cdev for iommufd support Yi Liu
                   ` (11 preceding siblings ...)
  2023-03-08 13:28 ` [Intel-gfx] [PATCH v6 12/24] vfio/pci: Allow passing zero-length fd array in VFIO_DEVICE_PCI_HOT_RESET Yi Liu
@ 2023-03-08 13:28 ` Yi Liu
  2023-03-10  8:08   ` Tian, Kevin
  2023-03-08 13:28 ` [Intel-gfx] [PATCH v6 14/24] vfio: Add cdev_device_open_cnt to vfio_group Yi Liu
                   ` (12 subsequent siblings)
  25 siblings, 1 reply; 103+ messages in thread
From: Yi Liu @ 2023-03-08 13:28 UTC (permalink / raw)
  To: alex.williamson, jgg, kevin.tian
  Cc: linux-s390, yi.l.liu, yi.y.sun, mjrosato, kvm, intel-gvt-dev,
	joro, cohuck, xudong.hao, peterx, yan.y.zhao, eric.auger,
	terrence.xu, nicolinc, shameerali.kolothum.thodi,
	suravee.suthikulpanit, intel-gfx, chao.p.peng, lulu,
	robin.murphy, jasowang

This makes the group code call .bind_iommufd and .attach_ioas in two steps
instead of in a single step. This prepares the bind_iommufd and attach_ioas
support in the coming cdev path.

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
---
 drivers/vfio/group.c   | 26 ++++++++++-----
 drivers/vfio/iommufd.c | 75 ++++++++++++++++++++++++++----------------
 drivers/vfio/vfio.h    |  8 +++++
 3 files changed, 73 insertions(+), 36 deletions(-)

diff --git a/drivers/vfio/group.c b/drivers/vfio/group.c
index 6280368eb0bd..555d68aefa71 100644
--- a/drivers/vfio/group.c
+++ b/drivers/vfio/group.c
@@ -177,7 +177,7 @@ static int vfio_device_group_open(struct vfio_device_file *df)
 	mutex_lock(&device->group->group_lock);
 	if (!vfio_group_has_iommu(device->group)) {
 		ret = -EINVAL;
-		goto out_unlock;
+		goto err_unlock;
 	}
 
 	mutex_lock(&device->dev_set->lock);
@@ -194,9 +194,14 @@ static int vfio_device_group_open(struct vfio_device_file *df)
 	df->iommufd = device->group->iommufd;
 
 	ret = vfio_device_open(df);
-	if (ret) {
-		df->iommufd = NULL;
-		goto out_put_kvm;
+	if (ret)
+		goto err_put_kvm;
+
+	if (device->group->iommufd) {
+		ret = vfio_iommufd_attach_compat_ioas(device,
+						      device->group->iommufd);
+		if (ret)
+			goto err_close_device;
 	}
 
 	/*
@@ -205,13 +210,18 @@ static int vfio_device_group_open(struct vfio_device_file *df)
 	 */
 	smp_store_release(&df->access_granted, true);
 
-out_put_kvm:
+	mutex_unlock(&device->dev_set->lock);
+	mutex_unlock(&device->group->group_lock);
+	return 0;
+
+err_close_device:
+	vfio_device_close(df);
+err_put_kvm:
+	df->iommufd = NULL;
 	if (device->open_count == 0)
 		vfio_device_put_kvm(device);
-
 	mutex_unlock(&device->dev_set->lock);
-
-out_unlock:
+err_unlock:
 	mutex_unlock(&device->group->group_lock);
 	return ret;
 }
diff --git a/drivers/vfio/iommufd.c b/drivers/vfio/iommufd.c
index 30c0da2e11f9..8c518f8bd39a 100644
--- a/drivers/vfio/iommufd.c
+++ b/drivers/vfio/iommufd.c
@@ -10,52 +10,71 @@
 MODULE_IMPORT_NS(IOMMUFD);
 MODULE_IMPORT_NS(IOMMUFD_VFIO);
 
-int vfio_iommufd_bind(struct vfio_device *vdev, struct iommufd_ctx *ictx)
+static int vfio_iommufd_device_probe_comapt_noiommu(struct vfio_device *vdev,
+						    struct iommufd_ctx *ictx)
 {
 	u32 ioas_id;
+
+	if (!capable(CAP_SYS_RAWIO))
+		return -EPERM;
+
+	/*
+	 * Require no compat ioas to be assigned to proceed.  The basic
+	 * statement is that the user cannot have done something that
+	 * implies they expected translation to exist
+	 */
+	if (!iommufd_vfio_compat_ioas_get_id(ictx, &ioas_id))
+		return -EPERM;
+	return 0;
+}
+
+int vfio_iommufd_bind(struct vfio_device *vdev, struct iommufd_ctx *ictx)
+{
 	u32 device_id;
 	int ret;
 
 	lockdep_assert_held(&vdev->dev_set->lock);
 
 	if (vfio_device_is_noiommu(vdev)) {
-		if (!capable(CAP_SYS_RAWIO))
-			return -EPERM;
-
-		/*
-		 * Require no compat ioas to be assigned to proceed. The basic
-		 * statement is that the user cannot have done something that
-		 * implies they expected translation to exist
-		 */
-		if (!iommufd_vfio_compat_ioas_get_id(ictx, &ioas_id))
-			return -EPERM;
-		return 0;
+		ret = vfio_iommufd_device_probe_comapt_noiommu(vdev, ictx);
+		if (ret)
+			return ret;
 	}
 
 	if (WARN_ON(!vdev->ops->bind_iommufd))
 		return -ENODEV;
 
-	ret = vdev->ops->bind_iommufd(vdev, ictx, &device_id);
-	if (ret)
-		return ret;
+	/* The legacy path has no way to return the device id */
+	return vdev->ops->bind_iommufd(vdev, ictx, &device_id);
+}
 
-	ret = iommufd_vfio_compat_ioas_get_id(ictx, &ioas_id);
-	if (ret)
-		goto err_unbind;
-	ret = vdev->ops->attach_ioas(vdev, &ioas_id);
-	if (ret)
-		goto err_unbind;
+int vfio_iommufd_attach_compat_ioas(struct vfio_device *vdev,
+				    struct iommufd_ctx *ictx)
+{
+	u32 ioas_id;
+	int ret;
+
+	lockdep_assert_held(&vdev->dev_set->lock);
 
 	/*
-	 * The legacy path has no way to return the device id or the selected
-	 * pt_id
+	 * If the driver doesn't provide this op then it means the device does
+	 * not do DMA at all. So nothing to do.
 	 */
-	return 0;
+	if (WARN_ON(!vdev->ops->bind_iommufd))
+		return -ENODEV;
 
-err_unbind:
-	if (vdev->ops->unbind_iommufd)
-		vdev->ops->unbind_iommufd(vdev);
-	return ret;
+	if (vfio_device_is_noiommu(vdev)) {
+		if (WARN_ON(vfio_iommufd_device_probe_comapt_noiommu(vdev, ictx)))
+			return -EINVAL;
+		return 0;
+	}
+
+	ret = iommufd_vfio_compat_ioas_get_id(ictx, &ioas_id);
+	if (ret)
+		return ret;
+
+	/* The legacy path has no way to return the selected pt_id */
+	return vdev->ops->attach_ioas(vdev, &ioas_id);
 }
 
 void vfio_iommufd_unbind(struct vfio_device *vdev)
diff --git a/drivers/vfio/vfio.h b/drivers/vfio/vfio.h
index 464263288d16..3356321805e9 100644
--- a/drivers/vfio/vfio.h
+++ b/drivers/vfio/vfio.h
@@ -232,6 +232,8 @@ static inline void vfio_container_cleanup(void)
 #if IS_ENABLED(CONFIG_IOMMUFD)
 int vfio_iommufd_bind(struct vfio_device *device, struct iommufd_ctx *ictx);
 void vfio_iommufd_unbind(struct vfio_device *device);
+int vfio_iommufd_attach_compat_ioas(struct vfio_device *vdev,
+				    struct iommufd_ctx *ictx);
 #else
 static inline int vfio_iommufd_bind(struct vfio_device *device,
 				    struct iommufd_ctx *ictx)
@@ -242,6 +244,12 @@ static inline int vfio_iommufd_bind(struct vfio_device *device,
 static inline void vfio_iommufd_unbind(struct vfio_device *device)
 {
 }
+
+static inline int vfio_iommufd_attach_compat_ioas(struct vfio_device *vdev,
+						  struct iommufd_ctx *ictx)
+{
+	return -EOPNOTSUPP;
+}
 #endif
 
 #if IS_ENABLED(CONFIG_VFIO_VIRQFD)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [Intel-gfx] [PATCH v6 14/24] vfio: Add cdev_device_open_cnt to vfio_group
  2023-03-08 13:28 [Intel-gfx] [PATCH v6 00/24] cover-letter: Add vfio_device cdev for iommufd support Yi Liu
                   ` (12 preceding siblings ...)
  2023-03-08 13:28 ` [Intel-gfx] [PATCH v6 13/24] vfio/iommufd: Split the compat_ioas attach out from vfio_iommufd_bind() Yi Liu
@ 2023-03-08 13:28 ` Yi Liu
  2023-03-08 13:28 ` [Intel-gfx] [PATCH v6 15/24] vfio: Make vfio_device_open() single open for device cdev path Yi Liu
                   ` (11 subsequent siblings)
  25 siblings, 0 replies; 103+ messages in thread
From: Yi Liu @ 2023-03-08 13:28 UTC (permalink / raw)
  To: alex.williamson, jgg, kevin.tian
  Cc: linux-s390, yi.l.liu, yi.y.sun, mjrosato, kvm, intel-gvt-dev,
	joro, cohuck, xudong.hao, peterx, yan.y.zhao, eric.auger,
	terrence.xu, nicolinc, shameerali.kolothum.thodi,
	suravee.suthikulpanit, intel-gfx, chao.p.peng, lulu,
	robin.murphy, jasowang

for counting the devices that are opened via the cdev path. This count
is increased and decreased by the cdev path. The group path checks it
to achieve exclusion with the cdev path. With this, only one path (group
path or cdev path) will claim DMA ownership. This avoids scenarios in
which devices within the same group may be opened via different paths.

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Tested-by: Terrence Xu <terrence.xu@intel.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Matthew Rosato <mjrosato@linux.ibm.com>
---
 drivers/vfio/group.c | 33 +++++++++++++++++++++++++++++++++
 drivers/vfio/vfio.h  |  3 +++
 2 files changed, 36 insertions(+)

diff --git a/drivers/vfio/group.c b/drivers/vfio/group.c
index 555d68aefa71..63aafd59278d 100644
--- a/drivers/vfio/group.c
+++ b/drivers/vfio/group.c
@@ -392,6 +392,33 @@ static long vfio_group_fops_unl_ioctl(struct file *filep,
 	}
 }
 
+int vfio_device_block_group(struct vfio_device *device)
+{
+	struct vfio_group *group = device->group;
+	int ret = 0;
+
+	mutex_lock(&group->group_lock);
+	if (group->opened_file) {
+		ret = -EBUSY;
+		goto out_unlock;
+	}
+
+	group->cdev_device_open_cnt++;
+
+out_unlock:
+	mutex_unlock(&group->group_lock);
+	return ret;
+}
+
+void vfio_device_unblock_group(struct vfio_device *device)
+{
+	struct vfio_group *group = device->group;
+
+	mutex_lock(&group->group_lock);
+	group->cdev_device_open_cnt--;
+	mutex_unlock(&group->group_lock);
+}
+
 static int vfio_group_fops_open(struct inode *inode, struct file *filep)
 {
 	struct vfio_group *group =
@@ -414,6 +441,11 @@ static int vfio_group_fops_open(struct inode *inode, struct file *filep)
 		goto out_unlock;
 	}
 
+	if (group->cdev_device_open_cnt) {
+		ret = -EBUSY;
+		goto out_unlock;
+	}
+
 	/*
 	 * Do we need multiple instances of the group open?  Seems not.
 	 */
@@ -488,6 +520,7 @@ static void vfio_group_release(struct device *dev)
 	mutex_destroy(&group->device_lock);
 	mutex_destroy(&group->group_lock);
 	WARN_ON(group->iommu_group);
+	WARN_ON(group->cdev_device_open_cnt);
 	ida_free(&vfio.group_ida, MINOR(group->dev.devt));
 	kfree(group);
 }
diff --git a/drivers/vfio/vfio.h b/drivers/vfio/vfio.h
index 3356321805e9..e1b8763e443e 100644
--- a/drivers/vfio/vfio.h
+++ b/drivers/vfio/vfio.h
@@ -83,8 +83,11 @@ struct vfio_group {
 	struct blocking_notifier_head	notifier;
 	struct iommufd_ctx		*iommufd;
 	spinlock_t			kvm_ref_lock;
+	unsigned int			cdev_device_open_cnt;
 };
 
+int vfio_device_block_group(struct vfio_device *device);
+void vfio_device_unblock_group(struct vfio_device *device);
 int vfio_device_set_group(struct vfio_device *device,
 			  enum vfio_group_type type);
 void vfio_device_remove_group(struct vfio_device *device);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [Intel-gfx] [PATCH v6 15/24] vfio: Make vfio_device_open() single open for device cdev path
  2023-03-08 13:28 [Intel-gfx] [PATCH v6 00/24] cover-letter: Add vfio_device cdev for iommufd support Yi Liu
                   ` (13 preceding siblings ...)
  2023-03-08 13:28 ` [Intel-gfx] [PATCH v6 14/24] vfio: Add cdev_device_open_cnt to vfio_group Yi Liu
@ 2023-03-08 13:28 ` Yi Liu
  2023-03-08 13:28 ` [Intel-gfx] [PATCH v6 16/24] vfio: Make vfio_device_first_open() to cover the noiommu mode in " Yi Liu
                   ` (10 subsequent siblings)
  25 siblings, 0 replies; 103+ messages in thread
From: Yi Liu @ 2023-03-08 13:28 UTC (permalink / raw)
  To: alex.williamson, jgg, kevin.tian
  Cc: linux-s390, yi.l.liu, yi.y.sun, mjrosato, kvm, intel-gvt-dev,
	joro, cohuck, xudong.hao, peterx, yan.y.zhao, eric.auger,
	terrence.xu, nicolinc, shameerali.kolothum.thodi,
	suravee.suthikulpanit, intel-gfx, chao.p.peng, lulu,
	robin.murphy, jasowang

VFIO group has historically allowed multi-open of the device FD. This
was made secure because the "open" was executed via an ioctl to the
group FD which is itself only single open.

However, no known use of multiple device FDs today. It is kind of a
strange thing to do because new device FDs can naturally be created
via dup().

When we implement the new device uAPI (only used in cdev path) there is
no natural way to allow the device itself from being multi-opened in a
secure manner. Without the group FD we cannot prove the security context
of the opener.

Thus, when moving to the new uAPI we block the ability to multi-open
the device. Old group path still allows it.

vfio_device_open() needs to sustain both the legacy behavior i.e. multi-open
in the group path and the new behavior i.e. single-open in the cdev path.
This mixture leads to storing a vfio_group pointer in struct vfio_device_file.
Only the group path would set it, cdev path never sets it.

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
---
 drivers/vfio/group.c     | 2 ++
 drivers/vfio/vfio.h      | 2 ++
 drivers/vfio/vfio_main.c | 7 +++++++
 3 files changed, 11 insertions(+)

diff --git a/drivers/vfio/group.c b/drivers/vfio/group.c
index 63aafd59278d..f067a6a7bbd2 100644
--- a/drivers/vfio/group.c
+++ b/drivers/vfio/group.c
@@ -254,6 +254,8 @@ static struct file *vfio_device_open_file(struct vfio_device *device)
 		goto err_out;
 	}
 
+	df->group = device->group;
+
 	ret = vfio_device_group_open(df);
 	if (ret)
 		goto err_free;
diff --git a/drivers/vfio/vfio.h b/drivers/vfio/vfio.h
index e1b8763e443e..11397cc95e0b 100644
--- a/drivers/vfio/vfio.h
+++ b/drivers/vfio/vfio.h
@@ -18,6 +18,8 @@ struct vfio_container;
 
 struct vfio_device_file {
 	struct vfio_device *device;
+	struct vfio_group *group;
+
 	bool access_granted;
 	spinlock_t kvm_ref_lock; /* protect kvm field */
 	struct kvm *kvm;
diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
index cf9994a65df3..aa31aae33407 100644
--- a/drivers/vfio/vfio_main.c
+++ b/drivers/vfio/vfio_main.c
@@ -477,6 +477,13 @@ int vfio_device_open(struct vfio_device_file *df)
 
 	lockdep_assert_held(&device->dev_set->lock);
 
+	/*
+	 * Only group path supports multiple device open. cdev path
+	 * doesn't have a secure way for it.
+	 */
+	if (device->open_count != 0 && !df->group)
+		return -EINVAL;
+
 	device->open_count++;
 	if (device->open_count == 1) {
 		ret = vfio_device_first_open(df);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [Intel-gfx] [PATCH v6 16/24] vfio: Make vfio_device_first_open() to cover the noiommu mode in cdev path
  2023-03-08 13:28 [Intel-gfx] [PATCH v6 00/24] cover-letter: Add vfio_device cdev for iommufd support Yi Liu
                   ` (14 preceding siblings ...)
  2023-03-08 13:28 ` [Intel-gfx] [PATCH v6 15/24] vfio: Make vfio_device_open() single open for device cdev path Yi Liu
@ 2023-03-08 13:28 ` Yi Liu
  2023-03-10  8:30   ` Tian, Kevin
  2023-03-08 13:28 ` [Intel-gfx] [PATCH v6 17/24] vfio-iommufd: Make vfio_iommufd_bind() selectively return devid Yi Liu
                   ` (9 subsequent siblings)
  25 siblings, 1 reply; 103+ messages in thread
From: Yi Liu @ 2023-03-08 13:28 UTC (permalink / raw)
  To: alex.williamson, jgg, kevin.tian
  Cc: linux-s390, yi.l.liu, yi.y.sun, mjrosato, kvm, intel-gvt-dev,
	joro, cohuck, xudong.hao, peterx, yan.y.zhao, eric.auger,
	terrence.xu, nicolinc, shameerali.kolothum.thodi,
	suravee.suthikulpanit, intel-gfx, chao.p.peng, lulu,
	robin.murphy, jasowang

vfio_device_first_open() now covers the below two cases:

1) user uses iommufd (e.g. the group path in iommufd compat mode);
2) user uses container (e.g. the group path in legacy mode);

The above two paths have their own noiommu mode support accordingly.

The cdev path also uses iommufd, so for the case user provides a valid
iommufd, this helper is able to support it. But for noiommu mode, the
cdev path just provides a NULL iommufd. So this needs to be able to cover
it. As there is no special things to do for the cdev path in noiommu
mode, it can be covered by simply differentiate it from the container
case. If user is not using iommufd nor container, it is the noiommu
mode.

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
---
 drivers/vfio/group.c     | 15 +++++++++++++++
 drivers/vfio/vfio.h      |  1 +
 drivers/vfio/vfio_main.c | 21 +++++++++++++++++----
 3 files changed, 33 insertions(+), 4 deletions(-)

diff --git a/drivers/vfio/group.c b/drivers/vfio/group.c
index f067a6a7bbd2..51c027134814 100644
--- a/drivers/vfio/group.c
+++ b/drivers/vfio/group.c
@@ -780,6 +780,21 @@ void vfio_device_group_unregister(struct vfio_device *device)
 	mutex_unlock(&device->group->device_lock);
 }
 
+/*
+ * This shall be used without group lock as group and group->container
+ * should be fixed before group is set to df->group.
+ */
+bool vfio_device_group_uses_container(struct vfio_device_file *df)
+{
+	/*
+	 * Use the df->group instead of the df->device->group as no
+	 * lock is acquired here.
+	 */
+	if (WARN_ON(!df->group))
+		return false;
+	return READ_ONCE(df->group->container);
+}
+
 int vfio_device_group_use_iommu(struct vfio_device *device)
 {
 	struct vfio_group *group = device->group;
diff --git a/drivers/vfio/vfio.h b/drivers/vfio/vfio.h
index 11397cc95e0b..615ffd58562b 100644
--- a/drivers/vfio/vfio.h
+++ b/drivers/vfio/vfio.h
@@ -95,6 +95,7 @@ int vfio_device_set_group(struct vfio_device *device,
 void vfio_device_remove_group(struct vfio_device *device);
 void vfio_device_group_register(struct vfio_device *device);
 void vfio_device_group_unregister(struct vfio_device *device);
+bool vfio_device_group_uses_container(struct vfio_device_file *df);
 int vfio_device_group_use_iommu(struct vfio_device *device);
 void vfio_device_group_unuse_iommu(struct vfio_device *device);
 void vfio_device_group_close(struct vfio_device_file *df);
diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
index aa31aae33407..8c73df1a400e 100644
--- a/drivers/vfio/vfio_main.c
+++ b/drivers/vfio/vfio_main.c
@@ -423,16 +423,29 @@ static int vfio_device_first_open(struct vfio_device_file *df)
 {
 	struct vfio_device *device = df->device;
 	struct iommufd_ctx *iommufd = df->iommufd;
-	int ret;
+	int ret = 0;
 
 	lockdep_assert_held(&device->dev_set->lock);
 
 	if (!try_module_get(device->dev->driver->owner))
 		return -ENODEV;
 
+	/*
+	 * The handling here depends on what the user is using.
+	 *
+	 * If user uses iommufd in the group compat mode or the
+	 * cdev path, call vfio_iommufd_bind().
+	 *
+	 * If user uses container in the group legacy mode, call
+	 * vfio_device_group_use_iommu().
+	 *
+	 * If user doesn't use iommufd nor container, this is
+	 * the noiommufd mode in the cdev path, nothing needs
+	 * to be done here just go ahead to open device.
+	 */
 	if (iommufd)
 		ret = vfio_iommufd_bind(device, iommufd);
-	else
+	else if (vfio_device_group_uses_container(df))
 		ret = vfio_device_group_use_iommu(device);
 	if (ret)
 		goto err_module_put;
@@ -447,7 +460,7 @@ static int vfio_device_first_open(struct vfio_device_file *df)
 err_unuse_iommu:
 	if (iommufd)
 		vfio_iommufd_unbind(device);
-	else
+	else if (vfio_device_group_uses_container(df))
 		vfio_device_group_unuse_iommu(device);
 err_module_put:
 	module_put(device->dev->driver->owner);
@@ -465,7 +478,7 @@ static void vfio_device_last_close(struct vfio_device_file *df)
 		device->ops->close_device(device);
 	if (iommufd)
 		vfio_iommufd_unbind(device);
-	else
+	else if (vfio_device_group_uses_container(df))
 		vfio_device_group_unuse_iommu(device);
 	module_put(device->dev->driver->owner);
 }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [Intel-gfx] [PATCH v6 17/24] vfio-iommufd: Make vfio_iommufd_bind() selectively return devid
  2023-03-08 13:28 [Intel-gfx] [PATCH v6 00/24] cover-letter: Add vfio_device cdev for iommufd support Yi Liu
                   ` (15 preceding siblings ...)
  2023-03-08 13:28 ` [Intel-gfx] [PATCH v6 16/24] vfio: Make vfio_device_first_open() to cover the noiommu mode in " Yi Liu
@ 2023-03-08 13:28 ` Yi Liu
  2023-03-10  8:31   ` Tian, Kevin
  2023-03-08 13:28 ` [Intel-gfx] [PATCH v6 18/24] vfio-iommufd: Add detach_ioas support for physical VFIO devices Yi Liu
                   ` (8 subsequent siblings)
  25 siblings, 1 reply; 103+ messages in thread
From: Yi Liu @ 2023-03-08 13:28 UTC (permalink / raw)
  To: alex.williamson, jgg, kevin.tian
  Cc: linux-s390, yi.l.liu, yi.y.sun, mjrosato, kvm, intel-gvt-dev,
	joro, cohuck, xudong.hao, peterx, yan.y.zhao, eric.auger,
	terrence.xu, nicolinc, shameerali.kolothum.thodi,
	suravee.suthikulpanit, intel-gfx, chao.p.peng, lulu,
	robin.murphy, jasowang

bind_iommufd() will generate an ID to represent this bond, it is needed
by userspace for further usage. devid is stored in vfio_device_file to
avoid passing devid pointer in multiple places.

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
---
 drivers/vfio/iommufd.c   | 13 ++++++++++---
 drivers/vfio/vfio.h      |  6 ++++--
 drivers/vfio/vfio_main.c |  8 +++++---
 include/linux/iommufd.h  |  2 ++
 4 files changed, 21 insertions(+), 8 deletions(-)

diff --git a/drivers/vfio/iommufd.c b/drivers/vfio/iommufd.c
index 8c518f8bd39a..b2cdb6b2b37f 100644
--- a/drivers/vfio/iommufd.c
+++ b/drivers/vfio/iommufd.c
@@ -28,7 +28,8 @@ static int vfio_iommufd_device_probe_comapt_noiommu(struct vfio_device *vdev,
 	return 0;
 }
 
-int vfio_iommufd_bind(struct vfio_device *vdev, struct iommufd_ctx *ictx)
+int vfio_iommufd_bind(struct vfio_device *vdev, struct iommufd_ctx *ictx,
+		      u32 *devid)
 {
 	u32 device_id;
 	int ret;
@@ -44,8 +45,14 @@ int vfio_iommufd_bind(struct vfio_device *vdev, struct iommufd_ctx *ictx)
 	if (WARN_ON(!vdev->ops->bind_iommufd))
 		return -ENODEV;
 
-	/* The legacy path has no way to return the device id */
-	return vdev->ops->bind_iommufd(vdev, ictx, &device_id);
+	ret = vdev->ops->bind_iommufd(vdev, ictx, &device_id);
+	if (ret)
+		return ret;
+
+	if (devid)
+		*devid = device_id;
+
+	return 0;
 }
 
 int vfio_iommufd_attach_compat_ioas(struct vfio_device *vdev,
diff --git a/drivers/vfio/vfio.h b/drivers/vfio/vfio.h
index 615ffd58562b..98cee2f765e9 100644
--- a/drivers/vfio/vfio.h
+++ b/drivers/vfio/vfio.h
@@ -24,6 +24,7 @@ struct vfio_device_file {
 	spinlock_t kvm_ref_lock; /* protect kvm field */
 	struct kvm *kvm;
 	struct iommufd_ctx *iommufd; /* protected by struct vfio_device_set::lock */
+	u32 devid; /* only valid when iommufd is valid */
 };
 
 void vfio_device_put_registration(struct vfio_device *device);
@@ -236,13 +237,14 @@ static inline void vfio_container_cleanup(void)
 #endif
 
 #if IS_ENABLED(CONFIG_IOMMUFD)
-int vfio_iommufd_bind(struct vfio_device *device, struct iommufd_ctx *ictx);
+int vfio_iommufd_bind(struct vfio_device *device, struct iommufd_ctx *ictx,
+		      u32 *devid);
 void vfio_iommufd_unbind(struct vfio_device *device);
 int vfio_iommufd_attach_compat_ioas(struct vfio_device *vdev,
 				    struct iommufd_ctx *ictx);
 #else
 static inline int vfio_iommufd_bind(struct vfio_device *device,
-				    struct iommufd_ctx *ictx)
+				    struct iommufd_ctx *ictx, u32 *devid)
 {
 	return -EOPNOTSUPP;
 }
diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
index 8c73df1a400e..a66ca138059b 100644
--- a/drivers/vfio/vfio_main.c
+++ b/drivers/vfio/vfio_main.c
@@ -444,7 +444,7 @@ static int vfio_device_first_open(struct vfio_device_file *df)
 	 * to be done here just go ahead to open device.
 	 */
 	if (iommufd)
-		ret = vfio_iommufd_bind(device, iommufd);
+		ret = vfio_iommufd_bind(device, iommufd, &df->devid);
 	else if (vfio_device_group_uses_container(df))
 		ret = vfio_device_group_use_iommu(device);
 	if (ret)
@@ -476,10 +476,12 @@ static void vfio_device_last_close(struct vfio_device_file *df)
 
 	if (device->ops->close_device)
 		device->ops->close_device(device);
-	if (iommufd)
+	if (iommufd) {
 		vfio_iommufd_unbind(device);
-	else if (vfio_device_group_uses_container(df))
+		df->devid = IOMMUFD_INVALID_ID;
+	} else if (vfio_device_group_uses_container(df)) {
 		vfio_device_group_unuse_iommu(device);
+	}
 	module_put(device->dev->driver->owner);
 }
 
diff --git a/include/linux/iommufd.h b/include/linux/iommufd.h
index 7a0d7f2c4237..48b9bfab9891 100644
--- a/include/linux/iommufd.h
+++ b/include/linux/iommufd.h
@@ -10,6 +10,8 @@
 #include <linux/errno.h>
 #include <linux/err.h>
 
+#define IOMMUFD_INVALID_ID 0
+
 struct device;
 struct iommufd_device;
 struct page;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [Intel-gfx] [PATCH v6 18/24] vfio-iommufd: Add detach_ioas support for physical VFIO devices
  2023-03-08 13:28 [Intel-gfx] [PATCH v6 00/24] cover-letter: Add vfio_device cdev for iommufd support Yi Liu
                   ` (16 preceding siblings ...)
  2023-03-08 13:28 ` [Intel-gfx] [PATCH v6 17/24] vfio-iommufd: Make vfio_iommufd_bind() selectively return devid Yi Liu
@ 2023-03-08 13:28 ` Yi Liu
  2023-03-08 13:28 ` [Intel-gfx] [PATCH v6 19/24] vfio-iommufd: Add detach_ioas support for emulated " Yi Liu
                   ` (7 subsequent siblings)
  25 siblings, 0 replies; 103+ messages in thread
From: Yi Liu @ 2023-03-08 13:28 UTC (permalink / raw)
  To: alex.williamson, jgg, kevin.tian
  Cc: linux-s390, yi.l.liu, yi.y.sun, mjrosato, kvm, intel-gvt-dev,
	joro, cohuck, xudong.hao, peterx, yan.y.zhao, eric.auger,
	terrence.xu, nicolinc, shameerali.kolothum.thodi,
	suravee.suthikulpanit, intel-gfx, chao.p.peng, lulu,
	robin.murphy, jasowang

this prepares for adding DETACH ioctl for physical VFIO devices.

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Tested-by: Terrence Xu <terrence.xu@intel.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Matthew Rosato <mjrosato@linux.ibm.com>
---
 Documentation/driver-api/vfio.rst             |  8 +++++---
 drivers/vfio/fsl-mc/vfio_fsl_mc.c             |  1 +
 drivers/vfio/iommufd.c                        | 20 +++++++++++++++++++
 .../vfio/pci/hisilicon/hisi_acc_vfio_pci.c    |  2 ++
 drivers/vfio/pci/mlx5/main.c                  |  1 +
 drivers/vfio/pci/vfio_pci.c                   |  1 +
 drivers/vfio/platform/vfio_amba.c             |  1 +
 drivers/vfio/platform/vfio_platform.c         |  1 +
 drivers/vfio/vfio_main.c                      |  3 ++-
 include/linux/vfio.h                          |  8 +++++++-
 10 files changed, 41 insertions(+), 5 deletions(-)

diff --git a/Documentation/driver-api/vfio.rst b/Documentation/driver-api/vfio.rst
index 50b690f7f663..44527420f20d 100644
--- a/Documentation/driver-api/vfio.rst
+++ b/Documentation/driver-api/vfio.rst
@@ -279,6 +279,7 @@ similar to a file operations structure::
 					struct iommufd_ctx *ictx, u32 *out_device_id);
 		void	(*unbind_iommufd)(struct vfio_device *vdev);
 		int	(*attach_ioas)(struct vfio_device *vdev, u32 *pt_id);
+		void	(*detach_ioas)(struct vfio_device *vdev);
 		int	(*open_device)(struct vfio_device *vdev);
 		void	(*close_device)(struct vfio_device *vdev);
 		ssize_t	(*read)(struct vfio_device *vdev, char __user *buf,
@@ -315,9 +316,10 @@ container_of().
 	- The [un]bind_iommufd callbacks are issued when the device is bound to
 	  and unbound from iommufd.
 
-	- The attach_ioas callback is issued when the device is attached to an
-	  IOAS managed by the bound iommufd. The attached IOAS is automatically
-	  detached when the device is unbound from iommufd.
+	- The [de]attach_ioas callback is issued when the device is attached to
+	  and detached from an IOAS managed by the bound iommufd. However, the
+	  attached IOAS can also be automatically detached when the device is
+	  unbound from iommufd.
 
 	- The read/write/mmap callbacks implement the device region access defined
 	  by the device's own VFIO_DEVICE_GET_REGION_INFO ioctl.
diff --git a/drivers/vfio/fsl-mc/vfio_fsl_mc.c b/drivers/vfio/fsl-mc/vfio_fsl_mc.c
index c89a047a4cd8..d540cf683d93 100644
--- a/drivers/vfio/fsl-mc/vfio_fsl_mc.c
+++ b/drivers/vfio/fsl-mc/vfio_fsl_mc.c
@@ -594,6 +594,7 @@ static const struct vfio_device_ops vfio_fsl_mc_ops = {
 	.bind_iommufd	= vfio_iommufd_physical_bind,
 	.unbind_iommufd	= vfio_iommufd_physical_unbind,
 	.attach_ioas	= vfio_iommufd_physical_attach_ioas,
+	.detach_ioas	= vfio_iommufd_physical_detach_ioas,
 };
 
 static struct fsl_mc_driver vfio_fsl_mc_driver = {
diff --git a/drivers/vfio/iommufd.c b/drivers/vfio/iommufd.c
index b2cdb6b2b37f..c06494e322f9 100644
--- a/drivers/vfio/iommufd.c
+++ b/drivers/vfio/iommufd.c
@@ -139,6 +139,14 @@ int vfio_iommufd_physical_attach_ioas(struct vfio_device *vdev, u32 *pt_id)
 {
 	int rc;
 
+	lockdep_assert_held(&vdev->dev_set->lock);
+
+	if (WARN_ON(!vdev->iommufd_device))
+		return -EINVAL;
+
+	if (vdev->iommufd_attached)
+		return -EBUSY;
+
 	rc = iommufd_device_attach(vdev->iommufd_device, pt_id);
 	if (rc)
 		return rc;
@@ -147,6 +155,18 @@ int vfio_iommufd_physical_attach_ioas(struct vfio_device *vdev, u32 *pt_id)
 }
 EXPORT_SYMBOL_GPL(vfio_iommufd_physical_attach_ioas);
 
+void vfio_iommufd_physical_detach_ioas(struct vfio_device *vdev)
+{
+	lockdep_assert_held(&vdev->dev_set->lock);
+
+	if (WARN_ON(!vdev->iommufd_device || !vdev->iommufd_attached))
+		return;
+
+	iommufd_device_detach(vdev->iommufd_device);
+	vdev->iommufd_attached = false;
+}
+EXPORT_SYMBOL_GPL(vfio_iommufd_physical_detach_ioas);
+
 /*
  * The emulated standard ops mean that vfio_device is going to use the
  * "mdev path" and will call vfio_pin_pages()/vfio_dma_rw(). Drivers using this
diff --git a/drivers/vfio/pci/hisilicon/hisi_acc_vfio_pci.c b/drivers/vfio/pci/hisilicon/hisi_acc_vfio_pci.c
index a117eaf21c14..b2f9778c8366 100644
--- a/drivers/vfio/pci/hisilicon/hisi_acc_vfio_pci.c
+++ b/drivers/vfio/pci/hisilicon/hisi_acc_vfio_pci.c
@@ -1373,6 +1373,7 @@ static const struct vfio_device_ops hisi_acc_vfio_pci_migrn_ops = {
 	.bind_iommufd = vfio_iommufd_physical_bind,
 	.unbind_iommufd = vfio_iommufd_physical_unbind,
 	.attach_ioas = vfio_iommufd_physical_attach_ioas,
+	.detach_ioas = vfio_iommufd_physical_detach_ioas,
 };
 
 static const struct vfio_device_ops hisi_acc_vfio_pci_ops = {
@@ -1391,6 +1392,7 @@ static const struct vfio_device_ops hisi_acc_vfio_pci_ops = {
 	.bind_iommufd = vfio_iommufd_physical_bind,
 	.unbind_iommufd = vfio_iommufd_physical_unbind,
 	.attach_ioas = vfio_iommufd_physical_attach_ioas,
+	.detach_ioas = vfio_iommufd_physical_detach_ioas,
 };
 
 static int hisi_acc_vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
diff --git a/drivers/vfio/pci/mlx5/main.c b/drivers/vfio/pci/mlx5/main.c
index e897537a9e8a..6fc3410989eb 100644
--- a/drivers/vfio/pci/mlx5/main.c
+++ b/drivers/vfio/pci/mlx5/main.c
@@ -1326,6 +1326,7 @@ static const struct vfio_device_ops mlx5vf_pci_ops = {
 	.bind_iommufd = vfio_iommufd_physical_bind,
 	.unbind_iommufd = vfio_iommufd_physical_unbind,
 	.attach_ioas = vfio_iommufd_physical_attach_ioas,
+	.detach_ioas = vfio_iommufd_physical_detach_ioas,
 };
 
 static int mlx5vf_pci_probe(struct pci_dev *pdev,
diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 29091ee2e984..cb5b7f865d58 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -141,6 +141,7 @@ static const struct vfio_device_ops vfio_pci_ops = {
 	.bind_iommufd	= vfio_iommufd_physical_bind,
 	.unbind_iommufd	= vfio_iommufd_physical_unbind,
 	.attach_ioas	= vfio_iommufd_physical_attach_ioas,
+	.detach_ioas	= vfio_iommufd_physical_detach_ioas,
 };
 
 static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
diff --git a/drivers/vfio/platform/vfio_amba.c b/drivers/vfio/platform/vfio_amba.c
index 83fe54015595..6464b3939ebc 100644
--- a/drivers/vfio/platform/vfio_amba.c
+++ b/drivers/vfio/platform/vfio_amba.c
@@ -119,6 +119,7 @@ static const struct vfio_device_ops vfio_amba_ops = {
 	.bind_iommufd	= vfio_iommufd_physical_bind,
 	.unbind_iommufd	= vfio_iommufd_physical_unbind,
 	.attach_ioas	= vfio_iommufd_physical_attach_ioas,
+	.detach_ioas	= vfio_iommufd_physical_detach_ioas,
 };
 
 static const struct amba_id pl330_ids[] = {
diff --git a/drivers/vfio/platform/vfio_platform.c b/drivers/vfio/platform/vfio_platform.c
index 22a1efca32a8..8cf22fa65baa 100644
--- a/drivers/vfio/platform/vfio_platform.c
+++ b/drivers/vfio/platform/vfio_platform.c
@@ -108,6 +108,7 @@ static const struct vfio_device_ops vfio_platform_ops = {
 	.bind_iommufd	= vfio_iommufd_physical_bind,
 	.unbind_iommufd	= vfio_iommufd_physical_unbind,
 	.attach_ioas	= vfio_iommufd_physical_attach_ioas,
+	.detach_ioas	= vfio_iommufd_physical_detach_ioas,
 };
 
 static struct platform_driver vfio_platform_driver = {
diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
index a66ca138059b..da5cc4c5c63f 100644
--- a/drivers/vfio/vfio_main.c
+++ b/drivers/vfio/vfio_main.c
@@ -258,7 +258,8 @@ static int __vfio_register_dev(struct vfio_device *device,
 	if (WARN_ON(IS_ENABLED(CONFIG_IOMMUFD) &&
 		    (!device->ops->bind_iommufd ||
 		     !device->ops->unbind_iommufd ||
-		     !device->ops->attach_ioas)))
+		     !device->ops->attach_ioas ||
+		     !device->ops->detach_ioas)))
 		return -EINVAL;
 
 	/*
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index fc14f8430a10..8982e48a30e2 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -72,7 +72,9 @@ struct vfio_device {
  * @bind_iommufd: Called when binding the device to an iommufd
  * @unbind_iommufd: Opposite of bind_iommufd
  * @attach_ioas: Called when attaching device to an IOAS/HWPT managed by the
- *		 bound iommufd. Undo in unbind_iommufd.
+ *		 bound iommufd. Undo in unbind_iommufd if @detach_ioas is not
+ *		 called.
+ * @detach_ioas: Opposite of attach_ioas
  * @open_device: Called when the first file descriptor is opened for this device
  * @close_device: Opposite of open_device
  * @read: Perform read(2) on device file descriptor
@@ -96,6 +98,7 @@ struct vfio_device_ops {
 				struct iommufd_ctx *ictx, u32 *out_device_id);
 	void	(*unbind_iommufd)(struct vfio_device *vdev);
 	int	(*attach_ioas)(struct vfio_device *vdev, u32 *pt_id);
+	void	(*detach_ioas)(struct vfio_device *vdev);
 	int	(*open_device)(struct vfio_device *vdev);
 	void	(*close_device)(struct vfio_device *vdev);
 	ssize_t	(*read)(struct vfio_device *vdev, char __user *buf,
@@ -118,6 +121,7 @@ int vfio_iommufd_physical_bind(struct vfio_device *vdev,
 void vfio_iommufd_physical_unbind(struct vfio_device *vdev);
 struct iommufd_ctx *vfio_iommufd_physical_ctx(struct vfio_device *vdev);
 int vfio_iommufd_physical_attach_ioas(struct vfio_device *vdev, u32 *pt_id);
+void vfio_iommufd_physical_detach_ioas(struct vfio_device *vdev);
 int vfio_iommufd_emulated_bind(struct vfio_device *vdev,
 			       struct iommufd_ctx *ictx, u32 *out_device_id);
 void vfio_iommufd_emulated_unbind(struct vfio_device *vdev);
@@ -132,6 +136,8 @@ int vfio_iommufd_emulated_attach_ioas(struct vfio_device *vdev, u32 *pt_id);
 	((struct iommufd_ctx * (*)(struct vfio_device *vdev)) NULL)
 #define vfio_iommufd_physical_attach_ioas \
 	((int (*)(struct vfio_device *vdev, u32 *pt_id)) NULL)
+#define vfio_iommufd_physical_detach_ioas \
+	((void (*)(struct vfio_device *vdev)) NULL)
 #define vfio_iommufd_emulated_bind                                      \
 	((int (*)(struct vfio_device *vdev, struct iommufd_ctx *ictx,   \
 		  u32 *out_device_id)) NULL)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [Intel-gfx] [PATCH v6 19/24] vfio-iommufd: Add detach_ioas support for emulated VFIO devices
  2023-03-08 13:28 [Intel-gfx] [PATCH v6 00/24] cover-letter: Add vfio_device cdev for iommufd support Yi Liu
                   ` (17 preceding siblings ...)
  2023-03-08 13:28 ` [Intel-gfx] [PATCH v6 18/24] vfio-iommufd: Add detach_ioas support for physical VFIO devices Yi Liu
@ 2023-03-08 13:28 ` Yi Liu
  2023-03-10 23:42   ` Nicolin Chen
  2023-03-08 13:28 ` [Intel-gfx] [PATCH v6 20/24] vfio: Add cdev for vfio_device Yi Liu
                   ` (6 subsequent siblings)
  25 siblings, 1 reply; 103+ messages in thread
From: Yi Liu @ 2023-03-08 13:28 UTC (permalink / raw)
  To: alex.williamson, jgg, kevin.tian
  Cc: linux-s390, yi.l.liu, yi.y.sun, mjrosato, kvm, intel-gvt-dev,
	joro, cohuck, xudong.hao, peterx, yan.y.zhao, eric.auger,
	terrence.xu, nicolinc, shameerali.kolothum.thodi,
	suravee.suthikulpanit, intel-gfx, chao.p.peng, lulu,
	robin.murphy, jasowang

this prepares for adding DETACH ioctl for emulated VFIO devices.

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Tested-by: Terrence Xu <terrence.xu@intel.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Matthew Rosato <mjrosato@linux.ibm.com>
---
 drivers/gpu/drm/i915/gvt/kvmgt.c  |  1 +
 drivers/s390/cio/vfio_ccw_ops.c   |  1 +
 drivers/s390/crypto/vfio_ap_ops.c |  1 +
 drivers/vfio/iommufd.c            | 14 +++++++++++++-
 include/linux/vfio.h              |  3 +++
 5 files changed, 19 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/gvt/kvmgt.c b/drivers/gpu/drm/i915/gvt/kvmgt.c
index de675d799c7d..9cd9e9da60dd 100644
--- a/drivers/gpu/drm/i915/gvt/kvmgt.c
+++ b/drivers/gpu/drm/i915/gvt/kvmgt.c
@@ -1474,6 +1474,7 @@ static const struct vfio_device_ops intel_vgpu_dev_ops = {
 	.bind_iommufd	= vfio_iommufd_emulated_bind,
 	.unbind_iommufd = vfio_iommufd_emulated_unbind,
 	.attach_ioas	= vfio_iommufd_emulated_attach_ioas,
+	.detach_ioas	= vfio_iommufd_emulated_detach_ioas,
 };
 
 static int intel_vgpu_probe(struct mdev_device *mdev)
diff --git a/drivers/s390/cio/vfio_ccw_ops.c b/drivers/s390/cio/vfio_ccw_ops.c
index 5b53b94f13c7..cba4971618ff 100644
--- a/drivers/s390/cio/vfio_ccw_ops.c
+++ b/drivers/s390/cio/vfio_ccw_ops.c
@@ -632,6 +632,7 @@ static const struct vfio_device_ops vfio_ccw_dev_ops = {
 	.bind_iommufd = vfio_iommufd_emulated_bind,
 	.unbind_iommufd = vfio_iommufd_emulated_unbind,
 	.attach_ioas = vfio_iommufd_emulated_attach_ioas,
+	.detach_ioas = vfio_iommufd_emulated_detach_ioas,
 };
 
 struct mdev_driver vfio_ccw_mdev_driver = {
diff --git a/drivers/s390/crypto/vfio_ap_ops.c b/drivers/s390/crypto/vfio_ap_ops.c
index 72e10abb103a..9902e62e7a17 100644
--- a/drivers/s390/crypto/vfio_ap_ops.c
+++ b/drivers/s390/crypto/vfio_ap_ops.c
@@ -1844,6 +1844,7 @@ static const struct vfio_device_ops vfio_ap_matrix_dev_ops = {
 	.bind_iommufd = vfio_iommufd_emulated_bind,
 	.unbind_iommufd = vfio_iommufd_emulated_unbind,
 	.attach_ioas = vfio_iommufd_emulated_attach_ioas,
+	.detach_ioas = vfio_iommufd_emulated_detach_ioas,
 };
 
 static struct mdev_driver vfio_ap_matrix_driver = {
diff --git a/drivers/vfio/iommufd.c b/drivers/vfio/iommufd.c
index c06494e322f9..8a9457d0a33c 100644
--- a/drivers/vfio/iommufd.c
+++ b/drivers/vfio/iommufd.c
@@ -218,8 +218,20 @@ int vfio_iommufd_emulated_attach_ioas(struct vfio_device *vdev, u32 *pt_id)
 {
 	lockdep_assert_held(&vdev->dev_set->lock);
 
-	if (!vdev->iommufd_access)
+	if (WARN_ON(!vdev->iommufd_access))
 		return -ENOENT;
 	return iommufd_access_set_ioas(vdev->iommufd_access, *pt_id);
 }
 EXPORT_SYMBOL_GPL(vfio_iommufd_emulated_attach_ioas);
+
+void vfio_iommufd_emulated_detach_ioas(struct vfio_device *vdev)
+{
+	lockdep_assert_held(&vdev->dev_set->lock);
+
+	if (WARN_ON(!vdev->iommufd_access))
+		return;
+
+	iommufd_access_destroy(vdev->iommufd_access);
+	vdev->iommufd_access = NULL;
+}
+EXPORT_SYMBOL_GPL(vfio_iommufd_emulated_detach_ioas);
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index 8982e48a30e2..5a1c5b6f1a63 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -126,6 +126,7 @@ int vfio_iommufd_emulated_bind(struct vfio_device *vdev,
 			       struct iommufd_ctx *ictx, u32 *out_device_id);
 void vfio_iommufd_emulated_unbind(struct vfio_device *vdev);
 int vfio_iommufd_emulated_attach_ioas(struct vfio_device *vdev, u32 *pt_id);
+void vfio_iommufd_emulated_detach_ioas(struct vfio_device *vdev);
 #else
 #define vfio_iommufd_physical_bind                                      \
 	((int (*)(struct vfio_device *vdev, struct iommufd_ctx *ictx,   \
@@ -145,6 +146,8 @@ int vfio_iommufd_emulated_attach_ioas(struct vfio_device *vdev, u32 *pt_id);
 	((void (*)(struct vfio_device *vdev)) NULL)
 #define vfio_iommufd_emulated_attach_ioas \
 	((int (*)(struct vfio_device *vdev, u32 *pt_id)) NULL)
+#define vfio_iommufd_emulated_detach_ioas \
+	((void (*)(struct vfio_device *vdev)) NULL)
 #endif
 
 /**
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [Intel-gfx] [PATCH v6 20/24] vfio: Add cdev for vfio_device
  2023-03-08 13:28 [Intel-gfx] [PATCH v6 00/24] cover-letter: Add vfio_device cdev for iommufd support Yi Liu
                   ` (18 preceding siblings ...)
  2023-03-08 13:28 ` [Intel-gfx] [PATCH v6 19/24] vfio-iommufd: Add detach_ioas support for emulated " Yi Liu
@ 2023-03-08 13:28 ` Yi Liu
  2023-03-10  8:48   ` Tian, Kevin
  2023-03-08 13:29 ` [Intel-gfx] [PATCH v6 21/24] vfio: Add VFIO_DEVICE_BIND_IOMMUFD Yi Liu
                   ` (5 subsequent siblings)
  25 siblings, 1 reply; 103+ messages in thread
From: Yi Liu @ 2023-03-08 13:28 UTC (permalink / raw)
  To: alex.williamson, jgg, kevin.tian
  Cc: linux-s390, yi.l.liu, yi.y.sun, mjrosato, kvm, intel-gvt-dev,
	joro, cohuck, xudong.hao, peterx, yan.y.zhao, eric.auger,
	terrence.xu, nicolinc, shameerali.kolothum.thodi,
	suravee.suthikulpanit, intel-gfx, chao.p.peng, lulu,
	robin.murphy, jasowang

This allows user to directly open a vfio device w/o using the legacy
container/group interface, as a prerequisite for supporting new iommu
features like nested translation.

The device fd opened in this manner doesn't have the capability to access
the device as the fops open() doesn't open the device until the successful
BIND_IOMMUFD which be added in next patch.

With this patch, devices registered to vfio core have both group and device
interface created.

- group interface : /dev/vfio/$groupID
- device interface: /dev/vfio/devices/vfioX  (X is the minor number and
					      is unique across devices)

Given a vfio device the user can identify the matching vfioX by checking
the sysfs path of the device. Take PCI device (0000:6a:01.0) for example,
/sys/bus/pci/devices/0000\:6a\:01.0/vfio-dev/vfio0/dev contains the
major:minor of the matching vfioX.

Userspace then opens the /dev/vfio/devices/vfioX and checks with fstat
that the major:minor matches.

The vfio_device cdev logic in this patch:
*) __vfio_register_dev() path ends up doing cdev_device_add() for each
   vfio_device if VFIO_DEVICE_CDEV configured.
*) vfio_unregister_group_dev() path does cdev_device_del();

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Tested-by: Terrence Xu <terrence.xu@intel.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Matthew Rosato <mjrosato@linux.ibm.com>
---
 drivers/vfio/Kconfig       | 11 +++++++
 drivers/vfio/Makefile      |  1 +
 drivers/vfio/device_cdev.c | 62 ++++++++++++++++++++++++++++++++++++++
 drivers/vfio/vfio.h        | 46 ++++++++++++++++++++++++++++
 drivers/vfio/vfio_main.c   | 34 ++++++++++++++++-----
 include/linux/vfio.h       |  4 +++
 6 files changed, 151 insertions(+), 7 deletions(-)
 create mode 100644 drivers/vfio/device_cdev.c

diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
index 89e06c981e43..e2105b4dac2d 100644
--- a/drivers/vfio/Kconfig
+++ b/drivers/vfio/Kconfig
@@ -12,6 +12,17 @@ menuconfig VFIO
 	  If you don't know what to do here, say N.
 
 if VFIO
+config VFIO_DEVICE_CDEV
+	bool "Support for the VFIO cdev /dev/vfio/devices/vfioX"
+	depends on IOMMUFD
+	help
+	  The VFIO device cdev is another way for userspace to get device
+	  access. Userspace gets device fd by opening device cdev under
+	  /dev/vfio/devices/vfioX, and then bind the device fd with an iommufd
+	  to set up secure DMA context for device access.
+
+	  If you don't know what to do here, say N.
+
 config VFIO_CONTAINER
 	bool "Support for the VFIO container /dev/vfio/vfio"
 	select VFIO_IOMMU_TYPE1 if MMU && (X86 || S390 || ARM || ARM64)
diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile
index 70e7dcb302ef..245394aeb94b 100644
--- a/drivers/vfio/Makefile
+++ b/drivers/vfio/Makefile
@@ -4,6 +4,7 @@ obj-$(CONFIG_VFIO) += vfio.o
 vfio-y += vfio_main.o \
 	  group.o \
 	  iova_bitmap.o
+vfio-$(CONFIG_VFIO_DEVICE_CDEV) += device_cdev.o
 vfio-$(CONFIG_IOMMUFD) += iommufd.o
 vfio-$(CONFIG_VFIO_CONTAINER) += container.o
 vfio-$(CONFIG_VFIO_VIRQFD) += virqfd.o
diff --git a/drivers/vfio/device_cdev.c b/drivers/vfio/device_cdev.c
new file mode 100644
index 000000000000..1c640016a824
--- /dev/null
+++ b/drivers/vfio/device_cdev.c
@@ -0,0 +1,62 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (c) 2023 Intel Corporation.
+ */
+#include <linux/vfio.h>
+
+#include "vfio.h"
+
+static dev_t device_devt;
+
+void vfio_init_device_cdev(struct vfio_device *device)
+{
+	device->device.devt = MKDEV(MAJOR(device_devt), device->index);
+	cdev_init(&device->cdev, &vfio_device_fops);
+	device->cdev.owner = THIS_MODULE;
+}
+
+/*
+ * device access via the fd opened by this function is blocked until
+ * .open_device() is called successfully during BIND_IOMMUFD.
+ */
+int vfio_device_fops_cdev_open(struct inode *inode, struct file *filep)
+{
+	struct vfio_device *device = container_of(inode->i_cdev,
+						  struct vfio_device, cdev);
+	struct vfio_device_file *df;
+	int ret;
+
+	if (!vfio_device_try_get_registration(device))
+		return -ENODEV;
+
+	df = vfio_allocate_device_file(device);
+	if (IS_ERR(df)) {
+		ret = PTR_ERR(df);
+		goto err_put_registration;
+	}
+
+	filep->private_data = df;
+
+	return 0;
+
+err_put_registration:
+	vfio_device_put_registration(device);
+	return ret;
+}
+
+static char *vfio_device_devnode(const struct device *dev, umode_t *mode)
+{
+	return kasprintf(GFP_KERNEL, "vfio/devices/%s", dev_name(dev));
+}
+
+int vfio_cdev_init(struct class *device_class)
+{
+	device_class->devnode = vfio_device_devnode;
+	return alloc_chrdev_region(&device_devt, 0,
+				   MINORMASK + 1, "vfio-dev");
+}
+
+void vfio_cdev_cleanup(void)
+{
+	unregister_chrdev_region(device_devt, MINORMASK + 1);
+}
diff --git a/drivers/vfio/vfio.h b/drivers/vfio/vfio.h
index 98cee2f765e9..3f359f04b754 100644
--- a/drivers/vfio/vfio.h
+++ b/drivers/vfio/vfio.h
@@ -260,6 +260,52 @@ static inline int vfio_iommufd_attach_compat_ioas(struct vfio_device *vdev,
 }
 #endif
 
+#if IS_ENABLED(CONFIG_VFIO_DEVICE_CDEV)
+static inline int vfio_device_add(struct vfio_device *device)
+{
+	return cdev_device_add(&device->cdev, &device->device);
+}
+
+static inline void vfio_device_del(struct vfio_device *device)
+{
+	cdev_device_del(&device->cdev, &device->device);
+}
+
+void vfio_init_device_cdev(struct vfio_device *device);
+int vfio_device_fops_cdev_open(struct inode *inode, struct file *filep);
+int vfio_cdev_init(struct class *device_class);
+void vfio_cdev_cleanup(void);
+#else
+static inline int vfio_device_add(struct vfio_device *device)
+{
+	return device_add(&device->device);
+}
+
+static inline void vfio_device_del(struct vfio_device *device)
+{
+	device_del(&device->device);
+}
+
+static inline void vfio_init_device_cdev(struct vfio_device *device)
+{
+}
+
+static inline int vfio_device_fops_cdev_open(struct inode *inode,
+					     struct file *filep)
+{
+	return 0;
+}
+
+static inline int vfio_cdev_init(struct class *device_class)
+{
+	return 0;
+}
+
+static inline void vfio_cdev_cleanup(void)
+{
+}
+#endif /* CONFIG_VFIO_DEVICE_CDEV */
+
 #if IS_ENABLED(CONFIG_VFIO_VIRQFD)
 int __init vfio_virqfd_init(void);
 void vfio_virqfd_exit(void);
diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
index da5cc4c5c63f..b0c2a7544524 100644
--- a/drivers/vfio/vfio_main.c
+++ b/drivers/vfio/vfio_main.c
@@ -242,6 +242,7 @@ static int vfio_init_device(struct vfio_device *device, struct device *dev,
 	device->device.release = vfio_device_release;
 	device->device.class = vfio.device_class;
 	device->device.parent = device->dev;
+	vfio_init_device_cdev(device);
 	return 0;
 
 out_uninit:
@@ -277,7 +278,7 @@ static int __vfio_register_dev(struct vfio_device *device,
 	if (ret)
 		return ret;
 
-	ret = device_add(&device->device);
+	ret = vfio_device_add(device);
 	if (ret)
 		goto err_out;
 
@@ -317,6 +318,20 @@ void vfio_unregister_group_dev(struct vfio_device *device)
 	bool interrupted = false;
 	long rc;
 
+	/*
+	 * Placing it before vfio_device_put_registration() to prevent
+	 * new registration refcount increment by VFIO_GROUP_GET_DEVICE_FD
+	 * during the unregister time.
+	 */
+	vfio_device_group_unregister(device);
+
+	/*
+	 * Balances vfio_device_add() in the register path. Placing it before
+	 * vfio_device_put_registration() to prevent new registration refcount
+	 * increment by the device cdev open during the unregister time.
+	 */
+	vfio_device_del(device);
+
 	vfio_device_put_registration(device);
 	rc = try_wait_for_completion(&device->comp);
 	while (rc <= 0) {
@@ -340,11 +355,6 @@ void vfio_unregister_group_dev(struct vfio_device *device)
 		}
 	}
 
-	vfio_device_group_unregister(device);
-
-	/* Balances device_add in register path */
-	device_del(&device->device);
-
 	/* Balances vfio_device_set_group in register path */
 	vfio_device_remove_group(device);
 }
@@ -563,7 +573,8 @@ static int vfio_device_fops_release(struct inode *inode, struct file *filep)
 	struct vfio_device_file *df = filep->private_data;
 	struct vfio_device *device = df->device;
 
-	vfio_device_group_close(df);
+	if (df->group)
+		vfio_device_group_close(df);
 
 	vfio_device_put_registration(device);
 
@@ -1212,6 +1223,7 @@ static int vfio_device_fops_mmap(struct file *filep, struct vm_area_struct *vma)
 
 const struct file_operations vfio_device_fops = {
 	.owner		= THIS_MODULE,
+	.open		= vfio_device_fops_cdev_open,
 	.release	= vfio_device_fops_release,
 	.read		= vfio_device_fops_read,
 	.write		= vfio_device_fops_write,
@@ -1603,9 +1615,16 @@ static int __init vfio_init(void)
 		goto err_dev_class;
 	}
 
+	ret = vfio_cdev_init(vfio.device_class);
+	if (ret)
+		goto err_alloc_dev_chrdev;
+
 	pr_info(DRIVER_DESC " version: " DRIVER_VERSION "\n");
 	return 0;
 
+err_alloc_dev_chrdev:
+	class_destroy(vfio.device_class);
+	vfio.device_class = NULL;
 err_dev_class:
 	vfio_virqfd_exit();
 err_virqfd:
@@ -1616,6 +1635,7 @@ static int __init vfio_init(void)
 static void __exit vfio_cleanup(void)
 {
 	ida_destroy(&vfio.device_ida);
+	vfio_cdev_cleanup();
 	class_destroy(vfio.device_class);
 	vfio.device_class = NULL;
 	vfio_virqfd_exit();
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index 5a1c5b6f1a63..34adfcb5b7bd 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -13,6 +13,7 @@
 #include <linux/mm.h>
 #include <linux/workqueue.h>
 #include <linux/poll.h>
+#include <linux/cdev.h>
 #include <uapi/linux/vfio.h>
 #include <linux/iova_bitmap.h>
 
@@ -51,6 +52,9 @@ struct vfio_device {
 	/* Members below here are private, not for driver use */
 	unsigned int index;
 	struct device device;	/* device.kref covers object life circle */
+#if IS_ENABLED(CONFIG_VFIO_DEVICE_CDEV)
+	struct cdev cdev;
+#endif
 	refcount_t refcount;	/* user count on registered device*/
 	unsigned int open_count;
 	struct completion comp;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [Intel-gfx] [PATCH v6 21/24] vfio: Add VFIO_DEVICE_BIND_IOMMUFD
  2023-03-08 13:28 [Intel-gfx] [PATCH v6 00/24] cover-letter: Add vfio_device cdev for iommufd support Yi Liu
                   ` (19 preceding siblings ...)
  2023-03-08 13:28 ` [Intel-gfx] [PATCH v6 20/24] vfio: Add cdev for vfio_device Yi Liu
@ 2023-03-08 13:29 ` Yi Liu
  2023-03-10  9:01   ` Tian, Kevin
  2023-03-08 13:29 ` [Intel-gfx] [PATCH v6 22/24] vfio: Add VFIO_DEVICE_AT[DE]TACH_IOMMUFD_PT Yi Liu
                   ` (4 subsequent siblings)
  25 siblings, 1 reply; 103+ messages in thread
From: Yi Liu @ 2023-03-08 13:29 UTC (permalink / raw)
  To: alex.williamson, jgg, kevin.tian
  Cc: linux-s390, yi.l.liu, yi.y.sun, mjrosato, kvm, intel-gvt-dev,
	joro, cohuck, xudong.hao, peterx, yan.y.zhao, eric.auger,
	terrence.xu, nicolinc, shameerali.kolothum.thodi,
	suravee.suthikulpanit, intel-gfx, chao.p.peng, lulu,
	robin.murphy, jasowang

This adds ioctl for userspace to bind device cdev fd to iommufd.

    VFIO_DEVICE_BIND_IOMMUFD: bind device to an iommufd, hence gain DMA
			      control provided by the iommufd. open_device
			      op is called after bind_iommufd op.
			      VFIO no iommu mode is indicated by passing
			      a negative iommufd value.

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
---
 drivers/vfio/device_cdev.c | 166 +++++++++++++++++++++++++++++++++++++
 drivers/vfio/group.c       |  15 ++++
 drivers/vfio/vfio.h        |  15 ++++
 drivers/vfio/vfio_main.c   |  29 ++++++-
 include/uapi/linux/vfio.h  |  37 +++++++++
 5 files changed, 258 insertions(+), 4 deletions(-)

diff --git a/drivers/vfio/device_cdev.c b/drivers/vfio/device_cdev.c
index 1c640016a824..568cc9da16c7 100644
--- a/drivers/vfio/device_cdev.c
+++ b/drivers/vfio/device_cdev.c
@@ -3,6 +3,7 @@
  * Copyright (c) 2023 Intel Corporation.
  */
 #include <linux/vfio.h>
+#include <linux/iommufd.h>
 
 #include "vfio.h"
 
@@ -44,6 +45,171 @@ int vfio_device_fops_cdev_open(struct inode *inode, struct file *filep)
 	return ret;
 }
 
+static void vfio_device_get_kvm_safe(struct vfio_device_file *df)
+{
+	spin_lock(&df->kvm_ref_lock);
+	if (df->kvm)
+		_vfio_device_get_kvm_safe(df->device, df->kvm);
+	spin_unlock(&df->kvm_ref_lock);
+}
+
+void vfio_device_cdev_close(struct vfio_device_file *df)
+{
+	struct vfio_device *device = df->device;
+
+	/*
+	 * As df->access_granted writer is under dev_set->lock as well,
+	 * so this read no need to use smp_load_acquire() to pair with
+	 * smp_store_release() in the caller of vfio_device_open().
+	 */
+	if (!df->access_granted)
+		return;
+
+	mutex_lock(&device->dev_set->lock);
+	vfio_device_close(df);
+	vfio_device_put_kvm(device);
+	if (df->iommufd)
+		iommufd_ctx_put(df->iommufd);
+	mutex_unlock(&device->dev_set->lock);
+	vfio_device_unblock_group(device);
+}
+
+static int vfio_device_cdev_probe_noiommu(struct vfio_device *device)
+{
+	struct iommu_group *iommu_group;
+	int ret = 0;
+
+	if (!IS_ENABLED(CONFIG_VFIO_NOIOMMU) || !vfio_noiommu)
+		return -EINVAL;
+
+	if (!capable(CAP_SYS_RAWIO))
+		return -EPERM;
+
+	iommu_group = iommu_group_get(device->dev);
+	if (!iommu_group)
+		return 0;
+
+	/*
+	 * We cannot support noiommu mode for devices that are protected
+	 * by IOMMU.  So check the iommu_group, if it is a no-iommu group
+	 * created by VFIO, we support. If not, we refuse.
+	 */
+	if (!vfio_group_find_noiommu_group_from_iommu(iommu_group))
+		ret = -EINVAL;
+	iommu_group_put(iommu_group);
+	return ret;
+}
+
+static struct iommufd_ctx *vfio_get_iommufd_from_fd(int fd)
+{
+	struct fd f;
+	struct iommufd_ctx *iommufd;
+
+	f = fdget(fd);
+	if (!f.file)
+		return ERR_PTR(-EBADF);
+
+	iommufd = iommufd_ctx_from_file(f.file);
+
+	fdput(f);
+	return iommufd;
+}
+
+long vfio_device_ioctl_bind_iommufd(struct vfio_device_file *df,
+				    struct vfio_device_bind_iommufd __user *arg)
+{
+	struct vfio_device *device = df->device;
+	struct vfio_device_bind_iommufd bind;
+	struct iommufd_ctx *iommufd = NULL;
+	unsigned long minsz;
+	int ret;
+
+	static_assert(__same_type(arg->out_devid, bind.out_devid));
+
+	minsz = offsetofend(struct vfio_device_bind_iommufd, out_devid);
+
+	if (copy_from_user(&bind, arg, minsz))
+		return -EFAULT;
+
+	if (bind.argsz < minsz || bind.flags)
+		return -EINVAL;
+
+	if (!device->ops->bind_iommufd)
+		return -ENODEV;
+
+	ret = vfio_device_block_group(device);
+	if (ret)
+		return ret;
+
+	mutex_lock(&device->dev_set->lock);
+	/* If already got access, should fail it. */
+	if (df->access_granted) {
+		ret = -EINVAL;
+		goto out_unlock;
+	}
+
+	/* iommufd < 0 means noiommu mode */
+	if (bind.iommufd < 0) {
+		ret = vfio_device_cdev_probe_noiommu(device);
+		if (ret)
+			goto out_unlock;
+	} else {
+		iommufd = vfio_get_iommufd_from_fd(bind.iommufd);
+		if (IS_ERR(iommufd)) {
+			ret = PTR_ERR(iommufd);
+			goto out_unlock;
+		}
+	}
+
+	/*
+	 * Before the device open, get the KVM pointer currently
+	 * associated with the device file (if there is) and obtain
+	 * a reference.  This reference is held until device closed.
+	 * Save the pointer in the device for use by drivers.
+	 */
+	vfio_device_get_kvm_safe(df);
+
+	df->iommufd = iommufd;
+	ret = vfio_device_open(df);
+	if (ret)
+		goto out_put_kvm;
+
+	if (df->iommufd)
+		bind.out_devid = df->devid;
+	else
+		bind.out_devid = IOMMUFD_INVALID_ID;
+
+	ret = copy_to_user(&arg->out_devid, &bind.out_devid,
+			   sizeof(bind.out_devid)) ? -EFAULT : 0;
+	if (ret)
+		goto out_close_device;
+
+	if (bind.iommufd < 0)
+		dev_warn(device->dev, "device is bound to vfio-noiommu by user "
+			 "(%s:%d)\n", current->comm, task_pid_nr(current));
+
+	/*
+	 * Paired with smp_load_acquire() in vfio_device_fops::ioctl/
+	 * read/write/mmap
+	 */
+	smp_store_release(&df->access_granted, true);
+	mutex_unlock(&device->dev_set->lock);
+
+	return 0;
+
+out_close_device:
+	vfio_device_close(df);
+out_put_kvm:
+	df->iommufd = NULL;
+	vfio_device_put_kvm(device);
+	if (iommufd)
+		iommufd_ctx_put(iommufd);
+out_unlock:
+	mutex_unlock(&device->dev_set->lock);
+	vfio_device_unblock_group(device);
+	return ret;
+}
+
 static char *vfio_device_devnode(const struct device *dev, umode_t *mode)
 {
 	return kasprintf(GFP_KERNEL, "vfio/devices/%s", dev_name(dev));
diff --git a/drivers/vfio/group.c b/drivers/vfio/group.c
index 51c027134814..fc49f2459b1a 100644
--- a/drivers/vfio/group.c
+++ b/drivers/vfio/group.c
@@ -701,6 +701,21 @@ static struct vfio_group *vfio_group_find_or_alloc(struct device *dev)
 	return group;
 }
 
+struct vfio_group *
+vfio_group_find_noiommu_group_from_iommu(struct iommu_group *iommu_group)
+{
+	struct vfio_group *group;
+	bool found = false;
+
+	mutex_lock(&vfio.group_lock);
+	group = vfio_group_find_from_iommu(iommu_group);
+	if (group && group->type == VFIO_NO_IOMMU)
+		found = true;
+	mutex_unlock(&vfio.group_lock);
+
+	return found ? group : NULL;
+}
+
 int vfio_device_set_group(struct vfio_device *device,
 			  enum vfio_group_type type)
 {
diff --git a/drivers/vfio/vfio.h b/drivers/vfio/vfio.h
index 3f359f04b754..5df737b24102 100644
--- a/drivers/vfio/vfio.h
+++ b/drivers/vfio/vfio.h
@@ -91,6 +91,8 @@ struct vfio_group {
 
 int vfio_device_block_group(struct vfio_device *device);
 void vfio_device_unblock_group(struct vfio_device *device);
+struct vfio_group *
+vfio_group_find_noiommu_group_from_iommu(struct iommu_group *iommu_group);
 int vfio_device_set_group(struct vfio_device *device,
 			  enum vfio_group_type type);
 void vfio_device_remove_group(struct vfio_device *device);
@@ -273,6 +275,9 @@ static inline void vfio_device_del(struct vfio_device *device)
 
 void vfio_init_device_cdev(struct vfio_device *device);
 int vfio_device_fops_cdev_open(struct inode *inode, struct file *filep);
+void vfio_device_cdev_close(struct vfio_device_file *df);
+long vfio_device_ioctl_bind_iommufd(struct vfio_device_file *df,
+				    struct vfio_device_bind_iommufd __user *arg);
 int vfio_cdev_init(struct class *device_class);
 void vfio_cdev_cleanup(void);
 #else
@@ -296,6 +301,16 @@ static inline int vfio_device_fops_cdev_open(struct inode *inode,
 	return 0;
 }
 
+static inline void vfio_device_cdev_close(struct vfio_device_file *df)
+{
+}
+
+static inline long vfio_device_ioctl_bind_iommufd(struct vfio_device_file *df,
+						  struct vfio_device_bind_iommufd __user *arg)
+{
+	return -EOPNOTSUPP;
+}
+
 static inline int vfio_cdev_init(struct class *device_class)
 {
 	return 0;
diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
index b0c2a7544524..08bb1705d02d 100644
--- a/drivers/vfio/vfio_main.c
+++ b/drivers/vfio/vfio_main.c
@@ -575,6 +575,8 @@ static int vfio_device_fops_release(struct inode *inode, struct file *filep)
 
 	if (df->group)
 		vfio_device_group_close(df);
+	else
+		vfio_device_cdev_close(df);
 
 	vfio_device_put_registration(device);
 
@@ -1148,7 +1150,14 @@ static long vfio_device_fops_unl_ioctl(struct file *filep,
 	struct vfio_device *device = df->device;
 	int ret;
 
-	/* Paired with smp_store_release() in vfio_device_group_open() */
+	if (cmd == VFIO_DEVICE_BIND_IOMMUFD)
+		return vfio_device_ioctl_bind_iommufd(df, (void __user *)arg);
+
+	/*
+	 * Paired with smp_store_release() in the caller of
+	 * vfio_device_open(). e.g. vfio_device_group_open()
+	 * and vfio_device_ioctl_bind_iommufd()
+	 */
 	if (!smp_load_acquire(&df->access_granted))
 		return -EINVAL;
 
@@ -1179,7 +1188,11 @@ static ssize_t vfio_device_fops_read(struct file *filep, char __user *buf,
 	struct vfio_device_file *df = filep->private_data;
 	struct vfio_device *device = df->device;
 
-	/* Paired with smp_store_release() in vfio_device_group_open() */
+	/*
+	 * Paired with smp_store_release() in the caller of
+	 * vfio_device_open(). e.g. vfio_device_group_open()
+	 * and vfio_device_ioctl_bind_iommufd()
+	 */
 	if (!smp_load_acquire(&df->access_granted))
 		return -EINVAL;
 
@@ -1196,7 +1209,11 @@ static ssize_t vfio_device_fops_write(struct file *filep,
 	struct vfio_device_file *df = filep->private_data;
 	struct vfio_device *device = df->device;
 
-	/* Paired with smp_store_release() in vfio_device_group_open() */
+	/*
+	 * Paired with smp_store_release() in the caller of
+	 * vfio_device_open(). e.g. vfio_device_group_open()
+	 * and vfio_device_ioctl_bind_iommufd()
+	 */
 	if (!smp_load_acquire(&df->access_granted))
 		return -EINVAL;
 
@@ -1211,7 +1228,11 @@ static int vfio_device_fops_mmap(struct file *filep, struct vm_area_struct *vma)
 	struct vfio_device_file *df = filep->private_data;
 	struct vfio_device *device = df->device;
 
-	/* Paired with smp_store_release() in vfio_device_group_open() */
+	/*
+	 * Paired with smp_store_release() in the caller of
+	 * vfio_device_open(). e.g. vfio_device_group_open()
+	 * and vfio_device_ioctl_bind_iommufd()
+	 */
 	if (!smp_load_acquire(&df->access_granted))
 		return -EINVAL;
 
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 382d95455f89..a53afe349a34 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -194,6 +194,43 @@ struct vfio_group_status {
 
 /* --------------- IOCTLs for DEVICE file descriptors --------------- */
 
+/*
+ * VFIO_DEVICE_BIND_IOMMUFD - _IOR(VFIO_TYPE, VFIO_BASE + 19,
+ *				   struct vfio_device_bind_iommufd)
+ *
+ * Bind a vfio_device to the specified iommufd.
+ *
+ * The user should provide a device cookie when calling this ioctl. The
+ * cookie is carried only in event e.g. I/O fault reported to userspace
+ * via iommufd. The user should use devid returned by this ioctl to mark
+ * the target device in other ioctls (e.g. iommu hardware infomration query
+ * via iommufd, and etc.).
+ *
+ * User is not allowed to access the device before the binding operation
+ * is completed.
+ *
+ * Unbind is automatically conducted when device fd is closed.
+ *
+ * @argsz:	 user filled size of this data.
+ * @flags:	 reserved for future extension.
+ * @dev_cookie:	 a per device cookie provided by userspace.
+ * @iommufd:	 iommufd to bind. a negative value means noiommu.
+ * @out_devid:	 the device id generated by this bind. This field is valid
+ *		as long as the input @iommufd is valid. Otherwise, it is
+ *		meaningless.
+ *
+ * Return: 0 on success, -errno on failure.
+ */
+struct vfio_device_bind_iommufd {
+	__u32		argsz;
+	__u32		flags;
+	__aligned_u64	dev_cookie;
+	__s32		iommufd;
+	__u32		out_devid;
+};
+
+#define VFIO_DEVICE_BIND_IOMMUFD	_IO(VFIO_TYPE, VFIO_BASE + 19)
+
 /**
  * VFIO_DEVICE_GET_INFO - _IOR(VFIO_TYPE, VFIO_BASE + 7,
  *						struct vfio_device_info)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [Intel-gfx] [PATCH v6 22/24] vfio: Add VFIO_DEVICE_AT[DE]TACH_IOMMUFD_PT
  2023-03-08 13:28 [Intel-gfx] [PATCH v6 00/24] cover-letter: Add vfio_device cdev for iommufd support Yi Liu
                   ` (20 preceding siblings ...)
  2023-03-08 13:29 ` [Intel-gfx] [PATCH v6 21/24] vfio: Add VFIO_DEVICE_BIND_IOMMUFD Yi Liu
@ 2023-03-08 13:29 ` Yi Liu
  2023-03-08 13:29 ` [Intel-gfx] [PATCH v6 23/24] vfio: Compile group optionally Yi Liu
                   ` (3 subsequent siblings)
  25 siblings, 0 replies; 103+ messages in thread
From: Yi Liu @ 2023-03-08 13:29 UTC (permalink / raw)
  To: alex.williamson, jgg, kevin.tian
  Cc: linux-s390, yi.l.liu, yi.y.sun, mjrosato, kvm, intel-gvt-dev,
	joro, cohuck, xudong.hao, peterx, yan.y.zhao, eric.auger,
	terrence.xu, nicolinc, shameerali.kolothum.thodi,
	suravee.suthikulpanit, intel-gfx, chao.p.peng, lulu,
	robin.murphy, jasowang

This adds ioctl for userspace to attach device cdev fd to and detach
from IOAS/hw_pagetable managed by iommufd.

    VFIO_DEVICE_ATTACH_IOMMUFD_PT: attach vfio device to IOAS, hw_pagetable
				   managed by iommufd. Attach can be
				   undo by VFIO_DEVICE_DETACH_IOMMUFD_PT
				   or device fd close.
    VFIO_DEVICE_DETACH_IOMMUFD_PT: detach vfio device from the current attached
				   IOAS or hw_pagetable managed by iommufd.

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Tested-by: Terrence Xu <terrence.xu@intel.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Matthew Rosato <mjrosato@linux.ibm.com>
---
 drivers/vfio/device_cdev.c | 85 ++++++++++++++++++++++++++++++++++++++
 drivers/vfio/vfio.h        | 16 +++++++
 drivers/vfio/vfio_main.c   |  8 ++++
 include/uapi/linux/vfio.h  | 52 +++++++++++++++++++++++
 4 files changed, 161 insertions(+)

diff --git a/drivers/vfio/device_cdev.c b/drivers/vfio/device_cdev.c
index 568cc9da16c7..77e0f34f6bc7 100644
--- a/drivers/vfio/device_cdev.c
+++ b/drivers/vfio/device_cdev.c
@@ -210,6 +210,91 @@ long vfio_device_ioctl_bind_iommufd(struct vfio_device_file *df,
 	return ret;
 }
 
+int vfio_ioctl_device_attach(struct vfio_device_file *df,
+			     struct vfio_device_attach_iommufd_pt __user *arg)
+{
+	struct vfio_device *device = df->device;
+	struct vfio_device_attach_iommufd_pt attach;
+	unsigned long minsz;
+	int ret;
+
+	static_assert(__same_type(arg->pt_id, attach.pt_id));
+
+	minsz = offsetofend(struct vfio_device_attach_iommufd_pt, pt_id);
+
+	if (copy_from_user(&attach, arg, minsz))
+		return -EFAULT;
+
+	if (attach.argsz < minsz || attach.flags)
+		return -EINVAL;
+
+	if (!device->ops->bind_iommufd)
+		return -ENODEV;
+
+	/* ATTACH only allowed for cdev fds */
+	if (df->group)
+		return -EINVAL;
+
+	mutex_lock(&device->dev_set->lock);
+	/* noiommufd mode doesn't allow attach */
+	if (!df->iommufd) {
+		ret = -EOPNOTSUPP;
+		goto out_unlock;
+	}
+
+	ret = device->ops->attach_ioas(device, &attach.pt_id);
+	if (ret)
+		goto out_unlock;
+
+	ret = copy_to_user(&arg->pt_id, &attach.pt_id,
+			   sizeof(attach.pt_id)) ? -EFAULT : 0;
+	if (ret)
+		goto out_detach;
+	mutex_unlock(&device->dev_set->lock);
+
+	return 0;
+
+out_detach:
+	device->ops->detach_ioas(device);
+out_unlock:
+	mutex_unlock(&device->dev_set->lock);
+	return ret;
+}
+
+int vfio_ioctl_device_detach(struct vfio_device_file *df,
+			     struct vfio_device_detach_iommufd_pt __user *arg)
+{
+	struct vfio_device *device = df->device;
+	struct vfio_device_detach_iommufd_pt detach;
+	unsigned long minsz;
+
+	minsz = offsetofend(struct vfio_device_detach_iommufd_pt, flags);
+
+	if (copy_from_user(&detach, arg, minsz))
+		return -EFAULT;
+
+	if (detach.argsz < minsz || detach.flags)
+		return -EINVAL;
+
+	if (!device->ops->bind_iommufd)
+		return -ENODEV;
+
+	/* DETACH only allowed for cdev fds */
+	if (df->group)
+		return -EINVAL;
+
+	mutex_lock(&device->dev_set->lock);
+	/* noiommufd mode doesn't support detach */
+	if (!df->iommufd) {
+		mutex_unlock(&device->dev_set->lock);
+		return -EOPNOTSUPP;
+	}
+	device->ops->detach_ioas(device);
+	mutex_unlock(&device->dev_set->lock);
+
+	return 0;
+}
+
 static char *vfio_device_devnode(const struct device *dev, umode_t *mode)
 {
 	return kasprintf(GFP_KERNEL, "vfio/devices/%s", dev_name(dev));
diff --git a/drivers/vfio/vfio.h b/drivers/vfio/vfio.h
index 5df737b24102..8b70e2af4ece 100644
--- a/drivers/vfio/vfio.h
+++ b/drivers/vfio/vfio.h
@@ -278,6 +278,10 @@ int vfio_device_fops_cdev_open(struct inode *inode, struct file *filep);
 void vfio_device_cdev_close(struct vfio_device_file *df);
 long vfio_device_ioctl_bind_iommufd(struct vfio_device_file *df,
 				    struct vfio_device_bind_iommufd __user *arg);
+int vfio_ioctl_device_attach(struct vfio_device_file *df,
+			     struct vfio_device_attach_iommufd_pt __user *arg);
+int vfio_ioctl_device_detach(struct vfio_device_file *df,
+			     struct vfio_device_detach_iommufd_pt __user *arg);
 int vfio_cdev_init(struct class *device_class);
 void vfio_cdev_cleanup(void);
 #else
@@ -311,6 +315,18 @@ static inline long vfio_device_ioctl_bind_iommufd(struct vfio_device_file *df,
 	return -EOPNOTSUPP;
 }
 
+static inline int vfio_ioctl_device_attach(struct vfio_device_file *df,
+					   struct vfio_device_attach_iommufd_pt __user *arg)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline int vfio_ioctl_device_detach(struct vfio_device_file *df,
+					   struct vfio_device_detach_iommufd_pt __user *arg)
+{
+	return -EOPNOTSUPP;
+}
+
 static inline int vfio_cdev_init(struct class *device_class)
 {
 	return 0;
diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
index 08bb1705d02d..e5f4738692ac 100644
--- a/drivers/vfio/vfio_main.c
+++ b/drivers/vfio/vfio_main.c
@@ -1170,6 +1170,14 @@ static long vfio_device_fops_unl_ioctl(struct file *filep,
 		ret = vfio_ioctl_device_feature(device, (void __user *)arg);
 		break;
 
+	case VFIO_DEVICE_ATTACH_IOMMUFD_PT:
+		ret = vfio_ioctl_device_attach(df, (void __user *)arg);
+		break;
+
+	case VFIO_DEVICE_DETACH_IOMMUFD_PT:
+		ret = vfio_ioctl_device_detach(df, (void __user *)arg);
+		break;
+
 	default:
 		if (unlikely(!device->ops->ioctl))
 			ret = -EINVAL;
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index a53afe349a34..692156a708bb 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -231,6 +231,58 @@ struct vfio_device_bind_iommufd {
 
 #define VFIO_DEVICE_BIND_IOMMUFD	_IO(VFIO_TYPE, VFIO_BASE + 19)
 
+/*
+ * VFIO_DEVICE_ATTACH_IOMMUFD_PT - _IOW(VFIO_TYPE, VFIO_BASE + 20,
+ *					struct vfio_device_attach_iommufd_pt)
+ *
+ * Attach a vfio device to an iommufd address space specified by IOAS
+ * id or hw_pagetable (hwpt) id.
+ *
+ * Available only after a device has been bound to iommufd via
+ * VFIO_DEVICE_BIND_IOMMUFD
+ *
+ * Undo by VFIO_DEVICE_DETACH_IOMMUFD_PT or device fd close.
+ *
+ * @argsz:	user filled size of this data.
+ * @flags:	must be 0.
+ * @pt_id:	Input the target id which can represent an ioas or a hwpt
+ *		allocated via iommufd subsystem.
+ *		Output the attached hwpt id which could be the specified
+ *		hwpt itself or a hwpt automatically created for the
+ *		specified ioas by kernel during the attachment.
+ *
+ * Return: 0 on success, -errno on failure.
+ */
+struct vfio_device_attach_iommufd_pt {
+	__u32	argsz;
+	__u32	flags;
+	__u32	pt_id;
+};
+
+#define VFIO_DEVICE_ATTACH_IOMMUFD_PT		_IO(VFIO_TYPE, VFIO_BASE + 20)
+
+/*
+ * VFIO_DEVICE_DETACH_IOMMUFD_PT - _IOW(VFIO_TYPE, VFIO_BASE + 21,
+ *					struct vfio_device_detach_iommufd_pt)
+ *
+ * Detach a vfio device from the iommufd address space it has been
+ * attached to. After it, device should be in a blocking DMA state.
+ *
+ * Available only after a device has been bound to iommufd via
+ * VFIO_DEVICE_BIND_IOMMUFD.
+ *
+ * @argsz:	user filled size of this data.
+ * @flags:	must be 0.
+ *
+ * Return: 0 on success, -errno on failure.
+ */
+struct vfio_device_detach_iommufd_pt {
+	__u32	argsz;
+	__u32	flags;
+};
+
+#define VFIO_DEVICE_DETACH_IOMMUFD_PT		_IO(VFIO_TYPE, VFIO_BASE + 21)
+
 /**
  * VFIO_DEVICE_GET_INFO - _IOR(VFIO_TYPE, VFIO_BASE + 7,
  *						struct vfio_device_info)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [Intel-gfx] [PATCH v6 23/24] vfio: Compile group optionally
  2023-03-08 13:28 [Intel-gfx] [PATCH v6 00/24] cover-letter: Add vfio_device cdev for iommufd support Yi Liu
                   ` (21 preceding siblings ...)
  2023-03-08 13:29 ` [Intel-gfx] [PATCH v6 22/24] vfio: Add VFIO_DEVICE_AT[DE]TACH_IOMMUFD_PT Yi Liu
@ 2023-03-08 13:29 ` Yi Liu
  2023-03-10  9:03   ` Tian, Kevin
  2023-03-08 13:29 ` [Intel-gfx] [PATCH v6 24/24] docs: vfio: Add vfio device cdev description Yi Liu
                   ` (2 subsequent siblings)
  25 siblings, 1 reply; 103+ messages in thread
From: Yi Liu @ 2023-03-08 13:29 UTC (permalink / raw)
  To: alex.williamson, jgg, kevin.tian
  Cc: linux-s390, yi.l.liu, yi.y.sun, mjrosato, kvm, intel-gvt-dev,
	joro, cohuck, xudong.hao, peterx, yan.y.zhao, eric.auger,
	terrence.xu, nicolinc, shameerali.kolothum.thodi,
	suravee.suthikulpanit, intel-gfx, chao.p.peng, lulu,
	robin.murphy, jasowang

group code is not needed for vfio device cdev, so with vfio device cdev
introduced, the group infrastructures can be compiled out if only cdev
is needed.

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
---
 drivers/vfio/Kconfig  | 16 +++++++-
 drivers/vfio/Makefile |  2 +-
 drivers/vfio/vfio.h   | 94 +++++++++++++++++++++++++++++++++++++++++++
 include/linux/vfio.h  | 18 ++++++++-
 4 files changed, 126 insertions(+), 4 deletions(-)

diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
index e2105b4dac2d..0942a19601a2 100644
--- a/drivers/vfio/Kconfig
+++ b/drivers/vfio/Kconfig
@@ -4,7 +4,9 @@ menuconfig VFIO
 	select IOMMU_API
 	depends on IOMMUFD || !IOMMUFD
 	select INTERVAL_TREE
-	select VFIO_CONTAINER if IOMMUFD=n
+	select VFIO_GROUP if SPAPR_TCE_IOMMU || !IOMMUFD
+	select VFIO_DEVICE_CDEV if !VFIO_GROUP
+	select VFIO_CONTAINER if IOMMUFD=n && VFIO_GROUP
 	help
 	  VFIO provides a framework for secure userspace device drivers.
 	  See Documentation/driver-api/vfio.rst for more details.
@@ -15,6 +17,7 @@ if VFIO
 config VFIO_DEVICE_CDEV
 	bool "Support for the VFIO cdev /dev/vfio/devices/vfioX"
 	depends on IOMMUFD
+	default !VFIO_GROUP
 	help
 	  The VFIO device cdev is another way for userspace to get device
 	  access. Userspace gets device fd by opening device cdev under
@@ -23,9 +26,20 @@ config VFIO_DEVICE_CDEV
 
 	  If you don't know what to do here, say N.
 
+config VFIO_GROUP
+	bool "Support for the VFIO group /dev/vfio/$group_id"
+	default y
+	help
+	   VFIO group support provides the traditional model for accessing
+	   devices through VFIO and is used by the majority of userspace
+	   applications and drivers making use of VFIO.
+
+	   If you don't know what to do here, say Y.
+
 config VFIO_CONTAINER
 	bool "Support for the VFIO container /dev/vfio/vfio"
 	select VFIO_IOMMU_TYPE1 if MMU && (X86 || S390 || ARM || ARM64)
+	depends on VFIO_GROUP
 	default y
 	help
 	  The VFIO container is the classic interface to VFIO for establishing
diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile
index 245394aeb94b..57c3515af606 100644
--- a/drivers/vfio/Makefile
+++ b/drivers/vfio/Makefile
@@ -2,9 +2,9 @@
 obj-$(CONFIG_VFIO) += vfio.o
 
 vfio-y += vfio_main.o \
-	  group.o \
 	  iova_bitmap.o
 vfio-$(CONFIG_VFIO_DEVICE_CDEV) += device_cdev.o
+vfio-$(CONFIG_VFIO_GROUP) += group.o
 vfio-$(CONFIG_IOMMUFD) += iommufd.o
 vfio-$(CONFIG_VFIO_CONTAINER) += container.o
 vfio-$(CONFIG_VFIO_VIRQFD) += virqfd.o
diff --git a/drivers/vfio/vfio.h b/drivers/vfio/vfio.h
index 8b70e2af4ece..d5dfacf11265 100644
--- a/drivers/vfio/vfio.h
+++ b/drivers/vfio/vfio.h
@@ -60,6 +60,7 @@ enum vfio_group_type {
 	VFIO_NO_IOMMU,
 };
 
+#if IS_ENABLED(CONFIG_VFIO_GROUP)
 struct vfio_group {
 	struct device 			dev;
 	struct cdev			cdev;
@@ -115,6 +116,99 @@ static inline bool vfio_device_is_noiommu(struct vfio_device *vdev)
 	return IS_ENABLED(CONFIG_VFIO_NOIOMMU) &&
 	       vdev->group->type == VFIO_NO_IOMMU;
 }
+#else
+struct vfio_group;
+
+static inline int vfio_device_block_group(struct vfio_device *device)
+{
+	return 0;
+}
+
+static inline void vfio_device_unblock_group(struct vfio_device *device)
+{
+}
+
+static inline struct vfio_group *
+vfio_group_find_noiommu_group_from_iommu(struct iommu_group *iommu_group)
+{
+	return NULL;
+}
+
+static inline int vfio_device_set_group(struct vfio_device *device,
+					enum vfio_group_type type)
+{
+	return 0;
+}
+
+static inline void vfio_device_remove_group(struct vfio_device *device)
+{
+}
+
+static inline void vfio_device_group_register(struct vfio_device *device)
+{
+}
+
+static inline void vfio_device_group_unregister(struct vfio_device *device)
+{
+}
+
+static inline bool vfio_device_group_uses_container(struct vfio_device_file *df)
+{
+	return false;
+}
+
+static inline int vfio_device_group_use_iommu(struct vfio_device *device)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline void vfio_device_group_unuse_iommu(struct vfio_device *device)
+{
+}
+
+static inline void vfio_device_group_close(struct vfio_device_file *df)
+{
+}
+
+static inline struct vfio_group *vfio_group_from_file(struct file *file)
+{
+	return NULL;
+}
+
+static inline bool vfio_group_enforced_coherent(struct vfio_group *group)
+{
+	return true;
+}
+
+static inline void vfio_group_set_kvm(struct vfio_group *group, struct kvm *kvm)
+{
+}
+
+static inline bool vfio_group_has_dev(struct vfio_group *group,
+				      struct vfio_device *device)
+{
+	return false;
+}
+
+static inline bool vfio_device_has_container(struct vfio_device *device)
+{
+	return false;
+}
+
+static inline int __init vfio_group_init(void)
+{
+	return 0;
+}
+
+static inline void vfio_group_cleanup(void)
+{
+}
+
+static inline bool vfio_device_is_noiommu(struct vfio_device *vdev)
+{
+	return false;
+}
+#endif /* CONFIG_VFIO_GROUP */
 
 #if IS_ENABLED(CONFIG_VFIO_CONTAINER)
 /**
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index 34adfcb5b7bd..e7fc1de35acf 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -43,7 +43,11 @@ struct vfio_device {
 	 */
 	const struct vfio_migration_ops *mig_ops;
 	const struct vfio_log_ops *log_ops;
+#if IS_ENABLED(CONFIG_VFIO_GROUP)
 	struct vfio_group *group;
+	struct list_head group_next;
+	struct list_head iommu_entry;
+#endif
 	struct vfio_device_set *dev_set;
 	struct list_head dev_set_list;
 	unsigned int migration_flags;
@@ -58,8 +62,6 @@ struct vfio_device {
 	refcount_t refcount;	/* user count on registered device*/
 	unsigned int open_count;
 	struct completion comp;
-	struct list_head group_next;
-	struct list_head iommu_entry;
 	struct iommufd_access *iommufd_access;
 	void (*put_kvm)(struct kvm *kvm);
 #if IS_ENABLED(CONFIG_IOMMUFD)
@@ -259,8 +261,20 @@ int vfio_mig_get_next_state(struct vfio_device *device,
 /*
  * External user API
  */
+#if IS_ENABLED(CONFIG_VFIO_GROUP)
 struct iommu_group *vfio_file_iommu_group(struct file *file);
 bool vfio_file_is_group(struct file *file);
+#else
+static inline struct iommu_group *vfio_file_iommu_group(struct file *file)
+{
+	return NULL;
+}
+
+static inline bool vfio_file_is_group(struct file *file)
+{
+	return false;
+}
+#endif
 bool vfio_file_is_valid(struct file *file);
 bool vfio_file_enforced_coherent(struct file *file);
 void vfio_file_set_kvm(struct file *file, struct kvm *kvm);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [Intel-gfx] [PATCH v6 24/24] docs: vfio: Add vfio device cdev description
  2023-03-08 13:28 [Intel-gfx] [PATCH v6 00/24] cover-letter: Add vfio_device cdev for iommufd support Yi Liu
                   ` (22 preceding siblings ...)
  2023-03-08 13:29 ` [Intel-gfx] [PATCH v6 23/24] vfio: Compile group optionally Yi Liu
@ 2023-03-08 13:29 ` Yi Liu
  2023-03-14 11:02 ` [Intel-gfx] ✗ Fi.CI.BUILD: failure for cover-letter: Add vfio_device cdev for iommufd support Patchwork
  2023-03-24 10:39 ` [Intel-gfx] ✗ Fi.CI.BUILD: failure for cover-letter: Add vfio_device cdev for iommufd support (rev2) Patchwork
  25 siblings, 0 replies; 103+ messages in thread
From: Yi Liu @ 2023-03-08 13:29 UTC (permalink / raw)
  To: alex.williamson, jgg, kevin.tian
  Cc: linux-s390, yi.l.liu, yi.y.sun, mjrosato, kvm, intel-gvt-dev,
	joro, cohuck, xudong.hao, peterx, yan.y.zhao, eric.auger,
	terrence.xu, nicolinc, shameerali.kolothum.thodi,
	suravee.suthikulpanit, intel-gfx, chao.p.peng, lulu,
	robin.murphy, jasowang

This gives notes for userspace applications on device cdev usage.

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
---
 Documentation/driver-api/vfio.rst | 125 ++++++++++++++++++++++++++++++
 1 file changed, 125 insertions(+)

diff --git a/Documentation/driver-api/vfio.rst b/Documentation/driver-api/vfio.rst
index 44527420f20d..227940e5224f 100644
--- a/Documentation/driver-api/vfio.rst
+++ b/Documentation/driver-api/vfio.rst
@@ -239,6 +239,123 @@ group and can access them as follows::
 	/* Gratuitous device reset and go... */
 	ioctl(device, VFIO_DEVICE_RESET);
 
+IOMMUFD and vfio_iommu_type1
+----------------------------
+
+IOMMUFD is the new user API to manage I/O page tables from userspace.
+It intends to be the portal of delivering advanced userspace DMA
+features (nested translation [5], PASID [6], etc.) and backward
+compatible with the vfio_iommu_type1 driver. Eventually vfio_iommu_type1
+will be deprecated.
+
+With the backward compatibility, no change is required for legacy VFIO
+drivers or applications to connect a VFIO device to IOMMUFD.
+
+	When CONFIG_IOMMUFD_VFIO_CONTAINER=n, VFIO container still provides
+	/dev/vfio/vfio which connects to vfio_iommu_type1. To disable VFIO
+	container and vfio_iommu_type1, the administrator could symbol link
+	/dev/vfio/vfio to /dev/iommu to enable VFIO container emulation
+	in IOMMUFD.
+
+	When CONFIG_IOMMUFD_VFIO_CONTAINER=y, IOMMUFD directly provides
+	/dev/vfio/vfio while the VFIO container and vfio_iommu_type1 are
+	explicitly disabled.
+
+VFIO Device cdev
+----------------
+
+Traditionally user acquires a device fd via VFIO_GROUP_GET_DEVICE_FD
+in a VFIO group.
+
+With CONFIG_VFIO_DEVICE_CDEV=y the user can now acquire a device fd
+by directly opening a character device /dev/vfio/devices/vfioX where
+"X" is the number allocated uniquely by VFIO for registered devices.
+
+The cdev only works with IOMMUFD. Both VFIO drivers and applications
+must adapt to the new cdev security model which requires using
+VFIO_DEVICE_BIND_IOMMUFD to claim DMA ownership before starting to
+actually use the device. Once bind succeeds then a VFIO device can
+be fully accessed by the user.
+
+VFIO device cdev doesn't rely on VFIO group/container/iommu drivers.
+Hence those modules can be fully compiled out in an environment
+where no legacy VFIO application exists.
+
+So far SPAPR does not support IOMMUFD yet. So it cannot support device
+cdev either.
+
+Device cdev Example
+-------------------
+
+Assume user wants to access PCI device 0000:6a:01.0::
+
+	$ ls /sys/bus/pci/devices/0000:6a:01.0/vfio-dev/
+	vfio0
+
+This device is therefore represented as vfio0. The user can verify
+its existence::
+
+	$ ls -l /dev/vfio/devices/vfio0
+	crw------- 1 root root 511, 0 Feb 16 01:22 /dev/vfio/devices/vfio0
+	$ cat /sys/bus/pci/devices/0000:6a:01.0/vfio-dev/vfio0/dev
+	511:0
+	$ ls -l /dev/char/511\:0
+	lrwxrwxrwx 1 root root 21 Feb 16 01:22 /dev/char/511:0 -> ../vfio/devices/vfio0
+
+Then provide the user with access to the device if unprivileged
+operation is desired::
+
+	$ chown user:user /dev/vfio/devices/vfio0
+
+Finally the user could get cdev fd by::
+
+	cdev_fd = open("/dev/vfio/devices/vfio0", O_RDWR);
+
+An opened cdev_fd doesn't give the user any permission of accessing
+the device except binding the cdev_fd to an iommufd. After that point
+then the device is fully accessible including attaching it to an
+IOMMUFD IOAS/HWPT to enable userspace DMA::
+
+	struct vfio_device_bind_iommufd bind = {
+		.argsz = sizeof(bind),
+		.flags = 0,
+	};
+	struct iommu_ioas_alloc alloc_data  = {
+		.size = sizeof(alloc_data),
+		.flags = 0,
+	};
+	struct vfio_device_attach_iommufd_pt attach_data = {
+		.argsz = sizeof(attach_data),
+		.flags = 0,
+	};
+	struct iommu_ioas_map map = {
+		.size = sizeof(map),
+		.flags = IOMMU_IOAS_MAP_READABLE |
+			 IOMMU_IOAS_MAP_WRITEABLE |
+			 IOMMU_IOAS_MAP_FIXED_IOVA,
+		.__reserved = 0,
+	};
+
+	iommufd = open("/dev/iommu", O_RDWR);
+
+	bind.iommufd = iommufd; // negative iommufd means vfio-noiommu mode
+	ioctl(cdev_fd, VFIO_DEVICE_BIND_IOMMUFD, &bind);
+
+	ioctl(iommufd, IOMMU_IOAS_ALLOC, &alloc_data);
+	attach_data.pt_id = alloc_data.out_ioas_id;
+	ioctl(cdev_fd, VFIO_DEVICE_ATTACH_IOMMUFD_PT, &attach_data);
+
+	/* Allocate some space and setup a DMA mapping */
+	map.user_va = (int64_t)mmap(0, 1024 * 1024, PROT_READ | PROT_WRITE,
+				    MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
+	map.iova = 0; /* 1MB starting at 0x0 from device view */
+	map.length = 1024 * 1024;
+	map.ioas_id = alloc_data.out_ioas_id;;
+
+	ioctl(iommufd, IOMMU_IOAS_MAP, &map);
+
+	/* Other device operations as stated in "VFIO Usage Example" */
+
 VFIO User API
 -------------------------------------------------------------------------------
 
@@ -566,3 +683,11 @@ This implementation has some specifics:
 				\-0d.1
 
 	00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev 90)
+
+.. [5] Nested translation is an IOMMU feature which supports two stage
+   address translations. This improves the address translation efficiency
+   in IOMMU virtualization.
+
+.. [6] PASID stands for Process Address Space ID, introduced by PCI
+   Express. It is a prerequisite for Shared Virtual Addressing (SVA)
+   and Scalable I/O Virtualization (Scalable IOV).
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* Re: [Intel-gfx] [PATCH v6 07/24] vfio: Block device access via device fd until device is opened
  2023-03-08 13:28 ` [Intel-gfx] [PATCH v6 07/24] vfio: Block device access via device fd until device is opened Yi Liu
@ 2023-03-10  4:50   ` Tian, Kevin
  0 siblings, 0 replies; 103+ messages in thread
From: Tian, Kevin @ 2023-03-10  4:50 UTC (permalink / raw)
  To: Liu, Yi L, alex.williamson, jgg
  Cc: linux-s390, yi.y.sun, mjrosato, kvm, intel-gvt-dev, joro, cohuck,
	Hao, Xudong, peterx, Zhao, Yan Y, eric.auger, Xu, Terrence,
	nicolinc, shameerali.kolothum.thodi, suravee.suthikulpanit,
	intel-gfx, chao.p.peng, lulu, robin.murphy, jasowang

> From: Liu, Yi L <yi.l.liu@intel.com>
> Sent: Wednesday, March 8, 2023 9:29 PM
>
> @@ -1114,6 +1114,10 @@ static long vfio_device_fops_unl_ioctl(struct file
> *filep,
>  	struct vfio_device *device = df->device;
>  	int ret;
> 
> +	/* Paired with smp_store_release() in vfio_device_group_open() */
> +	if (!smp_load_acquire(&df->access_granted))
> +		return -EINVAL;
> +

/* Paired with smp_store_release() right after vfio_device_open() is called */

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [Intel-gfx] [PATCH v6 09/24] vfio/pci: Only need to check opened devices in the dev_set for hot reset
  2023-03-08 13:28 ` [Intel-gfx] [PATCH v6 09/24] vfio/pci: Only need to check opened devices in the dev_set for hot reset Yi Liu
@ 2023-03-10  5:00   ` Tian, Kevin
  0 siblings, 0 replies; 103+ messages in thread
From: Tian, Kevin @ 2023-03-10  5:00 UTC (permalink / raw)
  To: Liu, Yi L, alex.williamson, jgg
  Cc: linux-s390, yi.y.sun, mjrosato, kvm, intel-gvt-dev, joro, cohuck,
	Hao, Xudong, peterx, Zhao, Yan Y, eric.auger, Xu, Terrence,
	nicolinc, shameerali.kolothum.thodi, suravee.suthikulpanit,
	intel-gfx, chao.p.peng, lulu, robin.murphy, jasowang

> From: Liu, Yi L <yi.l.liu@intel.com>
> Sent: Wednesday, March 8, 2023 9:29 PM
>
> @@ -2429,10 +2429,23 @@ static int vfio_pci_dev_set_hot_reset(struct
> vfio_device_set *dev_set,
> 
>  	list_for_each_entry(cur_vma, &dev_set->device_list,
> vdev.dev_set_list) {
>  		/*
> -		 * Test whether all the affected devices are contained by the
> +		 * Test whether all the affected devices can be reset by the
> +		 * user.  The affected devices may already been opened or
> not
> +		 * yet.
> +		 *
> +		 * For the devices not opened yet, user can reset them as it
> +		 * reason is that the hot reset is done under the protection
> +		 * of the dev_set->lock, and device open is also under this
> +		 * lock.  During the hot reset, such devices can not be opened
> +		 * by other users.
> +		 *
> +		 * For the devices that have been opened, needs to check the
> +		 * ownership.  If the user provides a set of group fds, test
> +		 * whether all the opened affected devices are contained by
> the
>  		 * set of groups provided by the user.
>  		 */

		 * Test whether all the affected devices can be reset by the
		 * user.
		 *
		 * Resetting an unused device (not opened) is safe, because
		 * dev_set->lock is held in hot reset path so this device
		 * cannot race being opened by another user simultaneously.
		 *
		 * Otherwise all opened devices in the dev_set must be
		 * contained by the set of groups provided by the user.

the rest looks good:

Reviewed-by: Kevin Tian <kevin.tian@intel.com>

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [Intel-gfx] [PATCH v6 10/24] vfio/pci: Rename the helpers and data in hot reset path to accept device fd
  2023-03-08 13:28 ` [Intel-gfx] [PATCH v6 10/24] vfio/pci: Rename the helpers and data in hot reset path to accept device fd Yi Liu
@ 2023-03-10  5:01   ` Tian, Kevin
  0 siblings, 0 replies; 103+ messages in thread
From: Tian, Kevin @ 2023-03-10  5:01 UTC (permalink / raw)
  To: Liu, Yi L, alex.williamson, jgg
  Cc: linux-s390, yi.y.sun, mjrosato, kvm, intel-gvt-dev, joro, cohuck,
	Hao, Xudong, peterx, Zhao, Yan Y, eric.auger, Xu, Terrence,
	nicolinc, shameerali.kolothum.thodi, suravee.suthikulpanit,
	intel-gfx, chao.p.peng, lulu, robin.murphy, jasowang

> From: Liu, Yi L <yi.l.liu@intel.com>
> Sent: Wednesday, March 8, 2023 9:29 PM
> 
> No function change is intended, just to make the helpers and structures
> to be prepared to accept device fds as proof of device ownership.
> 
> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> ---
>  drivers/vfio/pci/vfio_pci_core.c | 40 ++++++++++++++++----------------
>  1 file changed, 20 insertions(+), 20 deletions(-)
> 
> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> index f13b093557a9..265a0058436c 100644
> --- a/drivers/vfio/pci/vfio_pci_core.c
> +++ b/drivers/vfio/pci/vfio_pci_core.c
> @@ -177,10 +177,10 @@ static void vfio_pci_probe_mmaps(struct
> vfio_pci_core_device *vdev)
>  	}
>  }
> 
> -struct vfio_pci_group_info;
> +struct vfio_pci_user_file_info;
>  static void vfio_pci_dev_set_try_reset(struct vfio_device_set *dev_set);
>  static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
> -				      struct vfio_pci_group_info *groups);
> +				      struct vfio_pci_user_file_info *user_info);

I'd just remove 'user' here and in other places.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [Intel-gfx] [PATCH v6 11/24] vfio/pci: Accept device fd in VFIO_DEVICE_PCI_HOT_RESET ioctl
  2023-03-08 13:28 ` [Intel-gfx] [PATCH v6 11/24] vfio/pci: Accept device fd in VFIO_DEVICE_PCI_HOT_RESET ioctl Yi Liu
@ 2023-03-10  5:08   ` Tian, Kevin
  2023-03-10  5:38     ` Liu, Yi L
  0 siblings, 1 reply; 103+ messages in thread
From: Tian, Kevin @ 2023-03-10  5:08 UTC (permalink / raw)
  To: Liu, Yi L, alex.williamson, jgg
  Cc: linux-s390, yi.y.sun, mjrosato, kvm, intel-gvt-dev, joro, cohuck,
	Hao, Xudong, peterx, Zhao, Yan Y, eric.auger, Xu, Terrence,
	nicolinc, shameerali.kolothum.thodi, suravee.suthikulpanit,
	intel-gfx, chao.p.peng, lulu, robin.murphy, jasowang

> From: Liu, Yi L <yi.l.liu@intel.com>
> Sent: Wednesday, March 8, 2023 9:29 PM
> 
> @@ -1319,8 +1319,14 @@ static int vfio_pci_ioctl_pci_hot_reset(struct
> vfio_pci_core_device *vdev,
>  			break;
>  		}
> 
> -		/* Ensure the FD is a vfio group FD.*/
> -		if (!vfio_file_is_group(file)) {
> +		/*
> +		 * For vfio group FD, sanitize the file is enough.
> +		 * For vfio device FD, needs to ensure it has got the
> +		 * access to device, otherwise it cannot be used as
> +		 * proof of device ownership.
> +		 */
> +		if (!vfio_file_is_valid(file) ||
> +		    (!vfio_file_is_group(file)
> && !vfio_file_has_device_access(file))) {
>  			fput(file);
>  			ret = -EINVAL;
>  			break;

IMHO it's clearer to just check whether it's a valid vfio group/device fd
here.

then further restrictions are checked inside vfio_file_has_dev() when
it's called by vfio_dev_in_user_fds().

if fd is group then check whether device belongs to group.

if fd is device then check whether device allows access.

i.e.

> 
> +/**
> + * vfio_file_has_device_access - True if the file has opened device
> + * @file: VFIO device file
> + */
> +bool vfio_file_has_device_access(struct file *file)
> +{
> +	struct vfio_device_file *df;
> +
> +	if (vfio_group_from_file(file) ||
> +	    !vfio_device_from_file(file))
> +		return false;
> +
> +	df = file->private_data;
> +
> +	return READ_ONCE(df->access_granted);
> +}
> +EXPORT_SYMBOL_GPL(vfio_file_has_device_access);
> +
> +/**
> + * vfio_file_has_dev - True if the VFIO file is a handle for device
> + * @file: VFIO file to check
> + * @device: Device that must be part of the file
> + *
> + * Returns true if given file has permission to manipulate the given device.
> + */
> +bool vfio_file_has_dev(struct file *file, struct vfio_device *device)
> +{
> +	struct vfio_group *group;
> +	struct vfio_device *vdev;
> +
> +	group = vfio_group_from_file(file);
> +	if (group)
> +		return vfio_group_has_dev(group, device);
> +
> +	vdev = vfio_device_from_file(file);
> +	if (device)
> +		return vdev == device;
> +
> +	return false;
> +}
> +EXPORT_SYMBOL_GPL(vfio_file_has_dev);
> +

merge above two into one vfio_file_has_dev().

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [Intel-gfx] [PATCH v6 12/24] vfio/pci: Allow passing zero-length fd array in VFIO_DEVICE_PCI_HOT_RESET
  2023-03-08 13:28 ` [Intel-gfx] [PATCH v6 12/24] vfio/pci: Allow passing zero-length fd array in VFIO_DEVICE_PCI_HOT_RESET Yi Liu
@ 2023-03-10  5:31   ` Tian, Kevin
  2023-03-10  6:04     ` Liu, Yi L
  2023-03-15 22:53   ` Alex Williamson
  1 sibling, 1 reply; 103+ messages in thread
From: Tian, Kevin @ 2023-03-10  5:31 UTC (permalink / raw)
  To: Liu, Yi L, alex.williamson, jgg
  Cc: linux-s390, Liu, Yi L, yi.y.sun, mjrosato, kvm, jasowang,
	robin.murphy, joro, cohuck, Hao, Xudong, peterx, eric.auger, Xu,
	Terrence, nicolinc, shameerali.kolothum.thodi,
	suravee.suthikulpanit, chao.p.peng, Zhao, Yan Y, lulu,
	intel-gvt-dev, intel-gfx

> From: Yi Liu
> Sent: Wednesday, March 8, 2023 9:29 PM
> 
> This is another method to issue PCI hot reset for the users that bounds
> device to a positive iommufd value. In such case, iommufd is a proof of
> device ownership. By passing a zero-length fd array, user indicates kernel
> to do ownership check with the bound iommufd. All the opened devices
> within
> the affected dev_set should have been bound to the same iommufd. This is
> simpler and faster as user does not need to pass a set of fds and kernel
> no need to search the device within the given fds.
> 
> Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
> Signed-off-by: Yi Liu <yi.l.liu@intel.com>

I think you also need a s-o-b from Jason since he wrote most of the
code here.

> +struct iommufd_ctx *vfio_iommufd_physical_ctx(struct vfio_device *vdev)
> +{
> +	/* Only serve for physical device */
> +	if (!vdev->iommufd_device)
> +		return NULL;

pointless comment

> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -682,6 +682,11 @@ struct vfio_pci_hot_reset_info {
>   * The ownership can be proved by:
>   *   - An array of group fds
>   *   - An array of device fds
> + *   - A zero-length array
> + *
> + * In the last case all affected devices which are opened by this user
> + * must have been bound to a same iommufd_ctx.  This approach is only
> + * available for devices bound to positive iommufd.

As we chatted before I still think the last sentence is pointless. If a device
is bound to negative iommufd value (i.e. noiommu) it doesn't have a
valid iommufd_ctx so won't meet "bound to a same iommufd_ctx". 

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [Intel-gfx] [PATCH v6 11/24] vfio/pci: Accept device fd in VFIO_DEVICE_PCI_HOT_RESET ioctl
  2023-03-10  5:08   ` Tian, Kevin
@ 2023-03-10  5:38     ` Liu, Yi L
  0 siblings, 0 replies; 103+ messages in thread
From: Liu, Yi L @ 2023-03-10  5:38 UTC (permalink / raw)
  To: Tian, Kevin, alex.williamson, jgg
  Cc: linux-s390, yi.y.sun, mjrosato, kvm, intel-gvt-dev, joro, cohuck,
	Hao, Xudong, peterx, Zhao, Yan Y, eric.auger, Xu, Terrence,
	nicolinc, shameerali.kolothum.thodi, suravee.suthikulpanit,
	intel-gfx, chao.p.peng, lulu, robin.murphy, jasowang

> From: Tian, Kevin <kevin.tian@intel.com>
> Sent: Friday, March 10, 2023 1:08 PM
> 
> > From: Liu, Yi L <yi.l.liu@intel.com>
> > Sent: Wednesday, March 8, 2023 9:29 PM
> >
> > @@ -1319,8 +1319,14 @@ static int vfio_pci_ioctl_pci_hot_reset(struct
> > vfio_pci_core_device *vdev,
> >  			break;
> >  		}
> >
> > -		/* Ensure the FD is a vfio group FD.*/
> > -		if (!vfio_file_is_group(file)) {
> > +		/*
> > +		 * For vfio group FD, sanitize the file is enough.
> > +		 * For vfio device FD, needs to ensure it has got the
> > +		 * access to device, otherwise it cannot be used as
> > +		 * proof of device ownership.
> > +		 */
> > +		if (!vfio_file_is_valid(file) ||
> > +		    (!vfio_file_is_group(file)
> > && !vfio_file_has_device_access(file))) {
> >  			fput(file);
> >  			ret = -EINVAL;
> >  			break;
> 
> IMHO it's clearer to just check whether it's a valid vfio group/device fd
> here.
> 
> then further restrictions are checked inside vfio_file_has_dev() when
> it's called by vfio_dev_in_user_fds().

I see. But it just has a window in which a device file has not opened
device yet in this check, but opens the device before the dev_set->lock
is held by vfio_pci_dev_set_hot_reset(). Anyhow, no issue. So I can change
it. Then vfio_file_is_group() can be removed.

> 
> if fd is group then check whether device belongs to group.
> 
> if fd is device then check whether device allows access.
> 
> i.e.
> 
> >
> > +/**
> > + * vfio_file_has_device_access - True if the file has opened device
> > + * @file: VFIO device file
> > + */
> > +bool vfio_file_has_device_access(struct file *file)
> > +{
> > +	struct vfio_device_file *df;
> > +
> > +	if (vfio_group_from_file(file) ||
> > +	    !vfio_device_from_file(file))
> > +		return false;
> > +
> > +	df = file->private_data;
> > +
> > +	return READ_ONCE(df->access_granted);
> > +}
> > +EXPORT_SYMBOL_GPL(vfio_file_has_device_access);
> > +
> > +/**
> > + * vfio_file_has_dev - True if the VFIO file is a handle for device
> > + * @file: VFIO file to check
> > + * @device: Device that must be part of the file
> > + *
> > + * Returns true if given file has permission to manipulate the given device.
> > + */
> > +bool vfio_file_has_dev(struct file *file, struct vfio_device *device)
> > +{
> > +	struct vfio_group *group;
> > +	struct vfio_device *vdev;
> > +
> > +	group = vfio_group_from_file(file);
> > +	if (group)
> > +		return vfio_group_has_dev(group, device);
> > +
> > +	vdev = vfio_device_from_file(file);
> > +	if (device)
> > +		return vdev == device;
> > +
> > +	return false;
> > +}
> > +EXPORT_SYMBOL_GPL(vfio_file_has_dev);
> > +
> 
> merge above two into one vfio_file_has_dev().

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [Intel-gfx] [PATCH v6 12/24] vfio/pci: Allow passing zero-length fd array in VFIO_DEVICE_PCI_HOT_RESET
  2023-03-10  5:31   ` Tian, Kevin
@ 2023-03-10  6:04     ` Liu, Yi L
  2023-03-10  9:08       ` Tian, Kevin
  2023-03-10 17:42       ` Jason Gunthorpe
  0 siblings, 2 replies; 103+ messages in thread
From: Liu, Yi L @ 2023-03-10  6:04 UTC (permalink / raw)
  To: Tian, Kevin, alex.williamson, jgg
  Cc: linux-s390, yi.y.sun, mjrosato, kvm, jasowang, robin.murphy,
	joro, cohuck, Hao, Xudong, peterx, eric.auger, Xu, Terrence,
	nicolinc, shameerali.kolothum.thodi, suravee.suthikulpanit,
	chao.p.peng, Zhao, Yan Y, lulu, intel-gvt-dev, intel-gfx

> From: Tian, Kevin <kevin.tian@intel.com>
> Sent: Friday, March 10, 2023 1:31 PM
> 
> > From: Yi Liu
> > Sent: Wednesday, March 8, 2023 9:29 PM
> >
> > This is another method to issue PCI hot reset for the users that bounds
> > device to a positive iommufd value. In such case, iommufd is a proof of
> > device ownership. By passing a zero-length fd array, user indicates kernel
> > to do ownership check with the bound iommufd. All the opened devices
> > within
> > the affected dev_set should have been bound to the same iommufd. This
> is
> > simpler and faster as user does not need to pass a set of fds and kernel
> > no need to search the device within the given fds.
> >
> > Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
> > Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> 
> I think you also need a s-o-b from Jason since he wrote most of the
> code here.

Yes, it is. I'll add it if no objection from Jason.

> > +struct iommufd_ctx *vfio_iommufd_physical_ctx(struct vfio_device
> *vdev)
> > +{
> > +	/* Only serve for physical device */
> > +	if (!vdev->iommufd_device)
> > +		return NULL;
> 
> pointless comment

Will remove it.

> > --- a/include/uapi/linux/vfio.h
> > +++ b/include/uapi/linux/vfio.h
> > @@ -682,6 +682,11 @@ struct vfio_pci_hot_reset_info {
> >   * The ownership can be proved by:
> >   *   - An array of group fds
> >   *   - An array of device fds
> > + *   - A zero-length array
> > + *
> > + * In the last case all affected devices which are opened by this user
> > + * must have been bound to a same iommufd_ctx.  This approach is only
> > + * available for devices bound to positive iommufd.
> 
> As we chatted before I still think the last sentence is pointless. If a device
> is bound to negative iommufd value (i.e. noiommu) it doesn't have a
> valid iommufd_ctx so won't meet "bound to a same iommufd_ctx".

Yes, it is. But iommufd_ctx is more a kernel thing, userspace may just
know whether it has bound a positive iommufd or a negative iommufd
to the device. So positive iommufd may be more straightforward to
userspace programmers. 😊 If it's really redundant, I can remove it
as well.

Regards,
Yi Liu


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [Intel-gfx] [PATCH v6 13/24] vfio/iommufd: Split the compat_ioas attach out from vfio_iommufd_bind()
  2023-03-08 13:28 ` [Intel-gfx] [PATCH v6 13/24] vfio/iommufd: Split the compat_ioas attach out from vfio_iommufd_bind() Yi Liu
@ 2023-03-10  8:08   ` Tian, Kevin
  2023-03-10  8:22     ` Liu, Yi L
  0 siblings, 1 reply; 103+ messages in thread
From: Tian, Kevin @ 2023-03-10  8:08 UTC (permalink / raw)
  To: Liu, Yi L, alex.williamson, jgg
  Cc: linux-s390, yi.y.sun, mjrosato, kvm, intel-gvt-dev, joro, cohuck,
	Hao, Xudong, peterx, Zhao, Yan Y, eric.auger, Xu, Terrence,
	nicolinc, shameerali.kolothum.thodi, suravee.suthikulpanit,
	intel-gfx, chao.p.peng, lulu, robin.murphy, jasowang

> From: Liu, Yi L <yi.l.liu@intel.com>
> Sent: Wednesday, March 8, 2023 9:29 PM
> 
> @@ -177,7 +177,7 @@ static int vfio_device_group_open(struct
> vfio_device_file *df)
>  	mutex_lock(&device->group->group_lock);
>  	if (!vfio_group_has_iommu(device->group)) {
>  		ret = -EINVAL;
> -		goto out_unlock;
> +		goto err_unlock;
>  	}

My impression - out_xxx means go to do xxx while err_xxx means
go to do something for error xxx, though in many places the two
are mixed to both meaning 'do xxx'.

either way I don't see a need of changing it.

> -int vfio_iommufd_bind(struct vfio_device *vdev, struct iommufd_ctx *ictx)
> +static int vfio_iommufd_device_probe_comapt_noiommu(struct vfio_device
> *vdev,
> +						    struct iommufd_ctx *ictx)

s/comapt/compat/

btw it's clearer to move this check into vfio_device_group_open().

if noiommu then pass NULL to vfio_device_open(), same as the cdev path.

> +
> +int vfio_iommufd_bind(struct vfio_device *vdev, struct iommufd_ctx *ictx)
> +{
>  	u32 device_id;
>  	int ret;
> 
>  	lockdep_assert_held(&vdev->dev_set->lock);
> 
>  	if (vfio_device_is_noiommu(vdev)) {
> -		if (!capable(CAP_SYS_RAWIO))
> -			return -EPERM;
> -
> -		/*
> -		 * Require no compat ioas to be assigned to proceed. The
> basic
> -		 * statement is that the user cannot have done something
> that
> -		 * implies they expected translation to exist
> -		 */
> -		if (!iommufd_vfio_compat_ioas_get_id(ictx, &ioas_id))
> -			return -EPERM;
> -		return 0;
> +		ret = vfio_iommufd_device_probe_comapt_noiommu(vdev,
> ictx);
> +		if (ret)
> +			return ret;
>  	}
> 
>  	if (WARN_ON(!vdev->ops->bind_iommufd))
>  		return -ENODEV;
> 
> -	ret = vdev->ops->bind_iommufd(vdev, ictx, &device_id);
> -	if (ret)
> -		return ret;
> +	/* The legacy path has no way to return the device id */
> +	return vdev->ops->bind_iommufd(vdev, ictx, &device_id);
> +}
> 
> -	ret = iommufd_vfio_compat_ioas_get_id(ictx, &ioas_id);
> -	if (ret)
> -		goto err_unbind;
> -	ret = vdev->ops->attach_ioas(vdev, &ioas_id);
> -	if (ret)
> -		goto err_unbind;

after noiommu check and attach_ioas are moved out then this
entire function can be removed now. Just call the ops in
vfio_device_first_open().

> +int vfio_iommufd_attach_compat_ioas(struct vfio_device *vdev,
> +				    struct iommufd_ctx *ictx)
> +{
> +	u32 ioas_id;
> +	int ret;
> +
> +	lockdep_assert_held(&vdev->dev_set->lock);
> 
>  	/*
> -	 * The legacy path has no way to return the device id or the selected
> -	 * pt_id
> +	 * If the driver doesn't provide this op then it means the device does
> +	 * not do DMA at all. So nothing to do.
>  	 */
> -	return 0;
> +	if (WARN_ON(!vdev->ops->bind_iommufd))
> +		return -ENODEV;
> 
> -err_unbind:
> -	if (vdev->ops->unbind_iommufd)
> -		vdev->ops->unbind_iommufd(vdev);
> -	return ret;
> +	if (vfio_device_is_noiommu(vdev)) {
> +		if
> (WARN_ON(vfio_iommufd_device_probe_comapt_noiommu(vdev, ictx)))
> +			return -EINVAL;
> +		return 0;
> +	}

no need. let's directly call following from vfio_device_group_open().
In that case no need to do noiommu check twice in one function.

> +
> +	ret = iommufd_vfio_compat_ioas_get_id(ictx, &ioas_id);
> +	if (ret)
> +		return ret;
> +
> +	/* The legacy path has no way to return the selected pt_id */
> +	return vdev->ops->attach_ioas(vdev, &ioas_id);
>  }
> 

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [Intel-gfx] [PATCH v6 13/24] vfio/iommufd: Split the compat_ioas attach out from vfio_iommufd_bind()
  2023-03-10  8:08   ` Tian, Kevin
@ 2023-03-10  8:22     ` Liu, Yi L
  2023-03-10  9:10       ` Tian, Kevin
  2023-03-11 10:24       ` Liu, Yi L
  0 siblings, 2 replies; 103+ messages in thread
From: Liu, Yi L @ 2023-03-10  8:22 UTC (permalink / raw)
  To: Tian, Kevin, alex.williamson, jgg
  Cc: linux-s390, yi.y.sun, mjrosato, kvm, intel-gvt-dev, joro, cohuck,
	Hao, Xudong, peterx, Zhao, Yan Y, eric.auger, Xu, Terrence,
	nicolinc, shameerali.kolothum.thodi, suravee.suthikulpanit,
	intel-gfx, chao.p.peng, lulu, robin.murphy, jasowang

> From: Tian, Kevin <kevin.tian@intel.com>
> Sent: Friday, March 10, 2023 4:08 PM
> 
> > From: Liu, Yi L <yi.l.liu@intel.com>
> > Sent: Wednesday, March 8, 2023 9:29 PM
> >
> > @@ -177,7 +177,7 @@ static int vfio_device_group_open(struct
> > vfio_device_file *df)
> >  	mutex_lock(&device->group->group_lock);
> >  	if (!vfio_group_has_iommu(device->group)) {
> >  		ret = -EINVAL;
> > -		goto out_unlock;
> > +		goto err_unlock;
> >  	}
> 
> My impression - out_xxx means go to do xxx while err_xxx means
> go to do something for error xxx, though in many places the two
> are mixed to both meaning 'do xxx'.
>
> either way I don't see a need of changing it.

Ok. I'm fine with either way.

> > -int vfio_iommufd_bind(struct vfio_device *vdev, struct iommufd_ctx
> *ictx)
> > +static int vfio_iommufd_device_probe_comapt_noiommu(struct
> vfio_device
> > *vdev,
> > +						    struct iommufd_ctx *ictx)
> 
> s/comapt/compat/
> 
> btw it's clearer to move this check into vfio_device_group_open().
> 
> if noiommu then pass NULL to vfio_device_open(), same as the cdev path.

Right.

> > +
> > +int vfio_iommufd_bind(struct vfio_device *vdev, struct iommufd_ctx
> *ictx)
> > +{
> >  	u32 device_id;
> >  	int ret;
> >
> >  	lockdep_assert_held(&vdev->dev_set->lock);
> >
> >  	if (vfio_device_is_noiommu(vdev)) {
> > -		if (!capable(CAP_SYS_RAWIO))
> > -			return -EPERM;
> > -
> > -		/*
> > -		 * Require no compat ioas to be assigned to proceed. The
> > basic
> > -		 * statement is that the user cannot have done something
> > that
> > -		 * implies they expected translation to exist
> > -		 */
> > -		if (!iommufd_vfio_compat_ioas_get_id(ictx, &ioas_id))
> > -			return -EPERM;
> > -		return 0;
> > +		ret = vfio_iommufd_device_probe_comapt_noiommu(vdev,
> > ictx);
> > +		if (ret)
> > +			return ret;
> >  	}
> >
> >  	if (WARN_ON(!vdev->ops->bind_iommufd))
> >  		return -ENODEV;
> >
> > -	ret = vdev->ops->bind_iommufd(vdev, ictx, &device_id);
> > -	if (ret)
> > -		return ret;
> > +	/* The legacy path has no way to return the device id */
> > +	return vdev->ops->bind_iommufd(vdev, ictx, &device_id);
> > +}
> >
> > -	ret = iommufd_vfio_compat_ioas_get_id(ictx, &ioas_id);
> > -	if (ret)
> > -		goto err_unbind;
> > -	ret = vdev->ops->attach_ioas(vdev, &ioas_id);
> > -	if (ret)
> > -		goto err_unbind;
> 
> after noiommu check and attach_ioas are moved out then this
> entire function can be removed now. Just call the ops in
> vfio_device_first_open().

Yes. and also no vfio_iommufd_unbind().

> 
> > +int vfio_iommufd_attach_compat_ioas(struct vfio_device *vdev,
> > +				    struct iommufd_ctx *ictx)
> > +{
> > +	u32 ioas_id;
> > +	int ret;
> > +
> > +	lockdep_assert_held(&vdev->dev_set->lock);
> >
> >  	/*
> > -	 * The legacy path has no way to return the device id or the selected
> > -	 * pt_id
> > +	 * If the driver doesn't provide this op then it means the device does
> > +	 * not do DMA at all. So nothing to do.
> >  	 */
> > -	return 0;
> > +	if (WARN_ON(!vdev->ops->bind_iommufd))
> > +		return -ENODEV;
> >
> > -err_unbind:
> > -	if (vdev->ops->unbind_iommufd)
> > -		vdev->ops->unbind_iommufd(vdev);
> > -	return ret;
> > +	if (vfio_device_is_noiommu(vdev)) {
> > +		if
> > (WARN_ON(vfio_iommufd_device_probe_comapt_noiommu(vdev, ictx)))
> > +			return -EINVAL;
> > +		return 0;
> > +	}
> 
> no need. let's directly call following from vfio_device_group_open().
> In that case no need to do noiommu check twice in one function.

Ok. maybe still have vfio_iommufd_attach_compat_ioas() but
only call it if it's not noiommu mode. vfio_device_group_open()
can call probe_noiommu() first and has a bool to mark noiommu.
Jason had a remark that it's better to keep the
iommufd_vfio_compat_ioas_get_id() in iommufd.c

> 
> > +
> > +	ret = iommufd_vfio_compat_ioas_get_id(ictx, &ioas_id);
> > +	if (ret)
> > +		return ret;
> > +
> > +	/* The legacy path has no way to return the selected pt_id */
> > +	return vdev->ops->attach_ioas(vdev, &ioas_id);
> >  }
> >

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [Intel-gfx] [PATCH v6 16/24] vfio: Make vfio_device_first_open() to cover the noiommu mode in cdev path
  2023-03-08 13:28 ` [Intel-gfx] [PATCH v6 16/24] vfio: Make vfio_device_first_open() to cover the noiommu mode in " Yi Liu
@ 2023-03-10  8:30   ` Tian, Kevin
  0 siblings, 0 replies; 103+ messages in thread
From: Tian, Kevin @ 2023-03-10  8:30 UTC (permalink / raw)
  To: Liu, Yi L, alex.williamson, jgg
  Cc: linux-s390, yi.y.sun, mjrosato, kvm, intel-gvt-dev, joro, cohuck,
	Hao, Xudong, peterx, Zhao, Yan Y, eric.auger, Xu, Terrence,
	nicolinc, shameerali.kolothum.thodi, suravee.suthikulpanit,
	intel-gfx, chao.p.peng, lulu, robin.murphy, jasowang

> From: Liu, Yi L <yi.l.liu@intel.com>
> Sent: Wednesday, March 8, 2023 9:29 PM
> 
> +/*
> + * This shall be used without group lock as group and group->container
> + * should be fixed before group is set to df->group.
> + */

/* No group lock since df->group and df->group->container cannot change */

> +bool vfio_device_group_uses_container(struct vfio_device_file *df)
> +{
> +	/*
> +	 * Use the df->group instead of the df->device->group as no
> +	 * lock is acquired here.
> +	 */

remove

> 
> +	/*
> +	 * The handling here depends on what the user is using.
> +	 *
> +	 * If user uses iommufd in the group compat mode or the
> +	 * cdev path, call vfio_iommufd_bind().
> +	 *
> +	 * If user uses container in the group legacy mode, call
> +	 * vfio_device_group_use_iommu().

this is what code does.

> +	 *
> +	 * If user doesn't use iommufd nor container, this is
> +	 * the noiommufd mode in the cdev path, nothing needs
> +	 * to be done here just go ahead to open device.
> +	 */
>  	if (iommufd)
>  		ret = vfio_iommufd_bind(device, iommufd);
> -	else
> +	else if (vfio_device_group_uses_container(df))
>  		ret = vfio_device_group_use_iommu(device);

with earlier suggestion then iommufd=NULL always means
noiommu in this function. 

Then the comment could be simple as:

/*
 * if neither iommufd nor container is used the device is in
 * noiommu mode then just go ahead to open it.
 */

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [Intel-gfx] [PATCH v6 17/24] vfio-iommufd: Make vfio_iommufd_bind() selectively return devid
  2023-03-08 13:28 ` [Intel-gfx] [PATCH v6 17/24] vfio-iommufd: Make vfio_iommufd_bind() selectively return devid Yi Liu
@ 2023-03-10  8:31   ` Tian, Kevin
  0 siblings, 0 replies; 103+ messages in thread
From: Tian, Kevin @ 2023-03-10  8:31 UTC (permalink / raw)
  To: Liu, Yi L, alex.williamson, jgg
  Cc: linux-s390, yi.y.sun, mjrosato, kvm, intel-gvt-dev, joro, cohuck,
	Hao, Xudong, peterx, Zhao, Yan Y, eric.auger, Xu, Terrence,
	nicolinc, shameerali.kolothum.thodi, suravee.suthikulpanit,
	intel-gfx, chao.p.peng, lulu, robin.murphy, jasowang

> From: Liu, Yi L <yi.l.liu@intel.com>
> Sent: Wednesday, March 8, 2023 9:29 PM
> 
> bind_iommufd() will generate an ID to represent this bond, it is needed
> by userspace for further usage. devid is stored in vfio_device_file to
> avoid passing devid pointer in multiple places.

after removing vfio_iommufd_bind() then the caller can directly get
the id when calling .bind_iommufd().

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [Intel-gfx] [PATCH v6 20/24] vfio: Add cdev for vfio_device
  2023-03-08 13:28 ` [Intel-gfx] [PATCH v6 20/24] vfio: Add cdev for vfio_device Yi Liu
@ 2023-03-10  8:48   ` Tian, Kevin
  2023-03-10  9:59     ` Liu, Yi L
  0 siblings, 1 reply; 103+ messages in thread
From: Tian, Kevin @ 2023-03-10  8:48 UTC (permalink / raw)
  To: Liu, Yi L, alex.williamson, jgg
  Cc: linux-s390, yi.y.sun, mjrosato, kvm, intel-gvt-dev, joro, cohuck,
	Hao, Xudong, peterx, Zhao, Yan Y, eric.auger, Xu, Terrence,
	nicolinc, shameerali.kolothum.thodi, suravee.suthikulpanit,
	intel-gfx, chao.p.peng, lulu, robin.murphy, jasowang

> From: Liu, Yi L <yi.l.liu@intel.com>
> Sent: Wednesday, March 8, 2023 9:29 PM
> 
> +	/*
> +	 * Placing it before vfio_device_put_registration() to prevent
> +	 * new registration refcount increment by
> VFIO_GROUP_GET_DEVICE_FD
> +	 * during the unregister time.
> +	 */
> +	vfio_device_group_unregister(device);
> +
> +	/*
> +	 * Balances vfio_device_add() in the register path. Placing it before
> +	 * vfio_device_put_registration() to prevent new registration refcount
> +	 * increment by the device cdev open during the unregister time.
> +	 */
> +	vfio_device_del(device);
> +

What about below?

	/*
	 * Cleanup to pair with the register path. Must be done
	 * before vfio_device_put_registration () to avoid racing with
	 * a new registration.
	 */
	vfio_device_group_unregister(device);
	vfio_device_del(device);

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [Intel-gfx] [PATCH v6 21/24] vfio: Add VFIO_DEVICE_BIND_IOMMUFD
  2023-03-08 13:29 ` [Intel-gfx] [PATCH v6 21/24] vfio: Add VFIO_DEVICE_BIND_IOMMUFD Yi Liu
@ 2023-03-10  9:01   ` Tian, Kevin
  2023-03-10  9:58     ` Liu, Yi L
  0 siblings, 1 reply; 103+ messages in thread
From: Tian, Kevin @ 2023-03-10  9:01 UTC (permalink / raw)
  To: Liu, Yi L, alex.williamson, jgg
  Cc: linux-s390, yi.y.sun, mjrosato, kvm, intel-gvt-dev, joro, cohuck,
	Hao, Xudong, peterx, Zhao, Yan Y, eric.auger, Xu, Terrence,
	nicolinc, shameerali.kolothum.thodi, suravee.suthikulpanit,
	intel-gfx, chao.p.peng, lulu, robin.murphy, jasowang

> From: Liu, Yi L <yi.l.liu@intel.com>
> Sent: Wednesday, March 8, 2023 9:29 PM
> +
> +static int vfio_device_cdev_probe_noiommu(struct vfio_device *device)
> +{
> +	struct iommu_group *iommu_group;
> +	int ret = 0;
> +
> +	if (!IS_ENABLED(CONFIG_VFIO_NOIOMMU) || !vfio_noiommu)
> +		return -EINVAL;
> +
> +	if (!capable(CAP_SYS_RAWIO))
> +		return -EPERM;
> +
> +	iommu_group = iommu_group_get(device->dev);
> +	if (!iommu_group)
> +		return 0;
> +
> +	/*
> +	 * We cannot support noiommu mode for devices that are protected
> +	 * by IOMMU.  So check the iommu_group, if it is a no-iommu group
> +	 * created by VFIO, we support. If not, we refuse.
> +	 */
> +	if (!vfio_group_find_noiommu_group_from_iommu(iommu_group))
> +		ret = -EINVAL;
> +	iommu_group_put(iommu_group);
> +	return ret;

can check whether group->name == "vfio-noiommu"?

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [Intel-gfx] [PATCH v6 23/24] vfio: Compile group optionally
  2023-03-08 13:29 ` [Intel-gfx] [PATCH v6 23/24] vfio: Compile group optionally Yi Liu
@ 2023-03-10  9:03   ` Tian, Kevin
  0 siblings, 0 replies; 103+ messages in thread
From: Tian, Kevin @ 2023-03-10  9:03 UTC (permalink / raw)
  To: Liu, Yi L, alex.williamson, jgg
  Cc: linux-s390, yi.y.sun, mjrosato, kvm, intel-gvt-dev, joro, cohuck,
	Hao, Xudong, peterx, Zhao, Yan Y, eric.auger, Xu, Terrence,
	nicolinc, shameerali.kolothum.thodi, suravee.suthikulpanit,
	intel-gfx, chao.p.peng, lulu, robin.murphy, jasowang

> From: Liu, Yi L <yi.l.liu@intel.com>
> Sent: Wednesday, March 8, 2023 9:29 PM
> 
> group code is not needed for vfio device cdev, so with vfio device cdev
> introduced, the group infrastructures can be compiled out if only cdev
> is needed.
> 
> Signed-off-by: Yi Liu <yi.l.liu@intel.com>

Reviewed-by: Kevin Tian <kevin.tian@intel.com>

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [Intel-gfx] [PATCH v6 12/24] vfio/pci: Allow passing zero-length fd array in VFIO_DEVICE_PCI_HOT_RESET
  2023-03-10  6:04     ` Liu, Yi L
@ 2023-03-10  9:08       ` Tian, Kevin
  2023-03-10 17:42       ` Jason Gunthorpe
  1 sibling, 0 replies; 103+ messages in thread
From: Tian, Kevin @ 2023-03-10  9:08 UTC (permalink / raw)
  To: Liu, Yi L, alex.williamson, jgg
  Cc: linux-s390, yi.y.sun, mjrosato, kvm, jasowang, robin.murphy,
	joro, cohuck, Hao, Xudong, peterx, eric.auger, Xu, Terrence,
	nicolinc, shameerali.kolothum.thodi, suravee.suthikulpanit,
	chao.p.peng, Zhao, Yan Y, lulu, intel-gvt-dev, intel-gfx

> From: Liu, Yi L <yi.l.liu@intel.com>
> Sent: Friday, March 10, 2023 2:04 PM
> > > + *
> > > + * In the last case all affected devices which are opened by this user
> > > + * must have been bound to a same iommufd_ctx.  This approach is only
> > > + * available for devices bound to positive iommufd.
> >
> > As we chatted before I still think the last sentence is pointless. If a device
> > is bound to negative iommufd value (i.e. noiommu) it doesn't have a
> > valid iommufd_ctx so won't meet "bound to a same iommufd_ctx".
> 
> Yes, it is. But iommufd_ctx is more a kernel thing, userspace may just
> know whether it has bound a positive iommufd or a negative iommufd
> to the device. So positive iommufd may be more straightforward to
> userspace programmers. 😊 If it's really redundant, I can remove it
> as well.
> 

s/iommufd_ctx/iommufd/.

negative value is not a fd. Just a uAPI format to mark noiommu.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [Intel-gfx] [PATCH v6 13/24] vfio/iommufd: Split the compat_ioas attach out from vfio_iommufd_bind()
  2023-03-10  8:22     ` Liu, Yi L
@ 2023-03-10  9:10       ` Tian, Kevin
  2023-03-11 10:24       ` Liu, Yi L
  1 sibling, 0 replies; 103+ messages in thread
From: Tian, Kevin @ 2023-03-10  9:10 UTC (permalink / raw)
  To: Liu, Yi L, alex.williamson, jgg
  Cc: linux-s390, yi.y.sun, mjrosato, kvm, intel-gvt-dev, joro, cohuck,
	Hao, Xudong, peterx, Zhao, Yan Y, eric.auger, Xu, Terrence,
	nicolinc, shameerali.kolothum.thodi, suravee.suthikulpanit,
	intel-gfx, chao.p.peng, lulu, robin.murphy, jasowang

> From: Liu, Yi L <yi.l.liu@intel.com>
> Sent: Friday, March 10, 2023 4:22 PM
> 
> >
> > > +int vfio_iommufd_attach_compat_ioas(struct vfio_device *vdev,
> > > +				    struct iommufd_ctx *ictx)
> > > +{
> > > +	u32 ioas_id;
> > > +	int ret;
> > > +
> > > +	lockdep_assert_held(&vdev->dev_set->lock);
> > >
> > >  	/*
> > > -	 * The legacy path has no way to return the device id or the selected
> > > -	 * pt_id
> > > +	 * If the driver doesn't provide this op then it means the device does
> > > +	 * not do DMA at all. So nothing to do.
> > >  	 */
> > > -	return 0;
> > > +	if (WARN_ON(!vdev->ops->bind_iommufd))
> > > +		return -ENODEV;
> > >
> > > -err_unbind:
> > > -	if (vdev->ops->unbind_iommufd)
> > > -		vdev->ops->unbind_iommufd(vdev);
> > > -	return ret;
> > > +	if (vfio_device_is_noiommu(vdev)) {
> > > +		if
> > > (WARN_ON(vfio_iommufd_device_probe_comapt_noiommu(vdev, ictx)))
> > > +			return -EINVAL;
> > > +		return 0;
> > > +	}
> >
> > no need. let's directly call following from vfio_device_group_open().
> > In that case no need to do noiommu check twice in one function.
> 
> Ok. maybe still have vfio_iommufd_attach_compat_ioas() but
> only call it if it's not noiommu mode. vfio_device_group_open()
> can call probe_noiommu() first and has a bool to mark noiommu.
> Jason had a remark that it's better to keep the
> iommufd_vfio_compat_ioas_get_id() in iommufd.c
> 

Probably that remark doesn't hold now if we agree to remove
vfio_iommufd_bind() and let vfio_device_group_open() directly
call .bind_iommufd().

also group.c already calls other compat API:

                if (IS_ENABLED(CONFIG_VFIO_NOIOMMU) &&
                    group->type == VFIO_NO_IOMMU)
                        ret = iommufd_vfio_compat_set_no_iommu(iommufd);
                else
                        ret = iommufd_vfio_compat_ioas_create(iommufd);

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [Intel-gfx] [PATCH v6 21/24] vfio: Add VFIO_DEVICE_BIND_IOMMUFD
  2023-03-10  9:01   ` Tian, Kevin
@ 2023-03-10  9:58     ` Liu, Yi L
  2023-03-10 10:06       ` Tian, Kevin
  0 siblings, 1 reply; 103+ messages in thread
From: Liu, Yi L @ 2023-03-10  9:58 UTC (permalink / raw)
  To: Tian, Kevin, alex.williamson, jgg
  Cc: linux-s390, yi.y.sun, mjrosato, kvm, intel-gvt-dev, joro, cohuck,
	Hao, Xudong, peterx, Zhao, Yan Y, eric.auger, Xu, Terrence,
	nicolinc, shameerali.kolothum.thodi, suravee.suthikulpanit,
	intel-gfx, chao.p.peng, lulu, robin.murphy, jasowang

> From: Tian, Kevin <kevin.tian@intel.com>
> Sent: Friday, March 10, 2023 5:02 PM
> 
> > From: Liu, Yi L <yi.l.liu@intel.com>
> > Sent: Wednesday, March 8, 2023 9:29 PM
> > +
> > +static int vfio_device_cdev_probe_noiommu(struct vfio_device *device)
> > +{
> > +	struct iommu_group *iommu_group;
> > +	int ret = 0;
> > +
> > +	if (!IS_ENABLED(CONFIG_VFIO_NOIOMMU) || !vfio_noiommu)
> > +		return -EINVAL;
> > +
> > +	if (!capable(CAP_SYS_RAWIO))
> > +		return -EPERM;
> > +
> > +	iommu_group = iommu_group_get(device->dev);
> > +	if (!iommu_group)
> > +		return 0;
> > +
> > +	/*
> > +	 * We cannot support noiommu mode for devices that are
> protected
> > +	 * by IOMMU.  So check the iommu_group, if it is a no-iommu group
> > +	 * created by VFIO, we support. If not, we refuse.
> > +	 */
> > +	if
> (!vfio_group_find_noiommu_group_from_iommu(iommu_group))
> > +		ret = -EINVAL;
> > +	iommu_group_put(iommu_group);
> > +	return ret;
> 
> can check whether group->name == "vfio-noiommu"?

But VFIO names it to be "vfio-noiommu" for both VFIO_EMULATED_IOMMU
and VFIO_NO_IOMMU. And we don't support no-iommu mode for emulated
devices since VFIO_MAP/UNMAP, pin_page(), dam_rw() won't work in the
no-iommu mode.

So maybe something like below in drivers/vfio/vfio.h. It can be used
to replace the code from iommu_group_get() to
vfio_group_find_noiommu_group_from_iommu() In my patch.

#if IS_ENABLED(CONFIG_VFIO_GROUP)
static inline bool vfio_device_group_allow_noiommu(struct vfio_device *device)
{
	lockdep_assert_held(&device->dev_set->lock);

	return device->group->type == VFIO_NO_IOMMU;
}
#else
static inline bool vfio_device_group_allow_noiommu(struct vfio_device *device)
{
	struct iommu_group *iommu_group;

	lockdep_assert_held(&device->dev_set->lock);

	iommu_group = iommu_group_get(device->dev);
	if (iommu_group)
		iommu_group_put(iommu_group);

	return !iommu_group;
}
#endif

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [Intel-gfx] [PATCH v6 20/24] vfio: Add cdev for vfio_device
  2023-03-10  8:48   ` Tian, Kevin
@ 2023-03-10  9:59     ` Liu, Yi L
  0 siblings, 0 replies; 103+ messages in thread
From: Liu, Yi L @ 2023-03-10  9:59 UTC (permalink / raw)
  To: Tian, Kevin, alex.williamson, jgg
  Cc: linux-s390, yi.y.sun, mjrosato, kvm, intel-gvt-dev, joro, cohuck,
	Hao, Xudong, peterx, Zhao, Yan Y, eric.auger, Xu, Terrence,
	nicolinc, shameerali.kolothum.thodi, suravee.suthikulpanit,
	intel-gfx, chao.p.peng, lulu, robin.murphy, jasowang

> From: Tian, Kevin <kevin.tian@intel.com>
> Sent: Friday, March 10, 2023 4:49 PM
> 
> > From: Liu, Yi L <yi.l.liu@intel.com>
> > Sent: Wednesday, March 8, 2023 9:29 PM
> >
> > +	/*
> > +	 * Placing it before vfio_device_put_registration() to prevent
> > +	 * new registration refcount increment by
> > VFIO_GROUP_GET_DEVICE_FD
> > +	 * during the unregister time.
> > +	 */
> > +	vfio_device_group_unregister(device);
> > +
> > +	/*
> > +	 * Balances vfio_device_add() in the register path. Placing it before
> > +	 * vfio_device_put_registration() to prevent new registration
> refcount
> > +	 * increment by the device cdev open during the unregister time.
> > +	 */
> > +	vfio_device_del(device);
> > +
> 
> What about below?
> 
> 	/*
> 	 * Cleanup to pair with the register path. Must be done
> 	 * before vfio_device_put_registration () to avoid racing with
> 	 * a new registration.
> 	 */
> 	vfio_device_group_unregister(device);
> 	vfio_device_del(device);

new registration is bit confusing. Maybe "new registration refcount
increment by userspace".

Regards,
Yi Liu 

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [Intel-gfx] [PATCH v6 21/24] vfio: Add VFIO_DEVICE_BIND_IOMMUFD
  2023-03-10  9:58     ` Liu, Yi L
@ 2023-03-10 10:06       ` Tian, Kevin
  2023-03-15  4:40         ` Liu, Yi L
  0 siblings, 1 reply; 103+ messages in thread
From: Tian, Kevin @ 2023-03-10 10:06 UTC (permalink / raw)
  To: Liu, Yi L, alex.williamson, jgg
  Cc: linux-s390, yi.y.sun, mjrosato, kvm, intel-gvt-dev, joro, cohuck,
	Hao, Xudong, peterx, Zhao, Yan Y, eric.auger, Xu, Terrence,
	nicolinc, shameerali.kolothum.thodi, suravee.suthikulpanit,
	intel-gfx, chao.p.peng, lulu, robin.murphy, jasowang

> From: Liu, Yi L <yi.l.liu@intel.com>
> Sent: Friday, March 10, 2023 5:58 PM
> 
> > From: Tian, Kevin <kevin.tian@intel.com>
> > Sent: Friday, March 10, 2023 5:02 PM
> >
> > > From: Liu, Yi L <yi.l.liu@intel.com>
> > > Sent: Wednesday, March 8, 2023 9:29 PM
> > > +
> > > +static int vfio_device_cdev_probe_noiommu(struct vfio_device *device)
> > > +{
> > > +	struct iommu_group *iommu_group;
> > > +	int ret = 0;
> > > +
> > > +	if (!IS_ENABLED(CONFIG_VFIO_NOIOMMU) || !vfio_noiommu)
> > > +		return -EINVAL;
> > > +
> > > +	if (!capable(CAP_SYS_RAWIO))
> > > +		return -EPERM;
> > > +
> > > +	iommu_group = iommu_group_get(device->dev);
> > > +	if (!iommu_group)
> > > +		return 0;
> > > +
> > > +	/*
> > > +	 * We cannot support noiommu mode for devices that are
> > protected
> > > +	 * by IOMMU.  So check the iommu_group, if it is a no-iommu group
> > > +	 * created by VFIO, we support. If not, we refuse.
> > > +	 */
> > > +	if
> > (!vfio_group_find_noiommu_group_from_iommu(iommu_group))
> > > +		ret = -EINVAL;
> > > +	iommu_group_put(iommu_group);
> > > +	return ret;
> >
> > can check whether group->name == "vfio-noiommu"?
> 
> But VFIO names it to be "vfio-noiommu" for both VFIO_EMULATED_IOMMU
> and VFIO_NO_IOMMU. And we don't support no-iommu mode for emulated
> devices since VFIO_MAP/UNMAP, pin_page(), dam_rw() won't work in the
> no-iommu mode.

correct.

> 
> So maybe something like below in drivers/vfio/vfio.h. It can be used
> to replace the code from iommu_group_get() to
> vfio_group_find_noiommu_group_from_iommu() In my patch.
> 
> #if IS_ENABLED(CONFIG_VFIO_GROUP)
> static inline bool vfio_device_group_allow_noiommu(struct vfio_device
> *device)
> {
> 	lockdep_assert_held(&device->dev_set->lock);
> 
> 	return device->group->type == VFIO_NO_IOMMU;
> }
> #else
> static inline bool vfio_device_group_allow_noiommu(struct vfio_device
> *device)
> {
> 	struct iommu_group *iommu_group;
> 
> 	lockdep_assert_held(&device->dev_set->lock);
> 
> 	iommu_group = iommu_group_get(device->dev);
> 	if (iommu_group)
> 		iommu_group_put(iommu_group);
> 
> 	return !iommu_group;
> }
> #endif

this makes sense.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [Intel-gfx] [PATCH v6 12/24] vfio/pci: Allow passing zero-length fd array in VFIO_DEVICE_PCI_HOT_RESET
  2023-03-10  6:04     ` Liu, Yi L
  2023-03-10  9:08       ` Tian, Kevin
@ 2023-03-10 17:42       ` Jason Gunthorpe
  1 sibling, 0 replies; 103+ messages in thread
From: Jason Gunthorpe @ 2023-03-10 17:42 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: kvm, jasowang, Hao, Xudong, peterx, Xu, Terrence, chao.p.peng,
	linux-s390, mjrosato, lulu, joro, nicolinc, Zhao, Yan Y,
	intel-gfx, eric.auger, intel-gvt-dev, yi.y.sun, cohuck,
	shameerali.kolothum.thodi, suravee.suthikulpanit, robin.murphy

On Fri, Mar 10, 2023 at 06:04:02AM +0000, Liu, Yi L wrote:
> > From: Tian, Kevin <kevin.tian@intel.com>
> > Sent: Friday, March 10, 2023 1:31 PM
> > 
> > > From: Yi Liu
> > > Sent: Wednesday, March 8, 2023 9:29 PM
> > >
> > > This is another method to issue PCI hot reset for the users that bounds
> > > device to a positive iommufd value. In such case, iommufd is a proof of
> > > device ownership. By passing a zero-length fd array, user indicates kernel
> > > to do ownership check with the bound iommufd. All the opened devices
> > > within
> > > the affected dev_set should have been bound to the same iommufd. This
> > is
> > > simpler and faster as user does not need to pass a set of fds and kernel
> > > no need to search the device within the given fds.
> > >
> > > Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
> > > Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> > 
> > I think you also need a s-o-b from Jason since he wrote most of the
> > code here.
> 
> Yes, it is. I'll add it if no objection from Jason.

Go ahead

Jason

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [Intel-gfx] [PATCH v6 19/24] vfio-iommufd: Add detach_ioas support for emulated VFIO devices
  2023-03-08 13:28 ` [Intel-gfx] [PATCH v6 19/24] vfio-iommufd: Add detach_ioas support for emulated " Yi Liu
@ 2023-03-10 23:42   ` Nicolin Chen
  2023-03-15  6:15     ` Liu, Yi L
  0 siblings, 1 reply; 103+ messages in thread
From: Nicolin Chen @ 2023-03-10 23:42 UTC (permalink / raw)
  To: Yi Liu
  Cc: mjrosato, jasowang, xudong.hao, peterx, terrence.xu, chao.p.peng,
	linux-s390, kvm, lulu, joro, jgg, yan.y.zhao, intel-gfx,
	eric.auger, intel-gvt-dev, yi.y.sun, cohuck,
	shameerali.kolothum.thodi, suravee.suthikulpanit, robin.murphy

On Wed, Mar 08, 2023 at 05:28:58AM -0800, Yi Liu wrote:
> External email: Use caution opening links or attachments
> 
> 
> this prepares for adding DETACH ioctl for emulated VFIO devices.
> 
> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
> Tested-by: Terrence Xu <terrence.xu@intel.com>
> Tested-by: Nicolin Chen <nicolinc@nvidia.com>
> Tested-by: Matthew Rosato <mjrosato@linux.ibm.com>
> ---
>  drivers/gpu/drm/i915/gvt/kvmgt.c  |  1 +
>  drivers/s390/cio/vfio_ccw_ops.c   |  1 +
>  drivers/s390/crypto/vfio_ap_ops.c |  1 +
>  drivers/vfio/iommufd.c            | 14 +++++++++++++-
>  include/linux/vfio.h              |  3 +++
>  5 files changed, 19 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/i915/gvt/kvmgt.c b/drivers/gpu/drm/i915/gvt/kvmgt.c
> index de675d799c7d..9cd9e9da60dd 100644
> --- a/drivers/gpu/drm/i915/gvt/kvmgt.c
> +++ b/drivers/gpu/drm/i915/gvt/kvmgt.c
> @@ -1474,6 +1474,7 @@ static const struct vfio_device_ops intel_vgpu_dev_ops = {
>         .bind_iommufd   = vfio_iommufd_emulated_bind,
>         .unbind_iommufd = vfio_iommufd_emulated_unbind,
>         .attach_ioas    = vfio_iommufd_emulated_attach_ioas,
> +       .detach_ioas    = vfio_iommufd_emulated_detach_ioas,
>  };
> 
>  static int intel_vgpu_probe(struct mdev_device *mdev)
> diff --git a/drivers/s390/cio/vfio_ccw_ops.c b/drivers/s390/cio/vfio_ccw_ops.c
> index 5b53b94f13c7..cba4971618ff 100644
> --- a/drivers/s390/cio/vfio_ccw_ops.c
> +++ b/drivers/s390/cio/vfio_ccw_ops.c
> @@ -632,6 +632,7 @@ static const struct vfio_device_ops vfio_ccw_dev_ops = {
>         .bind_iommufd = vfio_iommufd_emulated_bind,
>         .unbind_iommufd = vfio_iommufd_emulated_unbind,
>         .attach_ioas = vfio_iommufd_emulated_attach_ioas,
> +       .detach_ioas = vfio_iommufd_emulated_detach_ioas,
>  };
> 
>  struct mdev_driver vfio_ccw_mdev_driver = {
> diff --git a/drivers/s390/crypto/vfio_ap_ops.c b/drivers/s390/crypto/vfio_ap_ops.c
> index 72e10abb103a..9902e62e7a17 100644
> --- a/drivers/s390/crypto/vfio_ap_ops.c
> +++ b/drivers/s390/crypto/vfio_ap_ops.c
> @@ -1844,6 +1844,7 @@ static const struct vfio_device_ops vfio_ap_matrix_dev_ops = {
>         .bind_iommufd = vfio_iommufd_emulated_bind,
>         .unbind_iommufd = vfio_iommufd_emulated_unbind,
>         .attach_ioas = vfio_iommufd_emulated_attach_ioas,
> +       .detach_ioas = vfio_iommufd_emulated_detach_ioas,
>  };
> 
>  static struct mdev_driver vfio_ap_matrix_driver = {
> diff --git a/drivers/vfio/iommufd.c b/drivers/vfio/iommufd.c
> index c06494e322f9..8a9457d0a33c 100644
> --- a/drivers/vfio/iommufd.c
> +++ b/drivers/vfio/iommufd.c
> @@ -218,8 +218,20 @@ int vfio_iommufd_emulated_attach_ioas(struct vfio_device *vdev, u32 *pt_id)
>  {
>         lockdep_assert_held(&vdev->dev_set->lock);
> 
> -       if (!vdev->iommufd_access)
> +       if (WARN_ON(!vdev->iommufd_access))
>                 return -ENOENT;
>         return iommufd_access_set_ioas(vdev->iommufd_access, *pt_id);
>  }
>  EXPORT_SYMBOL_GPL(vfio_iommufd_emulated_attach_ioas);
> +
> +void vfio_iommufd_emulated_detach_ioas(struct vfio_device *vdev)
> +{
> +       lockdep_assert_held(&vdev->dev_set->lock);
> +
> +       if (WARN_ON(!vdev->iommufd_access))
> +               return;
> +
[...]
> +       iommufd_access_destroy(vdev->iommufd_access);
> +       vdev->iommufd_access = NULL;

After moving access allocation/destroy to bind/unbind, here it
should be:
	iommufd_access_set_ioas(vdev->iommufd_access, 0);

Thanks
Nic

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [Intel-gfx] [PATCH v6 13/24] vfio/iommufd: Split the compat_ioas attach out from vfio_iommufd_bind()
  2023-03-10  8:22     ` Liu, Yi L
  2023-03-10  9:10       ` Tian, Kevin
@ 2023-03-11 10:24       ` Liu, Yi L
  2023-03-13  2:06         ` Tian, Kevin
  1 sibling, 1 reply; 103+ messages in thread
From: Liu, Yi L @ 2023-03-11 10:24 UTC (permalink / raw)
  To: Liu, Yi L, Tian, Kevin, alex.williamson, jgg
  Cc: linux-s390, yi.y.sun, mjrosato, kvm, intel-gvt-dev, joro, cohuck,
	Hao, Xudong, peterx, Zhao, Yan Y, eric.auger, Xu, Terrence,
	nicolinc, shameerali.kolothum.thodi, suravee.suthikulpanit,
	intel-gfx, chao.p.peng, lulu, robin.murphy, jasowang

Hi Keivn,

> From: Liu, Yi L <yi.l.liu@intel.com>
> Sent: Friday, March 10, 2023 4:22 PM
> 
> > From: Tian, Kevin <kevin.tian@intel.com>
> > Sent: Friday, March 10, 2023 4:08 PM
> >
> > > From: Liu, Yi L <yi.l.liu@intel.com>
> > > Sent: Wednesday, March 8, 2023 9:29 PM
> > >
> > > @@ -177,7 +177,7 @@ static int vfio_device_group_open(struct
> > > vfio_device_file *df)
> > >  	mutex_lock(&device->group->group_lock);
> > >  	if (!vfio_group_has_iommu(device->group)) {
> > >  		ret = -EINVAL;
> > > -		goto out_unlock;
> > > +		goto err_unlock;
> > >  	}
> >
> > My impression - out_xxx means go to do xxx while err_xxx means
> > go to do something for error xxx, though in many places the two
> > are mixed to both meaning 'do xxx'.
> >
> > either way I don't see a need of changing it.
> 
> Ok. I'm fine with either way.
> 
> > > -int vfio_iommufd_bind(struct vfio_device *vdev, struct iommufd_ctx
> > *ictx)
> > > +static int vfio_iommufd_device_probe_comapt_noiommu(struct
> > vfio_device
> > > *vdev,
> > > +						    struct iommufd_ctx *ictx)
> >
> > s/comapt/compat/
> >
> > btw it's clearer to move this check into vfio_device_group_open().
> >
> > if noiommu then pass NULL to vfio_device_open(), same as the cdev path.
> 
> Right.
> 
> > > +
> > > +int vfio_iommufd_bind(struct vfio_device *vdev, struct iommufd_ctx
> > *ictx)
> > > +{
> > >  	u32 device_id;
> > >  	int ret;
> > >
> > >  	lockdep_assert_held(&vdev->dev_set->lock);
> > >
> > >  	if (vfio_device_is_noiommu(vdev)) {
> > > -		if (!capable(CAP_SYS_RAWIO))
> > > -			return -EPERM;
> > > -
> > > -		/*
> > > -		 * Require no compat ioas to be assigned to proceed. The
> > > basic
> > > -		 * statement is that the user cannot have done something
> > > that
> > > -		 * implies they expected translation to exist
> > > -		 */
> > > -		if (!iommufd_vfio_compat_ioas_get_id(ictx, &ioas_id))
> > > -			return -EPERM;
> > > -		return 0;
> > > +		ret = vfio_iommufd_device_probe_comapt_noiommu(vdev,
> > > ictx);
> > > +		if (ret)
> > > +			return ret;
> > >  	}
> > >
> > >  	if (WARN_ON(!vdev->ops->bind_iommufd))
> > >  		return -ENODEV;
> > >
> > > -	ret = vdev->ops->bind_iommufd(vdev, ictx, &device_id);
> > > -	if (ret)
> > > -		return ret;
> > > +	/* The legacy path has no way to return the device id */
> > > +	return vdev->ops->bind_iommufd(vdev, ictx, &device_id);
> > > +}
> > >
> > > -	ret = iommufd_vfio_compat_ioas_get_id(ictx, &ioas_id);
> > > -	if (ret)
> > > -		goto err_unbind;
> > > -	ret = vdev->ops->attach_ioas(vdev, &ioas_id);
> > > -	if (ret)
> > > -		goto err_unbind;
> >
> > after noiommu check and attach_ioas are moved out then this
> > entire function can be removed now. Just call the ops in
> > vfio_device_first_open().
> 
> Yes. and also no vfio_iommufd_unbind().

Seems still necessary to have this wrapper. .bind_iommufd callback would
be NULL if CONFIG_IOMMUFD==n. If we call ops->bind_iommufd directly
in vfio_device_first_open() of vfio_main.c, it may trigger kernel panic
for NULL pointer dereference if there is wrong code that passes valid
iommufd pointer.. Ideally, if CONFIG_IOMMUFD==n, vfio_device_first_open
should not receive valid iommufd pointer hence won't call ops->bind_iommufd
at all. So it deserves a panic. However, if we have a wrapper for it, such code
may just fail with -EOPNOTSUPPT.

> >
> > > +int vfio_iommufd_attach_compat_ioas(struct vfio_device *vdev,
> > > +				    struct iommufd_ctx *ictx)
> > > +{
> > > +	u32 ioas_id;
> > > +	int ret;
> > > +
> > > +	lockdep_assert_held(&vdev->dev_set->lock);
> > >
> > >  	/*
> > > -	 * The legacy path has no way to return the device id or the selected
> > > -	 * pt_id
> > > +	 * If the driver doesn't provide this op then it means the device does
> > > +	 * not do DMA at all. So nothing to do.
> > >  	 */
> > > -	return 0;
> > > +	if (WARN_ON(!vdev->ops->bind_iommufd))
> > > +		return -ENODEV;
> > >
> > > -err_unbind:
> > > -	if (vdev->ops->unbind_iommufd)
> > > -		vdev->ops->unbind_iommufd(vdev);
> > > -	return ret;
> > > +	if (vfio_device_is_noiommu(vdev)) {
> > > +		if
> > > (WARN_ON(vfio_iommufd_device_probe_comapt_noiommu(vdev,
> ictx)))
> > > +			return -EINVAL;
> > > +		return 0;
> > > +	}
> >
> > no need. let's directly call following from vfio_device_group_open().
> > In that case no need to do noiommu check twice in one function.
> 
> Ok. maybe still have vfio_iommufd_attach_compat_ioas() but
> only call it if it's not noiommu mode. vfio_device_group_open()
> can call probe_noiommu() first and has a bool to mark noiommu.
> Jason had a remark that it's better to keep the
> iommufd_vfio_compat_ioas_get_id() in iommufd.c

Same with .bind_iommufd(). If we move the compat ioas attach
code to group.c, it may encounter kernel panic if there is wrong
code that passes valid iommufd pointer.

> >
> > > +
> > > +	ret = iommufd_vfio_compat_ioas_get_id(ictx, &ioas_id);
> > > +	if (ret)
> > > +		return ret;
> > > +
> > > +	/* The legacy path has no way to return the selected pt_id */
> > > +	return vdev->ops->attach_ioas(vdev, &ioas_id);
> > >  }
> > >

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [Intel-gfx] [PATCH v6 13/24] vfio/iommufd: Split the compat_ioas attach out from vfio_iommufd_bind()
  2023-03-11 10:24       ` Liu, Yi L
@ 2023-03-13  2:06         ` Tian, Kevin
  0 siblings, 0 replies; 103+ messages in thread
From: Tian, Kevin @ 2023-03-13  2:06 UTC (permalink / raw)
  To: Liu, Yi L, alex.williamson, jgg
  Cc: linux-s390, yi.y.sun, mjrosato, kvm, intel-gvt-dev, joro, cohuck,
	Hao, Xudong, peterx, Zhao, Yan Y, eric.auger, Xu, Terrence,
	nicolinc, shameerali.kolothum.thodi, suravee.suthikulpanit,
	intel-gfx, chao.p.peng, lulu, robin.murphy, jasowang

> From: Liu, Yi L <yi.l.liu@intel.com>
> Sent: Saturday, March 11, 2023 6:24 PM
> > > >
> > > > -	ret = vdev->ops->bind_iommufd(vdev, ictx, &device_id);
> > > > -	if (ret)
> > > > -		return ret;
> > > > +	/* The legacy path has no way to return the device id */
> > > > +	return vdev->ops->bind_iommufd(vdev, ictx, &device_id);
> > > > +}
> > > >
> > > > -	ret = iommufd_vfio_compat_ioas_get_id(ictx, &ioas_id);
> > > > -	if (ret)
> > > > -		goto err_unbind;
> > > > -	ret = vdev->ops->attach_ioas(vdev, &ioas_id);
> > > > -	if (ret)
> > > > -		goto err_unbind;
> > >
> > > after noiommu check and attach_ioas are moved out then this
> > > entire function can be removed now. Just call the ops in
> > > vfio_device_first_open().
> >
> > Yes. and also no vfio_iommufd_unbind().
> 
> Seems still necessary to have this wrapper. .bind_iommufd callback would
> be NULL if CONFIG_IOMMUFD==n. If we call ops->bind_iommufd directly
> in vfio_device_first_open() of vfio_main.c, it may trigger kernel panic
> for NULL pointer dereference if there is wrong code that passes valid
> iommufd pointer.. Ideally, if CONFIG_IOMMUFD==n, vfio_device_first_open
> should not receive valid iommufd pointer hence won't call ops-
> >bind_iommufd
> at all. So it deserves a panic. However, if we have a wrapper for it, such code
> may just fail with -EOPNOTSUPPT.
> 

ok, let's keep this wrapper then. I didn't realize it's NULL if
CONFIG_IOMMUFD==n.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* [Intel-gfx] ✗ Fi.CI.BUILD: failure for cover-letter: Add vfio_device cdev for iommufd support
  2023-03-08 13:28 [Intel-gfx] [PATCH v6 00/24] cover-letter: Add vfio_device cdev for iommufd support Yi Liu
                   ` (23 preceding siblings ...)
  2023-03-08 13:29 ` [Intel-gfx] [PATCH v6 24/24] docs: vfio: Add vfio device cdev description Yi Liu
@ 2023-03-14 11:02 ` Patchwork
  2023-03-24 10:39 ` [Intel-gfx] ✗ Fi.CI.BUILD: failure for cover-letter: Add vfio_device cdev for iommufd support (rev2) Patchwork
  25 siblings, 0 replies; 103+ messages in thread
From: Patchwork @ 2023-03-14 11:02 UTC (permalink / raw)
  To: Liu, Yi L; +Cc: intel-gfx

== Series Details ==

Series: cover-letter: Add vfio_device cdev for iommufd support
URL   : https://patchwork.freedesktop.org/series/114850/
State : failure

== Summary ==

Error: patch https://patchwork.freedesktop.org/api/1.0/series/114850/revisions/1/mbox/ not applied
Applying: vfio: Allocate per device file structure
Applying: vfio: Refine vfio file kAPIs for KVM
Applying: vfio: Accept vfio device file in the KVM facing kAPI
Applying: kvm/vfio: Rename kvm_vfio_group to prepare for accepting vfio device fd
Applying: kvm/vfio: Accept vfio device file from userspace
error: sha1 information is lacking or useless (Documentation/virt/kvm/devices/vfio.rst).
error: could not build fake ancestor
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0005 kvm/vfio: Accept vfio device file from userspace
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".



^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [Intel-gfx] [PATCH v6 21/24] vfio: Add VFIO_DEVICE_BIND_IOMMUFD
  2023-03-10 10:06       ` Tian, Kevin
@ 2023-03-15  4:40         ` Liu, Yi L
  2023-03-15  6:57           ` Tian, Kevin
  2023-03-20 14:09           ` Jason Gunthorpe
  0 siblings, 2 replies; 103+ messages in thread
From: Liu, Yi L @ 2023-03-15  4:40 UTC (permalink / raw)
  To: Tian, Kevin, alex.williamson, jgg
  Cc: linux-s390, yi.y.sun, mjrosato, kvm, intel-gvt-dev, joro, cohuck,
	Hao, Xudong, peterx, Zhao, Yan Y, eric.auger, Xu, Terrence,
	nicolinc, shameerali.kolothum.thodi, suravee.suthikulpanit,
	intel-gfx, chao.p.peng, lulu, robin.murphy, jasowang

> From: Tian, Kevin <kevin.tian@intel.com>
> Sent: Friday, March 10, 2023 6:07 PM
> 
> > From: Liu, Yi L <yi.l.liu@intel.com>
> > Sent: Friday, March 10, 2023 5:58 PM
> >
> > > From: Tian, Kevin <kevin.tian@intel.com>
> > > Sent: Friday, March 10, 2023 5:02 PM
> > >
> > > > From: Liu, Yi L <yi.l.liu@intel.com>
> > > > Sent: Wednesday, March 8, 2023 9:29 PM
> > > > +
> > > > +static int vfio_device_cdev_probe_noiommu(struct vfio_device
> *device)
> > > > +{
> > > > +	struct iommu_group *iommu_group;
> > > > +	int ret = 0;
> > > > +
> > > > +	if (!IS_ENABLED(CONFIG_VFIO_NOIOMMU) || !vfio_noiommu)
> > > > +		return -EINVAL;
> > > > +
> > > > +	if (!capable(CAP_SYS_RAWIO))
> > > > +		return -EPERM;
> > > > +
> > > > +	iommu_group = iommu_group_get(device->dev);
> > > > +	if (!iommu_group)
> > > > +		return 0;
> > > > +
> > > > +	/*
> > > > +	 * We cannot support noiommu mode for devices that are
> > > protected
> > > > +	 * by IOMMU.  So check the iommu_group, if it is a no-iommu group
> > > > +	 * created by VFIO, we support. If not, we refuse.
> > > > +	 */
> > > > +	if
> > > (!vfio_group_find_noiommu_group_from_iommu(iommu_group))
> > > > +		ret = -EINVAL;
> > > > +	iommu_group_put(iommu_group);
> > > > +	return ret;
> > >
> > > can check whether group->name == "vfio-noiommu"?
> >
> > But VFIO names it to be "vfio-noiommu" for both
> VFIO_EMULATED_IOMMU
> > and VFIO_NO_IOMMU. And we don't support no-iommu mode for
> emulated
> > devices since VFIO_MAP/UNMAP, pin_page(), dam_rw() won't work in
> the
> > no-iommu mode.
> 
> correct.
> 
> >
> > So maybe something like below in drivers/vfio/vfio.h. It can be used
> > to replace the code from iommu_group_get() to
> > vfio_group_find_noiommu_group_from_iommu() In my patch.
> >
> > #if IS_ENABLED(CONFIG_VFIO_GROUP)
> > static inline bool vfio_device_group_allow_noiommu(struct vfio_device
> > *device)
> > {
> > 	lockdep_assert_held(&device->dev_set->lock);
> >
> > 	return device->group->type == VFIO_NO_IOMMU;
> > }
> > #else
> > static inline bool vfio_device_group_allow_noiommu(struct vfio_device
> > *device)
> > {
> > 	struct iommu_group *iommu_group;
> >
> > 	lockdep_assert_held(&device->dev_set->lock);
> >
> > 	iommu_group = iommu_group_get(device->dev);
> > 	if (iommu_group)
> > 		iommu_group_put(iommu_group);
> >
> > 	return !iommu_group;
> > }
> > #endif
> 
> this makes sense.

Just have one more think. vfio_device_is_noiommu() is already able
to cover above vfio_device_group_allow_noiommu(), just needs
to make it work when !VFIO_GROUP. In the group code, group->type
== VFIO_NO_IOMMU means vfio_noiommu==true. So no need to
check it. While in the case !VFIO_GROUP, needs to check it. So the
code is as below. I can use vfio_device_is_noiommu() in cdev path.

# if IS_ENABLED(CONFIG_VFIO_GROUP)
static inline bool vfio_device_is_noiommu(struct vfio_device *vdev)
{
        return IS_ENABLED(CONFIG_VFIO_NOIOMMU) &&
               vdev->group->type == VFIO_NO_IOMMU;
}
#else
static inline bool vfio_device_is_noiommu(struct vfio_device *vdev)
{
        struct iommu_group *iommu_group;

        if (!IS_ENABLED(CONFIG_VFIO_NOIOMMU) || !vfio_noiommu)
                return -EINVAL;

        iommu_group = iommu_group_get(vdev->dev);
        if (iommu_group)
                iommu_group_put(iommu_group);

        return !iommu_group;
}
#endif

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [Intel-gfx] [PATCH v6 19/24] vfio-iommufd: Add detach_ioas support for emulated VFIO devices
  2023-03-10 23:42   ` Nicolin Chen
@ 2023-03-15  6:15     ` Liu, Yi L
  2023-03-15  6:25       ` Nicolin Chen
  0 siblings, 1 reply; 103+ messages in thread
From: Liu, Yi L @ 2023-03-15  6:15 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: mjrosato, jasowang, Hao, Xudong, peterx, Xu, Terrence,
	chao.p.peng, linux-s390, kvm, lulu, joro, jgg, Zhao, Yan Y,
	intel-gfx, eric.auger, intel-gvt-dev, yi.y.sun, cohuck,
	shameerali.kolothum.thodi, suravee.suthikulpanit, robin.murphy

> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Saturday, March 11, 2023 7:43 AM
> On Wed, Mar 08, 2023 at 05:28:58AM -0800, Yi Liu wrote:
> > External email: Use caution opening links or attachments
> >
> >
> > this prepares for adding DETACH ioctl for emulated VFIO devices.
> >
> > Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> > Reviewed-by: Kevin Tian <kevin.tian@intel.com>
> > Tested-by: Terrence Xu <terrence.xu@intel.com>
> > Tested-by: Nicolin Chen <nicolinc@nvidia.com>
> > Tested-by: Matthew Rosato <mjrosato@linux.ibm.com>
> > ---
> >  drivers/gpu/drm/i915/gvt/kvmgt.c  |  1 +
> >  drivers/s390/cio/vfio_ccw_ops.c   |  1 +
> >  drivers/s390/crypto/vfio_ap_ops.c |  1 +
> >  drivers/vfio/iommufd.c            | 14 +++++++++++++-
> >  include/linux/vfio.h              |  3 +++
> >  5 files changed, 19 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/gpu/drm/i915/gvt/kvmgt.c
> b/drivers/gpu/drm/i915/gvt/kvmgt.c
> > index de675d799c7d..9cd9e9da60dd 100644
> > --- a/drivers/gpu/drm/i915/gvt/kvmgt.c
> > +++ b/drivers/gpu/drm/i915/gvt/kvmgt.c
> > @@ -1474,6 +1474,7 @@ static const struct vfio_device_ops
> intel_vgpu_dev_ops = {
> >         .bind_iommufd   = vfio_iommufd_emulated_bind,
> >         .unbind_iommufd = vfio_iommufd_emulated_unbind,
> >         .attach_ioas    = vfio_iommufd_emulated_attach_ioas,
> > +       .detach_ioas    = vfio_iommufd_emulated_detach_ioas,
> >  };
> >
> >  static int intel_vgpu_probe(struct mdev_device *mdev)
> > diff --git a/drivers/s390/cio/vfio_ccw_ops.c
> b/drivers/s390/cio/vfio_ccw_ops.c
> > index 5b53b94f13c7..cba4971618ff 100644
> > --- a/drivers/s390/cio/vfio_ccw_ops.c
> > +++ b/drivers/s390/cio/vfio_ccw_ops.c
> > @@ -632,6 +632,7 @@ static const struct vfio_device_ops
> vfio_ccw_dev_ops = {
> >         .bind_iommufd = vfio_iommufd_emulated_bind,
> >         .unbind_iommufd = vfio_iommufd_emulated_unbind,
> >         .attach_ioas = vfio_iommufd_emulated_attach_ioas,
> > +       .detach_ioas = vfio_iommufd_emulated_detach_ioas,
> >  };
> >
> >  struct mdev_driver vfio_ccw_mdev_driver = {
> > diff --git a/drivers/s390/crypto/vfio_ap_ops.c
> b/drivers/s390/crypto/vfio_ap_ops.c
> > index 72e10abb103a..9902e62e7a17 100644
> > --- a/drivers/s390/crypto/vfio_ap_ops.c
> > +++ b/drivers/s390/crypto/vfio_ap_ops.c
> > @@ -1844,6 +1844,7 @@ static const struct vfio_device_ops
> vfio_ap_matrix_dev_ops = {
> >         .bind_iommufd = vfio_iommufd_emulated_bind,
> >         .unbind_iommufd = vfio_iommufd_emulated_unbind,
> >         .attach_ioas = vfio_iommufd_emulated_attach_ioas,
> > +       .detach_ioas = vfio_iommufd_emulated_detach_ioas,
> >  };
> >
> >  static struct mdev_driver vfio_ap_matrix_driver = {
> > diff --git a/drivers/vfio/iommufd.c b/drivers/vfio/iommufd.c
> > index c06494e322f9..8a9457d0a33c 100644
> > --- a/drivers/vfio/iommufd.c
> > +++ b/drivers/vfio/iommufd.c
> > @@ -218,8 +218,20 @@ int vfio_iommufd_emulated_attach_ioas(struct
> vfio_device *vdev, u32 *pt_id)
> >  {
> >         lockdep_assert_held(&vdev->dev_set->lock);
> >
> > -       if (!vdev->iommufd_access)
> > +       if (WARN_ON(!vdev->iommufd_access))
> >                 return -ENOENT;
> >         return iommufd_access_set_ioas(vdev->iommufd_access, *pt_id);
> >  }
> >  EXPORT_SYMBOL_GPL(vfio_iommufd_emulated_attach_ioas);
> > +
> > +void vfio_iommufd_emulated_detach_ioas(struct vfio_device *vdev)
> > +{
> > +       lockdep_assert_held(&vdev->dev_set->lock);
> > +
> > +       if (WARN_ON(!vdev->iommufd_access))
> > +               return;
> > +
> [...]
> > +       iommufd_access_destroy(vdev->iommufd_access);
> > +       vdev->iommufd_access = NULL;
> 
> After moving access allocation/destroy to bind/unbind, here it
> should be:
> 	iommufd_access_set_ioas(vdev->iommufd_access, 0);

You are right.

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [Intel-gfx] [PATCH v6 19/24] vfio-iommufd: Add detach_ioas support for emulated VFIO devices
  2023-03-15  6:15     ` Liu, Yi L
@ 2023-03-15  6:25       ` Nicolin Chen
  0 siblings, 0 replies; 103+ messages in thread
From: Nicolin Chen @ 2023-03-15  6:25 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: mjrosato, jasowang, Hao, Xudong, peterx, Xu, Terrence,
	chao.p.peng, linux-s390, kvm, lulu, joro, jgg, Zhao, Yan Y,
	intel-gfx, eric.auger, intel-gvt-dev, yi.y.sun, cohuck,
	shameerali.kolothum.thodi, suravee.suthikulpanit, robin.murphy

On Wed, Mar 15, 2023 at 06:15:02AM +0000, Liu, Yi L wrote:

> > > +void vfio_iommufd_emulated_detach_ioas(struct vfio_device *vdev)
> > > +{
> > > +       lockdep_assert_held(&vdev->dev_set->lock);
> > > +
> > > +       if (WARN_ON(!vdev->iommufd_access))
> > > +               return;
> > > +
> > [...]
> > > +       iommufd_access_destroy(vdev->iommufd_access);
> > > +       vdev->iommufd_access = NULL;
> >
> > After moving access allocation/destroy to bind/unbind, here it
> > should be:
> >       iommufd_access_set_ioas(vdev->iommufd_access, 0);
> 
> You are right.

Yet...iommufd_access_set_ioas is getting reworked with my patch:
In another thread, Jason suggested to have iommufd_acces_detach
API, and I am trying to finalize it with Jason/Kevin.
https://lore.kernel.org/kvm/BN9PR11MB5276738DC59AC1B4A66AB3C38CBF9@BN9PR11MB5276.namprd11.prod.outlook.com/

Nic

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [Intel-gfx] [PATCH v6 21/24] vfio: Add VFIO_DEVICE_BIND_IOMMUFD
  2023-03-15  4:40         ` Liu, Yi L
@ 2023-03-15  6:57           ` Tian, Kevin
  2023-03-20 14:09           ` Jason Gunthorpe
  1 sibling, 0 replies; 103+ messages in thread
From: Tian, Kevin @ 2023-03-15  6:57 UTC (permalink / raw)
  To: Liu, Yi L, alex.williamson, jgg
  Cc: linux-s390, yi.y.sun, mjrosato, kvm, intel-gvt-dev, joro, cohuck,
	Hao, Xudong, peterx, Zhao, Yan Y, eric.auger, Xu, Terrence,
	nicolinc, shameerali.kolothum.thodi, suravee.suthikulpanit,
	intel-gfx, chao.p.peng, lulu, robin.murphy, jasowang

> From: Liu, Yi L <yi.l.liu@intel.com>
> Sent: Wednesday, March 15, 2023 12:40 PM
> 
> > From: Tian, Kevin <kevin.tian@intel.com>
> > Sent: Friday, March 10, 2023 6:07 PM
> >
> > > From: Liu, Yi L <yi.l.liu@intel.com>
> > > Sent: Friday, March 10, 2023 5:58 PM
> > >
> > > > From: Tian, Kevin <kevin.tian@intel.com>
> > > > Sent: Friday, March 10, 2023 5:02 PM
> > > >
> > > > > From: Liu, Yi L <yi.l.liu@intel.com>
> > > > > Sent: Wednesday, March 8, 2023 9:29 PM
> > > > > +
> > > > > +static int vfio_device_cdev_probe_noiommu(struct vfio_device
> > *device)
> > > > > +{
> > > > > +	struct iommu_group *iommu_group;
> > > > > +	int ret = 0;
> > > > > +
> > > > > +	if (!IS_ENABLED(CONFIG_VFIO_NOIOMMU)
> || !vfio_noiommu)
> > > > > +		return -EINVAL;
> > > > > +
> > > > > +	if (!capable(CAP_SYS_RAWIO))
> > > > > +		return -EPERM;
> > > > > +
> > > > > +	iommu_group = iommu_group_get(device->dev);
> > > > > +	if (!iommu_group)
> > > > > +		return 0;
> > > > > +
> > > > > +	/*
> > > > > +	 * We cannot support noiommu mode for devices that are
> > > > protected
> > > > > +	 * by IOMMU.  So check the iommu_group, if it is a no-iommu
> group
> > > > > +	 * created by VFIO, we support. If not, we refuse.
> > > > > +	 */
> > > > > +	if
> > > > (!vfio_group_find_noiommu_group_from_iommu(iommu_group))
> > > > > +		ret = -EINVAL;
> > > > > +	iommu_group_put(iommu_group);
> > > > > +	return ret;
> > > >
> > > > can check whether group->name == "vfio-noiommu"?
> > >
> > > But VFIO names it to be "vfio-noiommu" for both
> > VFIO_EMULATED_IOMMU
> > > and VFIO_NO_IOMMU. And we don't support no-iommu mode for
> > emulated
> > > devices since VFIO_MAP/UNMAP, pin_page(), dam_rw() won't work in
> > the
> > > no-iommu mode.
> >
> > correct.
> >
> > >
> > > So maybe something like below in drivers/vfio/vfio.h. It can be used
> > > to replace the code from iommu_group_get() to
> > > vfio_group_find_noiommu_group_from_iommu() In my patch.
> > >
> > > #if IS_ENABLED(CONFIG_VFIO_GROUP)
> > > static inline bool vfio_device_group_allow_noiommu(struct vfio_device
> > > *device)
> > > {
> > > 	lockdep_assert_held(&device->dev_set->lock);
> > >
> > > 	return device->group->type == VFIO_NO_IOMMU;
> > > }
> > > #else
> > > static inline bool vfio_device_group_allow_noiommu(struct vfio_device
> > > *device)
> > > {
> > > 	struct iommu_group *iommu_group;
> > >
> > > 	lockdep_assert_held(&device->dev_set->lock);
> > >
> > > 	iommu_group = iommu_group_get(device->dev);
> > > 	if (iommu_group)
> > > 		iommu_group_put(iommu_group);
> > >
> > > 	return !iommu_group;
> > > }
> > > #endif
> >
> > this makes sense.
> 
> Just have one more think. vfio_device_is_noiommu() is already able
> to cover above vfio_device_group_allow_noiommu(), just needs
> to make it work when !VFIO_GROUP. In the group code, group->type
> == VFIO_NO_IOMMU means vfio_noiommu==true. So no need to
> check it. While in the case !VFIO_GROUP, needs to check it. So the
> code is as below. I can use vfio_device_is_noiommu() in cdev path.
> 
> # if IS_ENABLED(CONFIG_VFIO_GROUP)
> static inline bool vfio_device_is_noiommu(struct vfio_device *vdev)
> {
>         return IS_ENABLED(CONFIG_VFIO_NOIOMMU) &&
>                vdev->group->type == VFIO_NO_IOMMU;
> }
> #else
> static inline bool vfio_device_is_noiommu(struct vfio_device *vdev)
> {
>         struct iommu_group *iommu_group;
> 
>         if (!IS_ENABLED(CONFIG_VFIO_NOIOMMU) || !vfio_noiommu)
>                 return -EINVAL;
> 
>         iommu_group = iommu_group_get(vdev->dev);
>         if (iommu_group)
>                 iommu_group_put(iommu_group);
> 
>         return !iommu_group;
> }
> #endif

works for me.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [Intel-gfx] [PATCH v6 12/24] vfio/pci: Allow passing zero-length fd array in VFIO_DEVICE_PCI_HOT_RESET
  2023-03-08 13:28 ` [Intel-gfx] [PATCH v6 12/24] vfio/pci: Allow passing zero-length fd array in VFIO_DEVICE_PCI_HOT_RESET Yi Liu
  2023-03-10  5:31   ` Tian, Kevin
@ 2023-03-15 22:53   ` Alex Williamson
  2023-03-15 23:31     ` Tian, Kevin
  1 sibling, 1 reply; 103+ messages in thread
From: Alex Williamson @ 2023-03-15 22:53 UTC (permalink / raw)
  To: Yi Liu
  Cc: mjrosato, jasowang, xudong.hao, peterx, terrence.xu, chao.p.peng,
	linux-s390, kvm, lulu, joro, nicolinc, jgg, yan.y.zhao,
	intel-gfx, eric.auger, intel-gvt-dev, yi.y.sun, cohuck,
	shameerali.kolothum.thodi, suravee.suthikulpanit, robin.murphy

On Wed,  8 Mar 2023 05:28:51 -0800
Yi Liu <yi.l.liu@intel.com> wrote:

> This is another method to issue PCI hot reset for the users that bounds
> device to a positive iommufd value. In such case, iommufd is a proof of
> device ownership. By passing a zero-length fd array, user indicates kernel
> to do ownership check with the bound iommufd. All the opened devices within
> the affected dev_set should have been bound to the same iommufd. This is
> simpler and faster as user does not need to pass a set of fds and kernel
> no need to search the device within the given fds.

Couldn't this same idea apply to containers?

I'm afraid this proposal reduces or eliminates the handshake we have
with userspace between VFIO_DEVICE_GET_PCI_HOT_RESET_INFO and
VFIO_DEVICE_PCI_HOT_RESET, which could promote userspace to ignore the
_INFO ioctl altogether, resulting in drivers that don't understand the
scope of the reset.  Is it worth it?  What do we really gain?

> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index d80141969cd1..382d95455f89 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -682,6 +682,11 @@ struct vfio_pci_hot_reset_info {
>   * The ownership can be proved by:
>   *   - An array of group fds
>   *   - An array of device fds
> + *   - A zero-length array
> + *
> + * In the last case all affected devices which are opened by this user
> + * must have been bound to a same iommufd_ctx.  This approach is only
> + * available for devices bound to positive iommufd.
>   *
>   * Return: 0 on success, -errno on failure.
>   */

There's no introspection that this feature is supported, is that why
containers are not considered?  ie. if the host supports vfio cdevs, it
necessarily must support vfio-pci hot reset w/ a zero-length array?
Thanks,

Alex


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [Intel-gfx] [PATCH v6 12/24] vfio/pci: Allow passing zero-length fd array in VFIO_DEVICE_PCI_HOT_RESET
  2023-03-15 22:53   ` Alex Williamson
@ 2023-03-15 23:31     ` Tian, Kevin
  2023-03-16  3:54       ` [Intel-gfx] [offlist] " Liu, Yi L
  2023-03-16 18:45       ` Alex Williamson
  0 siblings, 2 replies; 103+ messages in thread
From: Tian, Kevin @ 2023-03-15 23:31 UTC (permalink / raw)
  To: Alex Williamson, Liu, Yi L
  Cc: linux-s390, suravee.suthikulpanit, yi.y.sun, mjrosato, kvm,
	intel-gvt-dev, joro, cohuck, Hao, Xudong, peterx, Zhao, Yan Y,
	eric.auger, Xu, Terrence, nicolinc, shameerali.kolothum.thodi,
	jgg, intel-gfx, chao.p.peng, lulu, robin.murphy, jasowang

> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Thursday, March 16, 2023 6:53 AM
> 
> On Wed,  8 Mar 2023 05:28:51 -0800
> Yi Liu <yi.l.liu@intel.com> wrote:
> 
> > This is another method to issue PCI hot reset for the users that bounds
> > device to a positive iommufd value. In such case, iommufd is a proof of
> > device ownership. By passing a zero-length fd array, user indicates kernel
> > to do ownership check with the bound iommufd. All the opened devices
> within
> > the affected dev_set should have been bound to the same iommufd. This is
> > simpler and faster as user does not need to pass a set of fds and kernel
> > no need to search the device within the given fds.
> 
> Couldn't this same idea apply to containers?

User is allowed to create multiple containers. Looks we don't have a way
to check whether multiple containers belong to the same user today.

> 
> I'm afraid this proposal reduces or eliminates the handshake we have
> with userspace between VFIO_DEVICE_GET_PCI_HOT_RESET_INFO and
> VFIO_DEVICE_PCI_HOT_RESET, which could promote userspace to ignore the
> _INFO ioctl altogether, resulting in drivers that don't understand the
> scope of the reset.  Is it worth it?  What do we really gain?

Jason raised the concern whether GET_PCI_HOT_RESET_INFO is actually
useful today.

It's an interface on opened device. So the tiny difference is whether the
user knows the device is resettable when calling GET_INFO or later when
actually calling PCI_HOT_RESET.

and with this series we also allow reset on affected devices which are not
opened. Such dynamic cannot be reflected in static GET_INFO. More
suitable a try-and-fail style.


> 
> > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> > index d80141969cd1..382d95455f89 100644
> > --- a/include/uapi/linux/vfio.h
> > +++ b/include/uapi/linux/vfio.h
> > @@ -682,6 +682,11 @@ struct vfio_pci_hot_reset_info {
> >   * The ownership can be proved by:
> >   *   - An array of group fds
> >   *   - An array of device fds
> > + *   - A zero-length array
> > + *
> > + * In the last case all affected devices which are opened by this user
> > + * must have been bound to a same iommufd_ctx.  This approach is only
> > + * available for devices bound to positive iommufd.
> >   *
> >   * Return: 0 on success, -errno on failure.
> >   */
> 
> There's no introspection that this feature is supported, is that why
> containers are not considered?  ie. if the host supports vfio cdevs, it
> necessarily must support vfio-pci hot reset w/ a zero-length array?
> Thanks,
> 

yes. It's more for users who knows that iommufd is used.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* [Intel-gfx] [offlist] RE: [PATCH v6 12/24] vfio/pci: Allow passing zero-length fd array in VFIO_DEVICE_PCI_HOT_RESET
  2023-03-15 23:31     ` Tian, Kevin
@ 2023-03-16  3:54       ` Liu, Yi L
  2023-03-16  6:09         ` [Intel-gfx] " Tian, Kevin
  2023-03-16 18:45       ` Alex Williamson
  1 sibling, 1 reply; 103+ messages in thread
From: Liu, Yi L @ 2023-03-16  3:54 UTC (permalink / raw)
  To: Tian, Kevin, Alex Williamson
  Cc: linux-s390, suravee.suthikulpanit, yi.y.sun, mjrosato, kvm,
	intel-gvt-dev, joro, cohuck, Hao, Xudong, peterx, Zhao, Yan Y,
	eric.auger, Xu, Terrence, nicolinc, shameerali.kolothum.thodi,
	jgg, intel-gfx, chao.p.peng, lulu, robin.murphy, jasowang

> From: Tian, Kevin <kevin.tian@intel.com>
> Sent: Thursday, March 16, 2023 7:31 AM
> 
> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Thursday, March 16, 2023 6:53 AM
> >
> > On Wed,  8 Mar 2023 05:28:51 -0800
> > Yi Liu <yi.l.liu@intel.com> wrote:
> >
> > > This is another method to issue PCI hot reset for the users that bounds
> > > device to a positive iommufd value. In such case, iommufd is a proof of
> > > device ownership. By passing a zero-length fd array, user indicates
> kernel
> > > to do ownership check with the bound iommufd. All the opened devices
> > within
> > > the affected dev_set should have been bound to the same iommufd.
> This is
> > > simpler and faster as user does not need to pass a set of fds and kernel
> > > no need to search the device within the given fds.
> >
> > Couldn't this same idea apply to containers?
> 
> User is allowed to create multiple containers. Looks we don't have a way
> to check whether multiple containers belong to the same user today.

Hi Kevin,

This reminds me. In the compat mode, container fd is actually iommufd.
If the compat mode passes a zeror-length array to do reset, it is possible
that the opened devices in this affected dev_set may be set to different
containers (a.k.a. iommufd_ctx). This would break what we defined in
uapi. So a better description is users that use cdev can use this zero-length
approach. And also, in kernel, we need to check if this approach is abused
by the compat mode.

> >
> > I'm afraid this proposal reduces or eliminates the handshake we have
> > with userspace between VFIO_DEVICE_GET_PCI_HOT_RESET_INFO and
> > VFIO_DEVICE_PCI_HOT_RESET, which could promote userspace to ignore
> the
> > _INFO ioctl altogether, resulting in drivers that don't understand the
> > scope of the reset.  Is it worth it?  What do we really gain?
> 
> Jason raised the concern whether GET_PCI_HOT_RESET_INFO is actually
> useful today.
> 
> It's an interface on opened device. So the tiny difference is whether the
> user knows the device is resettable when calling GET_INFO or later when
> actually calling PCI_HOT_RESET.
> 
> and with this series we also allow reset on affected devices which are not
> opened. Such dynamic cannot be reflected in static GET_INFO. More
> suitable a try-and-fail style.

Got the usage of zero-length, 
> 
> >
> > > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> > > index d80141969cd1..382d95455f89 100644
> > > --- a/include/uapi/linux/vfio.h
> > > +++ b/include/uapi/linux/vfio.h
> > > @@ -682,6 +682,11 @@ struct vfio_pci_hot_reset_info {
> > >   * The ownership can be proved by:
> > >   *   - An array of group fds
> > >   *   - An array of device fds
> > > + *   - A zero-length array
> > > + *
> > > + * In the last case all affected devices which are opened by this user
> > > + * must have been bound to a same iommufd_ctx.  This approach is
> only
> > > + * available for devices bound to positive iommufd.
> > >   *
> > >   * Return: 0 on success, -errno on failure.
> > >   */
> >
> > There's no introspection that this feature is supported, is that why
> > containers are not considered?  ie. if the host supports vfio cdevs, it
> > necessarily must support vfio-pci hot reset w/ a zero-length array?
> > Thanks,
> >
> 
> yes. It's more for users who knows that iommufd is used.

Needs to be more accurate. Only users that uses cdev. 

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [Intel-gfx] [PATCH v6 12/24] vfio/pci: Allow passing zero-length fd array in VFIO_DEVICE_PCI_HOT_RESET
  2023-03-16  3:54       ` [Intel-gfx] [offlist] " Liu, Yi L
@ 2023-03-16  6:09         ` Tian, Kevin
  2023-03-16  6:28           ` Liu, Yi L
  0 siblings, 1 reply; 103+ messages in thread
From: Tian, Kevin @ 2023-03-16  6:09 UTC (permalink / raw)
  To: Liu, Yi L, Alex Williamson
  Cc: linux-s390, suravee.suthikulpanit, yi.y.sun, mjrosato, kvm,
	intel-gvt-dev, joro, cohuck, Hao, Xudong, peterx, Zhao, Yan Y,
	eric.auger, Xu, Terrence, nicolinc, shameerali.kolothum.thodi,
	jgg, intel-gfx, chao.p.peng, lulu, robin.murphy, jasowang

> From: Liu, Yi L <yi.l.liu@intel.com>
> Sent: Thursday, March 16, 2023 11:55 AM
> 
> > From: Tian, Kevin <kevin.tian@intel.com>
> > Sent: Thursday, March 16, 2023 7:31 AM
> >
> > > From: Alex Williamson <alex.williamson@redhat.com>
> > > Sent: Thursday, March 16, 2023 6:53 AM
> > >
> > > On Wed,  8 Mar 2023 05:28:51 -0800
> > > Yi Liu <yi.l.liu@intel.com> wrote:
> > >
> > > > This is another method to issue PCI hot reset for the users that bounds
> > > > device to a positive iommufd value. In such case, iommufd is a proof of
> > > > device ownership. By passing a zero-length fd array, user indicates
> > kernel
> > > > to do ownership check with the bound iommufd. All the opened devices
> > > within
> > > > the affected dev_set should have been bound to the same iommufd.
> > This is
> > > > simpler and faster as user does not need to pass a set of fds and kernel
> > > > no need to search the device within the given fds.
> > >
> > > Couldn't this same idea apply to containers?
> >
> > User is allowed to create multiple containers. Looks we don't have a way
> > to check whether multiple containers belong to the same user today.
> 
> Hi Kevin,
> 
> This reminds me. In the compat mode, container fd is actually iommufd.
> If the compat mode passes a zeror-length array to do reset, it is possible
> that the opened devices in this affected dev_set may be set to different
> containers (a.k.a. iommufd_ctx). This would break what we defined in
> uapi. So a better description is users that use cdev can use this zero-length
> approach. And also, in kernel, we need to check if this approach is abused
> by the compat mode.
> 

In normal case legacy application uses group fd array and new application
with cdev uses zero-length approach.

In rare case an application which opens /dev/iommu but opts to use it
as a container in compat mode can also use zero-length array to reset
if all devices are attached to a single container (internally to a same
iommufd_ctx). It's still kind of matching uAPI description.

I'm not sure whether we want to add explicit check to prevent it.

Of course if affected devices span multiple compat iommufd's then
it will fail.

The open Alex raised is whether we want to further extend it to
legacy container if all affected devices are in one container. But
I hesitate to do so since iommufd is the future and if an application
can be rewritten to utilize zero-length reset then it probably
should explicitly embrace iommufd instead.

Anyway let's not wait here. Send your v7 and we can have more
focused discussion in your split series about hot reset.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [Intel-gfx] [PATCH v6 12/24] vfio/pci: Allow passing zero-length fd array in VFIO_DEVICE_PCI_HOT_RESET
  2023-03-16  6:09         ` [Intel-gfx] " Tian, Kevin
@ 2023-03-16  6:28           ` Liu, Yi L
  2023-03-16  6:49             ` Nicolin Chen
  0 siblings, 1 reply; 103+ messages in thread
From: Liu, Yi L @ 2023-03-16  6:28 UTC (permalink / raw)
  To: Tian, Kevin, Alex Williamson
  Cc: linux-s390, suravee.suthikulpanit, yi.y.sun, mjrosato, kvm,
	intel-gvt-dev, joro, cohuck, Hao, Xudong, peterx, Zhao, Yan Y,
	eric.auger, Xu, Terrence, nicolinc, shameerali.kolothum.thodi,
	jgg, intel-gfx, chao.p.peng, lulu, robin.murphy, jasowang

> From: Tian, Kevin <kevin.tian@intel.com>
> Sent: Thursday, March 16, 2023 2:10 PM
> 
> > From: Liu, Yi L <yi.l.liu@intel.com>
> > Sent: Thursday, March 16, 2023 11:55 AM
> >
> > > From: Tian, Kevin <kevin.tian@intel.com>
> > > Sent: Thursday, March 16, 2023 7:31 AM
> > >
> > > > From: Alex Williamson <alex.williamson@redhat.com>
> > > > Sent: Thursday, March 16, 2023 6:53 AM
> > > >
> > > > On Wed,  8 Mar 2023 05:28:51 -0800
> > > > Yi Liu <yi.l.liu@intel.com> wrote:
> > > >
> > > > > This is another method to issue PCI hot reset for the users that
> bounds
> > > > > device to a positive iommufd value. In such case, iommufd is a proof
> of
> > > > > device ownership. By passing a zero-length fd array, user indicates
> > > kernel
> > > > > to do ownership check with the bound iommufd. All the opened
> devices
> > > > within
> > > > > the affected dev_set should have been bound to the same iommufd.
> > > This is
> > > > > simpler and faster as user does not need to pass a set of fds and
> kernel
> > > > > no need to search the device within the given fds.
> > > >
> > > > Couldn't this same idea apply to containers?
> > >
> > > User is allowed to create multiple containers. Looks we don't have a way
> > > to check whether multiple containers belong to the same user today.
> >
> > Hi Kevin,
> >
> > This reminds me. In the compat mode, container fd is actually iommufd.
> > If the compat mode passes a zeror-length array to do reset, it is possible
> > that the opened devices in this affected dev_set may be set to different
> > containers (a.k.a. iommufd_ctx). This would break what we defined in
> > uapi. So a better description is users that use cdev can use this zero-length
> > approach. And also, in kernel, we need to check if this approach is abused
> > by the compat mode.
> >
> 
> In normal case legacy application uses group fd array and new application
> with cdev uses zero-length approach.
> 
> In rare case an application which opens /dev/iommu but opts to use it
> as a container in compat mode can also use zero-length array to reset
> if all devices are attached to a single container (internally to a same
> iommufd_ctx). It's still kind of matching uAPI description.
>
> I'm not sure whether we want to add explicit check to prevent it.
>
> Of course if affected devices span multiple compat iommufd's then
> it will fail.

Yes. this failure matches the uapi description. And it is rare case,
mostly likely applications should be explicitly updated to use cdev
and then use this zero-length approach. Before that, the legacy
applications do not know it at all. Even it uses this approach
by mistake, it will fail in the multiple compat iommufd case.

So maybe no need to limit it.

> The open Alex raised is whether we want to further extend it to
> legacy container if all affected devices are in one container. But
> I hesitate to do so since iommufd is the future and if an application
> can be rewritten to utilize zero-length reset then it probably
> should explicitly embrace iommufd instead.

For this, I agree with you.

> 
> Anyway let's not wait here. Send your v7 and we can have more
> focused discussion in your split series about hot reset.

Sure. Once Nicolin's patch is updated, I can send v7 with the hot
reset series as well.

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [Intel-gfx] [PATCH v6 12/24] vfio/pci: Allow passing zero-length fd array in VFIO_DEVICE_PCI_HOT_RESET
  2023-03-16  6:28           ` Liu, Yi L
@ 2023-03-16  6:49             ` Nicolin Chen
  2023-03-16 13:22               ` Liu, Yi L
  0 siblings, 1 reply; 103+ messages in thread
From: Nicolin Chen @ 2023-03-16  6:49 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: mjrosato, jasowang, Hao, Xudong, peterx, Xu, Terrence,
	chao.p.peng, linux-s390, kvm, lulu, joro, jgg, Zhao, Yan Y,
	intel-gfx, eric.auger, intel-gvt-dev, yi.y.sun, cohuck,
	shameerali.kolothum.thodi, suravee.suthikulpanit, robin.murphy

On Thu, Mar 16, 2023 at 06:28:28AM +0000, Liu, Yi L wrote:

> > Anyway let's not wait here. Send your v7 and we can have more
> > focused discussion in your split series about hot reset.
> 
> Sure. Once Nicolin's patch is updated, I can send v7 with the hot
> reset series as well.

I've updated three commits and pushed here:
https://github.com/nicolinc/iommufd/commits/wip/iommufd_nesting-03152023

Please pull the following commit to the emulated series:
  "iommufd: Create access in vfio_iommufd_emulated_bind()"
  https://github.com/nicolinc/iommufd/commit/6467e332584de62d1c4d5daab404a8c8d5a90a2d

Please pull the following commit to the cdev series or a place
that you feel it'd be better -- it's required by the change of
adding vfio_iommufd_emulated_detach_ioas():
  "iommufd/device: Add iommufd_access_detach() API"
  https://github.com/nicolinc/iommufd/commit/86346b5d06100640037cbb4a14bd249476072dec

The other one adding replace() will go with the replace series.

And regarding the new baseline for the replace series and the
nesting series, it'd be nicer to have another one git-merging
your cdev v7 branch on top of Jason's iommufd_hwpt branch. We
could wait for him updating to 6.3-rc2, if that's necessary.

Thanks
Nic

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [Intel-gfx] [PATCH v6 12/24] vfio/pci: Allow passing zero-length fd array in VFIO_DEVICE_PCI_HOT_RESET
  2023-03-16  6:49             ` Nicolin Chen
@ 2023-03-16 13:22               ` Liu, Yi L
  2023-03-16 21:27                 ` Nicolin Chen
  0 siblings, 1 reply; 103+ messages in thread
From: Liu, Yi L @ 2023-03-16 13:22 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: mjrosato, jasowang, Hao, Xudong, peterx, Xu, Terrence,
	chao.p.peng, linux-s390, kvm, lulu, joro, jgg, Zhao, Yan Y,
	intel-gfx, eric.auger, intel-gvt-dev, yi.y.sun, cohuck,
	shameerali.kolothum.thodi, suravee.suthikulpanit, robin.murphy

> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Thursday, March 16, 2023 2:49 PM
> 
> On Thu, Mar 16, 2023 at 06:28:28AM +0000, Liu, Yi L wrote:
> 
> > > Anyway let's not wait here. Send your v7 and we can have more
> > > focused discussion in your split series about hot reset.
> >
> > Sure. Once Nicolin's patch is updated, I can send v7 with the hot
> > reset series as well.
> 
> I've updated three commits and pushed here:
> https://github.com/nicolinc/iommufd/commits/wip/iommufd_nesting-
> 03152023
> 
> Please pull the following commit to the emulated series:
>   "iommufd: Create access in vfio_iommufd_emulated_bind()"
> 
> https://github.com/nicolinc/iommufd/commit/6467e332584de62d1c4d5daa
> b404a8c8d5a90a2d
> 
> Please pull the following commit to the cdev series or a place
> that you feel it'd be better -- it's required by the change of
> adding vfio_iommufd_emulated_detach_ioas():
>   "iommufd/device: Add iommufd_access_detach() API"
> 
> https://github.com/nicolinc/iommufd/commit/86346b5d06100640037cbb4a
> 14bd249476072dec

Thanks, I've taken them. v7 was sent out.

> The other one adding replace() will go with the replace series.
> 
> And regarding the new baseline for the replace series and the
> nesting series, it'd be nicer to have another one git-merging
> your cdev v7 branch on top of Jason's iommufd_hwpt branch. We
> could wait for him updating to 6.3-rc2, if that's necessary.

Yes. I cherry-pick his iommufd_hwpt to 6.3-rc2 and then try a
merge and then cherry-pick the replace and nesting series from
your above branch. Though the order between cdev and
iommufd_hwpt not perfect, we may use it as a wip baseline
when we try to address the comments w.r.t. nesting and
replace series.

https://github.com/yiliu1765/iommufd/tree/wip/iommufd_nesting-03162023

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [Intel-gfx] [PATCH v6 12/24] vfio/pci: Allow passing zero-length fd array in VFIO_DEVICE_PCI_HOT_RESET
  2023-03-15 23:31     ` Tian, Kevin
  2023-03-16  3:54       ` [Intel-gfx] [offlist] " Liu, Yi L
@ 2023-03-16 18:45       ` Alex Williamson
  2023-03-16 23:29         ` Tian, Kevin
  1 sibling, 1 reply; 103+ messages in thread
From: Alex Williamson @ 2023-03-16 18:45 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: mjrosato, jasowang, Hao, Xudong, peterx, Xu, Terrence,
	chao.p.peng, linux-s390, Liu, Yi L, kvm, lulu, joro, nicolinc,
	jgg, Zhao, Yan Y, intel-gfx, eric.auger, intel-gvt-dev, yi.y.sun,
	cohuck, shameerali.kolothum.thodi, suravee.suthikulpanit,
	robin.murphy

On Wed, 15 Mar 2023 23:31:23 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Thursday, March 16, 2023 6:53 AM
> > 
> > On Wed,  8 Mar 2023 05:28:51 -0800
> > Yi Liu <yi.l.liu@intel.com> wrote:
> >   
> > > This is another method to issue PCI hot reset for the users that bounds
> > > device to a positive iommufd value. In such case, iommufd is a proof of
> > > device ownership. By passing a zero-length fd array, user indicates kernel
> > > to do ownership check with the bound iommufd. All the opened devices  
> > within  
> > > the affected dev_set should have been bound to the same iommufd. This is
> > > simpler and faster as user does not need to pass a set of fds and kernel
> > > no need to search the device within the given fds.  
> > 
> > Couldn't this same idea apply to containers?  
> 
> User is allowed to create multiple containers. Looks we don't have a way
> to check whether multiple containers belong to the same user today.

No, but a common configuration is that all devices are in the same
container, ie. no-vIOMMU, and it's also rather common that when we have
multi-function devices, all functions are within the same IOMMU group
and therefore necessarily require the same address space and therefore
container.

> > I'm afraid this proposal reduces or eliminates the handshake we have
> > with userspace between VFIO_DEVICE_GET_PCI_HOT_RESET_INFO and
> > VFIO_DEVICE_PCI_HOT_RESET, which could promote userspace to ignore the
> > _INFO ioctl altogether, resulting in drivers that don't understand the
> > scope of the reset.  Is it worth it?  What do we really gain?  
> 
> Jason raised the concern whether GET_PCI_HOT_RESET_INFO is actually
> useful today.
> 
> It's an interface on opened device. So the tiny difference is whether the
> user knows the device is resettable when calling GET_INFO or later when
> actually calling PCI_HOT_RESET.

No, GET_PCI_HOT_RESET_INFO conveys not only whether a PCI_HOT_RESET can
be performed, but equally important the scope of the reset, ie. which
devices are affected by the reset.  If we de-emphasize the INFO
portion, then this easily gets confused as just a variant of
VFIO_DEVICE_RESET, which is explicitly a device-level cscope reset.  In
fact, I'd say the interface is not only trying to validate that the
user has sufficient privileges for the reset, but that they explicitly
acknowledge the scope of the reset.

> and with this series we also allow reset on affected devices which are not
> opened. Such dynamic cannot be reflected in static GET_INFO. More
> suitable a try-and-fail style.

Resets have side-effects, obviously, so this isn't the sort of thing we
can simply ask the user to probe for.  I agree that dynamics can
change, the GET_PCI_HOT_RESET_INFO is a point in time, isolated
functions on the same bus can change ownership.  However, in practice,
and in its primary use case with GPUs without isolation, it's
sufficiently static.  So I think this is a mischaracterization.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [Intel-gfx] [PATCH v6 12/24] vfio/pci: Allow passing zero-length fd array in VFIO_DEVICE_PCI_HOT_RESET
  2023-03-16 13:22               ` Liu, Yi L
@ 2023-03-16 21:27                 ` Nicolin Chen
  0 siblings, 0 replies; 103+ messages in thread
From: Nicolin Chen @ 2023-03-16 21:27 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: mjrosato, jasowang, Hao, Xudong, peterx, Xu, Terrence,
	chao.p.peng, linux-s390, kvm, lulu, joro, jgg, Zhao, Yan Y,
	intel-gfx, eric.auger, intel-gvt-dev, yi.y.sun, cohuck,
	shameerali.kolothum.thodi, suravee.suthikulpanit, robin.murphy

On Thu, Mar 16, 2023 at 01:22:58PM +0000, Liu, Yi L wrote:

> > And regarding the new baseline for the replace series and the
> > nesting series, it'd be nicer to have another one git-merging
> > your cdev v7 branch on top of Jason's iommufd_hwpt branch. We
> > could wait for him updating to 6.3-rc2, if that's necessary.
> 
> Yes. I cherry-pick his iommufd_hwpt to 6.3-rc2 and then try a
> merge and then cherry-pick the replace and nesting series from
> your above branch. Though the order between cdev and
> iommufd_hwpt not perfect, we may use it as a wip baseline
> when we try to address the comments w.r.t. nesting and
> replace series.
> 
> https://github.com/yiliu1765/iommufd/tree/wip/iommufd_nesting-03162023

Nice. It looks like you integrated everything from my tree so
it saves me some effort :)

Regarding the order between cdev and iommufd_hwpt, I think it
would be Jason's decision whether to merge his changes prior
to the PR from the VFIO tree, or the other way around.

One way or another, I think the replace v5 and the nesting v2
will be less impacted, unless Jason makes some huge changes
to his branch. Let's use this tree this week to rework both
series, and rebase after he comes back and updates his tree.

Lemme know if you need a help for the nesting series or so.

Thanks
Nic

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [Intel-gfx] [PATCH v6 12/24] vfio/pci: Allow passing zero-length fd array in VFIO_DEVICE_PCI_HOT_RESET
  2023-03-16 18:45       ` Alex Williamson
@ 2023-03-16 23:29         ` Tian, Kevin
  2023-03-17  0:22           ` Alex Williamson
  0 siblings, 1 reply; 103+ messages in thread
From: Tian, Kevin @ 2023-03-16 23:29 UTC (permalink / raw)
  To: Alex Williamson
  Cc: kvm, jasowang, Hao, Xudong, peterx, Xu, Terrence, chao.p.peng,
	linux-s390, Liu, Yi L, mjrosato, lulu, joro, nicolinc, jgg, Zhao,
	Yan Y, intel-gfx, eric.auger, intel-gvt-dev, yi.y.sun, cohuck,
	shameerali.kolothum.thodi, suravee.suthikulpanit, robin.murphy

> From: Alex Williamson
> Sent: Friday, March 17, 2023 2:46 AM
> 
> On Wed, 15 Mar 2023 23:31:23 +0000
> "Tian, Kevin" <kevin.tian@intel.com> wrote:
> 
> > > From: Alex Williamson <alex.williamson@redhat.com>
> > > Sent: Thursday, March 16, 2023 6:53 AM
> > > I'm afraid this proposal reduces or eliminates the handshake we have
> > > with userspace between VFIO_DEVICE_GET_PCI_HOT_RESET_INFO and
> > > VFIO_DEVICE_PCI_HOT_RESET, which could promote userspace to ignore
> the
> > > _INFO ioctl altogether, resulting in drivers that don't understand the
> > > scope of the reset.  Is it worth it?  What do we really gain?
> >
> > Jason raised the concern whether GET_PCI_HOT_RESET_INFO is actually
> > useful today.
> >
> > It's an interface on opened device. So the tiny difference is whether the
> > user knows the device is resettable when calling GET_INFO or later when
> > actually calling PCI_HOT_RESET.
> 
> No, GET_PCI_HOT_RESET_INFO conveys not only whether a PCI_HOT_RESET
> can
> be performed, but equally important the scope of the reset, ie. which
> devices are affected by the reset.  If we de-emphasize the INFO
> portion, then this easily gets confused as just a variant of
> VFIO_DEVICE_RESET, which is explicitly a device-level cscope reset.  In
> fact, I'd say the interface is not only trying to validate that the
> user has sufficient privileges for the reset, but that they explicitly
> acknowledge the scope of the reset.
> 

IMHO the usefulness of scope is if it's discoverable by the management
stack which then can try to assign devices with affected reset to a same
user.

but this info is only available after the device is opened. Then the mgmt.
stack just assigns devices w/o awareness of reset scope and nothing
can be changed by the user no matter it knows the scope or not.

from this angle I don't see a value of probe-and-reset vs. direct reset
when reset itself also takes the scope into consideration. Probably the
slight difference is that with probe a more informative error message
can be printed out?

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [Intel-gfx] [PATCH v6 12/24] vfio/pci: Allow passing zero-length fd array in VFIO_DEVICE_PCI_HOT_RESET
  2023-03-16 23:29         ` Tian, Kevin
@ 2023-03-17  0:22           ` Alex Williamson
  2023-03-17  0:57             ` Tian, Kevin
  0 siblings, 1 reply; 103+ messages in thread
From: Alex Williamson @ 2023-03-17  0:22 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: kvm, jasowang, Hao, Xudong, peterx, Xu, Terrence, chao.p.peng,
	linux-s390, Liu, Yi L, mjrosato, lulu, joro, nicolinc, jgg, Zhao,
	Yan Y, intel-gfx, eric.auger, intel-gvt-dev, yi.y.sun, cohuck,
	shameerali.kolothum.thodi, suravee.suthikulpanit, robin.murphy

On Thu, 16 Mar 2023 23:29:21 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Alex Williamson
> > Sent: Friday, March 17, 2023 2:46 AM
> > 
> > On Wed, 15 Mar 2023 23:31:23 +0000
> > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> >   
> > > > From: Alex Williamson <alex.williamson@redhat.com>
> > > > Sent: Thursday, March 16, 2023 6:53 AM
> > > > I'm afraid this proposal reduces or eliminates the handshake we have
> > > > with userspace between VFIO_DEVICE_GET_PCI_HOT_RESET_INFO and
> > > > VFIO_DEVICE_PCI_HOT_RESET, which could promote userspace to ignore  
> > the  
> > > > _INFO ioctl altogether, resulting in drivers that don't understand the
> > > > scope of the reset.  Is it worth it?  What do we really gain?  
> > >
> > > Jason raised the concern whether GET_PCI_HOT_RESET_INFO is actually
> > > useful today.
> > >
> > > It's an interface on opened device. So the tiny difference is whether the
> > > user knows the device is resettable when calling GET_INFO or later when
> > > actually calling PCI_HOT_RESET.  
> > 
> > No, GET_PCI_HOT_RESET_INFO conveys not only whether a PCI_HOT_RESET
> > can
> > be performed, but equally important the scope of the reset, ie. which
> > devices are affected by the reset.  If we de-emphasize the INFO
> > portion, then this easily gets confused as just a variant of
> > VFIO_DEVICE_RESET, which is explicitly a device-level cscope reset.  In
> > fact, I'd say the interface is not only trying to validate that the
> > user has sufficient privileges for the reset, but that they explicitly
> > acknowledge the scope of the reset.
> >   
> 
> IMHO the usefulness of scope is if it's discoverable by the management
> stack which then can try to assign devices with affected reset to a same
> user.

Disagree, the user needs to know the scope of reset.  Take for instance
two function of a device configured onto separate buses within a VM.
The VMM needs to know that a hot-reset of one will reset the other.
That's not obvious to the VMM without some understanding of PCI/e
topology and analysis of the host system.  The info ioctl simplifies
that discovery for the VMM and the handshake of passing the affected
groups makes sure that the info ioctl remains relevant.

OTOH, I really haven't seen any evidence that the null-array concept
provides significant simplification for userspace, especially without
compromising the user's understanding of the scope of the provided
reset.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [Intel-gfx] [PATCH v6 12/24] vfio/pci: Allow passing zero-length fd array in VFIO_DEVICE_PCI_HOT_RESET
  2023-03-17  0:22           ` Alex Williamson
@ 2023-03-17  0:57             ` Tian, Kevin
  2023-03-17 15:15               ` Alex Williamson
  0 siblings, 1 reply; 103+ messages in thread
From: Tian, Kevin @ 2023-03-17  0:57 UTC (permalink / raw)
  To: Alex Williamson
  Cc: mjrosato, jasowang, Hao, Xudong, peterx, Xu, Terrence,
	chao.p.peng, linux-s390, Liu, Yi L, kvm, lulu, joro, nicolinc,
	jgg, Zhao, Yan Y, intel-gfx, eric.auger, intel-gvt-dev, yi.y.sun,
	cohuck, shameerali.kolothum.thodi, suravee.suthikulpanit,
	robin.murphy

> From: Alex Williamson
> Sent: Friday, March 17, 2023 8:23 AM
> 
> On Thu, 16 Mar 2023 23:29:21 +0000
> "Tian, Kevin" <kevin.tian@intel.com> wrote:
> 
> > > From: Alex Williamson
> > > Sent: Friday, March 17, 2023 2:46 AM
> > >
> > > On Wed, 15 Mar 2023 23:31:23 +0000
> > > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> > >
> > > > > From: Alex Williamson <alex.williamson@redhat.com>
> > > > > Sent: Thursday, March 16, 2023 6:53 AM
> > > > > I'm afraid this proposal reduces or eliminates the handshake we have
> > > > > with userspace between VFIO_DEVICE_GET_PCI_HOT_RESET_INFO
> and
> > > > > VFIO_DEVICE_PCI_HOT_RESET, which could promote userspace to
> ignore
> > > the
> > > > > _INFO ioctl altogether, resulting in drivers that don't understand the
> > > > > scope of the reset.  Is it worth it?  What do we really gain?
> > > >
> > > > Jason raised the concern whether GET_PCI_HOT_RESET_INFO is actually
> > > > useful today.
> > > >
> > > > It's an interface on opened device. So the tiny difference is whether the
> > > > user knows the device is resettable when calling GET_INFO or later
> when
> > > > actually calling PCI_HOT_RESET.
> > >
> > > No, GET_PCI_HOT_RESET_INFO conveys not only whether a
> PCI_HOT_RESET
> > > can
> > > be performed, but equally important the scope of the reset, ie. which
> > > devices are affected by the reset.  If we de-emphasize the INFO
> > > portion, then this easily gets confused as just a variant of
> > > VFIO_DEVICE_RESET, which is explicitly a device-level cscope reset.  In
> > > fact, I'd say the interface is not only trying to validate that the
> > > user has sufficient privileges for the reset, but that they explicitly
> > > acknowledge the scope of the reset.
> > >
> >
> > IMHO the usefulness of scope is if it's discoverable by the management
> > stack which then can try to assign devices with affected reset to a same
> > user.
> 
> Disagree, the user needs to know the scope of reset.  Take for instance
> two function of a device configured onto separate buses within a VM.
> The VMM needs to know that a hot-reset of one will reset the other.
> That's not obvious to the VMM without some understanding of PCI/e
> topology and analysis of the host system.  The info ioctl simplifies
> that discovery for the VMM and the handshake of passing the affected
> groups makes sure that the info ioctl remains relevant.

If that is the intended usage then I don't see why this proposal will
promote userspace to ignore the _INFO ioctl. It should be always
queried no matter how the reset ioctl itself is designed. The motivation
of calling _INFO is not from the reset ioctl asking for an array of fds.

> 
> OTOH, I really haven't seen any evidence that the null-array concept
> provides significant simplification for userspace, especially without
> compromising the user's understanding of the scope of the provided
> reset.  Thanks,
> 

I'll let Jason to further comment after he is back.

The bottom line, if this cannot be converged in short time, is to move
it out of the preparatory series for cdev. There is no reason to block
cdev just for this open. Anyway we'll allow using device fd array
for cdev so there is no function gap. 😊

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [Intel-gfx] [PATCH v6 12/24] vfio/pci: Allow passing zero-length fd array in VFIO_DEVICE_PCI_HOT_RESET
  2023-03-17  0:57             ` Tian, Kevin
@ 2023-03-17 15:15               ` Alex Williamson
  2023-03-20 17:14                 ` Jason Gunthorpe
  0 siblings, 1 reply; 103+ messages in thread
From: Alex Williamson @ 2023-03-17 15:15 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: mjrosato, jasowang, Hao, Xudong, peterx, Xu, Terrence,
	chao.p.peng, linux-s390, Liu, Yi L, kvm, lulu, joro, nicolinc,
	jgg, Zhao, Yan Y, intel-gfx, eric.auger, intel-gvt-dev, yi.y.sun,
	cohuck, shameerali.kolothum.thodi, suravee.suthikulpanit,
	robin.murphy

On Fri, 17 Mar 2023 00:57:23 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Alex Williamson
> > Sent: Friday, March 17, 2023 8:23 AM
> > 
> > On Thu, 16 Mar 2023 23:29:21 +0000
> > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> >   
> > > > From: Alex Williamson
> > > > Sent: Friday, March 17, 2023 2:46 AM
> > > >
> > > > On Wed, 15 Mar 2023 23:31:23 +0000
> > > > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> > > >  
> > > > > > From: Alex Williamson <alex.williamson@redhat.com>
> > > > > > Sent: Thursday, March 16, 2023 6:53 AM
> > > > > > I'm afraid this proposal reduces or eliminates the handshake we have
> > > > > > with userspace between VFIO_DEVICE_GET_PCI_HOT_RESET_INFO  
> > and  
> > > > > > VFIO_DEVICE_PCI_HOT_RESET, which could promote userspace to  
> > ignore  
> > > > the  
> > > > > > _INFO ioctl altogether, resulting in drivers that don't understand the
> > > > > > scope of the reset.  Is it worth it?  What do we really gain?  
> > > > >
> > > > > Jason raised the concern whether GET_PCI_HOT_RESET_INFO is actually
> > > > > useful today.
> > > > >
> > > > > It's an interface on opened device. So the tiny difference is whether the
> > > > > user knows the device is resettable when calling GET_INFO or later  
> > when  
> > > > > actually calling PCI_HOT_RESET.  
> > > >
> > > > No, GET_PCI_HOT_RESET_INFO conveys not only whether a  
> > PCI_HOT_RESET  
> > > > can
> > > > be performed, but equally important the scope of the reset, ie. which
> > > > devices are affected by the reset.  If we de-emphasize the INFO
> > > > portion, then this easily gets confused as just a variant of
> > > > VFIO_DEVICE_RESET, which is explicitly a device-level cscope reset.  In
> > > > fact, I'd say the interface is not only trying to validate that the
> > > > user has sufficient privileges for the reset, but that they explicitly
> > > > acknowledge the scope of the reset.
> > > >  
> > >
> > > IMHO the usefulness of scope is if it's discoverable by the management
> > > stack which then can try to assign devices with affected reset to a same
> > > user.  
> > 
> > Disagree, the user needs to know the scope of reset.  Take for instance
> > two function of a device configured onto separate buses within a VM.
> > The VMM needs to know that a hot-reset of one will reset the other.
> > That's not obvious to the VMM without some understanding of PCI/e
> > topology and analysis of the host system.  The info ioctl simplifies
> > that discovery for the VMM and the handshake of passing the affected
> > groups makes sure that the info ioctl remains relevant.  
> 
> If that is the intended usage then I don't see why this proposal will
> promote userspace to ignore the _INFO ioctl. It should be always
> queried no matter how the reset ioctl itself is designed. The motivation
> of calling _INFO is not from the reset ioctl asking for an array of fds.

The VFIO_DEVICE_PCI_HOT_RESET ioctl requires a set of group (or cdev)
fds that encompass the set of affected devices reported by the
VFIO_DEVICE_GET_PCI_HOT_RESET_INFO ioctl, so I don't agree with the
last sentence above.

This proposal seems to be based on some confusion about the difference
between VFIO_DEVICE_RESET and VFIO_DEVICE_PCI_HOT_RESET, and therefore
IMO, proliferates that confusion by making the scope argument optional,
which I see as a key difference.  This therefore makes the behavior of
the ioctl less intuitive, easier to get wrong, and I expect we'll see
users unitentionally resetting devices beyond the desired scope if the
group/device fd array is allowed to be empty.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [Intel-gfx] [PATCH v6 21/24] vfio: Add VFIO_DEVICE_BIND_IOMMUFD
  2023-03-15  4:40         ` Liu, Yi L
  2023-03-15  6:57           ` Tian, Kevin
@ 2023-03-20 14:09           ` Jason Gunthorpe
  2023-03-20 14:31             ` Yi Liu
  1 sibling, 1 reply; 103+ messages in thread
From: Jason Gunthorpe @ 2023-03-20 14:09 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: mjrosato, jasowang, Hao, Xudong, peterx, Xu, Terrence,
	chao.p.peng, linux-s390, kvm, lulu, joro, nicolinc, Zhao, Yan Y,
	intel-gfx, eric.auger, intel-gvt-dev, yi.y.sun, cohuck,
	shameerali.kolothum.thodi, suravee.suthikulpanit, robin.murphy

On Wed, Mar 15, 2023 at 04:40:19AM +0000, Liu, Yi L wrote:

> # if IS_ENABLED(CONFIG_VFIO_GROUP)
> static inline bool vfio_device_is_noiommu(struct vfio_device *vdev)
> {
>         return IS_ENABLED(CONFIG_VFIO_NOIOMMU) &&
>                vdev->group->type == VFIO_NO_IOMMU;
> }
> #else
> static inline bool vfio_device_is_noiommu(struct vfio_device *vdev)
> {
>         struct iommu_group *iommu_group;
> 
>         if (!IS_ENABLED(CONFIG_VFIO_NOIOMMU) || !vfio_noiommu)
>                 return -EINVAL;
> 
>         iommu_group = iommu_group_get(vdev->dev);
>         if (iommu_group)
>                 iommu_group_put(iommu_group);
>
>         return !iommu_group;

If we don't have VFIO_GROUP then no-iommu is signaled by a NULL
iommu_ctx pointer in the vdev, don't mess with groups

Jason

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [Intel-gfx] [PATCH v6 21/24] vfio: Add VFIO_DEVICE_BIND_IOMMUFD
  2023-03-20 14:09           ` Jason Gunthorpe
@ 2023-03-20 14:31             ` Yi Liu
  2023-03-20 17:16               ` Jason Gunthorpe
  0 siblings, 1 reply; 103+ messages in thread
From: Yi Liu @ 2023-03-20 14:31 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: mjrosato, jasowang, Hao, Xudong, peterx, Xu, Terrence,
	chao.p.peng, linux-s390, kvm, lulu, joro, nicolinc, Zhao, Yan Y,
	intel-gfx, eric.auger, intel-gvt-dev, yi.y.sun, cohuck,
	shameerali.kolothum.thodi, suravee.suthikulpanit, robin.murphy

On 2023/3/20 22:09, Jason Gunthorpe wrote:
> On Wed, Mar 15, 2023 at 04:40:19AM +0000, Liu, Yi L wrote:
> 
>> # if IS_ENABLED(CONFIG_VFIO_GROUP)
>> static inline bool vfio_device_is_noiommu(struct vfio_device *vdev)
>> {
>>          return IS_ENABLED(CONFIG_VFIO_NOIOMMU) &&
>>                 vdev->group->type == VFIO_NO_IOMMU;
>> }
>> #else
>> static inline bool vfio_device_is_noiommu(struct vfio_device *vdev)
>> {
>>          struct iommu_group *iommu_group;
>>
>>          if (!IS_ENABLED(CONFIG_VFIO_NOIOMMU) || !vfio_noiommu)
>>                  return -EINVAL;
>>
>>          iommu_group = iommu_group_get(vdev->dev);
>>          if (iommu_group)
>>                  iommu_group_put(iommu_group);
>>
>>          return !iommu_group;
> 
> If we don't have VFIO_GROUP then no-iommu is signaled by a NULL
> iommu_ctx pointer in the vdev, don't mess with groups

yes, NULL iommufd_ctx pointer would be set in vdev and passed to the
vfio_device_open(). But here, we want to use this helper to check if
user can use noiommu mode. This is before calling vfio_device_open().
e.g. if the device is protected by iommu, then user cannot use noiommu
mode on it.

-- 
Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [Intel-gfx] [PATCH v6 12/24] vfio/pci: Allow passing zero-length fd array in VFIO_DEVICE_PCI_HOT_RESET
  2023-03-17 15:15               ` Alex Williamson
@ 2023-03-20 17:14                 ` Jason Gunthorpe
  2023-03-20 22:52                   ` Alex Williamson
  0 siblings, 1 reply; 103+ messages in thread
From: Jason Gunthorpe @ 2023-03-20 17:14 UTC (permalink / raw)
  To: Alex Williamson
  Cc: mjrosato, jasowang, Hao, Xudong, peterx, Xu, Terrence,
	chao.p.peng, linux-s390, Liu, Yi L, kvm, lulu, joro, nicolinc,
	Zhao, Yan Y, intel-gfx, eric.auger, intel-gvt-dev, yi.y.sun,
	cohuck, shameerali.kolothum.thodi, suravee.suthikulpanit,
	robin.murphy

On Fri, Mar 17, 2023 at 09:15:57AM -0600, Alex Williamson wrote:
> > If that is the intended usage then I don't see why this proposal will
> > promote userspace to ignore the _INFO ioctl. It should be always
> > queried no matter how the reset ioctl itself is designed. The motivation
> > of calling _INFO is not from the reset ioctl asking for an array of fds.
> 
> The VFIO_DEVICE_PCI_HOT_RESET ioctl requires a set of group (or cdev)
> fds that encompass the set of affected devices reported by the
> VFIO_DEVICE_GET_PCI_HOT_RESET_INFO ioctl, so I don't agree with the
> last sentence above.

There are two things going on - VFIO_DEVICE_PCI_HOT_RESET requires to
prove security that the userspace is not attempting to reset something
that it does not have ownership over. Eg a reset group that spans
multiple iommu groups.

The second is for userspace to discover the reset group so it can
understand what is happening.

IMHO it is perfectly fine for each API to be only concerned with its
own purpose.

VFIO_DEVICE_PCI_HOT_RESET needs to check security, which the
iommufd_ctx check does just fine

VFIO_DEVICE_GET_PCI_HOT_RESET_INFO needs to convey the reset group
span so userspace can do something with this.

I think confusing security and scope and "acknowledgment" is not a
good idea.

The APIs are well defined and userspace can always use them wrong. It
doesn't need to call RESET_INFO even today, it can just trivially pass
every group FD it owns to meet the security check.

It is much simpler if VFIO_DEVICE_PCI_HOT_RESET can pass the security
check without code marshalling fds, which is why we went this
direction.

Jason

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [Intel-gfx] [PATCH v6 21/24] vfio: Add VFIO_DEVICE_BIND_IOMMUFD
  2023-03-20 14:31             ` Yi Liu
@ 2023-03-20 17:16               ` Jason Gunthorpe
  2023-03-21  1:30                 ` Tian, Kevin
  0 siblings, 1 reply; 103+ messages in thread
From: Jason Gunthorpe @ 2023-03-20 17:16 UTC (permalink / raw)
  To: Yi Liu
  Cc: mjrosato, jasowang, Hao, Xudong, peterx, Xu, Terrence,
	chao.p.peng, linux-s390, kvm, lulu, joro, nicolinc, Zhao, Yan Y,
	intel-gfx, eric.auger, intel-gvt-dev, yi.y.sun, cohuck,
	shameerali.kolothum.thodi, suravee.suthikulpanit, robin.murphy

On Mon, Mar 20, 2023 at 10:31:53PM +0800, Yi Liu wrote:
> On 2023/3/20 22:09, Jason Gunthorpe wrote:
> > On Wed, Mar 15, 2023 at 04:40:19AM +0000, Liu, Yi L wrote:
> > 
> > > # if IS_ENABLED(CONFIG_VFIO_GROUP)
> > > static inline bool vfio_device_is_noiommu(struct vfio_device *vdev)
> > > {
> > >          return IS_ENABLED(CONFIG_VFIO_NOIOMMU) &&
> > >                 vdev->group->type == VFIO_NO_IOMMU;
> > > }
> > > #else
> > > static inline bool vfio_device_is_noiommu(struct vfio_device *vdev)
> > > {
> > >          struct iommu_group *iommu_group;
> > > 
> > >          if (!IS_ENABLED(CONFIG_VFIO_NOIOMMU) || !vfio_noiommu)
> > >                  return -EINVAL;
> > > 
> > >          iommu_group = iommu_group_get(vdev->dev);
> > >          if (iommu_group)
> > >                  iommu_group_put(iommu_group);
> > > 
> > >          return !iommu_group;
> > 
> > If we don't have VFIO_GROUP then no-iommu is signaled by a NULL
> > iommu_ctx pointer in the vdev, don't mess with groups
> 
> yes, NULL iommufd_ctx pointer would be set in vdev and passed to the
> vfio_device_open(). But here, we want to use this helper to check if
> user can use noiommu mode. This is before calling vfio_device_open().
> e.g. if the device is protected by iommu, then user cannot use noiommu
> mode on it.

Why not allow it?

If the admin has enabled this mode we may as well let it be used.

You explicitly ask for no-iommu mode by passing -1 for the iommufd
parameter. If the module parameter says it is allowed then that is all
you need.

Jason

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [Intel-gfx] [PATCH v6 12/24] vfio/pci: Allow passing zero-length fd array in VFIO_DEVICE_PCI_HOT_RESET
  2023-03-20 17:14                 ` Jason Gunthorpe
@ 2023-03-20 22:52                   ` Alex Williamson
  2023-03-20 23:39                     ` Jason Gunthorpe
  0 siblings, 1 reply; 103+ messages in thread
From: Alex Williamson @ 2023-03-20 22:52 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: mjrosato, jasowang, Hao,  Xudong, peterx, Xu, Terrence,
	chao.p.peng, linux-s390, Liu, Yi L, kvm, lulu, joro, nicolinc,
	Zhao, Yan Y, intel-gfx, eric.auger, intel-gvt-dev, yi.y.sun,
	cohuck, shameerali.kolothum.thodi, suravee.suthikulpanit,
	robin.murphy

On Mon, 20 Mar 2023 14:14:48 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Fri, Mar 17, 2023 at 09:15:57AM -0600, Alex Williamson wrote:
> > > If that is the intended usage then I don't see why this proposal will
> > > promote userspace to ignore the _INFO ioctl. It should be always
> > > queried no matter how the reset ioctl itself is designed. The motivation
> > > of calling _INFO is not from the reset ioctl asking for an array of fds.  
> > 
> > The VFIO_DEVICE_PCI_HOT_RESET ioctl requires a set of group (or cdev)
> > fds that encompass the set of affected devices reported by the
> > VFIO_DEVICE_GET_PCI_HOT_RESET_INFO ioctl, so I don't agree with the
> > last sentence above.  
> 
> There are two things going on - VFIO_DEVICE_PCI_HOT_RESET requires to
> prove security that the userspace is not attempting to reset something
> that it does not have ownership over. Eg a reset group that spans
> multiple iommu groups.
> 
> The second is for userspace to discover the reset group so it can
> understand what is happening.
> 
> IMHO it is perfectly fine for each API to be only concerned with its
> own purpose.
> 
> VFIO_DEVICE_PCI_HOT_RESET needs to check security, which the
> iommufd_ctx check does just fine
> 
> VFIO_DEVICE_GET_PCI_HOT_RESET_INFO needs to convey the reset group
> span so userspace can do something with this.
> 
> I think confusing security and scope and "acknowledgment" is not a
> good idea.
> 
> The APIs are well defined and userspace can always use them wrong. It
> doesn't need to call RESET_INFO even today, it can just trivially pass
> every group FD it owns to meet the security check.

That's not actually true, in order to avoid arbitrarily large buffers
from the user, the ioctl won't accept an array greater than the number
of devices affected by the reset.

> It is much simpler if VFIO_DEVICE_PCI_HOT_RESET can pass the security
> check without code marshalling fds, which is why we went this
> direction.

I agree that nullifying the arg makes the ioctl easier to use, but my
hesitation is whether it makes it more difficult to use correctly,
which includes resetting devices unexpectedly.

We're talking about something that's a relatively rare event, so I
don't see that time overhead is a factor, nor has the complexity
overhead in the QEMU implementation ever been raised as an issue
previously.

We can always blame the developer for using an interface incorrectly,
but if we make it easier to use incorrectly in order to optimize
something that doesn't need to be optimized, does that make it a good
choice for the uAPI?  Thanks,

Alex


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [Intel-gfx] [PATCH v6 12/24] vfio/pci: Allow passing zero-length fd array in VFIO_DEVICE_PCI_HOT_RESET
  2023-03-20 22:52                   ` Alex Williamson
@ 2023-03-20 23:39                     ` Jason Gunthorpe
  2023-03-21 20:31                       ` Alex Williamson
  0 siblings, 1 reply; 103+ messages in thread
From: Jason Gunthorpe @ 2023-03-20 23:39 UTC (permalink / raw)
  To: Alex Williamson
  Cc: mjrosato, jasowang, Hao, Xudong, peterx, Xu, Terrence,
	chao.p.peng, linux-s390, Liu, Yi L, kvm, lulu, joro, nicolinc,
	Zhao, Yan Y, intel-gfx, eric.auger, intel-gvt-dev, yi.y.sun,
	cohuck, shameerali.kolothum.thodi, suravee.suthikulpanit,
	robin.murphy

On Mon, Mar 20, 2023 at 04:52:17PM -0600, Alex Williamson wrote:

> > The APIs are well defined and userspace can always use them wrong. It
> > doesn't need to call RESET_INFO even today, it can just trivially pass
> > every group FD it owns to meet the security check.
> 
> That's not actually true, in order to avoid arbitrarily large buffers
> from the user, the ioctl won't accept an array greater than the number
> of devices affected by the reset.

Oh yuk!

> > It is much simpler if VFIO_DEVICE_PCI_HOT_RESET can pass the security
> > check without code marshalling fds, which is why we went this
> > direction.
> 
> I agree that nullifying the arg makes the ioctl easier to use, but my
> hesitation is whether it makes it more difficult to use correctly,
> which includes resetting devices unexpectedly.

I don't think it makes it harder to use correctly. It maybe makes it
easier to misuse, but IMHO not too much.

If the desire was to have an API that explicitly acknowledged the
reset scope then it should have taken in a list of device FDs and
optimally reset all of them or fail EPERM.

What is going to make this hard to use is the _INFO IOCTL, it returns
basically the BDF string, but I think we effectively get rid of this
in the new model. libvirt will know the BDF and open the cdev, then fd
pass the cdev to qemu. Qemu shouldn't also have to know the sysfs
path..

So we really want a new _INFO ioctl to make this easier to use..

> We can always blame the developer for using an interface incorrectly,
> but if we make it easier to use incorrectly in order to optimize
> something that doesn't need to be optimized, does that make it a good
> choice for the uAPI?

IMHO the API is designed around a security proof. Present some groups
and a subset of devices in those groups will be reset. You can't know
the subset unless you do the _INFO thing.

If we wanted it to be clearly linked to scope it should have taken in
a list of device FDs, and reset those devices FDs optimally or
returned -EPERM. Then the reset scope is very clearly connected to the
API.

Jason

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [Intel-gfx] [PATCH v6 21/24] vfio: Add VFIO_DEVICE_BIND_IOMMUFD
  2023-03-20 17:16               ` Jason Gunthorpe
@ 2023-03-21  1:30                 ` Tian, Kevin
  2023-03-21 12:00                   ` Jason Gunthorpe
  0 siblings, 1 reply; 103+ messages in thread
From: Tian, Kevin @ 2023-03-21  1:30 UTC (permalink / raw)
  To: Jason Gunthorpe, Liu, Yi L
  Cc: linux-s390, yi.y.sun, mjrosato, kvm, intel-gvt-dev, joro, cohuck,
	Hao, Xudong, peterx, Zhao, Yan Y, eric.auger, Xu, Terrence,
	nicolinc, shameerali.kolothum.thodi, suravee.suthikulpanit,
	intel-gfx, chao.p.peng, lulu, robin.murphy, jasowang

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Tuesday, March 21, 2023 1:17 AM
> 
> On Mon, Mar 20, 2023 at 10:31:53PM +0800, Yi Liu wrote:
> > On 2023/3/20 22:09, Jason Gunthorpe wrote:
> > > On Wed, Mar 15, 2023 at 04:40:19AM +0000, Liu, Yi L wrote:
> > >
> > > > # if IS_ENABLED(CONFIG_VFIO_GROUP)
> > > > static inline bool vfio_device_is_noiommu(struct vfio_device *vdev)
> > > > {
> > > >          return IS_ENABLED(CONFIG_VFIO_NOIOMMU) &&
> > > >                 vdev->group->type == VFIO_NO_IOMMU;
> > > > }
> > > > #else
> > > > static inline bool vfio_device_is_noiommu(struct vfio_device *vdev)
> > > > {
> > > >          struct iommu_group *iommu_group;
> > > >
> > > >          if (!IS_ENABLED(CONFIG_VFIO_NOIOMMU) || !vfio_noiommu)
> > > >                  return -EINVAL;
> > > >
> > > >          iommu_group = iommu_group_get(vdev->dev);
> > > >          if (iommu_group)
> > > >                  iommu_group_put(iommu_group);
> > > >
> > > >          return !iommu_group;
> > >
> > > If we don't have VFIO_GROUP then no-iommu is signaled by a NULL
> > > iommu_ctx pointer in the vdev, don't mess with groups
> >
> > yes, NULL iommufd_ctx pointer would be set in vdev and passed to the
> > vfio_device_open(). But here, we want to use this helper to check if
> > user can use noiommu mode. This is before calling vfio_device_open().
> > e.g. if the device is protected by iommu, then user cannot use noiommu
> > mode on it.
> 
> Why not allow it?
> 
> If the admin has enabled this mode we may as well let it be used.
> 
> You explicitly ask for no-iommu mode by passing -1 for the iommufd
> parameter. If the module parameter says it is allowed then that is all
> you need.
> 

IMHO we should disallow noiommu on a device which already has
a iommu group. This is how noiommu works with vfio group. I don't
think it's a good idea to further relax it in cdev.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [Intel-gfx] [PATCH v6 21/24] vfio: Add VFIO_DEVICE_BIND_IOMMUFD
  2023-03-21  1:30                 ` Tian, Kevin
@ 2023-03-21 12:00                   ` Jason Gunthorpe
  2023-03-21 14:37                     ` Liu, Yi L
  0 siblings, 1 reply; 103+ messages in thread
From: Jason Gunthorpe @ 2023-03-21 12:00 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: mjrosato, jasowang, Hao, Xudong, peterx, Xu, Terrence,
	chao.p.peng, linux-s390, Liu, Yi L, kvm, lulu, joro, nicolinc,
	Zhao, Yan Y, intel-gfx, eric.auger, intel-gvt-dev, yi.y.sun,
	cohuck, shameerali.kolothum.thodi, suravee.suthikulpanit,
	robin.murphy

On Tue, Mar 21, 2023 at 01:30:34AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Tuesday, March 21, 2023 1:17 AM
> > 
> > On Mon, Mar 20, 2023 at 10:31:53PM +0800, Yi Liu wrote:
> > > On 2023/3/20 22:09, Jason Gunthorpe wrote:
> > > > On Wed, Mar 15, 2023 at 04:40:19AM +0000, Liu, Yi L wrote:
> > > >
> > > > > # if IS_ENABLED(CONFIG_VFIO_GROUP)
> > > > > static inline bool vfio_device_is_noiommu(struct vfio_device *vdev)
> > > > > {
> > > > >          return IS_ENABLED(CONFIG_VFIO_NOIOMMU) &&
> > > > >                 vdev->group->type == VFIO_NO_IOMMU;
> > > > > }
> > > > > #else
> > > > > static inline bool vfio_device_is_noiommu(struct vfio_device *vdev)
> > > > > {
> > > > >          struct iommu_group *iommu_group;
> > > > >
> > > > >          if (!IS_ENABLED(CONFIG_VFIO_NOIOMMU) || !vfio_noiommu)
> > > > >                  return -EINVAL;
> > > > >
> > > > >          iommu_group = iommu_group_get(vdev->dev);
> > > > >          if (iommu_group)
> > > > >                  iommu_group_put(iommu_group);
> > > > >
> > > > >          return !iommu_group;
> > > >
> > > > If we don't have VFIO_GROUP then no-iommu is signaled by a NULL
> > > > iommu_ctx pointer in the vdev, don't mess with groups
> > >
> > > yes, NULL iommufd_ctx pointer would be set in vdev and passed to the
> > > vfio_device_open(). But here, we want to use this helper to check if
> > > user can use noiommu mode. This is before calling vfio_device_open().
> > > e.g. if the device is protected by iommu, then user cannot use noiommu
> > > mode on it.
> > 
> > Why not allow it?
> > 
> > If the admin has enabled this mode we may as well let it be used.
> > 
> > You explicitly ask for no-iommu mode by passing -1 for the iommufd
> > parameter. If the module parameter says it is allowed then that is all
> > you need.
> > 
> 
> IMHO we should disallow noiommu on a device which already has
> a iommu group. This is how noiommu works with vfio group. I don't
> think it's a good idea to further relax it in cdev.

This isn't the same thing, this will trigger for mdevs and stuff that
should not be noiommu as well.

If you want to copy what the group code does then noiommu needs to be
statically determined at physical vfio device allocation time.

Jason

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [Intel-gfx] [PATCH v6 21/24] vfio: Add VFIO_DEVICE_BIND_IOMMUFD
  2023-03-21 12:00                   ` Jason Gunthorpe
@ 2023-03-21 14:37                     ` Liu, Yi L
  2023-03-21 14:41                       ` Jason Gunthorpe
  0 siblings, 1 reply; 103+ messages in thread
From: Liu, Yi L @ 2023-03-21 14:37 UTC (permalink / raw)
  To: Jason Gunthorpe, Tian, Kevin
  Cc: linux-s390, yi.y.sun, mjrosato, kvm, intel-gvt-dev, joro, cohuck,
	Hao, Xudong, peterx, Zhao, Yan Y, eric.auger, Xu, Terrence,
	nicolinc, shameerali.kolothum.thodi, suravee.suthikulpanit,
	intel-gfx, chao.p.peng, lulu, robin.murphy, jasowang

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Tuesday, March 21, 2023 8:01 PM
> 
> On Tue, Mar 21, 2023 at 01:30:34AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Tuesday, March 21, 2023 1:17 AM
> > >
> > > On Mon, Mar 20, 2023 at 10:31:53PM +0800, Yi Liu wrote:
> > > > On 2023/3/20 22:09, Jason Gunthorpe wrote:
> > > > > On Wed, Mar 15, 2023 at 04:40:19AM +0000, Liu, Yi L wrote:
> > > > >
> > > > > > # if IS_ENABLED(CONFIG_VFIO_GROUP)
> > > > > > static inline bool vfio_device_is_noiommu(struct vfio_device
> *vdev)
> > > > > > {
> > > > > >          return IS_ENABLED(CONFIG_VFIO_NOIOMMU) &&
> > > > > >                 vdev->group->type == VFIO_NO_IOMMU;
> > > > > > }
> > > > > > #else
> > > > > > static inline bool vfio_device_is_noiommu(struct vfio_device
> *vdev)
> > > > > > {
> > > > > >          struct iommu_group *iommu_group;
> > > > > >
> > > > > >          if (!IS_ENABLED(CONFIG_VFIO_NOIOMMU)
> || !vfio_noiommu)
> > > > > >                  return -EINVAL;
> > > > > >
> > > > > >          iommu_group = iommu_group_get(vdev->dev);
> > > > > >          if (iommu_group)
> > > > > >                  iommu_group_put(iommu_group);
> > > > > >
> > > > > >          return !iommu_group;
> > > > >
> > > > > If we don't have VFIO_GROUP then no-iommu is signaled by a NULL
> > > > > iommu_ctx pointer in the vdev, don't mess with groups
> > > >
> > > > yes, NULL iommufd_ctx pointer would be set in vdev and passed to
> the
> > > > vfio_device_open(). But here, we want to use this helper to check if
> > > > user can use noiommu mode. This is before calling vfio_device_open().
> > > > e.g. if the device is protected by iommu, then user cannot use
> noiommu
> > > > mode on it.
> > >
> > > Why not allow it?
> > >
> > > If the admin has enabled this mode we may as well let it be used.
> > >
> > > You explicitly ask for no-iommu mode by passing -1 for the iommufd
> > > parameter. If the module parameter says it is allowed then that is all
> > > you need.
> > >
> >
> > IMHO we should disallow noiommu on a device which already has
> > a iommu group. This is how noiommu works with vfio group. I don't
> > think it's a good idea to further relax it in cdev.
> 
> This isn't the same thing, this will trigger for mdevs and stuff that
> should not be noiommu as well.

But the group path does disallow noiommu usage if the device has
a real iommu_group (the one created by VFIO code is not real). Would
it be better to keep it consistent from this angle?

> If you want to copy what the group code does then noiommu needs to be
> statically determined at physical vfio device allocation time.

There is another reason which may not that strong. For devices protected
by iommu, user needs to program IOVA mappings in order to do DMA. Such
device has a real iommu_group. So if we allow using noiommu mode for such
devices, DMA would be blocked by iommu. Perhaps users that use noiommu
mode should not do DMA at the first place.

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [Intel-gfx] [PATCH v6 21/24] vfio: Add VFIO_DEVICE_BIND_IOMMUFD
  2023-03-21 14:37                     ` Liu, Yi L
@ 2023-03-21 14:41                       ` Jason Gunthorpe
  2023-03-21 14:51                         ` Liu, Yi L
  0 siblings, 1 reply; 103+ messages in thread
From: Jason Gunthorpe @ 2023-03-21 14:41 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: mjrosato, jasowang, Hao, Xudong, peterx, Xu, Terrence,
	chao.p.peng, linux-s390, kvm, lulu, joro, nicolinc, Zhao, Yan Y,
	intel-gfx, eric.auger, intel-gvt-dev, yi.y.sun, cohuck,
	shameerali.kolothum.thodi, suravee.suthikulpanit, robin.murphy

On Tue, Mar 21, 2023 at 02:37:58PM +0000, Liu, Yi L wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Tuesday, March 21, 2023 8:01 PM
> > 
> > On Tue, Mar 21, 2023 at 01:30:34AM +0000, Tian, Kevin wrote:
> > > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > > Sent: Tuesday, March 21, 2023 1:17 AM
> > > >
> > > > On Mon, Mar 20, 2023 at 10:31:53PM +0800, Yi Liu wrote:
> > > > > On 2023/3/20 22:09, Jason Gunthorpe wrote:
> > > > > > On Wed, Mar 15, 2023 at 04:40:19AM +0000, Liu, Yi L wrote:
> > > > > >
> > > > > > > # if IS_ENABLED(CONFIG_VFIO_GROUP)
> > > > > > > static inline bool vfio_device_is_noiommu(struct vfio_device
> > *vdev)
> > > > > > > {
> > > > > > >          return IS_ENABLED(CONFIG_VFIO_NOIOMMU) &&
> > > > > > >                 vdev->group->type == VFIO_NO_IOMMU;
> > > > > > > }
> > > > > > > #else
> > > > > > > static inline bool vfio_device_is_noiommu(struct vfio_device
> > *vdev)
> > > > > > > {
> > > > > > >          struct iommu_group *iommu_group;
> > > > > > >
> > > > > > >          if (!IS_ENABLED(CONFIG_VFIO_NOIOMMU)
> > || !vfio_noiommu)
> > > > > > >                  return -EINVAL;
> > > > > > >
> > > > > > >          iommu_group = iommu_group_get(vdev->dev);
> > > > > > >          if (iommu_group)
> > > > > > >                  iommu_group_put(iommu_group);
> > > > > > >
> > > > > > >          return !iommu_group;
> > > > > >
> > > > > > If we don't have VFIO_GROUP then no-iommu is signaled by a NULL
> > > > > > iommu_ctx pointer in the vdev, don't mess with groups
> > > > >
> > > > > yes, NULL iommufd_ctx pointer would be set in vdev and passed to
> > the
> > > > > vfio_device_open(). But here, we want to use this helper to check if
> > > > > user can use noiommu mode. This is before calling vfio_device_open().
> > > > > e.g. if the device is protected by iommu, then user cannot use
> > noiommu
> > > > > mode on it.
> > > >
> > > > Why not allow it?
> > > >
> > > > If the admin has enabled this mode we may as well let it be used.
> > > >
> > > > You explicitly ask for no-iommu mode by passing -1 for the iommufd
> > > > parameter. If the module parameter says it is allowed then that is all
> > > > you need.
> > > >
> > >
> > > IMHO we should disallow noiommu on a device which already has
> > > a iommu group. This is how noiommu works with vfio group. I don't
> > > think it's a good idea to further relax it in cdev.
> > 
> > This isn't the same thing, this will trigger for mdevs and stuff that
> > should not be noiommu as well.
> 
> But the group path does disallow noiommu usage if the device has
> a real iommu_group (the one created by VFIO code is not real). Would
> it be better to keep it consistent from this angle?
> 
> > If you want to copy what the group code does then noiommu needs to be
> > statically determined at physical vfio device allocation time.
> 
> There is another reason which may not that strong. For devices protected
> by iommu, user needs to program IOVA mappings in order to do DMA. Such
> device has a real iommu_group. 

Oh that is a good reason for sure

But still, this check should be done at device creation time just like
in group mode, not during each attach call.

Jason

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [Intel-gfx] [PATCH v6 21/24] vfio: Add VFIO_DEVICE_BIND_IOMMUFD
  2023-03-21 14:41                       ` Jason Gunthorpe
@ 2023-03-21 14:51                         ` Liu, Yi L
  2023-03-21 14:58                           ` Jason Gunthorpe
  0 siblings, 1 reply; 103+ messages in thread
From: Liu, Yi L @ 2023-03-21 14:51 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: mjrosato, jasowang, Hao, Xudong, peterx, Xu, Terrence,
	chao.p.peng, linux-s390, kvm, lulu, joro, nicolinc, Zhao, Yan Y,
	intel-gfx, eric.auger, intel-gvt-dev, yi.y.sun, cohuck,
	shameerali.kolothum.thodi, suravee.suthikulpanit, robin.murphy

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Tuesday, March 21, 2023 10:41 PM
> 
> On Tue, Mar 21, 2023 at 02:37:58PM +0000, Liu, Yi L wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Tuesday, March 21, 2023 8:01 PM
> > >
> > > On Tue, Mar 21, 2023 at 01:30:34AM +0000, Tian, Kevin wrote:
> > > > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > > > Sent: Tuesday, March 21, 2023 1:17 AM
> > > > >
> > > > > On Mon, Mar 20, 2023 at 10:31:53PM +0800, Yi Liu wrote:
> > > > > > On 2023/3/20 22:09, Jason Gunthorpe wrote:
> > > > > > > On Wed, Mar 15, 2023 at 04:40:19AM +0000, Liu, Yi L wrote:
> > > > > > >
> > > > > > > > # if IS_ENABLED(CONFIG_VFIO_GROUP)
> > > > > > > > static inline bool vfio_device_is_noiommu(struct vfio_device
> > > *vdev)
> > > > > > > > {
> > > > > > > >          return IS_ENABLED(CONFIG_VFIO_NOIOMMU) &&
> > > > > > > >                 vdev->group->type == VFIO_NO_IOMMU;
> > > > > > > > }
> > > > > > > > #else
> > > > > > > > static inline bool vfio_device_is_noiommu(struct vfio_device
> > > *vdev)
> > > > > > > > {
> > > > > > > >          struct iommu_group *iommu_group;
> > > > > > > >
> > > > > > > >          if (!IS_ENABLED(CONFIG_VFIO_NOIOMMU)
> > > || !vfio_noiommu)
> > > > > > > >                  return -EINVAL;
> > > > > > > >
> > > > > > > >          iommu_group = iommu_group_get(vdev->dev);
> > > > > > > >          if (iommu_group)
> > > > > > > >                  iommu_group_put(iommu_group);
> > > > > > > >
> > > > > > > >          return !iommu_group;
> > > > > > >
> > > > > > > If we don't have VFIO_GROUP then no-iommu is signaled by a
> NULL
> > > > > > > iommu_ctx pointer in the vdev, don't mess with groups
> > > > > >
> > > > > > yes, NULL iommufd_ctx pointer would be set in vdev and passed
> to
> > > the
> > > > > > vfio_device_open(). But here, we want to use this helper to check
> if
> > > > > > user can use noiommu mode. This is before calling
> vfio_device_open().
> > > > > > e.g. if the device is protected by iommu, then user cannot use
> > > noiommu
> > > > > > mode on it.
> > > > >
> > > > > Why not allow it?
> > > > >
> > > > > If the admin has enabled this mode we may as well let it be used.
> > > > >
> > > > > You explicitly ask for no-iommu mode by passing -1 for the iommufd
> > > > > parameter. If the module parameter says it is allowed then that is all
> > > > > you need.
> > > > >
> > > >
> > > > IMHO we should disallow noiommu on a device which already has
> > > > a iommu group. This is how noiommu works with vfio group. I don't
> > > > think it's a good idea to further relax it in cdev.
> > >
> > > This isn't the same thing, this will trigger for mdevs and stuff that
> > > should not be noiommu as well.
> >
> > But the group path does disallow noiommu usage if the device has
> > a real iommu_group (the one created by VFIO code is not real). Would
> > it be better to keep it consistent from this angle?
> >
> > > If you want to copy what the group code does then noiommu needs to
> be
> > > statically determined at physical vfio device allocation time.
> >
> > There is another reason which may not that strong. For devices protected
> > by iommu, user needs to program IOVA mappings in order to do DMA.
> Such
> > device has a real iommu_group.
> 
> Oh that is a good reason for sure
> 
> But still, this check should be done at device creation time just like
> in group mode, not during each attach call.

Seems like requiring a noiommu_capable flag in vfio_device. We
set this flag only when the vfio_group->type==VFIO_NO_IOMMU
or no real iommu_group (for the case in which vfio group code is
compiled out). Perhaps the below check should be added as well.
Then in the time of bind, just check the noiommu_capable flag
and capable(CAP_SYS_RAWIO).

if (!IS_ENABLED(CONFIG_VFIO_NOIOMMU) || !vfio_noiommu)

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [Intel-gfx] [PATCH v6 21/24] vfio: Add VFIO_DEVICE_BIND_IOMMUFD
  2023-03-21 14:51                         ` Liu, Yi L
@ 2023-03-21 14:58                           ` Jason Gunthorpe
  2023-03-21 15:10                             ` Liu, Yi L
  0 siblings, 1 reply; 103+ messages in thread
From: Jason Gunthorpe @ 2023-03-21 14:58 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: mjrosato, jasowang, Hao, Xudong, peterx, Xu, Terrence,
	chao.p.peng, linux-s390, kvm, lulu, joro, nicolinc, Zhao, Yan Y,
	intel-gfx, eric.auger, intel-gvt-dev, yi.y.sun, cohuck,
	shameerali.kolothum.thodi, suravee.suthikulpanit, robin.murphy

On Tue, Mar 21, 2023 at 02:51:20PM +0000, Liu, Yi L wrote:
> > But still, this check should be done at device creation time just like
> > in group mode, not during each attach call.
> 
> Seems like requiring a noiommu_capable flag in vfio_device. We
> set this flag only when the vfio_group->type==VFIO_NO_IOMMU
> or no real iommu_group (for the case in which vfio group code is
> compiled out). Perhaps the below check should be added as well.
> Then in the time of bind, just check the noiommu_capable flag
> and capable(CAP_SYS_RAWIO).
> 
> if (!IS_ENABLED(CONFIG_VFIO_NOIOMMU) || !vfio_noiommu)

Yes, and also only for physical devices

Jason

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [Intel-gfx] [PATCH v6 21/24] vfio: Add VFIO_DEVICE_BIND_IOMMUFD
  2023-03-21 14:58                           ` Jason Gunthorpe
@ 2023-03-21 15:10                             ` Liu, Yi L
  2023-03-21 16:54                               ` Jason Gunthorpe
  0 siblings, 1 reply; 103+ messages in thread
From: Liu, Yi L @ 2023-03-21 15:10 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: mjrosato, jasowang, Hao, Xudong, peterx, Xu, Terrence,
	chao.p.peng, linux-s390, kvm, lulu, joro, nicolinc, Zhao, Yan Y,
	intel-gfx, eric.auger, intel-gvt-dev, yi.y.sun, cohuck,
	shameerali.kolothum.thodi, suravee.suthikulpanit, robin.murphy

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Tuesday, March 21, 2023 10:58 PM
> 
> On Tue, Mar 21, 2023 at 02:51:20PM +0000, Liu, Yi L wrote:
> > > But still, this check should be done at device creation time just like
> > > in group mode, not during each attach call.
> >
> > Seems like requiring a noiommu_capable flag in vfio_device. We
> > set this flag only when the vfio_group->type==VFIO_NO_IOMMU
> > or no real iommu_group (for the case in which vfio group code is
> > compiled out). Perhaps the below check should be added as well.
> > Then in the time of bind, just check the noiommu_capable flag
> > and capable(CAP_SYS_RAWIO).
> >
> > if (!IS_ENABLED(CONFIG_VFIO_NOIOMMU) || !vfio_noiommu)
> 
> Yes, and also only for physical devices

Sure. BTW. in the time of creation, there is no vfio group yet. So may
just check if the device has a real iommu_group. Another alternative
is to set this flag at the time of vfio_device registration. Physical
device driver and emulated device driver uses different register APIs.
Hence they can be distinguished easily.

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [Intel-gfx] [PATCH v6 21/24] vfio: Add VFIO_DEVICE_BIND_IOMMUFD
  2023-03-21 15:10                             ` Liu, Yi L
@ 2023-03-21 16:54                               ` Jason Gunthorpe
  0 siblings, 0 replies; 103+ messages in thread
From: Jason Gunthorpe @ 2023-03-21 16:54 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: mjrosato, jasowang, Hao, Xudong, peterx, Xu, Terrence,
	chao.p.peng, linux-s390, kvm, lulu, joro, nicolinc, Zhao, Yan Y,
	intel-gfx, eric.auger, intel-gvt-dev, yi.y.sun, cohuck,
	shameerali.kolothum.thodi, suravee.suthikulpanit, robin.murphy

On Tue, Mar 21, 2023 at 03:10:50PM +0000, Liu, Yi L wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Tuesday, March 21, 2023 10:58 PM
> > 
> > On Tue, Mar 21, 2023 at 02:51:20PM +0000, Liu, Yi L wrote:
> > > > But still, this check should be done at device creation time just like
> > > > in group mode, not during each attach call.
> > >
> > > Seems like requiring a noiommu_capable flag in vfio_device. We
> > > set this flag only when the vfio_group->type==VFIO_NO_IOMMU
> > > or no real iommu_group (for the case in which vfio group code is
> > > compiled out). Perhaps the below check should be added as well.
> > > Then in the time of bind, just check the noiommu_capable flag
> > > and capable(CAP_SYS_RAWIO).
> > >
> > > if (!IS_ENABLED(CONFIG_VFIO_NOIOMMU) || !vfio_noiommu)
> > 
> > Yes, and also only for physical devices
> 
> Sure. BTW. in the time of creation, there is no vfio group yet. So may
> just check if the device has a real iommu_group. Another alternative
> is to set this flag at the time of vfio_device registration. Physical
> device driver and emulated device driver uses different register APIs.
> Hence they can be distinguished easily.

Yes, at registration. In group mode it should copy the flag from the
vfio_group, in non-group mode it should query the iommu_group

Jason

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [Intel-gfx] [PATCH v6 12/24] vfio/pci: Allow passing zero-length fd array in VFIO_DEVICE_PCI_HOT_RESET
  2023-03-20 23:39                     ` Jason Gunthorpe
@ 2023-03-21 20:31                       ` Alex Williamson
  2023-03-21 20:50                         ` Jason Gunthorpe
  0 siblings, 1 reply; 103+ messages in thread
From: Alex Williamson @ 2023-03-21 20:31 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: mjrosato, jasowang, Hao,  Xudong, peterx, Xu, Terrence,
	chao.p.peng, linux-s390, Liu, Yi L, kvm, lulu, joro, nicolinc,
	Zhao, Yan Y, intel-gfx, eric.auger, intel-gvt-dev, yi.y.sun,
	cohuck, shameerali.kolothum.thodi, suravee.suthikulpanit,
	robin.murphy

On Mon, 20 Mar 2023 20:39:07 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Mon, Mar 20, 2023 at 04:52:17PM -0600, Alex Williamson wrote:
> 
> > > The APIs are well defined and userspace can always use them wrong. It
> > > doesn't need to call RESET_INFO even today, it can just trivially pass
> > > every group FD it owns to meet the security check.  
> > 
> > That's not actually true, in order to avoid arbitrarily large buffers
> > from the user, the ioctl won't accept an array greater than the number
> > of devices affected by the reset.  
> 
> Oh yuk!
> 
> > > It is much simpler if VFIO_DEVICE_PCI_HOT_RESET can pass the security
> > > check without code marshalling fds, which is why we went this
> > > direction.  
> > 
> > I agree that nullifying the arg makes the ioctl easier to use, but my
> > hesitation is whether it makes it more difficult to use correctly,
> > which includes resetting devices unexpectedly.  
> 
> I don't think it makes it harder to use correctly. It maybe makes it
> easier to misuse, but IMHO not too much.
> 
> If the desire was to have an API that explicitly acknowledged the
> reset scope then it should have taken in a list of device FDs and
> optimally reset all of them or fail EPERM.
> 
> What is going to make this hard to use is the _INFO IOCTL, it returns
> basically the BDF string, but I think we effectively get rid of this
> in the new model. libvirt will know the BDF and open the cdev, then fd
> pass the cdev to qemu. Qemu shouldn't also have to know the sysfs
> path..
> 
> So we really want a new _INFO ioctl to make this easier to use..

I think this makes it even worse.  If userspace cannot match BDFs from
the _INFO ioctl to devices files, for proof of ownership or scope, then
the answer is clearly not "get rid of the device files".

> > We can always blame the developer for using an interface incorrectly,
> > but if we make it easier to use incorrectly in order to optimize
> > something that doesn't need to be optimized, does that make it a good
> > choice for the uAPI?  
> 
> IMHO the API is designed around a security proof. Present some groups
> and a subset of devices in those groups will be reset. You can't know
> the subset unless you do the _INFO thing.
> 
> If we wanted it to be clearly linked to scope it should have taken in
> a list of device FDs, and reset those devices FDs optimally or
> returned -EPERM. Then the reset scope is very clearly connected to the
> API.

This just seems like nit-picking that the API could have accomplished
this more concisely.  Probably that's true, but I think you've
identified a gap above that amplifies the issue.  If the user cannot
map BDFs to cdevs because the cdevs are passed as open fds to the user
driver, the _INFO results become meaningless and by removing the fds
array, that becomes the obvious choice that a user presented with this
dilemma would take.  We're skipping past easier to misuse, difficult to
use correctly, and circling around no obvious way to use correctly.

Unfortunately the _INFO ioctl does presume that userspace knows the BDF
to device mappings today, so if we are attempting to pre-enable a case
with cdev support where that is not the case, then there must be
something done with the _INFO ioctl to provide scope.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [Intel-gfx] [PATCH v6 12/24] vfio/pci: Allow passing zero-length fd array in VFIO_DEVICE_PCI_HOT_RESET
  2023-03-21 20:31                       ` Alex Williamson
@ 2023-03-21 20:50                         ` Jason Gunthorpe
  2023-03-21 21:01                           ` Alex Williamson
  2023-03-22  8:17                           ` Liu, Yi L
  0 siblings, 2 replies; 103+ messages in thread
From: Jason Gunthorpe @ 2023-03-21 20:50 UTC (permalink / raw)
  To: Alex Williamson
  Cc: mjrosato, jasowang, Hao, Xudong, peterx, Xu, Terrence,
	chao.p.peng, linux-s390, Liu, Yi L, kvm, lulu, joro, nicolinc,
	Zhao, Yan Y, intel-gfx, eric.auger, intel-gvt-dev, yi.y.sun,
	cohuck, shameerali.kolothum.thodi, suravee.suthikulpanit,
	robin.murphy

On Tue, Mar 21, 2023 at 02:31:22PM -0600, Alex Williamson wrote:

> This just seems like nit-picking that the API could have accomplished
> this more concisely.  Probably that's true, but I think you've
> identified a gap above that amplifies the issue.  If the user cannot
> map BDFs to cdevs because the cdevs are passed as open fds to the user
> driver, the _INFO results become meaningless and by removing the fds
> array, that becomes the obvious choice that a user presented with this
> dilemma would take.  We're skipping past easier to misuse, difficult to
> use correctly, and circling around no obvious way to use correctly.

No - this just isn't finished yet is all it means :(

I just noticed it just now, presumably Eric would have discovered this
when he tried to implement the FD pass and we would have made a new
_INFO at that point (or more ugly, have libvirt pass the BDF along
with the FD).

> Unfortunately the _INFO ioctl does presume that userspace knows the BDF
> to device mappings today, so if we are attempting to pre-enable a case
> with cdev support where that is not the case, then there must be
> something done with the _INFO ioctl to provide scope.

Yes, something is required with _INFO before libvirt can use a FD
pass. I'm thinking of a new _INFO query that returns the iommufd
dev_ids for the reset group. Then qemu can match the dev_ids back to
cdev FDs and thus vPCI devices and do what it needs to do.

But for the current qemu setup it will open cdev directly and it will
know the BDF so it can still use the current _INFO.

Though it would be nice if qemu didn't need two implementations so Yi
I'd rather see a new info in this series as well and qemu can just
consistently use dev_id and never bdf in iommufd mode.

Anyhow, I don't see the two topics as really related, the intention is
not to discourage people from calling _INFO, it just to make the
security proof simpler and more logical.

Jason

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [Intel-gfx] [PATCH v6 12/24] vfio/pci: Allow passing zero-length fd array in VFIO_DEVICE_PCI_HOT_RESET
  2023-03-21 20:50                         ` Jason Gunthorpe
@ 2023-03-21 21:01                           ` Alex Williamson
  2023-03-21 22:20                             ` Jason Gunthorpe
  2023-03-24  9:09                             ` Tian, Kevin
  2023-03-22  8:17                           ` Liu, Yi L
  1 sibling, 2 replies; 103+ messages in thread
From: Alex Williamson @ 2023-03-21 21:01 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: mjrosato, jasowang, Hao,  Xudong, peterx, Xu, Terrence,
	chao.p.peng, linux-s390, Liu, Yi L, kvm, lulu, joro, nicolinc,
	Zhao, Yan Y, intel-gfx, eric.auger, intel-gvt-dev, yi.y.sun,
	cohuck, shameerali.kolothum.thodi, suravee.suthikulpanit,
	robin.murphy

On Tue, 21 Mar 2023 17:50:08 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Tue, Mar 21, 2023 at 02:31:22PM -0600, Alex Williamson wrote:
> 
> > This just seems like nit-picking that the API could have accomplished
> > this more concisely.  Probably that's true, but I think you've
> > identified a gap above that amplifies the issue.  If the user cannot
> > map BDFs to cdevs because the cdevs are passed as open fds to the user
> > driver, the _INFO results become meaningless and by removing the fds
> > array, that becomes the obvious choice that a user presented with this
> > dilemma would take.  We're skipping past easier to misuse, difficult to
> > use correctly, and circling around no obvious way to use correctly.  
> 
> No - this just isn't finished yet is all it means :(
> 
> I just noticed it just now, presumably Eric would have discovered this
> when he tried to implement the FD pass and we would have made a new
> _INFO at that point (or more ugly, have libvirt pass the BDF along
> with the FD).
> 
> > Unfortunately the _INFO ioctl does presume that userspace knows the BDF
> > to device mappings today, so if we are attempting to pre-enable a case
> > with cdev support where that is not the case, then there must be
> > something done with the _INFO ioctl to provide scope.  
> 
> Yes, something is required with _INFO before libvirt can use a FD
> pass. I'm thinking of a new _INFO query that returns the iommufd
> dev_ids for the reset group. Then qemu can match the dev_ids back to
> cdev FDs and thus vPCI devices and do what it needs to do.
> 
> But for the current qemu setup it will open cdev directly and it will
> know the BDF so it can still use the current _INFO.
> 
> Though it would be nice if qemu didn't need two implementations so Yi
> I'd rather see a new info in this series as well and qemu can just
> consistently use dev_id and never bdf in iommufd mode.

We also need to consider how libvirt determines if QEMU has the kernel
support it needs to pass file descriptors.  It'd be a lot cleaner if
this aligned with the introduction of vfio cdevs.
 
> Anyhow, I don't see the two topics as really related, the intention is
> not to discourage people from calling _INFO, it just to make the
> security proof simpler and more logical.

At a minimum, we need a new _INFO ioctl to get back to the point where
it's only a discussion of whether we're checking the user on scope.  We
can't remove the array while doing so opens up an obviously incorrect
solution to an impossible to use API.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [Intel-gfx] [PATCH v6 12/24] vfio/pci: Allow passing zero-length fd array in VFIO_DEVICE_PCI_HOT_RESET
  2023-03-21 21:01                           ` Alex Williamson
@ 2023-03-21 22:20                             ` Jason Gunthorpe
  2023-03-21 22:47                               ` Alex Williamson
  2023-03-24  9:09                             ` Tian, Kevin
  1 sibling, 1 reply; 103+ messages in thread
From: Jason Gunthorpe @ 2023-03-21 22:20 UTC (permalink / raw)
  To: Alex Williamson
  Cc: mjrosato, jasowang, Hao, Xudong, peterx, Xu, Terrence,
	chao.p.peng, linux-s390, Liu, Yi L, kvm, lulu, joro, nicolinc,
	Zhao, Yan Y, intel-gfx, eric.auger, intel-gvt-dev, yi.y.sun,
	cohuck, shameerali.kolothum.thodi, suravee.suthikulpanit,
	robin.murphy

On Tue, Mar 21, 2023 at 03:01:12PM -0600, Alex Williamson wrote:

> > Though it would be nice if qemu didn't need two implementations so Yi
> > I'd rather see a new info in this series as well and qemu can just
> > consistently use dev_id and never bdf in iommufd mode.
> 
> We also need to consider how libvirt determines if QEMU has the kernel
> support it needs to pass file descriptors.  It'd be a lot cleaner if
> this aligned with the introduction of vfio cdevs.

Yes, that would be much better if it was one package.

But this is starting to grow and we have so many threads that need to
progress blocked on this cdev enablement :(

Could we go forward with the cdev main patches and kconfig it to
experimental or something while the rest of the parts are completed
and tested through qemu? ie move the vfio-pci reset enablment to after
the cdev patches?

Jason

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [Intel-gfx] [PATCH v6 12/24] vfio/pci: Allow passing zero-length fd array in VFIO_DEVICE_PCI_HOT_RESET
  2023-03-21 22:20                             ` Jason Gunthorpe
@ 2023-03-21 22:47                               ` Alex Williamson
  2023-03-22  4:42                                 ` Liu, Yi L
  2023-03-22 12:27                                 ` Jason Gunthorpe
  0 siblings, 2 replies; 103+ messages in thread
From: Alex Williamson @ 2023-03-21 22:47 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: mjrosato, jasowang, Hao,  Xudong, peterx, Xu, Terrence,
	chao.p.peng, linux-s390, Liu, Yi L, kvm, lulu, joro, nicolinc,
	Zhao, Yan Y, intel-gfx, eric.auger, intel-gvt-dev, yi.y.sun,
	cohuck, shameerali.kolothum.thodi, suravee.suthikulpanit,
	robin.murphy

On Tue, 21 Mar 2023 19:20:37 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Tue, Mar 21, 2023 at 03:01:12PM -0600, Alex Williamson wrote:
> 
> > > Though it would be nice if qemu didn't need two implementations so Yi
> > > I'd rather see a new info in this series as well and qemu can just
> > > consistently use dev_id and never bdf in iommufd mode.  
> > 
> > We also need to consider how libvirt determines if QEMU has the kernel
> > support it needs to pass file descriptors.  It'd be a lot cleaner if
> > this aligned with the introduction of vfio cdevs.  
> 
> Yes, that would be much better if it was one package.
> 
> But this is starting to grow and we have so many threads that need to
> progress blocked on this cdev enablement :(
> 
> Could we go forward with the cdev main patches and kconfig it to
> experimental or something while the rest of the parts are completed
> and tested through qemu? ie move the vfio-pci reset enablment to after
> the cdev patches?

We need to be able to guarantee that there cannot be any significant
builds of the kernel with vfio cdev support if our intention is to stage
it for libvirt.  We don't have a global EXPERIMENTAL config option any
more.  Adding new code under BROKEN seems wrong.  Fedora ships with
STAGING enabled.  A sternly worded Kconfig entry is toothless.  What is
the proposed mechanism to make this not look like a big uncompiled code
dump?  Thanks,

Alex


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [Intel-gfx] [PATCH v6 12/24] vfio/pci: Allow passing zero-length fd array in VFIO_DEVICE_PCI_HOT_RESET
  2023-03-21 22:47                               ` Alex Williamson
@ 2023-03-22  4:42                                 ` Liu, Yi L
  2023-03-22 12:23                                   ` Alex Williamson
  2023-03-22 12:27                                 ` Jason Gunthorpe
  1 sibling, 1 reply; 103+ messages in thread
From: Liu, Yi L @ 2023-03-22  4:42 UTC (permalink / raw)
  To: Alex Williamson, Jason Gunthorpe
  Cc: linux-s390, Zhao, Yan Y, mjrosato, kvm, intel-gvt-dev, jasowang,
	cohuck, Hao,  Xudong, robin.murphy, peterx,
	suravee.suthikulpanit, eric.auger, Xu,  Terrence, nicolinc,
	shameerali.kolothum.thodi, yi.y.sun, chao.p.peng, lulu,
	intel-gfx, joro

> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Wednesday, March 22, 2023 6:48 AM
> 
> On Tue, 21 Mar 2023 19:20:37 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Tue, Mar 21, 2023 at 03:01:12PM -0600, Alex Williamson wrote:
> >
> > > > Though it would be nice if qemu didn't need two implementations so Yi
> > > > I'd rather see a new info in this series as well and qemu can just
> > > > consistently use dev_id and never bdf in iommufd mode.
> > >
> > > We also need to consider how libvirt determines if QEMU has the kernel
> > > support it needs to pass file descriptors.  It'd be a lot cleaner if
> > > this aligned with the introduction of vfio cdevs.
> >
> > Yes, that would be much better if it was one package.
> >
> > But this is starting to grow and we have so many threads that need to
> > progress blocked on this cdev enablement :(
> >
> > Could we go forward with the cdev main patches and kconfig it to
> > experimental or something while the rest of the parts are completed
> > and tested through qemu? ie move the vfio-pci reset enablment to after
> > the cdev patches?
> 
> We need to be able to guarantee that there cannot be any significant
> builds of the kernel with vfio cdev support if our intention is to stage
> it for libvirt.  We don't have a global EXPERIMENTAL config option any
> more.  Adding new code under BROKEN seems wrong.  Fedora ships with
> STAGING enabled.  A sternly worded Kconfig entry is toothless.  What is
> the proposed mechanism to make this not look like a big uncompiled code
> dump?  Thanks,

Just out of curious, is the BDF mapping gap only for cdev or it also
exists in the traditional group path? IMHO, if it is only a gap for cdev, maybe
we can use CONFIG_VFIO_DEVICE_CDEV to stage it. This kconfig is N by
default. I think it won't change until one day the whole ecosystem is
updated.

Anyhow, I'll also see the complexity of adding a new _INFO ioctl. It should
return a set of dev_id to user rather than the bdf info in the existing _INFO
ioctl. Is it?

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [Intel-gfx] [PATCH v6 12/24] vfio/pci: Allow passing zero-length fd array in VFIO_DEVICE_PCI_HOT_RESET
  2023-03-21 20:50                         ` Jason Gunthorpe
  2023-03-21 21:01                           ` Alex Williamson
@ 2023-03-22  8:17                           ` Liu, Yi L
  2023-03-22 12:17                             ` Jason Gunthorpe
  1 sibling, 1 reply; 103+ messages in thread
From: Liu, Yi L @ 2023-03-22  8:17 UTC (permalink / raw)
  To: Jason Gunthorpe, Alex Williamson
  Cc: linux-s390, Zhao, Yan Y, mjrosato, kvm, intel-gvt-dev, jasowang,
	cohuck, Hao,  Xudong, robin.murphy, peterx,
	suravee.suthikulpanit, eric.auger, Xu,  Terrence, nicolinc,
	shameerali.kolothum.thodi, yi.y.sun, chao.p.peng, lulu,
	intel-gfx, joro

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, March 22, 2023 4:50 AM
> 
> On Tue, Mar 21, 2023 at 02:31:22PM -0600, Alex Williamson wrote:
> 
> > This just seems like nit-picking that the API could have accomplished
> > this more concisely.  Probably that's true, but I think you've
> > identified a gap above that amplifies the issue.  If the user cannot
> > map BDFs to cdevs because the cdevs are passed as open fds to the user
> > driver, the _INFO results become meaningless and by removing the fds
> > array, that becomes the obvious choice that a user presented with this
> > dilemma would take.  We're skipping past easier to misuse, difficult to
> > use correctly, and circling around no obvious way to use correctly.
> 
> No - this just isn't finished yet is all it means :(
> 
> I just noticed it just now, presumably Eric would have discovered this
> when he tried to implement the FD pass and we would have made a new
> _INFO at that point (or more ugly, have libvirt pass the BDF along
> with the FD).
> 
> > Unfortunately the _INFO ioctl does presume that userspace knows the BDF
> > to device mappings today, so if we are attempting to pre-enable a case
> > with cdev support where that is not the case, then there must be
> > something done with the _INFO ioctl to provide scope.
> 
> Yes, something is required with _INFO before libvirt can use a FD
> pass. I'm thinking of a new _INFO query that returns the iommufd
> dev_ids for the reset group. Then qemu can match the dev_ids back to
> cdev FDs and thus vPCI devices and do what it needs to do.

Could you elaborate what is required with _INFO before libvirt can
use a FD pass?

> But for the current qemu setup it will open cdev directly and it will
> know the BDF so it can still use the current _INFO.
> 
> Though it would be nice if qemu didn't need two implementations so Yi
> I'd rather see a new info in this series as well and qemu can just
> consistently use dev_id and never bdf in iommufd mode.

I have one concern here. iommufd dev_id is not a static info as much as
bdf. It is generated when bound to iommufd. So if there are devices that
are affected but not bound to iommufd yet at the time of invoking _INFO,
then the _INFO ioctl just gets a subset of the affected devices. Is it enough?

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [Intel-gfx] [PATCH v6 12/24] vfio/pci: Allow passing zero-length fd array in VFIO_DEVICE_PCI_HOT_RESET
  2023-03-22  8:17                           ` Liu, Yi L
@ 2023-03-22 12:17                             ` Jason Gunthorpe
  2023-03-22 13:33                               ` Liu, Yi L
  0 siblings, 1 reply; 103+ messages in thread
From: Jason Gunthorpe @ 2023-03-22 12:17 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: mjrosato, jasowang, Hao, Xudong, peterx, Xu, Terrence,
	chao.p.peng, linux-s390, kvm, lulu, joro, nicolinc, Zhao, Yan Y,
	intel-gfx, eric.auger, intel-gvt-dev, yi.y.sun, cohuck,
	shameerali.kolothum.thodi, suravee.suthikulpanit, robin.murphy

On Wed, Mar 22, 2023 at 08:17:54AM +0000, Liu, Yi L wrote:

> Could you elaborate what is required with _INFO before libvirt can
> use a FD pass?

Make a new _INFO that returns an array of dev_ids within the cdev's
iommufd_ctx that are part of the reset group, eg the devset.

qemu will call this for each dev_id after it opens the cdev to
generate the groupings.

> > But for the current qemu setup it will open cdev directly and it will
> > know the BDF so it can still use the current _INFO.
> > 
> > Though it would be nice if qemu didn't need two implementations so Yi
> > I'd rather see a new info in this series as well and qemu can just
> > consistently use dev_id and never bdf in iommufd mode.
> 
> I have one concern here. iommufd dev_id is not a static info as much as
> bdf. It is generated when bound to iommufd. So if there are devices that
> are affected but not bound to iommufd yet at the time of invoking _INFO,
> then the _INFO ioctl just gets a subset of the affected devices. Is it enough?

I'd probably use similar logic as the reset path, if one of reset
group devices is open and on a different iommufd_ctx then fail the
IOCTL with EPERM.

Jason

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [Intel-gfx] [PATCH v6 12/24] vfio/pci: Allow passing zero-length fd array in VFIO_DEVICE_PCI_HOT_RESET
  2023-03-22  4:42                                 ` Liu, Yi L
@ 2023-03-22 12:23                                   ` Alex Williamson
  0 siblings, 0 replies; 103+ messages in thread
From: Alex Williamson @ 2023-03-22 12:23 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: mjrosato, jasowang, Hao, Xudong, peterx, Xu, Terrence,
	chao.p.peng, linux-s390, kvm, lulu, joro, nicolinc,
	Jason Gunthorpe, Zhao, Yan Y, intel-gfx, eric.auger,
	intel-gvt-dev, yi.y.sun, cohuck, shameerali.kolothum.thodi,
	suravee.suthikulpanit, robin.murphy

On Wed, 22 Mar 2023 04:42:16 +0000
"Liu, Yi L" <yi.l.liu@intel.com> wrote:

> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Wednesday, March 22, 2023 6:48 AM
> > 
> > On Tue, 21 Mar 2023 19:20:37 -0300
> > Jason Gunthorpe <jgg@nvidia.com> wrote:
> >   
> > > On Tue, Mar 21, 2023 at 03:01:12PM -0600, Alex Williamson wrote:
> > >  
> > > > > Though it would be nice if qemu didn't need two implementations so Yi
> > > > > I'd rather see a new info in this series as well and qemu can just
> > > > > consistently use dev_id and never bdf in iommufd mode.  
> > > >
> > > > We also need to consider how libvirt determines if QEMU has the kernel
> > > > support it needs to pass file descriptors.  It'd be a lot cleaner if
> > > > this aligned with the introduction of vfio cdevs.  
> > >
> > > Yes, that would be much better if it was one package.
> > >
> > > But this is starting to grow and we have so many threads that need to
> > > progress blocked on this cdev enablement :(
> > >
> > > Could we go forward with the cdev main patches and kconfig it to
> > > experimental or something while the rest of the parts are completed
> > > and tested through qemu? ie move the vfio-pci reset enablment to after
> > > the cdev patches?  
> > 
> > We need to be able to guarantee that there cannot be any significant
> > builds of the kernel with vfio cdev support if our intention is to stage
> > it for libvirt.  We don't have a global EXPERIMENTAL config option any
> > more.  Adding new code under BROKEN seems wrong.  Fedora ships with
> > STAGING enabled.  A sternly worded Kconfig entry is toothless.  What is
> > the proposed mechanism to make this not look like a big uncompiled code
> > dump?  Thanks,  
> 
> Just out of curious, is the BDF mapping gap only for cdev or it also
> exists in the traditional group path?

The group path doesn't support passing file descriptors, getting access
to the device files requires a full container configuration, which
implies significant policy decisions in libvirt.  Even if groups were
passed, QEMU would need to know the device name, ie. BDF in string
format, to get the device from the group.

> IMHO, if it is only a gap for cdev, maybe
> we can use CONFIG_VFIO_DEVICE_CDEV to stage it. This kconfig is N by
> default. I think it won't change until one day the whole ecosystem is
> updated.

See the "toothless" comment above, disabling vfio cdev support by
default because we don't have feature parity in reset support does not
provide any guarantees to libvirt that it can effectively take
advantage of passing cdev fds to QEMU.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [Intel-gfx] [PATCH v6 12/24] vfio/pci: Allow passing zero-length fd array in VFIO_DEVICE_PCI_HOT_RESET
  2023-03-21 22:47                               ` Alex Williamson
  2023-03-22  4:42                                 ` Liu, Yi L
@ 2023-03-22 12:27                                 ` Jason Gunthorpe
  2023-03-22 12:36                                   ` Alex Williamson
  1 sibling, 1 reply; 103+ messages in thread
From: Jason Gunthorpe @ 2023-03-22 12:27 UTC (permalink / raw)
  To: Alex Williamson
  Cc: mjrosato, jasowang, Hao, Xudong, peterx, Xu, Terrence,
	chao.p.peng, linux-s390, Liu, Yi L, kvm, lulu, joro, nicolinc,
	Zhao, Yan Y, intel-gfx, eric.auger, intel-gvt-dev, yi.y.sun,
	cohuck, shameerali.kolothum.thodi, suravee.suthikulpanit,
	robin.murphy

On Tue, Mar 21, 2023 at 04:47:37PM -0600, Alex Williamson wrote:
> On Tue, 21 Mar 2023 19:20:37 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Tue, Mar 21, 2023 at 03:01:12PM -0600, Alex Williamson wrote:
> > 
> > > > Though it would be nice if qemu didn't need two implementations so Yi
> > > > I'd rather see a new info in this series as well and qemu can just
> > > > consistently use dev_id and never bdf in iommufd mode.  
> > > 
> > > We also need to consider how libvirt determines if QEMU has the kernel
> > > support it needs to pass file descriptors.  It'd be a lot cleaner if
> > > this aligned with the introduction of vfio cdevs.  
> > 
> > Yes, that would be much better if it was one package.
> > 
> > But this is starting to grow and we have so many threads that need to
> > progress blocked on this cdev enablement :(
> > 
> > Could we go forward with the cdev main patches and kconfig it to
> > experimental or something while the rest of the parts are completed
> > and tested through qemu? ie move the vfio-pci reset enablment to after
> > the cdev patches?
> 
> We need to be able to guarantee that there cannot be any significant
> builds of the kernel with vfio cdev support if our intention is to stage
> it for libvirt.  We don't have a global EXPERIMENTAL config option any
> more.  Adding new code under BROKEN seems wrong.  Fedora ships with
> STAGING enabled.  A sternly worded Kconfig entry is toothless.  What is
> the proposed mechanism to make this not look like a big uncompiled code
> dump?  Thanks,

I would suggest a sternly worded kconfig and STAGING.

This isn't such a big issue, we are trying to say that a future
released qemu is not required to work on older kernels with a STAGING
kconfig mark.

IOW we are saying that qemu release X.0 with production iommufd
requires kernel version > x.y and just lightly reflecting this into
the kconfig.

qemu should simply not support iommufd if it finds itself on a old
kernel.

Jason

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [Intel-gfx] [PATCH v6 12/24] vfio/pci: Allow passing zero-length fd array in VFIO_DEVICE_PCI_HOT_RESET
  2023-03-22 12:27                                 ` Jason Gunthorpe
@ 2023-03-22 12:36                                   ` Alex Williamson
  2023-03-22 12:47                                     ` Jason Gunthorpe
  0 siblings, 1 reply; 103+ messages in thread
From: Alex Williamson @ 2023-03-22 12:36 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: mjrosato, jasowang, Hao,  Xudong, peterx, Xu, Terrence,
	chao.p.peng, linux-s390, Liu, Yi L, kvm, lulu, joro, nicolinc,
	Zhao, Yan Y, intel-gfx, eric.auger, intel-gvt-dev, yi.y.sun,
	cohuck, shameerali.kolothum.thodi, suravee.suthikulpanit,
	robin.murphy

On Wed, 22 Mar 2023 09:27:16 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Tue, Mar 21, 2023 at 04:47:37PM -0600, Alex Williamson wrote:
> > On Tue, 21 Mar 2023 19:20:37 -0300
> > Jason Gunthorpe <jgg@nvidia.com> wrote:
> >   
> > > On Tue, Mar 21, 2023 at 03:01:12PM -0600, Alex Williamson wrote:
> > >   
> > > > > Though it would be nice if qemu didn't need two implementations so Yi
> > > > > I'd rather see a new info in this series as well and qemu can just
> > > > > consistently use dev_id and never bdf in iommufd mode.    
> > > > 
> > > > We also need to consider how libvirt determines if QEMU has the kernel
> > > > support it needs to pass file descriptors.  It'd be a lot cleaner if
> > > > this aligned with the introduction of vfio cdevs.    
> > > 
> > > Yes, that would be much better if it was one package.
> > > 
> > > But this is starting to grow and we have so many threads that need to
> > > progress blocked on this cdev enablement :(
> > > 
> > > Could we go forward with the cdev main patches and kconfig it to
> > > experimental or something while the rest of the parts are completed
> > > and tested through qemu? ie move the vfio-pci reset enablment to after
> > > the cdev patches?  
> > 
> > We need to be able to guarantee that there cannot be any significant
> > builds of the kernel with vfio cdev support if our intention is to stage
> > it for libvirt.  We don't have a global EXPERIMENTAL config option any
> > more.  Adding new code under BROKEN seems wrong.  Fedora ships with
> > STAGING enabled.  A sternly worded Kconfig entry is toothless.  What is
> > the proposed mechanism to make this not look like a big uncompiled code
> > dump?  Thanks,  
> 
> I would suggest a sternly worded kconfig and STAGING.
> 
> This isn't such a big issue, we are trying to say that a future
> released qemu is not required to work on older kernels with a STAGING
> kconfig mark.
> 
> IOW we are saying that qemu release X.0 with production iommufd
> requires kernel version > x.y and just lightly reflecting this into
> the kconfig.
> 
> qemu should simply not support iommufd if it finds itself on a old
> kernel.

Inferring features based on kernel versions doesn't work in a world
where downstreams backport features to older kernels.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [Intel-gfx] [PATCH v6 12/24] vfio/pci: Allow passing zero-length fd array in VFIO_DEVICE_PCI_HOT_RESET
  2023-03-22 12:36                                   ` Alex Williamson
@ 2023-03-22 12:47                                     ` Jason Gunthorpe
  0 siblings, 0 replies; 103+ messages in thread
From: Jason Gunthorpe @ 2023-03-22 12:47 UTC (permalink / raw)
  To: Alex Williamson
  Cc: mjrosato, jasowang, Hao, Xudong, peterx, Xu, Terrence,
	chao.p.peng, linux-s390, Liu, Yi L, kvm, lulu, joro, nicolinc,
	Zhao, Yan Y, intel-gfx, eric.auger, intel-gvt-dev, yi.y.sun,
	cohuck, shameerali.kolothum.thodi, suravee.suthikulpanit,
	robin.murphy

On Wed, Mar 22, 2023 at 06:36:14AM -0600, Alex Williamson wrote:
> On Wed, 22 Mar 2023 09:27:16 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Tue, Mar 21, 2023 at 04:47:37PM -0600, Alex Williamson wrote:
> > > On Tue, 21 Mar 2023 19:20:37 -0300
> > > Jason Gunthorpe <jgg@nvidia.com> wrote:
> > >   
> > > > On Tue, Mar 21, 2023 at 03:01:12PM -0600, Alex Williamson wrote:
> > > >   
> > > > > > Though it would be nice if qemu didn't need two implementations so Yi
> > > > > > I'd rather see a new info in this series as well and qemu can just
> > > > > > consistently use dev_id and never bdf in iommufd mode.    
> > > > > 
> > > > > We also need to consider how libvirt determines if QEMU has the kernel
> > > > > support it needs to pass file descriptors.  It'd be a lot cleaner if
> > > > > this aligned with the introduction of vfio cdevs.    
> > > > 
> > > > Yes, that would be much better if it was one package.
> > > > 
> > > > But this is starting to grow and we have so many threads that need to
> > > > progress blocked on this cdev enablement :(
> > > > 
> > > > Could we go forward with the cdev main patches and kconfig it to
> > > > experimental or something while the rest of the parts are completed
> > > > and tested through qemu? ie move the vfio-pci reset enablment to after
> > > > the cdev patches?  
> > > 
> > > We need to be able to guarantee that there cannot be any significant
> > > builds of the kernel with vfio cdev support if our intention is to stage
> > > it for libvirt.  We don't have a global EXPERIMENTAL config option any
> > > more.  Adding new code under BROKEN seems wrong.  Fedora ships with
> > > STAGING enabled.  A sternly worded Kconfig entry is toothless.  What is
> > > the proposed mechanism to make this not look like a big uncompiled code
> > > dump?  Thanks,  
> > 
> > I would suggest a sternly worded kconfig and STAGING.
> > 
> > This isn't such a big issue, we are trying to say that a future
> > released qemu is not required to work on older kernels with a STAGING
> > kconfig mark.
> > 
> > IOW we are saying that qemu release X.0 with production iommufd
> > requires kernel version > x.y and just lightly reflecting this into
> > the kconfig.
> > 
> > qemu should simply not support iommufd if it finds itself on a old
> > kernel.
> 
> Inferring features based on kernel versions doesn't work in a world
> where downstreams backport features to older kernels. 

I don't mean actual kernel versions as a compatability test. I mean it
as documention and an expected "support" window.

ie we are disclaiming support for STAGING kernel as a matter of
documentation, not code.

Jason

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [Intel-gfx] [PATCH v6 12/24] vfio/pci: Allow passing zero-length fd array in VFIO_DEVICE_PCI_HOT_RESET
  2023-03-22 12:17                             ` Jason Gunthorpe
@ 2023-03-22 13:33                               ` Liu, Yi L
  2023-03-22 13:43                                 ` Jason Gunthorpe
  0 siblings, 1 reply; 103+ messages in thread
From: Liu, Yi L @ 2023-03-22 13:33 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: mjrosato, jasowang, Hao, Xudong, peterx, Xu,  Terrence,
	chao.p.peng, linux-s390, kvm, lulu, joro, nicolinc, Zhao, Yan Y,
	intel-gfx, eric.auger, intel-gvt-dev, yi.y.sun, cohuck,
	shameerali.kolothum.thodi, suravee.suthikulpanit, robin.murphy

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, March 22, 2023 8:18 PM
>
> On Wed, Mar 22, 2023 at 08:17:54AM +0000, Liu, Yi L wrote:
> 
> > Could you elaborate what is required with _INFO before libvirt can
> > use a FD pass?
> 
> Make a new _INFO that returns an array of dev_ids within the cdev's
> iommufd_ctx that are part of the reset group, eg the devset.
>
> qemu will call this for each dev_id after it opens the cdev to
> generate the groupings.

Thanks. So this new _INFO only reports a limited scope instead of
the full list of affected devices. Also, it is not static scope since device
may be opened just after the _INFO returns.

> > > But for the current qemu setup it will open cdev directly and it will
> > > know the BDF so it can still use the current _INFO.
> > >
> > > Though it would be nice if qemu didn't need two implementations so Yi
> > > I'd rather see a new info in this series as well and qemu can just
> > > consistently use dev_id and never bdf in iommufd mode.
> >
> > I have one concern here. iommufd dev_id is not a static info as much as
> > bdf. It is generated when bound to iommufd. So if there are devices that
> > are affected but not bound to iommufd yet at the time of invoking _INFO,
> > then the _INFO ioctl just gets a subset of the affected devices. Is it enough?
> 
> I'd probably use similar logic as the reset path, if one of reset
> group devices is open and on a different iommufd_ctx then fail the
> IOCTL with EPERM.

Say there are three devices in the dev_set. When the first device is
opened by the current qemu, this new _INFO would return one dev_id
to user. When the second device is opened, this new _INFO will return
two dev_ids to user. If the third device is opened by another qemu, then
the new _INFO would fail since the former two devices were opened and
have different iommufd_ctx with the third device.

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [Intel-gfx] [PATCH v6 12/24] vfio/pci: Allow passing zero-length fd array in VFIO_DEVICE_PCI_HOT_RESET
  2023-03-22 13:33                               ` Liu, Yi L
@ 2023-03-22 13:43                                 ` Jason Gunthorpe
  2023-03-23  3:15                                   ` Liu, Yi L
  0 siblings, 1 reply; 103+ messages in thread
From: Jason Gunthorpe @ 2023-03-22 13:43 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: mjrosato, jasowang, Hao, Xudong, peterx, Xu, Terrence,
	chao.p.peng, linux-s390, kvm, lulu, joro, nicolinc, Zhao, Yan Y,
	intel-gfx, eric.auger, intel-gvt-dev, yi.y.sun, cohuck,
	shameerali.kolothum.thodi, suravee.suthikulpanit, robin.murphy

On Wed, Mar 22, 2023 at 01:33:09PM +0000, Liu, Yi L wrote:

> Thanks. So this new _INFO only reports a limited scope instead of
> the full list of affected devices. Also, it is not static scope since device
> may be opened just after the _INFO returns.

Yes, it would be simplest for qemu to do the query after it gains a
new dev_id and then it can add the new dev_id with the correct reset
group.

> > I'd probably use similar logic as the reset path, if one of reset
> > group devices is open and on a different iommufd_ctx then fail the
> > IOCTL with EPERM.
> 
> Say there are three devices in the dev_set. When the first device is
> opened by the current qemu, this new _INFO would return one dev_id
> to user. When the second device is opened, this new _INFO will return
> two dev_ids to user.

Yes

> If the third device is opened by another qemu, then
> the new _INFO would fail since the former two devices were opened and
> have different iommufd_ctx with the third device.

Yes

qemu should refuse to use the device at this moment.

Jason

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [Intel-gfx] [PATCH v6 05/24] kvm/vfio: Accept vfio device file from userspace
  2023-03-08 13:28 ` [Intel-gfx] [PATCH v6 05/24] kvm/vfio: Accept vfio device file from userspace Yi Liu
@ 2023-03-22 14:10   ` Xu Yilun
  2023-03-28  3:48     ` Liu, Yi L
  0 siblings, 1 reply; 103+ messages in thread
From: Xu Yilun @ 2023-03-22 14:10 UTC (permalink / raw)
  To: Yi Liu
  Cc: mjrosato, jasowang, xudong.hao, peterx, terrence.xu, chao.p.peng,
	linux-s390, kvm, lulu, joro, nicolinc, jgg, yan.y.zhao,
	intel-gfx, eric.auger, intel-gvt-dev, yi.y.sun, cohuck,
	shameerali.kolothum.thodi, suravee.suthikulpanit, robin.murphy

On 2023-03-08 at 05:28:44 -0800, Yi Liu wrote:
> This defines KVM_DEV_VFIO_FILE* and make alias with KVM_DEV_VFIO_GROUP*.
> Old userspace uses KVM_DEV_VFIO_GROUP* works as well.
> 
> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> Tested-by: Terrence Xu <terrence.xu@intel.com>
> Tested-by: Nicolin Chen <nicolinc@nvidia.com>
> Tested-by: Matthew Rosato <mjrosato@linux.ibm.com>
> ---
>  Documentation/virt/kvm/devices/vfio.rst | 52 +++++++++++++++++--------
>  include/uapi/linux/kvm.h                | 16 ++++++--
>  virt/kvm/vfio.c                         | 16 ++++----
>  3 files changed, 55 insertions(+), 29 deletions(-)
> 
> diff --git a/Documentation/virt/kvm/devices/vfio.rst b/Documentation/virt/kvm/devices/vfio.rst
> index 79b6811bb4f3..5b05b48abaab 100644
> --- a/Documentation/virt/kvm/devices/vfio.rst
> +++ b/Documentation/virt/kvm/devices/vfio.rst
> @@ -9,24 +9,37 @@ Device types supported:
>    - KVM_DEV_TYPE_VFIO
>  
>  Only one VFIO instance may be created per VM.  The created device
> -tracks VFIO groups in use by the VM and features of those groups
> -important to the correctness and acceleration of the VM.  As groups
> -are enabled and disabled for use by the VM, KVM should be updated
> -about their presence.  When registered with KVM, a reference to the
> -VFIO-group is held by KVM.
> +tracks VFIO files (group or device) in use by the VM and features
> +of those groups/devices important to the correctness and acceleration
> +of the VM.  As groups/devices are enabled and disabled for use by the
> +VM, KVM should be updated about their presence.  When registered with
> +KVM, a reference to the VFIO file is held by KVM.
>  
>  Groups:
> -  KVM_DEV_VFIO_GROUP
> -
> -KVM_DEV_VFIO_GROUP attributes:
> -  KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking
> -	kvm_device_attr.addr points to an int32_t file descriptor
> -	for the VFIO group.
> -  KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device tracking
> -	kvm_device_attr.addr points to an int32_t file descriptor
> -	for the VFIO group.
> -  KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: attaches a guest visible TCE table
> +  KVM_DEV_VFIO_FILE
> +	alias: KVM_DEV_VFIO_GROUP
> +
> +KVM_DEV_VFIO_FILE attributes:
> +  KVM_DEV_VFIO_FILE_ADD: Add a VFIO file (group/device) to VFIO-KVM device
> +	tracking
> +
> +	alias: KVM_DEV_VFIO_GROUP_ADD
> +
> +	kvm_device_attr.addr points to an int32_t file descriptor for the
> +	VFIO file.

A blank line here to be consistent with other attibutes.

> +  KVM_DEV_VFIO_FILE_DEL: Remove a VFIO file (group/device) from VFIO-KVM
> +	device tracking
> +
> +	alias: KVM_DEV_VFIO_GROUP_DEL
> +
> +	kvm_device_attr.addr points to an int32_t file descriptor for the
> +	VFIO file.
> +
> +  KVM_DEV_VFIO_FILE_SET_SPAPR_TCE: attaches a guest visible TCE table
>  	allocated by sPAPR KVM.
> +
> +	alias: KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE
> +
>  	kvm_device_attr.addr points to a struct::
>  
>  		struct kvm_vfio_spapr_tce {
> @@ -40,9 +53,14 @@ KVM_DEV_VFIO_GROUP attributes:
>  	- @tablefd is a file descriptor for a TCE table allocated via
>  	  KVM_CREATE_SPAPR_TCE.
>  
> +	only accepts vfio group file as SPAPR has no iommufd support
> +
>  ::
>  
> -The GROUP_ADD operation above should be invoked prior to accessing the
> +The FILE/GROUP_ADD operation above should be invoked prior to accessing the
>  device file descriptor via VFIO_GROUP_GET_DEVICE_FD in order to support
>  drivers which require a kvm pointer to be set in their .open_device()
> -callback.
> +callback.  It is the same for device file descriptor via character device
> +open which gets device access via VFIO_DEVICE_BIND_IOMMUFD.  For such file
> +descriptors, FILE_ADD should be invoked before VFIO_DEVICE_BIND_IOMMUFD
> +to support the drivers mentioned in piror sentence as well.

s/piror/prior

Thanks,
Yilun

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [Intel-gfx] [PATCH v6 12/24] vfio/pci: Allow passing zero-length fd array in VFIO_DEVICE_PCI_HOT_RESET
  2023-03-22 13:43                                 ` Jason Gunthorpe
@ 2023-03-23  3:15                                   ` Liu, Yi L
  2023-03-23 12:02                                     ` Jason Gunthorpe
  0 siblings, 1 reply; 103+ messages in thread
From: Liu, Yi L @ 2023-03-23  3:15 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: mjrosato, jasowang, Hao, Xudong, peterx, Xu,  Terrence,
	chao.p.peng, linux-s390, kvm, lulu, joro, nicolinc, Zhao, Yan Y,
	intel-gfx, eric.auger, intel-gvt-dev, yi.y.sun, cohuck,
	shameerali.kolothum.thodi, suravee.suthikulpanit, robin.murphy

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, March 22, 2023 9:43 PM
> 
> On Wed, Mar 22, 2023 at 01:33:09PM +0000, Liu, Yi L wrote:
> 
> > Thanks. So this new _INFO only reports a limited scope instead of
> > the full list of affected devices. Also, it is not static scope since device
> > may be opened just after the _INFO returns.
> 
> Yes, it would be simplest for qemu to do the query after it gains a
> new dev_id and then it can add the new dev_id with the correct reset
> group.

I see. QEMU can decide. For now, it seems like QEMU doesn't store
such the info return by the existing _INFO ioctl. It is used in the hot
reset helper and freed before it returns. Though, I'm not sure whether
QEMU will store the dev_id info returned by the new _INFO. Perhaps
Alex can give some guidance.

> > > I'd probably use similar logic as the reset path, if one of reset
> > > group devices is open and on a different iommufd_ctx then fail the
> > > IOCTL with EPERM.
> >
> > Say there are three devices in the dev_set. When the first device is
> > opened by the current qemu, this new _INFO would return one dev_id
> > to user. When the second device is opened, this new _INFO will return
> > two dev_ids to user.
> 
> Yes
> 
> > If the third device is opened by another qemu, then
> > the new _INFO would fail since the former two devices were opened and
> > have different iommufd_ctx with the third device.
> 
> Yes
> 
> qemu should refuse to use the device at this moment.

Yes. it is.

btw.  Another question about this new _INFO ioctl. If there are affected
devices that have not been bound to vfio driver yet, should this new _INFO
ioctl fail all the same with EPERM? Or it still just returns the dev_ids
of the devices that are already bound and opened. This seems to make sense
with two reasons:
 - for one, the new _INFO is not designed to give userspace an exact
   affected device list;
 - for two, reset shall fail when checking vfio_pci_dev_set_resettable();
Please feel free to correct me if this is wrong.

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [Intel-gfx] [PATCH v6 12/24] vfio/pci: Allow passing zero-length fd array in VFIO_DEVICE_PCI_HOT_RESET
  2023-03-23  3:15                                   ` Liu, Yi L
@ 2023-03-23 12:02                                     ` Jason Gunthorpe
  2023-03-24  9:25                                       ` Liu, Yi L
  0 siblings, 1 reply; 103+ messages in thread
From: Jason Gunthorpe @ 2023-03-23 12:02 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: mjrosato, jasowang, Hao, Xudong, peterx, Xu, Terrence,
	chao.p.peng, linux-s390, kvm, lulu, joro, nicolinc, Zhao, Yan Y,
	intel-gfx, eric.auger, intel-gvt-dev, yi.y.sun, cohuck,
	shameerali.kolothum.thodi, suravee.suthikulpanit, robin.murphy

On Thu, Mar 23, 2023 at 03:15:20AM +0000, Liu, Yi L wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Wednesday, March 22, 2023 9:43 PM
> > 
> > On Wed, Mar 22, 2023 at 01:33:09PM +0000, Liu, Yi L wrote:
> > 
> > > Thanks. So this new _INFO only reports a limited scope instead of
> > > the full list of affected devices. Also, it is not static scope since device
> > > may be opened just after the _INFO returns.
> > 
> > Yes, it would be simplest for qemu to do the query after it gains a
> > new dev_id and then it can add the new dev_id with the correct reset
> > group.
> 
> I see. QEMU can decide. For now, it seems like QEMU doesn't store
> such the info return by the existing _INFO ioctl. It is used in the hot
> reset helper and freed before it returns. Though, I'm not sure whether
> QEMU will store the dev_id info returned by the new _INFO. Perhaps
> Alex can give some guidance.

That seems a bit confusing, qemu should take the reset group
information and encode it so that the guest can know it as well.

If all it does is blindly invoke the hot_reset with the right
parameters then what was the point of all this discussion?
 
> btw.  Another question about this new _INFO ioctl. If there are affected
> devices that have not been bound to vfio driver yet, should this new _INFO
> ioctl fail all the same with EPERM? 

Yeah, it should EPERM the same as the normal hot reset if it can't
marshal the device list.

Jason

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [Intel-gfx] [PATCH v6 12/24] vfio/pci: Allow passing zero-length fd array in VFIO_DEVICE_PCI_HOT_RESET
  2023-03-21 21:01                           ` Alex Williamson
  2023-03-21 22:20                             ` Jason Gunthorpe
@ 2023-03-24  9:09                             ` Tian, Kevin
  2023-03-24 13:14                               ` Jason Gunthorpe
  1 sibling, 1 reply; 103+ messages in thread
From: Tian, Kevin @ 2023-03-24  9:09 UTC (permalink / raw)
  To: Alex Williamson, Jason Gunthorpe
  Cc: linux-s390, suravee.suthikulpanit, Liu, Yi L, Zhao, Yan Y,
	mjrosato, kvm, intel-gvt-dev, jasowang, cohuck, Hao, Xudong,
	robin.murphy, peterx, eric.auger, Xu, Terrence, nicolinc,
	shameerali.kolothum.thodi, yi.y.sun, chao.p.peng, lulu,
	intel-gfx, joro

> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Wednesday, March 22, 2023 5:01 AM
> 
> On Tue, 21 Mar 2023 17:50:08 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> >
> > Though it would be nice if qemu didn't need two implementations so Yi
> > I'd rather see a new info in this series as well and qemu can just
> > consistently use dev_id and never bdf in iommufd mode.
> 
> We also need to consider how libvirt determines if QEMU has the kernel
> support it needs to pass file descriptors.  It'd be a lot cleaner if
> this aligned with the introduction of vfio cdevs.
> 

Libvirt can check whether the kernel creates cdev for a given device
via sysfs.

but I'm not sure how Libvirt determines whether QEMU supports a
feature that it wants to use. But sounds this is a general handshake
problem as Libvirt needs to support many versions of QEMU then
there must be a way for such negotiation?

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [Intel-gfx] [PATCH v6 12/24] vfio/pci: Allow passing zero-length fd array in VFIO_DEVICE_PCI_HOT_RESET
  2023-03-23 12:02                                     ` Jason Gunthorpe
@ 2023-03-24  9:25                                       ` Liu, Yi L
  2023-03-27 11:57                                         ` Liu, Yi L
  0 siblings, 1 reply; 103+ messages in thread
From: Liu, Yi L @ 2023-03-24  9:25 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: mjrosato, jasowang, Hao, Xudong, peterx, Xu,  Terrence,
	chao.p.peng, linux-s390, kvm, lulu, joro, nicolinc, Zhao, Yan Y,
	intel-gfx, eric.auger, intel-gvt-dev, yi.y.sun, cohuck,
	shameerali.kolothum.thodi, suravee.suthikulpanit, robin.murphy

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Thursday, March 23, 2023 8:02 PM
> 
> On Thu, Mar 23, 2023 at 03:15:20AM +0000, Liu, Yi L wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Wednesday, March 22, 2023 9:43 PM
> > >
> > > On Wed, Mar 22, 2023 at 01:33:09PM +0000, Liu, Yi L wrote:
> > >
> > > > Thanks. So this new _INFO only reports a limited scope instead of
> > > > the full list of affected devices. Also, it is not static scope since device
> > > > may be opened just after the _INFO returns.
> > >
> > > Yes, it would be simplest for qemu to do the query after it gains a
> > > new dev_id and then it can add the new dev_id with the correct reset
> > > group.
> >
> > I see. QEMU can decide. For now, it seems like QEMU doesn't store
> > such the info return by the existing _INFO ioctl. It is used in the hot
> > reset helper and freed before it returns. Though, I'm not sure whether
> > QEMU will store the dev_id info returned by the new _INFO. Perhaps
> > Alex can give some guidance.
> 
> That seems a bit confusing, qemu should take the reset group
> information and encode it so that the guest can know it as well.
> 
> If all it does is blindly invoke the hot_reset with the right
> parameters then what was the point of all this discussion?
> 
> > btw.  Another question about this new _INFO ioctl. If there are affected
> > devices that have not been bound to vfio driver yet, should this new _INFO
> > ioctl fail all the same with EPERM?
> 
> Yeah, it should EPERM the same as the normal hot reset if it can't
> marshal the device list.

Hi Jason, Alex,

I got a draft patch to add the new _INFO? It checks if all the affected devices
are in the dev_set, and then gets the dev_id of all the opened devices within
the dev_set. There is still one thing not quite clear. It is the noiommu mode.
In this mode, there is no iommufd bind, so no dev_id. For now, I just fail this
new _INFO ioctl if there is no iommufd_device. Hence, this new _INFO is not
available for users that operating in noiommu mode. Is this acceptable?

From e763474e255ff9805b1fb76d6b6b9ccedb61058f Mon Sep 17 00:00:00 2001
From: Yi Liu <yi.l.liu@intel.com>
Date: Fri, 24 Mar 2023 00:54:08 -0700
Subject: [PATCH 06/10] vfio/pci: Add VFIO_DEVICE_GET_PCI_HOT_RESET_GROUP_INFO

to report the affected devices for a given device's hot reset. It is a
list of iommufd dev_id that is opened by the user. If there is device
that is not bound to vfio driver or opened by another user, this shall
fail with -EPERM. For the noiommu mode in the vfio device cdev path,
this shall fail as no dev_id would be generated.

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
---
 drivers/vfio/pci/vfio_pci_core.c | 98 ++++++++++++++++++++++++++++++++
 include/uapi/linux/vfio.h        | 28 +++++++++
 2 files changed, 126 insertions(+)

diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index b68fcba67a4b..5789933a64ae 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -1181,6 +1181,102 @@ static int vfio_pci_ioctl_reset(struct vfio_pci_core_device *vdev,
 	return ret;
 }
 
+static struct pci_dev *
+vfio_pci_dev_set_resettable(struct vfio_device_set *dev_set);
+
+static int vfio_pci_ioctl_get_pci_hot_reset_group_info(
+	struct vfio_pci_core_device *vdev,
+	struct vfio_pci_hot_reset_group_info __user *arg)
+{
+	unsigned long minsz =
+		offsetofend(struct vfio_pci_hot_reset_group_info, count);
+	struct vfio_pci_hot_reset_group_info hdr;
+	struct iommufd_ctx *iommufd, *cur_iommufd;
+	u32 count = 0, index = 0, *devices = NULL;
+	struct vfio_pci_core_device *cur;
+	bool slot = false;
+	int ret = 0;
+
+	if (copy_from_user(&hdr, arg, minsz))
+		return -EFAULT;
+
+	if (hdr.argsz < minsz)
+		return -EINVAL;
+
+	hdr.flags = 0;
+
+	/* Can we do a slot or bus reset or neither? */
+	if (!pci_probe_reset_slot(vdev->pdev->slot))
+		slot = true;
+	else if (pci_probe_reset_bus(vdev->pdev->bus))
+		return -ENODEV;
+
+	mutex_lock(&vdev->vdev.dev_set->lock);
+	if (!vfio_pci_dev_set_resettable(vdev->vdev.dev_set)) {
+		ret = -EPERM;
+		goto out_unlock;
+	}
+
+	iommufd = vfio_iommufd_physical_ictx(&vdev->vdev);
+	if (!iommufd) {
+		ret = -EPERM;
+		goto out_unlock;
+	}
+
+	/* How many devices are affected? */
+	ret = vfio_pci_for_each_slot_or_bus(vdev->pdev, vfio_pci_count_devs,
+					    &count, slot);
+	if (ret)
+		goto out_unlock;
+
+	WARN_ON(!count); /* Should always be at least one */
+
+	/*
+	 * If there's enough space, fill it now, otherwise return -ENOSPC and
+	 * the number of devices affected.
+	 */
+	if (hdr.argsz < sizeof(hdr) + (count * sizeof(*devices))) {
+		ret = -ENOSPC;
+		hdr.count = count;
+		goto reset_info_exit;
+	}
+
+	devices = kcalloc(count, sizeof(*devices), GFP_KERNEL);
+	if (!devices) {
+		ret = -ENOMEM;
+		goto reset_info_exit;
+	}
+
+	list_for_each_entry(cur, &vdev->vdev.dev_set->device_list, vdev.dev_set_list) {
+		cur_iommufd = vfio_iommufd_physical_ictx(&cur->vdev);
+		if (cur->vdev.open_count) {
+			if (cur_iommufd != iommufd) {
+				ret = -EPERM;
+				break;
+			}
+			ret = vfio_iommufd_physical_devid(&cur->vdev, &devices[index]);
+			if (ret)
+				break;
+			index++;
+		}
+	}
+
+reset_info_exit:
+	if (copy_to_user(arg, &hdr, minsz))
+		ret = -EFAULT;
+
+	if (!ret) {
+		if (copy_to_user(&arg->devices, devices,
+				 hdr.count * sizeof(*devices)))
+			ret = -EFAULT;
+	}
+
+	kfree(devices);
+out_unlock:
+	mutex_unlock(&vdev->vdev.dev_set->lock);
+	return ret;
+}
+
 static int vfio_pci_ioctl_get_pci_hot_reset_info(
 	struct vfio_pci_core_device *vdev,
 	struct vfio_pci_hot_reset_info __user *arg)
@@ -1404,6 +1500,8 @@ long vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd,
 		return vfio_pci_ioctl_get_irq_info(vdev, uarg);
 	case VFIO_DEVICE_GET_PCI_HOT_RESET_INFO:
 		return vfio_pci_ioctl_get_pci_hot_reset_info(vdev, uarg);
+	case VFIO_DEVICE_GET_PCI_HOT_RESET_GROUP_INFO:
+		return vfio_pci_ioctl_get_pci_hot_reset_group_info(vdev, uarg);
 	case VFIO_DEVICE_GET_REGION_INFO:
 		return vfio_pci_ioctl_get_region_info(vdev, uarg);
 	case VFIO_DEVICE_IOEVENTFD:
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 17aa5d09db41..572497cda3ca 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -669,6 +669,43 @@ struct vfio_pci_hot_reset_info {
 
 #define VFIO_DEVICE_GET_PCI_HOT_RESET_INFO	_IO(VFIO_TYPE, VFIO_BASE + 12)
 
+/**
+ * VFIO_DEVICE_GET_PCI_HOT_RESET_GROUP_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 12,
+ *						    struct vfio_pci_hot_reset_group_info)
+ *
+ * This is used in the vfio device cdev mode.  It returns the list of
+ * affected devices (represented by iommufd dev_id) when hot reset is
+ * issued on the current device with which this ioctl is invoked.  It
+ * only includes the devices that are opened by the current user in the
+ * time of this command is invoked.  So user should re-invoke it when
+ * it has opened new devices.  This information gives user a scope of
+ * affected devices that is opened by it.  This is helpful for user to
+ * decide whether VFIO_DEVICE_PCI_HOT_RESET can be issued.  However,
+ * even user can do hot reset per the info got via this ioctl, hot reset
+ * can fail since the not-opened affected device can be opened by other
+ * users in the window between the two ioctls.  Detail can check the
+ * description of VFIO_DEVICE_PCI_HOT_RESET.
+ *
+ * Return: 0 on success, -errno on failure:
+ *	-enospc = insufficient buffer;
+ *	-enodev = unsupported for device;
+ *	-eperm = no permission for device, this error comes:
+ *		 - when there are affected devices that are opened but
+ *		   not bound to the same iommufd with the current device
+ *		   with which this ioctl is invoked,
+ *		 - there are affected devices that are not bound to vfio
+ *		   driver yet.
+ *		 - no valid iommufd is bound (e.g. noiommu mode)
+ */
+struct vfio_pci_hot_reset_group_info {
+	__u32	argsz;
+	__u32	flags;
+	__u32	count;
+	__u32	devices[];
+};
+
+#define VFIO_DEVICE_GET_PCI_HOT_RESET_GROUP_INFO	_IO(VFIO_TYPE, VFIO_BASE + 18)
+
 /**
  * VFIO_DEVICE_PCI_HOT_RESET - _IOW(VFIO_TYPE, VFIO_BASE + 13,
  *				    struct vfio_pci_hot_reset)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [Intel-gfx] ✗ Fi.CI.BUILD: failure for cover-letter: Add vfio_device cdev for iommufd support (rev2)
  2023-03-08 13:28 [Intel-gfx] [PATCH v6 00/24] cover-letter: Add vfio_device cdev for iommufd support Yi Liu
                   ` (24 preceding siblings ...)
  2023-03-14 11:02 ` [Intel-gfx] ✗ Fi.CI.BUILD: failure for cover-letter: Add vfio_device cdev for iommufd support Patchwork
@ 2023-03-24 10:39 ` Patchwork
  25 siblings, 0 replies; 103+ messages in thread
From: Patchwork @ 2023-03-24 10:39 UTC (permalink / raw)
  To: Liu, Yi L; +Cc: intel-gfx

== Series Details ==

Series: cover-letter: Add vfio_device cdev for iommufd support (rev2)
URL   : https://patchwork.freedesktop.org/series/114850/
State : failure

== Summary ==

Error: patch https://patchwork.freedesktop.org/api/1.0/series/114850/revisions/2/mbox/ not applied
Applying: vfio: Allocate per device file structure
Applying: vfio: Refine vfio file kAPIs for KVM
Applying: vfio: Accept vfio device file in the KVM facing kAPI
Applying: kvm/vfio: Rename kvm_vfio_group to prepare for accepting vfio device fd
Applying: kvm/vfio: Accept vfio device file from userspace
error: sha1 information is lacking or useless (Documentation/virt/kvm/devices/vfio.rst).
error: could not build fake ancestor
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0005 kvm/vfio: Accept vfio device file from userspace
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".
Build failed, no error log produced



^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [Intel-gfx] [PATCH v6 12/24] vfio/pci: Allow passing zero-length fd array in VFIO_DEVICE_PCI_HOT_RESET
  2023-03-24  9:09                             ` Tian, Kevin
@ 2023-03-24 13:14                               ` Jason Gunthorpe
  0 siblings, 0 replies; 103+ messages in thread
From: Jason Gunthorpe @ 2023-03-24 13:14 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: mjrosato, jasowang, Hao, Xudong, peterx, Xu, Terrence,
	chao.p.peng, linux-s390, Liu, Yi L, kvm, lulu, joro, nicolinc,
	Zhao, Yan Y, intel-gfx, eric.auger, intel-gvt-dev, yi.y.sun,
	cohuck, shameerali.kolothum.thodi, suravee.suthikulpanit,
	robin.murphy

On Fri, Mar 24, 2023 at 09:09:59AM +0000, Tian, Kevin wrote:
> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Wednesday, March 22, 2023 5:01 AM
> > 
> > On Tue, 21 Mar 2023 17:50:08 -0300
> > Jason Gunthorpe <jgg@nvidia.com> wrote:
> > 
> > >
> > > Though it would be nice if qemu didn't need two implementations so Yi
> > > I'd rather see a new info in this series as well and qemu can just
> > > consistently use dev_id and never bdf in iommufd mode.
> > 
> > We also need to consider how libvirt determines if QEMU has the kernel
> > support it needs to pass file descriptors.  It'd be a lot cleaner if
> > this aligned with the introduction of vfio cdevs.
> > 
> 
> Libvirt can check whether the kernel creates cdev for a given device
> via sysfs.
> 
> but I'm not sure how Libvirt determines whether QEMU supports a
> feature that it wants to use. But sounds this is a general handshake
> problem as Libvirt needs to support many versions of QEMU then
> there must be a way for such negotiation?

Ideally libvirt would be able to learn what ioctls are supported on
the cdev fd after it opens it, but before binding. I don't think we
have something here yet, but I could imagine having something.

Clearly it is easier if cdev alone is enough proof to know that it
should go ahead in cdev mode.

Jason

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [Intel-gfx] [PATCH v6 12/24] vfio/pci: Allow passing zero-length fd array in VFIO_DEVICE_PCI_HOT_RESET
  2023-03-24  9:25                                       ` Liu, Yi L
@ 2023-03-27 11:57                                         ` Liu, Yi L
  0 siblings, 0 replies; 103+ messages in thread
From: Liu, Yi L @ 2023-03-27 11:57 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: mjrosato, jasowang, Hao, Xudong, peterx, Xu,  Terrence,
	chao.p.peng, linux-s390, kvm, lulu, joro, nicolinc, Zhao, Yan Y,
	intel-gfx, eric.auger, intel-gvt-dev, yi.y.sun, cohuck,
	shameerali.kolothum.thodi, suravee.suthikulpanit, robin.murphy

> From: Liu, Yi L <yi.l.liu@intel.com>
> Sent: Friday, March 24, 2023 5:25 PM
> 
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Thursday, March 23, 2023 8:02 PM
> >
> > On Thu, Mar 23, 2023 at 03:15:20AM +0000, Liu, Yi L wrote:
> > > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > > Sent: Wednesday, March 22, 2023 9:43 PM
> > > >
> > > > On Wed, Mar 22, 2023 at 01:33:09PM +0000, Liu, Yi L wrote:
> > > >
> > > > > Thanks. So this new _INFO only reports a limited scope instead of
> > > > > the full list of affected devices. Also, it is not static scope since device
> > > > > may be opened just after the _INFO returns.
> > > >
> > > > Yes, it would be simplest for qemu to do the query after it gains a
> > > > new dev_id and then it can add the new dev_id with the correct reset
> > > > group.
> > >
> > > I see. QEMU can decide. For now, it seems like QEMU doesn't store
> > > such the info return by the existing _INFO ioctl. It is used in the hot
> > > reset helper and freed before it returns. Though, I'm not sure whether
> > > QEMU will store the dev_id info returned by the new _INFO. Perhaps
> > > Alex can give some guidance.
> >
> > That seems a bit confusing, qemu should take the reset group
> > information and encode it so that the guest can know it as well.
> >
> > If all it does is blindly invoke the hot_reset with the right
> > parameters then what was the point of all this discussion?
> >
> > > btw.  Another question about this new _INFO ioctl. If there are affected
> > > devices that have not been bound to vfio driver yet, should this new
> _INFO
> > > ioctl fail all the same with EPERM?
> >
> > Yeah, it should EPERM the same as the normal hot reset if it can't
> > marshal the device list.
> 
> Hi Jason, Alex,
> 
> I got a draft patch to add the new _INFO? It checks if all the affected devices
> are in the dev_set, and then gets the dev_id of all the opened devices within
> the dev_set. There is still one thing not quite clear. It is the noiommu mode.
> In this mode, there is no iommufd bind, so no dev_id. For now, I just fail this
> new _INFO ioctl if there is no iommufd_device. Hence, this new _INFO is not
> available for users that operating in noiommu mode. Is this acceptable?

The latest _INFO ioctl is in below link.

https://lore.kernel.org/kvm/20230327093458.44939-11-yi.l.liu@intel.com/


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [Intel-gfx] [PATCH v6 05/24] kvm/vfio: Accept vfio device file from userspace
  2023-03-22 14:10   ` Xu Yilun
@ 2023-03-28  3:48     ` Liu, Yi L
  0 siblings, 0 replies; 103+ messages in thread
From: Liu, Yi L @ 2023-03-28  3:48 UTC (permalink / raw)
  To: Xu, Yilun
  Cc: mjrosato, jasowang, Hao, Xudong, peterx, Xu, Terrence,
	chao.p.peng, linux-s390, kvm, lulu, joro, nicolinc, jgg, Zhao,
	Yan Y, intel-gfx, eric.auger, intel-gvt-dev, yi.y.sun, cohuck,
	shameerali.kolothum.thodi, suravee.suthikulpanit, robin.murphy

> From: Xu, Yilun <yilun.xu@intel.com>
> Sent: Wednesday, March 22, 2023 10:11 PM
> 
> On 2023-03-08 at 05:28:44 -0800, Yi Liu wrote:
> > This defines KVM_DEV_VFIO_FILE* and make alias with
> KVM_DEV_VFIO_GROUP*.
> > Old userspace uses KVM_DEV_VFIO_GROUP* works as well.
> >
> > Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> > Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> > Tested-by: Terrence Xu <terrence.xu@intel.com>
> > Tested-by: Nicolin Chen <nicolinc@nvidia.com>
> > Tested-by: Matthew Rosato <mjrosato@linux.ibm.com>
> > ---
> >  Documentation/virt/kvm/devices/vfio.rst | 52 +++++++++++++++++--------
> >  include/uapi/linux/kvm.h                | 16 ++++++--
> >  virt/kvm/vfio.c                         | 16 ++++----
> >  3 files changed, 55 insertions(+), 29 deletions(-)
> >
> > diff --git a/Documentation/virt/kvm/devices/vfio.rst
> b/Documentation/virt/kvm/devices/vfio.rst
> > index 79b6811bb4f3..5b05b48abaab 100644
> > --- a/Documentation/virt/kvm/devices/vfio.rst
> > +++ b/Documentation/virt/kvm/devices/vfio.rst
> > @@ -9,24 +9,37 @@ Device types supported:
> >    - KVM_DEV_TYPE_VFIO
> >
> >  Only one VFIO instance may be created per VM.  The created device
> > -tracks VFIO groups in use by the VM and features of those groups
> > -important to the correctness and acceleration of the VM.  As groups
> > -are enabled and disabled for use by the VM, KVM should be updated
> > -about their presence.  When registered with KVM, a reference to the
> > -VFIO-group is held by KVM.
> > +tracks VFIO files (group or device) in use by the VM and features
> > +of those groups/devices important to the correctness and acceleration
> > +of the VM.  As groups/devices are enabled and disabled for use by the
> > +VM, KVM should be updated about their presence.  When registered
> with
> > +KVM, a reference to the VFIO file is held by KVM.
> >
> >  Groups:
> > -  KVM_DEV_VFIO_GROUP
> > -
> > -KVM_DEV_VFIO_GROUP attributes:
> > -  KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device
> tracking
> > -	kvm_device_attr.addr points to an int32_t file descriptor
> > -	for the VFIO group.
> > -  KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM
> device tracking
> > -	kvm_device_attr.addr points to an int32_t file descriptor
> > -	for the VFIO group.
> > -  KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: attaches a guest visible TCE
> table
> > +  KVM_DEV_VFIO_FILE
> > +	alias: KVM_DEV_VFIO_GROUP
> > +
> > +KVM_DEV_VFIO_FILE attributes:
> > +  KVM_DEV_VFIO_FILE_ADD: Add a VFIO file (group/device) to VFIO-KVM
> device
> > +	tracking
> > +
> > +	alias: KVM_DEV_VFIO_GROUP_ADD
> > +
> > +	kvm_device_attr.addr points to an int32_t file descriptor for the
> > +	VFIO file.
> 
> A blank line here to be consistent with other attibutes.
> 
> > +  KVM_DEV_VFIO_FILE_DEL: Remove a VFIO file (group/device) from
> VFIO-KVM
> > +	device tracking
> > +
> > +	alias: KVM_DEV_VFIO_GROUP_DEL
> > +
> > +	kvm_device_attr.addr points to an int32_t file descriptor for the
> > +	VFIO file.
> > +
> > +  KVM_DEV_VFIO_FILE_SET_SPAPR_TCE: attaches a guest visible TCE
> table
> >  	allocated by sPAPR KVM.
> > +
> > +	alias: KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE
> > +
> >  	kvm_device_attr.addr points to a struct::
> >
> >  		struct kvm_vfio_spapr_tce {
> > @@ -40,9 +53,14 @@ KVM_DEV_VFIO_GROUP attributes:
> >  	- @tablefd is a file descriptor for a TCE table allocated via
> >  	  KVM_CREATE_SPAPR_TCE.
> >
> > +	only accepts vfio group file as SPAPR has no iommufd support
> > +
> >  ::
> >
> > -The GROUP_ADD operation above should be invoked prior to accessing
> the
> > +The FILE/GROUP_ADD operation above should be invoked prior to
> accessing the
> >  device file descriptor via VFIO_GROUP_GET_DEVICE_FD in order to
> support
> >  drivers which require a kvm pointer to be set in their .open_device()
> > -callback.
> > +callback.  It is the same for device file descriptor via character device
> > +open which gets device access via VFIO_DEVICE_BIND_IOMMUFD.  For
> such file
> > +descriptors, FILE_ADD should be invoked before
> VFIO_DEVICE_BIND_IOMMUFD
> > +to support the drivers mentioned in piror sentence as well.
> 
> s/piror/prior

Fixed in v8, thanks.

https://lore.kernel.org/kvm/ZCHW%2FQIKQVhBjg%2Fx@Asurada-Nvidia/T/#m7c076d3b9450c9de62b99a6fcefcc31c8ae3f8da



^ permalink raw reply	[flat|nested] 103+ messages in thread

end of thread, other threads:[~2023-03-28  3:48 UTC | newest]

Thread overview: 103+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-03-08 13:28 [Intel-gfx] [PATCH v6 00/24] cover-letter: Add vfio_device cdev for iommufd support Yi Liu
2023-03-08 13:28 ` [Intel-gfx] [PATCH v6 01/24] vfio: Allocate per device file structure Yi Liu
2023-03-08 13:28 ` [Intel-gfx] [PATCH v6 02/24] vfio: Refine vfio file kAPIs for KVM Yi Liu
2023-03-08 13:28 ` [Intel-gfx] [PATCH v6 03/24] vfio: Accept vfio device file in the KVM facing kAPI Yi Liu
2023-03-08 13:28 ` [Intel-gfx] [PATCH v6 04/24] kvm/vfio: Rename kvm_vfio_group to prepare for accepting vfio device fd Yi Liu
2023-03-08 13:28 ` [Intel-gfx] [PATCH v6 05/24] kvm/vfio: Accept vfio device file from userspace Yi Liu
2023-03-22 14:10   ` Xu Yilun
2023-03-28  3:48     ` Liu, Yi L
2023-03-08 13:28 ` [Intel-gfx] [PATCH v6 06/24] vfio: Pass struct vfio_device_file * to vfio_device_open/close() Yi Liu
2023-03-08 13:28 ` [Intel-gfx] [PATCH v6 07/24] vfio: Block device access via device fd until device is opened Yi Liu
2023-03-10  4:50   ` Tian, Kevin
2023-03-08 13:28 ` [Intel-gfx] [PATCH v6 08/24] vfio/pci: Update comment around group_fd get in vfio_pci_ioctl_pci_hot_reset() Yi Liu
2023-03-08 13:28 ` [Intel-gfx] [PATCH v6 09/24] vfio/pci: Only need to check opened devices in the dev_set for hot reset Yi Liu
2023-03-10  5:00   ` Tian, Kevin
2023-03-08 13:28 ` [Intel-gfx] [PATCH v6 10/24] vfio/pci: Rename the helpers and data in hot reset path to accept device fd Yi Liu
2023-03-10  5:01   ` Tian, Kevin
2023-03-08 13:28 ` [Intel-gfx] [PATCH v6 11/24] vfio/pci: Accept device fd in VFIO_DEVICE_PCI_HOT_RESET ioctl Yi Liu
2023-03-10  5:08   ` Tian, Kevin
2023-03-10  5:38     ` Liu, Yi L
2023-03-08 13:28 ` [Intel-gfx] [PATCH v6 12/24] vfio/pci: Allow passing zero-length fd array in VFIO_DEVICE_PCI_HOT_RESET Yi Liu
2023-03-10  5:31   ` Tian, Kevin
2023-03-10  6:04     ` Liu, Yi L
2023-03-10  9:08       ` Tian, Kevin
2023-03-10 17:42       ` Jason Gunthorpe
2023-03-15 22:53   ` Alex Williamson
2023-03-15 23:31     ` Tian, Kevin
2023-03-16  3:54       ` [Intel-gfx] [offlist] " Liu, Yi L
2023-03-16  6:09         ` [Intel-gfx] " Tian, Kevin
2023-03-16  6:28           ` Liu, Yi L
2023-03-16  6:49             ` Nicolin Chen
2023-03-16 13:22               ` Liu, Yi L
2023-03-16 21:27                 ` Nicolin Chen
2023-03-16 18:45       ` Alex Williamson
2023-03-16 23:29         ` Tian, Kevin
2023-03-17  0:22           ` Alex Williamson
2023-03-17  0:57             ` Tian, Kevin
2023-03-17 15:15               ` Alex Williamson
2023-03-20 17:14                 ` Jason Gunthorpe
2023-03-20 22:52                   ` Alex Williamson
2023-03-20 23:39                     ` Jason Gunthorpe
2023-03-21 20:31                       ` Alex Williamson
2023-03-21 20:50                         ` Jason Gunthorpe
2023-03-21 21:01                           ` Alex Williamson
2023-03-21 22:20                             ` Jason Gunthorpe
2023-03-21 22:47                               ` Alex Williamson
2023-03-22  4:42                                 ` Liu, Yi L
2023-03-22 12:23                                   ` Alex Williamson
2023-03-22 12:27                                 ` Jason Gunthorpe
2023-03-22 12:36                                   ` Alex Williamson
2023-03-22 12:47                                     ` Jason Gunthorpe
2023-03-24  9:09                             ` Tian, Kevin
2023-03-24 13:14                               ` Jason Gunthorpe
2023-03-22  8:17                           ` Liu, Yi L
2023-03-22 12:17                             ` Jason Gunthorpe
2023-03-22 13:33                               ` Liu, Yi L
2023-03-22 13:43                                 ` Jason Gunthorpe
2023-03-23  3:15                                   ` Liu, Yi L
2023-03-23 12:02                                     ` Jason Gunthorpe
2023-03-24  9:25                                       ` Liu, Yi L
2023-03-27 11:57                                         ` Liu, Yi L
2023-03-08 13:28 ` [Intel-gfx] [PATCH v6 13/24] vfio/iommufd: Split the compat_ioas attach out from vfio_iommufd_bind() Yi Liu
2023-03-10  8:08   ` Tian, Kevin
2023-03-10  8:22     ` Liu, Yi L
2023-03-10  9:10       ` Tian, Kevin
2023-03-11 10:24       ` Liu, Yi L
2023-03-13  2:06         ` Tian, Kevin
2023-03-08 13:28 ` [Intel-gfx] [PATCH v6 14/24] vfio: Add cdev_device_open_cnt to vfio_group Yi Liu
2023-03-08 13:28 ` [Intel-gfx] [PATCH v6 15/24] vfio: Make vfio_device_open() single open for device cdev path Yi Liu
2023-03-08 13:28 ` [Intel-gfx] [PATCH v6 16/24] vfio: Make vfio_device_first_open() to cover the noiommu mode in " Yi Liu
2023-03-10  8:30   ` Tian, Kevin
2023-03-08 13:28 ` [Intel-gfx] [PATCH v6 17/24] vfio-iommufd: Make vfio_iommufd_bind() selectively return devid Yi Liu
2023-03-10  8:31   ` Tian, Kevin
2023-03-08 13:28 ` [Intel-gfx] [PATCH v6 18/24] vfio-iommufd: Add detach_ioas support for physical VFIO devices Yi Liu
2023-03-08 13:28 ` [Intel-gfx] [PATCH v6 19/24] vfio-iommufd: Add detach_ioas support for emulated " Yi Liu
2023-03-10 23:42   ` Nicolin Chen
2023-03-15  6:15     ` Liu, Yi L
2023-03-15  6:25       ` Nicolin Chen
2023-03-08 13:28 ` [Intel-gfx] [PATCH v6 20/24] vfio: Add cdev for vfio_device Yi Liu
2023-03-10  8:48   ` Tian, Kevin
2023-03-10  9:59     ` Liu, Yi L
2023-03-08 13:29 ` [Intel-gfx] [PATCH v6 21/24] vfio: Add VFIO_DEVICE_BIND_IOMMUFD Yi Liu
2023-03-10  9:01   ` Tian, Kevin
2023-03-10  9:58     ` Liu, Yi L
2023-03-10 10:06       ` Tian, Kevin
2023-03-15  4:40         ` Liu, Yi L
2023-03-15  6:57           ` Tian, Kevin
2023-03-20 14:09           ` Jason Gunthorpe
2023-03-20 14:31             ` Yi Liu
2023-03-20 17:16               ` Jason Gunthorpe
2023-03-21  1:30                 ` Tian, Kevin
2023-03-21 12:00                   ` Jason Gunthorpe
2023-03-21 14:37                     ` Liu, Yi L
2023-03-21 14:41                       ` Jason Gunthorpe
2023-03-21 14:51                         ` Liu, Yi L
2023-03-21 14:58                           ` Jason Gunthorpe
2023-03-21 15:10                             ` Liu, Yi L
2023-03-21 16:54                               ` Jason Gunthorpe
2023-03-08 13:29 ` [Intel-gfx] [PATCH v6 22/24] vfio: Add VFIO_DEVICE_AT[DE]TACH_IOMMUFD_PT Yi Liu
2023-03-08 13:29 ` [Intel-gfx] [PATCH v6 23/24] vfio: Compile group optionally Yi Liu
2023-03-10  9:03   ` Tian, Kevin
2023-03-08 13:29 ` [Intel-gfx] [PATCH v6 24/24] docs: vfio: Add vfio device cdev description Yi Liu
2023-03-14 11:02 ` [Intel-gfx] ✗ Fi.CI.BUILD: failure for cover-letter: Add vfio_device cdev for iommufd support Patchwork
2023-03-24 10:39 ` [Intel-gfx] ✗ Fi.CI.BUILD: failure for cover-letter: Add vfio_device cdev for iommufd support (rev2) Patchwork

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).