All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v5 0/3] vfio/pci: Check the device set open count on reset
@ 2022-11-05 22:44 Anthony DeRossi
  2022-11-05 22:44 ` [PATCH v5 1/3] vfio: Fix container device registration life cycle Anthony DeRossi
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Anthony DeRossi @ 2022-11-05 22:44 UTC (permalink / raw)
  To: kvm; +Cc: alex.williamson, cohuck, jgg, kevin.tian, abhsahu, yishaih

This series fixes an issue where devices bound to vfio-pci are not reset when
they are released. Skipping the reset has unpredictable results depending on
the device, and can cause errors when accessing the device later or binding to
a different driver.

This revision replaces the patch that introduced an open_count field in
vfio_device_set with one that exports a new function that can be used for the
same purpose, i.e., determining whether any device in the set is in use.

The first patch in this series fixes a life cycle issue that was discovered in
an earlier revision of the series.

Anthony

v4 -> v5:
- Replaced patch 2 with a patch that introduces a new function to get the
  open count of a device set
- Updated patch 3 to use the new function
v4: https://lore.kernel.org/kvm/20221104195727.4629-1-ajderossi@gmail.com/

v3 -> v4:
- Added a patch to fix device registration life cycle
- Added a patch to add a public open_count on vfio_device_set
- Changed the implementation to avoid private open_count usage
v3: https://lore.kernel.org/kvm/20221102055732.2110-1-ajderossi@gmail.com/

v2 -> v3:
- Added WARN_ON()
- Revised commit message
v2: https://lore.kernel.org/kvm/20221026194245.1769-1-ajderossi@gmail.com/

v1 -> v2:
- Changed reset behavior instead of open_count ordering
- Retitled from "vfio: Decrement open_count before close_device()"
v1: https://lore.kernel.org/kvm/20221025193820.4412-1-ajderossi@gmail.com/

Anthony DeRossi (3):
  vfio: Fix container device registration life cycle
  vfio: Export the device set open count
  vfio/pci: Check the device set open count on reset

 drivers/vfio/pci/vfio_pci_core.c | 10 +++++-----
 drivers/vfio/vfio_main.c         | 24 +++++++++++++++++++-----
 include/linux/vfio.h             |  1 +
 3 files changed, 25 insertions(+), 10 deletions(-)

-- 
2.37.4


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v5 1/3] vfio: Fix container device registration life cycle
  2022-11-05 22:44 [PATCH v5 0/3] vfio/pci: Check the device set open count on reset Anthony DeRossi
@ 2022-11-05 22:44 ` Anthony DeRossi
  2022-11-09  0:43   ` Jason Gunthorpe
  2022-11-09  3:36   ` Tian, Kevin
  2022-11-05 22:44 ` [PATCH v5 2/3] vfio: Export the device set open count Anthony DeRossi
  2022-11-05 22:44 ` [PATCH v5 3/3] vfio/pci: Check the device set open count on reset Anthony DeRossi
  2 siblings, 2 replies; 12+ messages in thread
From: Anthony DeRossi @ 2022-11-05 22:44 UTC (permalink / raw)
  To: kvm; +Cc: alex.williamson, cohuck, jgg, kevin.tian, abhsahu, yishaih

In vfio_device_open(), vfio_container_device_register() is always called
when open_count == 1. On error, vfio_device_container_unregister() is
only called when open_count == 1 and close_device is set. This leaks a
registration for devices without a close_device implementation.

In vfio_device_fops_release(), vfio_device_container_unregister() is
called unconditionally. This can cause a device to be unregistered
multiple times.

Treating container device registration/unregistration uniformly (always
when open_count == 1) fixes both issues.

Fixes: ce4b4657ff18 ("vfio: Replace the DMA unmapping notifier with a callback")
Signed-off-by: Anthony DeRossi <ajderossi@gmail.com>
---
 drivers/vfio/vfio_main.c | 13 ++++++++-----
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
index 2d168793d4e1..9a4af880e941 100644
--- a/drivers/vfio/vfio_main.c
+++ b/drivers/vfio/vfio_main.c
@@ -801,8 +801,9 @@ static struct file *vfio_device_open(struct vfio_device *device)
 err_close_device:
 	mutex_lock(&device->dev_set->lock);
 	mutex_lock(&device->group->group_lock);
-	if (device->open_count == 1 && device->ops->close_device) {
-		device->ops->close_device(device);
+	if (device->open_count == 1) {
+		if (device->ops->close_device)
+			device->ops->close_device(device);
 
 		vfio_device_container_unregister(device);
 	}
@@ -1017,10 +1018,12 @@ static int vfio_device_fops_release(struct inode *inode, struct file *filep)
 	mutex_lock(&device->dev_set->lock);
 	vfio_assert_device_open(device);
 	mutex_lock(&device->group->group_lock);
-	if (device->open_count == 1 && device->ops->close_device)
-		device->ops->close_device(device);
+	if (device->open_count == 1) {
+		if (device->ops->close_device)
+			device->ops->close_device(device);
 
-	vfio_device_container_unregister(device);
+		vfio_device_container_unregister(device);
+	}
 	mutex_unlock(&device->group->group_lock);
 	device->open_count--;
 	if (device->open_count == 0)
-- 
2.37.4


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v5 2/3] vfio: Export the device set open count
  2022-11-05 22:44 [PATCH v5 0/3] vfio/pci: Check the device set open count on reset Anthony DeRossi
  2022-11-05 22:44 ` [PATCH v5 1/3] vfio: Fix container device registration life cycle Anthony DeRossi
@ 2022-11-05 22:44 ` Anthony DeRossi
  2022-11-08 23:52   ` Alex Williamson
                     ` (2 more replies)
  2022-11-05 22:44 ` [PATCH v5 3/3] vfio/pci: Check the device set open count on reset Anthony DeRossi
  2 siblings, 3 replies; 12+ messages in thread
From: Anthony DeRossi @ 2022-11-05 22:44 UTC (permalink / raw)
  To: kvm; +Cc: alex.williamson, cohuck, jgg, kevin.tian, abhsahu, yishaih

The open count of a device set is the sum of the open counts of all
devices in the set. Drivers can use this value to determine whether
shared resources are in use without tracking them manually or accessing
the private open_count in vfio_device.

Signed-off-by: Anthony DeRossi <ajderossi@gmail.com>
---
 drivers/vfio/vfio_main.c | 11 +++++++++++
 include/linux/vfio.h     |  1 +
 2 files changed, 12 insertions(+)

diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
index 9a4af880e941..ab34faabcebb 100644
--- a/drivers/vfio/vfio_main.c
+++ b/drivers/vfio/vfio_main.c
@@ -125,6 +125,17 @@ static void vfio_release_device_set(struct vfio_device *device)
 	xa_unlock(&vfio_device_set_xa);
 }
 
+unsigned int vfio_device_set_open_count(struct vfio_device_set *dev_set)
+{
+	struct vfio_device *cur;
+	unsigned int open_count = 0;
+
+	list_for_each_entry(cur, &dev_set->device_list, dev_set_list)
+		open_count += cur->open_count;
+	return open_count;
+}
+EXPORT_SYMBOL_GPL(vfio_device_set_open_count);
+
 /*
  * Group objects - create, release, get, put, search
  */
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index e7cebeb875dd..fdd393f70b19 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -189,6 +189,7 @@ int vfio_register_emulated_iommu_dev(struct vfio_device *device);
 void vfio_unregister_group_dev(struct vfio_device *device);
 
 int vfio_assign_device_set(struct vfio_device *device, void *set_id);
+unsigned int vfio_device_set_open_count(struct vfio_device_set *dev_set);
 
 int vfio_mig_get_next_state(struct vfio_device *device,
 			    enum vfio_device_mig_state cur_fsm,
-- 
2.37.4


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v5 3/3] vfio/pci: Check the device set open count on reset
  2022-11-05 22:44 [PATCH v5 0/3] vfio/pci: Check the device set open count on reset Anthony DeRossi
  2022-11-05 22:44 ` [PATCH v5 1/3] vfio: Fix container device registration life cycle Anthony DeRossi
  2022-11-05 22:44 ` [PATCH v5 2/3] vfio: Export the device set open count Anthony DeRossi
@ 2022-11-05 22:44 ` Anthony DeRossi
  2022-11-09  0:53   ` Jason Gunthorpe
  2022-11-09  3:38   ` Tian, Kevin
  2 siblings, 2 replies; 12+ messages in thread
From: Anthony DeRossi @ 2022-11-05 22:44 UTC (permalink / raw)
  To: kvm; +Cc: alex.williamson, cohuck, jgg, kevin.tian, abhsahu, yishaih

vfio_pci_dev_set_needs_reset() inspects the open_count of every device
in the set to determine whether a reset is allowed. The current device
always has open_count == 1 within vfio_pci_core_disable(), effectively
disabling the reset logic. This field is also documented as private in
vfio_device, so it should not be used to determine whether other devices
in the set are open.

Checking for vfio_device_set_open_count() > 1 on the device set fixes
both issues.

After commit 2cd8b14aaa66 ("vfio/pci: Move to the device set
infrastructure"), failure to create a new file for a device would cause
the reset to be skipped due to open_count being decremented after
calling close_device() in the error path.

After commit eadd86f835c6 ("vfio: Remove calls to
vfio_group_add_container_user()"), releasing a device would always skip
the reset due to an ordering change in vfio_device_fops_release().

Failing to reset the device leaves it in an unknown state, potentially
causing errors when it is accessed later or bound to a different driver.

This issue was observed with a Radeon RX Vega 56 [1002:687f] (rev c3)
assigned to a Windows guest. After shutting down the guest, unbinding
the device from vfio-pci, and binding the device to amdgpu:

[  548.007102] [drm:psp_hw_start [amdgpu]] *ERROR* PSP create ring failed!
[  548.027174] [drm:psp_hw_init [amdgpu]] *ERROR* PSP firmware loading failed
[  548.027242] [drm:amdgpu_device_fw_loading [amdgpu]] *ERROR* hw_init of IP block <psp> failed -22
[  548.027306] amdgpu 0000:0a:00.0: amdgpu: amdgpu_device_ip_init failed
[  548.027308] amdgpu 0000:0a:00.0: amdgpu: Fatal error during GPU init

Fixes: 2cd8b14aaa66 ("vfio/pci: Move to the device set infrastructure")
Fixes: eadd86f835c6 ("vfio: Remove calls to vfio_group_add_container_user()")
Signed-off-by: Anthony DeRossi <ajderossi@gmail.com>
---
 drivers/vfio/pci/vfio_pci_core.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index badc9d828cac..e030c2120183 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -2488,12 +2488,12 @@ static bool vfio_pci_dev_set_needs_reset(struct vfio_device_set *dev_set)
 	struct vfio_pci_core_device *cur;
 	bool needs_reset = false;
 
-	list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list) {
-		/* No VFIO device in the set can have an open device FD */
-		if (cur->vdev.open_count)
-			return false;
+	/* No other VFIO device in the set can be open. */
+	if (vfio_device_set_open_count(dev_set) > 1)
+		return false;
+
+	list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list)
 		needs_reset |= cur->needs_reset;
-	}
 	return needs_reset;
 }
 
-- 
2.37.4


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v5 2/3] vfio: Export the device set open count
  2022-11-05 22:44 ` [PATCH v5 2/3] vfio: Export the device set open count Anthony DeRossi
@ 2022-11-08 23:52   ` Alex Williamson
  2022-11-09  0:48   ` Jason Gunthorpe
  2022-11-09  3:37   ` Tian, Kevin
  2 siblings, 0 replies; 12+ messages in thread
From: Alex Williamson @ 2022-11-08 23:52 UTC (permalink / raw)
  To: Anthony DeRossi; +Cc: kvm, cohuck, jgg, kevin.tian, abhsahu, yishaih

On Sat,  5 Nov 2022 15:44:57 -0700
Anthony DeRossi <ajderossi@gmail.com> wrote:

> The open count of a device set is the sum of the open counts of all
> devices in the set. Drivers can use this value to determine whether
> shared resources are in use without tracking them manually or accessing
> the private open_count in vfio_device.
> 
> Signed-off-by: Anthony DeRossi <ajderossi@gmail.com>
> ---
>  drivers/vfio/vfio_main.c | 11 +++++++++++
>  include/linux/vfio.h     |  1 +
>  2 files changed, 12 insertions(+)
> 
> diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
> index 9a4af880e941..ab34faabcebb 100644
> --- a/drivers/vfio/vfio_main.c
> +++ b/drivers/vfio/vfio_main.c
> @@ -125,6 +125,17 @@ static void vfio_release_device_set(struct vfio_device *device)
>  	xa_unlock(&vfio_device_set_xa);
>  }
>  
> +unsigned int vfio_device_set_open_count(struct vfio_device_set *dev_set)
> +{
> +	struct vfio_device *cur;
> +	unsigned int open_count = 0;

This can only be called while holding the dev_set->lock, so we should
have an assert here:

	lockdep_assert_held(&dev_set->lock);

The series looks ok to me otherwise, hopefully we'll get some
additional reviews.  Thanks,

Alex

> +
> +	list_for_each_entry(cur, &dev_set->device_list, dev_set_list)
> +		open_count += cur->open_count;
> +	return open_count;
> +}
> +EXPORT_SYMBOL_GPL(vfio_device_set_open_count);
> +
>  /*
>   * Group objects - create, release, get, put, search
>   */
> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> index e7cebeb875dd..fdd393f70b19 100644
> --- a/include/linux/vfio.h
> +++ b/include/linux/vfio.h
> @@ -189,6 +189,7 @@ int vfio_register_emulated_iommu_dev(struct vfio_device *device);
>  void vfio_unregister_group_dev(struct vfio_device *device);
>  
>  int vfio_assign_device_set(struct vfio_device *device, void *set_id);
> +unsigned int vfio_device_set_open_count(struct vfio_device_set *dev_set);
>  
>  int vfio_mig_get_next_state(struct vfio_device *device,
>  			    enum vfio_device_mig_state cur_fsm,


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v5 1/3] vfio: Fix container device registration life cycle
  2022-11-05 22:44 ` [PATCH v5 1/3] vfio: Fix container device registration life cycle Anthony DeRossi
@ 2022-11-09  0:43   ` Jason Gunthorpe
  2022-11-09  3:36   ` Tian, Kevin
  1 sibling, 0 replies; 12+ messages in thread
From: Jason Gunthorpe @ 2022-11-09  0:43 UTC (permalink / raw)
  To: Anthony DeRossi
  Cc: kvm, alex.williamson, cohuck, kevin.tian, abhsahu, yishaih

On Sat, Nov 05, 2022 at 03:44:56PM -0700, Anthony DeRossi wrote:
> In vfio_device_open(), vfio_container_device_register() is always called
> when open_count == 1. On error, vfio_device_container_unregister() is
> only called when open_count == 1 and close_device is set. This leaks a
> registration for devices without a close_device implementation.
> 
> In vfio_device_fops_release(), vfio_device_container_unregister() is
> called unconditionally. This can cause a device to be unregistered
> multiple times.
> 
> Treating container device registration/unregistration uniformly (always
> when open_count == 1) fixes both issues.
> 
> Fixes: ce4b4657ff18 ("vfio: Replace the DMA unmapping notifier with a callback")
> Signed-off-by: Anthony DeRossi <ajderossi@gmail.com>
> ---
>  drivers/vfio/vfio_main.c | 13 ++++++++-----
>  1 file changed, 8 insertions(+), 5 deletions(-)

This seems to only effect the mbochs sample, but it looks OK

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

This will generate conflicts with the iommufd treee, so please lets
think about how to avoid them..

Jason

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v5 2/3] vfio: Export the device set open count
  2022-11-05 22:44 ` [PATCH v5 2/3] vfio: Export the device set open count Anthony DeRossi
  2022-11-08 23:52   ` Alex Williamson
@ 2022-11-09  0:48   ` Jason Gunthorpe
  2022-11-09 16:04     ` Alex Williamson
  2022-11-09  3:37   ` Tian, Kevin
  2 siblings, 1 reply; 12+ messages in thread
From: Jason Gunthorpe @ 2022-11-09  0:48 UTC (permalink / raw)
  To: Anthony DeRossi
  Cc: kvm, alex.williamson, cohuck, kevin.tian, abhsahu, yishaih

On Sat, Nov 05, 2022 at 03:44:57PM -0700, Anthony DeRossi wrote:
> The open count of a device set is the sum of the open counts of all
> devices in the set. Drivers can use this value to determine whether
> shared resources are in use without tracking them manually or accessing
> the private open_count in vfio_device.
> 
> Signed-off-by: Anthony DeRossi <ajderossi@gmail.com>
> ---
>  drivers/vfio/vfio_main.c | 11 +++++++++++
>  include/linux/vfio.h     |  1 +
>  2 files changed, 12 insertions(+)

>  
> +unsigned int vfio_device_set_open_count(struct vfio_device_set *dev_set)
> +{
> +	struct vfio_device *cur;
> +	unsigned int open_count = 0;

I'd probably just make this a bool

'vfio_device_set_last_close()'

And roll in the < 1 logic too

Nothing will ever need to know the number of fds open across the set.

But this is fine as written

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

Jason

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v5 3/3] vfio/pci: Check the device set open count on reset
  2022-11-05 22:44 ` [PATCH v5 3/3] vfio/pci: Check the device set open count on reset Anthony DeRossi
@ 2022-11-09  0:53   ` Jason Gunthorpe
  2022-11-09  3:38   ` Tian, Kevin
  1 sibling, 0 replies; 12+ messages in thread
From: Jason Gunthorpe @ 2022-11-09  0:53 UTC (permalink / raw)
  To: Anthony DeRossi
  Cc: kvm, alex.williamson, cohuck, kevin.tian, abhsahu, yishaih

On Sat, Nov 05, 2022 at 03:44:58PM -0700, Anthony DeRossi wrote:
> vfio_pci_dev_set_needs_reset() inspects the open_count of every device
> in the set to determine whether a reset is allowed. The current device
> always has open_count == 1 within vfio_pci_core_disable(), effectively
> disabling the reset logic. This field is also documented as private in
> vfio_device, so it should not be used to determine whether other devices
> in the set are open.
> 
> Checking for vfio_device_set_open_count() > 1 on the device set fixes
> both issues.
> 
> After commit 2cd8b14aaa66 ("vfio/pci: Move to the device set
> infrastructure"), failure to create a new file for a device would cause
> the reset to be skipped due to open_count being decremented after
> calling close_device() in the error path.
> 
> After commit eadd86f835c6 ("vfio: Remove calls to
> vfio_group_add_container_user()"), releasing a device would always skip
> the reset due to an ordering change in vfio_device_fops_release().
> 
> Failing to reset the device leaves it in an unknown state, potentially
> causing errors when it is accessed later or bound to a different driver.
> 
> This issue was observed with a Radeon RX Vega 56 [1002:687f] (rev c3)
> assigned to a Windows guest. After shutting down the guest, unbinding
> the device from vfio-pci, and binding the device to amdgpu:
> 
> [  548.007102] [drm:psp_hw_start [amdgpu]] *ERROR* PSP create ring failed!
> [  548.027174] [drm:psp_hw_init [amdgpu]] *ERROR* PSP firmware loading failed
> [  548.027242] [drm:amdgpu_device_fw_loading [amdgpu]] *ERROR* hw_init of IP block <psp> failed -22
> [  548.027306] amdgpu 0000:0a:00.0: amdgpu: amdgpu_device_ip_init failed
> [  548.027308] amdgpu 0000:0a:00.0: amdgpu: Fatal error during GPU init
> 
> Fixes: 2cd8b14aaa66 ("vfio/pci: Move to the device set infrastructure")
> Fixes: eadd86f835c6 ("vfio: Remove calls to vfio_group_add_container_user()")
> Signed-off-by: Anthony DeRossi <ajderossi@gmail.com>
> ---
>  drivers/vfio/pci/vfio_pci_core.c | 10 +++++-----
>  1 file changed, 5 insertions(+), 5 deletions(-)

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

Jason

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: [PATCH v5 1/3] vfio: Fix container device registration life cycle
  2022-11-05 22:44 ` [PATCH v5 1/3] vfio: Fix container device registration life cycle Anthony DeRossi
  2022-11-09  0:43   ` Jason Gunthorpe
@ 2022-11-09  3:36   ` Tian, Kevin
  1 sibling, 0 replies; 12+ messages in thread
From: Tian, Kevin @ 2022-11-09  3:36 UTC (permalink / raw)
  To: Anthony DeRossi, kvm; +Cc: alex.williamson, cohuck, jgg, abhsahu, yishaih

> From: Anthony DeRossi <ajderossi@gmail.com>
> Sent: Sunday, November 6, 2022 6:45 AM
> 
> In vfio_device_open(), vfio_container_device_register() is always called

vfio_device_container_register()

> when open_count == 1. On error, vfio_device_container_unregister() is
> only called when open_count == 1 and close_device is set. This leaks a
> registration for devices without a close_device implementation.
> 
> In vfio_device_fops_release(), vfio_device_container_unregister() is
> called unconditionally. This can cause a device to be unregistered
> multiple times.
> 
> Treating container device registration/unregistration uniformly (always
> when open_count == 1) fixes both issues.
> 
> Fixes: ce4b4657ff18 ("vfio: Replace the DMA unmapping notifier with a
> callback")
> Signed-off-by: Anthony DeRossi <ajderossi@gmail.com>

Reviewed-by: Kevin Tian <kevin.tian@intel.com>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: [PATCH v5 2/3] vfio: Export the device set open count
  2022-11-05 22:44 ` [PATCH v5 2/3] vfio: Export the device set open count Anthony DeRossi
  2022-11-08 23:52   ` Alex Williamson
  2022-11-09  0:48   ` Jason Gunthorpe
@ 2022-11-09  3:37   ` Tian, Kevin
  2 siblings, 0 replies; 12+ messages in thread
From: Tian, Kevin @ 2022-11-09  3:37 UTC (permalink / raw)
  To: Anthony DeRossi, kvm; +Cc: alex.williamson, cohuck, jgg, abhsahu, yishaih

> From: Anthony DeRossi <ajderossi@gmail.com>
> Sent: Sunday, November 6, 2022 6:45 AM
> 
> The open count of a device set is the sum of the open counts of all
> devices in the set. Drivers can use this value to determine whether
> shared resources are in use without tracking them manually or accessing
> the private open_count in vfio_device.
> 
> Signed-off-by: Anthony DeRossi <ajderossi@gmail.com>

Apart from remarks from Alex/Jason, good to me:

Reviewed-by: Kevin Tian <kevin.tian@intel.com>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: [PATCH v5 3/3] vfio/pci: Check the device set open count on reset
  2022-11-05 22:44 ` [PATCH v5 3/3] vfio/pci: Check the device set open count on reset Anthony DeRossi
  2022-11-09  0:53   ` Jason Gunthorpe
@ 2022-11-09  3:38   ` Tian, Kevin
  1 sibling, 0 replies; 12+ messages in thread
From: Tian, Kevin @ 2022-11-09  3:38 UTC (permalink / raw)
  To: Anthony DeRossi, kvm; +Cc: alex.williamson, cohuck, jgg, abhsahu, yishaih

> From: Anthony DeRossi <ajderossi@gmail.com>
> Sent: Sunday, November 6, 2022 6:45 AM
> 
> vfio_pci_dev_set_needs_reset() inspects the open_count of every device
> in the set to determine whether a reset is allowed. The current device
> always has open_count == 1 within vfio_pci_core_disable(), effectively
> disabling the reset logic. This field is also documented as private in
> vfio_device, so it should not be used to determine whether other devices
> in the set are open.
> 
> Checking for vfio_device_set_open_count() > 1 on the device set fixes
> both issues.
> 
> After commit 2cd8b14aaa66 ("vfio/pci: Move to the device set
> infrastructure"), failure to create a new file for a device would cause
> the reset to be skipped due to open_count being decremented after
> calling close_device() in the error path.
> 
> After commit eadd86f835c6 ("vfio: Remove calls to
> vfio_group_add_container_user()"), releasing a device would always skip
> the reset due to an ordering change in vfio_device_fops_release().
> 
> Failing to reset the device leaves it in an unknown state, potentially
> causing errors when it is accessed later or bound to a different driver.
> 
> This issue was observed with a Radeon RX Vega 56 [1002:687f] (rev c3)
> assigned to a Windows guest. After shutting down the guest, unbinding
> the device from vfio-pci, and binding the device to amdgpu:
> 
> [  548.007102] [drm:psp_hw_start [amdgpu]] *ERROR* PSP create ring failed!
> [  548.027174] [drm:psp_hw_init [amdgpu]] *ERROR* PSP firmware loading
> failed
> [  548.027242] [drm:amdgpu_device_fw_loading [amdgpu]] *ERROR* hw_init
> of IP block <psp> failed -22
> [  548.027306] amdgpu 0000:0a:00.0: amdgpu: amdgpu_device_ip_init failed
> [  548.027308] amdgpu 0000:0a:00.0: amdgpu: Fatal error during GPU init
> 
> Fixes: 2cd8b14aaa66 ("vfio/pci: Move to the device set infrastructure")
> Fixes: eadd86f835c6 ("vfio: Remove calls to
> vfio_group_add_container_user()")
> Signed-off-by: Anthony DeRossi <ajderossi@gmail.com>

Reviewed-by: Kevin Tian <kevin.tian@intel.com>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v5 2/3] vfio: Export the device set open count
  2022-11-09  0:48   ` Jason Gunthorpe
@ 2022-11-09 16:04     ` Alex Williamson
  0 siblings, 0 replies; 12+ messages in thread
From: Alex Williamson @ 2022-11-09 16:04 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Anthony DeRossi, kvm, cohuck, kevin.tian, abhsahu, yishaih

On Tue, 8 Nov 2022 20:48:08 -0400
Jason Gunthorpe <jgg@ziepe.ca> wrote:

> On Sat, Nov 05, 2022 at 03:44:57PM -0700, Anthony DeRossi wrote:
> > The open count of a device set is the sum of the open counts of all
> > devices in the set. Drivers can use this value to determine whether
> > shared resources are in use without tracking them manually or accessing
> > the private open_count in vfio_device.
> > 
> > Signed-off-by: Anthony DeRossi <ajderossi@gmail.com>
> > ---
> >  drivers/vfio/vfio_main.c | 11 +++++++++++
> >  include/linux/vfio.h     |  1 +
> >  2 files changed, 12 insertions(+)  
> 
> >  
> > +unsigned int vfio_device_set_open_count(struct vfio_device_set *dev_set)
> > +{
> > +	struct vfio_device *cur;
> > +	unsigned int open_count = 0;  
> 
> I'd probably just make this a bool
> 
> 'vfio_device_set_last_close()'
> 
> And roll in the < 1 logic too
> 
> Nothing will ever need to know the number of fds open across the set.

'last_close' presumes the caller though, which seems bad form.  It's
possible there are use cases for this in a 'first_open' scenario too.
Something along the lines of 'singleton_open', but that's a horrible
name, so we might as well just provide the count since we already have
it.  Thanks,

Alex

> But this is fine as written
> 
> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> 
> Jason
> 


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2022-11-09 16:05 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-11-05 22:44 [PATCH v5 0/3] vfio/pci: Check the device set open count on reset Anthony DeRossi
2022-11-05 22:44 ` [PATCH v5 1/3] vfio: Fix container device registration life cycle Anthony DeRossi
2022-11-09  0:43   ` Jason Gunthorpe
2022-11-09  3:36   ` Tian, Kevin
2022-11-05 22:44 ` [PATCH v5 2/3] vfio: Export the device set open count Anthony DeRossi
2022-11-08 23:52   ` Alex Williamson
2022-11-09  0:48   ` Jason Gunthorpe
2022-11-09 16:04     ` Alex Williamson
2022-11-09  3:37   ` Tian, Kevin
2022-11-05 22:44 ` [PATCH v5 3/3] vfio/pci: Check the device set open count on reset Anthony DeRossi
2022-11-09  0:53   ` Jason Gunthorpe
2022-11-09  3:38   ` Tian, Kevin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.