kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2] vfio: Follow a strict lifetime for struct iommu_group
@ 2022-09-23  0:06 Jason Gunthorpe
  2022-09-26 17:03 ` Matthew Rosato
                   ` (2 more replies)
  0 siblings, 3 replies; 30+ messages in thread
From: Jason Gunthorpe @ 2022-09-23  0:06 UTC (permalink / raw)
  To: Alex Williamson, Cornelia Huck, kvm
  Cc: Qian Cai, Joerg Roedel, Marek Szyprowski, Matthew Rosato

The iommu_group comes from the struct device that a driver has been bound
to and then created a struct vfio_device against. To keep the iommu layer
sane we want to have a simple rule that only an attached driver should be
using the iommu API. Particularly only an attached driver should hold
ownership.

In VFIO's case since it uses the group APIs and it shares between
different drivers it is a bit more complicated, but the principle still
holds.

Solve this by waiting for all users of the vfio_group to stop before
allowing vfio_unregister_group_dev() to complete. This is done with a new
completion to know when the users go away and an additional refcount to
keep track of how many device drivers are sharing the vfio group. The last
driver to be unregistered will clean up the group.

This solves crashes in the S390 iommu driver that come because VFIO ends
up racing releasing ownership (which attaches the default iommu_domain to
the device) with the removal of that same device from the iommu
driver. This is a side case that iommu drivers should not have to cope
with.

   iommu driver failed to attach the default/blocking domain
   WARNING: CPU: 0 PID: 5082 at drivers/iommu/iommu.c:1961 iommu_detach_group+0x6c/0x80
   Modules linked in: macvtap macvlan tap vfio_pci vfio_pci_core irqbypass vfio_virqfd kvm nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink mlx5_ib sunrpc ib_uverbs ism smc uvdevice ib_core s390_trng eadm_sch tape_3590 tape tape_class vfio_ccw mdev vfio_iommu_type1 vfio zcrypt_cex4 sch_fq_codel configfs ghash_s390 prng chacha_s390 libchacha aes_s390 mlx5_core des_s390 libdes sha3_512_s390 nvme sha3_256_s390 sha512_s390 sha256_s390 sha1_s390 sha_common nvme_core zfcp scsi_transport_fc pkey zcrypt rng_core autofs4
   CPU: 0 PID: 5082 Comm: qemu-system-s39 Tainted: G        W          6.0.0-rc3 #5
   Hardware name: IBM 3931 A01 782 (LPAR)
   Krnl PSW : 0704c00180000000 000000095bb10d28 (iommu_detach_group+0x70/0x80)
              R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3
   Krnl GPRS: 0000000000000001 0000000900000027 0000000000000039 000000095c97ffe0
              00000000fffeffff 00000009fc290000 00000000af1fda50 00000000af590b58
              00000000af1fdaf0 0000000135c7a320 0000000135e52258 0000000135e52200
              00000000a29e8000 00000000af590b40 000000095bb10d24 0000038004b13c98
   Krnl Code: 000000095bb10d18: c020003d56fc        larl    %r2,000000095c2bbb10
                          000000095bb10d1e: c0e50019d901        brasl   %r14,000000095be4bf20
                         #000000095bb10d24: af000000            mc      0,0
                         >000000095bb10d28: b904002a            lgr     %r2,%r10
                          000000095bb10d2c: ebaff0a00004        lmg     %r10,%r15,160(%r15)
                          000000095bb10d32: c0f4001aa867        brcl    15,000000095be65e00
                          000000095bb10d38: c004002168e0        brcl    0,000000095bf3def8
                          000000095bb10d3e: eb6ff0480024        stmg    %r6,%r15,72(%r15)
   Call Trace:
    [<000000095bb10d28>] iommu_detach_group+0x70/0x80
   ([<000000095bb10d24>] iommu_detach_group+0x6c/0x80)
    [<000003ff80243b0e>] vfio_iommu_type1_detach_group+0x136/0x6c8 [vfio_iommu_type1]
    [<000003ff80137780>] __vfio_group_unset_container+0x58/0x158 [vfio]
    [<000003ff80138a16>] vfio_group_fops_unl_ioctl+0x1b6/0x210 [vfio]
   pci 0004:00:00.0: Removing from iommu group 4
    [<000000095b5b62e8>] __s390x_sys_ioctl+0xc0/0x100
    [<000000095be5d3b4>] __do_syscall+0x1d4/0x200
    [<000000095be6c072>] system_call+0x82/0xb0
   Last Breaking-Event-Address:
    [<000000095be4bf80>] __warn_printk+0x60/0x68

It indicates that domain->ops->attach_dev() failed because the driver has
already passed the point of destructing the device.

Fixes: 9ac8545199a1 ("iommu: Fix use-after-free in iommu_release_device")
Reported-by: Matthew Rosato <mjrosato@linux.ibm.com>
Tested-by: Matthew Rosato <mjrosato@linux.ibm.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/vfio/vfio.h      |  8 +++++
 drivers/vfio/vfio_main.c | 68 ++++++++++++++++++++++++++--------------
 2 files changed, 53 insertions(+), 23 deletions(-)

v2
 - Rebase on the vfio struct device series and the container.c series
 - Drop patches 1 & 2, we need to have working error unwind, so another
   test is not a problem
 - Fold iommu_group_remove_device() into vfio_device_remove_group() so
   that it forms a strict pairing with the two allocation functions.
 - Drop the iommu patch from the series, it needs more work and discussion
v1 https://lore.kernel.org/r/0-v1-ef00ffecea52+2cb-iommu_group_lifetime_jgg@nvidia.com

This could probably use another quick sanity test due to all the rebasing,
Alex if you are happy let's wait for Matthew.

diff --git a/drivers/vfio/vfio.h b/drivers/vfio/vfio.h
index 56fab31f8e0ff8..039e3208d286fa 100644
--- a/drivers/vfio/vfio.h
+++ b/drivers/vfio/vfio.h
@@ -41,7 +41,15 @@ enum vfio_group_type {
 struct vfio_group {
 	struct device 			dev;
 	struct cdev			cdev;
+	/*
+	 * When drivers is non-zero a driver is attached to the struct device
+	 * that provided the iommu_group and thus the iommu_group is a valid
+	 * pointer. When drivers is 0 the driver is being detached. Once users
+	 * reaches 0 then the iommu_group is invalid.
+	 */
+	refcount_t			drivers;
 	refcount_t			users;
+	struct completion		users_comp;
 	unsigned int			container_users;
 	struct iommu_group		*iommu_group;
 	struct vfio_container		*container;
diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
index af5945c71c4175..f19171cad9a25f 100644
--- a/drivers/vfio/vfio_main.c
+++ b/drivers/vfio/vfio_main.c
@@ -125,8 +125,6 @@ static void vfio_release_device_set(struct vfio_device *device)
 	xa_unlock(&vfio_device_set_xa);
 }
 
-static void vfio_group_get(struct vfio_group *group);
-
 /*
  * Group objects - create, release, get, put, search
  */
@@ -137,7 +135,7 @@ __vfio_group_get_from_iommu(struct iommu_group *iommu_group)
 
 	list_for_each_entry(group, &vfio.group_list, vfio_next) {
 		if (group->iommu_group == iommu_group) {
-			vfio_group_get(group);
+			refcount_inc(&group->drivers);
 			return group;
 		}
 	}
@@ -189,6 +187,8 @@ static struct vfio_group *vfio_group_alloc(struct iommu_group *iommu_group,
 	group->cdev.owner = THIS_MODULE;
 
 	refcount_set(&group->users, 1);
+	refcount_set(&group->drivers, 1);
+	init_completion(&group->users_comp);
 	init_rwsem(&group->group_rwsem);
 	INIT_LIST_HEAD(&group->device_list);
 	mutex_init(&group->device_lock);
@@ -247,8 +247,41 @@ static struct vfio_group *vfio_create_group(struct iommu_group *iommu_group,
 
 static void vfio_group_put(struct vfio_group *group)
 {
-	if (!refcount_dec_and_mutex_lock(&group->users, &vfio.group_lock))
+	if (refcount_dec_and_test(&group->users))
+		complete(&group->users_comp);
+}
+
+static void vfio_device_remove_group(struct vfio_device *device)
+{
+	struct vfio_group *group = device->group;
+
+	if (group->type == VFIO_NO_IOMMU || group->type == VFIO_EMULATED_IOMMU)
+		iommu_group_remove_device(device->dev);
+
+	/* Pairs with vfio_create_group() / vfio_group_get_from_iommu() */
+	if (!refcount_dec_and_mutex_lock(&group->drivers, &vfio.group_lock))
 		return;
+	list_del(&group->vfio_next);
+
+	/*
+	 * We could concurrently probe another driver in the group that might
+	 * race vfio_device_remove_group() with vfio_get_group(), so we have to
+	 * ensure that the sysfs is all cleaned up under lock otherwise the
+	 * cdev_device_add() will fail due to the name aready existing.
+	 */
+	cdev_device_del(&group->cdev, &group->dev);
+	mutex_unlock(&vfio.group_lock);
+
+	/* Matches the get from vfio_group_alloc() */
+	vfio_group_put(group);
+
+	/*
+	 * Before we allow the last driver in the group to be unplugged the
+	 * group must be sanitized so nothing else is or can reference it. This
+	 * is because the group->iommu_group pointer should only be used so long
+	 * as a device driver is attached to a device in the group.
+	 */
+	wait_for_completion(&group->users_comp);
 
 	/*
 	 * These data structures all have paired operations that can only be
@@ -259,19 +292,11 @@ static void vfio_group_put(struct vfio_group *group)
 	WARN_ON(!list_empty(&group->device_list));
 	WARN_ON(group->container || group->container_users);
 	WARN_ON(group->notifier.head);
-
-	list_del(&group->vfio_next);
-	cdev_device_del(&group->cdev, &group->dev);
-	mutex_unlock(&vfio.group_lock);
+	group->iommu_group = NULL;
 
 	put_device(&group->dev);
 }
 
-static void vfio_group_get(struct vfio_group *group)
-{
-	refcount_inc(&group->users);
-}
-
 /*
  * Device objects - create, release, get, put, search
  */
@@ -494,6 +519,10 @@ static int __vfio_register_dev(struct vfio_device *device,
 	struct vfio_device *existing_device;
 	int ret;
 
+	/*
+	 * In all cases group is the output of one of the group allocation
+	 * functions and we have group->drivers incremented for us.
+	 */
 	if (IS_ERR(group))
 		return PTR_ERR(group);
 
@@ -533,10 +562,7 @@ static int __vfio_register_dev(struct vfio_device *device,
 
 	return 0;
 err_out:
-	if (group->type == VFIO_NO_IOMMU ||
-	    group->type == VFIO_EMULATED_IOMMU)
-		iommu_group_remove_device(device->dev);
-	vfio_group_put(group);
+	vfio_device_remove_group(device);
 	return ret;
 }
 
@@ -627,11 +653,7 @@ void vfio_unregister_group_dev(struct vfio_device *device)
 	/* Balances device_add in register path */
 	device_del(&device->device);
 
-	if (group->type == VFIO_NO_IOMMU || group->type == VFIO_EMULATED_IOMMU)
-		iommu_group_remove_device(device->dev);
-
-	/* Matches the get in vfio_register_group_dev() */
-	vfio_group_put(group);
+	vfio_device_remove_group(device);
 }
 EXPORT_SYMBOL_GPL(vfio_unregister_group_dev);
 
@@ -884,7 +906,7 @@ static int vfio_group_fops_open(struct inode *inode, struct file *filep)
 
 	down_write(&group->group_rwsem);
 
-	/* users can be zero if this races with vfio_group_put() */
+	/* users can be zero if this races with vfio_device_remove_group() */
 	if (!refcount_inc_not_zero(&group->users)) {
 		ret = -ENODEV;
 		goto err_unlock;

base-commit: 48a93f393ac698fedde0e63b8bb0b280d81d9021
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH v2] vfio: Follow a strict lifetime for struct iommu_group
  2022-09-23  0:06 [PATCH v2] vfio: Follow a strict lifetime for struct iommu_group Jason Gunthorpe
@ 2022-09-26 17:03 ` Matthew Rosato
  2022-09-27 20:05   ` Alex Williamson
  2022-10-04 19:59   ` Matthew Rosato
  2022-09-27  6:34 ` Yi Liu
  2022-09-28  3:51 ` Tian, Kevin
  2 siblings, 2 replies; 30+ messages in thread
From: Matthew Rosato @ 2022-09-26 17:03 UTC (permalink / raw)
  To: Jason Gunthorpe, Alex Williamson, Cornelia Huck, kvm
  Cc: Qian Cai, Joerg Roedel, Marek Szyprowski

On 9/22/22 8:06 PM, Jason Gunthorpe wrote:
> The iommu_group comes from the struct device that a driver has been bound
> to and then created a struct vfio_device against. To keep the iommu layer
> sane we want to have a simple rule that only an attached driver should be
> using the iommu API. Particularly only an attached driver should hold
> ownership.
> 
> In VFIO's case since it uses the group APIs and it shares between
> different drivers it is a bit more complicated, but the principle still
> holds.
> 
> Solve this by waiting for all users of the vfio_group to stop before
> allowing vfio_unregister_group_dev() to complete. This is done with a new
> completion to know when the users go away and an additional refcount to
> keep track of how many device drivers are sharing the vfio group. The last
> driver to be unregistered will clean up the group.
> 
> This solves crashes in the S390 iommu driver that come because VFIO ends
> up racing releasing ownership (which attaches the default iommu_domain to
> the device) with the removal of that same device from the iommu
> driver. This is a side case that iommu drivers should not have to cope
> with.
> 
>    iommu driver failed to attach the default/blocking domain
>    WARNING: CPU: 0 PID: 5082 at drivers/iommu/iommu.c:1961 iommu_detach_group+0x6c/0x80
>    Modules linked in: macvtap macvlan tap vfio_pci vfio_pci_core irqbypass vfio_virqfd kvm nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink mlx5_ib sunrpc ib_uverbs ism smc uvdevice ib_core s390_trng eadm_sch tape_3590 tape tape_class vfio_ccw mdev vfio_iommu_type1 vfio zcrypt_cex4 sch_fq_codel configfs ghash_s390 prng chacha_s390 libchacha aes_s390 mlx5_core des_s390 libdes sha3_512_s390 nvme sha3_256_s390 sha512_s390 sha256_s390 sha1_s390 sha_common nvme_core zfcp scsi_transport_fc pkey zcrypt rng_core autofs4
>    CPU: 0 PID: 5082 Comm: qemu-system-s39 Tainted: G        W          6.0.0-rc3 #5
>    Hardware name: IBM 3931 A01 782 (LPAR)
>    Krnl PSW : 0704c00180000000 000000095bb10d28 (iommu_detach_group+0x70/0x80)
>               R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3
>    Krnl GPRS: 0000000000000001 0000000900000027 0000000000000039 000000095c97ffe0
>               00000000fffeffff 00000009fc290000 00000000af1fda50 00000000af590b58
>               00000000af1fdaf0 0000000135c7a320 0000000135e52258 0000000135e52200
>               00000000a29e8000 00000000af590b40 000000095bb10d24 0000038004b13c98
>    Krnl Code: 000000095bb10d18: c020003d56fc        larl    %r2,000000095c2bbb10
>                           000000095bb10d1e: c0e50019d901        brasl   %r14,000000095be4bf20
>                          #000000095bb10d24: af000000            mc      0,0
>                          >000000095bb10d28: b904002a            lgr     %r2,%r10
>                           000000095bb10d2c: ebaff0a00004        lmg     %r10,%r15,160(%r15)
>                           000000095bb10d32: c0f4001aa867        brcl    15,000000095be65e00
>                           000000095bb10d38: c004002168e0        brcl    0,000000095bf3def8
>                           000000095bb10d3e: eb6ff0480024        stmg    %r6,%r15,72(%r15)
>    Call Trace:
>     [<000000095bb10d28>] iommu_detach_group+0x70/0x80
>    ([<000000095bb10d24>] iommu_detach_group+0x6c/0x80)
>     [<000003ff80243b0e>] vfio_iommu_type1_detach_group+0x136/0x6c8 [vfio_iommu_type1]
>     [<000003ff80137780>] __vfio_group_unset_container+0x58/0x158 [vfio]
>     [<000003ff80138a16>] vfio_group_fops_unl_ioctl+0x1b6/0x210 [vfio]
>    pci 0004:00:00.0: Removing from iommu group 4
>     [<000000095b5b62e8>] __s390x_sys_ioctl+0xc0/0x100
>     [<000000095be5d3b4>] __do_syscall+0x1d4/0x200
>     [<000000095be6c072>] system_call+0x82/0xb0
>    Last Breaking-Event-Address:
>     [<000000095be4bf80>] __warn_printk+0x60/0x68
> 
> It indicates that domain->ops->attach_dev() failed because the driver has
> already passed the point of destructing the device.
> 
> Fixes: 9ac8545199a1 ("iommu: Fix use-after-free in iommu_release_device")
> Reported-by: Matthew Rosato <mjrosato@linux.ibm.com>
> Tested-by: Matthew Rosato <mjrosato@linux.ibm.com>
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> ---
>  drivers/vfio/vfio.h      |  8 +++++
>  drivers/vfio/vfio_main.c | 68 ++++++++++++++++++++++++++--------------
>  2 files changed, 53 insertions(+), 23 deletions(-)
> 
> v2
>  - Rebase on the vfio struct device series and the container.c series
>  - Drop patches 1 & 2, we need to have working error unwind, so another
>    test is not a problem
>  - Fold iommu_group_remove_device() into vfio_device_remove_group() so
>    that it forms a strict pairing with the two allocation functions.
>  - Drop the iommu patch from the series, it needs more work and discussion
> v1 https://lore.kernel.org/r/0-v1-ef00ffecea52+2cb-iommu_group_lifetime_jgg@nvidia.com
> 
> This could probably use another quick sanity test due to all the rebasing,
> Alex if you are happy let's wait for Matthew.
> 

I have been re-running the same series of tests on this version (on top of vfio-next) and this still resolves the reported issue.  Thanks Jason!

Matt


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2] vfio: Follow a strict lifetime for struct iommu_group
  2022-09-23  0:06 [PATCH v2] vfio: Follow a strict lifetime for struct iommu_group Jason Gunthorpe
  2022-09-26 17:03 ` Matthew Rosato
@ 2022-09-27  6:34 ` Yi Liu
  2022-09-27 13:12   ` Jason Gunthorpe
  2022-09-28  3:51 ` Tian, Kevin
  2 siblings, 1 reply; 30+ messages in thread
From: Yi Liu @ 2022-09-27  6:34 UTC (permalink / raw)
  To: Jason Gunthorpe, Alex Williamson, Cornelia Huck, kvm
  Cc: Qian Cai, Joerg Roedel, Marek Szyprowski, Matthew Rosato

On 2022/9/23 08:06, Jason Gunthorpe wrote:
> The iommu_group comes from the struct device that a driver has been bound
> to and then created a struct vfio_device against. To keep the iommu layer
> sane we want to have a simple rule that only an attached driver should be
> using the iommu API. Particularly only an attached driver should hold
> ownership.
> 
> In VFIO's case since it uses the group APIs and it shares between
> different drivers it is a bit more complicated, but the principle still
> holds.
> 
> Solve this by waiting for all users of the vfio_group to stop before
> allowing vfio_unregister_group_dev() to complete. This is done with a new
> completion to know when the users go away and an additional refcount to
> keep track of how many device drivers are sharing the vfio group. The last
> driver to be unregistered will clean up the group.
> 
> This solves crashes in the S390 iommu driver that come because VFIO ends
> up racing releasing ownership (which attaches the default iommu_domain to
> the device) with the removal of that same device from the iommu
> driver. This is a side case that iommu drivers should not have to cope
> with.
> 
>     iommu driver failed to attach the default/blocking domain
>     WARNING: CPU: 0 PID: 5082 at drivers/iommu/iommu.c:1961 iommu_detach_group+0x6c/0x80
>     Modules linked in: macvtap macvlan tap vfio_pci vfio_pci_core irqbypass vfio_virqfd kvm nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink mlx5_ib sunrpc ib_uverbs ism smc uvdevice ib_core s390_trng eadm_sch tape_3590 tape tape_class vfio_ccw mdev vfio_iommu_type1 vfio zcrypt_cex4 sch_fq_codel configfs ghash_s390 prng chacha_s390 libchacha aes_s390 mlx5_core des_s390 libdes sha3_512_s390 nvme sha3_256_s390 sha512_s390 sha256_s390 sha1_s390 sha_common nvme_core zfcp scsi_transport_fc pkey zcrypt rng_core autofs4
>     CPU: 0 PID: 5082 Comm: qemu-system-s39 Tainted: G        W          6.0.0-rc3 #5
>     Hardware name: IBM 3931 A01 782 (LPAR)
>     Krnl PSW : 0704c00180000000 000000095bb10d28 (iommu_detach_group+0x70/0x80)
>                R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3
>     Krnl GPRS: 0000000000000001 0000000900000027 0000000000000039 000000095c97ffe0
>                00000000fffeffff 00000009fc290000 00000000af1fda50 00000000af590b58
>                00000000af1fdaf0 0000000135c7a320 0000000135e52258 0000000135e52200
>                00000000a29e8000 00000000af590b40 000000095bb10d24 0000038004b13c98
>     Krnl Code: 000000095bb10d18: c020003d56fc        larl    %r2,000000095c2bbb10
>                            000000095bb10d1e: c0e50019d901        brasl   %r14,000000095be4bf20
>                           #000000095bb10d24: af000000            mc      0,0
>                           >000000095bb10d28: b904002a            lgr     %r2,%r10
>                            000000095bb10d2c: ebaff0a00004        lmg     %r10,%r15,160(%r15)
>                            000000095bb10d32: c0f4001aa867        brcl    15,000000095be65e00
>                            000000095bb10d38: c004002168e0        brcl    0,000000095bf3def8
>                            000000095bb10d3e: eb6ff0480024        stmg    %r6,%r15,72(%r15)
>     Call Trace:
>      [<000000095bb10d28>] iommu_detach_group+0x70/0x80
>     ([<000000095bb10d24>] iommu_detach_group+0x6c/0x80)
>      [<000003ff80243b0e>] vfio_iommu_type1_detach_group+0x136/0x6c8 [vfio_iommu_type1]
>      [<000003ff80137780>] __vfio_group_unset_container+0x58/0x158 [vfio]
>      [<000003ff80138a16>] vfio_group_fops_unl_ioctl+0x1b6/0x210 [vfio]
>     pci 0004:00:00.0: Removing from iommu group 4
>      [<000000095b5b62e8>] __s390x_sys_ioctl+0xc0/0x100
>      [<000000095be5d3b4>] __do_syscall+0x1d4/0x200
>      [<000000095be6c072>] system_call+0x82/0xb0
>     Last Breaking-Event-Address:
>      [<000000095be4bf80>] __warn_printk+0x60/0x68
> 
> It indicates that domain->ops->attach_dev() failed because the driver has
> already passed the point of destructing the device.
> 
> Fixes: 9ac8545199a1 ("iommu: Fix use-after-free in iommu_release_device")
> Reported-by: Matthew Rosato <mjrosato@linux.ibm.com>
> Tested-by: Matthew Rosato <mjrosato@linux.ibm.com>
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> ---
>   drivers/vfio/vfio.h      |  8 +++++
>   drivers/vfio/vfio_main.c | 68 ++++++++++++++++++++++++++--------------
>   2 files changed, 53 insertions(+), 23 deletions(-)
> 
> v2
>   - Rebase on the vfio struct device series and the container.c series
>   - Drop patches 1 & 2, we need to have working error unwind, so another
>     test is not a problem
>   - Fold iommu_group_remove_device() into vfio_device_remove_group() so
>     that it forms a strict pairing with the two allocation functions.
>   - Drop the iommu patch from the series, it needs more work and discussion
> v1 https://lore.kernel.org/r/0-v1-ef00ffecea52+2cb-iommu_group_lifetime_jgg@nvidia.com
> 
> This could probably use another quick sanity test due to all the rebasing,
> Alex if you are happy let's wait for Matthew.
> 
> diff --git a/drivers/vfio/vfio.h b/drivers/vfio/vfio.h
> index 56fab31f8e0ff8..039e3208d286fa 100644
> --- a/drivers/vfio/vfio.h
> +++ b/drivers/vfio/vfio.h
> @@ -41,7 +41,15 @@ enum vfio_group_type {
>   struct vfio_group {
>   	struct device 			dev;
>   	struct cdev			cdev;
> +	/*
> +	 * When drivers is non-zero a driver is attached to the struct device
> +	 * that provided the iommu_group and thus the iommu_group is a valid
> +	 * pointer. When drivers is 0 the driver is being detached. Once users
> +	 * reaches 0 then the iommu_group is invalid.
> +	 */
> +	refcount_t			drivers;
>   	refcount_t			users;
> +	struct completion		users_comp;
>   	unsigned int			container_users;
>   	struct iommu_group		*iommu_group;
>   	struct vfio_container		*container;
> diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
> index af5945c71c4175..f19171cad9a25f 100644
> --- a/drivers/vfio/vfio_main.c
> +++ b/drivers/vfio/vfio_main.c
> @@ -125,8 +125,6 @@ static void vfio_release_device_set(struct vfio_device *device)
>   	xa_unlock(&vfio_device_set_xa);
>   }
>   
> -static void vfio_group_get(struct vfio_group *group);
> -
>   /*
>    * Group objects - create, release, get, put, search
>    */
> @@ -137,7 +135,7 @@ __vfio_group_get_from_iommu(struct iommu_group *iommu_group)
>   
>   	list_for_each_entry(group, &vfio.group_list, vfio_next) {
>   		if (group->iommu_group == iommu_group) {
> -			vfio_group_get(group);
> +			refcount_inc(&group->drivers);

so the __vfio_group_get_from_iommu() can only be used in the vfio device 
registration path. right? If used by other path, then group->drivers cnt
is not correct. may valuable to have a comment although no such usage in
existing code.

otherwise, this patch looks good to me.

Reviewed-by: Yi Liu <yi.l.liu@intel.com>

>   			return group;
>   		}
>   	}
> @@ -189,6 +187,8 @@ static struct vfio_group *vfio_group_alloc(struct iommu_group *iommu_group,
>   	group->cdev.owner = THIS_MODULE;
>   
>   	refcount_set(&group->users, 1);
> +	refcount_set(&group->drivers, 1);
> +	init_completion(&group->users_comp);
>   	init_rwsem(&group->group_rwsem);
>   	INIT_LIST_HEAD(&group->device_list);
>   	mutex_init(&group->device_lock);
> @@ -247,8 +247,41 @@ static struct vfio_group *vfio_create_group(struct iommu_group *iommu_group,
>   
>   static void vfio_group_put(struct vfio_group *group)
>   {
> -	if (!refcount_dec_and_mutex_lock(&group->users, &vfio.group_lock))
> +	if (refcount_dec_and_test(&group->users))
> +		complete(&group->users_comp);
> +}
> +
> +static void vfio_device_remove_group(struct vfio_device *device)
> +{
> +	struct vfio_group *group = device->group;
> +
> +	if (group->type == VFIO_NO_IOMMU || group->type == VFIO_EMULATED_IOMMU)
> +		iommu_group_remove_device(device->dev);
> +
> +	/* Pairs with vfio_create_group() / vfio_group_get_from_iommu() */
> +	if (!refcount_dec_and_mutex_lock(&group->drivers, &vfio.group_lock))
>   		return;
> +	list_del(&group->vfio_next);
> +
> +	/*
> +	 * We could concurrently probe another driver in the group that might
> +	 * race vfio_device_remove_group() with vfio_get_group(), so we have to
> +	 * ensure that the sysfs is all cleaned up under lock otherwise the
> +	 * cdev_device_add() will fail due to the name aready existing.
> +	 */
> +	cdev_device_del(&group->cdev, &group->dev);
> +	mutex_unlock(&vfio.group_lock);
> +
> +	/* Matches the get from vfio_group_alloc() */
> +	vfio_group_put(group);
> +
> +	/*
> +	 * Before we allow the last driver in the group to be unplugged the
> +	 * group must be sanitized so nothing else is or can reference it. This
> +	 * is because the group->iommu_group pointer should only be used so long
> +	 * as a device driver is attached to a device in the group.
> +	 */
> +	wait_for_completion(&group->users_comp);
>   
>   	/*
>   	 * These data structures all have paired operations that can only be
> @@ -259,19 +292,11 @@ static void vfio_group_put(struct vfio_group *group)
>   	WARN_ON(!list_empty(&group->device_list));
>   	WARN_ON(group->container || group->container_users);
>   	WARN_ON(group->notifier.head);
> -
> -	list_del(&group->vfio_next);
> -	cdev_device_del(&group->cdev, &group->dev);
> -	mutex_unlock(&vfio.group_lock);
> +	group->iommu_group = NULL;
>   
>   	put_device(&group->dev);
>   }
>   
> -static void vfio_group_get(struct vfio_group *group)
> -{
> -	refcount_inc(&group->users);
> -}
> -
>   /*
>    * Device objects - create, release, get, put, search
>    */
> @@ -494,6 +519,10 @@ static int __vfio_register_dev(struct vfio_device *device,
>   	struct vfio_device *existing_device;
>   	int ret;
>   
> +	/*
> +	 * In all cases group is the output of one of the group allocation
> +	 * functions and we have group->drivers incremented for us.
> +	 */
>   	if (IS_ERR(group))
>   		return PTR_ERR(group);
>   
> @@ -533,10 +562,7 @@ static int __vfio_register_dev(struct vfio_device *device,
>   
>   	return 0;
>   err_out:
> -	if (group->type == VFIO_NO_IOMMU ||
> -	    group->type == VFIO_EMULATED_IOMMU)
> -		iommu_group_remove_device(device->dev);
> -	vfio_group_put(group);
> +	vfio_device_remove_group(device);
>   	return ret;
>   }
>   
> @@ -627,11 +653,7 @@ void vfio_unregister_group_dev(struct vfio_device *device)
>   	/* Balances device_add in register path */
>   	device_del(&device->device);
>   
> -	if (group->type == VFIO_NO_IOMMU || group->type == VFIO_EMULATED_IOMMU)
> -		iommu_group_remove_device(device->dev);
> -
> -	/* Matches the get in vfio_register_group_dev() */
> -	vfio_group_put(group);
> +	vfio_device_remove_group(device);
>   }
>   EXPORT_SYMBOL_GPL(vfio_unregister_group_dev);
>   
> @@ -884,7 +906,7 @@ static int vfio_group_fops_open(struct inode *inode, struct file *filep)
>   
>   	down_write(&group->group_rwsem);
>   
> -	/* users can be zero if this races with vfio_group_put() */
> +	/* users can be zero if this races with vfio_device_remove_group() */
>   	if (!refcount_inc_not_zero(&group->users)) {
>   		ret = -ENODEV;
>   		goto err_unlock;
> 
> base-commit: 48a93f393ac698fedde0e63b8bb0b280d81d9021

-- 
Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2] vfio: Follow a strict lifetime for struct iommu_group
  2022-09-27  6:34 ` Yi Liu
@ 2022-09-27 13:12   ` Jason Gunthorpe
  0 siblings, 0 replies; 30+ messages in thread
From: Jason Gunthorpe @ 2022-09-27 13:12 UTC (permalink / raw)
  To: Yi Liu
  Cc: Alex Williamson, Cornelia Huck, kvm, Qian Cai, Joerg Roedel,
	Marek Szyprowski, Matthew Rosato

On Tue, Sep 27, 2022 at 02:34:43PM +0800, Yi Liu wrote:
> > @@ -137,7 +135,7 @@ __vfio_group_get_from_iommu(struct iommu_group *iommu_group)
> >   	list_for_each_entry(group, &vfio.group_list, vfio_next) {
> >   		if (group->iommu_group == iommu_group) {
> > -			vfio_group_get(group);
> > +			refcount_inc(&group->drivers);
> 
> so the __vfio_group_get_from_iommu() can only be used in the vfio device
> registration path. right? If used by other path, then group->drivers cnt
> is not correct.

Yes, thats right - more specifically vfio_device_remove_group() has to
pair with the two allocation functions.

Jason

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2] vfio: Follow a strict lifetime for struct iommu_group
  2022-09-26 17:03 ` Matthew Rosato
@ 2022-09-27 20:05   ` Alex Williamson
  2022-10-04 15:19     ` Christian Borntraeger
  2022-10-04 19:59   ` Matthew Rosato
  1 sibling, 1 reply; 30+ messages in thread
From: Alex Williamson @ 2022-09-27 20:05 UTC (permalink / raw)
  To: Matthew Rosato
  Cc: Jason Gunthorpe, Cornelia Huck, kvm, Qian Cai, Joerg Roedel,
	Marek Szyprowski

On Mon, 26 Sep 2022 13:03:56 -0400
Matthew Rosato <mjrosato@linux.ibm.com> wrote:

> On 9/22/22 8:06 PM, Jason Gunthorpe wrote:
> > The iommu_group comes from the struct device that a driver has been bound
> > to and then created a struct vfio_device against. To keep the iommu layer
> > sane we want to have a simple rule that only an attached driver should be
> > using the iommu API. Particularly only an attached driver should hold
> > ownership.
> > 
> > In VFIO's case since it uses the group APIs and it shares between
> > different drivers it is a bit more complicated, but the principle still
> > holds.
> > 
> > Solve this by waiting for all users of the vfio_group to stop before
> > allowing vfio_unregister_group_dev() to complete. This is done with a new
> > completion to know when the users go away and an additional refcount to
> > keep track of how many device drivers are sharing the vfio group. The last
> > driver to be unregistered will clean up the group.
> > 
> > This solves crashes in the S390 iommu driver that come because VFIO ends
> > up racing releasing ownership (which attaches the default iommu_domain to
> > the device) with the removal of that same device from the iommu
> > driver. This is a side case that iommu drivers should not have to cope
> > with.
> > 
> >    iommu driver failed to attach the default/blocking domain
> >    WARNING: CPU: 0 PID: 5082 at drivers/iommu/iommu.c:1961 iommu_detach_group+0x6c/0x80
> >    Modules linked in: macvtap macvlan tap vfio_pci vfio_pci_core irqbypass vfio_virqfd kvm nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink mlx5_ib sunrpc ib_uverbs ism smc uvdevice ib_core s390_trng eadm_sch tape_3590 tape tape_class vfio_ccw mdev vfio_iommu_type1 vfio zcrypt_cex4 sch_fq_codel configfs ghash_s390 prng chacha_s390 libchacha aes_s390 mlx5_core des_s390 libdes sha3_512_s390 nvme sha3_256_s390 sha512_s390 sha256_s390 sha1_s390 sha_common nvme_core zfcp scsi_transport_fc pkey zcrypt rng_core autofs4
> >    CPU: 0 PID: 5082 Comm: qemu-system-s39 Tainted: G        W          6.0.0-rc3 #5
> >    Hardware name: IBM 3931 A01 782 (LPAR)
> >    Krnl PSW : 0704c00180000000 000000095bb10d28 (iommu_detach_group+0x70/0x80)
> >               R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3
> >    Krnl GPRS: 0000000000000001 0000000900000027 0000000000000039 000000095c97ffe0
> >               00000000fffeffff 00000009fc290000 00000000af1fda50 00000000af590b58
> >               00000000af1fdaf0 0000000135c7a320 0000000135e52258 0000000135e52200
> >               00000000a29e8000 00000000af590b40 000000095bb10d24 0000038004b13c98
> >    Krnl Code: 000000095bb10d18: c020003d56fc        larl    %r2,000000095c2bbb10
> >                           000000095bb10d1e: c0e50019d901        brasl   %r14,000000095be4bf20
> >                          #000000095bb10d24: af000000            mc      0,0  
> >                          >000000095bb10d28: b904002a            lgr     %r2,%r10  
> >                           000000095bb10d2c: ebaff0a00004        lmg     %r10,%r15,160(%r15)
> >                           000000095bb10d32: c0f4001aa867        brcl    15,000000095be65e00
> >                           000000095bb10d38: c004002168e0        brcl    0,000000095bf3def8
> >                           000000095bb10d3e: eb6ff0480024        stmg    %r6,%r15,72(%r15)
> >    Call Trace:
> >     [<000000095bb10d28>] iommu_detach_group+0x70/0x80
> >    ([<000000095bb10d24>] iommu_detach_group+0x6c/0x80)
> >     [<000003ff80243b0e>] vfio_iommu_type1_detach_group+0x136/0x6c8 [vfio_iommu_type1]
> >     [<000003ff80137780>] __vfio_group_unset_container+0x58/0x158 [vfio]
> >     [<000003ff80138a16>] vfio_group_fops_unl_ioctl+0x1b6/0x210 [vfio]
> >    pci 0004:00:00.0: Removing from iommu group 4
> >     [<000000095b5b62e8>] __s390x_sys_ioctl+0xc0/0x100
> >     [<000000095be5d3b4>] __do_syscall+0x1d4/0x200
> >     [<000000095be6c072>] system_call+0x82/0xb0
> >    Last Breaking-Event-Address:
> >     [<000000095be4bf80>] __warn_printk+0x60/0x68
> > 
> > It indicates that domain->ops->attach_dev() failed because the driver has
> > already passed the point of destructing the device.
> > 
> > Fixes: 9ac8545199a1 ("iommu: Fix use-after-free in iommu_release_device")
> > Reported-by: Matthew Rosato <mjrosato@linux.ibm.com>
> > Tested-by: Matthew Rosato <mjrosato@linux.ibm.com>
> > Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> > ---
> >  drivers/vfio/vfio.h      |  8 +++++
> >  drivers/vfio/vfio_main.c | 68 ++++++++++++++++++++++++++--------------
> >  2 files changed, 53 insertions(+), 23 deletions(-)
> > 
> > v2
> >  - Rebase on the vfio struct device series and the container.c series
> >  - Drop patches 1 & 2, we need to have working error unwind, so another
> >    test is not a problem
> >  - Fold iommu_group_remove_device() into vfio_device_remove_group() so
> >    that it forms a strict pairing with the two allocation functions.
> >  - Drop the iommu patch from the series, it needs more work and discussion
> > v1 https://lore.kernel.org/r/0-v1-ef00ffecea52+2cb-iommu_group_lifetime_jgg@nvidia.com
> > 
> > This could probably use another quick sanity test due to all the rebasing,
> > Alex if you are happy let's wait for Matthew.
> >   
> 
> I have been re-running the same series of tests on this version (on top of vfio-next) and this still resolves the reported issue.  Thanks Jason!

Thanks all.  Applied to vfio next branch for v6.1.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: [PATCH v2] vfio: Follow a strict lifetime for struct iommu_group
  2022-09-23  0:06 [PATCH v2] vfio: Follow a strict lifetime for struct iommu_group Jason Gunthorpe
  2022-09-26 17:03 ` Matthew Rosato
  2022-09-27  6:34 ` Yi Liu
@ 2022-09-28  3:51 ` Tian, Kevin
  2022-09-28 15:26   ` Jason Gunthorpe
  2 siblings, 1 reply; 30+ messages in thread
From: Tian, Kevin @ 2022-09-28  3:51 UTC (permalink / raw)
  To: Jason Gunthorpe, Alex Williamson, Cornelia Huck, kvm
  Cc: Qian Cai, Rodel, Jorg, Marek Szyprowski, Matthew Rosato

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Friday, September 23, 2022 8:06 AM
>
> diff --git a/drivers/vfio/vfio.h b/drivers/vfio/vfio.h
> index 56fab31f8e0ff8..039e3208d286fa 100644
> --- a/drivers/vfio/vfio.h
> +++ b/drivers/vfio/vfio.h
> @@ -41,7 +41,15 @@ enum vfio_group_type {
>  struct vfio_group {
>  	struct device 			dev;
>  	struct cdev			cdev;
> +	/*
> +	 * When drivers is non-zero a driver is attached to the struct device
> +	 * that provided the iommu_group and thus the iommu_group is a
> valid
> +	 * pointer. When drivers is 0 the driver is being detached. Once users
> +	 * reaches 0 then the iommu_group is invalid.
> +	 */
> +	refcount_t			drivers;

While I agree all this patch is doing, the notation of 'drivers' here sounds
a bit confusing IMHO. Given all the paths where 'drivers' are populated
are related to device registration/unregistration, isn't it clearer to rename
it as 'devices' and make it clear that iommu_group is invalid once the last
device is unregistered from the group? This kind of makes sense to me
even without talking about the aspect of driver attach/detach...

>  	refcount_t			users;
> +	struct completion		users_comp;

Now the only place poking 'users' is when a group is opened/closed,
while group->opened_file already plays the guard role. From this
angle 'users' sounds redundant now?


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2] vfio: Follow a strict lifetime for struct iommu_group
  2022-09-28  3:51 ` Tian, Kevin
@ 2022-09-28 15:26   ` Jason Gunthorpe
  2022-09-28 23:54     ` Tian, Kevin
  0 siblings, 1 reply; 30+ messages in thread
From: Jason Gunthorpe @ 2022-09-28 15:26 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Alex Williamson, Cornelia Huck, kvm, Qian Cai, Rodel, Jorg,
	Marek Szyprowski, Matthew Rosato

On Wed, Sep 28, 2022 at 03:51:01AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Friday, September 23, 2022 8:06 AM
> >
> > diff --git a/drivers/vfio/vfio.h b/drivers/vfio/vfio.h
> > index 56fab31f8e0ff8..039e3208d286fa 100644
> > --- a/drivers/vfio/vfio.h
> > +++ b/drivers/vfio/vfio.h
> > @@ -41,7 +41,15 @@ enum vfio_group_type {
> >  struct vfio_group {
> >  	struct device 			dev;
> >  	struct cdev			cdev;
> > +	/*
> > +	 * When drivers is non-zero a driver is attached to the struct device
> > +	 * that provided the iommu_group and thus the iommu_group is a
> > valid
> > +	 * pointer. When drivers is 0 the driver is being detached. Once users
> > +	 * reaches 0 then the iommu_group is invalid.
> > +	 */
> > +	refcount_t			drivers;
> 
> While I agree all this patch is doing, the notation of 'drivers' here sounds
> a bit confusing IMHO.

Maybe, I picked it because we recently had a num_devices here that was
a different thing. "drivers" comes from the idea that it is the number
of drivers that have called 'register' on the group. This also happens
to be the number of vfio_devices of course.

> >  	refcount_t			users;
> > +	struct completion		users_comp;
> 
> Now the only place poking 'users' is when a group is opened/closed,
> while group->opened_file already plays the guard role. From this
> angle 'users' sounds redundant now?

Oh interesting. I did try to get rid of that thing, but I was thinking
to make it "disassociate" so we didn't have to sleep at all, but SPAPR
messed that up.. It is a good followup patch

So like this:

diff --git a/drivers/vfio/vfio.h b/drivers/vfio/vfio.h
index 039e3208d286fa..78b362a9250113 100644
--- a/drivers/vfio/vfio.h
+++ b/drivers/vfio/vfio.h
@@ -48,8 +48,6 @@ struct vfio_group {
 	 * reaches 0 then the iommu_group is invalid.
 	 */
 	refcount_t			drivers;
-	refcount_t			users;
-	struct completion		users_comp;
 	unsigned int			container_users;
 	struct iommu_group		*iommu_group;
 	struct vfio_container		*container;
@@ -61,6 +59,7 @@ struct vfio_group {
 	struct rw_semaphore		group_rwsem;
 	struct kvm			*kvm;
 	struct file			*opened_file;
+	struct swait_queue_head		opened_file_wait;
 	struct blocking_notifier_head	notifier;
 };
 
diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
index f19171cad9a25f..57a7576a96a61b 100644
--- a/drivers/vfio/vfio_main.c
+++ b/drivers/vfio/vfio_main.c
@@ -186,10 +186,9 @@ static struct vfio_group *vfio_group_alloc(struct iommu_group *iommu_group,
 	cdev_init(&group->cdev, &vfio_group_fops);
 	group->cdev.owner = THIS_MODULE;
 
-	refcount_set(&group->users, 1);
 	refcount_set(&group->drivers, 1);
-	init_completion(&group->users_comp);
 	init_rwsem(&group->group_rwsem);
+	init_swait_queue_head(&group->opened_file_wait);
 	INIT_LIST_HEAD(&group->device_list);
 	mutex_init(&group->device_lock);
 	group->iommu_group = iommu_group;
@@ -245,12 +244,6 @@ static struct vfio_group *vfio_create_group(struct iommu_group *iommu_group,
 	return ret;
 }
 
-static void vfio_group_put(struct vfio_group *group)
-{
-	if (refcount_dec_and_test(&group->users))
-		complete(&group->users_comp);
-}
-
 static void vfio_device_remove_group(struct vfio_device *device)
 {
 	struct vfio_group *group = device->group;
@@ -270,10 +263,6 @@ static void vfio_device_remove_group(struct vfio_device *device)
 	 * cdev_device_add() will fail due to the name aready existing.
 	 */
 	cdev_device_del(&group->cdev, &group->dev);
-	mutex_unlock(&vfio.group_lock);
-
-	/* Matches the get from vfio_group_alloc() */
-	vfio_group_put(group);
 
 	/*
 	 * Before we allow the last driver in the group to be unplugged the
@@ -281,7 +270,13 @@ static void vfio_device_remove_group(struct vfio_device *device)
 	 * is because the group->iommu_group pointer should only be used so long
 	 * as a device driver is attached to a device in the group.
 	 */
-	wait_for_completion(&group->users_comp);
+	while (group->opened_file) {
+		mutex_unlock(&vfio.group_lock);
+		swait_event_idle_exclusive(group->opened_file_wait,
+					   !group->opened_file);
+		mutex_lock(&vfio.group_lock);
+	}
+	mutex_unlock(&vfio.group_lock);
 
 	/*
 	 * These data structures all have paired operations that can only be
@@ -906,15 +901,18 @@ static int vfio_group_fops_open(struct inode *inode, struct file *filep)
 
 	down_write(&group->group_rwsem);
 
-	/* users can be zero if this races with vfio_device_remove_group() */
-	if (!refcount_inc_not_zero(&group->users)) {
+	/*
+	 * drivers can be zero if this races with vfio_device_remove_group(), it
+	 * will be stable at 0 under the group rwsem
+	 */
+	if (refcount_read(&group->drivers) == 0) {
 		ret = -ENODEV;
-		goto err_unlock;
+		goto out_unlock;
 	}
 
 	if (group->type == VFIO_NO_IOMMU && !capable(CAP_SYS_RAWIO)) {
 		ret = -EPERM;
-		goto err_put;
+		goto out_unlock;
 	}
 
 	/*
@@ -922,16 +920,12 @@ static int vfio_group_fops_open(struct inode *inode, struct file *filep)
 	 */
 	if (group->opened_file) {
 		ret = -EBUSY;
-		goto err_put;
+		goto out_unlock;
 	}
 	group->opened_file = filep;
 	filep->private_data = group;
-
-	up_write(&group->group_rwsem);
-	return 0;
-err_put:
-	vfio_group_put(group);
-err_unlock:
+	ret = 0;
+out_unlock:
 	up_write(&group->group_rwsem);
 	return ret;
 }
@@ -952,8 +946,7 @@ static int vfio_group_fops_release(struct inode *inode, struct file *filep)
 		vfio_group_detach_container(group);
 	group->opened_file = NULL;
 	up_write(&group->group_rwsem);
-
-	vfio_group_put(group);
+	swake_up_one(&group->opened_file_wait);
 
 	return 0;
 }

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* RE: [PATCH v2] vfio: Follow a strict lifetime for struct iommu_group
  2022-09-28 15:26   ` Jason Gunthorpe
@ 2022-09-28 23:54     ` Tian, Kevin
  0 siblings, 0 replies; 30+ messages in thread
From: Tian, Kevin @ 2022-09-28 23:54 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Cornelia Huck, kvm, Qian Cai, Rodel, Jorg,
	Marek Szyprowski, Matthew Rosato

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, September 28, 2022 11:26 PM
> 
> On Wed, Sep 28, 2022 at 03:51:01AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Friday, September 23, 2022 8:06 AM
> > >
> > > diff --git a/drivers/vfio/vfio.h b/drivers/vfio/vfio.h
> > > index 56fab31f8e0ff8..039e3208d286fa 100644
> > > --- a/drivers/vfio/vfio.h
> > > +++ b/drivers/vfio/vfio.h
> > > @@ -41,7 +41,15 @@ enum vfio_group_type {
> > >  struct vfio_group {
> > >  	struct device 			dev;
> > >  	struct cdev			cdev;
> > > +	/*
> > > +	 * When drivers is non-zero a driver is attached to the struct device
> > > +	 * that provided the iommu_group and thus the iommu_group is a
> > > valid
> > > +	 * pointer. When drivers is 0 the driver is being detached. Once users
> > > +	 * reaches 0 then the iommu_group is invalid.
> > > +	 */
> > > +	refcount_t			drivers;
> >
> > While I agree all this patch is doing, the notation of 'drivers' here sounds
> > a bit confusing IMHO.
> 
> Maybe, I picked it because we recently had a num_devices here that was
> a different thing. "drivers" comes from the idea that it is the number

num_devices has been removed by one of your patches.

> of drivers that have called 'register' on the group. This also happens
> to be the number of vfio_devices of course.

there could be one driver calling 'register' on multiple groups. That is
the part which gets confusing. If talking about driver it is more accurately
to be 'driver_registrations' then the implication is still about device. 😊

> 
> > >  	refcount_t			users;
> > > +	struct completion		users_comp;
> >
> > Now the only place poking 'users' is when a group is opened/closed,
> > while group->opened_file already plays the guard role. From this
> > angle 'users' sounds redundant now?
> 
> Oh interesting. I did try to get rid of that thing, but I was thinking
> to make it "disassociate" so we didn't have to sleep at all, but SPAPR
> messed that up.. It is a good followup patch

a followup patch like below sounds good

> 
> So like this:
> 
> diff --git a/drivers/vfio/vfio.h b/drivers/vfio/vfio.h
> index 039e3208d286fa..78b362a9250113 100644
> --- a/drivers/vfio/vfio.h
> +++ b/drivers/vfio/vfio.h
> @@ -48,8 +48,6 @@ struct vfio_group {
>  	 * reaches 0 then the iommu_group is invalid.
>  	 */
>  	refcount_t			drivers;
> -	refcount_t			users;
> -	struct completion		users_comp;
>  	unsigned int			container_users;
>  	struct iommu_group		*iommu_group;
>  	struct vfio_container		*container;
> @@ -61,6 +59,7 @@ struct vfio_group {
>  	struct rw_semaphore		group_rwsem;
>  	struct kvm			*kvm;
>  	struct file			*opened_file;
> +	struct swait_queue_head		opened_file_wait;
>  	struct blocking_notifier_head	notifier;
>  };
> 
> diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
> index f19171cad9a25f..57a7576a96a61b 100644
> --- a/drivers/vfio/vfio_main.c
> +++ b/drivers/vfio/vfio_main.c
> @@ -186,10 +186,9 @@ static struct vfio_group *vfio_group_alloc(struct
> iommu_group *iommu_group,
>  	cdev_init(&group->cdev, &vfio_group_fops);
>  	group->cdev.owner = THIS_MODULE;
> 
> -	refcount_set(&group->users, 1);
>  	refcount_set(&group->drivers, 1);
> -	init_completion(&group->users_comp);
>  	init_rwsem(&group->group_rwsem);
> +	init_swait_queue_head(&group->opened_file_wait);
>  	INIT_LIST_HEAD(&group->device_list);
>  	mutex_init(&group->device_lock);
>  	group->iommu_group = iommu_group;
> @@ -245,12 +244,6 @@ static struct vfio_group *vfio_create_group(struct
> iommu_group *iommu_group,
>  	return ret;
>  }
> 
> -static void vfio_group_put(struct vfio_group *group)
> -{
> -	if (refcount_dec_and_test(&group->users))
> -		complete(&group->users_comp);
> -}
> -
>  static void vfio_device_remove_group(struct vfio_device *device)
>  {
>  	struct vfio_group *group = device->group;
> @@ -270,10 +263,6 @@ static void vfio_device_remove_group(struct
> vfio_device *device)
>  	 * cdev_device_add() will fail due to the name aready existing.
>  	 */
>  	cdev_device_del(&group->cdev, &group->dev);
> -	mutex_unlock(&vfio.group_lock);
> -
> -	/* Matches the get from vfio_group_alloc() */
> -	vfio_group_put(group);
> 
>  	/*
>  	 * Before we allow the last driver in the group to be unplugged the
> @@ -281,7 +270,13 @@ static void vfio_device_remove_group(struct
> vfio_device *device)
>  	 * is because the group->iommu_group pointer should only be used
> so long
>  	 * as a device driver is attached to a device in the group.
>  	 */
> -	wait_for_completion(&group->users_comp);
> +	while (group->opened_file) {
> +		mutex_unlock(&vfio.group_lock);
> +		swait_event_idle_exclusive(group->opened_file_wait,
> +					   !group->opened_file);
> +		mutex_lock(&vfio.group_lock);
> +	}
> +	mutex_unlock(&vfio.group_lock);
> 
>  	/*
>  	 * These data structures all have paired operations that can only be
> @@ -906,15 +901,18 @@ static int vfio_group_fops_open(struct inode
> *inode, struct file *filep)
> 
>  	down_write(&group->group_rwsem);
> 
> -	/* users can be zero if this races with vfio_device_remove_group() */
> -	if (!refcount_inc_not_zero(&group->users)) {
> +	/*
> +	 * drivers can be zero if this races with vfio_device_remove_group(),
> it
> +	 * will be stable at 0 under the group rwsem
> +	 */
> +	if (refcount_read(&group->drivers) == 0) {
>  		ret = -ENODEV;
> -		goto err_unlock;
> +		goto out_unlock;
>  	}
> 
>  	if (group->type == VFIO_NO_IOMMU && !capable(CAP_SYS_RAWIO))
> {
>  		ret = -EPERM;
> -		goto err_put;
> +		goto out_unlock;
>  	}
> 
>  	/*
> @@ -922,16 +920,12 @@ static int vfio_group_fops_open(struct inode
> *inode, struct file *filep)
>  	 */
>  	if (group->opened_file) {
>  		ret = -EBUSY;
> -		goto err_put;
> +		goto out_unlock;
>  	}
>  	group->opened_file = filep;
>  	filep->private_data = group;
> -
> -	up_write(&group->group_rwsem);
> -	return 0;
> -err_put:
> -	vfio_group_put(group);
> -err_unlock:
> +	ret = 0;
> +out_unlock:
>  	up_write(&group->group_rwsem);
>  	return ret;
>  }
> @@ -952,8 +946,7 @@ static int vfio_group_fops_release(struct inode
> *inode, struct file *filep)
>  		vfio_group_detach_container(group);
>  	group->opened_file = NULL;
>  	up_write(&group->group_rwsem);
> -
> -	vfio_group_put(group);
> +	swake_up_one(&group->opened_file_wait);
> 
>  	return 0;
>  }

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2] vfio: Follow a strict lifetime for struct iommu_group
  2022-09-27 20:05   ` Alex Williamson
@ 2022-10-04 15:19     ` Christian Borntraeger
  2022-10-04 15:40       ` Jason Gunthorpe
  0 siblings, 1 reply; 30+ messages in thread
From: Christian Borntraeger @ 2022-10-04 15:19 UTC (permalink / raw)
  To: Alex Williamson, Matthew Rosato, Tony Krowiak, Jason J . Herne,
	Marc Hartmayer, Eric Farman
  Cc: Jason Gunthorpe, Cornelia Huck, kvm, Qian Cai, Joerg Roedel,
	Marek Szyprowski, linux-s390

Am 27.09.22 um 22:05 schrieb Alex Williamson:
> On Mon, 26 Sep 2022 13:03:56 -0400
> Matthew Rosato <mjrosato@linux.ibm.com> wrote:
> 
>> On 9/22/22 8:06 PM, Jason Gunthorpe wrote:
>>> The iommu_group comes from the struct device that a driver has been bound
>>> to and then created a struct vfio_device against. To keep the iommu layer
>>> sane we want to have a simple rule that only an attached driver should be
>>> using the iommu API. Particularly only an attached driver should hold
>>> ownership.
>>>
>>> In VFIO's case since it uses the group APIs and it shares between
>>> different drivers it is a bit more complicated, but the principle still
>>> holds.
>>>
>>> Solve this by waiting for all users of the vfio_group to stop before
>>> allowing vfio_unregister_group_dev() to complete. This is done with a new
>>> completion to know when the users go away and an additional refcount to
>>> keep track of how many device drivers are sharing the vfio group. The last
>>> driver to be unregistered will clean up the group.
>>>
>>> This solves crashes in the S390 iommu driver that come because VFIO ends
>>> up racing releasing ownership (which attaches the default iommu_domain to
>>> the device) with the removal of that same device from the iommu
>>> driver. This is a side case that iommu drivers should not have to cope
>>> with.
>>>
>>>     iommu driver failed to attach the default/blocking domain
>>>     WARNING: CPU: 0 PID: 5082 at drivers/iommu/iommu.c:1961 iommu_detach_group+0x6c/0x80
>>>     Modules linked in: macvtap macvlan tap vfio_pci vfio_pci_core irqbypass vfio_virqfd kvm nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink mlx5_ib sunrpc ib_uverbs ism smc uvdevice ib_core s390_trng eadm_sch tape_3590 tape tape_class vfio_ccw mdev vfio_iommu_type1 vfio zcrypt_cex4 sch_fq_codel configfs ghash_s390 prng chacha_s390 libchacha aes_s390 mlx5_core des_s390 libdes sha3_512_s390 nvme sha3_256_s390 sha512_s390 sha256_s390 sha1_s390 sha_common nvme_core zfcp scsi_transport_fc pkey zcrypt rng_core autofs4
>>>     CPU: 0 PID: 5082 Comm: qemu-system-s39 Tainted: G        W          6.0.0-rc3 #5
>>>     Hardware name: IBM 3931 A01 782 (LPAR)
>>>     Krnl PSW : 0704c00180000000 000000095bb10d28 (iommu_detach_group+0x70/0x80)
>>>                R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3
>>>     Krnl GPRS: 0000000000000001 0000000900000027 0000000000000039 000000095c97ffe0
>>>                00000000fffeffff 00000009fc290000 00000000af1fda50 00000000af590b58
>>>                00000000af1fdaf0 0000000135c7a320 0000000135e52258 0000000135e52200
>>>                00000000a29e8000 00000000af590b40 000000095bb10d24 0000038004b13c98
>>>     Krnl Code: 000000095bb10d18: c020003d56fc        larl    %r2,000000095c2bbb10
>>>                            000000095bb10d1e: c0e50019d901        brasl   %r14,000000095be4bf20
>>>                           #000000095bb10d24: af000000            mc      0,0
>>>                           >000000095bb10d28: b904002a            lgr     %r2,%r10
>>>                            000000095bb10d2c: ebaff0a00004        lmg     %r10,%r15,160(%r15)
>>>                            000000095bb10d32: c0f4001aa867        brcl    15,000000095be65e00
>>>                            000000095bb10d38: c004002168e0        brcl    0,000000095bf3def8
>>>                            000000095bb10d3e: eb6ff0480024        stmg    %r6,%r15,72(%r15)
>>>     Call Trace:
>>>      [<000000095bb10d28>] iommu_detach_group+0x70/0x80
>>>     ([<000000095bb10d24>] iommu_detach_group+0x6c/0x80)
>>>      [<000003ff80243b0e>] vfio_iommu_type1_detach_group+0x136/0x6c8 [vfio_iommu_type1]
>>>      [<000003ff80137780>] __vfio_group_unset_container+0x58/0x158 [vfio]
>>>      [<000003ff80138a16>] vfio_group_fops_unl_ioctl+0x1b6/0x210 [vfio]
>>>     pci 0004:00:00.0: Removing from iommu group 4
>>>      [<000000095b5b62e8>] __s390x_sys_ioctl+0xc0/0x100
>>>      [<000000095be5d3b4>] __do_syscall+0x1d4/0x200
>>>      [<000000095be6c072>] system_call+0x82/0xb0
>>>     Last Breaking-Event-Address:
>>>      [<000000095be4bf80>] __warn_printk+0x60/0x68
>>>
>>> It indicates that domain->ops->attach_dev() failed because the driver has
>>> already passed the point of destructing the device.
>>>
>>> Fixes: 9ac8545199a1 ("iommu: Fix use-after-free in iommu_release_device")
>>> Reported-by: Matthew Rosato <mjrosato@linux.ibm.com>
>>> Tested-by: Matthew Rosato <mjrosato@linux.ibm.com>
>>> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
>>> ---
>>>   drivers/vfio/vfio.h      |  8 +++++
>>>   drivers/vfio/vfio_main.c | 68 ++++++++++++++++++++++++++--------------
>>>   2 files changed, 53 insertions(+), 23 deletions(-)
>>>
>>> v2
>>>   - Rebase on the vfio struct device series and the container.c series
>>>   - Drop patches 1 & 2, we need to have working error unwind, so another
>>>     test is not a problem
>>>   - Fold iommu_group_remove_device() into vfio_device_remove_group() so
>>>     that it forms a strict pairing with the two allocation functions.
>>>   - Drop the iommu patch from the series, it needs more work and discussion
>>> v1 https://lore.kernel.org/r/0-v1-ef00ffecea52+2cb-iommu_group_lifetime_jgg@nvidia.com
>>>
>>> This could probably use another quick sanity test due to all the rebasing,
>>> Alex if you are happy let's wait for Matthew.
>>>    
>>
>> I have been re-running the same series of tests on this version (on top of vfio-next) and this still resolves the reported issue.  Thanks Jason!
> 
> Thanks all.  Applied to vfio next branch for v6.1.  Thanks,

So now I have bisected this to a regression in our KVM CI for vfio-ap. Our testcase MultipleMdevAPMatrixTestCase hangs forever.
I see  virtnodedevd spinning 100% and "mdevctl stop --uuid=d70d7685-a1b5-47a1-bdea-336925e0a95d" seems to wait for something:

[  186.815543] task:mdevctl         state:D stack:    0 pid: 1639 ppid:  1604 flags:0x00000001
[  186.815546] Call Trace:
[  186.815547]  [<0000002baf277386>] __schedule+0x296/0x650
[  186.815549]  [<0000002baf2777a2>] schedule+0x62/0x108
[  186.815551]  [<0000002baf27db20>] schedule_timeout+0xc0/0x108
[  186.815553]  [<0000002baf278166>] __wait_for_common+0xc6/0x250
[  186.815556]  [<000003ff800c263a>] vfio_device_remove_group.isra.0+0xb2/0x118 [vfio]
[  186.815561]  [<000003ff805caadc>] vfio_ap_mdev_remove+0x2c/0x198 [vfio_ap]
[  186.815565]  [<0000002baef1d4de>] device_release_driver_internal+0x1c6/0x288
[  186.815570]  [<0000002baef1b27c>] bus_remove_device+0x10c/0x198
[  186.815572]  [<0000002baef14b54>] device_del+0x19c/0x3e0
[  186.815575]  [<000003ff800d9e3a>] mdev_device_remove+0xb2/0x108 [mdev]
[  186.815579]  [<000003ff800da096>] remove_store+0x7e/0x90 [mdev]
[  186.815581]  [<0000002baea53c30>] kernfs_fop_write_iter+0x138/0x210
[  186.815586]  [<0000002bae98e310>] vfs_write+0x1a0/0x2f0
[  186.815588]  [<0000002bae98e6d8>] ksys_write+0x70/0x100
[  186.815590]  [<0000002baf26fe2c>] __do_syscall+0x1d4/0x200
[  186.815593]  [<0000002baf27eb42>] system_call+0x82/0xb0


Any quick idea?

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2] vfio: Follow a strict lifetime for struct iommu_group
  2022-10-04 15:19     ` Christian Borntraeger
@ 2022-10-04 15:40       ` Jason Gunthorpe
  2022-10-04 15:44         ` Christian Borntraeger
  0 siblings, 1 reply; 30+ messages in thread
From: Jason Gunthorpe @ 2022-10-04 15:40 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: Alex Williamson, Matthew Rosato, Tony Krowiak, Jason J . Herne,
	Marc Hartmayer, Eric Farman, Cornelia Huck, kvm, Qian Cai,
	Joerg Roedel, Marek Szyprowski, linux-s390

On Tue, Oct 04, 2022 at 05:19:07PM +0200, Christian Borntraeger wrote:
> Am 27.09.22 um 22:05 schrieb Alex Williamson:
> > On Mon, 26 Sep 2022 13:03:56 -0400
> > Matthew Rosato <mjrosato@linux.ibm.com> wrote:
> > 
> > > On 9/22/22 8:06 PM, Jason Gunthorpe wrote:
> > > > The iommu_group comes from the struct device that a driver has been bound
> > > > to and then created a struct vfio_device against. To keep the iommu layer
> > > > sane we want to have a simple rule that only an attached driver should be
> > > > using the iommu API. Particularly only an attached driver should hold
> > > > ownership.
> > > > 
> > > > In VFIO's case since it uses the group APIs and it shares between
> > > > different drivers it is a bit more complicated, but the principle still
> > > > holds.
> > > > 
> > > > Solve this by waiting for all users of the vfio_group to stop before
> > > > allowing vfio_unregister_group_dev() to complete. This is done with a new
> > > > completion to know when the users go away and an additional refcount to
> > > > keep track of how many device drivers are sharing the vfio group. The last
> > > > driver to be unregistered will clean up the group.
> > > > 
> > > > This solves crashes in the S390 iommu driver that come because VFIO ends
> > > > up racing releasing ownership (which attaches the default iommu_domain to
> > > > the device) with the removal of that same device from the iommu
> > > > driver. This is a side case that iommu drivers should not have to cope
> > > > with.
> > > > 
> > > >     iommu driver failed to attach the default/blocking domain
> > > >     WARNING: CPU: 0 PID: 5082 at drivers/iommu/iommu.c:1961 iommu_detach_group+0x6c/0x80
> > > >     Modules linked in: macvtap macvlan tap vfio_pci vfio_pci_core irqbypass vfio_virqfd kvm nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink mlx5_ib sunrpc ib_uverbs ism smc uvdevice ib_core s390_trng eadm_sch tape_3590 tape tape_class vfio_ccw mdev vfio_iommu_type1 vfio zcrypt_cex4 sch_fq_codel configfs ghash_s390 prng chacha_s390 libchacha aes_s390 mlx5_core des_s390 libdes sha3_512_s390 nvme sha3_256_s390 sha512_s390 sha256_s390 sha1_s390 sha_common nvme_core zfcp scsi_transport_fc pkey zcrypt rng_core autofs4
> > > >     CPU: 0 PID: 5082 Comm: qemu-system-s39 Tainted: G        W          6.0.0-rc3 #5
> > > >     Hardware name: IBM 3931 A01 782 (LPAR)
> > > >     Krnl PSW : 0704c00180000000 000000095bb10d28 (iommu_detach_group+0x70/0x80)
> > > >                R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3
> > > >     Krnl GPRS: 0000000000000001 0000000900000027 0000000000000039 000000095c97ffe0
> > > >                00000000fffeffff 00000009fc290000 00000000af1fda50 00000000af590b58
> > > >                00000000af1fdaf0 0000000135c7a320 0000000135e52258 0000000135e52200
> > > >                00000000a29e8000 00000000af590b40 000000095bb10d24 0000038004b13c98
> > > >     Krnl Code: 000000095bb10d18: c020003d56fc        larl    %r2,000000095c2bbb10
> > > >                            000000095bb10d1e: c0e50019d901        brasl   %r14,000000095be4bf20
> > > >                           #000000095bb10d24: af000000            mc      0,0
> > > >                           >000000095bb10d28: b904002a            lgr     %r2,%r10
> > > >                            000000095bb10d2c: ebaff0a00004        lmg     %r10,%r15,160(%r15)
> > > >                            000000095bb10d32: c0f4001aa867        brcl    15,000000095be65e00
> > > >                            000000095bb10d38: c004002168e0        brcl    0,000000095bf3def8
> > > >                            000000095bb10d3e: eb6ff0480024        stmg    %r6,%r15,72(%r15)
> > > >     Call Trace:
> > > >      [<000000095bb10d28>] iommu_detach_group+0x70/0x80
> > > >     ([<000000095bb10d24>] iommu_detach_group+0x6c/0x80)
> > > >      [<000003ff80243b0e>] vfio_iommu_type1_detach_group+0x136/0x6c8 [vfio_iommu_type1]
> > > >      [<000003ff80137780>] __vfio_group_unset_container+0x58/0x158 [vfio]
> > > >      [<000003ff80138a16>] vfio_group_fops_unl_ioctl+0x1b6/0x210 [vfio]
> > > >     pci 0004:00:00.0: Removing from iommu group 4
> > > >      [<000000095b5b62e8>] __s390x_sys_ioctl+0xc0/0x100
> > > >      [<000000095be5d3b4>] __do_syscall+0x1d4/0x200
> > > >      [<000000095be6c072>] system_call+0x82/0xb0
> > > >     Last Breaking-Event-Address:
> > > >      [<000000095be4bf80>] __warn_printk+0x60/0x68
> > > > 
> > > > It indicates that domain->ops->attach_dev() failed because the driver has
> > > > already passed the point of destructing the device.
> > > > 
> > > > Fixes: 9ac8545199a1 ("iommu: Fix use-after-free in iommu_release_device")
> > > > Reported-by: Matthew Rosato <mjrosato@linux.ibm.com>
> > > > Tested-by: Matthew Rosato <mjrosato@linux.ibm.com>
> > > > Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> > > > ---
> > > >   drivers/vfio/vfio.h      |  8 +++++
> > > >   drivers/vfio/vfio_main.c | 68 ++++++++++++++++++++++++++--------------
> > > >   2 files changed, 53 insertions(+), 23 deletions(-)
> > > > 
> > > > v2
> > > >   - Rebase on the vfio struct device series and the container.c series
> > > >   - Drop patches 1 & 2, we need to have working error unwind, so another
> > > >     test is not a problem
> > > >   - Fold iommu_group_remove_device() into vfio_device_remove_group() so
> > > >     that it forms a strict pairing with the two allocation functions.
> > > >   - Drop the iommu patch from the series, it needs more work and discussion
> > > > v1 https://lore.kernel.org/r/0-v1-ef00ffecea52+2cb-iommu_group_lifetime_jgg@nvidia.com
> > > > 
> > > > This could probably use another quick sanity test due to all the rebasing,
> > > > Alex if you are happy let's wait for Matthew.
> > > 
> > > I have been re-running the same series of tests on this version (on top of vfio-next) and this still resolves the reported issue.  Thanks Jason!
> > 
> > Thanks all.  Applied to vfio next branch for v6.1.  Thanks,
> 
> So now I have bisected this to a regression in our KVM CI for vfio-ap. Our testcase MultipleMdevAPMatrixTestCase hangs forever.
> I see  virtnodedevd spinning 100% and "mdevctl stop --uuid=d70d7685-a1b5-47a1-bdea-336925e0a95d" seems to wait for something:
> 
> [  186.815543] task:mdevctl         state:D stack:    0 pid: 1639 ppid:  1604 flags:0x00000001
> [  186.815546] Call Trace:
> [  186.815547]  [<0000002baf277386>] __schedule+0x296/0x650
> [  186.815549]  [<0000002baf2777a2>] schedule+0x62/0x108
> [  186.815551]  [<0000002baf27db20>] schedule_timeout+0xc0/0x108
> [  186.815553]  [<0000002baf278166>] __wait_for_common+0xc6/0x250
> [  186.815556]  [<000003ff800c263a>] vfio_device_remove_group.isra.0+0xb2/0x118 [vfio]
> [  186.815561]  [<000003ff805caadc>] vfio_ap_mdev_remove+0x2c/0x198 [vfio_ap]
> [  186.815565]  [<0000002baef1d4de>] device_release_driver_internal+0x1c6/0x288
> [  186.815570]  [<0000002baef1b27c>] bus_remove_device+0x10c/0x198
> [  186.815572]  [<0000002baef14b54>] device_del+0x19c/0x3e0
> [  186.815575]  [<000003ff800d9e3a>] mdev_device_remove+0xb2/0x108 [mdev]
> [  186.815579]  [<000003ff800da096>] remove_store+0x7e/0x90 [mdev]
> [  186.815581]  [<0000002baea53c30>] kernfs_fop_write_iter+0x138/0x210
> [  186.815586]  [<0000002bae98e310>] vfs_write+0x1a0/0x2f0
> [  186.815588]  [<0000002bae98e6d8>] ksys_write+0x70/0x100
> [  186.815590]  [<0000002baf26fe2c>] __do_syscall+0x1d4/0x200
> [  186.815593]  [<0000002baf27eb42>] system_call+0x82/0xb0

Does some userspace have the group FD open when it stucks like this,
eg what does fuser say?

Jason

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2] vfio: Follow a strict lifetime for struct iommu_group
  2022-10-04 15:40       ` Jason Gunthorpe
@ 2022-10-04 15:44         ` Christian Borntraeger
  2022-10-04 16:28           ` Jason Gunthorpe
  0 siblings, 1 reply; 30+ messages in thread
From: Christian Borntraeger @ 2022-10-04 15:44 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Matthew Rosato, Tony Krowiak, Jason J . Herne,
	Marc Hartmayer, Eric Farman, Cornelia Huck, kvm, Qian Cai,
	Joerg Roedel, Marek Szyprowski, linux-s390



Am 04.10.22 um 17:40 schrieb Jason Gunthorpe:
> On Tue, Oct 04, 2022 at 05:19:07PM +0200, Christian Borntraeger wrote:
>> Am 27.09.22 um 22:05 schrieb Alex Williamson:
>>> On Mon, 26 Sep 2022 13:03:56 -0400
>>> Matthew Rosato <mjrosato@linux.ibm.com> wrote:
>>>
>>>> On 9/22/22 8:06 PM, Jason Gunthorpe wrote:
>>>>> The iommu_group comes from the struct device that a driver has been bound
>>>>> to and then created a struct vfio_device against. To keep the iommu layer
>>>>> sane we want to have a simple rule that only an attached driver should be
>>>>> using the iommu API. Particularly only an attached driver should hold
>>>>> ownership.
>>>>>
>>>>> In VFIO's case since it uses the group APIs and it shares between
>>>>> different drivers it is a bit more complicated, but the principle still
>>>>> holds.
>>>>>
>>>>> Solve this by waiting for all users of the vfio_group to stop before
>>>>> allowing vfio_unregister_group_dev() to complete. This is done with a new
>>>>> completion to know when the users go away and an additional refcount to
>>>>> keep track of how many device drivers are sharing the vfio group. The last
>>>>> driver to be unregistered will clean up the group.
>>>>>
>>>>> This solves crashes in the S390 iommu driver that come because VFIO ends
>>>>> up racing releasing ownership (which attaches the default iommu_domain to
>>>>> the device) with the removal of that same device from the iommu
>>>>> driver. This is a side case that iommu drivers should not have to cope
>>>>> with.
>>>>>
>>>>>      iommu driver failed to attach the default/blocking domain
>>>>>      WARNING: CPU: 0 PID: 5082 at drivers/iommu/iommu.c:1961 iommu_detach_group+0x6c/0x80
>>>>>      Modules linked in: macvtap macvlan tap vfio_pci vfio_pci_core irqbypass vfio_virqfd kvm nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink mlx5_ib sunrpc ib_uverbs ism smc uvdevice ib_core s390_trng eadm_sch tape_3590 tape tape_class vfio_ccw mdev vfio_iommu_type1 vfio zcrypt_cex4 sch_fq_codel configfs ghash_s390 prng chacha_s390 libchacha aes_s390 mlx5_core des_s390 libdes sha3_512_s390 nvme sha3_256_s390 sha512_s390 sha256_s390 sha1_s390 sha_common nvme_core zfcp scsi_transport_fc pkey zcrypt rng_core autofs4
>>>>>      CPU: 0 PID: 5082 Comm: qemu-system-s39 Tainted: G        W          6.0.0-rc3 #5
>>>>>      Hardware name: IBM 3931 A01 782 (LPAR)
>>>>>      Krnl PSW : 0704c00180000000 000000095bb10d28 (iommu_detach_group+0x70/0x80)
>>>>>                 R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3
>>>>>      Krnl GPRS: 0000000000000001 0000000900000027 0000000000000039 000000095c97ffe0
>>>>>                 00000000fffeffff 00000009fc290000 00000000af1fda50 00000000af590b58
>>>>>                 00000000af1fdaf0 0000000135c7a320 0000000135e52258 0000000135e52200
>>>>>                 00000000a29e8000 00000000af590b40 000000095bb10d24 0000038004b13c98
>>>>>      Krnl Code: 000000095bb10d18: c020003d56fc        larl    %r2,000000095c2bbb10
>>>>>                             000000095bb10d1e: c0e50019d901        brasl   %r14,000000095be4bf20
>>>>>                            #000000095bb10d24: af000000            mc      0,0
>>>>>                            >000000095bb10d28: b904002a            lgr     %r2,%r10
>>>>>                             000000095bb10d2c: ebaff0a00004        lmg     %r10,%r15,160(%r15)
>>>>>                             000000095bb10d32: c0f4001aa867        brcl    15,000000095be65e00
>>>>>                             000000095bb10d38: c004002168e0        brcl    0,000000095bf3def8
>>>>>                             000000095bb10d3e: eb6ff0480024        stmg    %r6,%r15,72(%r15)
>>>>>      Call Trace:
>>>>>       [<000000095bb10d28>] iommu_detach_group+0x70/0x80
>>>>>      ([<000000095bb10d24>] iommu_detach_group+0x6c/0x80)
>>>>>       [<000003ff80243b0e>] vfio_iommu_type1_detach_group+0x136/0x6c8 [vfio_iommu_type1]
>>>>>       [<000003ff80137780>] __vfio_group_unset_container+0x58/0x158 [vfio]
>>>>>       [<000003ff80138a16>] vfio_group_fops_unl_ioctl+0x1b6/0x210 [vfio]
>>>>>      pci 0004:00:00.0: Removing from iommu group 4
>>>>>       [<000000095b5b62e8>] __s390x_sys_ioctl+0xc0/0x100
>>>>>       [<000000095be5d3b4>] __do_syscall+0x1d4/0x200
>>>>>       [<000000095be6c072>] system_call+0x82/0xb0
>>>>>      Last Breaking-Event-Address:
>>>>>       [<000000095be4bf80>] __warn_printk+0x60/0x68
>>>>>
>>>>> It indicates that domain->ops->attach_dev() failed because the driver has
>>>>> already passed the point of destructing the device.
>>>>>
>>>>> Fixes: 9ac8545199a1 ("iommu: Fix use-after-free in iommu_release_device")
>>>>> Reported-by: Matthew Rosato <mjrosato@linux.ibm.com>
>>>>> Tested-by: Matthew Rosato <mjrosato@linux.ibm.com>
>>>>> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
>>>>> ---
>>>>>    drivers/vfio/vfio.h      |  8 +++++
>>>>>    drivers/vfio/vfio_main.c | 68 ++++++++++++++++++++++++++--------------
>>>>>    2 files changed, 53 insertions(+), 23 deletions(-)
>>>>>
>>>>> v2
>>>>>    - Rebase on the vfio struct device series and the container.c series
>>>>>    - Drop patches 1 & 2, we need to have working error unwind, so another
>>>>>      test is not a problem
>>>>>    - Fold iommu_group_remove_device() into vfio_device_remove_group() so
>>>>>      that it forms a strict pairing with the two allocation functions.
>>>>>    - Drop the iommu patch from the series, it needs more work and discussion
>>>>> v1 https://lore.kernel.org/r/0-v1-ef00ffecea52+2cb-iommu_group_lifetime_jgg@nvidia.com
>>>>>
>>>>> This could probably use another quick sanity test due to all the rebasing,
>>>>> Alex if you are happy let's wait for Matthew.
>>>>
>>>> I have been re-running the same series of tests on this version (on top of vfio-next) and this still resolves the reported issue.  Thanks Jason!
>>>
>>> Thanks all.  Applied to vfio next branch for v6.1.  Thanks,
>>
>> So now I have bisected this to a regression in our KVM CI for vfio-ap. Our testcase MultipleMdevAPMatrixTestCase hangs forever.
>> I see  virtnodedevd spinning 100% and "mdevctl stop --uuid=d70d7685-a1b5-47a1-bdea-336925e0a95d" seems to wait for something:
>>
>> [  186.815543] task:mdevctl         state:D stack:    0 pid: 1639 ppid:  1604 flags:0x00000001
>> [  186.815546] Call Trace:
>> [  186.815547]  [<0000002baf277386>] __schedule+0x296/0x650
>> [  186.815549]  [<0000002baf2777a2>] schedule+0x62/0x108
>> [  186.815551]  [<0000002baf27db20>] schedule_timeout+0xc0/0x108
>> [  186.815553]  [<0000002baf278166>] __wait_for_common+0xc6/0x250
>> [  186.815556]  [<000003ff800c263a>] vfio_device_remove_group.isra.0+0xb2/0x118 [vfio]
>> [  186.815561]  [<000003ff805caadc>] vfio_ap_mdev_remove+0x2c/0x198 [vfio_ap]
>> [  186.815565]  [<0000002baef1d4de>] device_release_driver_internal+0x1c6/0x288
>> [  186.815570]  [<0000002baef1b27c>] bus_remove_device+0x10c/0x198
>> [  186.815572]  [<0000002baef14b54>] device_del+0x19c/0x3e0
>> [  186.815575]  [<000003ff800d9e3a>] mdev_device_remove+0xb2/0x108 [mdev]
>> [  186.815579]  [<000003ff800da096>] remove_store+0x7e/0x90 [mdev]
>> [  186.815581]  [<0000002baea53c30>] kernfs_fop_write_iter+0x138/0x210
>> [  186.815586]  [<0000002bae98e310>] vfs_write+0x1a0/0x2f0
>> [  186.815588]  [<0000002bae98e6d8>] ksys_write+0x70/0x100
>> [  186.815590]  [<0000002baf26fe2c>] __do_syscall+0x1d4/0x200
>> [  186.815593]  [<0000002baf27eb42>] system_call+0x82/0xb0
> 
> Does some userspace have the group FD open when it stucks like this,
> eg what does fuser say?

/proc/<virtnodedevd>/fd
51480 0 dr-x------. 2 root root  0  4. Okt 17:16 .
43593 0 dr-xr-xr-x. 9 root root  0  4. Okt 17:16 ..
65252 0 lr-x------. 1 root root 64  4. Okt 17:42 0 -> /dev/null
65253 0 lrwx------. 1 root root 64  4. Okt 17:42 1 -> 'socket:[51479]'
65261 0 lrwx------. 1 root root 64  4. Okt 17:42 10 -> 'anon_inode:[eventfd]'
65262 0 lrwx------. 1 root root 64  4. Okt 17:42 11 -> 'socket:[51485]'
65263 0 lrwx------. 1 root root 64  4. Okt 17:42 12 -> 'socket:[51487]'
65264 0 lrwx------. 1 root root 64  4. Okt 17:42 13 -> 'socket:[51486]'
65265 0 lrwx------. 1 root root 64  4. Okt 17:42 14 -> 'anon_inode:[eventfd]'
65266 0 lrwx------. 1 root root 64  4. Okt 17:42 15 -> 'socket:[60421]'
65267 0 lrwx------. 1 root root 64  4. Okt 17:42 16 -> 'anon_inode:[eventfd]'
65268 0 lrwx------. 1 root root 64  4. Okt 17:42 17 -> 'socket:[28008]'
65269 0 l-wx------. 1 root root 64  4. Okt 17:42 18 -> /run/libvirt/nodedev/driver.pid
65270 0 lrwx------. 1 root root 64  4. Okt 17:42 19 -> 'socket:[28818]'
65254 0 lrwx------. 1 root root 64  4. Okt 17:42 2 -> 'socket:[51479]'
65271 0 lr-x------. 1 root root 64  4. Okt 17:42 20 -> '/dev/vfio/3 (deleted)'
65272 0 lr-x------. 1 root root 64  4. Okt 17:42 21 -> anon_inode:inotify
65273 0 lr-x------. 1 root root 64  4. Okt 17:42 23 -> 'pipe:[30158]'
65274 0 lr-x------. 1 root root 64  4. Okt 17:42 25 -> 'pipe:[30159]'
51481 0 lrwx------. 1 root root 64  4. Okt 17:16 3 -> 'socket:[43590]'
65255 0 lrwx------. 1 root root 64  4. Okt 17:42 4 -> 'socket:[43591]'
65256 0 lrwx------. 1 root root 64  4. Okt 17:42 5 -> 'socket:[30947]'
65257 0 lrwx------. 1 root root 64  4. Okt 17:42 6 -> 'socket:[51483]'
65258 0 l-wx------. 1 root root 64  4. Okt 17:42 7 -> /run/virtnodedevd.pid
65259 0 lr-x------. 1 root root 64  4. Okt 17:42 8 -> 'pipe:[51484]'
65260 0 l-wx------. 1 root root 64  4. Okt 17:42 9 -> 'pipe:[51484]'

/proc/mdevctl/fd
59494 0 dr-x------. 2 root root  0  4. Okt 17:16 .
59493 0 dr-xr-xr-x. 9 root root  0  4. Okt 17:16 ..
59495 0 lrwx------. 1 root root 64  4. Okt 17:16 0 -> /dev/null
59496 0 l-wx------. 1 root root 64  4. Okt 17:16 1 -> 'pipe:[30158]'
59497 0 l-wx------. 1 root root 64  4. Okt 17:16 2 -> 'pipe:[30159]'
59498 0 l-wx------. 1 root root 64  4. Okt 17:16 3 -> /sys/devices/vfio_ap/matrix/d70d7685-a1b5-47a1-bdea-336925e0a95d/remove



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2] vfio: Follow a strict lifetime for struct iommu_group
  2022-10-04 15:44         ` Christian Borntraeger
@ 2022-10-04 16:28           ` Jason Gunthorpe
  2022-10-04 17:15             ` Christian Borntraeger
  2022-10-04 17:36             ` Christian Borntraeger
  0 siblings, 2 replies; 30+ messages in thread
From: Jason Gunthorpe @ 2022-10-04 16:28 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: Alex Williamson, Matthew Rosato, Tony Krowiak, Jason J . Herne,
	Marc Hartmayer, Eric Farman, Cornelia Huck, kvm, Qian Cai,
	Joerg Roedel, Marek Szyprowski, linux-s390

On Tue, Oct 04, 2022 at 05:44:53PM +0200, Christian Borntraeger wrote:

> > Does some userspace have the group FD open when it stucks like this,
> > eg what does fuser say?
> 
> /proc/<virtnodedevd>/fd
> 51480 0 dr-x------. 2 root root  0  4. Okt 17:16 .
> 43593 0 dr-xr-xr-x. 9 root root  0  4. Okt 17:16 ..
> 65252 0 lr-x------. 1 root root 64  4. Okt 17:42 0 -> /dev/null
> 65253 0 lrwx------. 1 root root 64  4. Okt 17:42 1 -> 'socket:[51479]'
> 65261 0 lrwx------. 1 root root 64  4. Okt 17:42 10 -> 'anon_inode:[eventfd]'
> 65262 0 lrwx------. 1 root root 64  4. Okt 17:42 11 -> 'socket:[51485]'
> 65263 0 lrwx------. 1 root root 64  4. Okt 17:42 12 -> 'socket:[51487]'
> 65264 0 lrwx------. 1 root root 64  4. Okt 17:42 13 -> 'socket:[51486]'
> 65265 0 lrwx------. 1 root root 64  4. Okt 17:42 14 -> 'anon_inode:[eventfd]'
> 65266 0 lrwx------. 1 root root 64  4. Okt 17:42 15 -> 'socket:[60421]'
> 65267 0 lrwx------. 1 root root 64  4. Okt 17:42 16 -> 'anon_inode:[eventfd]'
> 65268 0 lrwx------. 1 root root 64  4. Okt 17:42 17 -> 'socket:[28008]'
> 65269 0 l-wx------. 1 root root 64  4. Okt 17:42 18 -> /run/libvirt/nodedev/driver.pid
> 65270 0 lrwx------. 1 root root 64  4. Okt 17:42 19 -> 'socket:[28818]'
> 65254 0 lrwx------. 1 root root 64  4. Okt 17:42 2 -> 'socket:[51479]'
> 65271 0 lr-x------. 1 root root 64  4. Okt 17:42 20 -> '/dev/vfio/3 (deleted)'

Seems like a userspace bug to keep the group FD open after the /dev/
file has been deleted :|

What do you think about this?

commit a54a852b1484b1605917a8f4d80691db333b25ed
Author: Jason Gunthorpe <jgg@ziepe.ca>
Date:   Tue Oct 4 13:14:37 2022 -0300

    vfio: Make the group FD disassociate from the iommu_group
    
    Allow the vfio_group struct to exist with a NULL iommu_group pointer. When
    the pointer is NULL the vfio_group users promise not to touch the
    iommu_group. This allows a driver to be hot unplugged while userspace is
    keeping the group FD open.
    
    SPAPR mode is excluded from this behavior because of how it wrongly hacks
    part of its iommu interface through KVM. Due to this we loose control over
    what it is doing and cannot revoke the iommu_group usage in the IOMMU
    layer via vfio_group_detach_container().
    
    Thus, for SPAPR the group FDs must still be closed before a device can be
    hot unplugged.
    
    This fixes a userspace regression where we learned that virtnodedevd
    leaves a group FD open even though the /dev/ node for it has been deleted
    and all the drivers for it unplugged.
    
    Fixes: ca5f21b25749 ("vfio: Follow a strict lifetime for struct iommu_group")
    Reported-by: Christian Borntraeger <borntraeger@linux.ibm.com>
    Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 59a28251bb0b97..badc9d828cac20 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -1313,7 +1313,7 @@ static int vfio_pci_ioctl_pci_hot_reset(struct vfio_pci_core_device *vdev,
 		}
 
 		/* Ensure the FD is a vfio group FD.*/
-		if (!vfio_file_iommu_group(file)) {
+		if (!vfio_file_is_group(file)) {
 			fput(file);
 			ret = -EINVAL;
 			break;
diff --git a/drivers/vfio/vfio.h b/drivers/vfio/vfio.h
index 4d2de02f2ced6e..4e10a281420e66 100644
--- a/drivers/vfio/vfio.h
+++ b/drivers/vfio/vfio.h
@@ -59,6 +59,7 @@ struct vfio_group {
 	struct mutex			group_lock;
 	struct kvm			*kvm;
 	struct file			*opened_file;
+	bool				preserve_iommu_group;
 	struct swait_queue_head		opened_file_wait;
 	struct blocking_notifier_head	notifier;
 };
diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
index 9b1e5fd5f7b73c..e725cf38886c09 100644
--- a/drivers/vfio/vfio_main.c
+++ b/drivers/vfio/vfio_main.c
@@ -133,6 +133,10 @@ __vfio_group_get_from_iommu(struct iommu_group *iommu_group)
 {
 	struct vfio_group *group;
 
+	/*
+	 * group->iommu_group from the vfio.group_list cannot be NULL
+	 * under the vfio.group_lock.
+	 */
 	list_for_each_entry(group, &vfio.group_list, vfio_next) {
 		if (group->iommu_group == iommu_group) {
 			refcount_inc(&group->drivers);
@@ -159,7 +163,7 @@ static void vfio_group_release(struct device *dev)
 
 	mutex_destroy(&group->device_lock);
 	mutex_destroy(&group->group_lock);
-	iommu_group_put(group->iommu_group);
+	WARN_ON(group->iommu_group);
 	ida_free(&vfio.group_ida, MINOR(group->dev.devt));
 	kfree(group);
 }
@@ -248,6 +252,7 @@ static struct vfio_group *vfio_create_group(struct iommu_group *iommu_group,
 static void vfio_device_remove_group(struct vfio_device *device)
 {
 	struct vfio_group *group = device->group;
+	struct iommu_group *iommu_group;
 
 	if (group->type == VFIO_NO_IOMMU || group->type == VFIO_EMULATED_IOMMU)
 		iommu_group_remove_device(device->dev);
@@ -266,12 +271,25 @@ static void vfio_device_remove_group(struct vfio_device *device)
 	cdev_device_del(&group->cdev, &group->dev);
 
 	/*
-	 * Before we allow the last driver in the group to be unplugged the
-	 * group must be sanitized so nothing else is or can reference it. This
-	 * is because the group->iommu_group pointer should only be used so long
-	 * as a device driver is attached to a device in the group.
+	 * Revoke all users of group->iommu_group. At this point we know there
+	 * are no devices active because we are unplugging the last one. Setting
+	 * iommu_group to NULL blocks all new users.
 	 */
-	while (group->opened_file) {
+	WARN_ON(group->notifier.head);
+	if (group->container)
+		vfio_group_detach_container(group);
+	iommu_group = group->iommu_group;
+	group->iommu_group = NULL;
+
+	/*
+	 * Normally we can set the iommu_group to NULL above and that will
+	 * prevent any users from touching it. However, the SPAPR kvm path takes
+	 * a reference to the iommu_group and keeps using it in arch code out
+	 * side our control. So if this path is triggred we have no choice but
+	 * to wait for the group FD to be closed to be sure everyone has stopped
+	 * touching the group.
+	 */
+	while (group->preserve_iommu_group && group->opened_file) {
 		mutex_unlock(&vfio.group_lock);
 		swait_event_idle_exclusive(group->opened_file_wait,
 					   !group->opened_file);
@@ -288,8 +306,8 @@ static void vfio_device_remove_group(struct vfio_device *device)
 	WARN_ON(!list_empty(&group->device_list));
 	WARN_ON(group->container || group->container_users);
 	WARN_ON(group->notifier.head);
-	group->iommu_group = NULL;
 
+	iommu_group_put(iommu_group);
 	put_device(&group->dev);
 }
 
@@ -531,6 +549,10 @@ static int __vfio_register_dev(struct vfio_device *device,
 
 	existing_device = vfio_group_get_device(group, device->dev);
 	if (existing_device) {
+		/*
+		 * group->iommu_group is non-NULL because we hold the drivers
+		 * refcount.
+		 */
 		dev_WARN(device->dev, "Device already exists on group %d\n",
 			 iommu_group_id(group->iommu_group));
 		vfio_device_put_registration(existing_device);
@@ -702,6 +724,11 @@ static int vfio_group_ioctl_set_container(struct vfio_group *group,
 		ret = -EINVAL;
 		goto out_unlock;
 	}
+	if (!group->iommu_group) {
+		ret = -ENODEV;
+		goto out_unlock;
+	}
+
 	container = vfio_container_from_file(f.file);
 	ret = -EINVAL;
 	if (container) {
@@ -862,6 +889,11 @@ static int vfio_group_ioctl_get_status(struct vfio_group *group,
 	status.flags = 0;
 
 	mutex_lock(&group->group_lock);
+	if (!group->iommu_group) {
+		mutex_unlock(&group->group_lock);
+		return -ENODEV;
+	}
+
 	if (group->container)
 		status.flags |= VFIO_GROUP_FLAGS_CONTAINER_SET |
 				VFIO_GROUP_FLAGS_VIABLE;
@@ -938,13 +970,6 @@ static int vfio_group_fops_release(struct inode *inode, struct file *filep)
 	filep->private_data = NULL;
 
 	mutex_lock(&group->group_lock);
-	/*
-	 * Device FDs hold a group file reference, therefore the group release
-	 * is only called when there are no open devices.
-	 */
-	WARN_ON(group->notifier.head);
-	if (group->container)
-		vfio_group_detach_container(group);
 	group->opened_file = NULL;
 	mutex_unlock(&group->group_lock);
 	swake_up_one(&group->opened_file_wait);
@@ -1553,17 +1578,34 @@ static const struct file_operations vfio_device_fops = {
  * @file: VFIO group file
  *
  * The returned iommu_group is valid as long as a ref is held on the file.
+ * This function is deprecated, only the SPAPR path in kvm should call it.
  */
 struct iommu_group *vfio_file_iommu_group(struct file *file)
 {
 	struct vfio_group *group = file->private_data;
+	struct iommu_group *iommu_group = NULL;
+
+	if (!IS_ENABLED(CONFIG_SPAPR_TCE_IOMMU))
+		return NULL;
 
 	if (file->f_op != &vfio_group_fops)
 		return NULL;
-	return group->iommu_group;
+
+	mutex_lock(&group->group_lock);
+	if (group->iommu_group) {
+		iommu_group = group->iommu_group;
+		group->preserve_iommu_group = true;
+	}
+	mutex_unlock(&group->group_lock);
+	return iommu_group;
 }
 EXPORT_SYMBOL_GPL(vfio_file_iommu_group);
 
+bool vfio_file_is_group(struct file *file)
+{
+	return (file->f_op == &vfio_group_fops;
+}
+
 /**
  * vfio_file_enforced_coherent - True if the DMA associated with the VFIO file
  *        is always CPU cache coherent
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index 73bcb92179a224..bd9faaab85de18 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -199,6 +199,7 @@ int vfio_mig_get_next_state(struct vfio_device *device,
  * External user API
  */
 struct iommu_group *vfio_file_iommu_group(struct file *file);
+bool vfio_file_is_group(struct file *file);
 bool vfio_file_enforced_coherent(struct file *file);
 void vfio_file_set_kvm(struct file *file, struct kvm *kvm);
 bool vfio_file_has_dev(struct file *file, struct vfio_device *device);
diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
index ce1b01d02c5197..6ecd3aca047375 100644
--- a/virt/kvm/vfio.c
+++ b/virt/kvm/vfio.c
@@ -77,6 +77,23 @@ static struct iommu_group *kvm_vfio_file_iommu_group(struct file *file)
 	return ret;
 }
 
+static bool kvm_vfio_file_is_group(struct file *file)
+{
+	bool (*fn)(struct file *file);
+	bool ret;
+
+	fn = symbol_get(vfio_file_is_group);
+	if (!fn)
+		return false;
+
+	ret = fn(file);
+
+	symbol_put(vfio_file_is_group);
+
+	return ret;
+}
+
+
 #ifdef CONFIG_SPAPR_TCE_IOMMU
 static void kvm_spapr_tce_release_vfio_group(struct kvm *kvm,
 					     struct kvm_vfio_group *kvg)
@@ -136,7 +153,7 @@ static int kvm_vfio_group_add(struct kvm_device *dev, unsigned int fd)
 		return -EBADF;
 
 	/* Ensure the FD is a vfio group FD.*/
-	if (!kvm_vfio_file_iommu_group(filp)) {
+	if (!kvm_vfio_file_is_group(filp)) {
 		ret = -EINVAL;
 		goto err_fput;
 	}

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH v2] vfio: Follow a strict lifetime for struct iommu_group
  2022-10-04 16:28           ` Jason Gunthorpe
@ 2022-10-04 17:15             ` Christian Borntraeger
  2022-10-04 17:22               ` Jason Gunthorpe
  2022-10-04 17:36             ` Christian Borntraeger
  1 sibling, 1 reply; 30+ messages in thread
From: Christian Borntraeger @ 2022-10-04 17:15 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Matthew Rosato, Tony Krowiak, Jason J . Herne,
	Marc Hartmayer, Eric Farman, Cornelia Huck, kvm, Qian Cai,
	Joerg Roedel, Marek Szyprowski, linux-s390


Am 04.10.22 um 18:28 schrieb Jason Gunthorpe:
> On Tue, Oct 04, 2022 at 05:44:53PM +0200, Christian Borntraeger wrote:
> 
>>> Does some userspace have the group FD open when it stucks like this,
>>> eg what does fuser say?
>>
>> /proc/<virtnodedevd>/fd
>> 51480 0 dr-x------. 2 root root  0  4. Okt 17:16 .
>> 43593 0 dr-xr-xr-x. 9 root root  0  4. Okt 17:16 ..
>> 65252 0 lr-x------. 1 root root 64  4. Okt 17:42 0 -> /dev/null
>> 65253 0 lrwx------. 1 root root 64  4. Okt 17:42 1 -> 'socket:[51479]'
>> 65261 0 lrwx------. 1 root root 64  4. Okt 17:42 10 -> 'anon_inode:[eventfd]'
>> 65262 0 lrwx------. 1 root root 64  4. Okt 17:42 11 -> 'socket:[51485]'
>> 65263 0 lrwx------. 1 root root 64  4. Okt 17:42 12 -> 'socket:[51487]'
>> 65264 0 lrwx------. 1 root root 64  4. Okt 17:42 13 -> 'socket:[51486]'
>> 65265 0 lrwx------. 1 root root 64  4. Okt 17:42 14 -> 'anon_inode:[eventfd]'
>> 65266 0 lrwx------. 1 root root 64  4. Okt 17:42 15 -> 'socket:[60421]'
>> 65267 0 lrwx------. 1 root root 64  4. Okt 17:42 16 -> 'anon_inode:[eventfd]'
>> 65268 0 lrwx------. 1 root root 64  4. Okt 17:42 17 -> 'socket:[28008]'
>> 65269 0 l-wx------. 1 root root 64  4. Okt 17:42 18 -> /run/libvirt/nodedev/driver.pid
>> 65270 0 lrwx------. 1 root root 64  4. Okt 17:42 19 -> 'socket:[28818]'
>> 65254 0 lrwx------. 1 root root 64  4. Okt 17:42 2 -> 'socket:[51479]'
>> 65271 0 lr-x------. 1 root root 64  4. Okt 17:42 20 -> '/dev/vfio/3 (deleted)'
> 
> Seems like a userspace bug to keep the group FD open after the /dev/
> file has been deleted :|
> 
> What do you think about this?

On top of which tree is this?

> 
> commit a54a852b1484b1605917a8f4d80691db333b25ed
> Author: Jason Gunthorpe <jgg@ziepe.ca>
> Date:   Tue Oct 4 13:14:37 2022 -0300
> 
>      vfio: Make the group FD disassociate from the iommu_group
>      
>      Allow the vfio_group struct to exist with a NULL iommu_group pointer. When
>      the pointer is NULL the vfio_group users promise not to touch the
>      iommu_group. This allows a driver to be hot unplugged while userspace is
>      keeping the group FD open.
>      
>      SPAPR mode is excluded from this behavior because of how it wrongly hacks
>      part of its iommu interface through KVM. Due to this we loose control over
>      what it is doing and cannot revoke the iommu_group usage in the IOMMU
>      layer via vfio_group_detach_container().
>      
>      Thus, for SPAPR the group FDs must still be closed before a device can be
>      hot unplugged.
>      
>      This fixes a userspace regression where we learned that virtnodedevd
>      leaves a group FD open even though the /dev/ node for it has been deleted
>      and all the drivers for it unplugged.
>      
>      Fixes: ca5f21b25749 ("vfio: Follow a strict lifetime for struct iommu_group")
>      Reported-by: Christian Borntraeger <borntraeger@linux.ibm.com>
>      Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> 
> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> index 59a28251bb0b97..badc9d828cac20 100644
> --- a/drivers/vfio/pci/vfio_pci_core.c
> +++ b/drivers/vfio/pci/vfio_pci_core.c
> @@ -1313,7 +1313,7 @@ static int vfio_pci_ioctl_pci_hot_reset(struct vfio_pci_core_device *vdev,
>   		}
>   
>   		/* Ensure the FD is a vfio group FD.*/
> -		if (!vfio_file_iommu_group(file)) {
> +		if (!vfio_file_is_group(file)) {
>   			fput(file);
>   			ret = -EINVAL;
>   			break;
> diff --git a/drivers/vfio/vfio.h b/drivers/vfio/vfio.h
> index 4d2de02f2ced6e..4e10a281420e66 100644
> --- a/drivers/vfio/vfio.h
> +++ b/drivers/vfio/vfio.h
> @@ -59,6 +59,7 @@ struct vfio_group {
>   	struct mutex			group_lock;
>   	struct kvm			*kvm;
>   	struct file			*opened_file;
> +	bool				preserve_iommu_group;
>   	struct swait_queue_head		opened_file_wait;
>   	struct blocking_notifier_head	notifier;
>   };
> diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
> index 9b1e5fd5f7b73c..e725cf38886c09 100644
> --- a/drivers/vfio/vfio_main.c
> +++ b/drivers/vfio/vfio_main.c
> @@ -133,6 +133,10 @@ __vfio_group_get_from_iommu(struct iommu_group *iommu_group)
>   {
>   	struct vfio_group *group;
>   
> +	/*
> +	 * group->iommu_group from the vfio.group_list cannot be NULL
> +	 * under the vfio.group_lock.
> +	 */
>   	list_for_each_entry(group, &vfio.group_list, vfio_next) {
>   		if (group->iommu_group == iommu_group) {
>   			refcount_inc(&group->drivers);
> @@ -159,7 +163,7 @@ static void vfio_group_release(struct device *dev)
>   
>   	mutex_destroy(&group->device_lock);
>   	mutex_destroy(&group->group_lock);
> -	iommu_group_put(group->iommu_group);
> +	WARN_ON(group->iommu_group);
>   	ida_free(&vfio.group_ida, MINOR(group->dev.devt));
>   	kfree(group);
>   }
> @@ -248,6 +252,7 @@ static struct vfio_group *vfio_create_group(struct iommu_group *iommu_group,
>   static void vfio_device_remove_group(struct vfio_device *device)
>   {
>   	struct vfio_group *group = device->group;
> +	struct iommu_group *iommu_group;
>   
>   	if (group->type == VFIO_NO_IOMMU || group->type == VFIO_EMULATED_IOMMU)
>   		iommu_group_remove_device(device->dev);
> @@ -266,12 +271,25 @@ static void vfio_device_remove_group(struct vfio_device *device)
>   	cdev_device_del(&group->cdev, &group->dev);
>   
>   	/*
> -	 * Before we allow the last driver in the group to be unplugged the
> -	 * group must be sanitized so nothing else is or can reference it. This
> -	 * is because the group->iommu_group pointer should only be used so long
> -	 * as a device driver is attached to a device in the group.
> +	 * Revoke all users of group->iommu_group. At this point we know there
> +	 * are no devices active because we are unplugging the last one. Setting
> +	 * iommu_group to NULL blocks all new users.
>   	 */
> -	while (group->opened_file) {
> +	WARN_ON(group->notifier.head);
> +	if (group->container)
> +		vfio_group_detach_container(group);
> +	iommu_group = group->iommu_group;
> +	group->iommu_group = NULL;
> +
> +	/*
> +	 * Normally we can set the iommu_group to NULL above and that will
> +	 * prevent any users from touching it. However, the SPAPR kvm path takes
> +	 * a reference to the iommu_group and keeps using it in arch code out
> +	 * side our control. So if this path is triggred we have no choice but
> +	 * to wait for the group FD to be closed to be sure everyone has stopped
> +	 * touching the group.
> +	 */
> +	while (group->preserve_iommu_group && group->opened_file) {
>   		mutex_unlock(&vfio.group_lock);
>   		swait_event_idle_exclusive(group->opened_file_wait,
>   					   !group->opened_file);
> @@ -288,8 +306,8 @@ static void vfio_device_remove_group(struct vfio_device *device)
>   	WARN_ON(!list_empty(&group->device_list));
>   	WARN_ON(group->container || group->container_users);
>   	WARN_ON(group->notifier.head);
> -	group->iommu_group = NULL;
>   
> +	iommu_group_put(iommu_group);
>   	put_device(&group->dev);
>   }
>   
> @@ -531,6 +549,10 @@ static int __vfio_register_dev(struct vfio_device *device,
>   
>   	existing_device = vfio_group_get_device(group, device->dev);
>   	if (existing_device) {
> +		/*
> +		 * group->iommu_group is non-NULL because we hold the drivers
> +		 * refcount.
> +		 */
>   		dev_WARN(device->dev, "Device already exists on group %d\n",
>   			 iommu_group_id(group->iommu_group));
>   		vfio_device_put_registration(existing_device);
> @@ -702,6 +724,11 @@ static int vfio_group_ioctl_set_container(struct vfio_group *group,
>   		ret = -EINVAL;
>   		goto out_unlock;
>   	}
> +	if (!group->iommu_group) {
> +		ret = -ENODEV;
> +		goto out_unlock;
> +	}
> +
>   	container = vfio_container_from_file(f.file);
>   	ret = -EINVAL;
>   	if (container) {
> @@ -862,6 +889,11 @@ static int vfio_group_ioctl_get_status(struct vfio_group *group,
>   	status.flags = 0;
>   
>   	mutex_lock(&group->group_lock);
> +	if (!group->iommu_group) {
> +		mutex_unlock(&group->group_lock);
> +		return -ENODEV;
> +	}
> +
>   	if (group->container)
>   		status.flags |= VFIO_GROUP_FLAGS_CONTAINER_SET |
>   				VFIO_GROUP_FLAGS_VIABLE;
> @@ -938,13 +970,6 @@ static int vfio_group_fops_release(struct inode *inode, struct file *filep)
>   	filep->private_data = NULL;
>   
>   	mutex_lock(&group->group_lock);
> -	/*
> -	 * Device FDs hold a group file reference, therefore the group release
> -	 * is only called when there are no open devices.
> -	 */
> -	WARN_ON(group->notifier.head);
> -	if (group->container)
> -		vfio_group_detach_container(group);
>   	group->opened_file = NULL;
>   	mutex_unlock(&group->group_lock);
>   	swake_up_one(&group->opened_file_wait);
> @@ -1553,17 +1578,34 @@ static const struct file_operations vfio_device_fops = {
>    * @file: VFIO group file
>    *
>    * The returned iommu_group is valid as long as a ref is held on the file.
> + * This function is deprecated, only the SPAPR path in kvm should call it.
>    */
>   struct iommu_group *vfio_file_iommu_group(struct file *file)
>   {
>   	struct vfio_group *group = file->private_data;
> +	struct iommu_group *iommu_group = NULL;
> +
> +	if (!IS_ENABLED(CONFIG_SPAPR_TCE_IOMMU))
> +		return NULL;
>   
>   	if (file->f_op != &vfio_group_fops)
>   		return NULL;
> -	return group->iommu_group;
> +
> +	mutex_lock(&group->group_lock);
> +	if (group->iommu_group) {
> +		iommu_group = group->iommu_group;
> +		group->preserve_iommu_group = true;
> +	}
> +	mutex_unlock(&group->group_lock);
> +	return iommu_group;
>   }
>   EXPORT_SYMBOL_GPL(vfio_file_iommu_group);
>   
> +bool vfio_file_is_group(struct file *file)
> +{
> +	return (file->f_op == &vfio_group_fops;
> +}
> +
>   /**
>    * vfio_file_enforced_coherent - True if the DMA associated with the VFIO file
>    *        is always CPU cache coherent
> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> index 73bcb92179a224..bd9faaab85de18 100644
> --- a/include/linux/vfio.h
> +++ b/include/linux/vfio.h
> @@ -199,6 +199,7 @@ int vfio_mig_get_next_state(struct vfio_device *device,
>    * External user API
>    */
>   struct iommu_group *vfio_file_iommu_group(struct file *file);
> +bool vfio_file_is_group(struct file *file);
>   bool vfio_file_enforced_coherent(struct file *file);
>   void vfio_file_set_kvm(struct file *file, struct kvm *kvm);
>   bool vfio_file_has_dev(struct file *file, struct vfio_device *device);
> diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
> index ce1b01d02c5197..6ecd3aca047375 100644
> --- a/virt/kvm/vfio.c
> +++ b/virt/kvm/vfio.c
> @@ -77,6 +77,23 @@ static struct iommu_group *kvm_vfio_file_iommu_group(struct file *file)
>   	return ret;
>   }
>   
> +static bool kvm_vfio_file_is_group(struct file *file)
> +{
> +	bool (*fn)(struct file *file);
> +	bool ret;
> +
> +	fn = symbol_get(vfio_file_is_group);
> +	if (!fn)
> +		return false;
> +
> +	ret = fn(file);
> +
> +	symbol_put(vfio_file_is_group);
> +
> +	return ret;
> +}
> +
> +
>   #ifdef CONFIG_SPAPR_TCE_IOMMU
>   static void kvm_spapr_tce_release_vfio_group(struct kvm *kvm,
>   					     struct kvm_vfio_group *kvg)
> @@ -136,7 +153,7 @@ static int kvm_vfio_group_add(struct kvm_device *dev, unsigned int fd)
>   		return -EBADF;
>   
>   	/* Ensure the FD is a vfio group FD.*/
> -	if (!kvm_vfio_file_iommu_group(filp)) {
> +	if (!kvm_vfio_file_is_group(filp)) {
>   		ret = -EINVAL;
>   		goto err_fput;
>   	}

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2] vfio: Follow a strict lifetime for struct iommu_group
  2022-10-04 17:15             ` Christian Borntraeger
@ 2022-10-04 17:22               ` Jason Gunthorpe
  0 siblings, 0 replies; 30+ messages in thread
From: Jason Gunthorpe @ 2022-10-04 17:22 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: Alex Williamson, Matthew Rosato, Tony Krowiak, Jason J . Herne,
	Marc Hartmayer, Eric Farman, Cornelia Huck, kvm, Qian Cai,
	Joerg Roedel, Marek Szyprowski, linux-s390

On Tue, Oct 04, 2022 at 07:15:51PM +0200, Christian Borntraeger wrote:
> 
> Am 04.10.22 um 18:28 schrieb Jason Gunthorpe:
> > On Tue, Oct 04, 2022 at 05:44:53PM +0200, Christian Borntraeger wrote:
> > 
> > > > Does some userspace have the group FD open when it stucks like this,
> > > > eg what does fuser say?
> > > 
> > > /proc/<virtnodedevd>/fd
> > > 51480 0 dr-x------. 2 root root  0  4. Okt 17:16 .
> > > 43593 0 dr-xr-xr-x. 9 root root  0  4. Okt 17:16 ..
> > > 65252 0 lr-x------. 1 root root 64  4. Okt 17:42 0 -> /dev/null
> > > 65253 0 lrwx------. 1 root root 64  4. Okt 17:42 1 -> 'socket:[51479]'
> > > 65261 0 lrwx------. 1 root root 64  4. Okt 17:42 10 -> 'anon_inode:[eventfd]'
> > > 65262 0 lrwx------. 1 root root 64  4. Okt 17:42 11 -> 'socket:[51485]'
> > > 65263 0 lrwx------. 1 root root 64  4. Okt 17:42 12 -> 'socket:[51487]'
> > > 65264 0 lrwx------. 1 root root 64  4. Okt 17:42 13 -> 'socket:[51486]'
> > > 65265 0 lrwx------. 1 root root 64  4. Okt 17:42 14 -> 'anon_inode:[eventfd]'
> > > 65266 0 lrwx------. 1 root root 64  4. Okt 17:42 15 -> 'socket:[60421]'
> > > 65267 0 lrwx------. 1 root root 64  4. Okt 17:42 16 -> 'anon_inode:[eventfd]'
> > > 65268 0 lrwx------. 1 root root 64  4. Okt 17:42 17 -> 'socket:[28008]'
> > > 65269 0 l-wx------. 1 root root 64  4. Okt 17:42 18 -> /run/libvirt/nodedev/driver.pid
> > > 65270 0 lrwx------. 1 root root 64  4. Okt 17:42 19 -> 'socket:[28818]'
> > > 65254 0 lrwx------. 1 root root 64  4. Okt 17:42 2 -> 'socket:[51479]'
> > > 65271 0 lr-x------. 1 root root 64  4. Okt 17:42 20 -> '/dev/vfio/3 (deleted)'
> > 
> > Seems like a userspace bug to keep the group FD open after the /dev/
> > file has been deleted :|
> > 
> > What do you think about this?
> 
> On top of which tree is this?

It should apply on vfio-next

Jason

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2] vfio: Follow a strict lifetime for struct iommu_group
  2022-10-04 16:28           ` Jason Gunthorpe
  2022-10-04 17:15             ` Christian Borntraeger
@ 2022-10-04 17:36             ` Christian Borntraeger
  2022-10-04 17:48               ` Christian Borntraeger
  2022-10-04 18:22               ` Matthew Rosato
  1 sibling, 2 replies; 30+ messages in thread
From: Christian Borntraeger @ 2022-10-04 17:36 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Matthew Rosato, Tony Krowiak, Jason J . Herne,
	Marc Hartmayer, Eric Farman, Cornelia Huck, kvm, Qian Cai,
	Joerg Roedel, Marek Szyprowski, linux-s390



Am 04.10.22 um 18:28 schrieb Jason Gunthorpe:
> On Tue, Oct 04, 2022 at 05:44:53PM +0200, Christian Borntraeger wrote:
> 
>>> Does some userspace have the group FD open when it stucks like this,
>>> eg what does fuser say?
>>
>> /proc/<virtnodedevd>/fd
>> 51480 0 dr-x------. 2 root root  0  4. Okt 17:16 .
>> 43593 0 dr-xr-xr-x. 9 root root  0  4. Okt 17:16 ..
>> 65252 0 lr-x------. 1 root root 64  4. Okt 17:42 0 -> /dev/null
>> 65253 0 lrwx------. 1 root root 64  4. Okt 17:42 1 -> 'socket:[51479]'
>> 65261 0 lrwx------. 1 root root 64  4. Okt 17:42 10 -> 'anon_inode:[eventfd]'
>> 65262 0 lrwx------. 1 root root 64  4. Okt 17:42 11 -> 'socket:[51485]'
>> 65263 0 lrwx------. 1 root root 64  4. Okt 17:42 12 -> 'socket:[51487]'
>> 65264 0 lrwx------. 1 root root 64  4. Okt 17:42 13 -> 'socket:[51486]'
>> 65265 0 lrwx------. 1 root root 64  4. Okt 17:42 14 -> 'anon_inode:[eventfd]'
>> 65266 0 lrwx------. 1 root root 64  4. Okt 17:42 15 -> 'socket:[60421]'
>> 65267 0 lrwx------. 1 root root 64  4. Okt 17:42 16 -> 'anon_inode:[eventfd]'
>> 65268 0 lrwx------. 1 root root 64  4. Okt 17:42 17 -> 'socket:[28008]'
>> 65269 0 l-wx------. 1 root root 64  4. Okt 17:42 18 -> /run/libvirt/nodedev/driver.pid
>> 65270 0 lrwx------. 1 root root 64  4. Okt 17:42 19 -> 'socket:[28818]'
>> 65254 0 lrwx------. 1 root root 64  4. Okt 17:42 2 -> 'socket:[51479]'
>> 65271 0 lr-x------. 1 root root 64  4. Okt 17:42 20 -> '/dev/vfio/3 (deleted)'
> 
> Seems like a userspace bug to keep the group FD open after the /dev/
> file has been deleted :|
> 
> What do you think about this?
> 
> commit a54a852b1484b1605917a8f4d80691db333b25ed
> Author: Jason Gunthorpe <jgg@ziepe.ca>
> Date:   Tue Oct 4 13:14:37 2022 -0300
> 
>      vfio: Make the group FD disassociate from the iommu_group
>      
>      Allow the vfio_group struct to exist with a NULL iommu_group pointer. When
>      the pointer is NULL the vfio_group users promise not to touch the
>      iommu_group. This allows a driver to be hot unplugged while userspace is
>      keeping the group FD open.
>      
>      SPAPR mode is excluded from this behavior because of how it wrongly hacks
>      part of its iommu interface through KVM. Due to this we loose control over
>      what it is doing and cannot revoke the iommu_group usage in the IOMMU
>      layer via vfio_group_detach_container().
>      
>      Thus, for SPAPR the group FDs must still be closed before a device can be
>      hot unplugged.
>      
>      This fixes a userspace regression where we learned that virtnodedevd
>      leaves a group FD open even though the /dev/ node for it has been deleted
>      and all the drivers for it unplugged.
>      
>      Fixes: ca5f21b25749 ("vfio: Follow a strict lifetime for struct iommu_group")
>      Reported-by: Christian Borntraeger <borntraeger@linux.ibm.com>
>      Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

Almost :-)

drivers/vfio/vfio_main.c: In function 'vfio_file_is_group':
drivers/vfio/vfio_main.c:1606:47: error: expected ')' before ';' token
  1606 |         return (file->f_op == &vfio_group_fops;
       |                ~                              ^
       |                                               )
drivers/vfio/vfio_main.c:1606:48: error: expected ';' before '}' token
  1606 |         return (file->f_op == &vfio_group_fops;
       |                                                ^
       |                                                ;
  1607 | }
       | ~


With that fixed I get:

ERROR: modpost: "vfio_file_is_group" [drivers/vfio/pci/vfio-pci-core.ko] undefined!

With that worked around (m -> y)


Tested-by: Christian Borntraeger <borntraeger@linux.ibm.com>

At least the vfio-ap part

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2] vfio: Follow a strict lifetime for struct iommu_group
  2022-10-04 17:36             ` Christian Borntraeger
@ 2022-10-04 17:48               ` Christian Borntraeger
  2022-10-04 18:22               ` Matthew Rosato
  1 sibling, 0 replies; 30+ messages in thread
From: Christian Borntraeger @ 2022-10-04 17:48 UTC (permalink / raw)
  To: Matthew Rosato, Eric Farman
  Cc: Alex Williamson, Tony Krowiak, Jason J . Herne, Marc Hartmayer,
	Cornelia Huck, kvm, Qian Cai, Joerg Roedel, Marek Szyprowski,
	linux-s390, Jason Gunthorpe



Am 04.10.22 um 19:36 schrieb Christian Borntraeger:
> 
> 
> Am 04.10.22 um 18:28 schrieb Jason Gunthorpe:
>> On Tue, Oct 04, 2022 at 05:44:53PM +0200, Christian Borntraeger wrote:
>>
>>>> Does some userspace have the group FD open when it stucks like this,
>>>> eg what does fuser say?
>>>
>>> /proc/<virtnodedevd>/fd
>>> 51480 0 dr-x------. 2 root root  0  4. Okt 17:16 .
>>> 43593 0 dr-xr-xr-x. 9 root root  0  4. Okt 17:16 ..
>>> 65252 0 lr-x------. 1 root root 64  4. Okt 17:42 0 -> /dev/null
>>> 65253 0 lrwx------. 1 root root 64  4. Okt 17:42 1 -> 'socket:[51479]'
>>> 65261 0 lrwx------. 1 root root 64  4. Okt 17:42 10 -> 'anon_inode:[eventfd]'
>>> 65262 0 lrwx------. 1 root root 64  4. Okt 17:42 11 -> 'socket:[51485]'
>>> 65263 0 lrwx------. 1 root root 64  4. Okt 17:42 12 -> 'socket:[51487]'
>>> 65264 0 lrwx------. 1 root root 64  4. Okt 17:42 13 -> 'socket:[51486]'
>>> 65265 0 lrwx------. 1 root root 64  4. Okt 17:42 14 -> 'anon_inode:[eventfd]'
>>> 65266 0 lrwx------. 1 root root 64  4. Okt 17:42 15 -> 'socket:[60421]'
>>> 65267 0 lrwx------. 1 root root 64  4. Okt 17:42 16 -> 'anon_inode:[eventfd]'
>>> 65268 0 lrwx------. 1 root root 64  4. Okt 17:42 17 -> 'socket:[28008]'
>>> 65269 0 l-wx------. 1 root root 64  4. Okt 17:42 18 -> /run/libvirt/nodedev/driver.pid
>>> 65270 0 lrwx------. 1 root root 64  4. Okt 17:42 19 -> 'socket:[28818]'
>>> 65254 0 lrwx------. 1 root root 64  4. Okt 17:42 2 -> 'socket:[51479]'
>>> 65271 0 lr-x------. 1 root root 64  4. Okt 17:42 20 -> '/dev/vfio/3 (deleted)'
>>
>> Seems like a userspace bug to keep the group FD open after the /dev/
>> file has been deleted :|
>>
>> What do you think about this?
>>
>> commit a54a852b1484b1605917a8f4d80691db333b25ed
>> Author: Jason Gunthorpe <jgg@ziepe.ca>
>> Date:   Tue Oct 4 13:14:37 2022 -0300
>>
>>      vfio: Make the group FD disassociate from the iommu_group
>>      Allow the vfio_group struct to exist with a NULL iommu_group pointer. When
>>      the pointer is NULL the vfio_group users promise not to touch the
>>      iommu_group. This allows a driver to be hot unplugged while userspace is
>>      keeping the group FD open.
>>      SPAPR mode is excluded from this behavior because of how it wrongly hacks
>>      part of its iommu interface through KVM. Due to this we loose control over
>>      what it is doing and cannot revoke the iommu_group usage in the IOMMU
>>      layer via vfio_group_detach_container().
>>      Thus, for SPAPR the group FDs must still be closed before a device can be
>>      hot unplugged.
>>      This fixes a userspace regression where we learned that virtnodedevd
>>      leaves a group FD open even though the /dev/ node for it has been deleted
>>      and all the drivers for it unplugged.
>>      Fixes: ca5f21b25749 ("vfio: Follow a strict lifetime for struct iommu_group")
>>      Reported-by: Christian Borntraeger <borntraeger@linux.ibm.com>
>>      Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> 
> Almost :-)
> 
> drivers/vfio/vfio_main.c: In function 'vfio_file_is_group':
> drivers/vfio/vfio_main.c:1606:47: error: expected ')' before ';' token
>   1606 |         return (file->f_op == &vfio_group_fops;
>        |                ~                              ^
>        |                                               )
> drivers/vfio/vfio_main.c:1606:48: error: expected ';' before '}' token
>   1606 |         return (file->f_op == &vfio_group_fops;
>        |                                                ^
>        |                                                ;
>   1607 | }
>        | ~
> 
> 
> With that fixed I get:
> 
> ERROR: modpost: "vfio_file_is_group" [drivers/vfio/pci/vfio-pci-core.ko] undefined!
> 
> With that worked around (m -> y)
> 
> 
> Tested-by: Christian Borntraeger <borntraeger@linux.ibm.com>
> 
> At least the vfio-ap part

Matthew, Eric, can you verify this on top of vfio-next with vfio-ccw and vfio pci?

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2] vfio: Follow a strict lifetime for struct iommu_group
  2022-10-04 17:36             ` Christian Borntraeger
  2022-10-04 17:48               ` Christian Borntraeger
@ 2022-10-04 18:22               ` Matthew Rosato
  2022-10-04 18:56                 ` Eric Farman
  2022-10-05 13:46                 ` Matthew Rosato
  1 sibling, 2 replies; 30+ messages in thread
From: Matthew Rosato @ 2022-10-04 18:22 UTC (permalink / raw)
  To: Christian Borntraeger, Jason Gunthorpe
  Cc: Alex Williamson, Tony Krowiak, Jason J . Herne, Marc Hartmayer,
	Eric Farman, Cornelia Huck, kvm, Qian Cai, Joerg Roedel,
	Marek Szyprowski, linux-s390

On 10/4/22 1:36 PM, Christian Borntraeger wrote:
> 
> 
> Am 04.10.22 um 18:28 schrieb Jason Gunthorpe:
>> On Tue, Oct 04, 2022 at 05:44:53PM +0200, Christian Borntraeger wrote:
>>
>>>> Does some userspace have the group FD open when it stucks like this,
>>>> eg what does fuser say?
>>>
>>> /proc/<virtnodedevd>/fd
>>> 51480 0 dr-x------. 2 root root  0  4. Okt 17:16 .
>>> 43593 0 dr-xr-xr-x. 9 root root  0  4. Okt 17:16 ..
>>> 65252 0 lr-x------. 1 root root 64  4. Okt 17:42 0 -> /dev/null
>>> 65253 0 lrwx------. 1 root root 64  4. Okt 17:42 1 -> 'socket:[51479]'
>>> 65261 0 lrwx------. 1 root root 64  4. Okt 17:42 10 -> 'anon_inode:[eventfd]'
>>> 65262 0 lrwx------. 1 root root 64  4. Okt 17:42 11 -> 'socket:[51485]'
>>> 65263 0 lrwx------. 1 root root 64  4. Okt 17:42 12 -> 'socket:[51487]'
>>> 65264 0 lrwx------. 1 root root 64  4. Okt 17:42 13 -> 'socket:[51486]'
>>> 65265 0 lrwx------. 1 root root 64  4. Okt 17:42 14 -> 'anon_inode:[eventfd]'
>>> 65266 0 lrwx------. 1 root root 64  4. Okt 17:42 15 -> 'socket:[60421]'
>>> 65267 0 lrwx------. 1 root root 64  4. Okt 17:42 16 -> 'anon_inode:[eventfd]'
>>> 65268 0 lrwx------. 1 root root 64  4. Okt 17:42 17 -> 'socket:[28008]'
>>> 65269 0 l-wx------. 1 root root 64  4. Okt 17:42 18 -> /run/libvirt/nodedev/driver.pid
>>> 65270 0 lrwx------. 1 root root 64  4. Okt 17:42 19 -> 'socket:[28818]'
>>> 65254 0 lrwx------. 1 root root 64  4. Okt 17:42 2 -> 'socket:[51479]'
>>> 65271 0 lr-x------. 1 root root 64  4. Okt 17:42 20 -> '/dev/vfio/3 (deleted)'
>>
>> Seems like a userspace bug to keep the group FD open after the /dev/
>> file has been deleted :|
>>
>> What do you think about this?
>>
>> commit a54a852b1484b1605917a8f4d80691db333b25ed
>> Author: Jason Gunthorpe <jgg@ziepe.ca>
>> Date:   Tue Oct 4 13:14:37 2022 -0300
>>
>>      vfio: Make the group FD disassociate from the iommu_group
>>           Allow the vfio_group struct to exist with a NULL iommu_group pointer. When
>>      the pointer is NULL the vfio_group users promise not to touch the
>>      iommu_group. This allows a driver to be hot unplugged while userspace is
>>      keeping the group FD open.
>>           SPAPR mode is excluded from this behavior because of how it wrongly hacks
>>      part of its iommu interface through KVM. Due to this we loose control over
>>      what it is doing and cannot revoke the iommu_group usage in the IOMMU
>>      layer via vfio_group_detach_container().
>>           Thus, for SPAPR the group FDs must still be closed before a device can be
>>      hot unplugged.
>>           This fixes a userspace regression where we learned that virtnodedevd
>>      leaves a group FD open even though the /dev/ node for it has been deleted
>>      and all the drivers for it unplugged.
>>           Fixes: ca5f21b25749 ("vfio: Follow a strict lifetime for struct iommu_group")
>>      Reported-by: Christian Borntraeger <borntraeger@linux.ibm.com>
>>      Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> 
> Almost :-)
> 
> drivers/vfio/vfio_main.c: In function 'vfio_file_is_group':
> drivers/vfio/vfio_main.c:1606:47: error: expected ')' before ';' token
>  1606 |         return (file->f_op == &vfio_group_fops;
>       |                ~                              ^
>       |                                               )
> drivers/vfio/vfio_main.c:1606:48: error: expected ';' before '}' token
>  1606 |         return (file->f_op == &vfio_group_fops;
>       |                                                ^
>       |                                                ;
>  1607 | }
>       | ~
> 
> 
> With that fixed I get:
> 
> ERROR: modpost: "vfio_file_is_group" [drivers/vfio/pci/vfio-pci-core.ko] undefined!
> 
> With that worked around (m -> y)


Looks like this can be solved with EXPORT_SYMBOL_GPL(vfio_file_is_group);

Also:

arch/s390/kvm/../../../virt/kvm/vfio.c:64:28: warning: ‘kvm_vfio_file_iommu_group’ defined but not used [-Wunused-function]
   64 | static struct iommu_group *kvm_vfio_file_iommu_group(struct file *file)
      |                            ^~~~~~~~~~~~~~~~~~~~~~~~~

kvm_vfio_file_iommu_group looks like it is now SPAPR-only

> 
> 
> Tested-by: Christian Borntraeger <borntraeger@linux.ibm.com>
> 
> At least the vfio-ap part

Nope, with this s390 vfio-pci at least breaks:

[  132.943389] kernel BUG at lib/list_debug.c:53!
[  132.943406] monitor event: 0040 ilc:2 [#1] SMP 
[  132.943410] Modules linked in: vfio_pci kvm vfio_pci_core irqbypass vfio_virqfd vhost_vsock vmw_vsock_virtio_transport_common vsock vhost vhost_iotlb nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defr
ag_ipv4 ip_set nf_tables nfnetlink sunrpc mlx5_ib ism smc ib_uverbs ib_core uvdevice s390_trng tape_3590 tape tape_class eadm_sch vfio_ccw mdev vfio_iommu_type1 vfio zcrypt_cex4 sch_fq_codel configfs ghash_s390 prng chacha_s390 libchacha mlx5_core aes_s390 des_s390 libdes sha3_512_s390 nvme sha3_256_s390 sha512_s390 sh
a256_s390 nvme_core sha1_s390 sha_common zfcp scsi_transport_fc pkey zcrypt rng_core autofs4 [last unloaded: vfio_pci]
[  132.943457] CPU: 12 PID: 4991 Comm: nose2 Tainted: G        W          6.0.0-rc4 #40
[  132.943460] Hardware name: IBM 3931 A01 782 (LPAR)
[  132.943462] Krnl PSW : 0704c00180000000 00000000cbc90568 (__list_del_entry_valid+0xd8/0xf0)
[  132.943469]            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3
[  132.943474] Krnl GPRS: 8000000000000001 0000000900000027 000000000000004e 00000000ccc1ffe0
[  132.943477]            00000000fffeffff 00000009fc290000 0000000000000000 0000000000000080
[  132.943480]            00000000acc86438 0000000000000000 00000000acc86420 00000000a1492800
[  132.943483]            00000000922a0000 000003ffb9dce260 00000000cbc90564 0000038004a6b9f8
[  132.943489] Krnl Code: 00000000cbc90558: c0200045eff3        larl    %r2,00000000cc54e53e
[  132.943489]            00000000cbc9055e: c0e50022c7d9        brasl   %r14,00000000cc0e9510
[  132.943489]           #00000000cbc90564: af000000            mc      0,0
[  132.943489]           >00000000cbc90568: b9040032            lgr     %r3,%r2
[  132.943489]            00000000cbc9056c: c0200045efd4        larl    %r2,00000000cc54e514
[  132.943489]            00000000cbc90572: c0e50022c7cf        brasl   %r14,00000000cc0e9510
[  132.943489]            00000000cbc90578: af000000            mc      0,0
[  132.943489]            00000000cbc9057c: 0707                bcr     0,%r7
[  132.943510] Call Trace:
[  132.943512]  [<00000000cbc90568>] __list_del_entry_valid+0xd8/0xf0 
[  132.943515] ([<00000000cbc90564>] __list_del_entry_valid+0xd4/0xf0)
[  132.943518]  [<000003ff8011a1b8>] vfio_group_detach_container+0x88/0x170 [vfio] 
[  132.943524]  [<000003ff801176c0>] vfio_device_remove_group.isra.0+0xb0/0x1e0 [vfio] 
[  132.943529]  [<000003ff804f9e54>] vfio_pci_core_unregister_device+0x34/0x80 [vfio_pci_core] 
[  132.943535]  [<000003ff804ae1c4>] vfio_pci_remove+0x2c/0x40 [vfio_pci] 
[  132.943539]  [<00000000cbd58c3c>] pci_device_remove+0x3c/0x98 
[  132.943542]  [<00000000cbdbdbce>] device_release_driver_internal+0x1c6/0x288 
[  132.943545]  [<00000000cbd4e284>] pci_stop_bus_device+0x94/0xc0 
[  132.943549]  [<00000000cbd4e570>] pci_stop_and_remove_bus_device_locked+0x30/0x48 
[  132.943552]  [<00000000cb55d980>] zpci_bus_remove_device+0x68/0xa8 
[  132.943555]  [<00000000cb556e82>] zpci_deconfigure_device+0x3a/0xe0 
[  132.943558]  [<00000000cbd65d04>] power_write_file+0x7c/0x130 
[  132.943561]  [<00000000cb8fbc90>] kernfs_fop_write_iter+0x138/0x210 
[  132.943565]  [<00000000cb837344>] vfs_write+0x194/0x2e0 "
[  132.943568]  [<00000000cb8376fa>] ksys_write+0x6a/0xf8 
[  132.943571]  [<00000000cc0f918c>] __do_syscall+0x1d4/0x200 
[  132.943575]  [<00000000cc107e42>] system_call+0x82/0xb0 
[  132.943577] Last Breaking-Event-Address:
[  132.943579]  [<00000000cc0e955c>] _printk+0x4c/0x58
[  132.943585] Kernel panic - not syncing: Fatal exception: panic_on_oops

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2] vfio: Follow a strict lifetime for struct iommu_group
  2022-10-04 18:22               ` Matthew Rosato
@ 2022-10-04 18:56                 ` Eric Farman
  2022-10-05 13:46                 ` Matthew Rosato
  1 sibling, 0 replies; 30+ messages in thread
From: Eric Farman @ 2022-10-04 18:56 UTC (permalink / raw)
  To: Matthew Rosato, Christian Borntraeger, Jason Gunthorpe
  Cc: Alex Williamson, Tony Krowiak, Jason J . Herne, Marc Hartmayer,
	Cornelia Huck, kvm, Qian Cai, Joerg Roedel, Marek Szyprowski,
	linux-s390

On Tue, 2022-10-04 at 14:22 -0400, Matthew Rosato wrote:
> On 10/4/22 1:36 PM, Christian Borntraeger wrote:
> > 
> > 
> > Am 04.10.22 um 18:28 schrieb Jason Gunthorpe:
> > > On Tue, Oct 04, 2022 at 05:44:53PM +0200, Christian Borntraeger
> > > wrote:
> > > 
> > > > > Does some userspace have the group FD open when it stucks
> > > > > like this,
> > > > > eg what does fuser say?
> > > > 
> > > > /proc/<virtnodedevd>/fd
> > > > 51480 0 dr-x------. 2 root root  0  4. Okt 17:16 .
> > > > 43593 0 dr-xr-xr-x. 9 root root  0  4. Okt 17:16 ..
> > > > 65252 0 lr-x------. 1 root root 64  4. Okt 17:42 0 -> /dev/null
> > > > 65253 0 lrwx------. 1 root root 64  4. Okt 17:42 1 ->
> > > > 'socket:[51479]'
> > > > 65261 0 lrwx------. 1 root root 64  4. Okt 17:42 10 ->
> > > > 'anon_inode:[eventfd]'
> > > > 65262 0 lrwx------. 1 root root 64  4. Okt 17:42 11 ->
> > > > 'socket:[51485]'
> > > > 65263 0 lrwx------. 1 root root 64  4. Okt 17:42 12 ->
> > > > 'socket:[51487]'
> > > > 65264 0 lrwx------. 1 root root 64  4. Okt 17:42 13 ->
> > > > 'socket:[51486]'
> > > > 65265 0 lrwx------. 1 root root 64  4. Okt 17:42 14 ->
> > > > 'anon_inode:[eventfd]'
> > > > 65266 0 lrwx------. 1 root root 64  4. Okt 17:42 15 ->
> > > > 'socket:[60421]'
> > > > 65267 0 lrwx------. 1 root root 64  4. Okt 17:42 16 ->
> > > > 'anon_inode:[eventfd]'
> > > > 65268 0 lrwx------. 1 root root 64  4. Okt 17:42 17 ->
> > > > 'socket:[28008]'
> > > > 65269 0 l-wx------. 1 root root 64  4. Okt 17:42 18 ->
> > > > /run/libvirt/nodedev/driver.pid
> > > > 65270 0 lrwx------. 1 root root 64  4. Okt 17:42 19 ->
> > > > 'socket:[28818]'
> > > > 65254 0 lrwx------. 1 root root 64  4. Okt 17:42 2 ->
> > > > 'socket:[51479]'
> > > > 65271 0 lr-x------. 1 root root 64  4. Okt 17:42 20 ->
> > > > '/dev/vfio/3 (deleted)'
> > > 
> > > Seems like a userspace bug to keep the group FD open after the
> > > /dev/
> > > file has been deleted :|
> > > 
> > > What do you think about this?
> > > 
> > > commit a54a852b1484b1605917a8f4d80691db333b25ed
> > > Author: Jason Gunthorpe <jgg@ziepe.ca>
> > > Date:   Tue Oct 4 13:14:37 2022 -0300
> > > 
> > >      vfio: Make the group FD disassociate from the iommu_group
> > >           Allow the vfio_group struct to exist with a NULL
> > > iommu_group pointer. When
> > >      the pointer is NULL the vfio_group users promise not to
> > > touch the
> > >      iommu_group. This allows a driver to be hot unplugged while
> > > userspace is
> > >      keeping the group FD open.
> > >           SPAPR mode is excluded from this behavior because of
> > > how it wrongly hacks
> > >      part of its iommu interface through KVM. Due to this we
> > > loose control over
> > >      what it is doing and cannot revoke the iommu_group usage in
> > > the IOMMU
> > >      layer via vfio_group_detach_container().
> > >           Thus, for SPAPR the group FDs must still be closed
> > > before a device can be
> > >      hot unplugged.
> > >           This fixes a userspace regression where we learned that
> > > virtnodedevd
> > >      leaves a group FD open even though the /dev/ node for it has
> > > been deleted
> > >      and all the drivers for it unplugged.
> > >           Fixes: ca5f21b25749 ("vfio: Follow a strict lifetime
> > > for struct iommu_group")
> > >      Reported-by: Christian Borntraeger
> > > <borntraeger@linux.ibm.com>
> > >      Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> > 
> > Almost :-)
> > 
> > drivers/vfio/vfio_main.c: In function 'vfio_file_is_group':
> > drivers/vfio/vfio_main.c:1606:47: error: expected ')' before ';'
> > token
> >  1606 |         return (file->f_op == &vfio_group_fops;
> >       |                ~                              ^
> >       |                                               )
> > drivers/vfio/vfio_main.c:1606:48: error: expected ';' before '}'
> > token
> >  1606 |         return (file->f_op == &vfio_group_fops;
> >       |                                                ^
> >       |                                                ;
> >  1607 | }
> >       | ~
> > 
> > 
> > With that fixed I get:
> > 
> > ERROR: modpost: "vfio_file_is_group" [drivers/vfio/pci/vfio-pci-
> > core.ko] undefined!
> > 
> > With that worked around (m -> y)
> 
> 
> Looks like this can be solved with
> EXPORT_SYMBOL_GPL(vfio_file_is_group);
> 
> Also:
> 
> arch/s390/kvm/../../../virt/kvm/vfio.c:64:28: warning:
> ‘kvm_vfio_file_iommu_group’ defined but not used [-Wunused-function]
>    64 | static struct iommu_group *kvm_vfio_file_iommu_group(struct
> file *file)
>       |                            ^~~~~~~~~~~~~~~~~~~~~~~~~
> 
> kvm_vfio_file_iommu_group looks like it is now SPAPR-only
> 
> > 
> > 
> > Tested-by: Christian Borntraeger <borntraeger@linux.ibm.com>
> > 
> > At least the vfio-ap part
> 

I can reproduce the problem with vfio-ccw, also blamed to this patch.
With the changes described above, things work as they did before.

Tested-by: Eric Farman <farman@linux.ibm.com> # vfio-ccw, vfio-ap

I can try a v2 when the below gets addressed.

> Nope, with this s390 vfio-pci at least breaks:
> 
> [  132.943389] kernel BUG at lib/list_debug.c:53!
> [  132.943406] monitor event: 0040 ilc:2 [#1] SMP 
> [  132.943410] Modules linked in: vfio_pci kvm vfio_pci_core
> irqbypass vfio_virqfd vhost_vsock vmw_vsock_virtio_transport_common
> vsock vhost vhost_iotlb nft_fib_inet nft_fib_ipv4 nft_fib_ipv6
> nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject
> nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defr
> ag_ipv4 ip_set nf_tables nfnetlink sunrpc mlx5_ib ism smc ib_uverbs
> ib_core uvdevice s390_trng tape_3590 tape tape_class eadm_sch
> vfio_ccw mdev vfio_iommu_type1 vfio zcrypt_cex4 sch_fq_codel configfs
> ghash_s390 prng chacha_s390 libchacha mlx5_core aes_s390 des_s390
> libdes sha3_512_s390 nvme sha3_256_s390 sha512_s390 sh
> a256_s390 nvme_core sha1_s390 sha_common zfcp scsi_transport_fc pkey
> zcrypt rng_core autofs4 [last unloaded: vfio_pci]
> [  132.943457] CPU: 12 PID: 4991 Comm: nose2 Tainted: G       
> W          6.0.0-rc4 #40
> [  132.943460] Hardware name: IBM 3931 A01 782 (LPAR)
> [  132.943462] Krnl PSW : 0704c00180000000 00000000cbc90568
> (__list_del_entry_valid+0xd8/0xf0)
> [  132.943469]            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3
> CC:0 PM:0 RI:0 EA:3
> [  132.943474] Krnl GPRS: 8000000000000001 0000000900000027
> 000000000000004e 00000000ccc1ffe0
> [  132.943477]            00000000fffeffff 00000009fc290000
> 0000000000000000 0000000000000080
> [  132.943480]            00000000acc86438 0000000000000000
> 00000000acc86420 00000000a1492800
> [  132.943483]            00000000922a0000 000003ffb9dce260
> 00000000cbc90564 0000038004a6b9f8
> [  132.943489] Krnl Code: 00000000cbc90558: c0200045eff3       
> larl    %r2,00000000cc54e53e
> [  132.943489]            00000000cbc9055e: c0e50022c7d9       
> brasl   %r14,00000000cc0e9510
> [  132.943489]           #00000000cbc90564: af000000           
> mc      0,0
> [  132.943489]           >00000000cbc90568: b9040032           
> lgr     %r3,%r2
> [  132.943489]            00000000cbc9056c: c0200045efd4       
> larl    %r2,00000000cc54e514
> [  132.943489]            00000000cbc90572: c0e50022c7cf       
> brasl   %r14,00000000cc0e9510
> [  132.943489]            00000000cbc90578: af000000           
> mc      0,0
> [  132.943489]            00000000cbc9057c: 0707               
> bcr     0,%r7
> [  132.943510] Call Trace:
> [  132.943512]  [<00000000cbc90568>] __list_del_entry_valid+0xd8/0xf0
> [  132.943515] ([<00000000cbc90564>]
> __list_del_entry_valid+0xd4/0xf0)
> [  132.943518]  [<000003ff8011a1b8>]
> vfio_group_detach_container+0x88/0x170 [vfio] 
> [  132.943524]  [<000003ff801176c0>]
> vfio_device_remove_group.isra.0+0xb0/0x1e0 [vfio] 
> [  132.943529]  [<000003ff804f9e54>]
> vfio_pci_core_unregister_device+0x34/0x80 [vfio_pci_core] 
> [  132.943535]  [<000003ff804ae1c4>] vfio_pci_remove+0x2c/0x40
> [vfio_pci] 
> [  132.943539]  [<00000000cbd58c3c>] pci_device_remove+0x3c/0x98 
> [  132.943542]  [<00000000cbdbdbce>]
> device_release_driver_internal+0x1c6/0x288 
> [  132.943545]  [<00000000cbd4e284>] pci_stop_bus_device+0x94/0xc0 
> [  132.943549]  [<00000000cbd4e570>]
> pci_stop_and_remove_bus_device_locked+0x30/0x48 
> [  132.943552]  [<00000000cb55d980>] zpci_bus_remove_device+0x68/0xa8
> [  132.943555]  [<00000000cb556e82>]
> zpci_deconfigure_device+0x3a/0xe0 
> [  132.943558]  [<00000000cbd65d04>] power_write_file+0x7c/0x130 
> [  132.943561]  [<00000000cb8fbc90>]
> kernfs_fop_write_iter+0x138/0x210 
> [  132.943565]  [<00000000cb837344>] vfs_write+0x194/0x2e0 "
> [  132.943568]  [<00000000cb8376fa>] ksys_write+0x6a/0xf8 
> [  132.943571]  [<00000000cc0f918c>] __do_syscall+0x1d4/0x200 
> [  132.943575]  [<00000000cc107e42>] system_call+0x82/0xb0 
> [  132.943577] Last Breaking-Event-Address:
> [  132.943579]  [<00000000cc0e955c>] _printk+0x4c/0x58
> [  132.943585] Kernel panic - not syncing: Fatal exception:
> panic_on_oops


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2] vfio: Follow a strict lifetime for struct iommu_group
  2022-09-26 17:03 ` Matthew Rosato
  2022-09-27 20:05   ` Alex Williamson
@ 2022-10-04 19:59   ` Matthew Rosato
  2022-10-04 20:19     ` Alex Williamson
  1 sibling, 1 reply; 30+ messages in thread
From: Matthew Rosato @ 2022-10-04 19:59 UTC (permalink / raw)
  To: Jason Gunthorpe, Alex Williamson, Cornelia Huck, kvm
  Cc: Qian Cai, Joerg Roedel, Marek Szyprowski

On 9/26/22 1:03 PM, Matthew Rosato wrote:
> On 9/22/22 8:06 PM, Jason Gunthorpe wrote:
>> The iommu_group comes from the struct device that a driver has been bound
>> to and then created a struct vfio_device against. To keep the iommu layer
>> sane we want to have a simple rule that only an attached driver should be
>> using the iommu API. Particularly only an attached driver should hold
>> ownership.
>>
>> In VFIO's case since it uses the group APIs and it shares between
>> different drivers it is a bit more complicated, but the principle still
>> holds.
>>
>> Solve this by waiting for all users of the vfio_group to stop before
>> allowing vfio_unregister_group_dev() to complete. This is done with a new
>> completion to know when the users go away and an additional refcount to
>> keep track of how many device drivers are sharing the vfio group. The last
>> driver to be unregistered will clean up the group.
>>
>> This solves crashes in the S390 iommu driver that come because VFIO ends
>> up racing releasing ownership (which attaches the default iommu_domain to
>> the device) with the removal of that same device from the iommu
>> driver. This is a side case that iommu drivers should not have to cope
>> with.
>>
>>    iommu driver failed to attach the default/blocking domain
>>    WARNING: CPU: 0 PID: 5082 at drivers/iommu/iommu.c:1961 iommu_detach_group+0x6c/0x80
>>    Modules linked in: macvtap macvlan tap vfio_pci vfio_pci_core irqbypass vfio_virqfd kvm nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink mlx5_ib sunrpc ib_uverbs ism smc uvdevice ib_core s390_trng eadm_sch tape_3590 tape tape_class vfio_ccw mdev vfio_iommu_type1 vfio zcrypt_cex4 sch_fq_codel configfs ghash_s390 prng chacha_s390 libchacha aes_s390 mlx5_core des_s390 libdes sha3_512_s390 nvme sha3_256_s390 sha512_s390 sha256_s390 sha1_s390 sha_common nvme_core zfcp scsi_transport_fc pkey zcrypt rng_core autofs4
>>    CPU: 0 PID: 5082 Comm: qemu-system-s39 Tainted: G        W          6.0.0-rc3 #5
>>    Hardware name: IBM 3931 A01 782 (LPAR)
>>    Krnl PSW : 0704c00180000000 000000095bb10d28 (iommu_detach_group+0x70/0x80)
>>               R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3
>>    Krnl GPRS: 0000000000000001 0000000900000027 0000000000000039 000000095c97ffe0
>>               00000000fffeffff 00000009fc290000 00000000af1fda50 00000000af590b58
>>               00000000af1fdaf0 0000000135c7a320 0000000135e52258 0000000135e52200
>>               00000000a29e8000 00000000af590b40 000000095bb10d24 0000038004b13c98
>>    Krnl Code: 000000095bb10d18: c020003d56fc        larl    %r2,000000095c2bbb10
>>                           000000095bb10d1e: c0e50019d901        brasl   %r14,000000095be4bf20
>>                          #000000095bb10d24: af000000            mc      0,0
>>                          >000000095bb10d28: b904002a            lgr     %r2,%r10
>>                           000000095bb10d2c: ebaff0a00004        lmg     %r10,%r15,160(%r15)
>>                           000000095bb10d32: c0f4001aa867        brcl    15,000000095be65e00
>>                           000000095bb10d38: c004002168e0        brcl    0,000000095bf3def8
>>                           000000095bb10d3e: eb6ff0480024        stmg    %r6,%r15,72(%r15)
>>    Call Trace:
>>     [<000000095bb10d28>] iommu_detach_group+0x70/0x80
>>    ([<000000095bb10d24>] iommu_detach_group+0x6c/0x80)
>>     [<000003ff80243b0e>] vfio_iommu_type1_detach_group+0x136/0x6c8 [vfio_iommu_type1]
>>     [<000003ff80137780>] __vfio_group_unset_container+0x58/0x158 [vfio]
>>     [<000003ff80138a16>] vfio_group_fops_unl_ioctl+0x1b6/0x210 [vfio]
>>    pci 0004:00:00.0: Removing from iommu group 4
>>     [<000000095b5b62e8>] __s390x_sys_ioctl+0xc0/0x100
>>     [<000000095be5d3b4>] __do_syscall+0x1d4/0x200
>>     [<000000095be6c072>] system_call+0x82/0xb0
>>    Last Breaking-Event-Address:
>>     [<000000095be4bf80>] __warn_printk+0x60/0x68
>>
>> It indicates that domain->ops->attach_dev() failed because the driver has
>> already passed the point of destructing the device.
>>
>> Fixes: 9ac8545199a1 ("iommu: Fix use-after-free in iommu_release_device")
>> Reported-by: Matthew Rosato <mjrosato@linux.ibm.com>
>> Tested-by: Matthew Rosato <mjrosato@linux.ibm.com>
>> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
>> ---
>>  drivers/vfio/vfio.h      |  8 +++++
>>  drivers/vfio/vfio_main.c | 68 ++++++++++++++++++++++++++--------------
>>  2 files changed, 53 insertions(+), 23 deletions(-)
>>
>> v2
>>  - Rebase on the vfio struct device series and the container.c series
>>  - Drop patches 1 & 2, we need to have working error unwind, so another
>>    test is not a problem
>>  - Fold iommu_group_remove_device() into vfio_device_remove_group() so
>>    that it forms a strict pairing with the two allocation functions.
>>  - Drop the iommu patch from the series, it needs more work and discussion
>> v1 https://lore.kernel.org/r/0-v1-ef00ffecea52+2cb-iommu_group_lifetime_jgg@nvidia.com
>>
>> This could probably use another quick sanity test due to all the rebasing,
>> Alex if you are happy let's wait for Matthew.
>>
> 
> I have been re-running the same series of tests on this version (on top of vfio-next) and this still resolves the reported issue.  Thanks Jason!
> 

Hmm, there's more going on with this patch besides the issues with -ap and -ccw.  While it does indeed resolve the crashes I had been seeing, I just now noticed that I see monotonically increasing iommu group IDs (implying we are not calling iommu_group_release as much as we should be) when running the same testscase that would previously trigger the occasional crash (host device is powered off):

e.g. before this patch I would see:
[  156.735855] pci 0003:00:00.0: Removing from iommu group 3
[  160.299238] pci 0003:00:00.0: Adding to iommu group 3
[  182.185975] pci 0003:00:00.0: Removing from iommu group 3
[  185.770472] pci 0003:00:00.0: Adding to iommu group 3
[  188.065652] pci 0003:00:00.0: Removing from iommu group 3
[  191.590985] pci 0003:00:00.0: Adding to iommu group 3


And now after this patch I see:
[  115.091093] pci 0003:00:00.0: Removing from iommu group 3
[  118.653818] pci 0003:00:00.0: Adding to iommu group 5
[  139.721061] pci 0003:00:00.0: Removing from iommu group 5
[  143.281589] pci 0003:00:00.0: Adding to iommu group 6
[  162.651073] pci 0003:00:00.0: Removing from iommu group 6
[  166.212440] pci 0003:00:00.0: Adding to iommu group 7


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2] vfio: Follow a strict lifetime for struct iommu_group
  2022-10-04 19:59   ` Matthew Rosato
@ 2022-10-04 20:19     ` Alex Williamson
  2022-10-04 22:30       ` Jason Gunthorpe
  0 siblings, 1 reply; 30+ messages in thread
From: Alex Williamson @ 2022-10-04 20:19 UTC (permalink / raw)
  To: Matthew Rosato
  Cc: Jason Gunthorpe, Cornelia Huck, kvm, Qian Cai, Joerg Roedel,
	Marek Szyprowski

On Tue, 4 Oct 2022 15:59:12 -0400
Matthew Rosato <mjrosato@linux.ibm.com> wrote:

> On 9/26/22 1:03 PM, Matthew Rosato wrote:
> > On 9/22/22 8:06 PM, Jason Gunthorpe wrote:  
> >> The iommu_group comes from the struct device that a driver has been bound
> >> to and then created a struct vfio_device against. To keep the iommu layer
> >> sane we want to have a simple rule that only an attached driver should be
> >> using the iommu API. Particularly only an attached driver should hold
> >> ownership.
> >>
> >> In VFIO's case since it uses the group APIs and it shares between
> >> different drivers it is a bit more complicated, but the principle still
> >> holds.
> >>
> >> Solve this by waiting for all users of the vfio_group to stop before
> >> allowing vfio_unregister_group_dev() to complete. This is done with a new
> >> completion to know when the users go away and an additional refcount to
> >> keep track of how many device drivers are sharing the vfio group. The last
> >> driver to be unregistered will clean up the group.
> >>
> >> This solves crashes in the S390 iommu driver that come because VFIO ends
> >> up racing releasing ownership (which attaches the default iommu_domain to
> >> the device) with the removal of that same device from the iommu
> >> driver. This is a side case that iommu drivers should not have to cope
> >> with.
> >>
> >>    iommu driver failed to attach the default/blocking domain
> >>    WARNING: CPU: 0 PID: 5082 at drivers/iommu/iommu.c:1961 iommu_detach_group+0x6c/0x80
> >>    Modules linked in: macvtap macvlan tap vfio_pci vfio_pci_core irqbypass vfio_virqfd kvm nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink mlx5_ib sunrpc ib_uverbs ism smc uvdevice ib_core s390_trng eadm_sch tape_3590 tape tape_class vfio_ccw mdev vfio_iommu_type1 vfio zcrypt_cex4 sch_fq_codel configfs ghash_s390 prng chacha_s390 libchacha aes_s390 mlx5_core des_s390 libdes sha3_512_s390 nvme sha3_256_s390 sha512_s390 sha256_s390 sha1_s390 sha_common nvme_core zfcp scsi_transport_fc pkey zcrypt rng_core autofs4
> >>    CPU: 0 PID: 5082 Comm: qemu-system-s39 Tainted: G        W          6.0.0-rc3 #5
> >>    Hardware name: IBM 3931 A01 782 (LPAR)
> >>    Krnl PSW : 0704c00180000000 000000095bb10d28 (iommu_detach_group+0x70/0x80)
> >>               R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3
> >>    Krnl GPRS: 0000000000000001 0000000900000027 0000000000000039 000000095c97ffe0
> >>               00000000fffeffff 00000009fc290000 00000000af1fda50 00000000af590b58
> >>               00000000af1fdaf0 0000000135c7a320 0000000135e52258 0000000135e52200
> >>               00000000a29e8000 00000000af590b40 000000095bb10d24 0000038004b13c98
> >>    Krnl Code: 000000095bb10d18: c020003d56fc        larl    %r2,000000095c2bbb10
> >>                           000000095bb10d1e: c0e50019d901        brasl   %r14,000000095be4bf20
> >>                          #000000095bb10d24: af000000            mc      0,0  
> >>                          >000000095bb10d28: b904002a            lgr     %r2,%r10  
> >>                           000000095bb10d2c: ebaff0a00004        lmg     %r10,%r15,160(%r15)
> >>                           000000095bb10d32: c0f4001aa867        brcl    15,000000095be65e00
> >>                           000000095bb10d38: c004002168e0        brcl    0,000000095bf3def8
> >>                           000000095bb10d3e: eb6ff0480024        stmg    %r6,%r15,72(%r15)
> >>    Call Trace:
> >>     [<000000095bb10d28>] iommu_detach_group+0x70/0x80
> >>    ([<000000095bb10d24>] iommu_detach_group+0x6c/0x80)
> >>     [<000003ff80243b0e>] vfio_iommu_type1_detach_group+0x136/0x6c8 [vfio_iommu_type1]
> >>     [<000003ff80137780>] __vfio_group_unset_container+0x58/0x158 [vfio]
> >>     [<000003ff80138a16>] vfio_group_fops_unl_ioctl+0x1b6/0x210 [vfio]
> >>    pci 0004:00:00.0: Removing from iommu group 4
> >>     [<000000095b5b62e8>] __s390x_sys_ioctl+0xc0/0x100
> >>     [<000000095be5d3b4>] __do_syscall+0x1d4/0x200
> >>     [<000000095be6c072>] system_call+0x82/0xb0
> >>    Last Breaking-Event-Address:
> >>     [<000000095be4bf80>] __warn_printk+0x60/0x68
> >>
> >> It indicates that domain->ops->attach_dev() failed because the driver has
> >> already passed the point of destructing the device.
> >>
> >> Fixes: 9ac8545199a1 ("iommu: Fix use-after-free in iommu_release_device")
> >> Reported-by: Matthew Rosato <mjrosato@linux.ibm.com>
> >> Tested-by: Matthew Rosato <mjrosato@linux.ibm.com>
> >> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> >> ---
> >>  drivers/vfio/vfio.h      |  8 +++++
> >>  drivers/vfio/vfio_main.c | 68 ++++++++++++++++++++++++++--------------
> >>  2 files changed, 53 insertions(+), 23 deletions(-)
> >>
> >> v2
> >>  - Rebase on the vfio struct device series and the container.c series
> >>  - Drop patches 1 & 2, we need to have working error unwind, so another
> >>    test is not a problem
> >>  - Fold iommu_group_remove_device() into vfio_device_remove_group() so
> >>    that it forms a strict pairing with the two allocation functions.
> >>  - Drop the iommu patch from the series, it needs more work and discussion
> >> v1 https://lore.kernel.org/r/0-v1-ef00ffecea52+2cb-iommu_group_lifetime_jgg@nvidia.com
> >>
> >> This could probably use another quick sanity test due to all the rebasing,
> >> Alex if you are happy let's wait for Matthew.
> >>  
> > 
> > I have been re-running the same series of tests on this version (on top of vfio-next) and this still resolves the reported issue.  Thanks Jason!
> >   
> 
> Hmm, there's more going on with this patch besides the issues with -ap and -ccw.  While it does indeed resolve the crashes I had been seeing, I just now noticed that I see monotonically increasing iommu group IDs (implying we are not calling iommu_group_release as much as we should be) when running the same testscase that would previously trigger the occasional crash (host device is powered off):
> 
> e.g. before this patch I would see:
> [  156.735855] pci 0003:00:00.0: Removing from iommu group 3
> [  160.299238] pci 0003:00:00.0: Adding to iommu group 3
> [  182.185975] pci 0003:00:00.0: Removing from iommu group 3
> [  185.770472] pci 0003:00:00.0: Adding to iommu group 3
> [  188.065652] pci 0003:00:00.0: Removing from iommu group 3
> [  191.590985] pci 0003:00:00.0: Adding to iommu group 3
> 
> 
> And now after this patch I see:
> [  115.091093] pci 0003:00:00.0: Removing from iommu group 3
> [  118.653818] pci 0003:00:00.0: Adding to iommu group 5
> [  139.721061] pci 0003:00:00.0: Removing from iommu group 5
> [  143.281589] pci 0003:00:00.0: Adding to iommu group 6
> [  162.651073] pci 0003:00:00.0: Removing from iommu group 6
> [  166.212440] pci 0003:00:00.0: Adding to iommu group 7

I need to break my next branch anyway to correct a Fixes: sha1, so let
me know if we should just drop this for now instead.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2] vfio: Follow a strict lifetime for struct iommu_group
  2022-10-04 20:19     ` Alex Williamson
@ 2022-10-04 22:30       ` Jason Gunthorpe
  0 siblings, 0 replies; 30+ messages in thread
From: Jason Gunthorpe @ 2022-10-04 22:30 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Matthew Rosato, Cornelia Huck, kvm, Qian Cai, Joerg Roedel,
	Marek Szyprowski

On Tue, Oct 04, 2022 at 02:19:57PM -0600, Alex Williamson wrote:

> > >> v2
> > >>  - Rebase on the vfio struct device series and the container.c series
> > >>  - Drop patches 1 & 2, we need to have working error unwind, so another
> > >>    test is not a problem
> > >>  - Fold iommu_group_remove_device() into vfio_device_remove_group() so
> > >>    that it forms a strict pairing with the two allocation functions.
> > >>  - Drop the iommu patch from the series, it needs more work and discussion
> > >> v1 https://lore.kernel.org/r/0-v1-ef00ffecea52+2cb-iommu_group_lifetime_jgg@nvidia.com
> > >>
> > >> This could probably use another quick sanity test due to all the rebasing,
> > >> Alex if you are happy let's wait for Matthew.
> > >>  
> > > 

> > > I have been re-running the same series of tests on this version (on top of vfio-next) and this still resolves the reported issue.  Thanks Jason!  > > > > Hmm, there's more going on with this patch besides the issues with -ap and -ccw.  While it does indeed resolve the crashes I had been seeing, I just now noticed that I see monotonically increasing iommu group IDs (implying we are not calling iommu_group_release as much as we should be) when running the same testscase that would previously trigger the occasional crash (host device is powered off):

Yeah, I noticed that when writing the other patch, NULLing the
iommu_group quietly broke release. It should be fixed in the followup
by moving the iommu_group_put

> I need to break my next branch anyway to correct a Fixes: sha1, so let
> me know if we should just drop this for now instead.  Thanks,

I suspect other following patches will conflict with dropping it,
maybe better to just fix it.

Jason

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2] vfio: Follow a strict lifetime for struct iommu_group
  2022-10-04 18:22               ` Matthew Rosato
  2022-10-04 18:56                 ` Eric Farman
@ 2022-10-05 13:46                 ` Matthew Rosato
  2022-10-05 13:57                   ` Jason Gunthorpe
  1 sibling, 1 reply; 30+ messages in thread
From: Matthew Rosato @ 2022-10-05 13:46 UTC (permalink / raw)
  To: Christian Borntraeger, Jason Gunthorpe
  Cc: Alex Williamson, Tony Krowiak, Jason J . Herne, Marc Hartmayer,
	Eric Farman, Cornelia Huck, kvm, Qian Cai, Joerg Roedel,
	Marek Szyprowski, linux-s390

On 10/4/22 2:22 PM, Matthew Rosato wrote:
> On 10/4/22 1:36 PM, Christian Borntraeger wrote:
>>
>>
>> Am 04.10.22 um 18:28 schrieb Jason Gunthorpe:
>>> On Tue, Oct 04, 2022 at 05:44:53PM +0200, Christian Borntraeger wrote:
>>>
>>>>> Does some userspace have the group FD open when it stucks like this,
>>>>> eg what does fuser say?
>>>>
>>>> /proc/<virtnodedevd>/fd
>>>> 51480 0 dr-x------. 2 root root  0  4. Okt 17:16 .
>>>> 43593 0 dr-xr-xr-x. 9 root root  0  4. Okt 17:16 ..
>>>> 65252 0 lr-x------. 1 root root 64  4. Okt 17:42 0 -> /dev/null
>>>> 65253 0 lrwx------. 1 root root 64  4. Okt 17:42 1 -> 'socket:[51479]'
>>>> 65261 0 lrwx------. 1 root root 64  4. Okt 17:42 10 -> 'anon_inode:[eventfd]'
>>>> 65262 0 lrwx------. 1 root root 64  4. Okt 17:42 11 -> 'socket:[51485]'
>>>> 65263 0 lrwx------. 1 root root 64  4. Okt 17:42 12 -> 'socket:[51487]'
>>>> 65264 0 lrwx------. 1 root root 64  4. Okt 17:42 13 -> 'socket:[51486]'
>>>> 65265 0 lrwx------. 1 root root 64  4. Okt 17:42 14 -> 'anon_inode:[eventfd]'
>>>> 65266 0 lrwx------. 1 root root 64  4. Okt 17:42 15 -> 'socket:[60421]'
>>>> 65267 0 lrwx------. 1 root root 64  4. Okt 17:42 16 -> 'anon_inode:[eventfd]'
>>>> 65268 0 lrwx------. 1 root root 64  4. Okt 17:42 17 -> 'socket:[28008]'
>>>> 65269 0 l-wx------. 1 root root 64  4. Okt 17:42 18 -> /run/libvirt/nodedev/driver.pid
>>>> 65270 0 lrwx------. 1 root root 64  4. Okt 17:42 19 -> 'socket:[28818]'
>>>> 65254 0 lrwx------. 1 root root 64  4. Okt 17:42 2 -> 'socket:[51479]'
>>>> 65271 0 lr-x------. 1 root root 64  4. Okt 17:42 20 -> '/dev/vfio/3 (deleted)'
>>>
>>> Seems like a userspace bug to keep the group FD open after the /dev/
>>> file has been deleted :|
>>>
>>> What do you think about this?
>>>
>>> commit a54a852b1484b1605917a8f4d80691db333b25ed
>>> Author: Jason Gunthorpe <jgg@ziepe.ca>
>>> Date:   Tue Oct 4 13:14:37 2022 -0300
>>>
>>>      vfio: Make the group FD disassociate from the iommu_group
>>>           Allow the vfio_group struct to exist with a NULL iommu_group pointer. When
>>>      the pointer is NULL the vfio_group users promise not to touch the
>>>      iommu_group. This allows a driver to be hot unplugged while userspace is
>>>      keeping the group FD open.
>>>           SPAPR mode is excluded from this behavior because of how it wrongly hacks
>>>      part of its iommu interface through KVM. Due to this we loose control over
>>>      what it is doing and cannot revoke the iommu_group usage in the IOMMU
>>>      layer via vfio_group_detach_container().
>>>           Thus, for SPAPR the group FDs must still be closed before a device can be
>>>      hot unplugged.
>>>           This fixes a userspace regression where we learned that virtnodedevd
>>>      leaves a group FD open even though the /dev/ node for it has been deleted
>>>      and all the drivers for it unplugged.
>>>           Fixes: ca5f21b25749 ("vfio: Follow a strict lifetime for struct iommu_group")
>>>      Reported-by: Christian Borntraeger <borntraeger@linux.ibm.com>
>>>      Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
>>
>> Almost :-)
>>
>> drivers/vfio/vfio_main.c: In function 'vfio_file_is_group':
>> drivers/vfio/vfio_main.c:1606:47: error: expected ')' before ';' token
>>  1606 |         return (file->f_op == &vfio_group_fops;
>>       |                ~                              ^
>>       |                                               )
>> drivers/vfio/vfio_main.c:1606:48: error: expected ';' before '}' token
>>  1606 |         return (file->f_op == &vfio_group_fops;
>>       |                                                ^
>>       |                                                ;
>>  1607 | }
>>       | ~
>>
>>
>> With that fixed I get:
>>
>> ERROR: modpost: "vfio_file_is_group" [drivers/vfio/pci/vfio-pci-core.ko] undefined!
>>
>> With that worked around (m -> y)
> 
> 
> Looks like this can be solved with EXPORT_SYMBOL_GPL(vfio_file_is_group);
> 
> Also:
> 
> arch/s390/kvm/../../../virt/kvm/vfio.c:64:28: warning: ‘kvm_vfio_file_iommu_group’ defined but not used [-Wunused-function]
>    64 | static struct iommu_group *kvm_vfio_file_iommu_group(struct file *file)
>       |                            ^~~~~~~~~~~~~~~~~~~~~~~~~
> 
> kvm_vfio_file_iommu_group looks like it is now SPAPR-only
> 
>>
>>
>> Tested-by: Christian Borntraeger <borntraeger@linux.ibm.com>
>>
>> At least the vfio-ap part
> 
> Nope, with this s390 vfio-pci at least breaks:
> 
> [  132.943389] kernel BUG at lib/list_debug.c:53!
> [  132.943406] monitor event: 0040 ilc:2 [#1] SMP 
> [  132.943410] Modules linked in: vfio_pci kvm vfio_pci_core irqbypass vfio_virqfd vhost_vsock vmw_vsock_virtio_transport_common vsock vhost vhost_iotlb nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defr
> ag_ipv4 ip_set nf_tables nfnetlink sunrpc mlx5_ib ism smc ib_uverbs ib_core uvdevice s390_trng tape_3590 tape tape_class eadm_sch vfio_ccw mdev vfio_iommu_type1 vfio zcrypt_cex4 sch_fq_codel configfs ghash_s390 prng chacha_s390 libchacha mlx5_core aes_s390 des_s390 libdes sha3_512_s390 nvme sha3_256_s390 sha512_s390 sh
> a256_s390 nvme_core sha1_s390 sha_common zfcp scsi_transport_fc pkey zcrypt rng_core autofs4 [last unloaded: vfio_pci]
> [  132.943457] CPU: 12 PID: 4991 Comm: nose2 Tainted: G        W          6.0.0-rc4 #40
> [  132.943460] Hardware name: IBM 3931 A01 782 (LPAR)
> [  132.943462] Krnl PSW : 0704c00180000000 00000000cbc90568 (__list_del_entry_valid+0xd8/0xf0)
> [  132.943469]            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3
> [  132.943474] Krnl GPRS: 8000000000000001 0000000900000027 000000000000004e 00000000ccc1ffe0
> [  132.943477]            00000000fffeffff 00000009fc290000 0000000000000000 0000000000000080
> [  132.943480]            00000000acc86438 0000000000000000 00000000acc86420 00000000a1492800
> [  132.943483]            00000000922a0000 000003ffb9dce260 00000000cbc90564 0000038004a6b9f8
> [  132.943489] Krnl Code: 00000000cbc90558: c0200045eff3        larl    %r2,00000000cc54e53e
> [  132.943489]            00000000cbc9055e: c0e50022c7d9        brasl   %r14,00000000cc0e9510
> [  132.943489]           #00000000cbc90564: af000000            mc      0,0
> [  132.943489]           >00000000cbc90568: b9040032            lgr     %r3,%r2
> [  132.943489]            00000000cbc9056c: c0200045efd4        larl    %r2,00000000cc54e514
> [  132.943489]            00000000cbc90572: c0e50022c7cf        brasl   %r14,00000000cc0e9510
> [  132.943489]            00000000cbc90578: af000000            mc      0,0
> [  132.943489]            00000000cbc9057c: 0707                bcr     0,%r7
> [  132.943510] Call Trace:
> [  132.943512]  [<00000000cbc90568>] __list_del_entry_valid+0xd8/0xf0 
> [  132.943515] ([<00000000cbc90564>] __list_del_entry_valid+0xd4/0xf0)
> [  132.943518]  [<000003ff8011a1b8>] vfio_group_detach_container+0x88/0x170 [vfio] 
> [  132.943524]  [<000003ff801176c0>] vfio_device_remove_group.isra.0+0xb0/0x1e0 [vfio] 
> [  132.943529]  [<000003ff804f9e54>] vfio_pci_core_unregister_device+0x34/0x80 [vfio_pci_core] 
> [  132.943535]  [<000003ff804ae1c4>] vfio_pci_remove+0x2c/0x40 [vfio_pci] 
> [  132.943539]  [<00000000cbd58c3c>] pci_device_remove+0x3c/0x98 
> [  132.943542]  [<00000000cbdbdbce>] device_release_driver_internal+0x1c6/0x288 
> [  132.943545]  [<00000000cbd4e284>] pci_stop_bus_device+0x94/0xc0 
> [  132.943549]  [<00000000cbd4e570>] pci_stop_and_remove_bus_device_locked+0x30/0x48 
> [  132.943552]  [<00000000cb55d980>] zpci_bus_remove_device+0x68/0xa8 
> [  132.943555]  [<00000000cb556e82>] zpci_deconfigure_device+0x3a/0xe0 
> [  132.943558]  [<00000000cbd65d04>] power_write_file+0x7c/0x130 
> [  132.943561]  [<00000000cb8fbc90>] kernfs_fop_write_iter+0x138/0x210 
> [  132.943565]  [<00000000cb837344>] vfs_write+0x194/0x2e0 "
> [  132.943568]  [<00000000cb8376fa>] ksys_write+0x6a/0xf8 
> [  132.943571]  [<00000000cc0f918c>] __do_syscall+0x1d4/0x200 
> [  132.943575]  [<00000000cc107e42>] system_call+0x82/0xb0 
> [  132.943577] Last Breaking-Event-Address:
> [  132.943579]  [<00000000cc0e955c>] _printk+0x4c/0x58
> [  132.943585] Kernel panic - not syncing: Fatal exception: panic_on_oops

(again, with the follow-up applied) Besides the panic above I just noticed there is also this warning that immediately precedes and is perhaps more useful.  Re: what triggers the WARN, both group->owner and group->owner_cnt are already 0:

[  375.262923] WARNING: CPU: 8 PID: 5182 at drivers/iommu/iommu.c:3211 iommu_group_release_dma_owner+0x38/0x90
[  375.262932] Modules linked in: vhost_vsock vmw_vsock_virtio_transport_common vsock vhost vhost_iotlb macvtap macvlan tap vfio_pci vfio_pci_core irqbypass vfio_virqfd kvm nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf
_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink sunrpc mlx5_ib ib_uverbs ism smc ib_core uvdevice s390_trng eadm_sch tape_3590 tape tape_class vfio_ccw mdev vfio_iommu_type1 vfio zcrypt_cex4 sch_fq_codel configfs ghash_s390 prng chacha_s390 libchacha aes_s390 des_s390 mlx5_core libdes sha3_512_s390 sha3_256_s390
 nvme sha512_s390 sha256_s390 sha1_s390 sha_common nvme_core zfcp scsi_transport_fc pkey zcrypt rng_core autofs4
[  375.262969] CPU: 8 PID: 5182 Comm: nose2 Not tainted 6.0.0-rc4 #50
[  375.262971] Hardware name: IBM 3931 A01 782 (LPAR)
[  375.262972] Krnl PSW : 0704c00180000000 00000001d1cd8b34 (iommu_group_release_dma_owner+0x3c/0x90)
[  375.262976]            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3
[  375.262979] Krnl GPRS: 8000000000000001 0000000000000000 00000000844dd058 00000000ba60c200
[  375.262981]            00000000fffeffff 00000009fc290000 0000000000000000 0000000000000080
[  375.262984]            00000000b1a6d840 00000000b1a6d858 00000000844dd058 00000000844dd000
[  375.262986]            00000000ba60c200 000003ff962ce260 00000001d1cd8b26 0000038003d0f9d8
[  375.262994] Krnl Code: 00000001d1cd8b26: e310b0c00012        lt      %r1,192(%r11)
[  375.262994]            00000001d1cd8b2c: a774000c            brc     7,00000001d1cd8b44
[  375.262994]           #00000001d1cd8b30: af000000            mc      0,0
[  375.262994]           >00000001d1cd8b34: b904002a            lgr     %r2,%r10
[  375.262994]            00000001d1cd8b38: ebaff0a00004        lmg     %r10,%r15,160(%r15)
[  375.262994]            00000001d1cd8b3e: c0f4001aa84d        brcl    15,00000001d202dbd8
[  375.262994]            00000001d1cd8b44: e310b0c80002        ltg     %r1,200(%r11)
[  375.262994]            00000001d1cd8b4a: a784fff3            brc     8,00000001d1cd8b30
[  375.263048] Call Trace:
[  375.263051]  [<00000001d1cd8b34>] iommu_group_release_dma_owner+0x3c/0x90 
[  375.263058]  [<000003ff801431c8>] vfio_group_detach_container+0x98/0x1a0 [vfio] 
[  375.263067]  [<000003ff801406c0>] vfio_device_remove_group.isra.0+0xb0/0x1e0 [vfio] 
[  375.263071]  [<000003ff80540e54>] vfio_pci_core_unregister_device+0x34/0x80 [vfio_pci_core] 
[  375.263079]  [<000003ff804f31c4>] vfio_pci_remove+0x2c/0x40 [vfio_pci] 
[  375.263082]  [<00000001d1c84c3c>] pci_device_remove+0x3c/0x98 
[  375.263085]  [<00000001d1ce9bce>] device_release_driver_internal+0x1c6/0x288 
[  375.263090]  [<00000001d1c7a284>] pci_stop_bus_device+0x94/0xc0 
[  375.263093]  [<00000001d1c7a570>] pci_stop_and_remove_bus_device_locked+0x30/0x48 
[  375.263096]  [<00000001d1489980>] zpci_bus_remove_device+0x68/0xa8 
[  375.263100]  [<00000001d1482e82>] zpci_deconfigure_device+0x3a/0xe0 
[  375.263104]  [<00000001d1c91d04>] power_write_file+0x7c/0x130 
[  375.263108]  [<00000001d1827c90>] kernfs_fop_write_iter+0x138/0x210 
[  375.263114]  [<00000001d1763344>] vfs_write+0x194/0x2e0 
[  375.263119]  [<00000001d17636fa>] ksys_write+0x6a/0xf8 
[  375.263121]  [<00000001d202518c>] __do_syscall+0x1d4/0x200 
[  375.263127]  [<00000001d2033e42>] system_call+0x82/0xb0 
[  375.263132] Last Breaking-Event-Address:
[  375.263132]  [<00000001d202f394>] mutex_lock+0x1c/0x28

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2] vfio: Follow a strict lifetime for struct iommu_group
  2022-10-05 13:46                 ` Matthew Rosato
@ 2022-10-05 13:57                   ` Jason Gunthorpe
  2022-10-05 14:00                     ` Christian Borntraeger
                                       ` (2 more replies)
  0 siblings, 3 replies; 30+ messages in thread
From: Jason Gunthorpe @ 2022-10-05 13:57 UTC (permalink / raw)
  To: Matthew Rosato
  Cc: Christian Borntraeger, Alex Williamson, Tony Krowiak,
	Jason J . Herne, Marc Hartmayer, Eric Farman, Cornelia Huck, kvm,
	Qian Cai, Joerg Roedel, Marek Szyprowski, linux-s390

On Wed, Oct 05, 2022 at 09:46:45AM -0400, Matthew Rosato wrote:

 
> (again, with the follow-up applied) Besides the panic above I just
> noticed there is also this warning that immediately precedes and is
> perhaps more useful.  Re: what triggers the WARN, both group->owner
> and group->owner_cnt are already 0

And this is after the 2nd try that fixes the locking?

This shows that vfio_group_detach_container() is called twice (which
was my guess), hoever this looks to be impossible as both calls are
protected by 'if (group->container)' and the function NULL's
group->container and it is all under the proper lock.

My guess was that missing locking caused the two cases to race and
trigger WARN, but the locking should fix that.

So I'm at a loss, can you investigate a bit?

Jason

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2] vfio: Follow a strict lifetime for struct iommu_group
  2022-10-05 13:57                   ` Jason Gunthorpe
@ 2022-10-05 14:00                     ` Christian Borntraeger
  2022-10-05 14:01                     ` Jason Gunthorpe
  2022-10-05 14:01                     ` Matthew Rosato
  2 siblings, 0 replies; 30+ messages in thread
From: Christian Borntraeger @ 2022-10-05 14:00 UTC (permalink / raw)
  To: Jason Gunthorpe, Matthew Rosato
  Cc: Alex Williamson, Tony Krowiak, Jason J . Herne, Marc Hartmayer,
	Eric Farman, Cornelia Huck, kvm, Qian Cai, Joerg Roedel,
	Marek Szyprowski, linux-s390



Am 05.10.22 um 15:57 schrieb Jason Gunthorpe:
> On Wed, Oct 05, 2022 at 09:46:45AM -0400, Matthew Rosato wrote:
> 
>   
>> (again, with the follow-up applied) Besides the panic above I just
>> noticed there is also this warning that immediately precedes and is
>> perhaps more useful.  Re: what triggers the WARN, both group->owner
>> and group->owner_cnt are already 0
> 
> And this is after the 2nd try that fixes the locking?
> 
> This shows that vfio_group_detach_container() is called twice (which
> was my guess), hoever this looks to be impossible as both calls are
> protected by 'if (group->container)' and the function NULL's
> group->container and it is all under the proper lock.
> 
> My guess was that missing locking caused the two cases to race and
> trigger WARN, but the locking should fix that.
> 
> So I'm at a loss, can you investigate a bit?

So where is your 2nd version (and what was the first). I only saw one fix.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2] vfio: Follow a strict lifetime for struct iommu_group
  2022-10-05 13:57                   ` Jason Gunthorpe
  2022-10-05 14:00                     ` Christian Borntraeger
@ 2022-10-05 14:01                     ` Jason Gunthorpe
  2022-10-05 14:19                       ` Christian Borntraeger
  2022-10-05 14:21                       ` Matthew Rosato
  2022-10-05 14:01                     ` Matthew Rosato
  2 siblings, 2 replies; 30+ messages in thread
From: Jason Gunthorpe @ 2022-10-05 14:01 UTC (permalink / raw)
  To: Matthew Rosato
  Cc: Christian Borntraeger, Alex Williamson, Tony Krowiak,
	Jason J . Herne, Marc Hartmayer, Eric Farman, Cornelia Huck, kvm,
	Qian Cai, Joerg Roedel, Marek Szyprowski, linux-s390

On Wed, Oct 05, 2022 at 10:57:28AM -0300, Jason Gunthorpe wrote:
> On Wed, Oct 05, 2022 at 09:46:45AM -0400, Matthew Rosato wrote:
> 
>  
> > (again, with the follow-up applied) Besides the panic above I just
> > noticed there is also this warning that immediately precedes and is
> > perhaps more useful.  Re: what triggers the WARN, both group->owner
> > and group->owner_cnt are already 0
> 
> And this is after the 2nd try that fixes the locking?
> 
> This shows that vfio_group_detach_container() is called twice (which
> was my guess), hoever this looks to be impossible as both calls are
> protected by 'if (group->container)' and the function NULL's
> group->container and it is all under the proper lock.
> 
> My guess was that missing locking caused the two cases to race and
> trigger WARN, but the locking should fix that.
> 
> So I'm at a loss, can you investigate a bit?

Huh, perhaps I'm loosing my mind, but I'm sure I sent this out, but it
is not in the archive. This v2 fixes the missing locking and the rest
of the remarks.

commit f8b993620af72fa5f15bd4c1515868013c1c173d
Author: Jason Gunthorpe <jgg@ziepe.ca>
Date:   Tue Oct 4 13:14:37 2022 -0300

    vfio: Make the group FD disassociate from the iommu_group
    
    Allow the vfio_group struct to exist with a NULL iommu_group pointer. When
    the pointer is NULL the vfio_group users promise not to touch the
    iommu_group. This allows a driver to be hot unplugged while userspace is
    keeping the group FD open.
    
    SPAPR mode is excluded from this behavior because of how it wrongly hacks
    part of its iommu interface through KVM. Due to this we loose control over
    what it is doing and cannot revoke the iommu_group usage in the IOMMU
    layer via vfio_group_detach_container().
    
    Thus, for SPAPR the group FDs must still be closed before a device can be
    hot unplugged.
    
    This fixes a userspace regression where we learned that virtnodedevd
    leaves a group FD open even though the /dev/ node for it has been deleted
    and all the drivers for it unplugged.
    
    Fixes: ca5f21b25749 ("vfio: Follow a strict lifetime for struct iommu_group")
    Reported-by: Christian Borntraeger <borntraeger@linux.ibm.com>
    Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 59a28251bb0b97..badc9d828cac20 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -1313,7 +1313,7 @@ static int vfio_pci_ioctl_pci_hot_reset(struct vfio_pci_core_device *vdev,
 		}
 
 		/* Ensure the FD is a vfio group FD.*/
-		if (!vfio_file_iommu_group(file)) {
+		if (!vfio_file_is_group(file)) {
 			fput(file);
 			ret = -EINVAL;
 			break;
diff --git a/drivers/vfio/vfio.h b/drivers/vfio/vfio.h
index 4d2de02f2ced6e..4e10a281420e66 100644
--- a/drivers/vfio/vfio.h
+++ b/drivers/vfio/vfio.h
@@ -59,6 +59,7 @@ struct vfio_group {
 	struct mutex			group_lock;
 	struct kvm			*kvm;
 	struct file			*opened_file;
+	bool				preserve_iommu_group;
 	struct swait_queue_head		opened_file_wait;
 	struct blocking_notifier_head	notifier;
 };
diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
index 9b1e5fd5f7b73c..13d22bd84afc47 100644
--- a/drivers/vfio/vfio_main.c
+++ b/drivers/vfio/vfio_main.c
@@ -133,6 +133,10 @@ __vfio_group_get_from_iommu(struct iommu_group *iommu_group)
 {
 	struct vfio_group *group;
 
+	/*
+	 * group->iommu_group from the vfio.group_list cannot be NULL
+	 * under the vfio.group_lock.
+	 */
 	list_for_each_entry(group, &vfio.group_list, vfio_next) {
 		if (group->iommu_group == iommu_group) {
 			refcount_inc(&group->drivers);
@@ -159,7 +163,7 @@ static void vfio_group_release(struct device *dev)
 
 	mutex_destroy(&group->device_lock);
 	mutex_destroy(&group->group_lock);
-	iommu_group_put(group->iommu_group);
+	WARN_ON(group->iommu_group);
 	ida_free(&vfio.group_ida, MINOR(group->dev.devt));
 	kfree(group);
 }
@@ -248,6 +252,7 @@ static struct vfio_group *vfio_create_group(struct iommu_group *iommu_group,
 static void vfio_device_remove_group(struct vfio_device *device)
 {
 	struct vfio_group *group = device->group;
+	struct iommu_group *iommu_group;
 
 	if (group->type == VFIO_NO_IOMMU || group->type == VFIO_EMULATED_IOMMU)
 		iommu_group_remove_device(device->dev);
@@ -265,13 +270,36 @@ static void vfio_device_remove_group(struct vfio_device *device)
 	 */
 	cdev_device_del(&group->cdev, &group->dev);
 
+	mutex_lock(&group->group_lock);
+	/*
+	 * These data structures all have paired operations that can only be
+	 * undone when the caller holds a live reference on the device. Since
+	 * all pairs must be undone these WARN_ON's indicate some caller did not
+	 * properly hold the group reference.l.
+	 */
+	WARN_ON(!list_empty(&group->device_list));
+	WARN_ON(group->notifier.head);
+
+	/*
+	 * Revoke all users of group->iommu_group. At this point we know there
+	 * are no devices active because we are unplugging the last one. Setting
+	 * iommu_group to NULL blocks all new users.
+	 */
+	if (group->container)
+		vfio_group_detach_container(group);
+	iommu_group = group->iommu_group;
+	group->iommu_group = NULL;
+	mutex_unlock(&group->group_lock);
+
 	/*
-	 * Before we allow the last driver in the group to be unplugged the
-	 * group must be sanitized so nothing else is or can reference it. This
-	 * is because the group->iommu_group pointer should only be used so long
-	 * as a device driver is attached to a device in the group.
+	 * Normally we can set the iommu_group to NULL above and that will
+	 * prevent any users from touching it. However, the SPAPR kvm path takes
+	 * a reference to the iommu_group and keeps using it in arch code out
+	 * side our control. So if this path is triggred we have no choice but
+	 * to wait for the group FD to be closed to be sure everyone has stopped
+	 * touching the group.
 	 */
-	while (group->opened_file) {
+	while (group->preserve_iommu_group && group->opened_file) {
 		mutex_unlock(&vfio.group_lock);
 		swait_event_idle_exclusive(group->opened_file_wait,
 					   !group->opened_file);
@@ -279,17 +307,7 @@ static void vfio_device_remove_group(struct vfio_device *device)
 	}
 	mutex_unlock(&vfio.group_lock);
 
-	/*
-	 * These data structures all have paired operations that can only be
-	 * undone when the caller holds a live reference on the group. Since all
-	 * pairs must be undone these WARN_ON's indicate some caller did not
-	 * properly hold the group reference.
-	 */
-	WARN_ON(!list_empty(&group->device_list));
-	WARN_ON(group->container || group->container_users);
-	WARN_ON(group->notifier.head);
-	group->iommu_group = NULL;
-
+	iommu_group_put(iommu_group);
 	put_device(&group->dev);
 }
 
@@ -531,6 +549,10 @@ static int __vfio_register_dev(struct vfio_device *device,
 
 	existing_device = vfio_group_get_device(group, device->dev);
 	if (existing_device) {
+		/*
+		 * group->iommu_group is non-NULL because we hold the drivers
+		 * refcount.
+		 */
 		dev_WARN(device->dev, "Device already exists on group %d\n",
 			 iommu_group_id(group->iommu_group));
 		vfio_device_put_registration(existing_device);
@@ -702,6 +724,11 @@ static int vfio_group_ioctl_set_container(struct vfio_group *group,
 		ret = -EINVAL;
 		goto out_unlock;
 	}
+	if (!group->iommu_group) {
+		ret = -ENODEV;
+		goto out_unlock;
+	}
+
 	container = vfio_container_from_file(f.file);
 	ret = -EINVAL;
 	if (container) {
@@ -862,6 +889,11 @@ static int vfio_group_ioctl_get_status(struct vfio_group *group,
 	status.flags = 0;
 
 	mutex_lock(&group->group_lock);
+	if (!group->iommu_group) {
+		mutex_unlock(&group->group_lock);
+		return -ENODEV;
+	}
+
 	if (group->container)
 		status.flags |= VFIO_GROUP_FLAGS_CONTAINER_SET |
 				VFIO_GROUP_FLAGS_VIABLE;
@@ -938,13 +970,6 @@ static int vfio_group_fops_release(struct inode *inode, struct file *filep)
 	filep->private_data = NULL;
 
 	mutex_lock(&group->group_lock);
-	/*
-	 * Device FDs hold a group file reference, therefore the group release
-	 * is only called when there are no open devices.
-	 */
-	WARN_ON(group->notifier.head);
-	if (group->container)
-		vfio_group_detach_container(group);
 	group->opened_file = NULL;
 	mutex_unlock(&group->group_lock);
 	swake_up_one(&group->opened_file_wait);
@@ -1553,17 +1578,41 @@ static const struct file_operations vfio_device_fops = {
  * @file: VFIO group file
  *
  * The returned iommu_group is valid as long as a ref is held on the file.
+ * This function is deprecated, only the SPAPR path in kvm should call it.
  */
 struct iommu_group *vfio_file_iommu_group(struct file *file)
 {
 	struct vfio_group *group = file->private_data;
+	struct iommu_group *iommu_group = NULL;
+
+	if (!IS_ENABLED(CONFIG_SPAPR_TCE_IOMMU))
+		return NULL;
 
 	if (file->f_op != &vfio_group_fops)
 		return NULL;
-	return group->iommu_group;
+
+	mutex_lock(&vfio.group_lock);
+	mutex_lock(&group->group_lock);
+	if (group->iommu_group) {
+		iommu_group = group->iommu_group;
+		group->preserve_iommu_group = true;
+	}
+	mutex_unlock(&group->group_lock);
+	mutex_unlock(&vfio.group_lock);
+	return iommu_group;
 }
 EXPORT_SYMBOL_GPL(vfio_file_iommu_group);
 
+/**
+ * vfio_file_is_group - True if the file is usable with VFIO aPIS
+ * @file: VFIO group file
+ */
+bool vfio_file_is_group(struct file *file)
+{
+	return file->f_op == &vfio_group_fops;
+}
+EXPORT_SYMBOL_GPL(vfio_file_is_group);
+
 /**
  * vfio_file_enforced_coherent - True if the DMA associated with the VFIO file
  *        is always CPU cache coherent
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index 73bcb92179a224..bd9faaab85de18 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -199,6 +199,7 @@ int vfio_mig_get_next_state(struct vfio_device *device,
  * External user API
  */
 struct iommu_group *vfio_file_iommu_group(struct file *file);
+bool vfio_file_is_group(struct file *file);
 bool vfio_file_enforced_coherent(struct file *file);
 void vfio_file_set_kvm(struct file *file, struct kvm *kvm);
 bool vfio_file_has_dev(struct file *file, struct vfio_device *device);
diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
index ce1b01d02c5197..54aec3b0559c70 100644
--- a/virt/kvm/vfio.c
+++ b/virt/kvm/vfio.c
@@ -61,6 +61,23 @@ static bool kvm_vfio_file_enforced_coherent(struct file *file)
 	return ret;
 }
 
+static bool kvm_vfio_file_is_group(struct file *file)
+{
+	bool (*fn)(struct file *file);
+	bool ret;
+
+	fn = symbol_get(vfio_file_is_group);
+	if (!fn)
+		return false;
+
+	ret = fn(file);
+
+	symbol_put(vfio_file_is_group);
+
+	return ret;
+}
+
+#ifdef CONFIG_SPAPR_TCE_IOMMU
 static struct iommu_group *kvm_vfio_file_iommu_group(struct file *file)
 {
 	struct iommu_group *(*fn)(struct file *file);
@@ -77,7 +94,6 @@ static struct iommu_group *kvm_vfio_file_iommu_group(struct file *file)
 	return ret;
 }
 
-#ifdef CONFIG_SPAPR_TCE_IOMMU
 static void kvm_spapr_tce_release_vfio_group(struct kvm *kvm,
 					     struct kvm_vfio_group *kvg)
 {
@@ -136,7 +152,7 @@ static int kvm_vfio_group_add(struct kvm_device *dev, unsigned int fd)
 		return -EBADF;
 
 	/* Ensure the FD is a vfio group FD.*/
-	if (!kvm_vfio_file_iommu_group(filp)) {
+	if (!kvm_vfio_file_is_group(filp)) {
 		ret = -EINVAL;
 		goto err_fput;
 	}

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH v2] vfio: Follow a strict lifetime for struct iommu_group
  2022-10-05 13:57                   ` Jason Gunthorpe
  2022-10-05 14:00                     ` Christian Borntraeger
  2022-10-05 14:01                     ` Jason Gunthorpe
@ 2022-10-05 14:01                     ` Matthew Rosato
  2 siblings, 0 replies; 30+ messages in thread
From: Matthew Rosato @ 2022-10-05 14:01 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christian Borntraeger, Alex Williamson, Tony Krowiak,
	Jason J . Herne, Marc Hartmayer, Eric Farman, Cornelia Huck, kvm,
	Qian Cai, Joerg Roedel, Marek Szyprowski, linux-s390

On 10/5/22 9:57 AM, Jason Gunthorpe wrote:
> On Wed, Oct 05, 2022 at 09:46:45AM -0400, Matthew Rosato wrote:
> 
>  
>> (again, with the follow-up applied) Besides the panic above I just
>> noticed there is also this warning that immediately precedes and is
>> perhaps more useful.  Re: what triggers the WARN, both group->owner
>> and group->owner_cnt are already 0
> 
> And this is after the 2nd try that fixes the locking?

Maybe I missed that email?  I've only seen one version, happy to try another.  Please resend?

> 
> This shows that vfio_group_detach_container() is called twice (which
> was my guess), hoever this looks to be impossible as both calls are
> protected by 'if (group->container)' and the function NULL's
> group->container and it is all under the proper lock.
> 
> My guess was that missing locking caused the two cases to race and
> trigger WARN, but the locking should fix that.
> 
> So I'm at a loss, can you investigate a bit?
> 
> Jason


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2] vfio: Follow a strict lifetime for struct iommu_group
  2022-10-05 14:01                     ` Jason Gunthorpe
@ 2022-10-05 14:19                       ` Christian Borntraeger
  2022-10-06 11:55                         ` Christian Borntraeger
  2022-10-05 14:21                       ` Matthew Rosato
  1 sibling, 1 reply; 30+ messages in thread
From: Christian Borntraeger @ 2022-10-05 14:19 UTC (permalink / raw)
  To: Jason Gunthorpe, Matthew Rosato
  Cc: Alex Williamson, Tony Krowiak, Jason J . Herne, Marc Hartmayer,
	Eric Farman, Cornelia Huck, kvm, Qian Cai, Joerg Roedel,
	Marek Szyprowski, linux-s390



Am 05.10.22 um 16:01 schrieb Jason Gunthorpe:
[..]
> commit f8b993620af72fa5f15bd4c1515868013c1c173d
> Author: Jason Gunthorpe <jgg@ziepe.ca>
> Date:   Tue Oct 4 13:14:37 2022 -0300
> 
>      vfio: Make the group FD disassociate from the iommu_group
>      
>      Allow the vfio_group struct to exist with a NULL iommu_group pointer. When
>      the pointer is NULL the vfio_group users promise not to touch the
>      iommu_group. This allows a driver to be hot unplugged while userspace is
>      keeping the group FD open.
>      
>      SPAPR mode is excluded from this behavior because of how it wrongly hacks
>      part of its iommu interface through KVM. Due to this we loose control over
>      what it is doing and cannot revoke the iommu_group usage in the IOMMU
>      layer via vfio_group_detach_container().
>      
>      Thus, for SPAPR the group FDs must still be closed before a device can be
>      hot unplugged.
>      
>      This fixes a userspace regression where we learned that virtnodedevd
>      leaves a group FD open even though the /dev/ node for it has been deleted
>      and all the drivers for it unplugged.
>      
>      Fixes: ca5f21b25749 ("vfio: Follow a strict lifetime for struct iommu_group")
>      Reported-by: Christian Borntraeger <borntraeger@linux.ibm.com>
>      Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

Looks better now (I also did a quick check with vfio-pci on s390)

Tested-by: Christian Borntraeger <borntraeger@de.ibm.com>

> 
> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> index 59a28251bb0b97..badc9d828cac20 100644
> --- a/drivers/vfio/pci/vfio_pci_core.c
> +++ b/drivers/vfio/pci/vfio_pci_core.c
> @@ -1313,7 +1313,7 @@ static int vfio_pci_ioctl_pci_hot_reset(struct vfio_pci_core_device *vdev,
>   		}
>   
>   		/* Ensure the FD is a vfio group FD.*/
> -		if (!vfio_file_iommu_group(file)) {
> +		if (!vfio_file_is_group(file)) {
>   			fput(file);
>   			ret = -EINVAL;
>   			break;
> diff --git a/drivers/vfio/vfio.h b/drivers/vfio/vfio.h
> index 4d2de02f2ced6e..4e10a281420e66 100644
> --- a/drivers/vfio/vfio.h
> +++ b/drivers/vfio/vfio.h
> @@ -59,6 +59,7 @@ struct vfio_group {
>   	struct mutex			group_lock;
>   	struct kvm			*kvm;
>   	struct file			*opened_file;
> +	bool				preserve_iommu_group;
>   	struct swait_queue_head		opened_file_wait;
>   	struct blocking_notifier_head	notifier;
>   };
> diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
> index 9b1e5fd5f7b73c..13d22bd84afc47 100644
> --- a/drivers/vfio/vfio_main.c
> +++ b/drivers/vfio/vfio_main.c
> @@ -133,6 +133,10 @@ __vfio_group_get_from_iommu(struct iommu_group *iommu_group)
>   {
>   	struct vfio_group *group;
>   
> +	/*
> +	 * group->iommu_group from the vfio.group_list cannot be NULL
> +	 * under the vfio.group_lock.
> +	 */
>   	list_for_each_entry(group, &vfio.group_list, vfio_next) {
>   		if (group->iommu_group == iommu_group) {
>   			refcount_inc(&group->drivers);
> @@ -159,7 +163,7 @@ static void vfio_group_release(struct device *dev)
>   
>   	mutex_destroy(&group->device_lock);
>   	mutex_destroy(&group->group_lock);
> -	iommu_group_put(group->iommu_group);
> +	WARN_ON(group->iommu_group);
>   	ida_free(&vfio.group_ida, MINOR(group->dev.devt));
>   	kfree(group);
>   }
> @@ -248,6 +252,7 @@ static struct vfio_group *vfio_create_group(struct iommu_group *iommu_group,
>   static void vfio_device_remove_group(struct vfio_device *device)
>   {
>   	struct vfio_group *group = device->group;
> +	struct iommu_group *iommu_group;
>   
>   	if (group->type == VFIO_NO_IOMMU || group->type == VFIO_EMULATED_IOMMU)
>   		iommu_group_remove_device(device->dev);
> @@ -265,13 +270,36 @@ static void vfio_device_remove_group(struct vfio_device *device)
>   	 */
>   	cdev_device_del(&group->cdev, &group->dev);
>   
> +	mutex_lock(&group->group_lock);
> +	/*
> +	 * These data structures all have paired operations that can only be
> +	 * undone when the caller holds a live reference on the device. Since
> +	 * all pairs must be undone these WARN_ON's indicate some caller did not
> +	 * properly hold the group reference.l.
> +	 */
> +	WARN_ON(!list_empty(&group->device_list));
> +	WARN_ON(group->notifier.head);
> +
> +	/*
> +	 * Revoke all users of group->iommu_group. At this point we know there
> +	 * are no devices active because we are unplugging the last one. Setting
> +	 * iommu_group to NULL blocks all new users.
> +	 */
> +	if (group->container)
> +		vfio_group_detach_container(group);
> +	iommu_group = group->iommu_group;
> +	group->iommu_group = NULL;
> +	mutex_unlock(&group->group_lock);
> +
>   	/*
> -	 * Before we allow the last driver in the group to be unplugged the
> -	 * group must be sanitized so nothing else is or can reference it. This
> -	 * is because the group->iommu_group pointer should only be used so long
> -	 * as a device driver is attached to a device in the group.
> +	 * Normally we can set the iommu_group to NULL above and that will
> +	 * prevent any users from touching it. However, the SPAPR kvm path takes
> +	 * a reference to the iommu_group and keeps using it in arch code out
> +	 * side our control. So if this path is triggred we have no choice but
> +	 * to wait for the group FD to be closed to be sure everyone has stopped
> +	 * touching the group.
>   	 */
> -	while (group->opened_file) {
> +	while (group->preserve_iommu_group && group->opened_file) {
>   		mutex_unlock(&vfio.group_lock);
>   		swait_event_idle_exclusive(group->opened_file_wait,
>   					   !group->opened_file);
> @@ -279,17 +307,7 @@ static void vfio_device_remove_group(struct vfio_device *device)
>   	}
>   	mutex_unlock(&vfio.group_lock);
>   
> -	/*
> -	 * These data structures all have paired operations that can only be
> -	 * undone when the caller holds a live reference on the group. Since all
> -	 * pairs must be undone these WARN_ON's indicate some caller did not
> -	 * properly hold the group reference.
> -	 */
> -	WARN_ON(!list_empty(&group->device_list));
> -	WARN_ON(group->container || group->container_users);
> -	WARN_ON(group->notifier.head);
> -	group->iommu_group = NULL;
> -
> +	iommu_group_put(iommu_group);
>   	put_device(&group->dev);
>   }
>   
> @@ -531,6 +549,10 @@ static int __vfio_register_dev(struct vfio_device *device,
>   
>   	existing_device = vfio_group_get_device(group, device->dev);
>   	if (existing_device) {
> +		/*
> +		 * group->iommu_group is non-NULL because we hold the drivers
> +		 * refcount.
> +		 */
>   		dev_WARN(device->dev, "Device already exists on group %d\n",
>   			 iommu_group_id(group->iommu_group));
>   		vfio_device_put_registration(existing_device);
> @@ -702,6 +724,11 @@ static int vfio_group_ioctl_set_container(struct vfio_group *group,
>   		ret = -EINVAL;
>   		goto out_unlock;
>   	}
> +	if (!group->iommu_group) {
> +		ret = -ENODEV;
> +		goto out_unlock;
> +	}
> +
>   	container = vfio_container_from_file(f.file);
>   	ret = -EINVAL;
>   	if (container) {
> @@ -862,6 +889,11 @@ static int vfio_group_ioctl_get_status(struct vfio_group *group,
>   	status.flags = 0;
>   
>   	mutex_lock(&group->group_lock);
> +	if (!group->iommu_group) {
> +		mutex_unlock(&group->group_lock);
> +		return -ENODEV;
> +	}
> +
>   	if (group->container)
>   		status.flags |= VFIO_GROUP_FLAGS_CONTAINER_SET |
>   				VFIO_GROUP_FLAGS_VIABLE;
> @@ -938,13 +970,6 @@ static int vfio_group_fops_release(struct inode *inode, struct file *filep)
>   	filep->private_data = NULL;
>   
>   	mutex_lock(&group->group_lock);
> -	/*
> -	 * Device FDs hold a group file reference, therefore the group release
> -	 * is only called when there are no open devices.
> -	 */
> -	WARN_ON(group->notifier.head);
> -	if (group->container)
> -		vfio_group_detach_container(group);
>   	group->opened_file = NULL;
>   	mutex_unlock(&group->group_lock);
>   	swake_up_one(&group->opened_file_wait);
> @@ -1553,17 +1578,41 @@ static const struct file_operations vfio_device_fops = {
>    * @file: VFIO group file
>    *
>    * The returned iommu_group is valid as long as a ref is held on the file.
> + * This function is deprecated, only the SPAPR path in kvm should call it.
>    */
>   struct iommu_group *vfio_file_iommu_group(struct file *file)
>   {
>   	struct vfio_group *group = file->private_data;
> +	struct iommu_group *iommu_group = NULL;
> +
> +	if (!IS_ENABLED(CONFIG_SPAPR_TCE_IOMMU))
> +		return NULL;
>   
>   	if (file->f_op != &vfio_group_fops)
>   		return NULL;
> -	return group->iommu_group;
> +
> +	mutex_lock(&vfio.group_lock);
> +	mutex_lock(&group->group_lock);
> +	if (group->iommu_group) {
> +		iommu_group = group->iommu_group;
> +		group->preserve_iommu_group = true;
> +	}
> +	mutex_unlock(&group->group_lock);
> +	mutex_unlock(&vfio.group_lock);
> +	return iommu_group;
>   }
>   EXPORT_SYMBOL_GPL(vfio_file_iommu_group);
>   
> +/**
> + * vfio_file_is_group - True if the file is usable with VFIO aPIS
> + * @file: VFIO group file
> + */
> +bool vfio_file_is_group(struct file *file)
> +{
> +	return file->f_op == &vfio_group_fops;
> +}
> +EXPORT_SYMBOL_GPL(vfio_file_is_group);
> +
>   /**
>    * vfio_file_enforced_coherent - True if the DMA associated with the VFIO file
>    *        is always CPU cache coherent
> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> index 73bcb92179a224..bd9faaab85de18 100644
> --- a/include/linux/vfio.h
> +++ b/include/linux/vfio.h
> @@ -199,6 +199,7 @@ int vfio_mig_get_next_state(struct vfio_device *device,
>    * External user API
>    */
>   struct iommu_group *vfio_file_iommu_group(struct file *file);
> +bool vfio_file_is_group(struct file *file);
>   bool vfio_file_enforced_coherent(struct file *file);
>   void vfio_file_set_kvm(struct file *file, struct kvm *kvm);
>   bool vfio_file_has_dev(struct file *file, struct vfio_device *device);
> diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
> index ce1b01d02c5197..54aec3b0559c70 100644
> --- a/virt/kvm/vfio.c
> +++ b/virt/kvm/vfio.c
> @@ -61,6 +61,23 @@ static bool kvm_vfio_file_enforced_coherent(struct file *file)
>   	return ret;
>   }
>   
> +static bool kvm_vfio_file_is_group(struct file *file)
> +{
> +	bool (*fn)(struct file *file);
> +	bool ret;
> +
> +	fn = symbol_get(vfio_file_is_group);
> +	if (!fn)
> +		return false;
> +
> +	ret = fn(file);
> +
> +	symbol_put(vfio_file_is_group);
> +
> +	return ret;
> +}
> +
> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>   static struct iommu_group *kvm_vfio_file_iommu_group(struct file *file)
>   {
>   	struct iommu_group *(*fn)(struct file *file);
> @@ -77,7 +94,6 @@ static struct iommu_group *kvm_vfio_file_iommu_group(struct file *file)
>   	return ret;
>   }
>   
> -#ifdef CONFIG_SPAPR_TCE_IOMMU
>   static void kvm_spapr_tce_release_vfio_group(struct kvm *kvm,
>   					     struct kvm_vfio_group *kvg)
>   {
> @@ -136,7 +152,7 @@ static int kvm_vfio_group_add(struct kvm_device *dev, unsigned int fd)
>   		return -EBADF;
>   
>   	/* Ensure the FD is a vfio group FD.*/
> -	if (!kvm_vfio_file_iommu_group(filp)) {
> +	if (!kvm_vfio_file_is_group(filp)) {
>   		ret = -EINVAL;
>   		goto err_fput;
>   	}

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2] vfio: Follow a strict lifetime for struct iommu_group
  2022-10-05 14:01                     ` Jason Gunthorpe
  2022-10-05 14:19                       ` Christian Borntraeger
@ 2022-10-05 14:21                       ` Matthew Rosato
  2022-10-05 15:40                         ` Matthew Rosato
  1 sibling, 1 reply; 30+ messages in thread
From: Matthew Rosato @ 2022-10-05 14:21 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christian Borntraeger, Alex Williamson, Tony Krowiak,
	Jason J . Herne, Marc Hartmayer, Eric Farman, Cornelia Huck, kvm,
	Qian Cai, Joerg Roedel, Marek Szyprowski, linux-s390

On 10/5/22 10:01 AM, Jason Gunthorpe wrote:
> On Wed, Oct 05, 2022 at 10:57:28AM -0300, Jason Gunthorpe wrote:
>> On Wed, Oct 05, 2022 at 09:46:45AM -0400, Matthew Rosato wrote:
>>
>>  
>>> (again, with the follow-up applied) Besides the panic above I just
>>> noticed there is also this warning that immediately precedes and is
>>> perhaps more useful.  Re: what triggers the WARN, both group->owner
>>> and group->owner_cnt are already 0
>>
>> And this is after the 2nd try that fixes the locking?
>>
>> This shows that vfio_group_detach_container() is called twice (which
>> was my guess), hoever this looks to be impossible as both calls are
>> protected by 'if (group->container)' and the function NULL's
>> group->container and it is all under the proper lock.
>>
>> My guess was that missing locking caused the two cases to race and
>> trigger WARN, but the locking should fix that.
>>
>> So I'm at a loss, can you investigate a bit?
> 
> Huh, perhaps I'm loosing my mind, but I'm sure I sent this out, but it
> is not in the archive. This v2 fixes the missing locking and the rest
> of the remarks.

Ah, here we go.  OK, initial testing with vfio-pci on this version and I note that

1) the warning/crash is gone
2) the iommu group ID no longer increments

I next will take it through the longer series of tests that would crash before 'vfio: Follow a strict lifetime for struct iommu_group' but this looks good so far.

> 
> commit f8b993620af72fa5f15bd4c1515868013c1c173d
> Author: Jason Gunthorpe <jgg@ziepe.ca>
> Date:   Tue Oct 4 13:14:37 2022 -0300
> 
>     vfio: Make the group FD disassociate from the iommu_group
>     
>     Allow the vfio_group struct to exist with a NULL iommu_group pointer. When
>     the pointer is NULL the vfio_group users promise not to touch the
>     iommu_group. This allows a driver to be hot unplugged while userspace is
>     keeping the group FD open.
>     
>     SPAPR mode is excluded from this behavior because of how it wrongly hacks
>     part of its iommu interface through KVM. Due to this we loose control over
>     what it is doing and cannot revoke the iommu_group usage in the IOMMU
>     layer via vfio_group_detach_container().
>     
>     Thus, for SPAPR the group FDs must still be closed before a device can be
>     hot unplugged.
>     
>     This fixes a userspace regression where we learned that virtnodedevd
>     leaves a group FD open even though the /dev/ node for it has been deleted
>     and all the drivers for it unplugged.
>     
>     Fixes: ca5f21b25749 ("vfio: Follow a strict lifetime for struct iommu_group")
>     Reported-by: Christian Borntraeger <borntraeger@linux.ibm.com>
>     Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> 
> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> index 59a28251bb0b97..badc9d828cac20 100644
> --- a/drivers/vfio/pci/vfio_pci_core.c
> +++ b/drivers/vfio/pci/vfio_pci_core.c
> @@ -1313,7 +1313,7 @@ static int vfio_pci_ioctl_pci_hot_reset(struct vfio_pci_core_device *vdev,
>  		}
>  
>  		/* Ensure the FD is a vfio group FD.*/
> -		if (!vfio_file_iommu_group(file)) {
> +		if (!vfio_file_is_group(file)) {
>  			fput(file);
>  			ret = -EINVAL;
>  			break;
> diff --git a/drivers/vfio/vfio.h b/drivers/vfio/vfio.h
> index 4d2de02f2ced6e..4e10a281420e66 100644
> --- a/drivers/vfio/vfio.h
> +++ b/drivers/vfio/vfio.h
> @@ -59,6 +59,7 @@ struct vfio_group {
>  	struct mutex			group_lock;
>  	struct kvm			*kvm;
>  	struct file			*opened_file;
> +	bool				preserve_iommu_group;
>  	struct swait_queue_head		opened_file_wait;
>  	struct blocking_notifier_head	notifier;
>  };
> diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
> index 9b1e5fd5f7b73c..13d22bd84afc47 100644
> --- a/drivers/vfio/vfio_main.c
> +++ b/drivers/vfio/vfio_main.c
> @@ -133,6 +133,10 @@ __vfio_group_get_from_iommu(struct iommu_group *iommu_group)
>  {
>  	struct vfio_group *group;
>  
> +	/*
> +	 * group->iommu_group from the vfio.group_list cannot be NULL
> +	 * under the vfio.group_lock.
> +	 */
>  	list_for_each_entry(group, &vfio.group_list, vfio_next) {
>  		if (group->iommu_group == iommu_group) {
>  			refcount_inc(&group->drivers);
> @@ -159,7 +163,7 @@ static void vfio_group_release(struct device *dev)
>  
>  	mutex_destroy(&group->device_lock);
>  	mutex_destroy(&group->group_lock);
> -	iommu_group_put(group->iommu_group);
> +	WARN_ON(group->iommu_group);
>  	ida_free(&vfio.group_ida, MINOR(group->dev.devt));
>  	kfree(group);
>  }
> @@ -248,6 +252,7 @@ static struct vfio_group *vfio_create_group(struct iommu_group *iommu_group,
>  static void vfio_device_remove_group(struct vfio_device *device)
>  {
>  	struct vfio_group *group = device->group;
> +	struct iommu_group *iommu_group;
>  
>  	if (group->type == VFIO_NO_IOMMU || group->type == VFIO_EMULATED_IOMMU)
>  		iommu_group_remove_device(device->dev);
> @@ -265,13 +270,36 @@ static void vfio_device_remove_group(struct vfio_device *device)
>  	 */
>  	cdev_device_del(&group->cdev, &group->dev);
>  
> +	mutex_lock(&group->group_lock);
> +	/*
> +	 * These data structures all have paired operations that can only be
> +	 * undone when the caller holds a live reference on the device. Since
> +	 * all pairs must be undone these WARN_ON's indicate some caller did not
> +	 * properly hold the group reference.l.
> +	 */
> +	WARN_ON(!list_empty(&group->device_list));
> +	WARN_ON(group->notifier.head);
> +
> +	/*
> +	 * Revoke all users of group->iommu_group. At this point we know there
> +	 * are no devices active because we are unplugging the last one. Setting
> +	 * iommu_group to NULL blocks all new users.
> +	 */
> +	if (group->container)
> +		vfio_group_detach_container(group);
> +	iommu_group = group->iommu_group;
> +	group->iommu_group = NULL;
> +	mutex_unlock(&group->group_lock);
> +
>  	/*
> -	 * Before we allow the last driver in the group to be unplugged the
> -	 * group must be sanitized so nothing else is or can reference it. This
> -	 * is because the group->iommu_group pointer should only be used so long
> -	 * as a device driver is attached to a device in the group.
> +	 * Normally we can set the iommu_group to NULL above and that will
> +	 * prevent any users from touching it. However, the SPAPR kvm path takes
> +	 * a reference to the iommu_group and keeps using it in arch code out
> +	 * side our control. So if this path is triggred we have no choice but
> +	 * to wait for the group FD to be closed to be sure everyone has stopped
> +	 * touching the group.
>  	 */
> -	while (group->opened_file) {
> +	while (group->preserve_iommu_group && group->opened_file) {
>  		mutex_unlock(&vfio.group_lock);
>  		swait_event_idle_exclusive(group->opened_file_wait,
>  					   !group->opened_file);
> @@ -279,17 +307,7 @@ static void vfio_device_remove_group(struct vfio_device *device)
>  	}
>  	mutex_unlock(&vfio.group_lock);
>  
> -	/*
> -	 * These data structures all have paired operations that can only be
> -	 * undone when the caller holds a live reference on the group. Since all
> -	 * pairs must be undone these WARN_ON's indicate some caller did not
> -	 * properly hold the group reference.
> -	 */
> -	WARN_ON(!list_empty(&group->device_list));
> -	WARN_ON(group->container || group->container_users);
> -	WARN_ON(group->notifier.head);
> -	group->iommu_group = NULL;
> -
> +	iommu_group_put(iommu_group);
>  	put_device(&group->dev);
>  }
>  
> @@ -531,6 +549,10 @@ static int __vfio_register_dev(struct vfio_device *device,
>  
>  	existing_device = vfio_group_get_device(group, device->dev);
>  	if (existing_device) {
> +		/*
> +		 * group->iommu_group is non-NULL because we hold the drivers
> +		 * refcount.
> +		 */
>  		dev_WARN(device->dev, "Device already exists on group %d\n",
>  			 iommu_group_id(group->iommu_group));
>  		vfio_device_put_registration(existing_device);
> @@ -702,6 +724,11 @@ static int vfio_group_ioctl_set_container(struct vfio_group *group,
>  		ret = -EINVAL;
>  		goto out_unlock;
>  	}
> +	if (!group->iommu_group) {
> +		ret = -ENODEV;
> +		goto out_unlock;
> +	}
> +
>  	container = vfio_container_from_file(f.file);
>  	ret = -EINVAL;
>  	if (container) {
> @@ -862,6 +889,11 @@ static int vfio_group_ioctl_get_status(struct vfio_group *group,
>  	status.flags = 0;
>  
>  	mutex_lock(&group->group_lock);
> +	if (!group->iommu_group) {
> +		mutex_unlock(&group->group_lock);
> +		return -ENODEV;
> +	}
> +
>  	if (group->container)
>  		status.flags |= VFIO_GROUP_FLAGS_CONTAINER_SET |
>  				VFIO_GROUP_FLAGS_VIABLE;
> @@ -938,13 +970,6 @@ static int vfio_group_fops_release(struct inode *inode, struct file *filep)
>  	filep->private_data = NULL;
>  
>  	mutex_lock(&group->group_lock);
> -	/*
> -	 * Device FDs hold a group file reference, therefore the group release
> -	 * is only called when there are no open devices.
> -	 */
> -	WARN_ON(group->notifier.head);
> -	if (group->container)
> -		vfio_group_detach_container(group);
>  	group->opened_file = NULL;
>  	mutex_unlock(&group->group_lock);
>  	swake_up_one(&group->opened_file_wait);
> @@ -1553,17 +1578,41 @@ static const struct file_operations vfio_device_fops = {
>   * @file: VFIO group file
>   *
>   * The returned iommu_group is valid as long as a ref is held on the file.
> + * This function is deprecated, only the SPAPR path in kvm should call it.
>   */
>  struct iommu_group *vfio_file_iommu_group(struct file *file)
>  {
>  	struct vfio_group *group = file->private_data;
> +	struct iommu_group *iommu_group = NULL;
> +
> +	if (!IS_ENABLED(CONFIG_SPAPR_TCE_IOMMU))
> +		return NULL;
>  
>  	if (file->f_op != &vfio_group_fops)
>  		return NULL;
> -	return group->iommu_group;
> +
> +	mutex_lock(&vfio.group_lock);
> +	mutex_lock(&group->group_lock);
> +	if (group->iommu_group) {
> +		iommu_group = group->iommu_group;
> +		group->preserve_iommu_group = true;
> +	}
> +	mutex_unlock(&group->group_lock);
> +	mutex_unlock(&vfio.group_lock);
> +	return iommu_group;
>  }
>  EXPORT_SYMBOL_GPL(vfio_file_iommu_group);
>  
> +/**
> + * vfio_file_is_group - True if the file is usable with VFIO aPIS
> + * @file: VFIO group file
> + */
> +bool vfio_file_is_group(struct file *file)
> +{
> +	return file->f_op == &vfio_group_fops;
> +}
> +EXPORT_SYMBOL_GPL(vfio_file_is_group);
> +
>  /**
>   * vfio_file_enforced_coherent - True if the DMA associated with the VFIO file
>   *        is always CPU cache coherent
> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> index 73bcb92179a224..bd9faaab85de18 100644
> --- a/include/linux/vfio.h
> +++ b/include/linux/vfio.h
> @@ -199,6 +199,7 @@ int vfio_mig_get_next_state(struct vfio_device *device,
>   * External user API
>   */
>  struct iommu_group *vfio_file_iommu_group(struct file *file);
> +bool vfio_file_is_group(struct file *file);
>  bool vfio_file_enforced_coherent(struct file *file);
>  void vfio_file_set_kvm(struct file *file, struct kvm *kvm);
>  bool vfio_file_has_dev(struct file *file, struct vfio_device *device);
> diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
> index ce1b01d02c5197..54aec3b0559c70 100644
> --- a/virt/kvm/vfio.c
> +++ b/virt/kvm/vfio.c
> @@ -61,6 +61,23 @@ static bool kvm_vfio_file_enforced_coherent(struct file *file)
>  	return ret;
>  }
>  
> +static bool kvm_vfio_file_is_group(struct file *file)
> +{
> +	bool (*fn)(struct file *file);
> +	bool ret;
> +
> +	fn = symbol_get(vfio_file_is_group);
> +	if (!fn)
> +		return false;
> +
> +	ret = fn(file);
> +
> +	symbol_put(vfio_file_is_group);
> +
> +	return ret;
> +}
> +
> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>  static struct iommu_group *kvm_vfio_file_iommu_group(struct file *file)
>  {
>  	struct iommu_group *(*fn)(struct file *file);
> @@ -77,7 +94,6 @@ static struct iommu_group *kvm_vfio_file_iommu_group(struct file *file)
>  	return ret;
>  }
>  
> -#ifdef CONFIG_SPAPR_TCE_IOMMU
>  static void kvm_spapr_tce_release_vfio_group(struct kvm *kvm,
>  					     struct kvm_vfio_group *kvg)
>  {
> @@ -136,7 +152,7 @@ static int kvm_vfio_group_add(struct kvm_device *dev, unsigned int fd)
>  		return -EBADF;
>  
>  	/* Ensure the FD is a vfio group FD.*/
> -	if (!kvm_vfio_file_iommu_group(filp)) {
> +	if (!kvm_vfio_file_is_group(filp)) {
>  		ret = -EINVAL;
>  		goto err_fput;
>  	}


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2] vfio: Follow a strict lifetime for struct iommu_group
  2022-10-05 14:21                       ` Matthew Rosato
@ 2022-10-05 15:40                         ` Matthew Rosato
  0 siblings, 0 replies; 30+ messages in thread
From: Matthew Rosato @ 2022-10-05 15:40 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christian Borntraeger, Alex Williamson, Tony Krowiak,
	Jason J . Herne, Marc Hartmayer, Eric Farman, Cornelia Huck, kvm,
	Qian Cai, Joerg Roedel, Marek Szyprowski, linux-s390

On 10/5/22 10:21 AM, Matthew Rosato wrote:
> On 10/5/22 10:01 AM, Jason Gunthorpe wrote:
>> On Wed, Oct 05, 2022 at 10:57:28AM -0300, Jason Gunthorpe wrote:
>>> On Wed, Oct 05, 2022 at 09:46:45AM -0400, Matthew Rosato wrote:
>>>
>>>  
>>>> (again, with the follow-up applied) Besides the panic above I just
>>>> noticed there is also this warning that immediately precedes and is
>>>> perhaps more useful.  Re: what triggers the WARN, both group->owner
>>>> and group->owner_cnt are already 0
>>>
>>> And this is after the 2nd try that fixes the locking?
>>>
>>> This shows that vfio_group_detach_container() is called twice (which
>>> was my guess), hoever this looks to be impossible as both calls are
>>> protected by 'if (group->container)' and the function NULL's
>>> group->container and it is all under the proper lock.
>>>
>>> My guess was that missing locking caused the two cases to race and
>>> trigger WARN, but the locking should fix that.
>>>
>>> So I'm at a loss, can you investigate a bit?
>>
>> Huh, perhaps I'm loosing my mind, but I'm sure I sent this out, but it
>> is not in the archive. This v2 fixes the missing locking and the rest
>> of the remarks.
> 
> Ah, here we go.  OK, initial testing with vfio-pci on this version and I note that
> 
> 1) the warning/crash is gone
> 2) the iommu group ID no longer increments
> 
> I next will take it through the longer series of tests that would crash before 'vfio: Follow a strict lifetime for struct iommu_group' but this looks good so far.
> 

OK, this also looks good - thanks!  Besides the vfio-pci testing on s390 I also ran some brief tests against both vfio-ccw and vfio-ap.

Tested-by: Matthew Rosato <mjrosato@linux.ibm.com>


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2] vfio: Follow a strict lifetime for struct iommu_group
  2022-10-05 14:19                       ` Christian Borntraeger
@ 2022-10-06 11:55                         ` Christian Borntraeger
  0 siblings, 0 replies; 30+ messages in thread
From: Christian Borntraeger @ 2022-10-06 11:55 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Tony Krowiak, Jason J . Herne, Marc Hartmayer, Eric Farman,
	Cornelia Huck, kvm, Qian Cai, Joerg Roedel, Marek Szyprowski,
	linux-s390, Matthew Rosato, Jason Gunthorpe

> Am 05.10.22 um 16:01 schrieb Jason Gunthorpe:
> [..]
>> commit f8b993620af72fa5f15bd4c1515868013c1c173d
>> Author: Jason Gunthorpe <jgg@ziepe.ca>
>> Date:   Tue Oct 4 13:14:37 2022 -0300
>>
>>      vfio: Make the group FD disassociate from the iommu_group
>>      Allow the vfio_group struct to exist with a NULL iommu_group pointer. When
>>      the pointer is NULL the vfio_group users promise not to touch the
>>      iommu_group. This allows a driver to be hot unplugged while userspace is
>>      keeping the group FD open.
>>      SPAPR mode is excluded from this behavior because of how it wrongly hacks
>>      part of its iommu interface through KVM. Due to this we loose control over
>>      what it is doing and cannot revoke the iommu_group usage in the IOMMU
>>      layer via vfio_group_detach_container().
>>      Thus, for SPAPR the group FDs must still be closed before a device can be
>>      hot unplugged.
>>      This fixes a userspace regression where we learned that virtnodedevd
>>      leaves a group FD open even though the /dev/ node for it has been deleted
>>      and all the drivers for it unplugged.
>>      Fixes: ca5f21b25749 ("vfio: Follow a strict lifetime for struct iommu_group")
>>      Reported-by: Christian Borntraeger <borntraeger@linux.ibm.com>
>>      Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> 
> Looks better now (I also did a quick check with vfio-pci on s390)
> 
> Tested-by: Christian Borntraeger <borntraeger@de.ibm.com>

Alex I think it would be good to schedule this for vfio-next as well. Do you want Jason
to send this patch as a proper patch outline of this mail thread?

> 
>>
>> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
>> index 59a28251bb0b97..badc9d828cac20 100644
>> --- a/drivers/vfio/pci/vfio_pci_core.c
>> +++ b/drivers/vfio/pci/vfio_pci_core.c
>> @@ -1313,7 +1313,7 @@ static int vfio_pci_ioctl_pci_hot_reset(struct vfio_pci_core_device *vdev,
>>           }
>>           /* Ensure the FD is a vfio group FD.*/
>> -        if (!vfio_file_iommu_group(file)) {
>> +        if (!vfio_file_is_group(file)) {
>>               fput(file);
>>               ret = -EINVAL;
>>               break;
>> diff --git a/drivers/vfio/vfio.h b/drivers/vfio/vfio.h
>> index 4d2de02f2ced6e..4e10a281420e66 100644
>> --- a/drivers/vfio/vfio.h
>> +++ b/drivers/vfio/vfio.h
>> @@ -59,6 +59,7 @@ struct vfio_group {
>>       struct mutex            group_lock;
>>       struct kvm            *kvm;
>>       struct file            *opened_file;
>> +    bool                preserve_iommu_group;
>>       struct swait_queue_head        opened_file_wait;
>>       struct blocking_notifier_head    notifier;
>>   };
>> diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
>> index 9b1e5fd5f7b73c..13d22bd84afc47 100644
>> --- a/drivers/vfio/vfio_main.c
>> +++ b/drivers/vfio/vfio_main.c
>> @@ -133,6 +133,10 @@ __vfio_group_get_from_iommu(struct iommu_group *iommu_group)
>>   {
>>       struct vfio_group *group;
>> +    /*
>> +     * group->iommu_group from the vfio.group_list cannot be NULL
>> +     * under the vfio.group_lock.
>> +     */
>>       list_for_each_entry(group, &vfio.group_list, vfio_next) {
>>           if (group->iommu_group == iommu_group) {
>>               refcount_inc(&group->drivers);
>> @@ -159,7 +163,7 @@ static void vfio_group_release(struct device *dev)
>>       mutex_destroy(&group->device_lock);
>>       mutex_destroy(&group->group_lock);
>> -    iommu_group_put(group->iommu_group);
>> +    WARN_ON(group->iommu_group);
>>       ida_free(&vfio.group_ida, MINOR(group->dev.devt));
>>       kfree(group);
>>   }
>> @@ -248,6 +252,7 @@ static struct vfio_group *vfio_create_group(struct iommu_group *iommu_group,
>>   static void vfio_device_remove_group(struct vfio_device *device)
>>   {
>>       struct vfio_group *group = device->group;
>> +    struct iommu_group *iommu_group;
>>       if (group->type == VFIO_NO_IOMMU || group->type == VFIO_EMULATED_IOMMU)
>>           iommu_group_remove_device(device->dev);
>> @@ -265,13 +270,36 @@ static void vfio_device_remove_group(struct vfio_device *device)
>>        */
>>       cdev_device_del(&group->cdev, &group->dev);
>> +    mutex_lock(&group->group_lock);
>> +    /*
>> +     * These data structures all have paired operations that can only be
>> +     * undone when the caller holds a live reference on the device. Since
>> +     * all pairs must be undone these WARN_ON's indicate some caller did not
>> +     * properly hold the group reference.l.
>> +     */
>> +    WARN_ON(!list_empty(&group->device_list));
>> +    WARN_ON(group->notifier.head);
>> +
>> +    /*
>> +     * Revoke all users of group->iommu_group. At this point we know there
>> +     * are no devices active because we are unplugging the last one. Setting
>> +     * iommu_group to NULL blocks all new users.
>> +     */
>> +    if (group->container)
>> +        vfio_group_detach_container(group);
>> +    iommu_group = group->iommu_group;
>> +    group->iommu_group = NULL;
>> +    mutex_unlock(&group->group_lock);
>> +
>>       /*
>> -     * Before we allow the last driver in the group to be unplugged the
>> -     * group must be sanitized so nothing else is or can reference it. This
>> -     * is because the group->iommu_group pointer should only be used so long
>> -     * as a device driver is attached to a device in the group.
>> +     * Normally we can set the iommu_group to NULL above and that will
>> +     * prevent any users from touching it. However, the SPAPR kvm path takes
>> +     * a reference to the iommu_group and keeps using it in arch code out
>> +     * side our control. So if this path is triggred we have no choice but
>> +     * to wait for the group FD to be closed to be sure everyone has stopped
>> +     * touching the group.
>>        */
>> -    while (group->opened_file) {
>> +    while (group->preserve_iommu_group && group->opened_file) {
>>           mutex_unlock(&vfio.group_lock);
>>           swait_event_idle_exclusive(group->opened_file_wait,
>>                          !group->opened_file);
>> @@ -279,17 +307,7 @@ static void vfio_device_remove_group(struct vfio_device *device)
>>       }
>>       mutex_unlock(&vfio.group_lock);
>> -    /*
>> -     * These data structures all have paired operations that can only be
>> -     * undone when the caller holds a live reference on the group. Since all
>> -     * pairs must be undone these WARN_ON's indicate some caller did not
>> -     * properly hold the group reference.
>> -     */
>> -    WARN_ON(!list_empty(&group->device_list));
>> -    WARN_ON(group->container || group->container_users);
>> -    WARN_ON(group->notifier.head);
>> -    group->iommu_group = NULL;
>> -
>> +    iommu_group_put(iommu_group);
>>       put_device(&group->dev);
>>   }
>> @@ -531,6 +549,10 @@ static int __vfio_register_dev(struct vfio_device *device,
>>       existing_device = vfio_group_get_device(group, device->dev);
>>       if (existing_device) {
>> +        /*
>> +         * group->iommu_group is non-NULL because we hold the drivers
>> +         * refcount.
>> +         */
>>           dev_WARN(device->dev, "Device already exists on group %d\n",
>>                iommu_group_id(group->iommu_group));
>>           vfio_device_put_registration(existing_device);
>> @@ -702,6 +724,11 @@ static int vfio_group_ioctl_set_container(struct vfio_group *group,
>>           ret = -EINVAL;
>>           goto out_unlock;
>>       }
>> +    if (!group->iommu_group) {
>> +        ret = -ENODEV;
>> +        goto out_unlock;
>> +    }
>> +
>>       container = vfio_container_from_file(f.file);
>>       ret = -EINVAL;
>>       if (container) {
>> @@ -862,6 +889,11 @@ static int vfio_group_ioctl_get_status(struct vfio_group *group,
>>       status.flags = 0;
>>       mutex_lock(&group->group_lock);
>> +    if (!group->iommu_group) {
>> +        mutex_unlock(&group->group_lock);
>> +        return -ENODEV;
>> +    }
>> +
>>       if (group->container)
>>           status.flags |= VFIO_GROUP_FLAGS_CONTAINER_SET |
>>                   VFIO_GROUP_FLAGS_VIABLE;
>> @@ -938,13 +970,6 @@ static int vfio_group_fops_release(struct inode *inode, struct file *filep)
>>       filep->private_data = NULL;
>>       mutex_lock(&group->group_lock);
>> -    /*
>> -     * Device FDs hold a group file reference, therefore the group release
>> -     * is only called when there are no open devices.
>> -     */
>> -    WARN_ON(group->notifier.head);
>> -    if (group->container)
>> -        vfio_group_detach_container(group);
>>       group->opened_file = NULL;
>>       mutex_unlock(&group->group_lock);
>>       swake_up_one(&group->opened_file_wait);
>> @@ -1553,17 +1578,41 @@ static const struct file_operations vfio_device_fops = {
>>    * @file: VFIO group file
>>    *
>>    * The returned iommu_group is valid as long as a ref is held on the file.
>> + * This function is deprecated, only the SPAPR path in kvm should call it.
>>    */
>>   struct iommu_group *vfio_file_iommu_group(struct file *file)
>>   {
>>       struct vfio_group *group = file->private_data;
>> +    struct iommu_group *iommu_group = NULL;
>> +
>> +    if (!IS_ENABLED(CONFIG_SPAPR_TCE_IOMMU))
>> +        return NULL;
>>       if (file->f_op != &vfio_group_fops)
>>           return NULL;
>> -    return group->iommu_group;
>> +
>> +    mutex_lock(&vfio.group_lock);
>> +    mutex_lock(&group->group_lock);
>> +    if (group->iommu_group) {
>> +        iommu_group = group->iommu_group;
>> +        group->preserve_iommu_group = true;
>> +    }
>> +    mutex_unlock(&group->group_lock);
>> +    mutex_unlock(&vfio.group_lock);
>> +    return iommu_group;
>>   }
>>   EXPORT_SYMBOL_GPL(vfio_file_iommu_group);
>> +/**
>> + * vfio_file_is_group - True if the file is usable with VFIO aPIS
>> + * @file: VFIO group file
>> + */
>> +bool vfio_file_is_group(struct file *file)
>> +{
>> +    return file->f_op == &vfio_group_fops;
>> +}
>> +EXPORT_SYMBOL_GPL(vfio_file_is_group);
>> +
>>   /**
>>    * vfio_file_enforced_coherent - True if the DMA associated with the VFIO file
>>    *        is always CPU cache coherent
>> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
>> index 73bcb92179a224..bd9faaab85de18 100644
>> --- a/include/linux/vfio.h
>> +++ b/include/linux/vfio.h
>> @@ -199,6 +199,7 @@ int vfio_mig_get_next_state(struct vfio_device *device,
>>    * External user API
>>    */
>>   struct iommu_group *vfio_file_iommu_group(struct file *file);
>> +bool vfio_file_is_group(struct file *file);
>>   bool vfio_file_enforced_coherent(struct file *file);
>>   void vfio_file_set_kvm(struct file *file, struct kvm *kvm);
>>   bool vfio_file_has_dev(struct file *file, struct vfio_device *device);
>> diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
>> index ce1b01d02c5197..54aec3b0559c70 100644
>> --- a/virt/kvm/vfio.c
>> +++ b/virt/kvm/vfio.c
>> @@ -61,6 +61,23 @@ static bool kvm_vfio_file_enforced_coherent(struct file *file)
>>       return ret;
>>   }
>> +static bool kvm_vfio_file_is_group(struct file *file)
>> +{
>> +    bool (*fn)(struct file *file);
>> +    bool ret;
>> +
>> +    fn = symbol_get(vfio_file_is_group);
>> +    if (!fn)
>> +        return false;
>> +
>> +    ret = fn(file);
>> +
>> +    symbol_put(vfio_file_is_group);
>> +
>> +    return ret;
>> +}
>> +
>> +#ifdef CONFIG_SPAPR_TCE_IOMMU
>>   static struct iommu_group *kvm_vfio_file_iommu_group(struct file *file)
>>   {
>>       struct iommu_group *(*fn)(struct file *file);
>> @@ -77,7 +94,6 @@ static struct iommu_group *kvm_vfio_file_iommu_group(struct file *file)
>>       return ret;
>>   }
>> -#ifdef CONFIG_SPAPR_TCE_IOMMU
>>   static void kvm_spapr_tce_release_vfio_group(struct kvm *kvm,
>>                            struct kvm_vfio_group *kvg)
>>   {
>> @@ -136,7 +152,7 @@ static int kvm_vfio_group_add(struct kvm_device *dev, unsigned int fd)
>>           return -EBADF;
>>       /* Ensure the FD is a vfio group FD.*/
>> -    if (!kvm_vfio_file_iommu_group(filp)) {
>> +    if (!kvm_vfio_file_is_group(filp)) {
>>           ret = -EINVAL;
>>           goto err_fput;
>>       }

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2022-10-06 11:56 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-09-23  0:06 [PATCH v2] vfio: Follow a strict lifetime for struct iommu_group Jason Gunthorpe
2022-09-26 17:03 ` Matthew Rosato
2022-09-27 20:05   ` Alex Williamson
2022-10-04 15:19     ` Christian Borntraeger
2022-10-04 15:40       ` Jason Gunthorpe
2022-10-04 15:44         ` Christian Borntraeger
2022-10-04 16:28           ` Jason Gunthorpe
2022-10-04 17:15             ` Christian Borntraeger
2022-10-04 17:22               ` Jason Gunthorpe
2022-10-04 17:36             ` Christian Borntraeger
2022-10-04 17:48               ` Christian Borntraeger
2022-10-04 18:22               ` Matthew Rosato
2022-10-04 18:56                 ` Eric Farman
2022-10-05 13:46                 ` Matthew Rosato
2022-10-05 13:57                   ` Jason Gunthorpe
2022-10-05 14:00                     ` Christian Borntraeger
2022-10-05 14:01                     ` Jason Gunthorpe
2022-10-05 14:19                       ` Christian Borntraeger
2022-10-06 11:55                         ` Christian Borntraeger
2022-10-05 14:21                       ` Matthew Rosato
2022-10-05 15:40                         ` Matthew Rosato
2022-10-05 14:01                     ` Matthew Rosato
2022-10-04 19:59   ` Matthew Rosato
2022-10-04 20:19     ` Alex Williamson
2022-10-04 22:30       ` Jason Gunthorpe
2022-09-27  6:34 ` Yi Liu
2022-09-27 13:12   ` Jason Gunthorpe
2022-09-28  3:51 ` Tian, Kevin
2022-09-28 15:26   ` Jason Gunthorpe
2022-09-28 23:54     ` Tian, Kevin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).