iommu.lists.linux-foundation.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/4] Fix splats releated to using the iommu_group after destroying devices
@ 2022-09-08 18:44 Jason Gunthorpe
  2022-09-08 18:44 ` [PATCH 1/4] vfio: Simplify vfio_create_group() Jason Gunthorpe
                   ` (4 more replies)
  0 siblings, 5 replies; 24+ messages in thread
From: Jason Gunthorpe @ 2022-09-08 18:44 UTC (permalink / raw)
  To: Alex Williamson, Cornelia Huck, iommu, Joerg Roedel, kvm, Will Deacon
  Cc: Qian Cai, Joerg Roedel, Marek Szyprowski, Matthew Rosato, Robin Murphy

The basic issue is that the iommu_group is being used by VFIO after all
the device drivers have been removed.

In part this is caused by bad logic inside the iommu core that doesn't
sequence removing the device from the group properly, and in another part
this is bad logic in VFIO continuing to use device->iommu_group after all
VFIO device drivers have been removed.

Fix both situations. Either fix alone should fix the bug reported, but
both together bring a nice robust design to this area.

This is a followup from this thread:

https://lore.kernel.org/kvm/20220831201236.77595-1-mjrosato@linux.ibm.com/

Matthew confirmed an earlier version of the series solved the issue, it
would be best if he would test this as well to confirm the various changes
are still OK.

The iommu patch is independent of the other patches, it can go through the
iommu rc tree.

Jason Gunthorpe (4):
  vfio: Simplify vfio_create_group()
  vfio: Move the sanity check of the group to vfio_create_group()
  vfio: Follow a strict lifetime for struct iommu_group *
  iommu: Fix ordering of iommu_release_device()

 drivers/iommu/iommu.c    |  36 ++++++--
 drivers/vfio/vfio_main.c | 172 +++++++++++++++++++++------------------
 2 files changed, 120 insertions(+), 88 deletions(-)


base-commit: 245898eb9275ce31942cff95d0bdc7412ad3d589
-- 
2.37.3


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 1/4] vfio: Simplify vfio_create_group()
  2022-09-08 18:44 [PATCH 0/4] Fix splats releated to using the iommu_group after destroying devices Jason Gunthorpe
@ 2022-09-08 18:44 ` Jason Gunthorpe
  2022-09-20 19:45   ` Matthew Rosato
  2022-09-08 18:44 ` [PATCH 2/4] vfio: Move the sanity check of the group to vfio_create_group() Jason Gunthorpe
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 24+ messages in thread
From: Jason Gunthorpe @ 2022-09-08 18:44 UTC (permalink / raw)
  To: Alex Williamson, Cornelia Huck, iommu, Joerg Roedel, kvm, Will Deacon
  Cc: Qian Cai, Joerg Roedel, Marek Szyprowski, Matthew Rosato, Robin Murphy

The vfio.group_lock is now only used to serialize vfio_group creation
and destruction, we don't need a micro-optimization of searching,
unlocking, then allocating and searching again. Just hold the lock
the whole time.

Rename the function to 'vfio_get_group()' to reflect that it doesn't
always create something.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/vfio/vfio_main.c | 48 ++++++++++++++++------------------------
 1 file changed, 19 insertions(+), 29 deletions(-)

diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
index 77264d836d5200..4ab13808b536e1 100644
--- a/drivers/vfio/vfio_main.c
+++ b/drivers/vfio/vfio_main.c
@@ -319,17 +319,6 @@ __vfio_group_get_from_iommu(struct iommu_group *iommu_group)
 	return NULL;
 }
 
-static struct vfio_group *
-vfio_group_get_from_iommu(struct iommu_group *iommu_group)
-{
-	struct vfio_group *group;
-
-	mutex_lock(&vfio.group_lock);
-	group = __vfio_group_get_from_iommu(iommu_group);
-	mutex_unlock(&vfio.group_lock);
-	return group;
-}
-
 static void vfio_group_release(struct device *dev)
 {
 	struct vfio_group *group = container_of(dev, struct vfio_group, dev);
@@ -376,16 +365,26 @@ static struct vfio_group *vfio_group_alloc(struct iommu_group *iommu_group,
 	return group;
 }
 
-static struct vfio_group *vfio_create_group(struct iommu_group *iommu_group,
-		enum vfio_group_type type)
+/*
+ * Return a struct vfio_group * for the given iommu_group. If no vfio_group
+ * already exists then create a new one.
+ */
+static struct vfio_group *vfio_get_group(struct iommu_group *iommu_group,
+					 enum vfio_group_type type)
 {
 	struct vfio_group *group;
 	struct vfio_group *ret;
 	int err;
 
-	group = vfio_group_alloc(iommu_group, type);
-	if (IS_ERR(group))
-		return group;
+	mutex_lock(&vfio.group_lock);
+
+	ret = __vfio_group_get_from_iommu(iommu_group);
+	if (ret)
+		goto err_unlock;
+
+	group = ret = vfio_group_alloc(iommu_group, type);
+	if (IS_ERR(ret))
+		goto err_unlock;
 
 	err = dev_set_name(&group->dev, "%s%d",
 			   group->type == VFIO_NO_IOMMU ? "noiommu-" : "",
@@ -395,13 +394,6 @@ static struct vfio_group *vfio_create_group(struct iommu_group *iommu_group,
 		goto err_put;
 	}
 
-	mutex_lock(&vfio.group_lock);
-
-	/* Did we race creating this group? */
-	ret = __vfio_group_get_from_iommu(iommu_group);
-	if (ret)
-		goto err_unlock;
-
 	err = cdev_device_add(&group->cdev, &group->dev);
 	if (err) {
 		ret = ERR_PTR(err);
@@ -413,10 +405,10 @@ static struct vfio_group *vfio_create_group(struct iommu_group *iommu_group,
 	mutex_unlock(&vfio.group_lock);
 	return group;
 
-err_unlock:
-	mutex_unlock(&vfio.group_lock);
 err_put:
 	put_device(&group->dev);
+err_unlock:
+	mutex_unlock(&vfio.group_lock);
 	return ret;
 }
 
@@ -514,7 +506,7 @@ static struct vfio_group *vfio_noiommu_group_alloc(struct device *dev,
 	if (ret)
 		goto out_put_group;
 
-	group = vfio_create_group(iommu_group, type);
+	group = vfio_get_group(iommu_group, type);
 	if (IS_ERR(group)) {
 		ret = PTR_ERR(group);
 		goto out_remove_device;
@@ -564,9 +556,7 @@ static struct vfio_group *vfio_group_find_or_alloc(struct device *dev)
 		return ERR_PTR(-EINVAL);
 	}
 
-	group = vfio_group_get_from_iommu(iommu_group);
-	if (!group)
-		group = vfio_create_group(iommu_group, VFIO_IOMMU);
+	group = vfio_get_group(iommu_group, VFIO_IOMMU);
 
 	/* The vfio_group holds a reference to the iommu_group */
 	iommu_group_put(iommu_group);
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 2/4] vfio: Move the sanity check of the group to vfio_create_group()
  2022-09-08 18:44 [PATCH 0/4] Fix splats releated to using the iommu_group after destroying devices Jason Gunthorpe
  2022-09-08 18:44 ` [PATCH 1/4] vfio: Simplify vfio_create_group() Jason Gunthorpe
@ 2022-09-08 18:44 ` Jason Gunthorpe
  2022-09-22 19:10   ` Alex Williamson
  2022-09-08 18:45 ` [PATCH 3/4] vfio: Follow a strict lifetime for struct iommu_group * Jason Gunthorpe
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 24+ messages in thread
From: Jason Gunthorpe @ 2022-09-08 18:44 UTC (permalink / raw)
  To: Alex Williamson, Cornelia Huck, iommu, Joerg Roedel, kvm, Will Deacon
  Cc: Qian Cai, Joerg Roedel, Marek Szyprowski, Matthew Rosato, Robin Murphy

__vfio_register_dev() has a bit of code to sanity check if an (existing)
group is not corrupted by having two copies of the same struct device in
it. This should be impossible.

It then has some complicated error unwind to uncreate the group.

Instead check if the existing group is sane at the same time we locate
it. If a bug is found then there is no error unwind, just simply fail
allocation.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/vfio/vfio_main.c | 79 ++++++++++++++++++----------------------
 1 file changed, 36 insertions(+), 43 deletions(-)

diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
index 4ab13808b536e1..ba8b6bed12c7e7 100644
--- a/drivers/vfio/vfio_main.c
+++ b/drivers/vfio/vfio_main.c
@@ -306,15 +306,15 @@ static void vfio_container_put(struct vfio_container *container)
  * Group objects - create, release, get, put, search
  */
 static struct vfio_group *
-__vfio_group_get_from_iommu(struct iommu_group *iommu_group)
+vfio_group_find_from_iommu(struct iommu_group *iommu_group)
 {
 	struct vfio_group *group;
 
+	lockdep_assert_held(&vfio.group_lock);
+
 	list_for_each_entry(group, &vfio.group_list, vfio_next) {
-		if (group->iommu_group == iommu_group) {
-			vfio_group_get(group);
+		if (group->iommu_group == iommu_group)
 			return group;
-		}
 	}
 	return NULL;
 }
@@ -365,11 +365,27 @@ static struct vfio_group *vfio_group_alloc(struct iommu_group *iommu_group,
 	return group;
 }
 
+static bool vfio_group_has_device(struct vfio_group *group, struct device *dev)
+{
+	struct vfio_device *device;
+
+	mutex_lock(&group->device_lock);
+	list_for_each_entry(device, &group->device_list, group_next) {
+		if (device->dev == dev) {
+			mutex_unlock(&group->device_lock);
+			return true;
+		}
+	}
+	mutex_unlock(&group->device_lock);
+	return false;
+}
+
 /*
  * Return a struct vfio_group * for the given iommu_group. If no vfio_group
  * already exists then create a new one.
  */
-static struct vfio_group *vfio_get_group(struct iommu_group *iommu_group,
+static struct vfio_group *vfio_get_group(struct device *dev,
+					 struct iommu_group *iommu_group,
 					 enum vfio_group_type type)
 {
 	struct vfio_group *group;
@@ -378,13 +394,20 @@ static struct vfio_group *vfio_get_group(struct iommu_group *iommu_group,
 
 	mutex_lock(&vfio.group_lock);
 
-	ret = __vfio_group_get_from_iommu(iommu_group);
-	if (ret)
-		goto err_unlock;
+	ret = vfio_group_find_from_iommu(iommu_group);
+	if (ret) {
+		if (WARN_ON(vfio_group_has_device(ret, dev))) {
+			ret = ERR_PTR(-EINVAL);
+			goto out_unlock;
+		}
+		/* Found an existing group */
+		vfio_group_get(ret);
+		goto out_unlock;
+	}
 
 	group = ret = vfio_group_alloc(iommu_group, type);
 	if (IS_ERR(ret))
-		goto err_unlock;
+		goto out_unlock;
 
 	err = dev_set_name(&group->dev, "%s%d",
 			   group->type == VFIO_NO_IOMMU ? "noiommu-" : "",
@@ -397,7 +420,7 @@ static struct vfio_group *vfio_get_group(struct iommu_group *iommu_group,
 	err = cdev_device_add(&group->cdev, &group->dev);
 	if (err) {
 		ret = ERR_PTR(err);
-		goto err_unlock;
+		goto out_unlock;
 	}
 
 	list_add(&group->vfio_next, &vfio.group_list);
@@ -407,7 +430,7 @@ static struct vfio_group *vfio_get_group(struct iommu_group *iommu_group,
 
 err_put:
 	put_device(&group->dev);
-err_unlock:
+out_unlock:
 	mutex_unlock(&vfio.group_lock);
 	return ret;
 }
@@ -454,22 +477,6 @@ static bool vfio_device_try_get(struct vfio_device *device)
 	return refcount_inc_not_zero(&device->refcount);
 }
 
-static struct vfio_device *vfio_group_get_device(struct vfio_group *group,
-						 struct device *dev)
-{
-	struct vfio_device *device;
-
-	mutex_lock(&group->device_lock);
-	list_for_each_entry(device, &group->device_list, group_next) {
-		if (device->dev == dev && vfio_device_try_get(device)) {
-			mutex_unlock(&group->device_lock);
-			return device;
-		}
-	}
-	mutex_unlock(&group->device_lock);
-	return NULL;
-}
-
 /*
  * VFIO driver API
  */
@@ -506,7 +513,7 @@ static struct vfio_group *vfio_noiommu_group_alloc(struct device *dev,
 	if (ret)
 		goto out_put_group;
 
-	group = vfio_get_group(iommu_group, type);
+	group = vfio_get_group(dev, iommu_group, type);
 	if (IS_ERR(group)) {
 		ret = PTR_ERR(group);
 		goto out_remove_device;
@@ -556,7 +563,7 @@ static struct vfio_group *vfio_group_find_or_alloc(struct device *dev)
 		return ERR_PTR(-EINVAL);
 	}
 
-	group = vfio_get_group(iommu_group, VFIO_IOMMU);
+	group = vfio_get_group(dev, iommu_group, VFIO_IOMMU);
 
 	/* The vfio_group holds a reference to the iommu_group */
 	iommu_group_put(iommu_group);
@@ -566,8 +573,6 @@ static struct vfio_group *vfio_group_find_or_alloc(struct device *dev)
 static int __vfio_register_dev(struct vfio_device *device,
 		struct vfio_group *group)
 {
-	struct vfio_device *existing_device;
-
 	if (IS_ERR(group))
 		return PTR_ERR(group);
 
@@ -578,18 +583,6 @@ static int __vfio_register_dev(struct vfio_device *device,
 	if (!device->dev_set)
 		vfio_assign_device_set(device, device);
 
-	existing_device = vfio_group_get_device(group, device->dev);
-	if (existing_device) {
-		dev_WARN(device->dev, "Device already exists on group %d\n",
-			 iommu_group_id(group->iommu_group));
-		vfio_device_put(existing_device);
-		if (group->type == VFIO_NO_IOMMU ||
-		    group->type == VFIO_EMULATED_IOMMU)
-			iommu_group_remove_device(device->dev);
-		vfio_group_put(group);
-		return -EBUSY;
-	}
-
 	/* Our reference on group is moved to the device */
 	device->group = group;
 
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 3/4] vfio: Follow a strict lifetime for struct iommu_group *
  2022-09-08 18:44 [PATCH 0/4] Fix splats releated to using the iommu_group after destroying devices Jason Gunthorpe
  2022-09-08 18:44 ` [PATCH 1/4] vfio: Simplify vfio_create_group() Jason Gunthorpe
  2022-09-08 18:44 ` [PATCH 2/4] vfio: Move the sanity check of the group to vfio_create_group() Jason Gunthorpe
@ 2022-09-08 18:45 ` Jason Gunthorpe
  2022-09-20 19:32   ` Matthew Rosato
  2022-09-08 18:45 ` [PATCH 4/4] iommu: Fix ordering of iommu_release_device() Jason Gunthorpe
  2022-09-09 12:49 ` [PATCH 0/4] Fix splats releated to using the iommu_group after destroying devices Matthew Rosato
  4 siblings, 1 reply; 24+ messages in thread
From: Jason Gunthorpe @ 2022-09-08 18:45 UTC (permalink / raw)
  To: Alex Williamson, Cornelia Huck, iommu, Joerg Roedel, kvm, Will Deacon
  Cc: Qian Cai, Joerg Roedel, Marek Szyprowski, Matthew Rosato, Robin Murphy

The iommu_group comes from the struct device that a driver has been bound
to and then created a struct vfio_device against. To keep the iommu layer
sane we want to have a simple rule that only an attached driver should be
using the iommu API. Particularly only an attached driver should hold
ownership.

In VFIO's case since it uses the group APIs and it shares between
different drivers it is a bit more complicated, but the principle still
holds.

Solve this by waiting for all users of the vfio_group to stop before
allowing vfio_unregister_group_dev() to complete. This is done with a new
completion to know when the users go away and an additional refcount to
keep track of how many device drivers are sharing the vfio group. The last
driver to be unregistered will clean up the group.

This solves crashes in the S390 iommu driver that come because VFIO ends
up racing releasing ownership (which attaches the default iommu_domain to
the device) with the removal of that same device from the iommu
driver. This is a side case that iommu drivers should not have to cope
with.

   iommu driver failed to attach the default/blocking domain
   WARNING: CPU: 0 PID: 5082 at drivers/iommu/iommu.c:1961 iommu_detach_group+0x6c/0x80
   Modules linked in: macvtap macvlan tap vfio_pci vfio_pci_core irqbypass vfio_virqfd kvm nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink mlx5_ib sunrpc ib_uverbs ism smc uvdevice ib_core s390_trng eadm_sch tape_3590 tape tape_class vfio_ccw mdev vfio_iommu_type1 vfio zcrypt_cex4 sch_fq_codel configfs ghash_s390 prng chacha_s390 libchacha aes_s390 mlx5_core des_s390 libdes sha3_512_s390 nvme sha3_256_s390 sha512_s390 sha256_s390 sha1_s390 sha_common nvme_core zfcp scsi_transport_fc pkey zcrypt rng_core autofs4
   CPU: 0 PID: 5082 Comm: qemu-system-s39 Tainted: G        W          6.0.0-rc3 #5
   Hardware name: IBM 3931 A01 782 (LPAR)
   Krnl PSW : 0704c00180000000 000000095bb10d28 (iommu_detach_group+0x70/0x80)
              R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3
   Krnl GPRS: 0000000000000001 0000000900000027 0000000000000039 000000095c97ffe0
              00000000fffeffff 00000009fc290000 00000000af1fda50 00000000af590b58
              00000000af1fdaf0 0000000135c7a320 0000000135e52258 0000000135e52200
              00000000a29e8000 00000000af590b40 000000095bb10d24 0000038004b13c98
   Krnl Code: 000000095bb10d18: c020003d56fc        larl    %r2,000000095c2bbb10
                          000000095bb10d1e: c0e50019d901        brasl   %r14,000000095be4bf20
                         #000000095bb10d24: af000000            mc      0,0
                         >000000095bb10d28: b904002a            lgr     %r2,%r10
                          000000095bb10d2c: ebaff0a00004        lmg     %r10,%r15,160(%r15)
                          000000095bb10d32: c0f4001aa867        brcl    15,000000095be65e00
                          000000095bb10d38: c004002168e0        brcl    0,000000095bf3def8
                          000000095bb10d3e: eb6ff0480024        stmg    %r6,%r15,72(%r15)
   Call Trace:
    [<000000095bb10d28>] iommu_detach_group+0x70/0x80
   ([<000000095bb10d24>] iommu_detach_group+0x6c/0x80)
    [<000003ff80243b0e>] vfio_iommu_type1_detach_group+0x136/0x6c8 [vfio_iommu_type1]
    [<000003ff80137780>] __vfio_group_unset_container+0x58/0x158 [vfio]
    [<000003ff80138a16>] vfio_group_fops_unl_ioctl+0x1b6/0x210 [vfio]
   pci 0004:00:00.0: Removing from iommu group 4
    [<000000095b5b62e8>] __s390x_sys_ioctl+0xc0/0x100
    [<000000095be5d3b4>] __do_syscall+0x1d4/0x200
    [<000000095be6c072>] system_call+0x82/0xb0
   Last Breaking-Event-Address:
    [<000000095be4bf80>] __warn_printk+0x60/0x68

It reflects that domain->ops->attach_dev() failed because the driver has
already passed the point of destructing the device.

Fixes: 9ac8545199a1 ("iommu: Fix use-after-free in iommu_release_device")
Reported-by: Matthew Rosato <mjrosato@linux.ibm.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/vfio/vfio_main.c | 63 ++++++++++++++++++++++++++++++----------
 1 file changed, 47 insertions(+), 16 deletions(-)

diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
index ba8b6bed12c7e7..3bd6ec4cdd5b26 100644
--- a/drivers/vfio/vfio_main.c
+++ b/drivers/vfio/vfio_main.c
@@ -66,7 +66,15 @@ struct vfio_container {
 struct vfio_group {
 	struct device 			dev;
 	struct cdev			cdev;
+	/*
+	 * When drivers is non-zero a driver is attached to the struct device
+	 * that provided the iommu_group and thus the iommu_group is a valid
+	 * pointer. When drivers is 0 the driver is being detached. Once users
+	 * reaches 0 then the iommu_group is invalid.
+	 */
+	refcount_t			drivers;
 	refcount_t			users;
+	struct completion		users_comp;
 	unsigned int			container_users;
 	struct iommu_group		*iommu_group;
 	struct vfio_container		*container;
@@ -276,8 +284,6 @@ void vfio_unregister_iommu_driver(const struct vfio_iommu_driver_ops *ops)
 }
 EXPORT_SYMBOL_GPL(vfio_unregister_iommu_driver);
 
-static void vfio_group_get(struct vfio_group *group);
-
 /*
  * Container objects - containers are created when /dev/vfio/vfio is
  * opened, but their lifecycle extends until the last user is done, so
@@ -353,6 +359,8 @@ static struct vfio_group *vfio_group_alloc(struct iommu_group *iommu_group,
 	group->cdev.owner = THIS_MODULE;
 
 	refcount_set(&group->users, 1);
+	refcount_set(&group->drivers, 1);
+	init_completion(&group->users_comp);
 	init_rwsem(&group->group_rwsem);
 	INIT_LIST_HEAD(&group->device_list);
 	mutex_init(&group->device_lock);
@@ -401,7 +409,7 @@ static struct vfio_group *vfio_get_group(struct device *dev,
 			goto out_unlock;
 		}
 		/* Found an existing group */
-		vfio_group_get(ret);
+		refcount_inc(&ret->drivers);
 		goto out_unlock;
 	}
 
@@ -437,8 +445,36 @@ static struct vfio_group *vfio_get_group(struct device *dev,
 
 static void vfio_group_put(struct vfio_group *group)
 {
-	if (!refcount_dec_and_mutex_lock(&group->users, &vfio.group_lock))
+	if (refcount_dec_and_test(&group->users))
+		complete(&group->users_comp);
+}
+
+static void vfio_group_remove(struct vfio_group *group)
+{
+	/* Pairs with vfio_create_group() */
+	if (!refcount_dec_and_mutex_lock(&group->drivers, &vfio.group_lock))
 		return;
+	list_del(&group->vfio_next);
+
+	/*
+	 * We could concurrently probe another driver in the group that might
+	 * race vfio_group_remove() with vfio_get_group(), so we have to ensure
+	 * that the sysfs is all cleaned up under lock otherwise the
+	 * cdev_device_add() will fail due to the name aready existing.
+	 */
+	cdev_device_del(&group->cdev, &group->dev);
+	mutex_unlock(&vfio.group_lock);
+
+	/* Matches the get from vfio_group_alloc() */
+	vfio_group_put(group);
+
+	/*
+	 * Before we allow the last driver in the group to be unplugged the
+	 * group must be sanitized so nothing else is or can reference it. This
+	 * is because the group->iommu_group pointer should only be used so long
+	 * as a device driver is attached to a device in the group.
+	 */
+	wait_for_completion(&group->users_comp);
 
 	/*
 	 * These data structures all have paired operations that can only be
@@ -449,19 +485,11 @@ static void vfio_group_put(struct vfio_group *group)
 	WARN_ON(!list_empty(&group->device_list));
 	WARN_ON(group->container || group->container_users);
 	WARN_ON(group->notifier.head);
-
-	list_del(&group->vfio_next);
-	cdev_device_del(&group->cdev, &group->dev);
-	mutex_unlock(&vfio.group_lock);
+	group->iommu_group = NULL;
 
 	put_device(&group->dev);
 }
 
-static void vfio_group_get(struct vfio_group *group)
-{
-	refcount_inc(&group->users);
-}
-
 /*
  * Device objects - create, release, get, put, search
  */
@@ -573,6 +601,10 @@ static struct vfio_group *vfio_group_find_or_alloc(struct device *dev)
 static int __vfio_register_dev(struct vfio_device *device,
 		struct vfio_group *group)
 {
+	/*
+	 * In all cases group is the output of one of the group allocation
+	 * functions and we have group->drivers incremented for us.
+	 */
 	if (IS_ERR(group))
 		return PTR_ERR(group);
 
@@ -683,8 +715,7 @@ void vfio_unregister_group_dev(struct vfio_device *device)
 	if (group->type == VFIO_NO_IOMMU || group->type == VFIO_EMULATED_IOMMU)
 		iommu_group_remove_device(device->dev);
 
-	/* Matches the get in vfio_register_group_dev() */
-	vfio_group_put(group);
+	vfio_group_remove(group);
 }
 EXPORT_SYMBOL_GPL(vfio_unregister_group_dev);
 
@@ -1272,7 +1303,7 @@ static int vfio_group_fops_open(struct inode *inode, struct file *filep)
 
 	down_write(&group->group_rwsem);
 
-	/* users can be zero if this races with vfio_group_put() */
+	/* users can be zero if this races with vfio_group_remove() */
 	if (!refcount_inc_not_zero(&group->users)) {
 		ret = -ENODEV;
 		goto err_unlock;
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 4/4] iommu: Fix ordering of iommu_release_device()
  2022-09-08 18:44 [PATCH 0/4] Fix splats releated to using the iommu_group after destroying devices Jason Gunthorpe
                   ` (2 preceding siblings ...)
  2022-09-08 18:45 ` [PATCH 3/4] vfio: Follow a strict lifetime for struct iommu_group * Jason Gunthorpe
@ 2022-09-08 18:45 ` Jason Gunthorpe
  2022-09-08 21:05   ` Robin Murphy
  2022-09-09 12:49 ` [PATCH 0/4] Fix splats releated to using the iommu_group after destroying devices Matthew Rosato
  4 siblings, 1 reply; 24+ messages in thread
From: Jason Gunthorpe @ 2022-09-08 18:45 UTC (permalink / raw)
  To: Alex Williamson, Cornelia Huck, iommu, Joerg Roedel, kvm, Will Deacon
  Cc: Qian Cai, Joerg Roedel, Marek Szyprowski, Matthew Rosato, Robin Murphy

default domains created a situation where the device is always connected
to a domain of some kind. When the device is idle it is connected to one
of the two pre-existing domains in the group, blocking_domain or
default_domain. In this way we have a continuous assertion of what state
the transation is in.

When this is all destructed then we need to remove all the devices from
their domains via the ops->release_device() call before the domain can be
freed. This is the bug recognized in commit 9ac8545199a1 ("iommu: Fix
use-after-free in iommu_release_device").

However, we must also stop any concurrent access to the iommu driver for
this device before we destroy it. This is done by:

 1) Drivers only using the iommu API while they have a device driver
    attached to the device. This directly prevents release from happening.

 2) Removing the device from the group list so any lingering group
    references no longer refer to the device. This is done by
    iommu_group_remove_device()

Since iommu_group_remove_device() has been moved this breaks #2 and
triggers an WARN when VFIO races group activities with the release of the
device:

   iommu driver failed to attach the default/blocking domain
   WARNING: CPU: 0 PID: 5082 at drivers/iommu/iommu.c:1961 iommu_detach_group+0x6c/0x80
   Modules linked in: macvtap macvlan tap vfio_pci vfio_pci_core irqbypass vfio_virqfd kvm nft_fib_inet nft_fib_ipv4 nft_fib_ipv6>
   CPU: 0 PID: 5082 Comm: qemu-system-s39 Tainted: G        W          6.0.0-rc3 #5
   Hardware name: IBM 3931 A01 782 (LPAR)
   Krnl PSW : 0704c00180000000 000000095bb10d28 (iommu_detach_group+0x70/0x80)
	      R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3
   Krnl GPRS: 0000000000000001 0000000900000027 0000000000000039 000000095c97ffe0
	      00000000fffeffff 00000009fc290000 00000000af1fda50 00000000af590b58
	      00000000af1fdaf0 0000000135c7a320 0000000135e52258 0000000135e52200
	      00000000a29e8000 00000000af590b40 000000095bb10d24 0000038004b13c98
   Krnl Code: 000000095bb10d18: c020003d56fc        larl    %r2,000000095c2bbb10
			  000000095bb10d1e: c0e50019d901        brasl   %r14,000000095be4bf20
			 #000000095bb10d24: af000000            mc      0,0
			 >000000095bb10d28: b904002a            lgr     %r2,%r10
			  000000095bb10d2c: ebaff0a00004        lmg     %r10,%r15,160(%r15)
			  000000095bb10d32: c0f4001aa867        brcl    15,000000095be65e00
			  000000095bb10d38: c004002168e0        brcl    0,000000095bf3def8
			  000000095bb10d3e: eb6ff0480024        stmg    %r6,%r15,72(%r15)
   Call Trace:
    [<000000095bb10d28>] iommu_detach_group+0x70/0x80
   ([<000000095bb10d24>] iommu_detach_group+0x6c/0x80)
    [<000003ff80243b0e>] vfio_iommu_type1_detach_group+0x136/0x6c8 [vfio_iommu_type1]
    [<000003ff80137780>] __vfio_group_unset_container+0x58/0x158 [vfio]
    [<000003ff80138a16>] vfio_group_fops_unl_ioctl+0x1b6/0x210 [vfio]
   pci 0004:00:00.0: Removing from iommu group 4
    [<000000095b5b62e8>] __s390x_sys_ioctl+0xc0/0x100
    [<000000095be5d3b4>] __do_syscall+0x1d4/0x200
    [<000000095be6c072>] system_call+0x82/0xb0
   Last Breaking-Event-Address:
    [<000000095be4bf80>] __warn_printk+0x60/0x68

So, put things in the right order:
 - Remove the device from the group's list
 - Release the device from the iommu driver to drop all domain references
 - Free the domains

This is done by splitting out the kobject_put(), which triggers
iommu_group_release(), from the rest of iommu_group_remove_device() and
placing it after release is called.

Fixes: 9ac8545199a1 ("iommu: Fix use-after-free in iommu_release_device")
Reported-by: Matthew Rosato <mjrosato@linux.ibm.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/iommu.c | 36 +++++++++++++++++++++++++++---------
 1 file changed, 27 insertions(+), 9 deletions(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 780fb70715770d..c451bf715182ac 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -90,6 +90,7 @@ static int iommu_create_device_direct_mappings(struct iommu_group *group,
 static struct iommu_group *iommu_group_get_for_dev(struct device *dev);
 static ssize_t iommu_group_store_type(struct iommu_group *group,
 				      const char *buf, size_t count);
+static void __iommu_group_remove_device(struct device *dev);
 
 #define IOMMU_GROUP_ATTR(_name, _mode, _show, _store)		\
 struct iommu_group_attribute iommu_group_attr_##_name =		\
@@ -330,6 +331,7 @@ int iommu_probe_device(struct device *dev)
 
 void iommu_release_device(struct device *dev)
 {
+	struct iommu_group *group = dev->iommu_group;
 	const struct iommu_ops *ops;
 
 	if (!dev->iommu)
@@ -337,11 +339,20 @@ void iommu_release_device(struct device *dev)
 
 	iommu_device_unlink(dev->iommu->iommu_dev, dev);
 
+	__iommu_group_remove_device(dev);
 	ops = dev_iommu_ops(dev);
 	if (ops->release_device)
 		ops->release_device(dev);
 
-	iommu_group_remove_device(dev);
+	/*
+	 * This will eventually call iommu_group_release() which will free the
+	 * iommu_domains. Up until the release_device() above the iommu_domains
+	 * may still have been associated with the device, and we cannot free
+	 * them until the have been detached. release_device() is expected to
+	 * detach all domains connected to the dev.
+	 */
+	kobject_put(group->devices_kobj);
+
 	module_put(ops->owner);
 	dev_iommu_free(dev);
 }
@@ -939,14 +950,7 @@ int iommu_group_add_device(struct iommu_group *group, struct device *dev)
 }
 EXPORT_SYMBOL_GPL(iommu_group_add_device);
 
-/**
- * iommu_group_remove_device - remove a device from it's current group
- * @dev: device to be removed
- *
- * This function is called by an iommu driver to remove the device from
- * it's current group.  This decrements the iommu group reference count.
- */
-void iommu_group_remove_device(struct device *dev)
+static void __iommu_group_remove_device(struct device *dev)
 {
 	struct iommu_group *group = dev->iommu_group;
 	struct group_device *tmp_device, *device = NULL;
@@ -977,6 +981,20 @@ void iommu_group_remove_device(struct device *dev)
 	kfree(device->name);
 	kfree(device);
 	dev->iommu_group = NULL;
+}
+
+/**
+ * iommu_group_remove_device - remove a device from it's current group
+ * @dev: device to be removed
+ *
+ * This function is called by an iommu driver to remove the device from
+ * it's current group.  This decrements the iommu group reference count.
+ */
+void iommu_group_remove_device(struct device *dev)
+{
+	struct iommu_group *group = dev->iommu_group;
+
+	__iommu_group_remove_device(dev);
 	kobject_put(group->devices_kobj);
 }
 EXPORT_SYMBOL_GPL(iommu_group_remove_device);
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH 4/4] iommu: Fix ordering of iommu_release_device()
  2022-09-08 18:45 ` [PATCH 4/4] iommu: Fix ordering of iommu_release_device() Jason Gunthorpe
@ 2022-09-08 21:05   ` Robin Murphy
  2022-09-08 21:27     ` Robin Murphy
  0 siblings, 1 reply; 24+ messages in thread
From: Robin Murphy @ 2022-09-08 21:05 UTC (permalink / raw)
  To: Jason Gunthorpe, Alex Williamson, Cornelia Huck, iommu,
	Joerg Roedel, kvm, Will Deacon
  Cc: Qian Cai, Joerg Roedel, Marek Szyprowski, Matthew Rosato

On 2022-09-08 19:45, Jason Gunthorpe wrote:
> default domains created a situation where the device is always connected
> to a domain of some kind. When the device is idle it is connected to one
> of the two pre-existing domains in the group, blocking_domain or
> default_domain. In this way we have a continuous assertion of what state
> the transation is in.
> 
> When this is all destructed then we need to remove all the devices from
> their domains via the ops->release_device() call before the domain can be
> freed. This is the bug recognized in commit 9ac8545199a1 ("iommu: Fix
> use-after-free in iommu_release_device").
> 
> However, we must also stop any concurrent access to the iommu driver for
> this device before we destroy it. This is done by:
> 
>   1) Drivers only using the iommu API while they have a device driver
>      attached to the device. This directly prevents release from happening.
> 
>   2) Removing the device from the group list so any lingering group
>      references no longer refer to the device. This is done by
>      iommu_group_remove_device()
> 
> Since iommu_group_remove_device() has been moved this breaks #2 and
> triggers an WARN when VFIO races group activities with the release of the
> device:
> 
>     iommu driver failed to attach the default/blocking domain
>     WARNING: CPU: 0 PID: 5082 at drivers/iommu/iommu.c:1961 iommu_detach_group+0x6c/0x80
>     Modules linked in: macvtap macvlan tap vfio_pci vfio_pci_core irqbypass vfio_virqfd kvm nft_fib_inet nft_fib_ipv4 nft_fib_ipv6>
>     CPU: 0 PID: 5082 Comm: qemu-system-s39 Tainted: G        W          6.0.0-rc3 #5
>     Hardware name: IBM 3931 A01 782 (LPAR)
>     Krnl PSW : 0704c00180000000 000000095bb10d28 (iommu_detach_group+0x70/0x80)
> 	      R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3
>     Krnl GPRS: 0000000000000001 0000000900000027 0000000000000039 000000095c97ffe0
> 	      00000000fffeffff 00000009fc290000 00000000af1fda50 00000000af590b58
> 	      00000000af1fdaf0 0000000135c7a320 0000000135e52258 0000000135e52200
> 	      00000000a29e8000 00000000af590b40 000000095bb10d24 0000038004b13c98
>     Krnl Code: 000000095bb10d18: c020003d56fc        larl    %r2,000000095c2bbb10
> 			  000000095bb10d1e: c0e50019d901        brasl   %r14,000000095be4bf20
> 			 #000000095bb10d24: af000000            mc      0,0
> 			 >000000095bb10d28: b904002a            lgr     %r2,%r10
> 			  000000095bb10d2c: ebaff0a00004        lmg     %r10,%r15,160(%r15)
> 			  000000095bb10d32: c0f4001aa867        brcl    15,000000095be65e00
> 			  000000095bb10d38: c004002168e0        brcl    0,000000095bf3def8
> 			  000000095bb10d3e: eb6ff0480024        stmg    %r6,%r15,72(%r15)
>     Call Trace:
>      [<000000095bb10d28>] iommu_detach_group+0x70/0x80
>     ([<000000095bb10d24>] iommu_detach_group+0x6c/0x80)
>      [<000003ff80243b0e>] vfio_iommu_type1_detach_group+0x136/0x6c8 [vfio_iommu_type1]
>      [<000003ff80137780>] __vfio_group_unset_container+0x58/0x158 [vfio]
>      [<000003ff80138a16>] vfio_group_fops_unl_ioctl+0x1b6/0x210 [vfio]
>     pci 0004:00:00.0: Removing from iommu group 4
>      [<000000095b5b62e8>] __s390x_sys_ioctl+0xc0/0x100
>      [<000000095be5d3b4>] __do_syscall+0x1d4/0x200
>      [<000000095be6c072>] system_call+0x82/0xb0
>     Last Breaking-Event-Address:
>      [<000000095be4bf80>] __warn_printk+0x60/0x68
> 
> So, put things in the right order:
>   - Remove the device from the group's list
>   - Release the device from the iommu driver to drop all domain references
>   - Free the domains
> 
> This is done by splitting out the kobject_put(), which triggers
> iommu_group_release(), from the rest of iommu_group_remove_device() and
> placing it after release is called.

So simple... now how did I fail to think of that? :)

> Fixes: 9ac8545199a1 ("iommu: Fix use-after-free in iommu_release_device")
> Reported-by: Matthew Rosato <mjrosato@linux.ibm.com>
> Cc: Robin Murphy <robin.murphy@arm.com>
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> ---
>   drivers/iommu/iommu.c | 36 +++++++++++++++++++++++++++---------
>   1 file changed, 27 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index 780fb70715770d..c451bf715182ac 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -90,6 +90,7 @@ static int iommu_create_device_direct_mappings(struct iommu_group *group,
>   static struct iommu_group *iommu_group_get_for_dev(struct device *dev);
>   static ssize_t iommu_group_store_type(struct iommu_group *group,
>   				      const char *buf, size_t count);
> +static void __iommu_group_remove_device(struct device *dev);
>   
>   #define IOMMU_GROUP_ATTR(_name, _mode, _show, _store)		\
>   struct iommu_group_attribute iommu_group_attr_##_name =		\
> @@ -330,6 +331,7 @@ int iommu_probe_device(struct device *dev)
>   
>   void iommu_release_device(struct device *dev)
>   {
> +	struct iommu_group *group = dev->iommu_group;
>   	const struct iommu_ops *ops;
>   
>   	if (!dev->iommu)
> @@ -337,11 +339,20 @@ void iommu_release_device(struct device *dev)
>   
>   	iommu_device_unlink(dev->iommu->iommu_dev, dev);
>   

In fact, now that you've made it obvious, could we not simply do an 
extra kobject_get() here before calling regular 
iommu_group_remove_device(), and avoid having to split that up at all? 
That should delay any default domain teardown just as definitively as 
holding the original reference for longer, no?

Thanks,
Robin.

> +	__iommu_group_remove_device(dev);
>   	ops = dev_iommu_ops(dev);
>   	if (ops->release_device)
>   		ops->release_device(dev);
>   
> -	iommu_group_remove_device(dev);
> +	/*
> +	 * This will eventually call iommu_group_release() which will free the
> +	 * iommu_domains. Up until the release_device() above the iommu_domains
> +	 * may still have been associated with the device, and we cannot free
> +	 * them until the have been detached. release_device() is expected to
> +	 * detach all domains connected to the dev.
> +	 */
> +	kobject_put(group->devices_kobj);
> +
>   	module_put(ops->owner);
>   	dev_iommu_free(dev);
>   }
> @@ -939,14 +950,7 @@ int iommu_group_add_device(struct iommu_group *group, struct device *dev)
>   }
>   EXPORT_SYMBOL_GPL(iommu_group_add_device);
>   
> -/**
> - * iommu_group_remove_device - remove a device from it's current group
> - * @dev: device to be removed
> - *
> - * This function is called by an iommu driver to remove the device from
> - * it's current group.  This decrements the iommu group reference count.
> - */
> -void iommu_group_remove_device(struct device *dev)
> +static void __iommu_group_remove_device(struct device *dev)
>   {
>   	struct iommu_group *group = dev->iommu_group;
>   	struct group_device *tmp_device, *device = NULL;
> @@ -977,6 +981,20 @@ void iommu_group_remove_device(struct device *dev)
>   	kfree(device->name);
>   	kfree(device);
>   	dev->iommu_group = NULL;
> +}
> +
> +/**
> + * iommu_group_remove_device - remove a device from it's current group
> + * @dev: device to be removed
> + *
> + * This function is called by an iommu driver to remove the device from
> + * it's current group.  This decrements the iommu group reference count.
> + */
> +void iommu_group_remove_device(struct device *dev)
> +{
> +	struct iommu_group *group = dev->iommu_group;
> +
> +	__iommu_group_remove_device(dev);
>   	kobject_put(group->devices_kobj);
>   }
>   EXPORT_SYMBOL_GPL(iommu_group_remove_device);

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 4/4] iommu: Fix ordering of iommu_release_device()
  2022-09-08 21:05   ` Robin Murphy
@ 2022-09-08 21:27     ` Robin Murphy
  2022-09-08 21:43       ` Jason Gunthorpe
  0 siblings, 1 reply; 24+ messages in thread
From: Robin Murphy @ 2022-09-08 21:27 UTC (permalink / raw)
  To: Jason Gunthorpe, Alex Williamson, Cornelia Huck, iommu,
	Joerg Roedel, kvm, Will Deacon
  Cc: Qian Cai, Joerg Roedel, Marek Szyprowski, Matthew Rosato

On 2022-09-08 22:05, Robin Murphy wrote:
> On 2022-09-08 19:45, Jason Gunthorpe wrote:
>> default domains created a situation where the device is always connected
>> to a domain of some kind. When the device is idle it is connected to one
>> of the two pre-existing domains in the group, blocking_domain or
>> default_domain. In this way we have a continuous assertion of what state
>> the transation is in.
>>
>> When this is all destructed then we need to remove all the devices from
>> their domains via the ops->release_device() call before the domain can be
>> freed. This is the bug recognized in commit 9ac8545199a1 ("iommu: Fix
>> use-after-free in iommu_release_device").
>>
>> However, we must also stop any concurrent access to the iommu driver for
>> this device before we destroy it. This is done by:
>>
>>   1) Drivers only using the iommu API while they have a device driver
>>      attached to the device. This directly prevents release from 
>> happening.
>>
>>   2) Removing the device from the group list so any lingering group
>>      references no longer refer to the device. This is done by
>>      iommu_group_remove_device()
>>
>> Since iommu_group_remove_device() has been moved this breaks #2 and
>> triggers an WARN when VFIO races group activities with the release of the
>> device:
>>
>>     iommu driver failed to attach the default/blocking domain
>>     WARNING: CPU: 0 PID: 5082 at drivers/iommu/iommu.c:1961 
>> iommu_detach_group+0x6c/0x80
>>     Modules linked in: macvtap macvlan tap vfio_pci vfio_pci_core 
>> irqbypass vfio_virqfd kvm nft_fib_inet nft_fib_ipv4 nft_fib_ipv6>
>>     CPU: 0 PID: 5082 Comm: qemu-system-s39 Tainted: G        
>> W          6.0.0-rc3 #5
>>     Hardware name: IBM 3931 A01 782 (LPAR)
>>     Krnl PSW : 0704c00180000000 000000095bb10d28 
>> (iommu_detach_group+0x70/0x80)
>>           R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3
>>     Krnl GPRS: 0000000000000001 0000000900000027 0000000000000039 
>> 000000095c97ffe0
>>           00000000fffeffff 00000009fc290000 00000000af1fda50 
>> 00000000af590b58
>>           00000000af1fdaf0 0000000135c7a320 0000000135e52258 
>> 0000000135e52200
>>           00000000a29e8000 00000000af590b40 000000095bb10d24 
>> 0000038004b13c98
>>     Krnl Code: 000000095bb10d18: c020003d56fc        larl    
>> %r2,000000095c2bbb10
>>               000000095bb10d1e: c0e50019d901        brasl   
>> %r14,000000095be4bf20
>>              #000000095bb10d24: af000000            mc      0,0
>>              >000000095bb10d28: b904002a            lgr     %r2,%r10
>>               000000095bb10d2c: ebaff0a00004        lmg     
>> %r10,%r15,160(%r15)
>>               000000095bb10d32: c0f4001aa867        brcl    
>> 15,000000095be65e00
>>               000000095bb10d38: c004002168e0        brcl    
>> 0,000000095bf3def8
>>               000000095bb10d3e: eb6ff0480024        stmg    
>> %r6,%r15,72(%r15)
>>     Call Trace:
>>      [<000000095bb10d28>] iommu_detach_group+0x70/0x80
>>     ([<000000095bb10d24>] iommu_detach_group+0x6c/0x80)
>>      [<000003ff80243b0e>] vfio_iommu_type1_detach_group+0x136/0x6c8 
>> [vfio_iommu_type1]
>>      [<000003ff80137780>] __vfio_group_unset_container+0x58/0x158 [vfio]
>>      [<000003ff80138a16>] vfio_group_fops_unl_ioctl+0x1b6/0x210 [vfio]
>>     pci 0004:00:00.0: Removing from iommu group 4
>>      [<000000095b5b62e8>] __s390x_sys_ioctl+0xc0/0x100
>>      [<000000095be5d3b4>] __do_syscall+0x1d4/0x200
>>      [<000000095be6c072>] system_call+0x82/0xb0
>>     Last Breaking-Event-Address:
>>      [<000000095be4bf80>] __warn_printk+0x60/0x68
>>
>> So, put things in the right order:
>>   - Remove the device from the group's list
>>   - Release the device from the iommu driver to drop all domain 
>> references
>>   - Free the domains
>>
>> This is done by splitting out the kobject_put(), which triggers
>> iommu_group_release(), from the rest of iommu_group_remove_device() and
>> placing it after release is called.
> 
> So simple... now how did I fail to think of that? :)

Oh, because s390 is using iommu_get_domain_for_dev() in its 
release_device callback, which needs to dereference the group to work, 
and the current domain may also be a non-default one which we can't 
prevent from disappearing racily, that was why :(

I think we may be back to looking at s390 having to keep track of 
domains internally like SMMUv3 does, and both drivers synchronising 
between their domain_free and release_device to to do their internal 
detach from either place... unless possibly the core code starts 
refcounting domains as well?

Robin.

>> Fixes: 9ac8545199a1 ("iommu: Fix use-after-free in iommu_release_device")
>> Reported-by: Matthew Rosato <mjrosato@linux.ibm.com>
>> Cc: Robin Murphy <robin.murphy@arm.com>
>> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
>> ---
>>   drivers/iommu/iommu.c | 36 +++++++++++++++++++++++++++---------
>>   1 file changed, 27 insertions(+), 9 deletions(-)
>>
>> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
>> index 780fb70715770d..c451bf715182ac 100644
>> --- a/drivers/iommu/iommu.c
>> +++ b/drivers/iommu/iommu.c
>> @@ -90,6 +90,7 @@ static int 
>> iommu_create_device_direct_mappings(struct iommu_group *group,
>>   static struct iommu_group *iommu_group_get_for_dev(struct device *dev);
>>   static ssize_t iommu_group_store_type(struct iommu_group *group,
>>                         const char *buf, size_t count);
>> +static void __iommu_group_remove_device(struct device *dev);
>>   #define IOMMU_GROUP_ATTR(_name, _mode, _show, _store)        \
>>   struct iommu_group_attribute iommu_group_attr_##_name =        \
>> @@ -330,6 +331,7 @@ int iommu_probe_device(struct device *dev)
>>   void iommu_release_device(struct device *dev)
>>   {
>> +    struct iommu_group *group = dev->iommu_group;
>>       const struct iommu_ops *ops;
>>       if (!dev->iommu)
>> @@ -337,11 +339,20 @@ void iommu_release_device(struct device *dev)
>>       iommu_device_unlink(dev->iommu->iommu_dev, dev);
> 
> In fact, now that you've made it obvious, could we not simply do an 
> extra kobject_get() here before calling regular 
> iommu_group_remove_device(), and avoid having to split that up at all? 
> That should delay any default domain teardown just as definitively as 
> holding the original reference for longer, no?
> 
> Thanks,
> Robin.
> 
>> +    __iommu_group_remove_device(dev);
>>       ops = dev_iommu_ops(dev);
>>       if (ops->release_device)
>>           ops->release_device(dev);
>> -    iommu_group_remove_device(dev);
>> +    /*
>> +     * This will eventually call iommu_group_release() which will 
>> free the
>> +     * iommu_domains. Up until the release_device() above the 
>> iommu_domains
>> +     * may still have been associated with the device, and we cannot 
>> free
>> +     * them until the have been detached. release_device() is 
>> expected to
>> +     * detach all domains connected to the dev.
>> +     */
>> +    kobject_put(group->devices_kobj);
>> +
>>       module_put(ops->owner);
>>       dev_iommu_free(dev);
>>   }
>> @@ -939,14 +950,7 @@ int iommu_group_add_device(struct iommu_group 
>> *group, struct device *dev)
>>   }
>>   EXPORT_SYMBOL_GPL(iommu_group_add_device);
>> -/**
>> - * iommu_group_remove_device - remove a device from it's current group
>> - * @dev: device to be removed
>> - *
>> - * This function is called by an iommu driver to remove the device from
>> - * it's current group.  This decrements the iommu group reference count.
>> - */
>> -void iommu_group_remove_device(struct device *dev)
>> +static void __iommu_group_remove_device(struct device *dev)
>>   {
>>       struct iommu_group *group = dev->iommu_group;
>>       struct group_device *tmp_device, *device = NULL;
>> @@ -977,6 +981,20 @@ void iommu_group_remove_device(struct device *dev)
>>       kfree(device->name);
>>       kfree(device);
>>       dev->iommu_group = NULL;
>> +}
>> +
>> +/**
>> + * iommu_group_remove_device - remove a device from it's current group
>> + * @dev: device to be removed
>> + *
>> + * This function is called by an iommu driver to remove the device from
>> + * it's current group.  This decrements the iommu group reference count.
>> + */
>> +void iommu_group_remove_device(struct device *dev)
>> +{
>> +    struct iommu_group *group = dev->iommu_group;
>> +
>> +    __iommu_group_remove_device(dev);
>>       kobject_put(group->devices_kobj);
>>   }
>>   EXPORT_SYMBOL_GPL(iommu_group_remove_device);
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 4/4] iommu: Fix ordering of iommu_release_device()
  2022-09-08 21:27     ` Robin Murphy
@ 2022-09-08 21:43       ` Jason Gunthorpe
  2022-09-09  9:05         ` Robin Murphy
  0 siblings, 1 reply; 24+ messages in thread
From: Jason Gunthorpe @ 2022-09-08 21:43 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Alex Williamson, Cornelia Huck, iommu, Joerg Roedel, kvm,
	Will Deacon, Qian Cai, Joerg Roedel, Marek Szyprowski,
	Matthew Rosato

On Thu, Sep 08, 2022 at 10:27:06PM +0100, Robin Murphy wrote:

> Oh, because s390 is using iommu_get_domain_for_dev() in its release_device
> callback, which needs to dereference the group to work, and the current
> domain may also be a non-default one which we can't prevent from
> disappearing racily, that was why :(

Hum, the issue there is the use of device->iommu_group - but that just
means I didn't split properly. How about this incremental:

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index c451bf715182ac..99ef799f3fe6b5 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -351,6 +351,7 @@ void iommu_release_device(struct device *dev)
 	 * them until the have been detached. release_device() is expected to
 	 * detach all domains connected to the dev.
 	 */
+	dev->iommu_group = NULL;
 	kobject_put(group->devices_kobj);
 
 	module_put(ops->owner);
@@ -980,7 +981,6 @@ static void __iommu_group_remove_device(struct device *dev)
 
 	kfree(device->name);
 	kfree(device);
-	dev->iommu_group = NULL;
 }
 
 /**
@@ -995,6 +995,7 @@ void iommu_group_remove_device(struct device *dev)
 	struct iommu_group *group = dev->iommu_group;
 
 	__iommu_group_remove_device(dev);
+	dev->iommu_group = NULL;
 	kobject_put(group->devices_kobj);
 }
 EXPORT_SYMBOL_GPL(iommu_group_remove_device);

To me it makes sense that the driver should be able to continue to
query the iommu_group during release anyhow..

And to your other question, the reason I split the function is because
I couldn't really say WTF iommu_group_remove_device() was supposed to
do. The __ version make ssense as part of the remove_device, due to
the sequencing with ops->release()

But the other one doesn't have that. So I want to put in a:

   WARN_ON(group->blocking_domain || group->default_domain);

Because calling it after those domains are allocated looks broken to
me.

Jason

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH 4/4] iommu: Fix ordering of iommu_release_device()
  2022-09-08 21:43       ` Jason Gunthorpe
@ 2022-09-09  9:05         ` Robin Murphy
  2022-09-09 13:25           ` Jason Gunthorpe
  0 siblings, 1 reply; 24+ messages in thread
From: Robin Murphy @ 2022-09-09  9:05 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Cornelia Huck, iommu, Joerg Roedel, kvm,
	Will Deacon, Qian Cai, Joerg Roedel, Marek Szyprowski,
	Matthew Rosato

On 2022-09-08 22:43, Jason Gunthorpe wrote:
> On Thu, Sep 08, 2022 at 10:27:06PM +0100, Robin Murphy wrote:
> 
>> Oh, because s390 is using iommu_get_domain_for_dev() in its release_device
>> callback, which needs to dereference the group to work, and the current
>> domain may also be a non-default one which we can't prevent from
>> disappearing racily, that was why :(
> 
> Hum, the issue there is the use of device->iommu_group - but that just
> means I didn't split properly. How about this incremental:

That did cross my mind, but it's a bit grim. In the light of the 
morning, I'm not sure s390 actually *needs* the group anyway - AFAICS if 
iommu_group_remove_device() has been processed first, that will have 
synchronised against any concurrent attach/detach, so zdev->s390_domain 
can be assumed to be up to date and used directly without the round trip 
through iommu_get_domain_for_dev(). That then only leaves the issue that 
that domain may still become invalid at any point after the group mutex 
has been dropped.

> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index c451bf715182ac..99ef799f3fe6b5 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -351,6 +351,7 @@ void iommu_release_device(struct device *dev)
>   	 * them until the have been detached. release_device() is expected to
>   	 * detach all domains connected to the dev.
>   	 */
> +	dev->iommu_group = NULL;
>   	kobject_put(group->devices_kobj);
>   
>   	module_put(ops->owner);
> @@ -980,7 +981,6 @@ static void __iommu_group_remove_device(struct device *dev)
>   
>   	kfree(device->name);
>   	kfree(device);
> -	dev->iommu_group = NULL;
>   }
>   
>   /**
> @@ -995,6 +995,7 @@ void iommu_group_remove_device(struct device *dev)
>   	struct iommu_group *group = dev->iommu_group;
>   
>   	__iommu_group_remove_device(dev);
> +	dev->iommu_group = NULL;
>   	kobject_put(group->devices_kobj);
>   }
>   EXPORT_SYMBOL_GPL(iommu_group_remove_device);
> 
> To me it makes sense that the driver should be able to continue to
> query the iommu_group during release anyhow..

I'm not so sure, release shouldn't be depending on a group since there 
may never have been one anyway. Perhaps the answer is an extra 
pre-release step to balance probe_finalize?

> And to your other question, the reason I split the function is because
> I couldn't really say WTF iommu_group_remove_device() was supposed to
> do. The __ version make ssense as part of the remove_device, due to
> the sequencing with ops->release()
> 
> But the other one doesn't have that. So I want to put in a:
> 
>     WARN_ON(group->blocking_domain || group->default_domain);
> 
> Because calling it after those domains are allocated looks broken to
> me.

I might be misunderstanding, but that sounds backwards - if a real 
device is being hotplugged out, we absolutely expect that to happen 
*after* its default domain has been set up. The external callers are 
using fake groups where default domains aren't relevant, and I have no 
idea what PAMU is doing but it's been doing it for long enough that it 
most likely isn't a problem. Thus wherever that check would be it would 
seem either wrong or unnecessary.

Thanks,
Robin.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 0/4] Fix splats releated to using the iommu_group after destroying devices
  2022-09-08 18:44 [PATCH 0/4] Fix splats releated to using the iommu_group after destroying devices Jason Gunthorpe
                   ` (3 preceding siblings ...)
  2022-09-08 18:45 ` [PATCH 4/4] iommu: Fix ordering of iommu_release_device() Jason Gunthorpe
@ 2022-09-09 12:49 ` Matthew Rosato
  2022-09-09 16:24   ` Jason Gunthorpe
  4 siblings, 1 reply; 24+ messages in thread
From: Matthew Rosato @ 2022-09-09 12:49 UTC (permalink / raw)
  To: Jason Gunthorpe, Alex Williamson, Cornelia Huck, iommu,
	Joerg Roedel, kvm, Will Deacon
  Cc: Qian Cai, Joerg Roedel, Marek Szyprowski, Robin Murphy

On 9/8/22 2:44 PM, Jason Gunthorpe wrote:
> The basic issue is that the iommu_group is being used by VFIO after all
> the device drivers have been removed.
> 
> In part this is caused by bad logic inside the iommu core that doesn't
> sequence removing the device from the group properly, and in another part
> this is bad logic in VFIO continuing to use device->iommu_group after all
> VFIO device drivers have been removed.
> 
> Fix both situations. Either fix alone should fix the bug reported, but
> both together bring a nice robust design to this area.
> 
> This is a followup from this thread:
> 
> https://lore.kernel.org/kvm/20220831201236.77595-1-mjrosato@linux.ibm.com/
> 
> Matthew confirmed an earlier version of the series solved the issue, it
> would be best if he would test this as well to confirm the various changes
> are still OK.

FYI I've been running this series (+ the incremental to patch 4 you mentioned) against my original repro scenario in a loop overnight, looks good.

> 
> The iommu patch is independent of the other patches, it can go through the
> iommu rc tree.
> 
> Jason Gunthorpe (4):
>   vfio: Simplify vfio_create_group()
>   vfio: Move the sanity check of the group to vfio_create_group()
>   vfio: Follow a strict lifetime for struct iommu_group *
>   iommu: Fix ordering of iommu_release_device()
> 
>  drivers/iommu/iommu.c    |  36 ++++++--
>  drivers/vfio/vfio_main.c | 172 +++++++++++++++++++++------------------
>  2 files changed, 120 insertions(+), 88 deletions(-)
> 
> 
> base-commit: 245898eb9275ce31942cff95d0bdc7412ad3d589


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 4/4] iommu: Fix ordering of iommu_release_device()
  2022-09-09  9:05         ` Robin Murphy
@ 2022-09-09 13:25           ` Jason Gunthorpe
  2022-09-09 17:57             ` Robin Murphy
  0 siblings, 1 reply; 24+ messages in thread
From: Jason Gunthorpe @ 2022-09-09 13:25 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Alex Williamson, Cornelia Huck, iommu, Joerg Roedel, kvm,
	Will Deacon, Qian Cai, Joerg Roedel, Marek Szyprowski,
	Matthew Rosato

On Fri, Sep 09, 2022 at 10:05:58AM +0100, Robin Murphy wrote:
> On 2022-09-08 22:43, Jason Gunthorpe wrote:
> > On Thu, Sep 08, 2022 at 10:27:06PM +0100, Robin Murphy wrote:
> > 
> > > Oh, because s390 is using iommu_get_domain_for_dev() in its release_device
> > > callback, which needs to dereference the group to work, and the current
> > > domain may also be a non-default one which we can't prevent from
> > > disappearing racily, that was why :(
> > 
> > Hum, the issue there is the use of device->iommu_group - but that just
> > means I didn't split properly. How about this incremental:
> 
> That did cross my mind, but it's a bit grim.

Actually, also in my morning, I think it may not even be necessary.

Keep in mind the start of the series fixes VFIO.

The bug that S390 is trying to fix is that VFIO didn't put back the
group ownership, it just left its own iommu_domain attached and called
release().

But now, at least for single device groups, VFIO will put owenership
back and zdev->s390_domain == NULL when we get to release_device()

> That then only leaves the issue that that domain may still become
> invalid at any point after the group mutex has been dropped.

So that is this race:

        CPU0                         CPU1 
   iommu_release_device(a)
      __iommu_group_remove_device(a)
			         iommu_device_use_default_domain(b)
                                 iommu_domain_free(domain)
                                 iommu_release_device(b)
                                      ops->release_device(b)
      ops->release_device(a) 
        // Boom, a is still attached to domain :(

I can't think of how to solve this other than holding the group mutex
across release_device. See below.

> > And to your other question, the reason I split the function is because
> > I couldn't really say WTF iommu_group_remove_device() was supposed to
> > do. The __ version make ssense as part of the remove_device, due to
> > the sequencing with ops->release()
> > 
> > But the other one doesn't have that. So I want to put in a:
> > 
> >     WARN_ON(group->blocking_domain || group->default_domain);
> > 
> > Because calling it after those domains are allocated looks broken to
> > me.
> 
> I might be misunderstanding, but that sounds backwards - if a real device is
> being hotplugged out, we absolutely expect that to happen *after* its
> default domain has been set up.

See below for what I mean

iommu_group_remove_device() doesn't work as an API because it has no
way to tell the device to stop using the domain we are about to free.

So it should assert that there is no domain to worry about. For the
vfio and power case there is no domain because they don't use iommu
drivers

For FSL it does not use default domains so it will also be OK.

Also, I think FSL is broken and it should not be trying to remove the
"PCI controller" from a group. Every PCI device behind an IOMMU should
be linked to a group.

The only reason I can think someone would wanted to do this is because
they ran into trouble with the VFIO viability checks. But we have a
robust solution to that now that doesn't require abusing the group
like this.

Jason

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 780fb70715770d..cb83576b1877d5 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -90,6 +90,10 @@ static int iommu_create_device_direct_mappings(struct iommu_group *group,
 static struct iommu_group *iommu_group_get_for_dev(struct device *dev);
 static ssize_t iommu_group_store_type(struct iommu_group *group,
 				      const char *buf, size_t count);
+static struct group_device *
+__iommu_group_remove_device(struct iommu_group *group, struct device *dev);
+static void __iommu_group_remove_device_sysfs(struct iommu_group *group,
+					      struct group_device *device);
 
 #define IOMMU_GROUP_ATTR(_name, _mode, _show, _store)		\
 struct iommu_group_attribute iommu_group_attr_##_name =		\
@@ -330,18 +334,50 @@ int iommu_probe_device(struct device *dev)
 
 void iommu_release_device(struct device *dev)
 {
+	struct iommu_group *group = dev->iommu_group;
 	const struct iommu_ops *ops;
+	struct group_device *device;
 
 	if (!dev->iommu)
 		return;
 
 	iommu_device_unlink(dev->iommu->iommu_dev, dev);
 
+	mutex_lock(&group->mutex);
+	device = __iommu_group_remove_device(group, dev);
 	ops = dev_iommu_ops(dev);
+
+	/*
+	 * If the group has become empty then ownership must have been released,
+	 * and the current domain must be set back to NULL or the default
+	 * domain.
+	 */
+	if (list_empty(&group->devices))
+		WARN_ON(group->owner_cnt ||
+			group->domain != group->default_domain);
+
+	/*
+	 * release_device() must stop using any attached domain on the device.
+	 * If there are still other devices in the group they are not effected
+	 * by this callback.
+	 *
+	 * The IOMMU driver must set the device to either an identity or
+	 * blocking translation and stop using any domain pointer, as it is
+	 * going to be freed.
+	 */
 	if (ops->release_device)
 		ops->release_device(dev);
+	mutex_unlock(&group->mutex);
+
+	__iommu_group_remove_device_sysfs(group, device);
+
+	/*
+	 * This will eventually call iommu_group_release() which will free the
+	 * iommu_domains.
+	 */
+	dev->iommu_group = NULL;
+	kobject_put(group->devices_kobj);
 
-	iommu_group_remove_device(dev);
 	module_put(ops->owner);
 	dev_iommu_free(dev);
 }
@@ -939,44 +975,69 @@ int iommu_group_add_device(struct iommu_group *group, struct device *dev)
 }
 EXPORT_SYMBOL_GPL(iommu_group_add_device);
 
-/**
- * iommu_group_remove_device - remove a device from it's current group
- * @dev: device to be removed
- *
- * This function is called by an iommu driver to remove the device from
- * it's current group.  This decrements the iommu group reference count.
- */
-void iommu_group_remove_device(struct device *dev)
+static struct group_device *
+__iommu_group_remove_device(struct iommu_group *group, struct device *dev)
 {
-	struct iommu_group *group = dev->iommu_group;
-	struct group_device *tmp_device, *device = NULL;
+	struct group_device *device;
+
+	lockdep_assert_held(&group->mutex);
 
 	if (!group)
-		return;
+		return NULL;
 
 	dev_info(dev, "Removing from iommu group %d\n", group->id);
 
-	mutex_lock(&group->mutex);
-	list_for_each_entry(tmp_device, &group->devices, list) {
-		if (tmp_device->dev == dev) {
-			device = tmp_device;
+	list_for_each_entry(device, &group->devices, list) {
+		if (device->dev == dev) {
 			list_del(&device->list);
-			break;
+			return device;
 		}
 	}
-	mutex_unlock(&group->mutex);
+	return NULL;
+}
 
+static void __iommu_group_remove_device_sysfs(struct iommu_group *group,
+					      struct group_device *device)
+{
 	if (!device)
 		return;
 
 	sysfs_remove_link(group->devices_kobj, device->name);
-	sysfs_remove_link(&dev->kobj, "iommu_group");
+	sysfs_remove_link(&device->dev->kobj, "iommu_group");
 
-	trace_remove_device_from_group(group->id, dev);
+	trace_remove_device_from_group(group->id, device->dev);
 
 	kfree(device->name);
 	kfree(device);
-	dev->iommu_group = NULL;
+}
+
+/**
+ * iommu_group_remove_device - remove a device from it's current group
+ * @dev: device to be removed
+ *
+ * This function is called by an iommu driver to remove the device from
+ * it's current group.  This decrements the iommu group reference count.
+ */
+void iommu_group_remove_device(struct device *dev)
+{
+	struct iommu_group *group = dev->iommu_group;
+	struct group_device *device;
+
+	/*
+	 * Since we don't do ops->release_device() there is no way for the
+	 * driver to stop using any attached domain before we free it. If a
+	 * domain is attached while this function is called it will eventually
+	 * UAF.
+	 *
+	 * Thus it is only useful for cases like VFIO/SPAPR that don't use an
+	 * iommu driver, or for cases like FSL that don't use default domains.
+	 */
+	WARN_ON(group->domain);
+
+	mutex_lock(&group->mutex);
+	device = __iommu_group_remove_device(group, dev);
+	mutex_unlock(&group->mutex);
+	__iommu_group_remove_device_sysfs(group, device);
 	kobject_put(group->devices_kobj);
 }
 EXPORT_SYMBOL_GPL(iommu_group_remove_device);

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH 0/4] Fix splats releated to using the iommu_group after destroying devices
  2022-09-09 12:49 ` [PATCH 0/4] Fix splats releated to using the iommu_group after destroying devices Matthew Rosato
@ 2022-09-09 16:24   ` Jason Gunthorpe
  0 siblings, 0 replies; 24+ messages in thread
From: Jason Gunthorpe @ 2022-09-09 16:24 UTC (permalink / raw)
  To: Matthew Rosato
  Cc: Alex Williamson, Cornelia Huck, iommu, Joerg Roedel, kvm,
	Will Deacon, Qian Cai, Joerg Roedel, Marek Szyprowski,
	Robin Murphy

On Fri, Sep 09, 2022 at 08:49:40AM -0400, Matthew Rosato wrote:
> On 9/8/22 2:44 PM, Jason Gunthorpe wrote:
> > The basic issue is that the iommu_group is being used by VFIO after all
> > the device drivers have been removed.
> > 
> > In part this is caused by bad logic inside the iommu core that doesn't
> > sequence removing the device from the group properly, and in another part
> > this is bad logic in VFIO continuing to use device->iommu_group after all
> > VFIO device drivers have been removed.
> > 
> > Fix both situations. Either fix alone should fix the bug reported, but
> > both together bring a nice robust design to this area.
> > 
> > This is a followup from this thread:
> > 
> > https://lore.kernel.org/kvm/20220831201236.77595-1-mjrosato@linux.ibm.com/
> > 
> > Matthew confirmed an earlier version of the series solved the issue, it
> > would be best if he would test this as well to confirm the various changes
> > are still OK.
> 
> FYI I've been running this series (+ the incremental to patch 4 you
> mentioned) against my original repro scenario in a loop overnight,
> looks good.

Thanks Matthew, looks like we need some more time on the last patch
but I think the VFIO ones are OK if Alex wants to pick them before LPC
is over.

Jason

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 4/4] iommu: Fix ordering of iommu_release_device()
  2022-09-09 13:25           ` Jason Gunthorpe
@ 2022-09-09 17:57             ` Robin Murphy
  2022-09-09 18:30               ` Jason Gunthorpe
  0 siblings, 1 reply; 24+ messages in thread
From: Robin Murphy @ 2022-09-09 17:57 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Cornelia Huck, iommu, Joerg Roedel, kvm,
	Will Deacon, Qian Cai, Joerg Roedel, Marek Szyprowski,
	Matthew Rosato

On 2022-09-09 14:25, Jason Gunthorpe wrote:
> On Fri, Sep 09, 2022 at 10:05:58AM +0100, Robin Murphy wrote:
>> On 2022-09-08 22:43, Jason Gunthorpe wrote:
>>> On Thu, Sep 08, 2022 at 10:27:06PM +0100, Robin Murphy wrote:
>>>
>>>> Oh, because s390 is using iommu_get_domain_for_dev() in its release_device
>>>> callback, which needs to dereference the group to work, and the current
>>>> domain may also be a non-default one which we can't prevent from
>>>> disappearing racily, that was why :(
>>>
>>> Hum, the issue there is the use of device->iommu_group - but that just
>>> means I didn't split properly. How about this incremental:
>>
>> That did cross my mind, but it's a bit grim.
> 
> Actually, also in my morning, I think it may not even be necessary.
> 
> Keep in mind the start of the series fixes VFIO.
> 
> The bug that S390 is trying to fix is that VFIO didn't put back the
> group ownership, it just left its own iommu_domain attached and called
> release().
> 
> But now, at least for single device groups, VFIO will put owenership
> back and zdev->s390_domain == NULL when we get to release_device()
> 
>> That then only leaves the issue that that domain may still become
>> invalid at any point after the group mutex has been dropped.
> 
> So that is this race:
> 
>          CPU0                         CPU1
>     iommu_release_device(a)
>        __iommu_group_remove_device(a)
> 			         iommu_device_use_default_domain(b)
>                                   iommu_domain_free(domain)
>                                   iommu_release_device(b)
>                                        ops->release_device(b)
>        ops->release_device(a)
>          // Boom, a is still attached to domain :(
> 
> I can't think of how to solve this other than holding the group mutex
> across release_device. See below.

I see a few possibilities:

- Backtrack slightly on its removal, and instead repurpose detach_dev
into a specialised domain cleanup callback, called before or during
iommu_group_remove_device(), with the group mutex held.

- Drivers that hold any kind of internal per-device references to
domains - which is generally the root of this issue in the first place -
can implement proper reference counting, so even if a domain is "freed"
with a device still attached as above, it doesn't actually go away until
release_device(a) cleans up the final dangling reference. I suggested
the core doing this generically, but on reflection I think it's actually
a lot more straightforward as a driver-internal thing.

- Drivers that basically just keep a list of devices in the domain and
need to do a list_del() in release_device, can also list_del_init() any
still-attached devices in domain_free, with a simple per-instance or
global lock to serialise the two.

>>> And to your other question, the reason I split the function is because
>>> I couldn't really say WTF iommu_group_remove_device() was supposed to
>>> do. The __ version make ssense as part of the remove_device, due to
>>> the sequencing with ops->release()
>>>
>>> But the other one doesn't have that. So I want to put in a:
>>>
>>>      WARN_ON(group->blocking_domain || group->default_domain);
>>>
>>> Because calling it after those domains are allocated looks broken to
>>> me.
>>
>> I might be misunderstanding, but that sounds backwards - if a real device is
>> being hotplugged out, we absolutely expect that to happen *after* its
>> default domain has been set up.
> 
> See below for what I mean
> 
> iommu_group_remove_device() doesn't work as an API because it has no
> way to tell the device to stop using the domain we are about to free.
> 
> So it should assert that there is no domain to worry about. For the
> vfio and power case there is no domain because they don't use iommu
> drivers

Ah, I see it now - if we think it's a usage error for any current API
user to allow a device to be removed while still attached to a non-
default domain, then we can just throw our hands up at that, and
mitigate for the default domain case that we *can* control. I'm not 100%
convinced there might not be some niche non-uAPI case for skipping a
detach because you know you're tearing down your device and domain at the
same time, but I'm inclined to agree that we can worry about that if and
when it does ever come up.

If so, I reckon it should be about as as easy as this (untested).

Cheers,
Robin.

----->8-----
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 9fbe5d067473..760d9bd3ad66 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -396,17 +396,25 @@ int iommu_probe_device(struct device *dev)
  void iommu_release_device(struct device *dev)
  {
  	const struct iommu_ops *ops;
+	struct iommu_group *group;
  
  	if (!dev->iommu)
  		return;
  
  	iommu_device_unlink(dev->iommu->iommu_dev, dev);
  
+	/*
+	 * Some drivers track a device's current domain internally and may
+	 * dereference it to clean up in release_device. If a default domain
+	 * exists, hold a reference to ensure it stays around long enough.
+	 */
+	group = iommu_group_get(dev);
+	iommu_group_remove_device(dev);
  	ops = dev_iommu_ops(dev);
  	if (ops->release_device)
  		ops->release_device(dev);
  
-	iommu_group_remove_device(dev);
+	iommu_group_put(group);
  	module_put(ops->owner);
  	dev_iommu_free(dev);
  }
@@ -1022,6 +1030,14 @@ void iommu_group_remove_device(struct device *dev)
  	dev_info(dev, "Removing from iommu group %d\n", group->id);
  
  	mutex_lock(&group->mutex);
+	if (WARN_ON(group->domain != group->default_domain &&
+		    group->domain != group->blocking_domain)) {
+		if (group->default_domain)
+			__iommu_attach_device(group->default_domain, dev);
+		else
+			__iommu_detach_device(group->domain, dev);
+	}
+
  	list_for_each_entry(tmp_device, &group->devices, list) {
  		if (tmp_device->dev == dev) {
  			device = tmp_device;

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH 4/4] iommu: Fix ordering of iommu_release_device()
  2022-09-09 17:57             ` Robin Murphy
@ 2022-09-09 18:30               ` Jason Gunthorpe
  2022-09-09 19:55                 ` Robin Murphy
  0 siblings, 1 reply; 24+ messages in thread
From: Jason Gunthorpe @ 2022-09-09 18:30 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Alex Williamson, Cornelia Huck, iommu, Joerg Roedel, kvm,
	Will Deacon, Qian Cai, Joerg Roedel, Marek Szyprowski,
	Matthew Rosato

On Fri, Sep 09, 2022 at 06:57:59PM +0100, Robin Murphy wrote:

> > > That then only leaves the issue that that domain may still become
> > > invalid at any point after the group mutex has been dropped.
> > 
> > So that is this race:
> > 
> >          CPU0                         CPU1
> >     iommu_release_device(a)
> >        __iommu_group_remove_device(a)
> > 			         iommu_device_use_default_domain(b)
> >                                   iommu_domain_free(domain)
> >                                   iommu_release_device(b)
> >                                        ops->release_device(b)
> >        ops->release_device(a)
> >          // Boom, a is still attached to domain :(
> > 
> > I can't think of how to solve this other than holding the group mutex
> > across release_device. See below.
> 
> I see a few possibilities:
> 
> - Backtrack slightly on its removal, and instead repurpose detach_dev
> into a specialised domain cleanup callback, called before or during
> iommu_group_remove_device(), with the group mutex held.

See below for why that is somewhat troublesome..
 
> - Drivers that hold any kind of internal per-device references to
> domains - which is generally the root of this issue in the first place -
> can implement proper reference counting, so even if a domain is "freed"
> with a device still attached as above, it doesn't actually go away until
> release_device(a) cleans up the final dangling reference. I suggested
> the core doing this generically, but on reflection I think it's actually
> a lot more straightforward as a driver-internal thing.

Isn't this every driver though? Like every single driver implementing
an UNMANAGED/DMA/DMA_FQ domain has a hidden reference to the
iommu_domain - minimally to point the HW to the IOPTEs it stores.

> - Drivers that basically just keep a list of devices in the domain and
> need to do a list_del() in release_device, can also list_del_init() any
> still-attached devices in domain_free, with a simple per-instance or
> global lock to serialise the two.

Compared to just locking ops->release_device() these all seem more
complicated?

IMHO the core code should try to protect the driver from racing
release with anything else.

Do you know a reason not to hold the group mutex across
release_device? I think that is the most straightforward and
future proof.

Arguably all the device ops should be serialized under the group
mutex.

> @@ -1022,6 +1030,14 @@ void iommu_group_remove_device(struct device *dev)
>  	dev_info(dev, "Removing from iommu group %d\n", group->id);
>  	mutex_lock(&group->mutex);
> +	if (WARN_ON(group->domain != group->default_domain &&
> +		    group->domain != group->blocking_domain)) {

This will false trigger, if there are two VFIO devices then the group
will remained owned when we unplug one just of them, but the group's domain
will be a VFIO owned domain. 

It is why I put the list_empty() protection, as the test only works if
it is the last device.

> +		if (group->default_domain)
> +			__iommu_attach_device(group->default_domain, dev);
> +		else
> +			__iommu_detach_device(group->domain, dev);
> +	}

This was very appealing, but I rejected it because it is too difficult
to support multi-device groups that share the RID.

In that case we expect that the first attach/detach of a device on the
shared RID will reconfigure the iommu and the attach/deatch of all the
other devices on the group with the same parameters will be a NOP.

So in a VFIO configuration where two drivers are bound to a single
group with shared RID and we unplug one device, this will rebind the
shared RID and thus the entire group to blocking/default and break the
still running VFIO on the second device.

The device centric interface only works if we always apply the
operation to every device in the group..

This is why I kept it as ops->release_device() with an implicit detach
of the current domain inside the driver. release_device() has that
special meaning of 'detach the domain but do not change a shared RID'

And it misses the logic to WARN_ON if a domain is set and an external
entity wrongly uses iommu_group_remove_device()..

Jason

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 4/4] iommu: Fix ordering of iommu_release_device()
  2022-09-09 18:30               ` Jason Gunthorpe
@ 2022-09-09 19:55                 ` Robin Murphy
  2022-09-09 23:45                   ` Jason Gunthorpe
  0 siblings, 1 reply; 24+ messages in thread
From: Robin Murphy @ 2022-09-09 19:55 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Cornelia Huck, iommu, Joerg Roedel, kvm,
	Will Deacon, Qian Cai, Joerg Roedel, Marek Szyprowski,
	Matthew Rosato

On 2022-09-09 19:30, Jason Gunthorpe wrote:
> On Fri, Sep 09, 2022 at 06:57:59PM +0100, Robin Murphy wrote:
> 
>>>> That then only leaves the issue that that domain may still become
>>>> invalid at any point after the group mutex has been dropped.
>>>
>>> So that is this race:
>>>
>>>           CPU0                         CPU1
>>>      iommu_release_device(a)
>>>         __iommu_group_remove_device(a)
>>> 			         iommu_device_use_default_domain(b)
>>>                                    iommu_domain_free(domain)
>>>                                    iommu_release_device(b)
>>>                                         ops->release_device(b)
>>>         ops->release_device(a)
>>>           // Boom, a is still attached to domain :(
>>>
>>> I can't think of how to solve this other than holding the group mutex
>>> across release_device. See below.
>>
>> I see a few possibilities:
>>
>> - Backtrack slightly on its removal, and instead repurpose detach_dev
>> into a specialised domain cleanup callback, called before or during
>> iommu_group_remove_device(), with the group mutex held.
> 
> See below for why that is somewhat troublesome..
>   
>> - Drivers that hold any kind of internal per-device references to
>> domains - which is generally the root of this issue in the first place -
>> can implement proper reference counting, so even if a domain is "freed"
>> with a device still attached as above, it doesn't actually go away until
>> release_device(a) cleans up the final dangling reference. I suggested
>> the core doing this generically, but on reflection I think it's actually
>> a lot more straightforward as a driver-internal thing.
> 
> Isn't this every driver though? Like every single driver implementing
> an UNMANAGED/DMA/DMA_FQ domain has a hidden reference to the
> iommu_domain - minimally to point the HW to the IOPTEs it stores.

Um, no? Domain ops get the domain passed in as an argument, which is far 
from hidden, and if any driver implemented them to ignore that argument 
and operate on something else it would be stupid and broken. Note I said 
"per-device reference", meaning things like s390's zpci_dev->s390_domain 
and SMMUv3's dev->iommu->priv->domain. It's only those references that 
are reachable from release_device - outside the normal domain lifecycle 
- which are problematic.

>> - Drivers that basically just keep a list of devices in the domain and
>> need to do a list_del() in release_device, can also list_del_init() any
>> still-attached devices in domain_free, with a simple per-instance or
>> global lock to serialise the two.
> 
> Compared to just locking ops->release_device() these all seem more
> complicated?

Well, yes, but they are still potentially-viable examples of the 
alternative solutions you said you couldn't think of ;)

> IMHO the core code should try to protect the driver from racing
> release with anything else.
> 
> Do you know a reason not to hold the group mutex across
> release_device? I think that is the most straightforward and
> future proof.

Yes, the ones documented in the code and already discussed here. The 
current functional ones aren't particularly *good* reasons, but unless 
and until they can all be cleaned up they are what they are.

> Arguably all the device ops should be serialized under the group
> mutex.

Maybe once groups and default domains are used consistently everywhere. 
And notwithstanding that half the ops have no association with a group, 
are needed before or as part of obtaining a group, or were explicitly 
intended to allow calling back into other APIs that might lock the 
relevant group :P

>> @@ -1022,6 +1030,14 @@ void iommu_group_remove_device(struct device *dev)
>>   	dev_info(dev, "Removing from iommu group %d\n", group->id);
>>   	mutex_lock(&group->mutex);
>> +	if (WARN_ON(group->domain != group->default_domain &&
>> +		    group->domain != group->blocking_domain)) {
> 
> This will false trigger, if there are two VFIO devices then the group
> will remained owned when we unplug one just of them, but the group's domain
> will be a VFIO owned domain.

As opposed to currently, where most drivers' release_device will blindly 
detach/disable the RID in some fashion so the other device would 
suddenly blow up anyway? A warning of the impending disaster might be 
quite informative, I reckon.

> It is why I put the list_empty() protection, as the test only works if
> it is the last device.
> 
>> +		if (group->default_domain)
>> +			__iommu_attach_device(group->default_domain, dev);
>> +		else
>> +			__iommu_detach_device(group->domain, dev);
>> +	}
> 
> This was very appealing, but I rejected it because it is too difficult
> to support multi-device groups that share the RID.
> 
> In that case we expect that the first attach/detach of a device on the
> shared RID will reconfigure the iommu and the attach/deatch of all the
> other devices on the group with the same parameters will be a NOP.
> 
> So in a VFIO configuration where two drivers are bound to a single
> group with shared RID and we unplug one device, this will rebind the
> shared RID and thus the entire group to blocking/default and break the
> still running VFIO on the second device.

As above, I am supremely confident that nobody has ever done that 
because it is already broken on everything that matters.

(It *will* actually work on SMMUv2 because SMMUv2 comprehensively 
handles StreamID-level aliasing beyond what pci_device_group() covers, 
which I remain rather proud of)

> The device centric interface only works if we always apply the
> operation to every device in the group..
> 
> This is why I kept it as ops->release_device() with an implicit detach
> of the current domain inside the driver. release_device() has that
> special meaning of 'detach the domain but do not change a shared RID'

If you want to rely on that notion, you'll need to tell all the major 
drivers about it first, I'm afraid.

> And it misses the logic to WARN_ON if a domain is set and an external
> entity wrongly uses iommu_group_remove_device()..

Huh? An external fake group couldn't have a default domain or blocking 
domain, thus any non-NULL domain will not compare equal to either, so if 
that could happen it will warn, and then most likely crash. I did think 
briefly about trying to make it not crash, but then I remembered that 
fake groups from external callers also aren't backed by IOMMU API 
drivers so have no way to allocate or attach domains either, so in fact 
it cannot happen at all under any circumstances that are worth reasoning 
about.

Yes, some nefarious driver could call iommu_group_remove_device() on 
random devices. It could also call kfree(dev->iommu_group). Or 
kfree(iommu_group_remove_device). Fundamentally-broken kernel code can 
crash the kernel, whoop de do.

Thanks,
Robin.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 4/4] iommu: Fix ordering of iommu_release_device()
  2022-09-09 19:55                 ` Robin Murphy
@ 2022-09-09 23:45                   ` Jason Gunthorpe
  2022-09-12 11:13                     ` Robin Murphy
  0 siblings, 1 reply; 24+ messages in thread
From: Jason Gunthorpe @ 2022-09-09 23:45 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Alex Williamson, Cornelia Huck, iommu, Joerg Roedel, kvm,
	Will Deacon, Qian Cai, Joerg Roedel, Marek Szyprowski,
	Matthew Rosato

On Fri, Sep 09, 2022 at 08:55:07PM +0100, Robin Murphy wrote:

> > Isn't this every driver though? Like every single driver implementing
> > an UNMANAGED/DMA/DMA_FQ domain has a hidden reference to the
> > iommu_domain - minimally to point the HW to the IOPTEs it stores.
> 
> Um, no? Domain ops get the domain passed in as an argument, which is far
> from hidden, and if any driver implemented them to ignore that argument and
> operate on something else it would be stupid and broken. Note I said
> "per-device reference", meaning things like s390's zpci_dev->s390_domain and
> SMMUv3's dev->iommu->priv->domain. It's only those references that are
> reachable from release_device - outside the normal domain lifecycle - which
> are problematic.

If the plan is to make the domain refcounted and then allow a 'put' on
it before we reach release_device() then it means every driver needs
to hold a 'get' on the domain while it is programmed into the HW.

Because the hw will still be touching memory that could be freed by an
iommu_domain_free(). By "hidden" reference I mean the HW walkers are
touching memory that would be freed - ie kasn won't see it.

> > Do you know a reason not to hold the group mutex across
> > release_device? I think that is the most straightforward and
> > future proof.
> 
> Yes, the ones documented in the code and already discussed here. The current
> functional ones aren't particularly *good* reasons, but unless and until
> they can all be cleaned up they are what they are.

Uh, I feel like I missed part of the conversation - I don't know what
this list is..

I did look. I'm looking for a call chain that goes from
release_device() into a core function that grabs the group->mutex.

There is a comment in iommu_group_store_type() that suggest there is a
recursion but doesn't say what it is. It was an Intel person who wrote
the comment so I more carefully checked the intel driver and didn't
find a call path, but it is big and complicated..

There is a commment in iommu_change_dev_def_domain() about recursion
on probe_finalize(), not relevant here, AFAICT.

So, I did two approaches, one I checked quickly through every
release_device looking for something.

Then I looked across the entire exported driver facing API and focused
on callchains going back toward the core from APIs that might be
trouble and audited them almost completely.

These APIs do not take the lock, so completely safe:
 iommu_group_alloc
 iommu_group_set_iommudata
 iommu_group_set_name
 iommu_group_get
 iommu_group_ref_get
 iommu_group_put
 iommu_get_domain_for_dev
 iommu_fwspec_free
 iommu_fwspec_init
 iommu_fwspec_add_ids
 iommu_put_resv_regions (called from release_device)

Does take the lock. Checked all of these in all the drivers, didn't
find an obvious path to release_device:
 iommu_group_remove_device
 iommu_group_for_each_dev
 iommu_attach_device
 iommu_detach_device
 iommu_attach_group
 iommu_detach_group
 bus_set_iommu

Can't tell if these take lock due to driver callbacks, but couldn't
find them in release, and doesn't make sense they would be there:
 iommu_device_register
 iommu_device_unregister
 iommu_domain_alloc
 iommu_domain_free

Rest of the exported interface touching the drivers - didn't carefully
check if they are using the lock - however by name seems unlikely
these are in release_device():
 iommu_register_device_fault_handler
 iommu_unregister_device_fault_handler
 iommu_report_device_fault
 iommu_page_response
 report_iommu_fault
 iommu_iova_to_phys
 iommu_map
 iommu_unmap
 iommu_alloc_resv_region
 iommu_present
 iommu_capable
 iommu_default_passthrough

It is big and complicated, so I wouldn't stake my life on it, but it
seems worth investigating further.

Could the recursion have been cleaned up already?

> > > @@ -1022,6 +1030,14 @@ void iommu_group_remove_device(struct device *dev)
> > >   	dev_info(dev, "Removing from iommu group %d\n", group->id);
> > >   	mutex_lock(&group->mutex);
> > > +	if (WARN_ON(group->domain != group->default_domain &&
> > > +		    group->domain != group->blocking_domain)) {
> > 
> > This will false trigger, if there are two VFIO devices then the group
> > will remained owned when we unplug one just of them, but the group's domain
> > will be a VFIO owned domain.
> 
> As opposed to currently, where most drivers' release_device will blindly
> detach/disable the RID in some fashion so the other device would suddenly
> blow up anyway? 

Er, I think it is OK today, in the non-shared case. If the RID isn't
shared then each device in the group is independent, so most drivers,
most of the time, should only effect the RID release_device() is
called on, while this warning will always trigger for any multi-device
group.

> (It *will* actually work on SMMUv2 because SMMUv2 comprehensively handles
> StreamID-level aliasing beyond what pci_device_group() covers, which I
> remain rather proud of)

This is why I prefered not to explicitly change the domain, because at
least if someone did write a non-buggy driver it doesn't get wrecked -
and making a non-buggy driver is at least allowed by the API.

> > And it misses the logic to WARN_ON if a domain is set and an external
> > entity wrongly uses iommu_group_remove_device()..
> 
> Huh? An external fake group couldn't have a default domain or blocking
> domain, thus any non-NULL domain will not compare equal to either, so if
> that could happen it will warn, and then most likely crash. I did think
> briefly about trying to make it not crash, but then I remembered that fake
> groups from external callers also aren't backed by IOMMU API drivers so have
> no way to allocate or attach domains either, so in fact it cannot happen at
> all under any circumstances that are worth reasoning about.

I mean specificaly thing like FSL is doing where it is a real driver
calling this API and the test of 'group->domain == NULL' is the more
robust precondition.

So, IDK, I would perfer to understand where we hit a group mutex
recursion before rejecting that path... If you know specifics please
share, otherwise maybe we should stick in a lockdep check there and
see if anything hits?

But I'm off to LPC so I probably won't write anything more thoughtful
on this for a while.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 4/4] iommu: Fix ordering of iommu_release_device()
  2022-09-09 23:45                   ` Jason Gunthorpe
@ 2022-09-12 11:13                     ` Robin Murphy
  2022-09-22 16:56                       ` Jason Gunthorpe
  0 siblings, 1 reply; 24+ messages in thread
From: Robin Murphy @ 2022-09-12 11:13 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Cornelia Huck, iommu, Joerg Roedel, kvm,
	Will Deacon, Qian Cai, Joerg Roedel, Marek Szyprowski,
	Matthew Rosato

On 2022-09-10 00:45, Jason Gunthorpe wrote:
> On Fri, Sep 09, 2022 at 08:55:07PM +0100, Robin Murphy wrote:
> 
>>> Isn't this every driver though? Like every single driver implementing
>>> an UNMANAGED/DMA/DMA_FQ domain has a hidden reference to the
>>> iommu_domain - minimally to point the HW to the IOPTEs it stores.
>>
>> Um, no? Domain ops get the domain passed in as an argument, which is far
>> from hidden, and if any driver implemented them to ignore that argument and
>> operate on something else it would be stupid and broken. Note I said
>> "per-device reference", meaning things like s390's zpci_dev->s390_domain and
>> SMMUv3's dev->iommu->priv->domain. It's only those references that are
>> reachable from release_device - outside the normal domain lifecycle - which
>> are problematic.
> 
> If the plan is to make the domain refcounted and then allow a 'put' on
> it before we reach release_device() then it means every driver needs
> to hold a 'get' on the domain while it is programmed into the HW.
> 
> Because the hw will still be touching memory that could be freed by an
> iommu_domain_free(). By "hidden" reference I mean the HW walkers are
> touching memory that would be freed - ie kasn won't see it.

As far as I'm concerned we're dealing purely with the case where 
release_device races with attaching back to the default domain *and* the 
driver has some reason for release_device to poke at what it thinks the 
currently-attached domain is. Anyone who frees a domain while it's still 
actually live deserves whatever they get; it would be thoroughly 
impractical to attempt to mitigate for that kind of silliness.

>>> Do you know a reason not to hold the group mutex across
>>> release_device? I think that is the most straightforward and
>>> future proof.
>>
>> Yes, the ones documented in the code and already discussed here. The current
>> functional ones aren't particularly *good* reasons, but unless and until
>> they can all be cleaned up they are what they are.
> 
> Uh, I feel like I missed part of the conversation - I don't know what
> this list is..

s390 (remember how we got here?) calls iommu_get_domain_for_dev(). 
ipmmu-vmsa calls arm_iommu_detach_device() (mtk_v1 doesn't, but perhaps 
technically should), to undo the corresponding attach from 
probe_finalize - apologies for misremembering which way round the 
comments were.

>>>> @@ -1022,6 +1030,14 @@ void iommu_group_remove_device(struct device *dev)
>>>>    	dev_info(dev, "Removing from iommu group %d\n", group->id);
>>>>    	mutex_lock(&group->mutex);
>>>> +	if (WARN_ON(group->domain != group->default_domain &&
>>>> +		    group->domain != group->blocking_domain)) {
>>>
>>> This will false trigger, if there are two VFIO devices then the group
>>> will remained owned when we unplug one just of them, but the group's domain
>>> will be a VFIO owned domain.
>>
>> As opposed to currently, where most drivers' release_device will blindly
>> detach/disable the RID in some fashion so the other device would suddenly
>> blow up anyway?
> 
> Er, I think it is OK today, in the non-shared case. If the RID isn't
> shared then each device in the group is independent, so most drivers,
> most of the time, should only effect the RID release_device() is
> called on, while this warning will always trigger for any multi-device
> group.

Oh, apparently I managed to misinterpret this as the two *aliasing* 
devices case, sorry. Indeed it is overly conservative for that. I think 
the robust way to detect bad usage is actually not via the group at all, 
but for iommu_device_{un}use_default_domain() to also maintain a 
per-device ownership flag, then we warn if a device is released with 
that still set.

>> (It *will* actually work on SMMUv2 because SMMUv2 comprehensively handles
>> StreamID-level aliasing beyond what pci_device_group() covers, which I
>> remain rather proud of)
> 
> This is why I prefered not to explicitly change the domain, because at
> least if someone did write a non-buggy driver it doesn't get wrecked -
> and making a non-buggy driver is at least allowed by the API.

Detaching back to the default domain still seems like it's *always* the 
right thing to do at this point, even when it should not be warned about 
as above. As I say it *does* work on non-buggy drivers, and making this 
whole domain use-after-free race a fundamental non-issue is attractive.

>>> And it misses the logic to WARN_ON if a domain is set and an external
>>> entity wrongly uses iommu_group_remove_device()..
>>
>> Huh? An external fake group couldn't have a default domain or blocking
>> domain, thus any non-NULL domain will not compare equal to either, so if
>> that could happen it will warn, and then most likely crash. I did think
>> briefly about trying to make it not crash, but then I remembered that fake
>> groups from external callers also aren't backed by IOMMU API drivers so have
>> no way to allocate or attach domains either, so in fact it cannot happen at
>> all under any circumstances that are worth reasoning about.
> 
> I mean specificaly thing like FSL is doing where it is a real driver
> calling this API and the test of 'group->domain == NULL' is the more
> robust precondition.

Having looked a bit closer, I think I get what PAMU is doing - kind of 
impedance-matching between pci_device_group() and its own non-ACS form 
of isolation, and possibly also a rather roundabout way of propagating 
DT data from the PCI controller node up into the PCI hierarchy. Both 
could quite likely be done in a more straightforward manner these days 
(and TBH I'm not convinced it works at all since it doesn't appear to 
match the actual DT binding), but either way I'm fairly confident we 
needn't worry about it.

Thanks,
Robin.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/4] vfio: Follow a strict lifetime for struct iommu_group *
  2022-09-08 18:45 ` [PATCH 3/4] vfio: Follow a strict lifetime for struct iommu_group * Jason Gunthorpe
@ 2022-09-20 19:32   ` Matthew Rosato
  0 siblings, 0 replies; 24+ messages in thread
From: Matthew Rosato @ 2022-09-20 19:32 UTC (permalink / raw)
  To: Jason Gunthorpe, Alex Williamson, Cornelia Huck, iommu,
	Joerg Roedel, kvm, Will Deacon
  Cc: Qian Cai, Joerg Roedel, Marek Szyprowski, Robin Murphy

On 9/8/22 2:45 PM, Jason Gunthorpe wrote:
> The iommu_group comes from the struct device that a driver has been bound
> to and then created a struct vfio_device against. To keep the iommu layer
> sane we want to have a simple rule that only an attached driver should be
> using the iommu API. Particularly only an attached driver should hold
> ownership.
> 
> In VFIO's case since it uses the group APIs and it shares between
> different drivers it is a bit more complicated, but the principle still
> holds.
> 
> Solve this by waiting for all users of the vfio_group to stop before
> allowing vfio_unregister_group_dev() to complete. This is done with a new
> completion to know when the users go away and an additional refcount to
> keep track of how many device drivers are sharing the vfio group. The last
> driver to be unregistered will clean up the group.
> 
> This solves crashes in the S390 iommu driver that come because VFIO ends
> up racing releasing ownership (which attaches the default iommu_domain to
> the device) with the removal of that same device from the iommu
> driver. This is a side case that iommu drivers should not have to cope
> with.
> 
>    iommu driver failed to attach the default/blocking domain
>    WARNING: CPU: 0 PID: 5082 at drivers/iommu/iommu.c:1961 iommu_detach_group+0x6c/0x80
>    Modules linked in: macvtap macvlan tap vfio_pci vfio_pci_core irqbypass vfio_virqfd kvm nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink mlx5_ib sunrpc ib_uverbs ism smc uvdevice ib_core s390_trng eadm_sch tape_3590 tape tape_class vfio_ccw mdev vfio_iommu_type1 vfio zcrypt_cex4 sch_fq_codel configfs ghash_s390 prng chacha_s390 libchacha aes_s390 mlx5_core des_s390 libdes sha3_512_s390 nvme sha3_256_s390 sha512_s390 sha256_s390 sha1_s390 sha_common nvme_core zfcp scsi_transport_fc pkey zcrypt rng_core autofs4
>    CPU: 0 PID: 5082 Comm: qemu-system-s39 Tainted: G        W          6.0.0-rc3 #5
>    Hardware name: IBM 3931 A01 782 (LPAR)
>    Krnl PSW : 0704c00180000000 000000095bb10d28 (iommu_detach_group+0x70/0x80)
>               R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3
>    Krnl GPRS: 0000000000000001 0000000900000027 0000000000000039 000000095c97ffe0
>               00000000fffeffff 00000009fc290000 00000000af1fda50 00000000af590b58
>               00000000af1fdaf0 0000000135c7a320 0000000135e52258 0000000135e52200
>               00000000a29e8000 00000000af590b40 000000095bb10d24 0000038004b13c98
>    Krnl Code: 000000095bb10d18: c020003d56fc        larl    %r2,000000095c2bbb10
>                           000000095bb10d1e: c0e50019d901        brasl   %r14,000000095be4bf20
>                          #000000095bb10d24: af000000            mc      0,0
>                          >000000095bb10d28: b904002a            lgr     %r2,%r10
>                           000000095bb10d2c: ebaff0a00004        lmg     %r10,%r15,160(%r15)
>                           000000095bb10d32: c0f4001aa867        brcl    15,000000095be65e00
>                           000000095bb10d38: c004002168e0        brcl    0,000000095bf3def8
>                           000000095bb10d3e: eb6ff0480024        stmg    %r6,%r15,72(%r15)
>    Call Trace:
>     [<000000095bb10d28>] iommu_detach_group+0x70/0x80
>    ([<000000095bb10d24>] iommu_detach_group+0x6c/0x80)
>     [<000003ff80243b0e>] vfio_iommu_type1_detach_group+0x136/0x6c8 [vfio_iommu_type1]
>     [<000003ff80137780>] __vfio_group_unset_container+0x58/0x158 [vfio]
>     [<000003ff80138a16>] vfio_group_fops_unl_ioctl+0x1b6/0x210 [vfio]
>    pci 0004:00:00.0: Removing from iommu group 4
>     [<000000095b5b62e8>] __s390x_sys_ioctl+0xc0/0x100
>     [<000000095be5d3b4>] __do_syscall+0x1d4/0x200
>     [<000000095be6c072>] system_call+0x82/0xb0
>    Last Breaking-Event-Address:
>     [<000000095be4bf80>] __warn_printk+0x60/0x68
> 
> It reflects that domain->ops->attach_dev() failed because the driver has
> already passed the point of destructing the device.
> 
> Fixes: 9ac8545199a1 ("iommu: Fix use-after-free in iommu_release_device")
> Reported-by: Matthew Rosato <mjrosato@linux.ibm.com>
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

I've been running with only the first 3 patches in this series (the vfio changes) and can confirm that they resolve the reported issue for me.

Tested-by: Matthew Rosato <mjrosato@linux.ibm.com> # s390

...

> +static void vfio_group_remove(struct vfio_group *group)
> +{
> +	/* Pairs with vfio_create_group() */

Nit: vfio_create_group() no longer exists as of patch 1



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/4] vfio: Simplify vfio_create_group()
  2022-09-08 18:44 ` [PATCH 1/4] vfio: Simplify vfio_create_group() Jason Gunthorpe
@ 2022-09-20 19:45   ` Matthew Rosato
  0 siblings, 0 replies; 24+ messages in thread
From: Matthew Rosato @ 2022-09-20 19:45 UTC (permalink / raw)
  To: Jason Gunthorpe, Alex Williamson, Cornelia Huck, iommu,
	Joerg Roedel, kvm, Will Deacon
  Cc: Qian Cai, Joerg Roedel, Marek Szyprowski, Robin Murphy

On 9/8/22 2:44 PM, Jason Gunthorpe wrote:
> The vfio.group_lock is now only used to serialize vfio_group creation
> and destruction, we don't need a micro-optimization of searching,
> unlocking, then allocating and searching again. Just hold the lock
> the whole time.
> 
> Rename the function to 'vfio_get_group()' to reflect that it doesn't
> always create something.
> move
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

Reviewed-by: Matthew Rosato <mjrosato@linux.ibm.com>

> ---
>  drivers/vfio/vfio_main.c | 48 ++++++++++++++++------------------------
>  1 file changed, 19 insertions(+), 29 deletions(-)
> 
> diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
> index 77264d836d5200..4ab13808b536e1 100644
> --- a/drivers/vfio/vfio_main.c
> +++ b/drivers/vfio/vfio_main.c
> @@ -319,17 +319,6 @@ __vfio_group_get_from_iommu(struct iommu_group *iommu_group)
>  	return NULL;
>  }
>  
> -static struct vfio_group *
> -vfio_group_get_from_iommu(struct iommu_group *iommu_group)
> -{
> -	struct vfio_group *group;
> -
> -	mutex_lock(&vfio.group_lock);
> -	group = __vfio_group_get_from_iommu(iommu_group);
> -	mutex_unlock(&vfio.group_lock);
> -	return group;
> -}
> -
>  static void vfio_group_release(struct device *dev)
>  {
>  	struct vfio_group *group = container_of(dev, struct vfio_group, dev);
> @@ -376,16 +365,26 @@ static struct vfio_group *vfio_group_alloc(struct iommu_group *iommu_group,
>  	return group;
>  }
>  
> -static struct vfio_group *vfio_create_group(struct iommu_group *iommu_group,
> -		enum vfio_group_type type)
> +/*
> + * Return a struct vfio_group * for the given iommu_group. If no vfio_group
> + * already exists then create a new one.
> + */
> +static struct vfio_group *vfio_get_group(struct iommu_group *iommu_group,
> +					 enum vfio_group_type type)
>  {
>  	struct vfio_group *group;
>  	struct vfio_group *ret;
>  	int err;
>  
> -	group = vfio_group_alloc(iommu_group, type);
> -	if (IS_ERR(group))
> -		return group;
> +	mutex_lock(&vfio.group_lock);
> +
> +	ret = __vfio_group_get_from_iommu(iommu_group);
> +	if (ret)
> +		goto err_unlock;
> +
> +	group = ret = vfio_group_alloc(iommu_group, type);
> +	if (IS_ERR(ret))
> +		goto err_unlock;
>  
>  	err = dev_set_name(&group->dev, "%s%d",
>  			   group->type == VFIO_NO_IOMMU ? "noiommu-" : "",
> @@ -395,13 +394,6 @@ static struct vfio_group *vfio_create_group(struct iommu_group *iommu_group,
>  		goto err_put;
>  	}
>  
> -	mutex_lock(&vfio.group_lock);
> -
> -	/* Did we race creating this group? */
> -	ret = __vfio_group_get_from_iommu(iommu_group);
> -	if (ret)
> -		goto err_unlock;
> -
>  	err = cdev_device_add(&group->cdev, &group->dev);
>  	if (err) {
>  		ret = ERR_PTR(err);
> @@ -413,10 +405,10 @@ static struct vfio_group *vfio_create_group(struct iommu_group *iommu_group,
>  	mutex_unlock(&vfio.group_lock);
>  	return group;
>  
> -err_unlock:
> -	mutex_unlock(&vfio.group_lock);
>  err_put:
>  	put_device(&group->dev);
> +err_unlock:
> +	mutex_unlock(&vfio.group_lock);
>  	return ret;
>  }
>  
> @@ -514,7 +506,7 @@ static struct vfio_group *vfio_noiommu_group_alloc(struct device *dev,
>  	if (ret)
>  		goto out_put_group;
>  
> -	group = vfio_create_group(iommu_group, type);
> +	group = vfio_get_group(iommu_group, type);
>  	if (IS_ERR(group)) {
>  		ret = PTR_ERR(group);
>  		goto out_remove_device;
> @@ -564,9 +556,7 @@ static struct vfio_group *vfio_group_find_or_alloc(struct device *dev)
>  		return ERR_PTR(-EINVAL);
>  	}
>  
> -	group = vfio_group_get_from_iommu(iommu_group);
> -	if (!group)
> -		group = vfio_create_group(iommu_group, VFIO_IOMMU);
> +	group = vfio_get_group(iommu_group, VFIO_IOMMU);
>  
>  	/* The vfio_group holds a reference to the iommu_group */
>  	iommu_group_put(iommu_group);


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 4/4] iommu: Fix ordering of iommu_release_device()
  2022-09-12 11:13                     ` Robin Murphy
@ 2022-09-22 16:56                       ` Jason Gunthorpe
  0 siblings, 0 replies; 24+ messages in thread
From: Jason Gunthorpe @ 2022-09-22 16:56 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Alex Williamson, Cornelia Huck, iommu, Joerg Roedel, kvm,
	Will Deacon, Qian Cai, Joerg Roedel, Marek Szyprowski,
	Matthew Rosato

On Mon, Sep 12, 2022 at 12:13:25PM +0100, Robin Murphy wrote:

> > > Yes, the ones documented in the code and already discussed here. The current
> > > functional ones aren't particularly *good* reasons, but unless and until
> > > they can all be cleaned up they are what they are.
> > 
> > Uh, I feel like I missed part of the conversation - I don't know what
> > this list is..
> 
> s390 (remember how we got here?) calls iommu_get_domain_for_dev().

iommu_get_domain_for_dev() doesn't take a lock, and the last try I
showed (see below) ordered things so that it would succeed when called
under release(). AFIACT it is already not a problem.

> ipmmu-vmsa calls arm_iommu_detach_device() (mtk_v1 doesn't, but perhaps
> technically should), to undo the corresponding attach from probe_finalize -
> apologies for misremembering which way round the comments were.

This does eventually deadlock on the group->mutex, so yes this is a
problem!

And I agree mtk_v1 does looks like it has a memory leak.

But, this looks easy enough to fix up. See below

> Oh, apparently I managed to misinterpret this as the two *aliasing* devices
> case, sorry. Indeed it is overly conservative for that. I think the robust
> way to detect bad usage is actually not via the group at all, but for
> iommu_device_{un}use_default_domain() to also maintain a per-device
> ownership flag, then we warn if a device is released with that still set.

It seems a bit hard to implement a per-device flag with VFIO?

> > > (It *will* actually work on SMMUv2 because SMMUv2 comprehensively handles
> > > StreamID-level aliasing beyond what pci_device_group() covers, which I
> > > remain rather proud of)
> > 
> > This is why I prefered not to explicitly change the domain, because at
> > least if someone did write a non-buggy driver it doesn't get wrecked -
> > and making a non-buggy driver is at least allowed by the API.
> 
> Detaching back to the default domain still seems like it's *always* the
> right thing to do at this point, 

If the RID is aliased it is the wrong thing to do. Calling attach will
wreck the entire alias set.

release is not supposed to do that, buggy drivers excepted.

> Having looked a bit closer, I think I get what PAMU is doing - kind of
> impedance-matching between pci_device_group() and its own non-ACS form of
> isolation, and possibly also a rather roundabout way of propagating DT data
> from the PCI controller node up into the PCI hierarchy. Both could quite
> likely be done in a more straightforward manner these days (and TBH I'm not
> convinced it works at all since it doesn't appear to match the actual DT
> binding), but either way I'm fairly confident we needn't worry about it.

Yes, this is what I thought too. Arguably it is wrong to try and
rework the groups once created, it should be creating the groups to
cover what it wants from the start, and it shouldn't leave the
controller without a group.

So something like the below is what I'm thinking

Thanks,
Jason

diff --git a/arch/arm/include/asm/dma-iommu.h b/arch/arm/include/asm/dma-iommu.h
index fe9ef6f79e9cfe..ea7198a1786186 100644
--- a/arch/arm/include/asm/dma-iommu.h
+++ b/arch/arm/include/asm/dma-iommu.h
@@ -31,6 +31,7 @@ void arm_iommu_release_mapping(struct dma_iommu_mapping *mapping);
 int arm_iommu_attach_device(struct device *dev,
 					struct dma_iommu_mapping *mapping);
 void arm_iommu_detach_device(struct device *dev);
+void arm_iommu_release_device(struct device *dev);
 
 #endif /* __KERNEL__ */
 #endif
diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c
index 089c9c644cce2a..1694bebb3aa4dc 100644
--- a/arch/arm/mm/dma-mapping.c
+++ b/arch/arm/mm/dma-mapping.c
@@ -1697,13 +1697,15 @@ int arm_iommu_attach_device(struct device *dev,
 EXPORT_SYMBOL_GPL(arm_iommu_attach_device);
 
 /**
- * arm_iommu_detach_device
+ * arm_iommu_release_device
  * @dev: valid struct device pointer
  *
- * Detaches the provided device from a previously attached map.
- * This overwrites the dma_ops pointer with appropriate non-IOMMU ops.
+ * This is like arm_iommu_detach_device() except it handles the special
+ * case of being called under an iommu driver's release operation. In this
+ * case the driver must have already detached the device from any attached
+ * domain before calling this function.
  */
-void arm_iommu_detach_device(struct device *dev)
+void arm_iommu_release_device(struct device *dev)
 {
 	struct dma_iommu_mapping *mapping;
 
@@ -1713,13 +1715,34 @@ void arm_iommu_detach_device(struct device *dev)
 		return;
 	}
 
-	iommu_detach_device(mapping->domain, dev);
 	kref_put(&mapping->kref, release_iommu_mapping);
 	to_dma_iommu_mapping(dev) = NULL;
 	set_dma_ops(dev, NULL);
 
 	pr_debug("Detached IOMMU controller from %s device.\n", dev_name(dev));
 }
+EXPORT_SYMBOL_GPL(arm_iommu_release_device);
+
+/**
+ * arm_iommu_detach_device
+ * @dev: valid struct device pointer
+ *
+ * Detaches the provided device from a previously attached map.
+ * This overwrites the dma_ops pointer with appropriate non-IOMMU ops.
+ */
+void arm_iommu_detach_device(struct device *dev)
+{
+	struct dma_iommu_mapping *mapping;
+
+	mapping = to_dma_iommu_mapping(dev);
+	if (!mapping) {
+		dev_warn(dev, "Not attached\n");
+		return;
+	}
+
+	iommu_detach_device(mapping->domain, dev);
+	arm_iommu_release_device(dev);
+}
 EXPORT_SYMBOL_GPL(arm_iommu_detach_device);
 
 static void arm_setup_iommu_dma_ops(struct device *dev, u64 dma_base, u64 size,
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 780fb70715770d..61444b2a11e217 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -90,6 +90,10 @@ static int iommu_create_device_direct_mappings(struct iommu_group *group,
 static struct iommu_group *iommu_group_get_for_dev(struct device *dev);
 static ssize_t iommu_group_store_type(struct iommu_group *group,
 				      const char *buf, size_t count);
+static struct group_device *
+__iommu_group_remove_device(struct iommu_group *group, struct device *dev);
+static void __iommu_group_remove_device_sysfs(struct iommu_group *group,
+					      struct group_device *device);
 
 #define IOMMU_GROUP_ATTR(_name, _mode, _show, _store)		\
 struct iommu_group_attribute iommu_group_attr_##_name =		\
@@ -330,18 +334,50 @@ int iommu_probe_device(struct device *dev)
 
 void iommu_release_device(struct device *dev)
 {
+	struct iommu_group *group = dev->iommu_group;
 	const struct iommu_ops *ops;
+	struct group_device *device;
 
 	if (!dev->iommu)
 		return;
 
 	iommu_device_unlink(dev->iommu->iommu_dev, dev);
 
+	mutex_lock(&group->mutex);
+	device = __iommu_group_remove_device(group, dev);
 	ops = dev_iommu_ops(dev);
+
+	/*
+	 * If the group has become empty then ownership must have been released,
+	 * and the current domain must be set back to NULL or the default
+	 * domain.
+	 */
+	if (list_empty(&group->devices))
+		WARN_ON(group->owner_cnt ||
+			group->domain != group->default_domain);
+
+	/*
+	 * release_device() must stop using any attached domain on the device.
+	 * If there are still other devices in the group they are not effected
+	 * by this callback.
+	 *
+	 * The IOMMU driver must set the device to either an identity or
+	 * blocking translation and stop using any domain pointer, as it is
+	 * going to be freed.
+	 */
 	if (ops->release_device)
 		ops->release_device(dev);
+	mutex_unlock(&group->mutex);
+
+	__iommu_group_remove_device_sysfs(group, device);
+
+	/*
+	 * This will eventually call iommu_group_release() which will free the
+	 * iommu_domains.
+	 */
+	dev->iommu_group = NULL;
+	kobject_put(group->devices_kobj);
 
-	iommu_group_remove_device(dev);
 	module_put(ops->owner);
 	dev_iommu_free(dev);
 }
@@ -939,44 +975,71 @@ int iommu_group_add_device(struct iommu_group *group, struct device *dev)
 }
 EXPORT_SYMBOL_GPL(iommu_group_add_device);
 
-/**
- * iommu_group_remove_device - remove a device from it's current group
- * @dev: device to be removed
- *
- * This function is called by an iommu driver to remove the device from
- * it's current group.  This decrements the iommu group reference count.
- */
-void iommu_group_remove_device(struct device *dev)
+static struct group_device *
+__iommu_group_remove_device(struct iommu_group *group, struct device *dev)
 {
-	struct iommu_group *group = dev->iommu_group;
-	struct group_device *tmp_device, *device = NULL;
+	struct group_device *device;
+
+	lockdep_assert_held(&group->mutex);
 
 	if (!group)
-		return;
+		return NULL;
 
 	dev_info(dev, "Removing from iommu group %d\n", group->id);
 
-	mutex_lock(&group->mutex);
-	list_for_each_entry(tmp_device, &group->devices, list) {
-		if (tmp_device->dev == dev) {
-			device = tmp_device;
+	list_for_each_entry(device, &group->devices, list) {
+		if (device->dev == dev) {
 			list_del(&device->list);
-			break;
+			return device;
 		}
 	}
-	mutex_unlock(&group->mutex);
+	return NULL;
+}
 
+static void __iommu_group_remove_device_sysfs(struct iommu_group *group,
+					      struct group_device *device)
+{
 	if (!device)
 		return;
 
 	sysfs_remove_link(group->devices_kobj, device->name);
-	sysfs_remove_link(&dev->kobj, "iommu_group");
+	sysfs_remove_link(&device->dev->kobj, "iommu_group");
 
-	trace_remove_device_from_group(group->id, dev);
+	trace_remove_device_from_group(group->id, device->dev);
 
 	kfree(device->name);
 	kfree(device);
-	dev->iommu_group = NULL;
+}
+
+/**
+ * iommu_group_remove_device - remove a device from it's current group
+ * @dev: device to be removed
+ *
+ * This function is used by non-iommu drivers to create non-iommu subystem
+ * groups (eg VFIO mdev, SPAPR) Using this from inside an iommu driver is an
+ * abuse. Instead the driver should return the correct group from
+ * ops->device_group()
+ */
+void iommu_group_remove_device(struct device *dev)
+{
+	struct iommu_group *group = dev->iommu_group;
+	struct group_device *device;
+
+	/*
+	 * Since we don't do ops->release_device() there is no way for the
+	 * driver to stop using any attached domain before we free it. If a
+	 * domain is attached while this function is called it will eventually
+	 * UAF.
+	 *
+	 * Thus it is only useful for cases like VFIO/SPAPR that don't use an
+	 * iommu driver, or for cases like FSL that don't use default domains.
+	 */
+	WARN_ON(group->domain);
+
+	mutex_lock(&group->mutex);
+	device = __iommu_group_remove_device(group, dev);
+	mutex_unlock(&group->mutex);
+	__iommu_group_remove_device_sysfs(group, device);
 	kobject_put(group->devices_kobj);
 }
 EXPORT_SYMBOL_GPL(iommu_group_remove_device);
diff --git a/drivers/iommu/ipmmu-vmsa.c b/drivers/iommu/ipmmu-vmsa.c
index 1d42084d02767e..f5b9787d22420c 100644
--- a/drivers/iommu/ipmmu-vmsa.c
+++ b/drivers/iommu/ipmmu-vmsa.c
@@ -302,11 +302,8 @@ static void ipmmu_utlb_enable(struct ipmmu_vmsa_domain *domain,
 /*
  * Disable MMU translation for the microTLB.
  */
-static void ipmmu_utlb_disable(struct ipmmu_vmsa_domain *domain,
-			       unsigned int utlb)
+static void ipmmu_utlb_disable(struct ipmmu_vmsa_device *mmu, unsigned int utlb)
 {
-	struct ipmmu_vmsa_device *mmu = domain->mmu;
-
 	ipmmu_imuctr_write(mmu, utlb, 0);
 	mmu->utlb_ctx[utlb] = IPMMU_CTX_INVALID;
 }
@@ -649,11 +646,11 @@ static void ipmmu_detach_device(struct iommu_domain *io_domain,
 				struct device *dev)
 {
 	struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev);
-	struct ipmmu_vmsa_domain *domain = to_vmsa_domain(io_domain);
+	struct ipmmu_vmsa_device *mmu = to_ipmmu(dev);
 	unsigned int i;
 
 	for (i = 0; i < fwspec->num_ids; ++i)
-		ipmmu_utlb_disable(domain, fwspec->ids[i]);
+		ipmmu_utlb_disable(mmu, fwspec->ids[i]);
 
 	/*
 	 * TODO: Optimize by disabling the context when no device is attached.
@@ -849,7 +846,8 @@ static void ipmmu_probe_finalize(struct device *dev)
 
 static void ipmmu_release_device(struct device *dev)
 {
-	arm_iommu_detach_device(dev);
+	ipmmu_detach_device(NULL, dev);
+	arm_iommu_release_device(dev);
 }
 
 static struct iommu_group *ipmmu_find_group(struct device *dev)

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/4] vfio: Move the sanity check of the group to vfio_create_group()
  2022-09-08 18:44 ` [PATCH 2/4] vfio: Move the sanity check of the group to vfio_create_group() Jason Gunthorpe
@ 2022-09-22 19:10   ` Alex Williamson
  2022-09-22 19:36     ` Jason Gunthorpe
  0 siblings, 1 reply; 24+ messages in thread
From: Alex Williamson @ 2022-09-22 19:10 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Cornelia Huck, iommu, Joerg Roedel, kvm, Will Deacon, Qian Cai,
	Joerg Roedel, Marek Szyprowski, Matthew Rosato, Robin Murphy

On Thu,  8 Sep 2022 15:44:59 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> __vfio_register_dev() has a bit of code to sanity check if an (existing)
> group is not corrupted by having two copies of the same struct device in
> it. This should be impossible.
> 
> It then has some complicated error unwind to uncreate the group.
> 
> Instead check if the existing group is sane at the same time we locate
> it. If a bug is found then there is no error unwind, just simply fail
> allocation.
> 
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> ---
>  drivers/vfio/vfio_main.c | 79 ++++++++++++++++++----------------------
>  1 file changed, 36 insertions(+), 43 deletions(-)
> 
> diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
> index 4ab13808b536e1..ba8b6bed12c7e7 100644
> --- a/drivers/vfio/vfio_main.c
> +++ b/drivers/vfio/vfio_main.c
> @@ -306,15 +306,15 @@ static void vfio_container_put(struct vfio_container *container)
>   * Group objects - create, release, get, put, search
>   */
>  static struct vfio_group *
> -__vfio_group_get_from_iommu(struct iommu_group *iommu_group)
> +vfio_group_find_from_iommu(struct iommu_group *iommu_group)
>  {
>  	struct vfio_group *group;
>  
> +	lockdep_assert_held(&vfio.group_lock);
> +
>  	list_for_each_entry(group, &vfio.group_list, vfio_next) {
> -		if (group->iommu_group == iommu_group) {
> -			vfio_group_get(group);
> +		if (group->iommu_group == iommu_group)
>  			return group;
> -		}
>  	}
>  	return NULL;
>  }
> @@ -365,11 +365,27 @@ static struct vfio_group *vfio_group_alloc(struct iommu_group *iommu_group,
>  	return group;
>  }
>  
> +static bool vfio_group_has_device(struct vfio_group *group, struct device *dev)
> +{
> +	struct vfio_device *device;
> +
> +	mutex_lock(&group->device_lock);
> +	list_for_each_entry(device, &group->device_list, group_next) {
> +		if (device->dev == dev) {
> +			mutex_unlock(&group->device_lock);
> +			return true;
> +		}
> +	}
> +	mutex_unlock(&group->device_lock);
> +	return false;
> +}
> +
>  /*
>   * Return a struct vfio_group * for the given iommu_group. If no vfio_group
>   * already exists then create a new one.
>   */
> -static struct vfio_group *vfio_get_group(struct iommu_group *iommu_group,
> +static struct vfio_group *vfio_get_group(struct device *dev,
> +					 struct iommu_group *iommu_group,
>  					 enum vfio_group_type type)
>  {
>  	struct vfio_group *group;
> @@ -378,13 +394,20 @@ static struct vfio_group *vfio_get_group(struct iommu_group *iommu_group,
>  
>  	mutex_lock(&vfio.group_lock);
>  
> -	ret = __vfio_group_get_from_iommu(iommu_group);
> -	if (ret)
> -		goto err_unlock;
> +	ret = vfio_group_find_from_iommu(iommu_group);
> +	if (ret) {
> +		if (WARN_ON(vfio_group_has_device(ret, dev))) {
> +			ret = ERR_PTR(-EINVAL);
> +			goto out_unlock;
> +		}

This still looks racy.  We only know within vfio_group_has_device() that
the device is not present in the group, what prevents a race between
here and when we finally do add it to group->device_list?  We can't
make any guarantees if we drop group->device_lock between test and add.

The semantics of vfio_get_group() are also rather strange, 'return a
vfio_group for this iommu_group, but make sure it doesn't include this
device' :-\  Thanks,

Alex


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/4] vfio: Move the sanity check of the group to vfio_create_group()
  2022-09-22 19:10   ` Alex Williamson
@ 2022-09-22 19:36     ` Jason Gunthorpe
  2022-09-22 21:23       ` Alex Williamson
  0 siblings, 1 reply; 24+ messages in thread
From: Jason Gunthorpe @ 2022-09-22 19:36 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Cornelia Huck, iommu, Joerg Roedel, kvm, Will Deacon, Qian Cai,
	Joerg Roedel, Marek Szyprowski, Matthew Rosato, Robin Murphy

On Thu, Sep 22, 2022 at 01:10:50PM -0600, Alex Williamson wrote:
> > @@ -378,13 +394,20 @@ static struct vfio_group *vfio_get_group(struct iommu_group *iommu_group,
> >  
> >  	mutex_lock(&vfio.group_lock);
> >  
> > -	ret = __vfio_group_get_from_iommu(iommu_group);
> > -	if (ret)
> > -		goto err_unlock;
> > +	ret = vfio_group_find_from_iommu(iommu_group);
> > +	if (ret) {
> > +		if (WARN_ON(vfio_group_has_device(ret, dev))) {
> > +			ret = ERR_PTR(-EINVAL);
> > +			goto out_unlock;
> > +		}
> 
> This still looks racy.  We only know within vfio_group_has_device() that
> the device is not present in the group, what prevents a race between
> here and when we finally do add it to group->device_list?

This is a condition which is defined to be impossible and by
auditing I've checked it can't happen.

There is no race in the sense that this can't actually happen, if it
does happen it means the group is corrupted. At that point reasoning
about locks and such goes out the window too - eg it might be
corrupted because of bad locking.

When it comes to self-debugging tests we often have these
"races", eg in the destroy path we do:

	WARN_ON(!list_empty(&group->device_list));

Which doesn't hold the appropriate locks either.

The point is just to detect that group has been corrupted at a point
in time in hopes of guarding against a future kernel bug.

> The semantics of vfio_get_group() are also rather strange, 'return a
> vfio_group for this iommu_group, but make sure it doesn't include this
> device' :-\  Thanks,

I think of it as "return a group and also do sanity checks that the
returned group has not been corrupted"

I don't like the name of this function but couldn't figure a better
one. It is something like "find or create a group for a device which
we know doesn't already have a group"

Jason

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/4] vfio: Move the sanity check of the group to vfio_create_group()
  2022-09-22 19:36     ` Jason Gunthorpe
@ 2022-09-22 21:23       ` Alex Williamson
  2022-09-22 23:12         ` Jason Gunthorpe
  0 siblings, 1 reply; 24+ messages in thread
From: Alex Williamson @ 2022-09-22 21:23 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Cornelia Huck, iommu, Joerg Roedel, kvm, Will Deacon, Qian Cai,
	Joerg Roedel, Marek Szyprowski, Matthew Rosato, Robin Murphy

On Thu, 22 Sep 2022 16:36:14 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Thu, Sep 22, 2022 at 01:10:50PM -0600, Alex Williamson wrote:
> > The semantics of vfio_get_group() are also rather strange, 'return a
> > vfio_group for this iommu_group, but make sure it doesn't include this
> > device' :-\  Thanks,  
> 
> I think of it as "return a group and also do sanity checks that the
> returned group has not been corrupted"
> 
> I don't like the name of this function but couldn't figure a better
> one. It is something like "find or create a group for a device which
> we know doesn't already have a group"

Well, we don't really need to have this behavior, we could choose to
implement the first two patches with the caller holding the
group_lock.  Only one of the callers needs the duplicate test, no-iommu
creates its own iommu_group and therefore cannot have an existing
device.  I think patches 1 & 2 would look like below*, with patch 3
simply moving the change from vfio_group_get() to refcount_inc() into
the equivalent place in vfio_group_find_of_alloc().  Thanks,

Alex

*only compile tested

---
 drivers/vfio/vfio_main.c |   33 ++++++++++-----------------------
 1 file changed, 10 insertions(+), 23 deletions(-)

diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
index f9d10dbcf3e6..aa33944cb759 100644
--- a/drivers/vfio/vfio_main.c
+++ b/drivers/vfio/vfio_main.c
@@ -310,10 +310,12 @@ static void vfio_container_put(struct vfio_container *container)
  * Group objects - create, release, get, put, search
  */
 static struct vfio_group *
-__vfio_group_get_from_iommu(struct iommu_group *iommu_group)
+vfio_group_get_from_iommu(struct iommu_group *iommu_group)
 {
 	struct vfio_group *group;
 
+	lockdep_assert_held(&vfio.group_lock);
+
 	list_for_each_entry(group, &vfio.group_list, vfio_next) {
 		if (group->iommu_group == iommu_group) {
 			vfio_group_get(group);
@@ -323,17 +325,6 @@ __vfio_group_get_from_iommu(struct iommu_group *iommu_group)
 	return NULL;
 }
 
-static struct vfio_group *
-vfio_group_get_from_iommu(struct iommu_group *iommu_group)
-{
-	struct vfio_group *group;
-
-	mutex_lock(&vfio.group_lock);
-	group = __vfio_group_get_from_iommu(iommu_group);
-	mutex_unlock(&vfio.group_lock);
-	return group;
-}
-
 static void vfio_group_release(struct device *dev)
 {
 	struct vfio_group *group = container_of(dev, struct vfio_group, dev);
@@ -387,6 +378,8 @@ static struct vfio_group *vfio_create_group(struct iommu_group *iommu_group,
 	struct vfio_group *ret;
 	int err;
 
+	lockdep_assert_held(&vfio.group_lock);
+
 	group = vfio_group_alloc(iommu_group, type);
 	if (IS_ERR(group))
 		return group;
@@ -399,26 +392,16 @@ static struct vfio_group *vfio_create_group(struct iommu_group *iommu_group,
 		goto err_put;
 	}
 
-	mutex_lock(&vfio.group_lock);
-
-	/* Did we race creating this group? */
-	ret = __vfio_group_get_from_iommu(iommu_group);
-	if (ret)
-		goto err_unlock;
-
 	err = cdev_device_add(&group->cdev, &group->dev);
 	if (err) {
 		ret = ERR_PTR(err);
-		goto err_unlock;
+		goto err_put;
 	}
 
 	list_add(&group->vfio_next, &vfio.group_list);
 
-	mutex_unlock(&vfio.group_lock);
 	return group;
 
-err_unlock:
-	mutex_unlock(&vfio.group_lock);
 err_put:
 	put_device(&group->dev);
 	return ret;
@@ -609,7 +592,9 @@ static struct vfio_group *vfio_noiommu_group_alloc(struct device *dev,
 	if (ret)
 		goto out_put_group;
 
+	mutex_lock(&vfio.group_lock);
 	group = vfio_create_group(iommu_group, type);
+	mutex_unlock(&vfio.group_lock);
 	if (IS_ERR(group)) {
 		ret = PTR_ERR(group);
 		goto out_remove_device;
@@ -659,9 +644,11 @@ static struct vfio_group *vfio_group_find_or_alloc(struct device *dev)
 		return ERR_PTR(-EINVAL);
 	}
 
+	mutex_lock(&vfio.group_lock);
 	group = vfio_group_get_from_iommu(iommu_group);
 	if (!group)
 		group = vfio_create_group(iommu_group, VFIO_IOMMU);
+	mutex_unlock(&vfio.group_lock);
 
 	/* The vfio_group holds a reference to the iommu_group */
 	iommu_group_put(iommu_group);


---
 drivers/vfio/vfio_main.c |   58 ++++++++++++++++++++--------------------------
 1 file changed, 25 insertions(+), 33 deletions(-)

diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
index aa33944cb759..4692493d386a 100644
--- a/drivers/vfio/vfio_main.c
+++ b/drivers/vfio/vfio_main.c
@@ -310,17 +310,15 @@ static void vfio_container_put(struct vfio_container *container)
  * Group objects - create, release, get, put, search
  */
 static struct vfio_group *
-vfio_group_get_from_iommu(struct iommu_group *iommu_group)
+vfio_group_find_from_iommu(struct iommu_group *iommu_group)
 {
 	struct vfio_group *group;
 
 	lockdep_assert_held(&vfio.group_lock);
 
 	list_for_each_entry(group, &vfio.group_list, vfio_next) {
-		if (group->iommu_group == iommu_group) {
-			vfio_group_get(group);
+		if (group->iommu_group == iommu_group)
 			return group;
-		}
 	}
 	return NULL;
 }
@@ -449,23 +447,6 @@ static bool vfio_device_try_get_registration(struct vfio_device *device)
 	return refcount_inc_not_zero(&device->refcount);
 }
 
-static struct vfio_device *vfio_group_get_device(struct vfio_group *group,
-						 struct device *dev)
-{
-	struct vfio_device *device;
-
-	mutex_lock(&group->device_lock);
-	list_for_each_entry(device, &group->device_list, group_next) {
-		if (device->dev == dev &&
-		    vfio_device_try_get_registration(device)) {
-			mutex_unlock(&group->device_lock);
-			return device;
-		}
-	}
-	mutex_unlock(&group->device_lock);
-	return NULL;
-}
-
 /*
  * VFIO driver API
  */
@@ -609,6 +590,21 @@ static struct vfio_group *vfio_noiommu_group_alloc(struct device *dev,
 	return ERR_PTR(ret);
 }
 
+static bool vfio_group_has_device(struct vfio_group *group, struct device *dev)
+{
+	struct vfio_device *device;
+
+	mutex_lock(&group->device_lock);
+	list_for_each_entry(device, &group->device_list, group_next) {
+		if (device->dev == dev) {
+			mutex_unlock(&group->device_lock);
+			return true;
+		}
+	}
+	mutex_unlock(&group->device_lock);
+	return false;
+}
+
 static struct vfio_group *vfio_group_find_or_alloc(struct device *dev)
 {
 	struct iommu_group *iommu_group;
@@ -645,9 +641,15 @@ static struct vfio_group *vfio_group_find_or_alloc(struct device *dev)
 	}
 
 	mutex_lock(&vfio.group_lock);
-	group = vfio_group_get_from_iommu(iommu_group);
-	if (!group)
+	group = vfio_group_find_from_iommu(iommu_group);
+	if (group) {
+		if (WARN_ON(vfio_group_has_device(group, dev)))
+			group = ERR_PTR(-EINVAL);
+		else
+			vfio_group_get(group);
+	} else {
 		group = vfio_create_group(iommu_group, VFIO_IOMMU);
+	}
 	mutex_unlock(&vfio.group_lock);
 
 	/* The vfio_group holds a reference to the iommu_group */
@@ -658,7 +660,6 @@ static struct vfio_group *vfio_group_find_or_alloc(struct device *dev)
 static int __vfio_register_dev(struct vfio_device *device,
 		struct vfio_group *group)
 {
-	struct vfio_device *existing_device;
 	int ret;
 
 	if (IS_ERR(group))
@@ -671,15 +672,6 @@ static int __vfio_register_dev(struct vfio_device *device,
 	if (!device->dev_set)
 		vfio_assign_device_set(device, device);
 
-	existing_device = vfio_group_get_device(group, device->dev);
-	if (existing_device) {
-		dev_WARN(device->dev, "Device already exists on group %d\n",
-			 iommu_group_id(group->iommu_group));
-		vfio_device_put_registration(existing_device);
-		ret = -EBUSY;
-		goto err_out;
-	}
-
 	/* Our reference on group is moved to the device */
 	device->group = group;
 


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/4] vfio: Move the sanity check of the group to vfio_create_group()
  2022-09-22 21:23       ` Alex Williamson
@ 2022-09-22 23:12         ` Jason Gunthorpe
  0 siblings, 0 replies; 24+ messages in thread
From: Jason Gunthorpe @ 2022-09-22 23:12 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Cornelia Huck, iommu, Joerg Roedel, kvm, Will Deacon, Qian Cai,
	Joerg Roedel, Marek Szyprowski, Matthew Rosato, Robin Murphy

On Thu, Sep 22, 2022 at 03:23:38PM -0600, Alex Williamson wrote:

> Well, we don't really need to have this behavior, we could choose to
> implement the first two patches with the caller holding the
> group_lock.  Only one of the callers needs the duplicate test, no-iommu
> creates its own iommu_group and therefore cannot have an existing
> device.  I think patches 1 & 2 would look like below*, with patch 3
> simply moving the change from vfio_group_get() to refcount_inc() into
> the equivalent place in vfio_group_find_of_alloc().  Thanks,

Yeah, this is nice

I just rebased it onto Kevins series and the reason I did this patch
has evaporated so let me check it again, maybe we don't need it.

Jason

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2022-09-22 23:12 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-09-08 18:44 [PATCH 0/4] Fix splats releated to using the iommu_group after destroying devices Jason Gunthorpe
2022-09-08 18:44 ` [PATCH 1/4] vfio: Simplify vfio_create_group() Jason Gunthorpe
2022-09-20 19:45   ` Matthew Rosato
2022-09-08 18:44 ` [PATCH 2/4] vfio: Move the sanity check of the group to vfio_create_group() Jason Gunthorpe
2022-09-22 19:10   ` Alex Williamson
2022-09-22 19:36     ` Jason Gunthorpe
2022-09-22 21:23       ` Alex Williamson
2022-09-22 23:12         ` Jason Gunthorpe
2022-09-08 18:45 ` [PATCH 3/4] vfio: Follow a strict lifetime for struct iommu_group * Jason Gunthorpe
2022-09-20 19:32   ` Matthew Rosato
2022-09-08 18:45 ` [PATCH 4/4] iommu: Fix ordering of iommu_release_device() Jason Gunthorpe
2022-09-08 21:05   ` Robin Murphy
2022-09-08 21:27     ` Robin Murphy
2022-09-08 21:43       ` Jason Gunthorpe
2022-09-09  9:05         ` Robin Murphy
2022-09-09 13:25           ` Jason Gunthorpe
2022-09-09 17:57             ` Robin Murphy
2022-09-09 18:30               ` Jason Gunthorpe
2022-09-09 19:55                 ` Robin Murphy
2022-09-09 23:45                   ` Jason Gunthorpe
2022-09-12 11:13                     ` Robin Murphy
2022-09-22 16:56                       ` Jason Gunthorpe
2022-09-09 12:49 ` [PATCH 0/4] Fix splats releated to using the iommu_group after destroying devices Matthew Rosato
2022-09-09 16:24   ` Jason Gunthorpe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).