Re: [PATCH 4/4] iommu: Fix ordering of iommu_release_device()

From: Robin Murphy <robin.murphy@arm.com>
To: Jason Gunthorpe <jgg@nvidia.com>
Cc: Alex Williamson <alex.williamson@redhat.com>,
	Cornelia Huck <cohuck@redhat.com>,
	iommu@lists.linux.dev, Joerg Roedel <joro@8bytes.org>,
	kvm@vger.kernel.org, Will Deacon <will@kernel.org>,
	Qian Cai <cai@lca.pw>, Joerg Roedel <jroedel@suse.de>,
	Marek Szyprowski <m.szyprowski@samsung.com>,
	Matthew Rosato <mjrosato@linux.ibm.com>
Subject: Re: [PATCH 4/4] iommu: Fix ordering of iommu_release_device()
Date: Fri, 9 Sep 2022 18:57:59 +0100	[thread overview]
Message-ID: <e0ff6dc1-91b3-2e41-212c-c83a2bf2b3a8@arm.com> (raw)
In-Reply-To: <Yxs+1s+MPENLTUpG@nvidia.com>

On 2022-09-09 14:25, Jason Gunthorpe wrote:
> On Fri, Sep 09, 2022 at 10:05:58AM +0100, Robin Murphy wrote:
>> On 2022-09-08 22:43, Jason Gunthorpe wrote:
>>> On Thu, Sep 08, 2022 at 10:27:06PM +0100, Robin Murphy wrote:
>>>
>>>> Oh, because s390 is using iommu_get_domain_for_dev() in its release_device
>>>> callback, which needs to dereference the group to work, and the current
>>>> domain may also be a non-default one which we can't prevent from
>>>> disappearing racily, that was why :(
>>>
>>> Hum, the issue there is the use of device->iommu_group - but that just
>>> means I didn't split properly. How about this incremental:
>>
>> That did cross my mind, but it's a bit grim.
> 
> Actually, also in my morning, I think it may not even be necessary.
> 
> Keep in mind the start of the series fixes VFIO.
> 
> The bug that S390 is trying to fix is that VFIO didn't put back the
> group ownership, it just left its own iommu_domain attached and called
> release().
> 
> But now, at least for single device groups, VFIO will put owenership
> back and zdev->s390_domain == NULL when we get to release_device()
> 
>> That then only leaves the issue that that domain may still become
>> invalid at any point after the group mutex has been dropped.
> 
> So that is this race:
> 
>          CPU0                         CPU1
>     iommu_release_device(a)
>        __iommu_group_remove_device(a)
> 			         iommu_device_use_default_domain(b)
>                                   iommu_domain_free(domain)
>                                   iommu_release_device(b)
>                                        ops->release_device(b)
>        ops->release_device(a)
>          // Boom, a is still attached to domain :(
> 
> I can't think of how to solve this other than holding the group mutex
> across release_device. See below.

I see a few possibilities:

- Backtrack slightly on its removal, and instead repurpose detach_dev
into a specialised domain cleanup callback, called before or during
iommu_group_remove_device(), with the group mutex held.

- Drivers that hold any kind of internal per-device references to
domains - which is generally the root of this issue in the first place -
can implement proper reference counting, so even if a domain is "freed"
with a device still attached as above, it doesn't actually go away until
release_device(a) cleans up the final dangling reference. I suggested
the core doing this generically, but on reflection I think it's actually
a lot more straightforward as a driver-internal thing.

- Drivers that basically just keep a list of devices in the domain and
need to do a list_del() in release_device, can also list_del_init() any
still-attached devices in domain_free, with a simple per-instance or
global lock to serialise the two.

>>> And to your other question, the reason I split the function is because
>>> I couldn't really say WTF iommu_group_remove_device() was supposed to
>>> do. The __ version make ssense as part of the remove_device, due to
>>> the sequencing with ops->release()
>>>
>>> But the other one doesn't have that. So I want to put in a:
>>>
>>>      WARN_ON(group->blocking_domain || group->default_domain);
>>>
>>> Because calling it after those domains are allocated looks broken to
>>> me.
>>
>> I might be misunderstanding, but that sounds backwards - if a real device is
>> being hotplugged out, we absolutely expect that to happen *after* its
>> default domain has been set up.
> 
> See below for what I mean
> 
> iommu_group_remove_device() doesn't work as an API because it has no
> way to tell the device to stop using the domain we are about to free.
> 
> So it should assert that there is no domain to worry about. For the
> vfio and power case there is no domain because they don't use iommu
> drivers

Ah, I see it now - if we think it's a usage error for any current API
user to allow a device to be removed while still attached to a non-
default domain, then we can just throw our hands up at that, and
mitigate for the default domain case that we *can* control. I'm not 100%
convinced there might not be some niche non-uAPI case for skipping a
detach because you know you're tearing down your device and domain at the
same time, but I'm inclined to agree that we can worry about that if and
when it does ever come up.

If so, I reckon it should be about as as easy as this (untested).

Cheers,
Robin.

----->8-----

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 9fbe5d067473..760d9bd3ad66 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -396,17 +396,25 @@ int iommu_probe_device(struct device *dev)
  void iommu_release_device(struct device *dev)
  {
  	const struct iommu_ops *ops;
+	struct iommu_group *group;
  
  	if (!dev->iommu)
  		return;
  
  	iommu_device_unlink(dev->iommu->iommu_dev, dev);
  
+	/*
+	 * Some drivers track a device's current domain internally and may
+	 * dereference it to clean up in release_device. If a default domain
+	 * exists, hold a reference to ensure it stays around long enough.
+	 */
+	group = iommu_group_get(dev);
+	iommu_group_remove_device(dev);
  	ops = dev_iommu_ops(dev);
  	if (ops->release_device)
  		ops->release_device(dev);
  
-	iommu_group_remove_device(dev);
+	iommu_group_put(group);
  	module_put(ops->owner);
  	dev_iommu_free(dev);
  }
@@ -1022,6 +1030,14 @@ void iommu_group_remove_device(struct device *dev)
  	dev_info(dev, "Removing from iommu group %d\n", group->id);
  
  	mutex_lock(&group->mutex);
+	if (WARN_ON(group->domain != group->default_domain &&
+		    group->domain != group->blocking_domain)) {
+		if (group->default_domain)
+			__iommu_attach_device(group->default_domain, dev);
+		else
+			__iommu_detach_device(group->domain, dev);
+	}
+
  	list_for_each_entry(tmp_device, &group->devices, list) {
  		if (tmp_device->dev == dev) {
  			device = tmp_device;