Re: [PATCH 4/4] iommu: Fix ordering of iommu_release_device()

From: Jason Gunthorpe <jgg@nvidia.com>
To: Robin Murphy <robin.murphy@arm.com>
Cc: Alex Williamson <alex.williamson@redhat.com>,
	Cornelia Huck <cohuck@redhat.com>,
	iommu@lists.linux.dev, Joerg Roedel <joro@8bytes.org>,
	kvm@vger.kernel.org, Will Deacon <will@kernel.org>,
	Qian Cai <cai@lca.pw>, Joerg Roedel <jroedel@suse.de>,
	Marek Szyprowski <m.szyprowski@samsung.com>,
	Matthew Rosato <mjrosato@linux.ibm.com>
Subject: Re: [PATCH 4/4] iommu: Fix ordering of iommu_release_device()
Date: Fri, 9 Sep 2022 20:45:57 -0300	[thread overview]
Message-ID: <YxvQNTD1U4bs5TZD@nvidia.com> (raw)
In-Reply-To: <b753aecb-ee2a-2cd0-1df2-0c3e977b4cb9@arm.com>

On Fri, Sep 09, 2022 at 08:55:07PM +0100, Robin Murphy wrote:

> > Isn't this every driver though? Like every single driver implementing
> > an UNMANAGED/DMA/DMA_FQ domain has a hidden reference to the
> > iommu_domain - minimally to point the HW to the IOPTEs it stores.
> 
> Um, no? Domain ops get the domain passed in as an argument, which is far
> from hidden, and if any driver implemented them to ignore that argument and
> operate on something else it would be stupid and broken. Note I said
> "per-device reference", meaning things like s390's zpci_dev->s390_domain and
> SMMUv3's dev->iommu->priv->domain. It's only those references that are
> reachable from release_device - outside the normal domain lifecycle - which
> are problematic.

If the plan is to make the domain refcounted and then allow a 'put' on
it before we reach release_device() then it means every driver needs
to hold a 'get' on the domain while it is programmed into the HW.

Because the hw will still be touching memory that could be freed by an
iommu_domain_free(). By "hidden" reference I mean the HW walkers are
touching memory that would be freed - ie kasn won't see it.

> > Do you know a reason not to hold the group mutex across
> > release_device? I think that is the most straightforward and
> > future proof.
> 
> Yes, the ones documented in the code and already discussed here. The current
> functional ones aren't particularly *good* reasons, but unless and until
> they can all be cleaned up they are what they are.

Uh, I feel like I missed part of the conversation - I don't know what
this list is..

I did look. I'm looking for a call chain that goes from
release_device() into a core function that grabs the group->mutex.

There is a comment in iommu_group_store_type() that suggest there is a
recursion but doesn't say what it is. It was an Intel person who wrote
the comment so I more carefully checked the intel driver and didn't
find a call path, but it is big and complicated..

There is a commment in iommu_change_dev_def_domain() about recursion
on probe_finalize(), not relevant here, AFAICT.

So, I did two approaches, one I checked quickly through every
release_device looking for something.

Then I looked across the entire exported driver facing API and focused
on callchains going back toward the core from APIs that might be
trouble and audited them almost completely.

These APIs do not take the lock, so completely safe:
 iommu_group_alloc
 iommu_group_set_iommudata
 iommu_group_set_name
 iommu_group_get
 iommu_group_ref_get
 iommu_group_put
 iommu_get_domain_for_dev
 iommu_fwspec_free
 iommu_fwspec_init
 iommu_fwspec_add_ids
 iommu_put_resv_regions (called from release_device)

Does take the lock. Checked all of these in all the drivers, didn't
find an obvious path to release_device:
 iommu_group_remove_device
 iommu_group_for_each_dev
 iommu_attach_device
 iommu_detach_device
 iommu_attach_group
 iommu_detach_group
 bus_set_iommu

Can't tell if these take lock due to driver callbacks, but couldn't
find them in release, and doesn't make sense they would be there:
 iommu_device_register
 iommu_device_unregister
 iommu_domain_alloc
 iommu_domain_free

Rest of the exported interface touching the drivers - didn't carefully
check if they are using the lock - however by name seems unlikely
these are in release_device():
 iommu_register_device_fault_handler
 iommu_unregister_device_fault_handler
 iommu_report_device_fault
 iommu_page_response
 report_iommu_fault
 iommu_iova_to_phys
 iommu_map
 iommu_unmap
 iommu_alloc_resv_region
 iommu_present
 iommu_capable
 iommu_default_passthrough

It is big and complicated, so I wouldn't stake my life on it, but it
seems worth investigating further.

Could the recursion have been cleaned up already?

> > > @@ -1022,6 +1030,14 @@ void iommu_group_remove_device(struct device *dev)
> > >   	dev_info(dev, "Removing from iommu group %d\n", group->id);
> > >   	mutex_lock(&group->mutex);
> > > +	if (WARN_ON(group->domain != group->default_domain &&
> > > +		    group->domain != group->blocking_domain)) {
> > 
> > This will false trigger, if there are two VFIO devices then the group
> > will remained owned when we unplug one just of them, but the group's domain
> > will be a VFIO owned domain.
> 
> As opposed to currently, where most drivers' release_device will blindly
> detach/disable the RID in some fashion so the other device would suddenly
> blow up anyway? 

Er, I think it is OK today, in the non-shared case. If the RID isn't
shared then each device in the group is independent, so most drivers,
most of the time, should only effect the RID release_device() is
called on, while this warning will always trigger for any multi-device
group.

> (It *will* actually work on SMMUv2 because SMMUv2 comprehensively handles
> StreamID-level aliasing beyond what pci_device_group() covers, which I
> remain rather proud of)

This is why I prefered not to explicitly change the domain, because at
least if someone did write a non-buggy driver it doesn't get wrecked -
and making a non-buggy driver is at least allowed by the API.

> > And it misses the logic to WARN_ON if a domain is set and an external
> > entity wrongly uses iommu_group_remove_device()..
> 
> Huh? An external fake group couldn't have a default domain or blocking
> domain, thus any non-NULL domain will not compare equal to either, so if
> that could happen it will warn, and then most likely crash. I did think
> briefly about trying to make it not crash, but then I remembered that fake
> groups from external callers also aren't backed by IOMMU API drivers so have
> no way to allocate or attach domains either, so in fact it cannot happen at
> all under any circumstances that are worth reasoning about.

I mean specificaly thing like FSL is doing where it is a real driver
calling this API and the test of 'group->domain == NULL' is the more
robust precondition.

So, IDK, I would perfer to understand where we hit a group mutex
recursion before rejecting that path... If you know specifics please
share, otherwise maybe we should stick in a lockdep check there and
see if anything hits?

But I'm off to LPC so I probably won't write anything more thoughtful
on this for a while.

Thanks,
Jason