Re: [PATCH 4/4] iommu: Fix ordering of iommu_release_device()

From: Robin Murphy <robin.murphy@arm.com>
To: Jason Gunthorpe <jgg@nvidia.com>
Cc: Alex Williamson <alex.williamson@redhat.com>,
	Cornelia Huck <cohuck@redhat.com>,
	iommu@lists.linux.dev, Joerg Roedel <joro@8bytes.org>,
	kvm@vger.kernel.org, Will Deacon <will@kernel.org>,
	Qian Cai <cai@lca.pw>, Joerg Roedel <jroedel@suse.de>,
	Marek Szyprowski <m.szyprowski@samsung.com>,
	Matthew Rosato <mjrosato@linux.ibm.com>
Subject: Re: [PATCH 4/4] iommu: Fix ordering of iommu_release_device()
Date: Mon, 12 Sep 2022 12:13:25 +0100	[thread overview]
Message-ID: <2c66a30e-f96f-f11c-bb05-a5f4a756ab30@arm.com> (raw)
In-Reply-To: <YxvQNTD1U4bs5TZD@nvidia.com>

On 2022-09-10 00:45, Jason Gunthorpe wrote:
> On Fri, Sep 09, 2022 at 08:55:07PM +0100, Robin Murphy wrote:
> 
>>> Isn't this every driver though? Like every single driver implementing
>>> an UNMANAGED/DMA/DMA_FQ domain has a hidden reference to the
>>> iommu_domain - minimally to point the HW to the IOPTEs it stores.
>>
>> Um, no? Domain ops get the domain passed in as an argument, which is far
>> from hidden, and if any driver implemented them to ignore that argument and
>> operate on something else it would be stupid and broken. Note I said
>> "per-device reference", meaning things like s390's zpci_dev->s390_domain and
>> SMMUv3's dev->iommu->priv->domain. It's only those references that are
>> reachable from release_device - outside the normal domain lifecycle - which
>> are problematic.
> 
> If the plan is to make the domain refcounted and then allow a 'put' on
> it before we reach release_device() then it means every driver needs
> to hold a 'get' on the domain while it is programmed into the HW.
> 
> Because the hw will still be touching memory that could be freed by an
> iommu_domain_free(). By "hidden" reference I mean the HW walkers are
> touching memory that would be freed - ie kasn won't see it.

As far as I'm concerned we're dealing purely with the case where 
release_device races with attaching back to the default domain *and* the 
driver has some reason for release_device to poke at what it thinks the 
currently-attached domain is. Anyone who frees a domain while it's still 
actually live deserves whatever they get; it would be thoroughly 
impractical to attempt to mitigate for that kind of silliness.

>>> Do you know a reason not to hold the group mutex across
>>> release_device? I think that is the most straightforward and
>>> future proof.
>>
>> Yes, the ones documented in the code and already discussed here. The current
>> functional ones aren't particularly *good* reasons, but unless and until
>> they can all be cleaned up they are what they are.
> 
> Uh, I feel like I missed part of the conversation - I don't know what
> this list is..

s390 (remember how we got here?) calls iommu_get_domain_for_dev(). 
ipmmu-vmsa calls arm_iommu_detach_device() (mtk_v1 doesn't, but perhaps 
technically should), to undo the corresponding attach from 
probe_finalize - apologies for misremembering which way round the 
comments were.

>>>> @@ -1022,6 +1030,14 @@ void iommu_group_remove_device(struct device *dev)
>>>>    	dev_info(dev, "Removing from iommu group %d\n", group->id);
>>>>    	mutex_lock(&group->mutex);
>>>> +	if (WARN_ON(group->domain != group->default_domain &&
>>>> +		    group->domain != group->blocking_domain)) {
>>>
>>> This will false trigger, if there are two VFIO devices then the group
>>> will remained owned when we unplug one just of them, but the group's domain
>>> will be a VFIO owned domain.
>>
>> As opposed to currently, where most drivers' release_device will blindly
>> detach/disable the RID in some fashion so the other device would suddenly
>> blow up anyway?
> 
> Er, I think it is OK today, in the non-shared case. If the RID isn't
> shared then each device in the group is independent, so most drivers,
> most of the time, should only effect the RID release_device() is
> called on, while this warning will always trigger for any multi-device
> group.

Oh, apparently I managed to misinterpret this as the two *aliasing* 
devices case, sorry. Indeed it is overly conservative for that. I think 
the robust way to detect bad usage is actually not via the group at all, 
but for iommu_device_{un}use_default_domain() to also maintain a 
per-device ownership flag, then we warn if a device is released with 
that still set.

>> (It *will* actually work on SMMUv2 because SMMUv2 comprehensively handles
>> StreamID-level aliasing beyond what pci_device_group() covers, which I
>> remain rather proud of)
> 
> This is why I prefered not to explicitly change the domain, because at
> least if someone did write a non-buggy driver it doesn't get wrecked -
> and making a non-buggy driver is at least allowed by the API.

Detaching back to the default domain still seems like it's *always* the 
right thing to do at this point, even when it should not be warned about 
as above. As I say it *does* work on non-buggy drivers, and making this 
whole domain use-after-free race a fundamental non-issue is attractive.

>>> And it misses the logic to WARN_ON if a domain is set and an external
>>> entity wrongly uses iommu_group_remove_device()..
>>
>> Huh? An external fake group couldn't have a default domain or blocking
>> domain, thus any non-NULL domain will not compare equal to either, so if
>> that could happen it will warn, and then most likely crash. I did think
>> briefly about trying to make it not crash, but then I remembered that fake
>> groups from external callers also aren't backed by IOMMU API drivers so have
>> no way to allocate or attach domains either, so in fact it cannot happen at
>> all under any circumstances that are worth reasoning about.
> 
> I mean specificaly thing like FSL is doing where it is a real driver
> calling this API and the test of 'group->domain == NULL' is the more
> robust precondition.

Having looked a bit closer, I think I get what PAMU is doing - kind of 
impedance-matching between pci_device_group() and its own non-ACS form 
of isolation, and possibly also a rather roundabout way of propagating 
DT data from the PCI controller node up into the PCI hierarchy. Both 
could quite likely be done in a more straightforward manner these days 
(and TBH I'm not convinced it works at all since it doesn't appear to 
match the actual DT binding), but either way I'm fairly confident we 
needn't worry about it.

Thanks,
Robin.