Re: RMRR device on non-Intel platform

From: Jason Gunthorpe <jgg@nvidia.com>
To: Robin Murphy <robin.murphy@arm.com>, Nicolin Chen <nicolinc@nvidia.com>
Cc: Alex Williamson <alex.williamson@redhat.com>,
	"Tian, Kevin" <kevin.tian@intel.com>,
	"kvm@vger.kernel.org" <kvm@vger.kernel.org>,
	"iommu@lists.linux.dev" <iommu@lists.linux.dev>
Subject: Re: RMRR device on non-Intel platform
Date: Fri, 21 Apr 2023 14:58:01 -0300	[thread overview]
Message-ID: <ZELOqZliiwbG6l5K@nvidia.com> (raw)
In-Reply-To: <fe7e20e5-9729-248d-ee03-c8b444a1b7c0@arm.com>

On Fri, Apr 21, 2023 at 06:22:37PM +0100, Robin Murphy wrote:

> I think a slightly more considered and slightly less wrong version of that
> idea is to mark it as IOMMU_RESV_MSI, and special-case direct-mapping those
> on Arm (I believe it would technically be benign to do on x86 too, but might
> annoy people with its pointlessness). However...

I'd rather have a IOMMU_RESV_MSI_DIRECT and put the ARM special case
in ARM code..

> > On baremetal we have no idea what the platform put under that
> > hardcoded address?
> > 
> > On VM we don't use the iommu_get_msi_cookie() flow because the GIC in
> > the VM pretends it doesn't have an ITS page?  (did I get that right?)
> 
> No, that can't be right - PCIe devices have to support MSI or MSI-X, and
> many of them won't support INTx at all, so if the guest wants to use
> interrupts in general it must surely need to believe it has some kind of MSI
> controller, 

Yes..

> which for practical purposes in this context means an ITS. 

I haven't delved into it super detail, but.. my impression was..

The ITS page only becomes relavent to the IOMMU layer if the actual
IRQ driver calls iommu_dma_prepare_msi()

And we have only these drivers that do so:

drivers/irqchip/irq-gic-v2m.c:  err = iommu_dma_prepare_msi(info->desc,
drivers/irqchip/irq-gic-v3-its.c:       err = iommu_dma_prepare_msi(info->desc, its->get_msi_base(its_dev));
drivers/irqchip/irq-gic-v3-mbi.c:       err = iommu_dma_prepare_msi(info->desc,
drivers/irqchip/irq-ls-scfg-msi.c:      err = iommu_dma_prepare_msi(info->desc, msi_data->msiir_addr);

While, I *thought* that the vGIC in ARM uses

drivers/irqchip/irq-gic-v4.c

Which doesn't obviously call iommu_dma_prepare_msi() ?

So while the SMMU driver will stick in a IOMMU_RESV_SW_MSI, and
iommufd will call iommu_get_msi_cookie(), there is no matching call
of iommu_dma_prepare_msi() - so it all effectively does nothing.

Instead, again from what I understood, is that the IOMMU layer is
expected to install the ITS page, not knowing it is an ITS page,
because the ACPI creates a IOMMU_RESV_DIRECT.

When the VM writes it totally-a-lie MSI address to the PCI MSI-X
registers the hypervisor traps it and subsitutes, what it valiantly
hopes, is the right address for the ITS in the VM's S1 IOMMU table
based on the ACPI where it nicely asked the guest to keep this
specific IOVA mapped.

I'm not sure how the data bit works on ARM..

> was the next thing I started wondering after the above - if the aim is to
> direct-map the host's SW_MSI region to effectively pass through the S2 MSI
> cookie, but you have the same Linux SMMU driver in the guest, isn't that
> guest driver still going to add a conflicting SW_MSI region for the same
> IOVA and confuse things?

Oh probably yes. At least from iommufd perspective, it can resolve
overlapping regions just fine though.

> Ideally for nesting, the VMM would just tell us the IPA of where it's going
> to claim the given device's associated MSI doorbell is, we map that to the
> real underlying address at S2, then the guest can use its S1 cookie as
> normal if it wants to, and the host doesn't have to rewrite addresses either
> way. 

Goodness yes, I'd love that.

> that the nesting usage model inherently constrains the VMM's options for
> emulating the IOMMU, would it be unreasonable to make our lives a lot easier
> with some similar constraints around interrupts, and just not attempt to
> support the full gamut of "emulate any kind of IRQ with any other kind of
> IRQ" irqfd hilarity?

Isn't that what GICv4 is?

Frankly, I think something whent wrong with the GICv4 design. A purely
virtualization focused GIC should not have continued to rely on the
hypervisor trapping of the MSI-X writes. The guest should have had a
real data value and a real physical ITS page.

I can understand why we got here, because fixing *all* of that would
be a big task and this is a small hack, but still... Yuk.

But that is a whole other journey. There is work afoot to standardize
some things would make MSI-X trapping impossible and more solidly
force this issue, so I'm just hoping to keep the current mess going
as-is right now..

> > > MSI regions already represent "safe" direct mappings, either as an inherent
> > > property of the hardware, or with an actual mapping maintained by software.
> > > Also RELAXABLE is meant to imply that it is only needed until a driver takes
> > > over the device, which at face value doesn't make much sense for interrupts.
> > 
> > I used "relxable" to suggest it is safe for userspace.
> 
> I know, but the subtlety is the reason *why* it's safe for userspace. Namely
> that a VFIO driver has bound and reset (or at least taken control of) the
> device, and thus it is assumed to no longer be doing whatever the boot
> firmware left it doing, therefore the reserved region is assumed to no
> longer be relevant, and from then on the requirement to preserve it can be
> relaxed.

IOMMU_RESV_MSI_DIRECT is probably the better name

> >          unsigned long pg_size;
> >          int ret = 0;
> > -       if (!iommu_is_dma_domain(domain))
> > -               return 0;
> > -
> >          BUG_ON(!domain->pgsize_bitmap);
> >          pg_size = 1UL << __ffs(domain->pgsize_bitmap);
> 
> But then you realise that you also need to juggle this around since identity
> domains aren't required to have a valid pgsize_bitmap either, give up on the
> idea and go straight to writing a dedicated loop as the clearer and tidier
> option because hey this is hardly a fast path anyway. At least, you do if
> you're me :)

domain->pgsize_bitmap is always valid memory, and __ffs() always
returns [0:31], so this caclculation will be fine but garbage.

> > @@ -1052,13 +1049,18 @@ static int iommu_create_device_direct_mappings(struct i>
> >                  dma_addr_t start, end, addr;
> >                  size_t map_size = 0;
> > -               start = ALIGN(entry->start, pg_size);
> > -               end   = ALIGN(entry->start + entry->length, pg_size);
> > -
> >                  if (entry->type != IOMMU_RESV_DIRECT &&
> >                      entry->type != IOMMU_RESV_DIRECT_RELAXABLE)
> >                          continue;
> > +               if (entry->type == IOMMU_RESV_DIRECT)
> > +                       dev->iommu->requires_direct = 1;
> > +
> > +               if (!iommu_is_dma_domain(domain))
> > +                       continue;
> > +
> > +               start = ALIGN(entry->start, pg_size);
> > +               end   = ALIGN(entry->start + entry->length, pg_size);

Which is why I moved the only reader of pg_size after the check if it
is valid..

Thanks,
Jason