On Sun, 2020-11-08 at 19:47 +0100, Thomas Gleixner wrote: > This only works when the guest OS actually knows that it runs in a > VM. If the guest can't figure that out, i.e. via CPUID, this cannot be > solved because from the guest OS view that's the same as running on bare > metal. Obviously on bare metal the Vector domain can and must handle > this. > > So this needs some thought. The problem here is that Intel implemented interrupt remapping in a way which is anathema to structured, ordered IRQ domains. When a guest writes an MSI message (addr/data) to the MSI table of a PCI device which has been assigned to that guest, it *doesn't* properly inherit the MSI composition from a parent irqdomain which knows about the (host-side) IOMMU. What actually happens is the hypervisor *traps* the writes to the device's MSI table, and translates them *then*. In *precisely* the fashion which we're trying to avoid for IMS. Now, you can imagine a world where it wasn't like this, where Remappable Format MSI messages don't exist, and where we let guests write native MSI message to the device without trapping — and where the IOMMU then sees the incoming interrupt and has to map the APIC ID to a *virtual* CPU for that guest, based on the PCI source-id of the device. In that world, IMS would work naturally. But that isn't how Intel designed interrupt remapping. They *designed* to have to trap and translate as the message is written to the device. So it does look like we're going to need a hypercall interface to compose an MSI message on behalf of the guest, for IMS to use. In fact PCI devices assigned to a guest could use that too, and then we'd only need to trap-and-remap any attempt to write a Compatibility Format MSI to the device's MSI table, while letting Remappable Format messages get written directly. We'd also need a way for an OS running on bare metal to *know* that it's on bare metal and can just compose MSI messages for itself. Since we do expect bare metal to have an IOMMU, perhaps that is just a feature flag on the IOMMU? That or Intel needs to fix the IOMMU to do proper virtualisation and actually translate "Compatibility Format" MSIs for a guest too.