Virtualizing MSI-X on IMS via VFIO

From: "Tian, Kevin" <kevin.tian@intel.com>
To: "Alex Williamson (alex.williamson@redhat.com)" 
	<alex.williamson@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>,
	Jason Gunthorpe <jgg@nvidia.com>,
	"Dey, Megha" <megha.dey@intel.com>,
	"Raj, Ashok" <ashok.raj@intel.com>,
	"Pan, Jacob jun" <jacob.jun.pan@intel.com>,
	"Jiang, Dave" <dave.jiang@intel.com>,
	"Liu, Yi L" <yi.l.liu@intel.com>,
	"Lu, Baolu" <baolu.lu@intel.com>,
	"Williams, Dan J" <dan.j.williams@intel.com>,
	"Luck, Tony" <tony.luck@intel.com>,
	"Kumar, Sanjay K" <sanjay.k.kumar@intel.com>,
	LKML <linux-kernel@vger.kernel.org>,
	"kvm@vger.kernel.org" <kvm@vger.kernel.org>,
	Kirti Wankhede <kwankhede@nvidia.com>
Subject: Virtualizing MSI-X on IMS via VFIO
Date: Tue, 22 Jun 2021 10:16:15 +0000	[thread overview]
Message-ID: <MWHPR11MB188603D0D809C1079F5817DC8C099@MWHPR11MB1886.namprd11.prod.outlook.com> (raw)

Hi, Alex,

Need your help to understand the current MSI-X virtualization flow in 
VFIO. Some background info first.

Recently we are discussing how to virtualize MSI-X with Interrupt 
Message Storage (IMS) on mdev:
	https://lore.kernel.org/kvm/87im2lyiv6.ffs@nanos.tec.linutronix.de/ 

IMS is a device specific interrupt storage, allowing an optimized and 
scalable manner for generating interrupts. idxd mdev exposes virtual 
MSI-X capability to guest but uses IMS entries physically for generating 
interrupts. 

Thomas has helped implement a generic ims irqchip driver:
	https://lore.kernel.org/linux-hyperv/20200826112335.202234502@linutronix.de/

idxd device allows software to specify an IMS entry (for triggering 
completion interrupt) when submitting a descriptor. To prevent one 
mdev triggering malicious interrupt into another mdev (by specifying 
an arbitrary entry), idxd ims entry includes a PASID field for validation - 
only a matching PASID in the executed descriptor can trigger interrupt 
via this entry. idxd driver is expected to program ims entries with 
PASIDs that are allocated to the mdev which owns those entries.

Other devices may have different ID and format to isolate ims entries. 
But we need abstract a generic means for programming vendor-specific 
ID into vendor-specific ims entry, without violating the layering model. 

Thomas suggested vendor driver to first register ID information (possibly 
plus the location where to write ID to) in msi_desc when allocating irqs 
(extend existing alloc function or via new helper function) and then have 
the generic ims irqchip driver to update ID to the ims entry when it's 
started up by request_irq().

Then there are two questions to be answered:

    1) How does vendor driver decide the ID to be registered to msi_desc?
    2) How is Thomas's model mapped to the MSI-X virtualization flow in VFIO?

For the 1st open, there are two types of PASIDs on idxd mdev:

    1) default PASID: one per mdev and allocated when mdev is created;
    2) sva PASIDs: multiple per mdev and allocated on-demand (via vIOMMU);

If vIOMMU is not exposed, all ims entries of this mdev should be 
programmed with default PASID which is always available in mdev's 
lifespan.

If vIOMMU is exposed and guest sva is enabled, entries used for sva 
should be tagged with sva PASIDs, leaving others tagged with default 
PASID. To help achieve intra-guest interrupt isolation, guest idxd driver 
needs program guest sva PASIDs into virtual MSIX_PERM register (one 
per MSI-X entry) for validation. Access to MSIX_PERM is trap-and-emulated 
by host idxd driver which then figure out which PASID to register to 
msi_desc (require PASID translation info via new /dev/iommu proposal).

The guest driver is expected to update MSIX_PERM before request_irq().

Now the 2nd open requires your help. Below is what I learned from 
current vfio/qemu code (for vfio-pci device):

    0) Qemu doesn't attempt to allocate all irqs as reported by msix->
        table_size. It is done in an dynamic and incremental way.

    1) VFIO provides just one command (VFIO_DEVICE_SET_IRQS) for 
         allocating/enabling irqs given a set of vMSIX vectors [start, count]:

        a) if irqs not allocated, allocate irqs [start+count]. Enable irqs for 
            specified vectors [start, count] via request_irq();
        b) if irqs already allocated, enable irqs for specified vectors;
        c) if irq already enabled, disable and re-enable irqs for specified
             vectors because user may specify a different eventfd;

    2) When guest enables virtual MSI-X capability, Qemu calls VFIO_
        DEVICE_SET_IRQS to enable vector#0, even though it's currently 
        masked by the guest. Interrupts are received by Qemu but blocked
        from guest via mask/pending bit emulation. The main intention is 
        to enable physical MSI-X;

    3) When guest unmasks vector#0 via request_irq(), Qemu calls VFIO_
        DEVICE_SET_IRQS to enable vector#0 again, with a eventfd different
        from the one provided in 2);

    4) When guest unmasks vector#1, Qemu finds it's outside of allocated
        vectors (only vector#0 now):

        a) Qemu first calls VFIO_DEVICE_SET_IRQS to disable and free 
            irq for vector#0;

        b) Qemu then calls VFIO_DEVICE_SET_IRQS to allocate and enable
            irqs for both vector#0 and vector#1;

     5) When guest unmasks vector#2, same flow in 4) continues.

     ....

If above understanding is correct, how is lost interrupt avoided between 
4.a) and 4.b) given that irq has been torn down for vector#0 in the middle
while from guest p.o.v this vector is actually unmasked? There must be
a mechanism in place, but I just didn't figure it out...

Given above flow is robust, mapping Thomas's model to this flow is
straightforward. Assume idxd mdev has two vectors: vector#0 for
misc/error interrupt and vector#1 as completion interrupt for guest
sva. VFIO_DEVICE_SET_IRQS is handled by idxd mdev driver:

    2) When guest enables virtual MSI-X capability, Qemu calls VFIO_
        DEVICE_SET_IRQS to enable vector#0. Because vector#0 is not
        used for sva, MSIX_PERM#0 has PASID disabled. Host idxd driver 
        knows to register default PASID to msi_desc#0 when allocating irqs. 
        Then .startup() callback of ims irqchip is called to program default 
        PASID saved in msi_desc#0 to the target ims entry when request_irq().

    3) When guest unmasks vector#0 via request_irq(), Qemu calls VFIO_
        DEVICE_SET_IRQS to enable vector#0 again. Following same logic
        as vfio-pci, idxd driver first disable irq#0 via free_irq() and then
        re-enable irq#0 via request_irq(). It's still default PASID being used
        according to msi_desc#0.

    4) When guest unmasks vector#1, Qemu finds it's outside of allocated
        vectors (only vector#0 now):

        a) Qemu first calls VFIO_DEVICE_SET_IRQS to disable and free 
            irq for vector#0. msi_desc#0 is also freed.

        b) Qemu then calls VFIO_DEVICE_SET_IRQS to allocate and enable
            irqs for both vector#0 and vector#1. At this point, MSIX_PERM#0
           has PASID disabled while MSIX_PERM#1 has a valid guest PASID1
           for sva. idxd driver registers default PASID to msix_desc#0 and 
           host PASID2 (translated from guest PASID1) to msix_desc#1 when
           allocating irqs. Later when both irqs are enabled via request_irq(),
           ims irqchip driver updates the target ims entries according to 
           msix_desc#0 and misx_desc#1 respectively.

But this is specific to how Qemu virtualizes MSI-X today. What about it
may change (or another device model) to allocate all table_size irqs 
when guest enables MSI-X capability? At that point we don't have valid
MSIX_PERM content to register PASID info to msix_desc. Possibly what 
we really require is a separate helper function allowing driver to update 
msix_desc after irq allocation, e.g. when guest unmasks a vector...

and do you see any other facets which are overlooked here?

Thanks
Kevin