* Virtualizing MSI-X on IMS via VFIO @ 2021-06-22 10:16 Tian, Kevin 2021-06-22 15:50 ` Dave Jiang 2021-06-22 19:12 ` Alex Williamson 0 siblings, 2 replies; 29+ messages in thread From: Tian, Kevin @ 2021-06-22 10:16 UTC (permalink / raw) To: Alex Williamson (alex.williamson@redhat.com) Cc: Thomas Gleixner, Jason Gunthorpe, Dey, Megha, Raj, Ashok, Pan, Jacob jun, Jiang, Dave, Liu, Yi L, Lu, Baolu, Williams, Dan J, Luck, Tony, Kumar, Sanjay K, LKML, kvm, Kirti Wankhede Hi, Alex, Need your help to understand the current MSI-X virtualization flow in VFIO. Some background info first. Recently we are discussing how to virtualize MSI-X with Interrupt Message Storage (IMS) on mdev: https://lore.kernel.org/kvm/87im2lyiv6.ffs@nanos.tec.linutronix.de/ IMS is a device specific interrupt storage, allowing an optimized and scalable manner for generating interrupts. idxd mdev exposes virtual MSI-X capability to guest but uses IMS entries physically for generating interrupts. Thomas has helped implement a generic ims irqchip driver: https://lore.kernel.org/linux-hyperv/20200826112335.202234502@linutronix.de/ idxd device allows software to specify an IMS entry (for triggering completion interrupt) when submitting a descriptor. To prevent one mdev triggering malicious interrupt into another mdev (by specifying an arbitrary entry), idxd ims entry includes a PASID field for validation - only a matching PASID in the executed descriptor can trigger interrupt via this entry. idxd driver is expected to program ims entries with PASIDs that are allocated to the mdev which owns those entries. Other devices may have different ID and format to isolate ims entries. But we need abstract a generic means for programming vendor-specific ID into vendor-specific ims entry, without violating the layering model. Thomas suggested vendor driver to first register ID information (possibly plus the location where to write ID to) in msi_desc when allocating irqs (extend existing alloc function or via new helper function) and then have the generic ims irqchip driver to update ID to the ims entry when it's started up by request_irq(). Then there are two questions to be answered: 1) How does vendor driver decide the ID to be registered to msi_desc? 2) How is Thomas's model mapped to the MSI-X virtualization flow in VFIO? For the 1st open, there are two types of PASIDs on idxd mdev: 1) default PASID: one per mdev and allocated when mdev is created; 2) sva PASIDs: multiple per mdev and allocated on-demand (via vIOMMU); If vIOMMU is not exposed, all ims entries of this mdev should be programmed with default PASID which is always available in mdev's lifespan. If vIOMMU is exposed and guest sva is enabled, entries used for sva should be tagged with sva PASIDs, leaving others tagged with default PASID. To help achieve intra-guest interrupt isolation, guest idxd driver needs program guest sva PASIDs into virtual MSIX_PERM register (one per MSI-X entry) for validation. Access to MSIX_PERM is trap-and-emulated by host idxd driver which then figure out which PASID to register to msi_desc (require PASID translation info via new /dev/iommu proposal). The guest driver is expected to update MSIX_PERM before request_irq(). Now the 2nd open requires your help. Below is what I learned from current vfio/qemu code (for vfio-pci device): 0) Qemu doesn't attempt to allocate all irqs as reported by msix-> table_size. It is done in an dynamic and incremental way. 1) VFIO provides just one command (VFIO_DEVICE_SET_IRQS) for allocating/enabling irqs given a set of vMSIX vectors [start, count]: a) if irqs not allocated, allocate irqs [start+count]. Enable irqs for specified vectors [start, count] via request_irq(); b) if irqs already allocated, enable irqs for specified vectors; c) if irq already enabled, disable and re-enable irqs for specified vectors because user may specify a different eventfd; 2) When guest enables virtual MSI-X capability, Qemu calls VFIO_ DEVICE_SET_IRQS to enable vector#0, even though it's currently masked by the guest. Interrupts are received by Qemu but blocked from guest via mask/pending bit emulation. The main intention is to enable physical MSI-X; 3) When guest unmasks vector#0 via request_irq(), Qemu calls VFIO_ DEVICE_SET_IRQS to enable vector#0 again, with a eventfd different from the one provided in 2); 4) When guest unmasks vector#1, Qemu finds it's outside of allocated vectors (only vector#0 now): a) Qemu first calls VFIO_DEVICE_SET_IRQS to disable and free irq for vector#0; b) Qemu then calls VFIO_DEVICE_SET_IRQS to allocate and enable irqs for both vector#0 and vector#1; 5) When guest unmasks vector#2, same flow in 4) continues. .... If above understanding is correct, how is lost interrupt avoided between 4.a) and 4.b) given that irq has been torn down for vector#0 in the middle while from guest p.o.v this vector is actually unmasked? There must be a mechanism in place, but I just didn't figure it out... Given above flow is robust, mapping Thomas's model to this flow is straightforward. Assume idxd mdev has two vectors: vector#0 for misc/error interrupt and vector#1 as completion interrupt for guest sva. VFIO_DEVICE_SET_IRQS is handled by idxd mdev driver: 2) When guest enables virtual MSI-X capability, Qemu calls VFIO_ DEVICE_SET_IRQS to enable vector#0. Because vector#0 is not used for sva, MSIX_PERM#0 has PASID disabled. Host idxd driver knows to register default PASID to msi_desc#0 when allocating irqs. Then .startup() callback of ims irqchip is called to program default PASID saved in msi_desc#0 to the target ims entry when request_irq(). 3) When guest unmasks vector#0 via request_irq(), Qemu calls VFIO_ DEVICE_SET_IRQS to enable vector#0 again. Following same logic as vfio-pci, idxd driver first disable irq#0 via free_irq() and then re-enable irq#0 via request_irq(). It's still default PASID being used according to msi_desc#0. 4) When guest unmasks vector#1, Qemu finds it's outside of allocated vectors (only vector#0 now): a) Qemu first calls VFIO_DEVICE_SET_IRQS to disable and free irq for vector#0. msi_desc#0 is also freed. b) Qemu then calls VFIO_DEVICE_SET_IRQS to allocate and enable irqs for both vector#0 and vector#1. At this point, MSIX_PERM#0 has PASID disabled while MSIX_PERM#1 has a valid guest PASID1 for sva. idxd driver registers default PASID to msix_desc#0 and host PASID2 (translated from guest PASID1) to msix_desc#1 when allocating irqs. Later when both irqs are enabled via request_irq(), ims irqchip driver updates the target ims entries according to msix_desc#0 and misx_desc#1 respectively. But this is specific to how Qemu virtualizes MSI-X today. What about it may change (or another device model) to allocate all table_size irqs when guest enables MSI-X capability? At that point we don't have valid MSIX_PERM content to register PASID info to msix_desc. Possibly what we really require is a separate helper function allowing driver to update msix_desc after irq allocation, e.g. when guest unmasks a vector... and do you see any other facets which are overlooked here? Thanks Kevin ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Virtualizing MSI-X on IMS via VFIO 2021-06-22 10:16 Virtualizing MSI-X on IMS via VFIO Tian, Kevin @ 2021-06-22 15:50 ` Dave Jiang 2021-06-23 6:16 ` Tian, Kevin 2021-06-22 19:12 ` Alex Williamson 1 sibling, 1 reply; 29+ messages in thread From: Dave Jiang @ 2021-06-22 15:50 UTC (permalink / raw) To: Tian, Kevin, Alex Williamson (alex.williamson@redhat.com) Cc: Thomas Gleixner, Jason Gunthorpe, Dey, Megha, Raj, Ashok, Pan, Jacob jun, Liu, Yi L, Lu, Baolu, Williams, Dan J, Luck, Tony, Kumar, Sanjay K, LKML, kvm, Kirti Wankhede On 6/22/2021 3:16 AM, Tian, Kevin wrote: > Hi, Alex, > > Need your help to understand the current MSI-X virtualization flow in > VFIO. Some background info first. > > Recently we are discussing how to virtualize MSI-X with Interrupt > Message Storage (IMS) on mdev: > https://lore.kernel.org/kvm/87im2lyiv6.ffs@nanos.tec.linutronix.de/ > > IMS is a device specific interrupt storage, allowing an optimized and > scalable manner for generating interrupts. idxd mdev exposes virtual > MSI-X capability to guest but uses IMS entries physically for generating > interrupts. > > Thomas has helped implement a generic ims irqchip driver: > https://lore.kernel.org/linux-hyperv/20200826112335.202234502@linutronix.de/ > > idxd device allows software to specify an IMS entry (for triggering > completion interrupt) when submitting a descriptor. To prevent one > mdev triggering malicious interrupt into another mdev (by specifying > an arbitrary entry), idxd ims entry includes a PASID field for validation - > only a matching PASID in the executed descriptor can trigger interrupt > via this entry. idxd driver is expected to program ims entries with > PASIDs that are allocated to the mdev which owns those entries. > > Other devices may have different ID and format to isolate ims entries. > But we need abstract a generic means for programming vendor-specific > ID into vendor-specific ims entry, without violating the layering model. > > Thomas suggested vendor driver to first register ID information (possibly > plus the location where to write ID to) in msi_desc when allocating irqs > (extend existing alloc function or via new helper function) and then have > the generic ims irqchip driver to update ID to the ims entry when it's > started up by request_irq(). > > Then there are two questions to be answered: > > 1) How does vendor driver decide the ID to be registered to msi_desc? > 2) How is Thomas's model mapped to the MSI-X virtualization flow in VFIO? > > For the 1st open, there are two types of PASIDs on idxd mdev: > > 1) default PASID: one per mdev and allocated when mdev is created; > 2) sva PASIDs: multiple per mdev and allocated on-demand (via vIOMMU); > > If vIOMMU is not exposed, all ims entries of this mdev should be > programmed with default PASID which is always available in mdev's > lifespan. > > If vIOMMU is exposed and guest sva is enabled, entries used for sva > should be tagged with sva PASIDs, leaving others tagged with default > PASID. To help achieve intra-guest interrupt isolation, guest idxd driver > needs program guest sva PASIDs into virtual MSIX_PERM register (one > per MSI-X entry) for validation. Access to MSIX_PERM is trap-and-emulated > by host idxd driver which then figure out which PASID to register to > msi_desc (require PASID translation info via new /dev/iommu proposal). > > The guest driver is expected to update MSIX_PERM before request_irq(). > > Now the 2nd open requires your help. Below is what I learned from > current vfio/qemu code (for vfio-pci device): > > 0) Qemu doesn't attempt to allocate all irqs as reported by msix-> > table_size. It is done in an dynamic and incremental way. > > 1) VFIO provides just one command (VFIO_DEVICE_SET_IRQS) for > allocating/enabling irqs given a set of vMSIX vectors [start, count]: > > a) if irqs not allocated, allocate irqs [start+count]. Enable irqs for > specified vectors [start, count] via request_irq(); > b) if irqs already allocated, enable irqs for specified vectors; > c) if irq already enabled, disable and re-enable irqs for specified > vectors because user may specify a different eventfd; > > 2) When guest enables virtual MSI-X capability, Qemu calls VFIO_ > DEVICE_SET_IRQS to enable vector#0, even though it's currently > masked by the guest. Interrupts are received by Qemu but blocked > from guest via mask/pending bit emulation. The main intention is > to enable physical MSI-X; > > 3) When guest unmasks vector#0 via request_irq(), Qemu calls VFIO_ > DEVICE_SET_IRQS to enable vector#0 again, with a eventfd different > from the one provided in 2); > > 4) When guest unmasks vector#1, Qemu finds it's outside of allocated > vectors (only vector#0 now): > > a) Qemu first calls VFIO_DEVICE_SET_IRQS to disable and free > irq for vector#0; > > b) Qemu then calls VFIO_DEVICE_SET_IRQS to allocate and enable > irqs for both vector#0 and vector#1; > > 5) When guest unmasks vector#2, same flow in 4) continues. > > .... > > If above understanding is correct, how is lost interrupt avoided between > 4.a) and 4.b) given that irq has been torn down for vector#0 in the middle > while from guest p.o.v this vector is actually unmasked? There must be > a mechanism in place, but I just didn't figure it out... > > Given above flow is robust, mapping Thomas's model to this flow is > straightforward. Assume idxd mdev has two vectors: vector#0 for > misc/error interrupt and vector#1 as completion interrupt for guest > sva. VFIO_DEVICE_SET_IRQS is handled by idxd mdev driver: > > 2) When guest enables virtual MSI-X capability, Qemu calls VFIO_ > DEVICE_SET_IRQS to enable vector#0. Because vector#0 is not > used for sva, MSIX_PERM#0 has PASID disabled. Host idxd driver > knows to register default PASID to msi_desc#0 when allocating irqs. > Then .startup() callback of ims irqchip is called to program default > PASID saved in msi_desc#0 to the target ims entry when request_irq(). > > 3) When guest unmasks vector#0 via request_irq(), Qemu calls VFIO_ > DEVICE_SET_IRQS to enable vector#0 again. Following same logic > as vfio-pci, idxd driver first disable irq#0 via free_irq() and then > re-enable irq#0 via request_irq(). It's still default PASID being used > according to msi_desc#0. Hi Kevin, slight correction here. Because vector#0 is emulated for idxd vdev, it has no IMS backing. So there is no msi_desc#0 for that vector. msi_desc#0 actually starts at vector#1 where IMS is allocated to back it. vector#0 does not go through request_irq(). It only has eventfd part. Everything you say is correct but starts at vector#1. > > 4) When guest unmasks vector#1, Qemu finds it's outside of allocated > vectors (only vector#0 now): > > a) Qemu first calls VFIO_DEVICE_SET_IRQS to disable and free > irq for vector#0. msi_desc#0 is also freed. > > b) Qemu then calls VFIO_DEVICE_SET_IRQS to allocate and enable > irqs for both vector#0 and vector#1. At this point, MSIX_PERM#0 > has PASID disabled while MSIX_PERM#1 has a valid guest PASID1 > for sva. idxd driver registers default PASID to msix_desc#0 and > host PASID2 (translated from guest PASID1) to msix_desc#1 when > allocating irqs. Later when both irqs are enabled via request_irq(), > ims irqchip driver updates the target ims entries according to > msix_desc#0 and misx_desc#1 respectively. > > But this is specific to how Qemu virtualizes MSI-X today. What about it > may change (or another device model) to allocate all table_size irqs > when guest enables MSI-X capability? At that point we don't have valid > MSIX_PERM content to register PASID info to msix_desc. Possibly what > we really require is a separate helper function allowing driver to update > msix_desc after irq allocation, e.g. when guest unmasks a vector... > > and do you see any other facets which are overlooked here? > > Thanks > Kevin ^ permalink raw reply [flat|nested] 29+ messages in thread
* RE: Virtualizing MSI-X on IMS via VFIO 2021-06-22 15:50 ` Dave Jiang @ 2021-06-23 6:16 ` Tian, Kevin 0 siblings, 0 replies; 29+ messages in thread From: Tian, Kevin @ 2021-06-23 6:16 UTC (permalink / raw) To: Jiang, Dave, Alex Williamson (alex.williamson@redhat.com) Cc: Thomas Gleixner, Jason Gunthorpe, Dey, Megha, Raj, Ashok, Pan, Jacob jun, Liu, Yi L, Lu, Baolu, Williams, Dan J, Luck, Tony, Kumar, Sanjay K, LKML, kvm, Kirti Wankhede > From: Jiang, Dave <dave.jiang@intel.com> > Sent: Tuesday, June 22, 2021 11:51 PM > > On 6/22/2021 3:16 AM, Tian, Kevin wrote: > > Hi, Alex, > > > > Need your help to understand the current MSI-X virtualization flow in > > VFIO. Some background info first. > > > > Recently we are discussing how to virtualize MSI-X with Interrupt > > Message Storage (IMS) on mdev: > > https://lore.kernel.org/kvm/87im2lyiv6.ffs@nanos.tec.linutronix.de/ > > > > IMS is a device specific interrupt storage, allowing an optimized and > > scalable manner for generating interrupts. idxd mdev exposes virtual > > MSI-X capability to guest but uses IMS entries physically for generating > > interrupts. > > > > Thomas has helped implement a generic ims irqchip driver: > > https://lore.kernel.org/linux- > hyperv/20200826112335.202234502@linutronix.de/ > > > > idxd device allows software to specify an IMS entry (for triggering > > completion interrupt) when submitting a descriptor. To prevent one > > mdev triggering malicious interrupt into another mdev (by specifying > > an arbitrary entry), idxd ims entry includes a PASID field for validation - > > only a matching PASID in the executed descriptor can trigger interrupt > > via this entry. idxd driver is expected to program ims entries with > > PASIDs that are allocated to the mdev which owns those entries. > > > > Other devices may have different ID and format to isolate ims entries. > > But we need abstract a generic means for programming vendor-specific > > ID into vendor-specific ims entry, without violating the layering model. > > > > Thomas suggested vendor driver to first register ID information (possibly > > plus the location where to write ID to) in msi_desc when allocating irqs > > (extend existing alloc function or via new helper function) and then have > > the generic ims irqchip driver to update ID to the ims entry when it's > > started up by request_irq(). > > > > Then there are two questions to be answered: > > > > 1) How does vendor driver decide the ID to be registered to msi_desc? > > 2) How is Thomas's model mapped to the MSI-X virtualization flow in > VFIO? > > > > For the 1st open, there are two types of PASIDs on idxd mdev: > > > > 1) default PASID: one per mdev and allocated when mdev is created; > > 2) sva PASIDs: multiple per mdev and allocated on-demand (via > vIOMMU); > > > > If vIOMMU is not exposed, all ims entries of this mdev should be > > programmed with default PASID which is always available in mdev's > > lifespan. > > > > If vIOMMU is exposed and guest sva is enabled, entries used for sva > > should be tagged with sva PASIDs, leaving others tagged with default > > PASID. To help achieve intra-guest interrupt isolation, guest idxd driver > > needs program guest sva PASIDs into virtual MSIX_PERM register (one > > per MSI-X entry) for validation. Access to MSIX_PERM is trap-and-emulated > > by host idxd driver which then figure out which PASID to register to > > msi_desc (require PASID translation info via new /dev/iommu proposal). > > > > The guest driver is expected to update MSIX_PERM before request_irq(). > > > > Now the 2nd open requires your help. Below is what I learned from > > current vfio/qemu code (for vfio-pci device): > > > > 0) Qemu doesn't attempt to allocate all irqs as reported by msix-> > > table_size. It is done in an dynamic and incremental way. > > > > 1) VFIO provides just one command (VFIO_DEVICE_SET_IRQS) for > > allocating/enabling irqs given a set of vMSIX vectors [start, count]: > > > > a) if irqs not allocated, allocate irqs [start+count]. Enable irqs for > > specified vectors [start, count] via request_irq(); > > b) if irqs already allocated, enable irqs for specified vectors; > > c) if irq already enabled, disable and re-enable irqs for specified > > vectors because user may specify a different eventfd; > > > > 2) When guest enables virtual MSI-X capability, Qemu calls VFIO_ > > DEVICE_SET_IRQS to enable vector#0, even though it's currently > > masked by the guest. Interrupts are received by Qemu but blocked > > from guest via mask/pending bit emulation. The main intention is > > to enable physical MSI-X; > > > > 3) When guest unmasks vector#0 via request_irq(), Qemu calls VFIO_ > > DEVICE_SET_IRQS to enable vector#0 again, with a eventfd different > > from the one provided in 2); > > > > 4) When guest unmasks vector#1, Qemu finds it's outside of allocated > > vectors (only vector#0 now): > > > > a) Qemu first calls VFIO_DEVICE_SET_IRQS to disable and free > > irq for vector#0; > > > > b) Qemu then calls VFIO_DEVICE_SET_IRQS to allocate and enable > > irqs for both vector#0 and vector#1; > > > > 5) When guest unmasks vector#2, same flow in 4) continues. > > > > .... > > > > If above understanding is correct, how is lost interrupt avoided between > > 4.a) and 4.b) given that irq has been torn down for vector#0 in the middle > > while from guest p.o.v this vector is actually unmasked? There must be > > a mechanism in place, but I just didn't figure it out... > > > > Given above flow is robust, mapping Thomas's model to this flow is > > straightforward. Assume idxd mdev has two vectors: vector#0 for > > misc/error interrupt and vector#1 as completion interrupt for guest > > sva. VFIO_DEVICE_SET_IRQS is handled by idxd mdev driver: > > > > 2) When guest enables virtual MSI-X capability, Qemu calls VFIO_ > > DEVICE_SET_IRQS to enable vector#0. Because vector#0 is not > > used for sva, MSIX_PERM#0 has PASID disabled. Host idxd driver > > knows to register default PASID to msi_desc#0 when allocating irqs. > > Then .startup() callback of ims irqchip is called to program default > > PASID saved in msi_desc#0 to the target ims entry when request_irq(). > > > > 3) When guest unmasks vector#0 via request_irq(), Qemu calls VFIO_ > > DEVICE_SET_IRQS to enable vector#0 again. Following same logic > > as vfio-pci, idxd driver first disable irq#0 via free_irq() and then > > re-enable irq#0 via request_irq(). It's still default PASID being used > > according to msi_desc#0. > > Hi Kevin, slight correction here. Because vector#0 is emulated for idxd > vdev, it has no IMS backing. So there is no msi_desc#0 for that vector. > msi_desc#0 actually starts at vector#1 where IMS is allocated to back > it. vector#0 does not go through request_irq(). It only has eventfd > part. Everything you say is correct but starts at vector#1. > You are right. But for illustration simplicity, let's still assume both vector #0 and #1 are backed by ims in following discussion, since purely emulated vector is anyway outside of this context. 😊 Thanks Kevin ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Virtualizing MSI-X on IMS via VFIO 2021-06-22 10:16 Virtualizing MSI-X on IMS via VFIO Tian, Kevin 2021-06-22 15:50 ` Dave Jiang @ 2021-06-22 19:12 ` Alex Williamson 2021-06-22 23:59 ` Thomas Gleixner 1 sibling, 1 reply; 29+ messages in thread From: Alex Williamson @ 2021-06-22 19:12 UTC (permalink / raw) To: Tian, Kevin Cc: Thomas Gleixner, Jason Gunthorpe, Dey, Megha, Raj, Ashok, Pan, Jacob jun, Jiang, Dave, Liu, Yi L, Lu, Baolu, Williams, Dan J, Luck, Tony, Kumar, Sanjay K, LKML, kvm, Kirti Wankhede On Tue, 22 Jun 2021 10:16:15 +0000 "Tian, Kevin" <kevin.tian@intel.com> wrote: > Hi, Alex, > > Need your help to understand the current MSI-X virtualization flow in > VFIO. Some background info first. > > Recently we are discussing how to virtualize MSI-X with Interrupt > Message Storage (IMS) on mdev: > https://lore.kernel.org/kvm/87im2lyiv6.ffs@nanos.tec.linutronix.de/ > > IMS is a device specific interrupt storage, allowing an optimized and > scalable manner for generating interrupts. idxd mdev exposes virtual > MSI-X capability to guest but uses IMS entries physically for generating > interrupts. > > Thomas has helped implement a generic ims irqchip driver: > https://lore.kernel.org/linux-hyperv/20200826112335.202234502@linutronix.de/ > > idxd device allows software to specify an IMS entry (for triggering > completion interrupt) when submitting a descriptor. To prevent one > mdev triggering malicious interrupt into another mdev (by specifying > an arbitrary entry), idxd ims entry includes a PASID field for validation - > only a matching PASID in the executed descriptor can trigger interrupt > via this entry. idxd driver is expected to program ims entries with > PASIDs that are allocated to the mdev which owns those entries. > > Other devices may have different ID and format to isolate ims entries. > But we need abstract a generic means for programming vendor-specific > ID into vendor-specific ims entry, without violating the layering model. > > Thomas suggested vendor driver to first register ID information (possibly > plus the location where to write ID to) in msi_desc when allocating irqs > (extend existing alloc function or via new helper function) and then have > the generic ims irqchip driver to update ID to the ims entry when it's > started up by request_irq(). > > Then there are two questions to be answered: > > 1) How does vendor driver decide the ID to be registered to msi_desc? > 2) How is Thomas's model mapped to the MSI-X virtualization flow in VFIO? > > For the 1st open, there are two types of PASIDs on idxd mdev: > > 1) default PASID: one per mdev and allocated when mdev is created; > 2) sva PASIDs: multiple per mdev and allocated on-demand (via vIOMMU); > > If vIOMMU is not exposed, all ims entries of this mdev should be > programmed with default PASID which is always available in mdev's > lifespan. > > If vIOMMU is exposed and guest sva is enabled, entries used for sva > should be tagged with sva PASIDs, leaving others tagged with default > PASID. To help achieve intra-guest interrupt isolation, guest idxd driver > needs program guest sva PASIDs into virtual MSIX_PERM register (one > per MSI-X entry) for validation. Access to MSIX_PERM is trap-and-emulated > by host idxd driver which then figure out which PASID to register to > msi_desc (require PASID translation info via new /dev/iommu proposal). > > The guest driver is expected to update MSIX_PERM before request_irq(). > > Now the 2nd open requires your help. Below is what I learned from > current vfio/qemu code (for vfio-pci device): > > 0) Qemu doesn't attempt to allocate all irqs as reported by msix-> > table_size. It is done in an dynamic and incremental way. Not by table_size, our expectation is that the device interrupt behavior can be implicitly affected by the enabled vectors and the table size may support far more vectors than the driver might actually use. It's also easier if we never need to get into the scenario of pci_alloc_irq_vectors() returning a smaller than requested number of vectors and needing to fallback to a vector negotiation that doesn't exist via MSI-X. FWIW, more recent QEMU will scan the vector table for the highest non-masked vector to initially enable that number of vectors in the host, both to improve restore behavior after migration and avoid overhead for guests that write the vector table before setting the MSI-X capability enable bit (Windows?). > 1) VFIO provides just one command (VFIO_DEVICE_SET_IRQS) for > allocating/enabling irqs given a set of vMSIX vectors [start, count]: > > a) if irqs not allocated, allocate irqs [start+count]. Enable irqs for > specified vectors [start, count] via request_irq(); > b) if irqs already allocated, enable irqs for specified vectors; > c) if irq already enabled, disable and re-enable irqs for specified > vectors because user may specify a different eventfd; > > 2) When guest enables virtual MSI-X capability, Qemu calls VFIO_ > DEVICE_SET_IRQS to enable vector#0, even though it's currently > masked by the guest. Interrupts are received by Qemu but blocked > from guest via mask/pending bit emulation. The main intention is > to enable physical MSI-X; Yes, this is a bit awkward since the interrupt API is via SET_IRQS and we don't allow writes to the MSI-X enable bit via config space. > 3) When guest unmasks vector#0 via request_irq(), Qemu calls VFIO_ > DEVICE_SET_IRQS to enable vector#0 again, with a eventfd different > from the one provided in 2); > > 4) When guest unmasks vector#1, Qemu finds it's outside of allocated > vectors (only vector#0 now): > > a) Qemu first calls VFIO_DEVICE_SET_IRQS to disable and free > irq for vector#0; > > b) Qemu then calls VFIO_DEVICE_SET_IRQS to allocate and enable > irqs for both vector#0 and vector#1; > > 5) When guest unmasks vector#2, same flow in 4) continues. > > .... > > If above understanding is correct, how is lost interrupt avoided between > 4.a) and 4.b) given that irq has been torn down for vector#0 in the middle > while from guest p.o.v this vector is actually unmasked? There must be > a mechanism in place, but I just didn't figure it out... In practice unmasking new vectors is rare and done only at initialization. Risk from lost interrupts at this time is low. When masking and unmasking vectors that are already in use, we're only changing the signaling eventfd between KVM and QEMU such that QEMU can set emulated pending bits in response to interrupts (and our lack of interfaces to handle the mask/unmask at the host). I believe that locking in the vfio-pci driver prevents an interrupt from being lost during the eventfd switch. > Given above flow is robust, mapping Thomas's model to this flow is > straightforward. Assume idxd mdev has two vectors: vector#0 for > misc/error interrupt and vector#1 as completion interrupt for guest > sva. VFIO_DEVICE_SET_IRQS is handled by idxd mdev driver: > > 2) When guest enables virtual MSI-X capability, Qemu calls VFIO_ > DEVICE_SET_IRQS to enable vector#0. Because vector#0 is not > used for sva, MSIX_PERM#0 has PASID disabled. Host idxd driver > knows to register default PASID to msi_desc#0 when allocating irqs. > Then .startup() callback of ims irqchip is called to program default > PASID saved in msi_desc#0 to the target ims entry when request_irq(). > > 3) When guest unmasks vector#0 via request_irq(), Qemu calls VFIO_ > DEVICE_SET_IRQS to enable vector#0 again. Following same logic > as vfio-pci, idxd driver first disable irq#0 via free_irq() and then > re-enable irq#0 via request_irq(). It's still default PASID being used > according to msi_desc#0. > > 4) When guest unmasks vector#1, Qemu finds it's outside of allocated > vectors (only vector#0 now): > > a) Qemu first calls VFIO_DEVICE_SET_IRQS to disable and free > irq for vector#0. msi_desc#0 is also freed. > > b) Qemu then calls VFIO_DEVICE_SET_IRQS to allocate and enable > irqs for both vector#0 and vector#1. At this point, MSIX_PERM#0 > has PASID disabled while MSIX_PERM#1 has a valid guest PASID1 > for sva. idxd driver registers default PASID to msix_desc#0 and > host PASID2 (translated from guest PASID1) to msix_desc#1 when > allocating irqs. Later when both irqs are enabled via request_irq(), > ims irqchip driver updates the target ims entries according to > msix_desc#0 and misx_desc#1 respectively. > > But this is specific to how Qemu virtualizes MSI-X today. What about it > may change (or another device model) to allocate all table_size irqs > when guest enables MSI-X capability? At that point we don't have valid > MSIX_PERM content to register PASID info to msix_desc. Possibly what > we really require is a separate helper function allowing driver to update > msix_desc after irq allocation, e.g. when guest unmasks a vector... I think you're basically asking how you can guarantee that you always have a mechanism to update your device specific MSIX_PERM table before or after the vector tables. You're trapping and emulating the MSIX_PERM table, so every write traps to your driver. Therefore shouldn't the driver be able to setup any vector using the default PASID if MSIX_PERM is not configured and update it with the correct PASID translation based on the trapped write to the register? Logically I think you'd want your guest driver to mask and unmask the affected vector around modifying the MSIX_PERM entry as well, so it would be another option to reevaluate MSI_PERM on unmask, which triggers a SET_IRQS ioctl into the idxd host driver. Either entry point could trigger a descriptor update to the host irqchip driver. > and do you see any other facets which are overlooked here? AIUI, the default with no sva PASID should always work, the host driver initializes the device with a virtual MSIX_PERM table with all the entries disabled, no special guest/host driver coordination is required. In the case of setting a PASID for a vector, you have entry points into the driver either by the virtualization of the MSIX_PERM table or likely also at the unmasking of the vector, so it seems fully contained. Thanks, Alex ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Virtualizing MSI-X on IMS via VFIO 2021-06-22 19:12 ` Alex Williamson @ 2021-06-22 23:59 ` Thomas Gleixner 2021-06-23 6:12 ` Tian, Kevin 2021-06-23 15:19 ` Alex Williamson 0 siblings, 2 replies; 29+ messages in thread From: Thomas Gleixner @ 2021-06-22 23:59 UTC (permalink / raw) To: Alex Williamson, Kevin Tian Cc: Jason Gunthorpe, Megha Dey, Ashok Rai, Jacob Pan, Dave Jiang, Yi Liu, Baolu Lu, Dan Williams, Tony Luck, Sanjay Kumar, LKML, KVM, Kirti Wankhede, Peter Zijlstra, Marc Zyngier, Bjorn Helgaas Alex, [ Resend due to a delivery issue here. Sorry if you got this twice. ] On Tue, Jun 22 2021 at 13:12, Alex Williamson wrote: > On Tue, 22 Jun 2021 10:16:15 +0000 "Tian, Kevin" <kevin.tian@intel.com> wrote: > > Now the 2nd open requires your help. Below is what I learned from >> current vfio/qemu code (for vfio-pci device): >> >> 0) Qemu doesn't attempt to allocate all irqs as reported by msix-> >> table_size. It is done in an dynamic and incremental way. > > Not by table_size, our expectation is that the device interrupt > behavior can be implicitly affected by the enabled vectors and the > table size may support far more vectors than the driver might actually > use. It's also easier if we never need to get into the scenario of > pci_alloc_irq_vectors() returning a smaller than requested number of > vectors and needing to fallback to a vector negotiation that doesn't > exist via MSI-X. > > FWIW, more recent QEMU will scan the vector table for the highest > non-masked vector to initially enable that number of vectors in the > host, both to improve restore behavior after migration and avoid > overhead for guests that write the vector table before setting the > MSI-X capability enable bit (Windows?). > >> 1) VFIO provides just one command (VFIO_DEVICE_SET_IRQS) for >> allocating/enabling irqs given a set of vMSIX vectors [start, count]: >> a) if irqs not allocated, allocate irqs [start+count]. Enable irqs for >> specified vectors [start, count] via request_irq(); >> b) if irqs already allocated, enable irqs for specified vectors; >> c) if irq already enabled, disable and re-enable irqs for specified >> vectors because user may specify a different eventfd; >> >> 2) When guest enables virtual MSI-X capability, Qemu calls VFIO_ >> DEVICE_SET_IRQS to enable vector#0, even though it's currently >> masked by the guest. Interrupts are received by Qemu but blocked >> from guest via mask/pending bit emulation. The main intention is >> to enable physical MSI-X; > > Yes, this is a bit awkward since the interrupt API is via SET_IRQS and > we don't allow writes to the MSI-X enable bit via config space. > >> 3) When guest unmasks vector#0 via request_irq(), Qemu calls VFIO_ >> DEVICE_SET_IRQS to enable vector#0 again, with a eventfd different >> from the one provided in 2); >> >> 4) When guest unmasks vector#1, Qemu finds it's outside of allocated >> vectors (only vector#0 now): >> >> a) Qemu first calls VFIO_DEVICE_SET_IRQS to disable and free >> irq for vector#0; >> >> b) Qemu then calls VFIO_DEVICE_SET_IRQS to allocate and enable >> irqs for both vector#0 and vector#1; >> >> 5) When guest unmasks vector#2, same flow in 4) continues. That's dangerous and makes weird assumptions about interrupts being requested early in the driver init() function. But that's wishful thinking, really. There are enough drivers, especially networking which request interrupts on device open() and not on device init(). Some special functions only request the irq way later, i.e. only when someone uses that special function and at this point the other irqs of that very same device are already in use. >> If above understanding is correct, how is lost interrupt avoided between >> 4.a) and 4.b) given that irq has been torn down for vector#0 in the middle >> while from guest p.o.v this vector is actually unmasked? There must be >> a mechanism in place, but I just didn't figure it out... > > In practice unmasking new vectors is rare and done only at > initialization. Risk from lost interrupts at this time is low. See above. Wishful thinking. OMG, I really don't want to be the one to debug _WHY_ a device lost interrupts just because it did a late request_irq() when the device is operational already and has other interrupts active. > When masking and unmasking vectors that are already in use, we're only > changing the signaling eventfd between KVM and QEMU such that QEMU can > set emulated pending bits in response to interrupts (and our lack of > interfaces to handle the mask/unmask at the host). I believe that > locking in the vfio-pci driver prevents an interrupt from being lost > during the eventfd switch. Let's look at this from a driver perspective: 1) Driver checks how many entries are possible in the MSI-X table 2) Driver invokes pci_msix_enable[_range]() which returns the number of vectors the system is willing to hand out to the driver. 3) Driver assigns the vectors to the different resources in the hardware 4) Later on these interrupts are requested, but not necessarily during device init. Yes, request_irq() can fail and today it can fail also due to CPU vector exhaustion. That's perfectly fine as the driver can handle the fail and act accordingly. All of this is consistent and well defined. Now lets look at the guest. This VFIO mechanism introduces two brand new failure modes because of this: guest::unmask() trapped_by_host() free_irqs(); pci_free_irq_vectors(); pci_disable_msix(); pci_alloc_irq_vectors(); pci_enable_msix(); request_irqs(); #1 What happens if the host side allocation or the host side request_irq() fails? a) Guest is killed? b) Failure is silently ignored and guest just does not receive interrupts? c) Any other mechanism? Whatever it is, it simply _cannot_ make request_irq() in the guest fail because the guest has already passed all failure points and is in the low level function of unmasking the interrupt for the first time after which is will return 0, aka success. So if you answer #a, fine with me. It's well defined. #2 What happens to already active interrupts on that device which might fire during that time? They get lost or are redirected to the legacy PCI interrupt and there is absolutely nothing you can do about that. Simply because to prevent that the host side would have to disable the interrupt source at the device level, i.e. fiddle with actual device registers to shut up interrupt delivery and reenable it afterwards again and thereby racing against some other VCPU of the same guest which fiddles with that very same registers. IOW, undefined behaviour, which the "VFIO design" shrugged off on completely unjustified assumptions. No matter how much you argue about this being unlikely, this is just broken. Unlikely simply does not exist at cloud scale. Aside of that. How did you come to the conclusion that #2 does not matter? By analyzing _every_ open and closed source driver for their usage patterns and documenting that all drivers which want to work in VFIO-PCI space have to follow specific rules vs. interrupt setup and usage? I'm pretty sure that you have a mechanism in place which monitors closely whether a driver violates those well documented rules. Yes, I know that I'm dreaming and the reality is that this is based on interesting assumptions and just works by chance. I have no idea _why_ this has been done that way. The changelogs of the relevant commits are void of useful content and lack links to the possibly existing discussions about this. I only can assume that back then the main problem was vector exhaustion on the host and to avoid allocating memory for interrupt descriptors etc, right? The host vector exhaustion problem was that each MSIX vector consumed a real CPU vector which is a limited resource on X86. This is not longer the case today: 1) pci_msix_enable[range]() consumes exactly zero CPU vectors from the allocatable range independent of the number of MSIX vectors it allocates, unless it is in multi-queue managed mode where it will actually reserve a vector (maybe two) per CPU. But for devices which are not using that mode, they just opportunistically "reserve" vectors. All entries are initialized with a special system vector which when raised will emit a nastigram in dmesg. 2) request_irq() actually allocates a CPU vector from the allocatable vector space which can obviously still fail, which is perfectly fine. So the only downside today of allocating more MSI-X vectors than necessary is memory consumption for the irq descriptors. Though for virtualization there is still another problem: Even if all possible MSI-X vectors for a passthrough PCI device would be allocated upfront independent of the actual usage in the guest, then there is still the issue of request_irq() failing on the host once the guest decides to use any of those interrupts. It's a halfways reasonable argumentation by some definition of reasonable, that this case would be a host system configuration problem and the admin who overcommitted is responsible for the consequence. Where the only reasonable consequence is to kill the guest right there because there is no mechanism to actually tell it that the host ran out of resources. Not at all a pretty solution, but it is contrary to the status quo well defined. The most important aspect is that it is well defined for the case of success: If it succeeds then there is no way that already requested interrupts can be lost or end up being redirected to the legacy PCI irq due to clearing the MSIX enable bit, which is a gazillion times better than the "let's hope it works" based tinkerware we have now. So, aside of the existing VFIO/PCI/MSIX thing being just half thought out, even thinking about proliferating this with IMS is bonkers. IMS is meant to avoid the problem of MSI-X which needs to disable MSI-X in order to expand the number of vectors. The use cases are going to be even more dynamic than the usual device drivers, so the lost interrupt issue will be much more likely to trigger. So no, we are not going to proliferate this complete ignorance of how MSI-X actually works and just cram another "feature" into code which is known to be incorrect. Thanks, tglx ^ permalink raw reply [flat|nested] 29+ messages in thread
* RE: Virtualizing MSI-X on IMS via VFIO 2021-06-22 23:59 ` Thomas Gleixner @ 2021-06-23 6:12 ` Tian, Kevin 2021-06-23 16:31 ` Thomas Gleixner 2021-06-23 15:19 ` Alex Williamson 1 sibling, 1 reply; 29+ messages in thread From: Tian, Kevin @ 2021-06-23 6:12 UTC (permalink / raw) To: Thomas Gleixner, Alex Williamson Cc: Jason Gunthorpe, Dey, Megha, Raj, Ashok, Pan, Jacob jun, Jiang, Dave, Liu, Yi L, Lu, Baolu, Williams, Dan J, Luck, Tony, Kumar, Sanjay K, LKML, KVM, Kirti Wankhede, Peter Zijlstra, Marc Zyngier, Bjorn Helgaas > From: Thomas Gleixner <tglx@linutronix.de> > Sent: Wednesday, June 23, 2021 7:59 AM > [...] > I only can assume that back then the main problem was vector exhaustion > on the host and to avoid allocating memory for interrupt descriptors > etc, right? > > The host vector exhaustion problem was that each MSIX vector consumed a > real CPU vector which is a limited resource on X86. This is not longer > the case today: > > 1) pci_msix_enable[range]() consumes exactly zero CPU vectors from > the allocatable range independent of the number of MSIX vectors > it allocates, unless it is in multi-queue managed mode where it > will actually reserve a vector (maybe two) per CPU. > > But for devices which are not using that mode, they just > opportunistically "reserve" vectors. > > All entries are initialized with a special system vector which > when raised will emit a nastigram in dmesg. > > 2) request_irq() actually allocates a CPU vector from the > allocatable vector space which can obviously still fail, which is > perfectly fine. > > So the only downside today of allocating more MSI-X vectors than > necessary is memory consumption for the irq descriptors. Curious about irte entry when IRQ remapping is enabled. Is it also allocated at request_irq()? > > Though for virtualization there is still another problem: > > Even if all possible MSI-X vectors for a passthrough PCI device would > be allocated upfront independent of the actual usage in the guest, > then there is still the issue of request_irq() failing on the host > once the guest decides to use any of those interrupts. > > It's a halfways reasonable argumentation by some definition of > reasonable, that this case would be a host system configuration problem > and the admin who overcommitted is responsible for the consequence. > > Where the only reasonable consequence is to kill the guest right there > because there is no mechanism to actually tell it that the host ran out > of resources. > > Not at all a pretty solution, but it is contrary to the status quo well > defined. The most important aspect is that it is well defined for the > case of success: > > If it succeeds then there is no way that already requested interrupts > can be lost or end up being redirected to the legacy PCI irq due to > clearing the MSIX enable bit, which is a gazillion times better than > the "let's hope it works" based tinkerware we have now. fair enough. > > So, aside of the existing VFIO/PCI/MSIX thing being just half thought > out, even thinking about proliferating this with IMS is bonkers. > > IMS is meant to avoid the problem of MSI-X which needs to disable MSI-X > in order to expand the number of vectors. The use cases are going to be > even more dynamic than the usual device drivers, so the lost interrupt > issue will be much more likely to trigger. > > So no, we are not going to proliferate this complete ignorance of how > MSI-X actually works and just cram another "feature" into code which is > known to be incorrect. > So the correct flow is like below: guest::enable_msix() trapped_by_host() pci_alloc_irq_vectors(); // for all possible vMSI-X entries pci_enable_msix(); guest::unmask() trapped_by_host() request_irqs(); the first trap calls a new VFIO ioctl e.g. VFIO_DEVICE_ALLOC_IRQS. the 2nd trap can reuse existing VFIO_DEVICE_SET_IRQS which just does request_irq() if specified irqs have been allocated. Then map ims to this flow: guest::enable_msix() trapped_by_host() msi_domain_alloc_irqs(); // for all possible vMSI-X entries for_all_allocated_irqs(i) pci_update_msi_desc_id(i, default_pasid); // a new helper func guest::unmask(entry#0) trapped_by_host() request_irqs(); ims_array_irq_startup(); // write msi_desc.id (default_pasid) to ims entry guest::set_msix_perm(entry#1, guest_sva_pasid) trapped_by_host() pci_update_msi_desc_id(1, host_sva_pasid); guest::unmask(entry#1) trapped_by_host() request_irqs(); ims_array_irq_startup(); // write msi_desc.id (host_sva_pasid) to ims entry Does above match your thoughts? Thanks Kevin ^ permalink raw reply [flat|nested] 29+ messages in thread
* RE: Virtualizing MSI-X on IMS via VFIO 2021-06-23 6:12 ` Tian, Kevin @ 2021-06-23 16:31 ` Thomas Gleixner 2021-06-23 16:41 ` Jason Gunthorpe 2021-06-23 23:37 ` Tian, Kevin 0 siblings, 2 replies; 29+ messages in thread From: Thomas Gleixner @ 2021-06-23 16:31 UTC (permalink / raw) To: Tian, Kevin, Alex Williamson Cc: Jason Gunthorpe, Dey, Megha, Raj, Ashok, Pan, Jacob jun, Jiang, Dave, Liu, Yi L, Lu, Baolu, Williams, Dan J, Luck, Tony, Kumar, Sanjay K, LKML, KVM, Kirti Wankhede, Peter Zijlstra, Marc Zyngier, Bjorn Helgaas On Wed, Jun 23 2021 at 06:12, Kevin Tian wrote: >> From: Thomas Gleixner <tglx@linutronix.de> >> So the only downside today of allocating more MSI-X vectors than >> necessary is memory consumption for the irq descriptors. > > Curious about irte entry when IRQ remapping is enabled. Is it also > allocated at request_irq()? Good question. No, it has to be allocated right away. We stick the shutdown vector into the IRTE and then request_irq() will update it with the real one. > So the correct flow is like below: > > guest::enable_msix() > trapped_by_host() > pci_alloc_irq_vectors(); // for all possible vMSI-X entries > pci_enable_msix(); > > guest::unmask() > trapped_by_host() > request_irqs(); > > the first trap calls a new VFIO ioctl e.g. VFIO_DEVICE_ALLOC_IRQS. > > the 2nd trap can reuse existing VFIO_DEVICE_SET_IRQS which just > does request_irq() if specified irqs have been allocated. > > Then map ims to this flow: > > guest::enable_msix() > trapped_by_host() > msi_domain_alloc_irqs(); // for all possible vMSI-X entries > for_all_allocated_irqs(i) > pci_update_msi_desc_id(i, default_pasid); // a new helper func > > guest::unmask(entry#0) > trapped_by_host() > request_irqs(); > ims_array_irq_startup(); // write msi_desc.id (default_pasid) to ims entry > > guest::set_msix_perm(entry#1, guest_sva_pasid) > trapped_by_host() > pci_update_msi_desc_id(1, host_sva_pasid); > > guest::unmask(entry#1) > trapped_by_host() > request_irqs(); > ims_array_irq_startup(); // write msi_desc.id (host_sva_pasid) to ims entry That's one way to do that, but that still has the same problem that the request_irq() in the guest succeeds even if the host side fails. As this is really new stuff there is no real good reason to force that into the existing VFIO/MSIX stuff with all it's known downsides and limitations. The point is, that IMS can just add another interrupt to a device on the fly without doing any of the PCI/MSIX nasties. So why not take advantage of that? I can see the point of using PCI to expose the device to the guest because it's trivial to enumerate, but contrary to VF devices there is no legacy and the mechanism how to setup the device interrupts can be completely different from PCI/MSIX. Exposing some trappable "IMS" storage in a separate PCI bar won't cut it because this still has the same problem that the allocation or request_irq() on the host can fail w/o feedback. So IMO creating a proper paravirt interface is the right approach. It avoids _all_ of the trouble and will be necessary anyway once you want to support devices which store the message/pasid in system memory and not in on-device memory. Thanks, tglx ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Virtualizing MSI-X on IMS via VFIO 2021-06-23 16:31 ` Thomas Gleixner @ 2021-06-23 16:41 ` Jason Gunthorpe 2021-06-23 23:41 ` Tian, Kevin 2021-06-23 23:37 ` Tian, Kevin 1 sibling, 1 reply; 29+ messages in thread From: Jason Gunthorpe @ 2021-06-23 16:41 UTC (permalink / raw) To: Thomas Gleixner Cc: Tian, Kevin, Alex Williamson, Dey, Megha, Raj, Ashok, Pan, Jacob jun, Jiang, Dave, Liu, Yi L, Lu, Baolu, Williams, Dan J, Luck, Tony, Kumar, Sanjay K, LKML, KVM, Kirti Wankhede, Peter Zijlstra, Marc Zyngier, Bjorn Helgaas On Wed, Jun 23, 2021 at 06:31:34PM +0200, Thomas Gleixner wrote: > So IMO creating a proper paravirt interface is the right approach. It > avoids _all_ of the trouble and will be necessary anyway once you want > to support devices which store the message/pasid in system memory and > not in on-device memory. I think this is basically where we got to in the other earlier discussion with using IMS natively in VMs - it can't be done generically without a new paravirt interface. The guest needs a paravirt interface to program the IOMMU to route MSI vectors to the guest's vAPIC and then the guest itself can deliver an addr/data pair directly to the HW. In this mode qemu would not emulate MSI at all so will avoid all the problems you identified. How to build that and provide backwards compat is an open question. Instead that thread went into blocking IMS on VM situations.. Jason ^ permalink raw reply [flat|nested] 29+ messages in thread
* RE: Virtualizing MSI-X on IMS via VFIO 2021-06-23 16:41 ` Jason Gunthorpe @ 2021-06-23 23:41 ` Tian, Kevin 0 siblings, 0 replies; 29+ messages in thread From: Tian, Kevin @ 2021-06-23 23:41 UTC (permalink / raw) To: Jason Gunthorpe, Thomas Gleixner Cc: Alex Williamson, Dey, Megha, Raj, Ashok, Pan, Jacob jun, Jiang, Dave, Liu, Yi L, Lu, Baolu, Williams, Dan J, Luck, Tony, Kumar, Sanjay K, LKML, KVM, Kirti Wankhede, Peter Zijlstra, Marc Zyngier, Bjorn Helgaas > From: Jason Gunthorpe <jgg@nvidia.com> > Sent: Thursday, June 24, 2021 12:41 AM > > On Wed, Jun 23, 2021 at 06:31:34PM +0200, Thomas Gleixner wrote: > > > So IMO creating a proper paravirt interface is the right approach. It > > avoids _all_ of the trouble and will be necessary anyway once you want > > to support devices which store the message/pasid in system memory and > > not in on-device memory. > > I think this is basically where we got to in the other earlier > discussion with using IMS natively in VMs - it can't be done > generically without a new paravirt interface. > > The guest needs a paravirt interface to program the IOMMU to route MSI > vectors to the guest's vAPIC and then the guest itself can deliver an > addr/data pair directly to the HW. > > In this mode qemu would not emulate MSI at all so will avoid all the > problems you identified. No emulation for PF/VF. But emulation might be required for mdev for two reasons: 1) the ims entries for mdevs are collapsed together; 2) there are other fields in ims entry which cannot allow guest to control, e.g. PASID; > > How to build that and provide backwards compat is an open > question. Instead that thread went into blocking IMS on VM situations.. > > Jason ^ permalink raw reply [flat|nested] 29+ messages in thread
* RE: Virtualizing MSI-X on IMS via VFIO 2021-06-23 16:31 ` Thomas Gleixner 2021-06-23 16:41 ` Jason Gunthorpe @ 2021-06-23 23:37 ` Tian, Kevin 2021-06-24 1:18 ` Thomas Gleixner 1 sibling, 1 reply; 29+ messages in thread From: Tian, Kevin @ 2021-06-23 23:37 UTC (permalink / raw) To: Thomas Gleixner, Alex Williamson Cc: Jason Gunthorpe, Dey, Megha, Raj, Ashok, Pan, Jacob jun, Jiang, Dave, Liu, Yi L, Lu, Baolu, Williams, Dan J, Luck, Tony, Kumar, Sanjay K, LKML, KVM, Kirti Wankhede, Peter Zijlstra, Marc Zyngier, Bjorn Helgaas > From: Thomas Gleixner <tglx@linutronix.de> > Sent: Thursday, June 24, 2021 12:32 AM > > On Wed, Jun 23 2021 at 06:12, Kevin Tian wrote: > >> From: Thomas Gleixner <tglx@linutronix.de> > >> So the only downside today of allocating more MSI-X vectors than > >> necessary is memory consumption for the irq descriptors. > > > > Curious about irte entry when IRQ remapping is enabled. Is it also > > allocated at request_irq()? > > Good question. No, it has to be allocated right away. We stick the > shutdown vector into the IRTE and then request_irq() will update it with > the real one. There are max 64K irte entries per Intel VT-d. Do we consider it as a limited resource in this new model, though it's much more than CPU vectors? > > > So the correct flow is like below: > > > > guest::enable_msix() > > trapped_by_host() > > pci_alloc_irq_vectors(); // for all possible vMSI-X entries > > pci_enable_msix(); > > > > guest::unmask() > > trapped_by_host() > > request_irqs(); > > > > the first trap calls a new VFIO ioctl e.g. VFIO_DEVICE_ALLOC_IRQS. > > > > the 2nd trap can reuse existing VFIO_DEVICE_SET_IRQS which just > > does request_irq() if specified irqs have been allocated. > > > > Then map ims to this flow: > > > > guest::enable_msix() > > trapped_by_host() > > msi_domain_alloc_irqs(); // for all possible vMSI-X entries > > for_all_allocated_irqs(i) > > pci_update_msi_desc_id(i, default_pasid); // a new helper func > > > > guest::unmask(entry#0) > > trapped_by_host() > > request_irqs(); > > ims_array_irq_startup(); // write msi_desc.id (default_pasid) to ims > entry > > > > guest::set_msix_perm(entry#1, guest_sva_pasid) > > trapped_by_host() > > pci_update_msi_desc_id(1, host_sva_pasid); > > > > guest::unmask(entry#1) > > trapped_by_host() > > request_irqs(); > > ims_array_irq_startup(); // write msi_desc.id (host_sva_pasid) to ims > entry > > That's one way to do that, but that still has the same problem that the > request_irq() in the guest succeeds even if the host side fails. yes > > As this is really new stuff there is no real good reason to force that > into the existing VFIO/MSIX stuff with all it's known downsides and > limitations. > > The point is, that IMS can just add another interrupt to a device on the > fly without doing any of the PCI/MSIX nasties. So why not take advantage > of that? > > I can see the point of using PCI to expose the device to the guest > because it's trivial to enumerate, but contrary to VF devices there is also about compatibility since PCI is supported by almost all OSes. > no legacy and the mechanism how to setup the device interrupts can be > completely different from PCI/MSIX. > > Exposing some trappable "IMS" storage in a separate PCI bar won't cut it > because this still has the same problem that the allocation or > request_irq() on the host can fail w/o feedback. yes to fully fix the said nasty some feedback mechanism is required. > > So IMO creating a proper paravirt interface is the right approach. It > avoids _all_ of the trouble and will be necessary anyway once you want > to support devices which store the message/pasid in system memory and > not in on-device memory. > While I agree a paravirt interface is definitely cleaner, I wonder whether this should be done in orthogonal or tied to all new ims-capable devices. Back to earlier discussion about guest ims support, you explained a layered model where the paravirt interface sits between msi domain and vector domain to get addr/data pair from the host. In this way it could provide a feedback mechanism for both msi and ims devices, thus not specific to ims only. Then considering the transition window where not all guest OSes may support paravirt interface at the same time (or there are multiple paravirt interfaces which takes time for host to support all), would below staging approach still makes sense? 1) Fix the lost interrupt issue in existing MSI virtualization flow; 2) Virtualize MSI-X on IMS, bearing the same request_irq() problem; 3) Develop a paravirt interface to solve request_irq() problem for both msi and ims devices; Thanks Kevin ^ permalink raw reply [flat|nested] 29+ messages in thread
* RE: Virtualizing MSI-X on IMS via VFIO 2021-06-23 23:37 ` Tian, Kevin @ 2021-06-24 1:18 ` Thomas Gleixner 2021-06-24 2:41 ` Tian, Kevin 0 siblings, 1 reply; 29+ messages in thread From: Thomas Gleixner @ 2021-06-24 1:18 UTC (permalink / raw) To: Tian, Kevin, Alex Williamson Cc: Jason Gunthorpe, Dey, Megha, Raj, Ashok, Pan, Jacob jun, Jiang, Dave, Liu, Yi L, Lu, Baolu, Williams, Dan J, Luck, Tony, Kumar, Sanjay K, LKML, KVM, Kirti Wankhede, Peter Zijlstra, Marc Zyngier, Bjorn Helgaas Kevin! On Wed, Jun 23 2021 at 23:37, Kevin Tian wrote: >> From: Thomas Gleixner <tglx@linutronix.de> >> > Curious about irte entry when IRQ remapping is enabled. Is it also >> > allocated at request_irq()? >> >> Good question. No, it has to be allocated right away. We stick the >> shutdown vector into the IRTE and then request_irq() will update it with >> the real one. > > There are max 64K irte entries per Intel VT-d. Do we consider it as > a limited resource in this new model, though it's much more than > CPU vectors? It's surely a limited resource. For me 64k entries seems to be plenty, but what do I know. I'm not a virtualization wizard. > Back to earlier discussion about guest ims support, you explained a layered > model where the paravirt interface sits between msi domain and vector > domain to get addr/data pair from the host. In this way it could provide > a feedback mechanism for both msi and ims devices, thus not specific > to ims only. Then considering the transition window where not all guest > OSes may support paravirt interface at the same time (or there are > multiple paravirt interfaces which takes time for host to support all), > would below staging approach still makes sense? > > 1) Fix the lost interrupt issue in existing MSI virtualization flow; That _cannot_ be fixed without a hypercall. See my reply to Alex. > 2) Virtualize MSI-X on IMS, bearing the same request_irq() problem; That solves what? Maybe your perceived roadmap problem, but certainly not any technical problem in the right way. Again: See my reply to Alex. > 3) Develop a paravirt interface to solve request_irq() problem for > both msi and ims devices; First of all it's not a request_irq() problem: It's a plain resource management problem which requires proper interaction between host and guest. And yes, it _is_ the correct answer to the problem and as I outlined in my reply to Alex already it is _not_ rocket science and it won't make a significant difference on your timeline because it's straight forward and solves the problem properly with the added benefit to solve existing problems which should and could have been solved long ago. I don't care at all about the time you are wasting with half baken thoughts about avoiding to do the right thing, but I very much care about my time wasted to debunk them. Thanks, tglx ^ permalink raw reply [flat|nested] 29+ messages in thread
* RE: Virtualizing MSI-X on IMS via VFIO 2021-06-24 1:18 ` Thomas Gleixner @ 2021-06-24 2:41 ` Tian, Kevin 2021-06-24 15:14 ` Thomas Gleixner 2021-06-24 17:03 ` Jacob Pan 0 siblings, 2 replies; 29+ messages in thread From: Tian, Kevin @ 2021-06-24 2:41 UTC (permalink / raw) To: Thomas Gleixner, Alex Williamson Cc: Jason Gunthorpe, Dey, Megha, Raj, Ashok, Pan, Jacob jun, Jiang, Dave, Liu, Yi L, Lu, Baolu, Williams, Dan J, Luck, Tony, Kumar, Sanjay K, LKML, KVM, Kirti Wankhede, Peter Zijlstra, Marc Zyngier, Bjorn Helgaas > From: Thomas Gleixner <tglx@linutronix.de> > Sent: Thursday, June 24, 2021 9:19 AM > > Kevin! > > On Wed, Jun 23 2021 at 23:37, Kevin Tian wrote: > >> From: Thomas Gleixner <tglx@linutronix.de> > >> > Curious about irte entry when IRQ remapping is enabled. Is it also > >> > allocated at request_irq()? > >> > >> Good question. No, it has to be allocated right away. We stick the > >> shutdown vector into the IRTE and then request_irq() will update it with > >> the real one. > > > > There are max 64K irte entries per Intel VT-d. Do we consider it as > > a limited resource in this new model, though it's much more than > > CPU vectors? > > It's surely a limited resource. For me 64k entries seems to be plenty, > but what do I know. I'm not a virtualization wizard. > > > Back to earlier discussion about guest ims support, you explained a layered > > model where the paravirt interface sits between msi domain and vector > > domain to get addr/data pair from the host. In this way it could provide > > a feedback mechanism for both msi and ims devices, thus not specific > > to ims only. Then considering the transition window where not all guest > > OSes may support paravirt interface at the same time (or there are > > multiple paravirt interfaces which takes time for host to support all), > > would below staging approach still makes sense? > > > > 1) Fix the lost interrupt issue in existing MSI virtualization flow; > > That _cannot_ be fixed without a hypercall. See my reply to Alex. The lost interrupt issue was caused due to resizing based on stale impression of vector exhaustion. With your explanation this issue can be partially fixed by having Qemu allocate all possible irqs when guest enables msi-x and never resizes it before guest disables msi-x. The remaining problem is no feedback to block guest request_irq() in case of vector shortage. This has to be solved via paravirt interface but fixing lost interrupt alone is still a step forward for guest which doesn't implement the paravirt interface. > > > 2) Virtualize MSI-X on IMS, bearing the same request_irq() problem; > > That solves what? Maybe your perceived roadmap problem, but certainly > not any technical problem in the right way. Again: See my reply to Alex. Not about roadmap. See explanation below. > > > 3) Develop a paravirt interface to solve request_irq() problem for > > both msi and ims devices; > > First of all it's not a request_irq() problem: It's a plain resource > management problem which requires proper interaction between host and > guest. sure. > > And yes, it _is_ the correct answer to the problem and as I outlined in > my reply to Alex already it is _not_ rocket science and it won't make a > significant difference on your timeline because it's straight forward > and solves the problem properly with the added benefit to solve existing > problems which should and could have been solved long ago. > > I don't care at all about the time you are wasting with half baken > thoughts about avoiding to do the right thing, but I very much care > about my time wasted to debunk them. > I'm really not thinking from any angle of roadmap thing, and I actually very much appreciate all of your comments on the right direction. All my comments are purely based on possible use scenarios. I will give more explanation below and hope you can consider it as a thought practice to compose the full picture based on your guidance, instead of seeking half baken idea to waste your time. 😊 At any time guest OSes can be categorized into three classes: a) doesn't implement any paravirt interface for vector allocation; b) implement one paravirt interface that has been supported by KVM; c) implement one paravirt interface which has not been supported by KVM; The transition phase from c) to b) is undefined, but it does exist more or less. For example a windows guest will never implement the interface defined between Linux guest and Linux host. It will have its own hyperv variation which likely takes time for KVM to emulate and claim support. Transition from a) to b) or a) to c) is a guest-side choice. It's not controlled by the host world. Here I didn't further differentiate whether a guest OS support ims, since once a supported paravirt interface is in place both msi and ims can get necessary feedback info from the host side. Then let's look at the host side: 1) kernel versions before we conduct any discussed change: This is a known broken world as you explained. irq resizing could lead to lost interrupts in all three guest classes. The only mitigation is to document this limitation somewhere. We'll not enable ims based on this broken framework. 2) kernel versions after we make a clean refactoring: a) For guest OS which doesn't implement paravirt interface: c) For guest OS which implement a paravirt interface not supported by KVM: You confirmed that recent kernels (since 4.15+) all uses reservation mode to avoid vector exhaustion. So VFIO can define a new protocol asking its userspace to disable resizing by allocating all possible irqs when guest msix is enabled. This is one step forward by fixing the lost interrupt issue and is what the step-1) in my proposal tries to achieve. But there remains a limitation as no feedback is provided into the guest to block it when host vectors are in shortage. But that's the reality that we have to bear for such guest. VFIO returns such error info to and let userspace decide how to react. It's not elegant but improved over the status quo. and we do see value of enabling ims-capable device/subdevice within such guest, though the guest will just fall back to use msix. This is about step-2) in my proposal; b) For guest OS which implement a paravirt interface supported by KVM: This is the right framework that you just described. With such interface in place, the guest needs to proactively claim resource from the host side before it can actually enable a specific msi/ims entry. Everything is well set with cooperation between host/guest. If you agree with above, then it's not something that we want to make half-baked stuff with my proposal. It's really about splitting tasks by doing conservative stuff first which works for most guests and then optimizing things for new guests. And strictly speaking we don't want to do paravirt stuff very late since it's the much cleaner approach in concept. We do plan to find some resource to initiate a separate design discussion in parallel with fixing interrupt lost issue for a) and c). Does this rationale sound good to you? Thanks Kevin ^ permalink raw reply [flat|nested] 29+ messages in thread
* RE: Virtualizing MSI-X on IMS via VFIO 2021-06-24 2:41 ` Tian, Kevin @ 2021-06-24 15:14 ` Thomas Gleixner 2021-06-24 21:44 ` Alex Williamson 2021-06-24 17:03 ` Jacob Pan 1 sibling, 1 reply; 29+ messages in thread From: Thomas Gleixner @ 2021-06-24 15:14 UTC (permalink / raw) To: Tian, Kevin, Alex Williamson Cc: Jason Gunthorpe, Dey, Megha, Raj, Ashok, Pan, Jacob jun, Jiang, Dave, Liu, Yi L, Lu, Baolu, Williams, Dan J, Luck, Tony, Kumar, Sanjay K, LKML, KVM, Kirti Wankhede, Peter Zijlstra, Marc Zyngier, Bjorn Helgaas Kevin! On Thu, Jun 24 2021 at 02:41, Kevin Tian wrote: >> From: Thomas Gleixner <tglx@linutronix.de> >> On Wed, Jun 23 2021 at 23:37, Kevin Tian wrote: >> > 1) Fix the lost interrupt issue in existing MSI virtualization flow; >> >> That _cannot_ be fixed without a hypercall. See my reply to Alex. > > The lost interrupt issue was caused due to resizing based on stale > impression of vector exhaustion. > > With your explanation this issue can be partially fixed by having Qemu > allocate all possible irqs when guest enables msi-x and never resizes > it before guest disables msi-x. Yes, that works with all the downsides attached to it. > The remaining problem is no feedback to block guest request_irq() > in case of vector shortage. This has to be solved via paravirt interface > but fixing lost interrupt alone is still a step forward for guest which > doesn't implement the paravirt interface. Fair enough. > At any time guest OSes can be categorized into three classes: > > a) doesn't implement any paravirt interface for vector allocation; > > b) implement one paravirt interface that has been supported by KVM; > > c) implement one paravirt interface which has not been supported by KVM; > > The transition phase from c) to b) is undefined, but it does exist more > or less. For example a windows guest will never implement the interface > defined between Linux guest and Linux host. It will have its own hyperv > variation which likely takes time for KVM to emulate and claim support. > > Transition from a) to b) or a) to c) is a guest-side choice. It's not > controlled by the host world. That's correct. > Here I didn't further differentiate whether a guest OS support ims, since > once a supported paravirt interface is in place both msi and ims can get > necessary feedback info from the host side. > > Then let's look at the host side: > > 1) kernel versions before we conduct any discussed change: > > This is a known broken world as you explained. irq resizing could > lead to lost interrupts in all three guest classes. The only mitigation > is to document this limitation somewhere. > > We'll not enable ims based on this broken framework. > > 2) kernel versions after we make a clean refactoring: > > a) For guest OS which doesn't implement paravirt interface: > c) For guest OS which implement a paravirt interface not > supported by KVM: > > You confirmed that recent kernels (since 4.15+) all uses > reservation mode to avoid vector exhaustion. So VFIO can > define a new protocol asking its userspace to disable resizing > by allocating all possible irqs when guest msix is enabled. This > is one step forward by fixing the lost interrupt issue and is what > the step-1) in my proposal tries to achieve. After studying the MSI-X specification again, I think there is another option to solve this for MSI-X, i.e. the dynamic sizing part: MSI requires to disable MSI in order to update the number of enabled vectors in the control word. MSI-X does not have that requirement as there is no 'number of used vectors' control field. MSI-X provides a fixed sized vector table and enabling MSI-X "activates" the full table. System software has to set proper messages in the table and eventually associate the table entries to device (sub)functions if that's not hardwired in the device and controlled by queue enablement etc. According to the specification there is no requirement for masked table entries to contain a valid message: "Mask Bit: ... When this bit is set, the function is prohibited from sending a message using this MSI-X Table entry." which means that the function must reread the table entry when the mask bit in the vector control word is cleared. So instead of tearing down the whole set and then bringing it up again, which is wrong, the kernel could allocate an interrupt descriptor and append an MSI entry, write the table entry and unmask it afterwards. There are a couple of things to get there: 1) MSI descriptor list handling. The msi descriptor list of a device is assumed to be unmutable after pci_enable_msix() has successfully enabled all of it, which means that there is no serialization in place. IIRC, the original attempt to glue IMS into the existing MSI management had a patch to add the required protections. That part could be dusted off. 2) Provide the required functionality in the MSI irq domain infrastructure 3) Expose this functionality through a proper interface. That's append only of course. It's not clear to me whether this is worth the effort, but at least it is a viable solution if memory consumption, IRTE consumption is an actual concern. Thanks, tglx ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Virtualizing MSI-X on IMS via VFIO 2021-06-24 15:14 ` Thomas Gleixner @ 2021-06-24 21:44 ` Alex Williamson 2021-06-25 5:21 ` Tian, Kevin 2021-06-25 8:29 ` Thomas Gleixner 0 siblings, 2 replies; 29+ messages in thread From: Alex Williamson @ 2021-06-24 21:44 UTC (permalink / raw) To: Thomas Gleixner Cc: Tian, Kevin, Jason Gunthorpe, Dey, Megha, Raj, Ashok, Pan, Jacob jun, Jiang, Dave, Liu, Yi L, Lu, Baolu, Williams, Dan J, Luck, Tony, Kumar, Sanjay K, LKML, KVM, Kirti Wankhede, Peter Zijlstra, Marc Zyngier, Bjorn Helgaas On Thu, 24 Jun 2021 17:14:39 +0200 Thomas Gleixner <tglx@linutronix.de> wrote: > After studying the MSI-X specification again, I think there is another > option to solve this for MSI-X, i.e. the dynamic sizing part: > > MSI requires to disable MSI in order to update the number of enabled > vectors in the control word. Exactly what part of the spec requires this? This is generally the convention I expect too, and there are complications around contiguous vectors and data field alignment, but I'm not actually able to find a requirement in the spec that MSI Enable must be 0 when modifying other writable fields or that writable fields are latched when MSI Enable is set. > MSI-X does not have that requirement as there is no 'number of used > vectors' control field. MSI-X provides a fixed sized vector table and > enabling MSI-X "activates" the full table. > > System software has to set proper messages in the table and eventually > associate the table entries to device (sub)functions if that's not > hardwired in the device and controlled by queue enablement etc. > > According to the specification there is no requirement for masked table > entries to contain a valid message: > > "Mask Bit: ... When this bit is set, the function is prohibited from > sending a message using this MSI-X Table entry." > > which means that the function must reread the table entry when the mask > bit in the vector control word is cleared. What is a "valid" message as far as the device is concerned? "Valid" is meaningful to system software and hardware, the device doesn't care. Like MSI above, I think the real question is when is the data latched by the hardware. For MSI-X this seems to be addressed in (PCIe 5.0 spec) 6.1.4.2 MSI-X Configuration: Software must not modify the Address, Data, or Steering Tag fields of an entry while it is unmasked. Followed by 6.1.4.5 Per-vector Masking and Function Masking: For MSI-X, a Function is permitted to cache Address and Data values from unmasked MSI-X Table entries. However, anytime software unmasks a currently masked MSI-X Table entry either by Clearing its Mask bit or by Clearing the Function Mask bit, the Function must update any Address or Data values that it cached from that entry. If software changes the Address or Data value of an entry while the entry is unmasked, the result is undefined. So caching/latching occurs on unmask for MSI-X, but I can't find similar statements for MSI. If you have, please note them. It's possible MSI is per interrupt. Anyway, at least MSI-X if not also MSI could have a !NORESIZE implementation, which is why this flag exists in vfio. Thanks, Alex ^ permalink raw reply [flat|nested] 29+ messages in thread
* RE: Virtualizing MSI-X on IMS via VFIO 2021-06-24 21:44 ` Alex Williamson @ 2021-06-25 5:21 ` Tian, Kevin 2021-06-25 8:43 ` Thomas Gleixner 2021-06-25 8:29 ` Thomas Gleixner 1 sibling, 1 reply; 29+ messages in thread From: Tian, Kevin @ 2021-06-25 5:21 UTC (permalink / raw) To: Alex Williamson, Thomas Gleixner Cc: Jason Gunthorpe, Dey, Megha, Raj, Ashok, Pan, Jacob jun, Jiang, Dave, Liu, Yi L, Lu, Baolu, Williams, Dan J, Luck, Tony, Kumar, Sanjay K, LKML, KVM, Kirti Wankhede, Peter Zijlstra, Marc Zyngier, Bjorn Helgaas > From: Alex Williamson <alex.williamson@redhat.com> > Sent: Friday, June 25, 2021 5:45 AM > > On Thu, 24 Jun 2021 17:14:39 +0200 > Thomas Gleixner <tglx@linutronix.de> wrote: > > > After studying the MSI-X specification again, I think there is another > > option to solve this for MSI-X, i.e. the dynamic sizing part: > > > > MSI requires to disable MSI in order to update the number of enabled > > vectors in the control word. > > Exactly what part of the spec requires this? This is generally the > convention I expect too, and there are complications around contiguous > vectors and data field alignment, but I'm not actually able to find a > requirement in the spec that MSI Enable must be 0 when modifying other > writable fields or that writable fields are latched when MSI Enable is > set. > > > MSI-X does not have that requirement as there is no 'number of used > > vectors' control field. MSI-X provides a fixed sized vector table and > > enabling MSI-X "activates" the full table. > > > > System software has to set proper messages in the table and eventually > > associate the table entries to device (sub)functions if that's not > > hardwired in the device and controlled by queue enablement etc. > > > > According to the specification there is no requirement for masked table > > entries to contain a valid message: > > > > "Mask Bit: ... When this bit is set, the function is prohibited from > > sending a message using this MSI-X Table entry." > > > > which means that the function must reread the table entry when the mask > > bit in the vector control word is cleared. > > What is a "valid" message as far as the device is concerned? "Valid" > is meaningful to system software and hardware, the device doesn't care. > > Like MSI above, I think the real question is when is the data latched > by the hardware. For MSI-X this seems to be addressed in (PCIe 5.0 > spec) 6.1.4.2 MSI-X Configuration: > > Software must not modify the Address, Data, or Steering Tag fields of > an entry while it is unmasked. > > Followed by 6.1.4.5 Per-vector Masking and Function Masking: > > For MSI-X, a Function is permitted to cache Address and Data values > from unmasked MSI-X Table entries. However, anytime software unmasks > a currently masked MSI-X Table entry either by Clearing its Mask bit > or by Clearing the Function Mask bit, the Function must update any > Address or Data values that it cached from that entry. If software > changes the Address or Data value of an entry while the entry is > unmasked, the result is undefined. > > So caching/latching occurs on unmask for MSI-X, but I can't find > similar statements for MSI. If you have, please note them. It's > possible MSI is per interrupt. I checked PCI Local Bus Specification rev3.0. At that time MSI and MSI-X were described/compared together in almost every paragraph in 6.8.3.4 (Per-vector Masking and Function Masking). The paragraph that you cited is the last one in that section. It's a pity that MSI is not clarified in this paragraph but it gives me the impression that MSI function is not permitted to cache address and data values. Later after MSI and MSI-X descriptions were split into separate sections in PCIe spec, this impression is definitely weakened a lot. If true, this even implies that software is free to change data/addr when MSI is unmasked, which is sort of counter-intuitive to most people. Then I further found below thread: https://lore.kernel.org/lkml/1468426713-31431-1-git-send-email-marc.zyngier@arm.com/ It identified a device which does latch the message content in a MSI-capable device, forcing the kernel to startup irq early before enabling MSI capability. So, no answer and let's see whether Thomas can help identify a better proof. > > Anyway, at least MSI-X if not also MSI could have a !NORESIZE > implementation, which is why this flag exists in vfio. Thanks, > For MSI we can still mitigate the lost interrupt issue by having Qemu to allocate all possible MSI vectors in the start when guest enables MSI capability and never freeing them before disable. Anyway there are just up to 32 vectors per device, and total vectors of all MSI devices in a platform should be limited. This won't be a big problem after CPU vector exhaustion is relaxed. p.s. one question to Thomas. As Alex cited above, software must not modify the Address, Data, or Steering Tag fields of an MSI-X entry while it is unmasked. However this rule might be violated today in below flow: request_irq() __setup_irq() irq_startup() __irq_startup() irq_enable() unmask_irq() <<<<<<<<<<<<< irq_setup_affinity() irq_do_set_affinity() msi_set_affinity() // when IR is disabled irq_msi_update_msg() pci_msi_domain_write_msg() <<<<<<<<<<<<<< Isn't above have msi-x entry updated after it's unmasked? I may overlook something though since there are many branches in the middle which may make above flow invalid... Thanks Kevin ^ permalink raw reply [flat|nested] 29+ messages in thread
* RE: Virtualizing MSI-X on IMS via VFIO 2021-06-25 5:21 ` Tian, Kevin @ 2021-06-25 8:43 ` Thomas Gleixner 2021-06-25 12:42 ` Thomas Gleixner 2021-06-25 21:19 ` Thomas Gleixner 0 siblings, 2 replies; 29+ messages in thread From: Thomas Gleixner @ 2021-06-25 8:43 UTC (permalink / raw) To: Tian, Kevin, Alex Williamson Cc: Jason Gunthorpe, Dey, Megha, Raj, Ashok, Pan, Jacob jun, Jiang, Dave, Liu, Yi L, Lu, Baolu, Williams, Dan J, Luck, Tony, Kumar, Sanjay K, LKML, KVM, Kirti Wankhede, Peter Zijlstra, Marc Zyngier, Bjorn Helgaas On Fri, Jun 25 2021 at 05:21, Kevin Tian wrote: >> From: Alex Williamson <alex.williamson@redhat.com> >> So caching/latching occurs on unmask for MSI-X, but I can't find >> similar statements for MSI. If you have, please note them. It's >> possible MSI is per interrupt. > > I checked PCI Local Bus Specification rev3.0. At that time MSI and > MSI-X were described/compared together in almost every paragraph > in 6.8.3.4 (Per-vector Masking and Function Masking). The paragraph > that you cited is the last one in that section. It's a pity that MSI is > not clarified in this paragraph but it gives me the impression that > MSI function is not permitted to cache address and data values. > Later after MSI and MSI-X descriptions were split into separate > sections in PCIe spec, this impression is definitely weakened a lot. > > If true, this even implies that software is free to change data/addr > when MSI is unmasked, which is sort of counter-intuitive to most > people. Yes, software is free to do that and it has to deal with the consequences. See arch/x86/kernel/apic/msi.c::msi_set_affinity(). > Then I further found below thread: > > https://lore.kernel.org/lkml/1468426713-31431-1-git-send-email-marc.zyngier@arm.com/ > > It identified a device which does latch the message content in a > MSI-capable device, forcing the kernel to startup irq early before > enabling MSI capability. > > So, no answer and let's see whether Thomas can help identify > a better proof. As I said to Alex: The MSI specification is and always was blury and the behaviour in detail is implementation defined. IOW, what might work on device A is not guaranteed to work on device B. > p.s. one question to Thomas. As Alex cited above, software must > not modify the Address, Data, or Steering Tag fields of an MSI-X > entry while it is unmasked. However this rule might be violated > today in below flow: > > request_irq() > __setup_irq() > irq_startup() > __irq_startup() > irq_enable() > unmask_irq() <<<<<<<<<<<<< > irq_setup_affinity() > irq_do_set_affinity() > msi_set_affinity() // when IR is disabled > irq_msi_update_msg() > pci_msi_domain_write_msg() <<<<<<<<<<<<<< > > Isn't above have msi-x entry updated after it's unmasked? Dammit, I could swear that we had masking at the core or PCI level at some point. Let me dig into this. Thanks, tglx ^ permalink raw reply [flat|nested] 29+ messages in thread
* RE: Virtualizing MSI-X on IMS via VFIO 2021-06-25 8:43 ` Thomas Gleixner @ 2021-06-25 12:42 ` Thomas Gleixner 2021-06-25 21:19 ` Thomas Gleixner 1 sibling, 0 replies; 29+ messages in thread From: Thomas Gleixner @ 2021-06-25 12:42 UTC (permalink / raw) To: Tian, Kevin, Alex Williamson Cc: Jason Gunthorpe, Dey, Megha, Raj, Ashok, Pan, Jacob jun, Jiang, Dave, Liu, Yi L, Lu, Baolu, Williams, Dan J, Luck, Tony, Kumar, Sanjay K, LKML, KVM, Kirti Wankhede, Peter Zijlstra, Marc Zyngier, Bjorn Helgaas On Fri, Jun 25 2021 at 10:43, Thomas Gleixner wrote: > On Fri, Jun 25 2021 at 05:21, Kevin Tian wrote: >> p.s. one question to Thomas. As Alex cited above, software must >> not modify the Address, Data, or Steering Tag fields of an MSI-X >> entry while it is unmasked. However this rule might be violated >> today in below flow: >> >> request_irq() >> __setup_irq() >> irq_startup() >> __irq_startup() >> irq_enable() >> unmask_irq() <<<<<<<<<<<<< >> irq_setup_affinity() >> irq_do_set_affinity() >> msi_set_affinity() // when IR is disabled >> irq_msi_update_msg() >> pci_msi_domain_write_msg() <<<<<<<<<<<<<< >> >> Isn't above have msi-x entry updated after it's unmasked? > > Dammit, I could swear that we had masking at the core or PCI level at > some point. Let me dig into this. Indeed, that code path does not check irq_can_move_pcntxt(). It doesn't blow up in our face by chance because of this: __setup_irq() irq_activate() unmask() irq_setup_affinity() irq_activate() assigns a vector based on the affinity mask so irq_setup_affinity() ends up writing the same data again pointlessly. For some stupid reason the ordering of startup/setup_affinity is the way it is for historical reasons. I tried to reorder it at some point but that caused failure on !x86 so I went back to the status quo. All other affinity settings happen with the interrupt masked because we do that from actual interrupt context via irq_move_masked_irq() which does the right thing. Let me fix that proper for the startup case. Thanks, tglx ^ permalink raw reply [flat|nested] 29+ messages in thread
* RE: Virtualizing MSI-X on IMS via VFIO 2021-06-25 8:43 ` Thomas Gleixner 2021-06-25 12:42 ` Thomas Gleixner @ 2021-06-25 21:19 ` Thomas Gleixner 1 sibling, 0 replies; 29+ messages in thread From: Thomas Gleixner @ 2021-06-25 21:19 UTC (permalink / raw) To: Tian, Kevin, Alex Williamson Cc: Jason Gunthorpe, Dey, Megha, Raj, Ashok, Pan, Jacob jun, Jiang, Dave, Liu, Yi L, Lu, Baolu, Williams, Dan J, Luck, Tony, Kumar, Sanjay K, LKML, KVM, Kirti Wankhede, Peter Zijlstra, Marc Zyngier, Bjorn Helgaas On Fri, Jun 25 2021 at 10:43, Thomas Gleixner wrote: > On Fri, Jun 25 2021 at 05:21, Kevin Tian wrote: >>> From: Alex Williamson <alex.williamson@redhat.com> >>> So caching/latching occurs on unmask for MSI-X, but I can't find >>> similar statements for MSI. If you have, please note them. It's >>> possible MSI is per interrupt. >> >> I checked PCI Local Bus Specification rev3.0. At that time MSI and >> MSI-X were described/compared together in almost every paragraph >> in 6.8.3.4 (Per-vector Masking and Function Masking). The paragraph >> that you cited is the last one in that section. It's a pity that MSI is >> not clarified in this paragraph but it gives me the impression that >> MSI function is not permitted to cache address and data values. >> Later after MSI and MSI-X descriptions were split into separate >> sections in PCIe spec, this impression is definitely weakened a lot. >> >> If true, this even implies that software is free to change data/addr >> when MSI is unmasked, which is sort of counter-intuitive to most >> people. > > Yes, software is free to do that and it has to deal with the > consequences. See arch/x86/kernel/apic/msi.c::msi_set_affinity(). Well, it's actually forced to do so when the MSI implementation does not provide masking. If masking is available then it should be used also for MSI as it prevents the nasty update problem where the hardware can observe inconsistent state... Which x86 does correctly except for that startup issue you spotted, but I'm not at all sure about the rest. The startup issue is halfways trivial to fix I think, but there are other issues with the PCI/MSI-X handling which I discovered while staring at that code for a while. Still working on it. Thanks, tglx ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Virtualizing MSI-X on IMS via VFIO 2021-06-24 21:44 ` Alex Williamson 2021-06-25 5:21 ` Tian, Kevin @ 2021-06-25 8:29 ` Thomas Gleixner 1 sibling, 0 replies; 29+ messages in thread From: Thomas Gleixner @ 2021-06-25 8:29 UTC (permalink / raw) To: Alex Williamson Cc: Tian, Kevin, Jason Gunthorpe, Dey, Megha, Raj, Ashok, Pan, Jacob jun, Jiang, Dave, Liu, Yi L, Lu, Baolu, Williams, Dan J, Luck, Tony, Kumar, Sanjay K, LKML, KVM, Kirti Wankhede, Peter Zijlstra, Marc Zyngier, Bjorn Helgaas Alex! On Thu, Jun 24 2021 at 15:44, Alex Williamson wrote: > On Thu, 24 Jun 2021 17:14:39 +0200 > Thomas Gleixner <tglx@linutronix.de> wrote: > >> After studying the MSI-X specification again, I think there is another >> option to solve this for MSI-X, i.e. the dynamic sizing part: >> >> MSI requires to disable MSI in order to update the number of enabled >> vectors in the control word. > > Exactly what part of the spec requires this? This is generally the > convention I expect too, and there are complications around contiguous > vectors and data field alignment, but I'm not actually able to find a > requirement in the spec that MSI Enable must be 0 when modifying other > writable fields or that writable fields are latched when MSI Enable is > set. There is nothing in the spec which mandates that, but based on experience I know that devices latch the number of vectors field when the enable bit goes from 0 to 1, which makes sense. Devices derive their internal interrupt routing from that. >> which means that the function must reread the table entry when the mask >> bit in the vector control word is cleared. > > What is a "valid" message as far as the device is concerned? "Valid" > is meaningful to system software and hardware, the device doesn't > care. That's correct, it uses whatever is there. > So caching/latching occurs on unmask for MSI-X, but I can't find > similar statements for MSI. If you have, please note them. It's > possible MSI is per interrupt. MSI is mostly implementation defined due to the blury specification. Also the fact that MSI masking is optional does not make it any better. Most devices (even new ones) do not have MSI masking. > Anyway, at least MSI-X if not also MSI could have a !NORESIZE > implementation, which is why this flag exists in vfio. MSI-X yes with a pretty large surgery. MSI, no way. Contrary to MSI-X you cannot just update the $N entry in the table because there is no table. MSI has a base message and derives the $Nth vector message from it by modifying the lower bits of the data word. So without masking updating the base message for multi-msi is close to impossible. Look at the dance we have to do in msi_set_affinity(). But even with masking there is still the issue of the 'number of vectors' field and you can't set that to maximum at init time either because some devices derive from that how interrupts are routed and you surely don't want to change that behaviour while devices are active. Even if that'd be possible, then we'd need to allocate the full IRTE space, which would be just another corner case and require extra handling. MSI is a badly specified trainwreck and we already have enough horrible code dealing with it. No need to add more of that which is going to cause more problems than it solves. The sad thing is that despite the fact that the problems of MSI are known for more than a decade MSI is still widely used in new silicon and most of the time even without masking. > Anyway, at least MSI-X if not also MSI could have a !NORESIZE > implementation, which is why this flag exists in vfio. Right, it's there to be ignored for MSI-X in the current implementation of QEMU and VFIO/PCI. Thanks, tglx ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Virtualizing MSI-X on IMS via VFIO 2021-06-24 2:41 ` Tian, Kevin 2021-06-24 15:14 ` Thomas Gleixner @ 2021-06-24 17:03 ` Jacob Pan 1 sibling, 0 replies; 29+ messages in thread From: Jacob Pan @ 2021-06-24 17:03 UTC (permalink / raw) To: Tian, Kevin Cc: Thomas Gleixner, Alex Williamson, Jason Gunthorpe, Dey, Megha, Raj, Ashok, Jiang, Dave, Liu, Yi L, Lu, Baolu, Williams, Dan J, Luck, Tony, Kumar, Sanjay K, LKML, KVM, Kirti Wankhede, Peter Zijlstra, Marc Zyngier, Bjorn Helgaas, jacob.jun.pan Hi Kevin, On Wed, 23 Jun 2021 19:41:24 -0700, "Tian, Kevin" <kevin.tian@intel.com> wrote: > > > 1) Fix the lost interrupt issue in existing MSI virtualization flow; > > > > > > > That _cannot_ be fixed without a hypercall. See my reply to Alex. > > The lost interrupt issue was caused due to resizing based on stale > impression of vector exhaustion. Is it possible to mitigate the lost interrupt by always injecting an IRQ after unmask? Either in VFIO layer, or let QEMU do that after the second VFIO_DEVICE_SET_IRQS in step 4.b of your original email. After all, spurious interrupts should be tolerated and unmasking MSI-x should be rare. I am not suggesting this as an alternative to the real fix, just a stop gap. Thanks, Jacob ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Virtualizing MSI-X on IMS via VFIO 2021-06-22 23:59 ` Thomas Gleixner 2021-06-23 6:12 ` Tian, Kevin @ 2021-06-23 15:19 ` Alex Williamson 2021-06-24 0:00 ` Tian, Kevin 2021-06-24 0:43 ` Thomas Gleixner 1 sibling, 2 replies; 29+ messages in thread From: Alex Williamson @ 2021-06-23 15:19 UTC (permalink / raw) To: Thomas Gleixner Cc: Kevin Tian, Jason Gunthorpe, Megha Dey, Ashok Rai, Jacob Pan, Dave Jiang, Yi Liu, Baolu Lu, Dan Williams, Tony Luck, Sanjay Kumar, LKML, KVM, Kirti Wankhede, Peter Zijlstra, Marc Zyngier, Bjorn Helgaas On Wed, 23 Jun 2021 01:59:24 +0200 Thomas Gleixner <tglx@linutronix.de> wrote: > Alex, > > [ Resend due to a delivery issue here. Sorry if you got this twice. ] > > On Tue, Jun 22 2021 at 13:12, Alex Williamson wrote: > > On Tue, 22 Jun 2021 10:16:15 +0000 "Tian, Kevin" <kevin.tian@intel.com> wrote: > > > > Now the 2nd open requires your help. Below is what I learned from > >> current vfio/qemu code (for vfio-pci device): > >> > >> 0) Qemu doesn't attempt to allocate all irqs as reported by msix-> > >> table_size. It is done in an dynamic and incremental way. > > > > Not by table_size, our expectation is that the device interrupt > > behavior can be implicitly affected by the enabled vectors and the > > table size may support far more vectors than the driver might actually > > use. It's also easier if we never need to get into the scenario of > > pci_alloc_irq_vectors() returning a smaller than requested number of > > vectors and needing to fallback to a vector negotiation that doesn't > > exist via MSI-X. > > > > FWIW, more recent QEMU will scan the vector table for the highest > > non-masked vector to initially enable that number of vectors in the > > host, both to improve restore behavior after migration and avoid > > overhead for guests that write the vector table before setting the > > MSI-X capability enable bit (Windows?). > > > >> 1) VFIO provides just one command (VFIO_DEVICE_SET_IRQS) for > >> allocating/enabling irqs given a set of vMSIX vectors [start, count]: > >> a) if irqs not allocated, allocate irqs [start+count]. Enable irqs for > >> specified vectors [start, count] via request_irq(); > >> b) if irqs already allocated, enable irqs for specified vectors; > >> c) if irq already enabled, disable and re-enable irqs for specified > >> vectors because user may specify a different eventfd; > >> > >> 2) When guest enables virtual MSI-X capability, Qemu calls VFIO_ > >> DEVICE_SET_IRQS to enable vector#0, even though it's currently > >> masked by the guest. Interrupts are received by Qemu but blocked > >> from guest via mask/pending bit emulation. The main intention is > >> to enable physical MSI-X; > > > > Yes, this is a bit awkward since the interrupt API is via SET_IRQS and > > we don't allow writes to the MSI-X enable bit via config space. > > > >> 3) When guest unmasks vector#0 via request_irq(), Qemu calls VFIO_ > >> DEVICE_SET_IRQS to enable vector#0 again, with a eventfd different > >> from the one provided in 2); > >> > >> 4) When guest unmasks vector#1, Qemu finds it's outside of allocated > >> vectors (only vector#0 now): > >> > >> a) Qemu first calls VFIO_DEVICE_SET_IRQS to disable and free > >> irq for vector#0; > >> > >> b) Qemu then calls VFIO_DEVICE_SET_IRQS to allocate and enable > >> irqs for both vector#0 and vector#1; > >> > >> 5) When guest unmasks vector#2, same flow in 4) continues. > > That's dangerous and makes weird assumptions about interrupts being > requested early in the driver init() function. But that's wishful > thinking, really. There are enough drivers, especially networking which > request interrupts on device open() and not on device init(). Some > special functions only request the irq way later, i.e. only when someone > uses that special function and at this point the other irqs of that very > same device are already in use. I'd consider allocating vectors on open() to be setup time, device initialization. Chances of inflight interrupts is low. If you have examples of drivers where the irq request comes way later please let me know, maybe there's some way we can identify that use case. > >> If above understanding is correct, how is lost interrupt avoided between > >> 4.a) and 4.b) given that irq has been torn down for vector#0 in the middle > >> while from guest p.o.v this vector is actually unmasked? There must be > >> a mechanism in place, but I just didn't figure it out... > > > > In practice unmasking new vectors is rare and done only at > > initialization. Risk from lost interrupts at this time is low. > > See above. Wishful thinking. > > OMG, I really don't want to be the one to debug _WHY_ a device lost > interrupts just because it did a late request_irq() when the device is > operational already and has other interrupts active. > > > When masking and unmasking vectors that are already in use, we're only > > changing the signaling eventfd between KVM and QEMU such that QEMU can > > set emulated pending bits in response to interrupts (and our lack of > > interfaces to handle the mask/unmask at the host). I believe that > > locking in the vfio-pci driver prevents an interrupt from being lost > > during the eventfd switch. > > Let's look at this from a driver perspective: > > 1) Driver checks how many entries are possible in the MSI-X table > > 2) Driver invokes pci_msix_enable[_range]() which returns the number > of vectors the system is willing to hand out to the driver. > > 3) Driver assigns the vectors to the different resources in the > hardware > > 4) Later on these interrupts are requested, but not necessarily during > device init. > > Yes, request_irq() can fail and today it can fail also due to CPU > vector exhaustion. That's perfectly fine as the driver can handle > the fail and act accordingly. > > All of this is consistent and well defined. We can't know in the hypervisor how many vectors the guest driver asked for, we only know that it's somewhere between one and the number supported by the vector table. We'd contribute to vector exhaustion if we always assume the max. > Now lets look at the guest. This VFIO mechanism introduces two brand new > failure modes because of this: > > guest::unmask() > trapped_by_host() > free_irqs(); > pci_free_irq_vectors(); > pci_disable_msix(); > pci_alloc_irq_vectors(); > pci_enable_msix(); > request_irqs(); > > #1 What happens if the host side allocation or the host side request_irq() > fails? > > a) Guest is killed? > b) Failure is silently ignored and guest just does not receive > interrupts? > c) Any other mechanism? QEMU generates a log on failure. At vfio-pci we're emulating the MSI-X capability and vector table, there is no mechanism in the PCI spec defined protocol that manipulating a vector table entry can fail, but clearly it can fail on the host. That's just the world we live in, either we overestimate vector usage on the host or we do the best we can to infer how the device is actually being used. Within the vfio API we have a mechanism to describe that the above behavior is required, ie. that we can't dynamically resize the number of MSI-X vectors. If we had kernel interfaces to dynamically change our vector allocation after pci_alloc_irq_vectors() we could remove that flag and QEMU could wouldn't need to go through this process. > > Whatever it is, it simply _cannot_ make request_irq() in the > guest fail because the guest has already passed all failure > points and is in the low level function of unmasking the > interrupt for the first time after which is will return 0, aka > success. > > So if you answer #a, fine with me. It's well defined. As far as the guest is concerned, #b. As you say, we're at the point in the guest where the guest interrupt code is interacting with MSI-X directly and there is no way it can fail on hardware. It's not entirely silent though, the hypervisor logs an error. Whether you think "the device I hot-added to my VM doesn't work" is worse than "I hot-added a device to my VM and it crashed" is a better failure mode is a matter of perspective. > #2 What happens to already active interrupts on that device which might > fire during that time? > > They get lost or are redirected to the legacy PCI interrupt and > there is absolutely nothing you can do about that. > > Simply because to prevent that the host side would have to > disable the interrupt source at the device level, i.e. fiddle > with actual device registers to shut up interrupt delivery and > reenable it afterwards again and thereby racing against some > other VCPU of the same guest which fiddles with that very same > registers. > > IOW, undefined behaviour, which the "VFIO design" shrugged off on > completely unjustified assumptions. The vfio design signals to userspace that this behavior is required via a flag on the IRQ info structure. Should host interfaces become available that allow us to re-size the number of vectors for a device, we can remove the flag and userspace would no longer need this disable/re-enable process. > No matter how much you argue about this being unlikely, this is just > broken. Unlikely simply does not exist at cloud scale. > > Aside of that. How did you come to the conclusion that #2 does not > matter? By analyzing _every_ open and closed source driver for their > usage patterns and documenting that all drivers which want to work in > VFIO-PCI space have to follow specific rules vs. interrupt setup and > usage? I'm pretty sure that you have a mechanism in place which > monitors closely whether a driver violates those well documented rules. > > Yes, I know that I'm dreaming and the reality is that this is based on > interesting assumptions and just works by chance. > > I have no idea _why_ this has been done that way. The changelogs of the > relevant commits are void of useful content and lack links to the > possibly existing discussions about this. > > I only can assume that back then the main problem was vector exhaustion > on the host and to avoid allocating memory for interrupt descriptors > etc, right? Essentially yes, and it's largely userspace policy, ie. QEMU. QEMU owns the device, it has a right to allocate an entire vector table worth of interrupts if it wants. > The host vector exhaustion problem was that each MSIX vector consumed a > real CPU vector which is a limited resource on X86. This is not longer > the case today: > > 1) pci_msix_enable[range]() consumes exactly zero CPU vectors from > the allocatable range independent of the number of MSIX vectors > it allocates, unless it is in multi-queue managed mode where it > will actually reserve a vector (maybe two) per CPU. > > But for devices which are not using that mode, they just > opportunistically "reserve" vectors. > > All entries are initialized with a special system vector which > when raised will emit a nastigram in dmesg. > > 2) request_irq() actually allocates a CPU vector from the > allocatable vector space which can obviously still fail, which is > perfectly fine. request_irq() is where the vector table entry actually gets unmasked though, that act of unmasking the vector table entry in hardware cannot fail, but virtualization of that act is backed by a request_irq() in the host, which can fail. > So the only downside today of allocating more MSI-X vectors than > necessary is memory consumption for the irq descriptors. As above, this is a QEMU policy of essentially trying to be a good citizen and allocate only what we can infer the guest is using. What's a good way for QEMU, or any userspace, to know it's running on a host where vector exhaustion is not an issue? > Though for virtualization there is still another problem: > > Even if all possible MSI-X vectors for a passthrough PCI device would > be allocated upfront independent of the actual usage in the guest, > then there is still the issue of request_irq() failing on the host > once the guest decides to use any of those interrupts. Yup. > It's a halfways reasonable argumentation by some definition of > reasonable, that this case would be a host system configuration problem > and the admin who overcommitted is responsible for the consequence. > > Where the only reasonable consequence is to kill the guest right there > because there is no mechanism to actually tell it that the host ran out > of resources. People tend not to like their VM getting killed and potentially losing data, so kill-the-guest vs device-doesn't-work is still debatable imo. > Not at all a pretty solution, but it is contrary to the status quo well > defined. The most important aspect is that it is well defined for the > case of success: > > If it succeeds then there is no way that already requested interrupts > can be lost or end up being redirected to the legacy PCI irq due to > clearing the MSIX enable bit, which is a gazillion times better than > the "let's hope it works" based tinkerware we have now. > > So, aside of the existing VFIO/PCI/MSIX thing being just half thought > out, even thinking about proliferating this with IMS is bonkers. > > IMS is meant to avoid the problem of MSI-X which needs to disable MSI-X > in order to expand the number of vectors. The use cases are going to be > even more dynamic than the usual device drivers, so the lost interrupt > issue will be much more likely to trigger. > > So no, we are not going to proliferate this complete ignorance of how > MSI-X actually works and just cram another "feature" into code which is > known to be incorrect. Some of the issues of virtualizing MSI-X are unsolvable without creating a new paravirtual interface, but obviously we want to work with existing drivers and unmodified guests, so that's not an option. To work with what we've got, the vfio API describes the limitation of the host interfaces via the VFIO_IRQ_INFO_NORESIZE flag. QEMU then makes a choice in an attempt to better reflect what we can infer of the guest programming of the device to incrementally enable vectors. We could a) work to provide host kernel interfaces that allow us to remove that noresize flag and b) decide whether QEMU's usage policy can be improved on kernels where vector exhaustion is no longer an issue. Thanks, Alex ^ permalink raw reply [flat|nested] 29+ messages in thread
* RE: Virtualizing MSI-X on IMS via VFIO 2021-06-23 15:19 ` Alex Williamson @ 2021-06-24 0:00 ` Tian, Kevin 2021-06-24 1:36 ` Thomas Gleixner ` (2 more replies) 2021-06-24 0:43 ` Thomas Gleixner 1 sibling, 3 replies; 29+ messages in thread From: Tian, Kevin @ 2021-06-24 0:00 UTC (permalink / raw) To: Alex Williamson, Thomas Gleixner Cc: Jason Gunthorpe, Dey, Megha, Raj, Ashok, Pan, Jacob jun, Jiang, Dave, Liu, Yi L, Lu, Baolu, Williams, Dan J, Luck, Tony, Kumar, Sanjay K, LKML, KVM, Kirti Wankhede, Peter Zijlstra, Marc Zyngier, Bjorn Helgaas > From: Alex Williamson <alex.williamson@redhat.com> > Sent: Wednesday, June 23, 2021 11:20 PM > [...] > > So the only downside today of allocating more MSI-X vectors than > > necessary is memory consumption for the irq descriptors. > > As above, this is a QEMU policy of essentially trying to be a good > citizen and allocate only what we can infer the guest is using. What's > a good way for QEMU, or any userspace, to know it's running on a host > where vector exhaustion is not an issue? In my proposal a new command (VFIO_DEVICE_ALLOC_IRQS) is introduced to separate allocation from enabling. The availability of this command could be the indicator whether vector exhaustion is not an issue now? > > > > So no, we are not going to proliferate this complete ignorance of how > > MSI-X actually works and just cram another "feature" into code which is > > known to be incorrect. > > Some of the issues of virtualizing MSI-X are unsolvable without > creating a new paravirtual interface, but obviously we want to work > with existing drivers and unmodified guests, so that's not an option. > > To work with what we've got, the vfio API describes the limitation of > the host interfaces via the VFIO_IRQ_INFO_NORESIZE flag. QEMU then > makes a choice in an attempt to better reflect what we can infer of the > guest programming of the device to incrementally enable vectors. We It's a surprise to me that Qemu even doesn't look at this flag today after searching its code... > could a) work to provide host kernel interfaces that allow us to remove > that noresize flag and b) decide whether QEMU's usage policy can be > improved on kernels where vector exhaustion is no longer an issue. Thomas can help confirm but looks noresize limitation is still there. b) makes more sense since Thomas thinks vector exhaustion is not an issue now (except one minor open about irte). Thanks Kevin ^ permalink raw reply [flat|nested] 29+ messages in thread
* RE: Virtualizing MSI-X on IMS via VFIO 2021-06-24 0:00 ` Tian, Kevin @ 2021-06-24 1:36 ` Thomas Gleixner 2021-06-24 2:20 ` Thomas Gleixner 2021-06-24 17:52 ` Virtualizing MSI-X on IMS via VFIO Alex Williamson 2 siblings, 0 replies; 29+ messages in thread From: Thomas Gleixner @ 2021-06-24 1:36 UTC (permalink / raw) To: Tian, Kevin, Alex Williamson Cc: Jason Gunthorpe, Dey, Megha, Raj, Ashok, Pan, Jacob jun, Jiang, Dave, Liu, Yi L, Lu, Baolu, Williams, Dan J, Luck, Tony, Kumar, Sanjay K, LKML, KVM, Kirti Wankhede, Peter Zijlstra, Marc Zyngier, Bjorn Helgaas Kevin, On Thu, Jun 24 2021 at 00:00, Kevin Tian wrote: >> From: Alex Williamson <alex.williamson@redhat.com> >> Sent: Wednesday, June 23, 2021 11:20 PM >> > [...] > > > So the only downside today of allocating more MSI-X vectors than >> > necessary is memory consumption for the irq descriptors. >> >> As above, this is a QEMU policy of essentially trying to be a good >> citizen and allocate only what we can infer the guest is using. What's >> a good way for QEMU, or any userspace, to know it's running on a host >> where vector exhaustion is not an issue? > > In my proposal a new command (VFIO_DEVICE_ALLOC_IRQS) is > introduced to separate allocation from enabling. The availability > of this command could be the indicator whether vector > exhaustion is not an issue now? Your proposal still does not address the fundamental issue of a missing feedback to the guest and you can invent a gazillion more IOCTL commands and none of them will solve that issue. A hypercall/paravirt interface is the only reasonable solution. The time you are wasting to come up with non-solutions would have surely been better spent implementing the already known and obvious proper solution. You might be halfways done already with that. Thanks, tglx ^ permalink raw reply [flat|nested] 29+ messages in thread
* RE: Virtualizing MSI-X on IMS via VFIO 2021-06-24 0:00 ` Tian, Kevin 2021-06-24 1:36 ` Thomas Gleixner @ 2021-06-24 2:20 ` Thomas Gleixner 2021-06-24 2:48 ` Alex Williamson 2021-06-24 17:52 ` Virtualizing MSI-X on IMS via VFIO Alex Williamson 2 siblings, 1 reply; 29+ messages in thread From: Thomas Gleixner @ 2021-06-24 2:20 UTC (permalink / raw) To: Tian, Kevin, Alex Williamson Cc: Jason Gunthorpe, Dey, Megha, Raj, Ashok, Pan, Jacob jun, Jiang, Dave, Liu, Yi L, Lu, Baolu, Williams, Dan J, Luck, Tony, Kumar, Sanjay K, LKML, KVM, Kirti Wankhede, Peter Zijlstra, Marc Zyngier, Bjorn Helgaas Kevin, thank you very much for digging into this! You made my day! On Thu, Jun 24 2021 at 00:00, Kevin Tian wrote: >> From: Alex Williamson <alex.williamson@redhat.com> >> To work with what we've got, the vfio API describes the limitation of >> the host interfaces via the VFIO_IRQ_INFO_NORESIZE flag. QEMU then >> makes a choice in an attempt to better reflect what we can infer of the >> guest programming of the device to incrementally enable vectors. We > > It's a surprise to me that Qemu even doesn't look at this flag today after > searching its code... Indeed. git clone https://github.com/qemu/qemu.git cd qemu git log -p | grep NORESIZE + * The NORESIZE flag indicates that the interrupt lines within the index +#define VFIO_IRQ_INFO_NORESIZE (1 << 3) According to the git history of QEMU this was never used at all and I don't care about the magic muck which might be in some RHT repository which might make use of that. Find below the proper fix for this nonsense which just wasted everyones time. I'll post it officialy with a proper changelog tomorrow unless Kevin beats me to it who actually unearthed this and surely earns the credit. Alex, I seriously have to ask what you were trying to tell us about this flag and it's great value and the design related to this. I'm sure you can submit the corresponding fix to qemu yourself. And once you are back from lala land, can you please explain how VFIO/PCI/MSIX is supposed to work in reality? Thanks, tglx --- --- a/drivers/gpu/drm/i915/gvt/kvmgt.c +++ b/drivers/gpu/drm/i915/gvt/kvmgt.c @@ -1644,8 +1644,6 @@ static long intel_vgpu_ioctl(struct mdev if (info.index == VFIO_PCI_INTX_IRQ_INDEX) info.flags |= (VFIO_IRQ_INFO_MASKABLE | VFIO_IRQ_INFO_AUTOMASKED); - else - info.flags |= VFIO_IRQ_INFO_NORESIZE; return copy_to_user((void __user *)arg, &info, minsz) ? -EFAULT : 0; --- a/drivers/vfio/pci/vfio_pci.c +++ b/drivers/vfio/pci/vfio_pci.c @@ -1018,8 +1018,6 @@ static long vfio_pci_ioctl(struct vfio_d if (info.index == VFIO_PCI_INTX_IRQ_INDEX) info.flags |= (VFIO_IRQ_INFO_MASKABLE | VFIO_IRQ_INFO_AUTOMASKED); - else - info.flags |= VFIO_IRQ_INFO_NORESIZE; return copy_to_user((void __user *)arg, &info, minsz) ? -EFAULT : 0; --- a/include/uapi/linux/vfio.h +++ b/include/uapi/linux/vfio.h @@ -693,16 +693,6 @@ struct vfio_region_info_cap_nvlink2_lnks * automatically masked by VFIO and the user needs to unmask the line * to receive new interrupts. This is primarily intended to distinguish * level triggered interrupts. - * - * The NORESIZE flag indicates that the interrupt lines within the index - * are setup as a set and new subindexes cannot be enabled without first - * disabling the entire index. This is used for interrupts like PCI MSI - * and MSI-X where the driver may only use a subset of the available - * indexes, but VFIO needs to enable a specific number of vectors - * upfront. In the case of MSI-X, where the user can enable MSI-X and - * then add and unmask vectors, it's up to userspace to make the decision - * whether to allocate the maximum supported number of vectors or tear - * down setup and incrementally increase the vectors as each is enabled. */ struct vfio_irq_info { __u32 argsz; @@ -710,7 +700,6 @@ struct vfio_irq_info { #define VFIO_IRQ_INFO_EVENTFD (1 << 0) #define VFIO_IRQ_INFO_MASKABLE (1 << 1) #define VFIO_IRQ_INFO_AUTOMASKED (1 << 2) -#define VFIO_IRQ_INFO_NORESIZE (1 << 3) __u32 index; /* IRQ index */ __u32 count; /* Number of IRQs within this index */ }; --- a/samples/vfio-mdev/mtty.c +++ b/samples/vfio-mdev/mtty.c @@ -1092,9 +1092,6 @@ static int mtty_get_irq_info(struct mdev if (irq_info->index == VFIO_PCI_INTX_IRQ_INDEX) irq_info->flags |= (VFIO_IRQ_INFO_MASKABLE | VFIO_IRQ_INFO_AUTOMASKED); - else - irq_info->flags |= VFIO_IRQ_INFO_NORESIZE; - return 0; } ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Virtualizing MSI-X on IMS via VFIO 2021-06-24 2:20 ` Thomas Gleixner @ 2021-06-24 2:48 ` Alex Williamson 2021-06-24 12:06 ` [PATCH] vfio/pci: Document the MSI[X] resize side effects properly Thomas Gleixner 0 siblings, 1 reply; 29+ messages in thread From: Alex Williamson @ 2021-06-24 2:48 UTC (permalink / raw) To: Thomas Gleixner Cc: Tian, Kevin, Jason Gunthorpe, Dey, Megha, Raj, Ashok, Pan, Jacob jun, Jiang, Dave, Liu, Yi L, Lu, Baolu, Williams, Dan J, Luck, Tony, Kumar, Sanjay K, LKML, KVM, Kirti Wankhede, Peter Zijlstra, Marc Zyngier, Bjorn Helgaas On Thu, 24 Jun 2021 04:20:31 +0200 Thomas Gleixner <tglx@linutronix.de> wrote: > Kevin, > > thank you very much for digging into this! You made my day! > > On Thu, Jun 24 2021 at 00:00, Kevin Tian wrote: > >> From: Alex Williamson <alex.williamson@redhat.com> > >> To work with what we've got, the vfio API describes the limitation of > >> the host interfaces via the VFIO_IRQ_INFO_NORESIZE flag. QEMU then > >> makes a choice in an attempt to better reflect what we can infer of the > >> guest programming of the device to incrementally enable vectors. We > > > > It's a surprise to me that Qemu even doesn't look at this flag today after > > searching its code... > > Indeed. > > git clone https://github.com/qemu/qemu.git > cd qemu > git log -p | grep NORESIZE > + * The NORESIZE flag indicates that the interrupt lines within the index > +#define VFIO_IRQ_INFO_NORESIZE (1 << 3) > > According to the git history of QEMU this was never used at all and I > don't care about the magic muck which might be in some RHT repository > which might make use of that. > > Find below the proper fix for this nonsense which just wasted everyones > time. I'll post it officialy with a proper changelog tomorrow unless > Kevin beats me to it who actually unearthed this and surely earns the > credit. > > Alex, I seriously have to ask what you were trying to tell us about this > flag and it's great value and the design related to this. > > I'm sure you can submit the corresponding fix to qemu yourself. > > And once you are back from lala land, can you please explain how > VFIO/PCI/MSIX is supposed to work in reality? It's part of the spec, there's never been a case of !NORESIZE, assuming NORESIZE is the safe behavior. Sorry, there's no smoking gun here, NAK > --- > --- a/drivers/gpu/drm/i915/gvt/kvmgt.c > +++ b/drivers/gpu/drm/i915/gvt/kvmgt.c > @@ -1644,8 +1644,6 @@ static long intel_vgpu_ioctl(struct mdev > if (info.index == VFIO_PCI_INTX_IRQ_INDEX) > info.flags |= (VFIO_IRQ_INFO_MASKABLE | > VFIO_IRQ_INFO_AUTOMASKED); > - else > - info.flags |= VFIO_IRQ_INFO_NORESIZE; > > return copy_to_user((void __user *)arg, &info, minsz) ? > -EFAULT : 0; > --- a/drivers/vfio/pci/vfio_pci.c > +++ b/drivers/vfio/pci/vfio_pci.c > @@ -1018,8 +1018,6 @@ static long vfio_pci_ioctl(struct vfio_d > if (info.index == VFIO_PCI_INTX_IRQ_INDEX) > info.flags |= (VFIO_IRQ_INFO_MASKABLE | > VFIO_IRQ_INFO_AUTOMASKED); > - else > - info.flags |= VFIO_IRQ_INFO_NORESIZE; > > return copy_to_user((void __user *)arg, &info, minsz) ? > -EFAULT : 0; > --- a/include/uapi/linux/vfio.h > +++ b/include/uapi/linux/vfio.h > @@ -693,16 +693,6 @@ struct vfio_region_info_cap_nvlink2_lnks > * automatically masked by VFIO and the user needs to unmask the line > * to receive new interrupts. This is primarily intended to distinguish > * level triggered interrupts. > - * > - * The NORESIZE flag indicates that the interrupt lines within the index > - * are setup as a set and new subindexes cannot be enabled without first > - * disabling the entire index. This is used for interrupts like PCI MSI > - * and MSI-X where the driver may only use a subset of the available > - * indexes, but VFIO needs to enable a specific number of vectors > - * upfront. In the case of MSI-X, where the user can enable MSI-X and > - * then add and unmask vectors, it's up to userspace to make the decision > - * whether to allocate the maximum supported number of vectors or tear > - * down setup and incrementally increase the vectors as each is enabled. > */ > struct vfio_irq_info { > __u32 argsz; > @@ -710,7 +700,6 @@ struct vfio_irq_info { > #define VFIO_IRQ_INFO_EVENTFD (1 << 0) > #define VFIO_IRQ_INFO_MASKABLE (1 << 1) > #define VFIO_IRQ_INFO_AUTOMASKED (1 << 2) > -#define VFIO_IRQ_INFO_NORESIZE (1 << 3) > __u32 index; /* IRQ index */ > __u32 count; /* Number of IRQs within this index */ > }; > --- a/samples/vfio-mdev/mtty.c > +++ b/samples/vfio-mdev/mtty.c > @@ -1092,9 +1092,6 @@ static int mtty_get_irq_info(struct mdev > if (irq_info->index == VFIO_PCI_INTX_IRQ_INDEX) > irq_info->flags |= (VFIO_IRQ_INFO_MASKABLE | > VFIO_IRQ_INFO_AUTOMASKED); > - else > - irq_info->flags |= VFIO_IRQ_INFO_NORESIZE; > - > return 0; > } > > ^ permalink raw reply [flat|nested] 29+ messages in thread
* [PATCH] vfio/pci: Document the MSI[X] resize side effects properly 2021-06-24 2:48 ` Alex Williamson @ 2021-06-24 12:06 ` Thomas Gleixner 2021-06-24 22:22 ` Alex Williamson 0 siblings, 1 reply; 29+ messages in thread From: Thomas Gleixner @ 2021-06-24 12:06 UTC (permalink / raw) To: Alex Williamson Cc: Tian, Kevin, Jason Gunthorpe, Dey, Megha, Raj, Ashok, Pan, Jacob jun, Jiang, Dave, Liu, Yi L, Lu, Baolu, Williams, Dan J, Luck, Tony, Kumar, Sanjay K, LKML, KVM, Kirti Wankhede, Peter Zijlstra, Marc Zyngier, Bjorn Helgaas The documentation of VFIO_IRQ_INFO_NORESIZE is inaccurate as it suggests that it is safe to dynamically add new MSI-X vectors even when previously allocated vectors are already in use and enabled. Enabling additional vectors is possible according the MSI-X specification, but the kernel does not have any mechanisms today to do that safely. The only available mechanism is to teardown the already active vectors and to setup the full vector set afterwards. This requires to temporarily disable MSI-X which redirects any interrupt raised by the device during this time to the legacy PCI/INTX which is not handled and the interrupt is therefore lost. Update the documentation of VFIO_IRQ_INFO_NORESIZE accordingly. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> --- include/uapi/linux/vfio.h | 17 +++++++++++++---- 1 file changed, 13 insertions(+), 4 deletions(-) --- a/include/uapi/linux/vfio.h +++ b/include/uapi/linux/vfio.h @@ -699,10 +699,19 @@ struct vfio_region_info_cap_nvlink2_lnks * disabling the entire index. This is used for interrupts like PCI MSI * and MSI-X where the driver may only use a subset of the available * indexes, but VFIO needs to enable a specific number of vectors - * upfront. In the case of MSI-X, where the user can enable MSI-X and - * then add and unmask vectors, it's up to userspace to make the decision - * whether to allocate the maximum supported number of vectors or tear - * down setup and incrementally increase the vectors as each is enabled. + * upfront. + * + * MSI cannot be resized safely when interrupts are in use already because + * resizing requires temporary disablement of MSI for updating the relevant + * PCI config space entries. Disabling MSI redirects an interrupt raised by + * the device during this time to the unhandled legacy PCI/INTX, which + * means the interrupt is lost. + * + * Enabling additional vectors for MSI-X is possible at least from the + * perspective of the MSI-X specification, but not supported by the + * exisiting PCI/MSI-X mechanisms in the kernel. The kernel provides + * currently only a full teardown/setup cycle which requires to disable + * MSI-X temporarily with the same side effects as for MSI. */ struct vfio_irq_info { __u32 argsz; ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH] vfio/pci: Document the MSI[X] resize side effects properly 2021-06-24 12:06 ` [PATCH] vfio/pci: Document the MSI[X] resize side effects properly Thomas Gleixner @ 2021-06-24 22:22 ` Alex Williamson 0 siblings, 0 replies; 29+ messages in thread From: Alex Williamson @ 2021-06-24 22:22 UTC (permalink / raw) To: Thomas Gleixner Cc: Tian, Kevin, Jason Gunthorpe, Dey, Megha, Raj, Ashok, Pan, Jacob jun, Jiang, Dave, Liu, Yi L, Lu, Baolu, Williams, Dan J, Luck, Tony, Kumar, Sanjay K, LKML, KVM, Kirti Wankhede, Peter Zijlstra, Marc Zyngier, Bjorn Helgaas On Thu, 24 Jun 2021 14:06:09 +0200 Thomas Gleixner <tglx@linutronix.de> wrote: > The documentation of VFIO_IRQ_INFO_NORESIZE is inaccurate as it suggests > that it is safe to dynamically add new MSI-X vectors even when > previously allocated vectors are already in use and enabled. > > Enabling additional vectors is possible according the MSI-X specification, > but the kernel does not have any mechanisms today to do that safely. > > The only available mechanism is to teardown the already active vectors > and to setup the full vector set afterwards. > > This requires to temporarily disable MSI-X which redirects any interrupt > raised by the device during this time to the legacy PCI/INTX which is > not handled and the interrupt is therefore lost. > > Update the documentation of VFIO_IRQ_INFO_NORESIZE accordingly. > > Signed-off-by: Thomas Gleixner <tglx@linutronix.de> > --- > include/uapi/linux/vfio.h | 17 +++++++++++++---- > 1 file changed, 13 insertions(+), 4 deletions(-) > --- a/include/uapi/linux/vfio.h > +++ b/include/uapi/linux/vfio.h > @@ -699,10 +699,19 @@ struct vfio_region_info_cap_nvlink2_lnks > * disabling the entire index. This is used for interrupts like PCI MSI > * and MSI-X where the driver may only use a subset of the available > * indexes, but VFIO needs to enable a specific number of vectors > - * upfront. In the case of MSI-X, where the user can enable MSI-X and > - * then add and unmask vectors, it's up to userspace to make the decision > - * whether to allocate the maximum supported number of vectors or tear > - * down setup and incrementally increase the vectors as each is enabled. > + * upfront. > + * > + * MSI cannot be resized safely when interrupts are in use already because > + * resizing requires temporary disablement of MSI for updating the relevant > + * PCI config space entries. Disabling MSI redirects an interrupt raised by > + * the device during this time to the unhandled legacy PCI/INTX, which > + * means the interrupt is lost. > + * > + * Enabling additional vectors for MSI-X is possible at least from the > + * perspective of the MSI-X specification, but not supported by the > + * exisiting PCI/MSI-X mechanisms in the kernel. The kernel provides > + * currently only a full teardown/setup cycle which requires to disable > + * MSI-X temporarily with the same side effects as for MSI. > */ > struct vfio_irq_info { > __u32 argsz; > There's good information here, but as per my other reply I think NORESIZE might be only a host implementation issue for both MSI and MSI/X. I'd also rather not focus on that existing implementation in this header, which is essentially the uAPI spec, because that implementation can change and we're unlikely to remember to update the description here. We might even be describing a device that emulates MSI/X in some way that it's not bound by this limitation. For example maybe Intel's emulation of MSI-X backed by IMS wouldn't need this flag and we could update QEMU to finally have a branch that avoids the teardown/setup. We have a flag to indicate this behavior, consequences should be relative to the presence of that flag. Finally a nit, I don't really see a strong case that the existing text is actually inaccurate or implying some safety against lost interrupts. It's actually making note of the issue here already, though the more explicit description is welcome. Thanks, Alex ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Virtualizing MSI-X on IMS via VFIO 2021-06-24 0:00 ` Tian, Kevin 2021-06-24 1:36 ` Thomas Gleixner 2021-06-24 2:20 ` Thomas Gleixner @ 2021-06-24 17:52 ` Alex Williamson 2 siblings, 0 replies; 29+ messages in thread From: Alex Williamson @ 2021-06-24 17:52 UTC (permalink / raw) To: Tian, Kevin Cc: Thomas Gleixner, Jason Gunthorpe, Dey, Megha, Raj, Ashok, Pan, Jacob jun, Jiang, Dave, Liu, Yi L, Lu, Baolu, Williams, Dan J, Luck, Tony, Kumar, Sanjay K, LKML, KVM, Kirti Wankhede, Peter Zijlstra, Marc Zyngier, Bjorn Helgaas On Thu, 24 Jun 2021 00:00:37 +0000 "Tian, Kevin" <kevin.tian@intel.com> wrote: > > From: Alex Williamson <alex.williamson@redhat.com> > > Sent: Wednesday, June 23, 2021 11:20 PM > > > [...] > > > So the only downside today of allocating more MSI-X vectors than > > > necessary is memory consumption for the irq descriptors. > > > > As above, this is a QEMU policy of essentially trying to be a good > > citizen and allocate only what we can infer the guest is using. What's > > a good way for QEMU, or any userspace, to know it's running on a host > > where vector exhaustion is not an issue? > > In my proposal a new command (VFIO_DEVICE_ALLOC_IRQS) is > introduced to separate allocation from enabling. The availability > of this command could be the indicator whether vector > exhaustion is not an issue now? We have options with existing interfaces if we want to provide some programmatic means through vfio to hint to userspace about vector usage. Otherwise I don't see much justification for this new ioctl, it can largely be done with SET_IRQS, or certainly with extensions of flags. > > > So no, we are not going to proliferate this complete ignorance of how > > > MSI-X actually works and just cram another "feature" into code which is > > > known to be incorrect. > > > > Some of the issues of virtualizing MSI-X are unsolvable without > > creating a new paravirtual interface, but obviously we want to work > > with existing drivers and unmodified guests, so that's not an option. > > > > To work with what we've got, the vfio API describes the limitation of > > the host interfaces via the VFIO_IRQ_INFO_NORESIZE flag. QEMU then > > makes a choice in an attempt to better reflect what we can infer of the > > guest programming of the device to incrementally enable vectors. We > > It's a surprise to me that Qemu even doesn't look at this flag today after > searching its code... There are no examples of the alternative, it would be dead, untested code. The flag exists in the uAPI to indicate a limitation of the underlying implementation that has always existed. Should we remove that limitation, as Thomas now sees as possible, then QEMU wouldn't need to make a choice whether to fully allocate the vector table or incrementally tear-down and re-init. > > could a) work to provide host kernel interfaces that allow us to remove > > that noresize flag and b) decide whether QEMU's usage policy can be > > improved on kernels where vector exhaustion is no longer an issue. > > Thomas can help confirm but looks noresize limitation is still there. > b) makes more sense since Thomas thinks vector exhaustion is not > an issue now (except one minor open about irte). As noted elsewhere, a) is indeed a limitation of the host interfaces, not implicit to MSI-X. Obviously we can look at different QEMU policies, including generating hardware faults to the VM on exhaustion or unmask failures, interrupt injection or better inferring potential vector usage. Thanks, Alex ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Virtualizing MSI-X on IMS via VFIO 2021-06-23 15:19 ` Alex Williamson 2021-06-24 0:00 ` Tian, Kevin @ 2021-06-24 0:43 ` Thomas Gleixner 1 sibling, 0 replies; 29+ messages in thread From: Thomas Gleixner @ 2021-06-24 0:43 UTC (permalink / raw) To: Alex Williamson Cc: Kevin Tian, Jason Gunthorpe, Megha Dey, Ashok Rai, Jacob Pan, Dave Jiang, Yi Liu, Baolu Lu, Dan Williams, Tony Luck, Sanjay Kumar, LKML, KVM, Kirti Wankhede, Peter Zijlstra, Marc Zyngier, Bjorn Helgaas Alex! On Wed, Jun 23 2021 at 09:19, Alex Williamson wrote: > On Wed, 23 Jun 2021 01:59:24 +0200 > Thomas Gleixner <tglx@linutronix.de> wrote: >> That's dangerous and makes weird assumptions about interrupts being >> requested early in the driver init() function. But that's wishful >> thinking, really. There are enough drivers, especially networking which >> request interrupts on device open() and not on device init(). Some >> special functions only request the irq way later, i.e. only when someone >> uses that special function and at this point the other irqs of that very >> same device are already in use. > > I'd consider allocating vectors on open() to be setup time, device > initialization. Chances of inflight interrupts is low. If you have It does not matter whether chances are low or not. Again: At datacenter scale 'unlikely' does not exist except in wishful thinking. It does not matter whether it's unlikely or not. What matters is that this problem exists. If only one of one million resize operations results in a lost interrupt and subsequently in a stale device then it is exactly _one_ fail too much especially for cases where it could have been avoided. If you would have argued, that you can't do anything about it for unmodified, i.e. unaware guests, but otherwise acknowledged that this _is_ a problem which could and should have been fixed long ago, then we can just move on and have a conversation about technical ways to fix it, but handwaving it away is definitely _not_ an option. It's not even documented anywhere. It's just implemented that way to prove the theorem that virtualization creates more problems than it solves. > examples of drivers where the irq request comes way later please let > me know, maybe there's some way we can identify that use case. IOW, please tell me where I need to add more heuristics in order to "fix" the symptom instead of the root cause? Here we go nevertheless: 1) Multi-queue devices with managed interrupts. A queue and it's interrupt(s) are not started up when the associated CPU(s) is/are not online or not present. But a later hotplug brings them up and that's the point where such interrupts are unmasked the first time. All the other queues are active already at that point. Now you just have to ensure that there is load on the already online queues and the chances of getting a stale queue or running into a hopefully recoverable timeout are not longer low. That fail is pretty much guaranteed. I'm sure Joe Admin will appreciate all the helpful tools to debug that and the extensive documentation about this kind of failure. Surely you will tell me that these kind of devices are already covered by weird heuristics in the hypervisor which allocate all vectors upfront based on PCI ID or whatever, but I still have to tell you that this is beyond wrong and could have been avoided long ago at least for the not so uncommon case of an up-to-date hypervisor and up-to-date host and guest kernels. 2) Some drivers request interrupts with IRQF_NO_AUTOEN (or the older mechanism of marking it as NOAUTOEN before request_irq() which does not start it up. It's fully initialized but it gets unmasked when later on the driver invokes enable_irq(). And for some drivers that can be very late. 3) Multi-function devices where a subfunction gets initialized only when needed. 4) We have at least one case here with a custom FPGA driver where a certain power hungry IP block is only initialized when the program loaded by the end-user uses a particular library which enables it. At that point the interrupt gets requested and unmasked but there is already other functionality enabled in the FPGA with heavy interrupt load, so ripping MSI-X temporarily off for resizing would be a guarantee for losing interrupts in hard to debug ways. That stuff runs today only on bare metal fortunately, but there are plans to have it exposed to guests... Good thing we talked about this well documented VFIO 'feature' now, so we can spare us the head scratching later. There are certainly others, but that's from the top of my head. >> Let's look at this from a driver perspective: >> >> 1) Driver checks how many entries are possible in the MSI-X table >> >> 2) Driver invokes pci_msix_enable[_range]() which returns the number >> of vectors the system is willing to hand out to the driver. >> >> 3) Driver assigns the vectors to the different resources in the >> hardware >> >> 4) Later on these interrupts are requested, but not necessarily during >> device init. >> >> Yes, request_irq() can fail and today it can fail also due to CPU >> vector exhaustion. That's perfectly fine as the driver can handle >> the fail and act accordingly. >> >> All of this is consistent and well defined. > > We can't know in the hypervisor how many vectors the guest driver asked > for, we only know that it's somewhere between one and the number > supported by the vector table. We'd contribute to vector exhaustion if > we always assume the max. I understand that, but that does not make it more consistent. >> Now lets look at the guest. This VFIO mechanism introduces two brand new >> failure modes because of this: >> >> guest::unmask() >> trapped_by_host() >> free_irqs(); >> pci_free_irq_vectors(); >> pci_disable_msix(); >> pci_alloc_irq_vectors(); >> pci_enable_msix(); >> request_irqs(); >> >> #1 What happens if the host side allocation or the host side request_irq() >> fails? >> >> a) Guest is killed? >> b) Failure is silently ignored and guest just does not receive >> interrupts? >> c) Any other mechanism? > > QEMU generates a log on failure. At vfio-pci we're emulating the MSI-X > capability and vector table, there is no mechanism in the PCI spec > defined protocol that manipulating a vector table entry can fail, but > clearly it can fail on the host. That's just the world we live in, > either we overestimate vector usage on the host or we do the best we > can to infer how the device is actually being used. > > Within the vfio API we have a mechanism to describe that the above > behavior is required, ie. that we can't dynamically resize the number > of MSI-X vectors. If we had kernel interfaces to dynamically change > our vector allocation after pci_alloc_irq_vectors() we could remove > that flag and QEMU could wouldn't need to go through this process. > >> >> Whatever it is, it simply _cannot_ make request_irq() in the >> guest fail because the guest has already passed all failure >> points and is in the low level function of unmasking the >> interrupt for the first time after which is will return 0, aka >> success. >> >> So if you answer #a, fine with me. It's well defined. > > As far as the guest is concerned, #b. As you say, we're at the point > in the guest where the guest interrupt code is interacting with MSI-X > directly and there is no way it can fail on hardware. It's not > entirely silent though, the hypervisor logs an error. Whether you > think "the device I hot-added to my VM doesn't work" is worse than "I > hot-added a device to my VM and it crashed" is a better failure mode is > a matter of perspective. Yes, it's a matter of perspective, but crashing the thing is at least well defined. Silently breaking the device is an interesting choice and completely undefined whether you log it or not. Even if it does not break because the host side request_irq() succeeds, risking that it can lose interrupts is daft and if that happens then it still malfunctions and this is _not_ going to be logged in any hypervisor log. It's going to be a nightmare to debug. >> #2 What happens to already active interrupts on that device which might >> fire during that time? >> >> They get lost or are redirected to the legacy PCI interrupt and >> there is absolutely nothing you can do about that. >> >> Simply because to prevent that the host side would have to >> disable the interrupt source at the device level, i.e. fiddle >> with actual device registers to shut up interrupt delivery and >> reenable it afterwards again and thereby racing against some >> other VCPU of the same guest which fiddles with that very same >> registers. >> >> IOW, undefined behaviour, which the "VFIO design" shrugged off on >> completely unjustified assumptions. > > The vfio design signals to userspace that this behavior is required via > a flag on the IRQ info structure. Should host interfaces become > available that allow us to re-size the number of vectors for a device, > we can remove the flag and userspace would no longer need this > disable/re-enable process. You cannot resize MSI-X without clearing the MSI-X enable bit. And that's not going to change anytime soon. It's how MSI-X is specified. So what are you trying to tell me? That your design is great because it allows to remove a flag which cannot be removed for the particular class of devices utilizing true MSI-X which is affected by this problem? I'm amazed how you emphasize design all over the place. I'm truly impressed that this very design does not even document that the re-sizing business can cause losing interrupts on already active interrupts on the same device. >> 2) request_irq() actually allocates a CPU vector from the >> allocatable vector space which can obviously still fail, which is >> perfectly fine. > > request_irq() is where the vector table entry actually gets unmasked > though, that act of unmasking the vector table entry in hardware cannot > fail, but virtualization of that act is backed by a request_irq() in > the host, which can fail. Again: request_irq() does not necessarily unmask. Just because for a large part of the drivers the first unmask happens from request_irq() does not make it an universal law. The unmask can be completely detached from request_irq(). What this mechanism observes is the first unmask() operation on a particalar MSI-X vector, nothing else. You _cannot_ make assumptions about when and where that unmask happens especially not with closed source guests, but your assumptions do not even hold with open source guests. Can we pretty please use the proper and precise terminology and not some blurb based on random assumptions about how things might work or might have worked 20 years ago? >> So the only downside today of allocating more MSI-X vectors than >> necessary is memory consumption for the irq descriptors. > > As above, this is a QEMU policy of essentially trying to be a good > citizen and allocate only what we can infer the guest is using. What's > a good way for QEMU, or any userspace, to know it's running on a host > where vector exhaustion is not an issue? If the kernel is halfways recent (4.15+) then it uses reservation mode on x86: - It does not consume CPU vectors for not requested interrupts (except for multi-queue devices with managed interrupts). - It consumes an IRTE entry per vector, which might or might not be an issue. There are 64k entries per IOMMU on Intel and sizeable number of entries per PCI function on AMD. - It consumes host memory obviously Of course this does still not solve the problem that the actual request_irq() can fail due to vector exhaustion (or other reasons), but it at least prevents losing interrupts for already activated vectors due to resizing. >> Not at all a pretty solution, but it is contrary to the status quo well >> defined. The most important aspect is that it is well defined for the >> case of success: >> >> If it succeeds then there is no way that already requested interrupts >> can be lost or end up being redirected to the legacy PCI irq due to >> clearing the MSIX enable bit, which is a gazillion times better than >> the "let's hope it works" based tinkerware we have now. >> >> So, aside of the existing VFIO/PCI/MSIX thing being just half thought >> out, even thinking about proliferating this with IMS is bonkers. >> >> IMS is meant to avoid the problem of MSI-X which needs to disable MSI-X >> in order to expand the number of vectors. The use cases are going to be >> even more dynamic than the usual device drivers, so the lost interrupt >> issue will be much more likely to trigger. >> >> So no, we are not going to proliferate this complete ignorance of how >> MSI-X actually works and just cram another "feature" into code which is >> known to be incorrect. > > Some of the issues of virtualizing MSI-X are unsolvable without > creating a new paravirtual interface, but obviously we want to work > with existing drivers and unmodified guests, so that's not an option. I grant you that unmodified, i.e. unaware guests are an issue which is pretty much impossible to fix, but at least it could have been documented. So this has been known for almost _ten_ years now and the only answer you have is 'not an option' for everything and not just for unaware guests? Sorry, but that's hillarious and it's absolutely not rocket science to fix that. You do not even have to modify _one_ single PCI device driver if you put the hypercall into the obvious place, i.e. pci_enable_msix[_range](), which is invoked when the driver initializes and sizes and negotiates the resources. Existing and completely unmodified drivers handle the situation perfectly fine when this function tells them: No, you can't get 1000 vectors, I grant you only one. Those which don't are broken even w/o virt. So the only situation where you really can by some means justify the 'try and pray' approach is on guests which are unaware of that mechanism for whatever reason. That's it. And that want's proper documentation of the potential consequences. > To work with what we've got, the vfio API describes the limitation of > the host interfaces via the VFIO_IRQ_INFO_NORESIZE flag. QEMU then > makes a choice in an attempt to better reflect what we can infer of the > guest programming of the device to incrementally enable vectors. We > could a) work to provide host kernel interfaces that allow us to remove > that noresize flag and b) decide whether QEMU's usage policy can be > improved on kernels where vector exhaustion is no longer an issue. For IMS "work with what we've got" is just not a reasonable answer and IMS needs hypercalls sooner than later anyway for: 1) Message store in guest system RAM which obviously cannot be trapped. That's the case where a single subdevice (not VF) provided by the host driver managed PCI device is wrapped and exposed as "PCI" device to the guest because that makes it simple to enumerate and probe. Obviously this just covers the device-memory storage of IDXD as well because that's just a an implementation detail. 2) Passthrough of a full IMS capable device into the guest which has different requirements. What we are looking at right now is #1. #2 is a different story and needs way more than #1. So the question is where such a hypercall should be located. A) IMS device specific which is going to end up with code duplication and slightly different bugs all over the place. Not really desirable. B) Part of PCI/MSI[X] is the right thing as it can be used to address the existing issues as well in a sane way. So #B would look like this: device_init(dev) pci_enable_msix_range(dev, msidescs, minvec, maxvec) .... ret = hypercall(MSIX_ENABLE, &alloc_info); We need to find the right spot for that call and the alloc_info structure needs some thought obviously, but none of this is rocket science. The hypervisor can then: 1) Grant the request and do the required allocation/setup for the vectors which the guest device driver asked for 2) Grant and setup only $N vectors 3) Refuse the request and return a proper error code and everything just falls in place including sane error handling on all ends. Now add the reverse operation for MSIX_DISABLE and suddenly all of this becomes a design. I'm really amazed that this does not exist today because it's so bloody obvious. Maybe it's just bloody obvious to me, but I very well remember that I explained on several occasions in great length why resizing MSI-X with already active interrupts is bloody wrong. Both in email and face to face, the last face to face was at a table during Plumbers 2019 in Lisbon where IMS was discussed for the first time in public. Thanks, tglx ^ permalink raw reply [flat|nested] 29+ messages in thread
end of thread, other threads:[~2021-06-25 21:19 UTC | newest] Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-06-22 10:16 Virtualizing MSI-X on IMS via VFIO Tian, Kevin 2021-06-22 15:50 ` Dave Jiang 2021-06-23 6:16 ` Tian, Kevin 2021-06-22 19:12 ` Alex Williamson 2021-06-22 23:59 ` Thomas Gleixner 2021-06-23 6:12 ` Tian, Kevin 2021-06-23 16:31 ` Thomas Gleixner 2021-06-23 16:41 ` Jason Gunthorpe 2021-06-23 23:41 ` Tian, Kevin 2021-06-23 23:37 ` Tian, Kevin 2021-06-24 1:18 ` Thomas Gleixner 2021-06-24 2:41 ` Tian, Kevin 2021-06-24 15:14 ` Thomas Gleixner 2021-06-24 21:44 ` Alex Williamson 2021-06-25 5:21 ` Tian, Kevin 2021-06-25 8:43 ` Thomas Gleixner 2021-06-25 12:42 ` Thomas Gleixner 2021-06-25 21:19 ` Thomas Gleixner 2021-06-25 8:29 ` Thomas Gleixner 2021-06-24 17:03 ` Jacob Pan 2021-06-23 15:19 ` Alex Williamson 2021-06-24 0:00 ` Tian, Kevin 2021-06-24 1:36 ` Thomas Gleixner 2021-06-24 2:20 ` Thomas Gleixner 2021-06-24 2:48 ` Alex Williamson 2021-06-24 12:06 ` [PATCH] vfio/pci: Document the MSI[X] resize side effects properly Thomas Gleixner 2021-06-24 22:22 ` Alex Williamson 2021-06-24 17:52 ` Virtualizing MSI-X on IMS via VFIO Alex Williamson 2021-06-24 0:43 ` Thomas Gleixner
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).