RE: Virtualizing MSI-X on IMS via VFIO

From: "Tian, Kevin" <kevin.tian@intel.com>
To: Alex Williamson <alex.williamson@redhat.com>,
	Thomas Gleixner <tglx@linutronix.de>
Cc: Jason Gunthorpe <jgg@nvidia.com>,
	"Dey, Megha" <megha.dey@intel.com>,
	"Raj, Ashok" <ashok.raj@intel.com>,
	"Pan, Jacob jun" <jacob.jun.pan@intel.com>,
	"Jiang, Dave" <dave.jiang@intel.com>,
	"Liu, Yi L" <yi.l.liu@intel.com>,
	"Lu, Baolu" <baolu.lu@intel.com>,
	"Williams, Dan J" <dan.j.williams@intel.com>,
	"Luck, Tony" <tony.luck@intel.com>,
	"Kumar, Sanjay K" <sanjay.k.kumar@intel.com>,
	LKML <linux-kernel@vger.kernel.org>, KVM <kvm@vger.kernel.org>,
	Kirti Wankhede <kwankhede@nvidia.com>,
	"Peter Zijlstra" <peterz@infradead.org>,
	Marc Zyngier <maz@kernel.org>,
	"Bjorn Helgaas" <helgaas@kernel.org>
Subject: RE: Virtualizing MSI-X on IMS via VFIO
Date: Fri, 25 Jun 2021 05:21:11 +0000	[thread overview]
Message-ID: <BN9PR11MB5433063F826F5CEC93BCE0E38C069@BN9PR11MB5433.namprd11.prod.outlook.com> (raw)
In-Reply-To: <20210624154434.11809b8f.alex.williamson@redhat.com>

> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Friday, June 25, 2021 5:45 AM
> 
> On Thu, 24 Jun 2021 17:14:39 +0200
> Thomas Gleixner <tglx@linutronix.de> wrote:
> 
> > After studying the MSI-X specification again, I think there is another
> > option to solve this for MSI-X, i.e. the dynamic sizing part:
> >
> > MSI requires to disable MSI in order to update the number of enabled
> > vectors in the control word.
> 
> Exactly what part of the spec requires this?  This is generally the
> convention I expect too, and there are complications around contiguous
> vectors and data field alignment, but I'm not actually able to find a
> requirement in the spec that MSI Enable must be 0 when modifying other
> writable fields or that writable fields are latched when MSI Enable is
> set.
> 
> > MSI-X does not have that requirement as there is no 'number of used
> > vectors' control field. MSI-X provides a fixed sized vector table and
> > enabling MSI-X "activates" the full table.
> >
> > System software has to set proper messages in the table and eventually
> > associate the table entries to device (sub)functions if that's not
> > hardwired in the device and controlled by queue enablement etc.
> >
> > According to the specification there is no requirement for masked table
> > entries to contain a valid message:
> >
> >  "Mask Bit: ... When this bit is set, the function is prohibited from
> >                 sending a message using this MSI-X Table entry."
> >
> > which means that the function must reread the table entry when the mask
> > bit in the vector control word is cleared.
> 
> What is a "valid" message as far as the device is concerned?  "Valid"
> is meaningful to system software and hardware, the device doesn't care.
> 
> Like MSI above, I think the real question is when is the data latched
> by the hardware.  For MSI-X this seems to be addressed in (PCIe 5.0
> spec) 6.1.4.2 MSI-X Configuration:
> 
>   Software must not modify the Address, Data, or Steering Tag fields of
>   an entry while it is unmasked.
> 
> Followed by 6.1.4.5 Per-vector Masking and Function Masking:
> 
>   For MSI-X, a Function is permitted to cache Address and Data values
>   from unmasked MSI-X Table entries. However, anytime software unmasks
>   a currently masked MSI-X Table entry either by Clearing its Mask bit
>   or by Clearing the Function Mask bit, the Function must update any
>   Address or Data values that it cached from that entry. If software
>   changes the Address or Data value of an entry while the entry is
>   unmasked, the result is undefined.
> 
> So caching/latching occurs on unmask for MSI-X, but I can't find
> similar statements for MSI.  If you have, please note them.  It's
> possible MSI is per interrupt.

I checked PCI Local Bus Specification rev3.0. At that time MSI and
MSI-X were described/compared together in almost every paragraph 
in 6.8.3.4 (Per-vector Masking and Function Masking). The paragraph
that you cited is the last one in that section. It's a pity that MSI is
not clarified in this paragraph but it gives me the impression that 
MSI function is not permitted to cache address and data values. 
Later after MSI and MSI-X descriptions were split into separate 
sections in PCIe spec, this impression is definitely weakened a lot.

If true, this even implies that software is free to change data/addr
when MSI is unmasked, which is sort of counter-intuitive to most
people. 

Then I further found below thread:

https://lore.kernel.org/lkml/1468426713-31431-1-git-send-email-marc.zyngier@arm.com/

It identified a device which does latch the message content in a
MSI-capable device, forcing the kernel to startup irq early before
enabling MSI capability.

So, no answer and let's see whether Thomas can help identify
a better proof.

> 
> Anyway, at least MSI-X if not also MSI could have a !NORESIZE
> implementation, which is why this flag exists in vfio.  Thanks,
> 

For MSI we can still mitigate the lost interrupt issue by having
Qemu to allocate all possible MSI vectors in the start when guest 
enables MSI capability and never freeing them before disable.
Anyway there are just up to 32 vectors per device, and total
vectors of all MSI devices in a platform should be limited. This
won't be a big problem after CPU vector exhaustion is relaxed.

p.s. one question to Thomas. As Alex cited above, software must 
not modify the Address, Data, or Steering Tag fields of an MSI-X
entry while it is unmasked. However this rule might be violated
today in below flow:

request_irq()
    __setup_irq()
        irq_startup()
            __irq_startup()
                irq_enable()
                    unmask_irq() <<<<<<<<<<<<<
        irq_setup_affinity()
            irq_do_set_affinity()
                msi_set_affinity() // when IR is disabled
                    irq_msi_update_msg()
                        pci_msi_domain_write_msg() <<<<<<<<<<<<<<

Isn't above have msi-x entry updated after it's unmasked? 

I may overlook something though since there are many branches
in the middle which may make above flow invalid...

Thanks
Kevin