Re: [PATCH V2 7/8] vfio/pci: Support dynamic MSI-x

From: Alex Williamson <alex.williamson@redhat.com>
To: Reinette Chatre <reinette.chatre@intel.com>
Cc: <jgg@nvidia.com>, <yishaih@nvidia.com>,
	<shameerali.kolothum.thodi@huawei.com>, <kevin.tian@intel.com>,
	<tglx@linutronix.de>, <darwi@linutronix.de>,
	<kvm@vger.kernel.org>, <dave.jiang@intel.com>,
	<jing2.liu@intel.com>, <ashok.raj@intel.com>,
	<fenghua.yu@intel.com>, <tom.zanussi@linux.intel.com>,
	<linux-kernel@vger.kernel.org>
Subject: Re: [PATCH V2 7/8] vfio/pci: Support dynamic MSI-x
Date: Tue, 4 Apr 2023 12:24:44 -0600	[thread overview]
Message-ID: <20230404122444.59e36a99.alex.williamson@redhat.com> (raw)
In-Reply-To: <5efa361d-012b-bdb6-b5e5-869887bde98d@intel.com>

On Tue, 4 Apr 2023 09:54:46 -0700
Reinette Chatre <reinette.chatre@intel.com> wrote:

> Hi Alex,
> 
> On 4/3/2023 8:18 PM, Alex Williamson wrote:
> > On Mon, 3 Apr 2023 15:50:54 -0700
> > Reinette Chatre <reinette.chatre@intel.com> wrote:  
> >> On 4/3/2023 1:22 PM, Alex Williamson wrote:  
> >>> On Mon, 3 Apr 2023 10:31:23 -0700
> >>> Reinette Chatre <reinette.chatre@intel.com> wrote:  
> >>>> On 3/31/2023 3:24 PM, Alex Williamson wrote:    
> >>>>> On Fri, 31 Mar 2023 10:49:16 -0700
> >>>>> Reinette Chatre <reinette.chatre@intel.com> wrote:      
> >>>>>> On 3/30/2023 3:42 PM, Alex Williamson wrote:      
> >>>>>>> On Thu, 30 Mar 2023 16:40:50 -0600
> >>>>>>> Alex Williamson <alex.williamson@redhat.com> wrote:
> >>>>>>>         
> >>>>>>>> On Tue, 28 Mar 2023 14:53:34 -0700
> >>>>>>>> Reinette Chatre <reinette.chatre@intel.com> wrote:
> >>>>>>>>        
> 
> 
> ...
> 
> >>> If the goal is to allow the user to swap one eventfd for another, where
> >>> the result will always be the new eventfd on success or the old eventfd
> >>> on error, I don't see that this code does that, or that we've ever
> >>> attempted to make such a guarantee.  If the ioctl errors, I think the
> >>> eventfds are generally deconfigured.   We certainly have the unwind code
> >>> that we discussed earlier that deconfigures all the vectors previously
> >>> touched in the loop (which seems to be another path where we could
> >>> de-allocate from the set of initial ctxs).    
> >>
> >> Thank you for your patience in hearing and addressing my concerns. I plan
> >> to remove new_ctx in the next version.
> >>  
> >>>>> devices supporting vdev->has_dyn_msix only ever have active contexts
> >>>>> allocated?  Thanks,      
> >>>>
> >>>> What do you see as an "active context"? A policy that is currently enforced
> >>>> is that an allocated context always has an allocated interrupt associated
> >>>> with it. I do not see how this could be expanded to also require an
> >>>> enabled interrupt because interrupt enabling requires a trigger that
> >>>> may not be available.    
> >>>
> >>> A context is essentially meant to track a trigger, ie. an eventfd
> >>> provided by the user.  In the static case all the irqs are necessarily
> >>> pre-allocated, therefore we had no reason to consider a dynamic array
> >>> for the contexts.  However, a given context is really only "active" if
> >>> it has a trigger, otherwise it's just a placeholder.  When the
> >>> placeholder is filled by an eventfd, the pre-allocated irq is enabled.    
> >>
> >> I see.
> >>  
> >>>
> >>> This proposal seems to be a hybrid approach, pre-allocating some
> >>> initial set of irqs and contexts and expecting the differentiation to
> >>> occur only when new vectors are added, though we have some disagreement
> >>> about this per above.  Unfortunately I don't see an API to enable MSI-X
> >>> without some vectors, so some pre-allocation of irqs seems to be
> >>> required regardless.    
> >>
> >> Right. pci_alloc_irq_vectors() or equivalent continues to be needed to
> >> enable MSI-X. Even so, it does seem possible (within vfio_msi_enable())
> >> to just allocate one vector using pci_alloc_irq_vectors()
> >> and then immediately free it using pci_msix_free_irq(). What do you think?  
> > 
> > QEMU does something similar but I think it can really only be described
> > as a hack.  In this case I think we can work with them being allocated
> > since that's essentially the static path.  
> 
> ok. In this case I understand the hybrid approach to be required. Without
> something (a hack) like this I am not able to see how an "active context"
> policy can be enforced though. Interrupts allocated during MSI-X enabling may
> not have eventfd associated and thus cannot adhere to an "active context" policy. I
> understand from  earlier comments that we do not want to track where contexts
> are allocated so I can only see a way to enforce a policy that a context has
> an allocated interrupt, but not an enabled interrupt.

We're talking about the contexts that we now allocate in the xarray to
store the eventfd linkage, right?  We need to pre-allocate some irqs
both to satisfy the API and to support non-dynamic MSI-X devices, but
we don't need to pre-allocate contexts.  The logic that I propose below
supports lookup of the pre-allocated irqs for all cases and falls back
to allocating a new irq only for cases that support it.  irqs and
contexts aren't exactly 1:1 for the dynamic case due to the artifacts
of the API, but the model supports only allocating contexts as they're
used, or "active".

> >> If I understand correctly this can be done without allocating any context
> >> and leave MSI-X enabled without any interrupts allocated. This could be a
> >> way to accomplish the "active context" policy for dynamic allocation.
> >> This is not a policy that can be applied broadly to interrupt contexts though
> >> because MSI and non-dynamic MSI-X could still have contexts with allocated
> >> interrupts without eventfd.  
> > 
> > I think we could come up with wrappers that handle all cases, for
> > example:
> > 
> > int vfio_pci_alloc_irq(struct vfio_pci_core_device *vdev,
> > 		       unsigned int vector, int irq_type)
> > {
> > 	struct pci_dev *pdev = vdev->pdev;
> > 	struct msi_map map;
> > 	int irq;
> > 
> > 	if (irq_type == VFIO_PCI_INTX_IRQ_INDEX)
> > 		return pdev->irq ?: -EINVAL;
> > 
> > 	irq = pci_irq_vector(pdev, vector);
> > 	if (irq > 0 || irq_type == VFIO_PCI_MSI_IRQ_INDEX ||
> > 	    !vdev->has_dyn_msix)
> > 		return irq;
> > 
> > 	map = pci_msix_alloc_irq_at(pdev, vector, NULL);
> > 
> > 	return map.index;
> > }
> > 
> > void vfio_pci_free_irq(struct vfio_pci_core_device *vdev,
> > 		       unsigned in vector, int irq_type)
> > {
> > 	struct msi_map map;
> > 	int irq;
> > 
> > 	if (irq_type != VFIO_PCI_INTX_MSIX_INDEX ||
> > 	    !vdev->has_dyn_msix)
> > 		return;
> > 
> > 	irq = pci_irq_vector(pdev, vector);
> > 	map = { .index = vector, .virq = irq };
> > 
> > 	if (WARN_ON(irq < 0))
> > 		return;
> > 
> > 	pci_msix_free_irq(pdev, msix_map);
> > }  
> 
> Thank you very much for taking the time to write this out. I am not able to
> see where vfio_pci_alloc_irq()/vfio_pci_free_irq() would be called for
> an INTx interrupt. Is the INTx handling there for robustness or am I
> missing how it should be used for INTx interrupts?

Mostly just trying to illustrate that all interrupt types could be
supported, if it doesn't make sense for INTx, drop it.

> > At that point, maybe we'd check whether it makes sense to embed the irq
> > alloc/free within the ctx alloc/free.  
> 
> I think doing so would be the right thing to do since it helps
> to enforce the policy that interrupts and contexts are allocated together.
> I think this can be done when switching around the initialization within 
> vfio_msi_set_vector_signal(). I need to look into this more.

Interrupts and contexts allocated together would be ideal, but I think
given the API it's a reasonable and simple compromise given the
non-dynamic support to draw from the initial allocation where we can.
Actually, there could be a latency and reliability advantage to hang on
to the irq when an eventfd is unset, maybe we should only free irqs on
MSI-X teardown and otherwise use the allocated irqs as a cache.  Maybe
worth thinking about.  Thanks,

Alex