Re: [PATCH mlx5-next v1 2/5] PCI: Add SR-IOV sysfs entry to read number of MSI-X vectors

From: Jason Gunthorpe <jgg@nvidia.com>
To: Alexander Duyck <alexander.duyck@gmail.com>
Cc: Leon Romanovsky <leon@kernel.org>,
	Bjorn Helgaas <bhelgaas@google.com>,
	Saeed Mahameed <saeedm@nvidia.com>,
	Jakub Kicinski <kuba@kernel.org>,
	linux-pci <linux-pci@vger.kernel.org>,
	<linux-rdma@vger.kernel.org>, Netdev <netdev@vger.kernel.org>,
	Don Dutile <ddutile@redhat.com>,
	Alex Williamson <alex.williamson@redhat.com>
Subject: Re: [PATCH mlx5-next v1 2/5] PCI: Add SR-IOV sysfs entry to read number of MSI-X vectors
Date: Fri, 15 Jan 2021 09:51:01 -0400	[thread overview]
Message-ID: <20210115135101.GZ4147@nvidia.com> (raw)
In-Reply-To: <CAKgT0UcaRgY4XnM0jgWRvwBLj+ufiabFzKPyrf3jkLrF1Z8zEg@mail.gmail.com>

On Thu, Jan 14, 2021 at 01:43:57PM -0800, Alexander Duyck wrote:

> > > In addition Leon still hasn't answered my question on preventing the
> > > VF driver from altering entries beyond the ones listed in the table.
> >
> > Of course this is blocked, the FW completely revokes the HW resource
> > backing the vectors.
> 
> One of the troubles with this is that I am having to take your word
> for it.

This is a Linux patch review, not a security review of a HW
implementation. There are million ways to screw up a PCI device
implementation and in SRIOV the PCI device HW implementation forms
part of the trust base of the hypervisor.

If the HW API can be implemented securely and the Linux code is
appropriate is the only question here.

In this case mlx5 HW is implemented correctly and securely, if you
don't belive then you are free not to use it.

> What it defines is the aperture available in MMIO to define the
> possible addresses and values to be written to trigger the
> interrupts. The device itself plays a large role in defining the
> number of interrupts ultimately requested.

Again you are confused about what is going on here - this is about
reconfiguring the HW so that MSI vector entries exist or not - it has
absoultely nothing to do with the driver. We are not optimizing for
the case where the driver does not use MSI vectors the VF has
available.

> > > At a minimum I really think we need to go through and have a clear
> > > definition on when updating the MSI-X table size is okay and when it
> > > is not. I am not sure just saying to not update it when a driver isn't
> > > attached is enough to guarantee all that.
> >
> > If you know of a real issue then please state it, other don't fear
> > monger "maybe" issues that don't exist.
>
> Well I don't have visibility into your firmware so I am not sure what
> is going on in response to this command so forgive me when I do a bit
> of fear mongering when somebody tells me that all this patch set does
> is modify the VF configuration space.

You were not talking about the FW, "is okay and when it is not" is a
*Linux* question.

> > > What we are talking about is the MSI-X table size. Not the number of
> > > MSI-X vectors being requested by the device driver. Those are normally
> > > two seperate things.
> >
> > Yes, table size is what is critical. The number of entries in that BAR
> > memory is what needs to be controlled.
> 
> That is where we disagree. 

Huh? You are disagreeing this is how the mlx5 device works?

> Normally as a part of that the device itself will place some
> limit on how many causes and vectors you can associate before you even
> get to the MSI-X table.

For mlx5 this cause limit is huge. With IMS it can even be higher than
the 2K MSI-X limit. Remember on an x86 system you get 256 interrupt
vectors per CPU *and* per vCPU, so with interrupt remapping there can
be huge numbers of interrupts required.

Your "normally" is for simplistic fixed function HW devices not
intended for use at this scale.

> The MSI-X table size is usually a formality that defines the upper
> limit on the number of entries the device might request.

It is not a formality. PCI rules require *actual physical HW* to be
dedicated to the MSI vector entries.

Think of it like this - the device has a large global MSI-X table of
say 2K entires. This is the actual physical HW SRAM backing MSI
entires required by PCIe.

The HW will map the MSI-X table BAR space in every PF/VF to a slice of
that global table. If the PCI Cap says 8 entries then the MSI-X page has
only 8 entries, everything else is /dev/null.

Global MSI entries cannot be shared - the total of all PF/VFs cap
field must not be more than 2K.

One application requires 2K MSI-X on a single function because it uses
VDPA devices and VT-d interrupt remapping

Another application requires 16 MSI-X on 128 VFs because it is using
SRIOV with VMs having 16 vCPUs.

The HW is configured differently in both cases. It is not something
that can be faked with VFIO!

> > That is completely different, in the hypervisor there is no idea how
> > many vectors a guest OS will create. The FW is told to only permit 1
> > vector. How is the guest to know this if we don't update the config
> > space *as the standard requires* ?
> 
> Doesn't the guest driver talk to the firmware? Last I knew it had to
> request additional resources such as queues and those come from the
> firmware don't they?

That is not how things work. Because VFIO has to be involved in
setting up interrupt remapping through its MSI emulation we don't get
to use a dynamic FW only path as something like IMS might imagine.

That would be so much better, but lots of things are not ready for
that.

> > 1) The FW has to be told of the limit and everything has to be in sync
> >    If the FW is out of sync with the config space then everything
> >    breaks if the user makes even a small mistake - for instance
> >    forgetting to use the ioctl to override vfio. This is needlessly
> >    frail and complicated.
> 
> That is also the way I feel about the sysfs solution.

Huh? The sysfs is much safer. If the write() succeeds I can't think of
any way the system would be left broken? Why do you think it is frail?

> I'm just kind of surprised the firmware itself isn't providing some
> sort of messaging on the number of interrupt vectors that a given

It does, it is the PCI cap, just because you keep saying it isn't used
doesn't make that true :)

> device has since I assume that it is already providing you with
> information on the number of queues and such since that isn't
> provided by any other mechanism.

Queues are effectively unlimited.

Jason