Re: [PATCH 8/9] vfio/pci: export nvlink2 support into vendor vfio_pci drivers

From: Jason Gunthorpe <jgg@nvidia.com>
To: Alex Williamson <alex.williamson@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>,
	Max Gurtovoy <mgurtovoy@nvidia.com>,
	Alexey Kardashevskiy <aik@ozlabs.ru>,
	cohuck@redhat.com, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, liranl@nvidia.com, oren@nvidia.com,
	tzahio@nvidia.com, leonro@nvidia.com, yarong@nvidia.com,
	aviadye@nvidia.com, shahafs@nvidia.com, artemp@nvidia.com,
	kwankhede@nvidia.com, ACurrid@nvidia.com, cjia@nvidia.com,
	yishaih@nvidia.com, mjrosato@linux.ibm.com
Subject: Re: [PATCH 8/9] vfio/pci: export nvlink2 support into vendor vfio_pci drivers
Date: Fri, 19 Mar 2021 19:59:43 -0300	[thread overview]
Message-ID: <20210319225943.GH2356281@nvidia.com> (raw)
In-Reply-To: <20210319150809.31bcd292@omen.home.shazbot.org>

On Fri, Mar 19, 2021 at 03:08:09PM -0600, Alex Williamson wrote:
> On Fri, 19 Mar 2021 17:07:49 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Fri, Mar 19, 2021 at 11:36:42AM -0600, Alex Williamson wrote:
> > > On Fri, 19 Mar 2021 17:34:49 +0100
> > > Christoph Hellwig <hch@lst.de> wrote:
> > >   
> > > > On Fri, Mar 19, 2021 at 01:28:48PM -0300, Jason Gunthorpe wrote:  
> > > > > The wrinkle I don't yet have an easy answer to is how to load vfio_pci
> > > > > as a universal "default" within the driver core lazy bind scheme and
> > > > > still have working module autoloading... I'm hoping to get some
> > > > > research into this..    
> > > 
> > > What about using MODULE_SOFTDEP("pre: ...") in the vfio-pci base
> > > driver, which would load all the known variants in order to influence
> > > the match, and therefore probe ordering?  
> > 
> > The way the driver core works is to first match against the already
> > loaded driver list, then trigger an event for module loading and when
> > new drivers are registered they bind to unbound devices.
> 
> The former is based on id_tables, the latter on MODULE_DEVICE_TABLE, we
> don't have either of those.

Well, today we don't, but Max here adds id_table's to the special
devices and a MODULE_DEVICE_TABLE would come too if we do the flavours
thing below.

My starting thinking is that everything should have these tables and
they should work properly..

> As noted to Christoph, the cases where we want a vfio driver to
> bind to anything automatically is the exception.

I agree vfio should not automatically claim devices, but once vfio is
told to claim a device everything from there after should be
automatic.

> > One answer is to have userspace udev have the "hook" here and when a
> > vfio flavour mod alias is requested on a PCI device it swaps in
> > vfio_pci if it can't find an alternative.
> > 
> > The dream would be a system with no vfio modules loaded could do some
> > 
> >  echo "vfio" > /sys/bus/pci/xxx/driver_flavour
> > 
> > And a module would be loaded and a struct vfio_device is created for
> > that device. Very easy for the user.
> 
> This is like switching a device to a parallel universe where we do
> want vfio drivers to bind automatically to devices.

Yes.

If we do this I'd probably suggest that driver_override be bumped down
to some user compat and 'vfio > driver_override' would just set the
flavour.

As-is driver_override seems dangerous as overriding the matching table
could surely allow root userspace to crash the machine. In situations
with trusted boot/signed modules this shouldn't be.

> > > If we coupled that with wildcard support in driver_override, ex.
> > > "vfio_pci*", and used consistent module naming, I think we'd only need
> > > to teach userspace about this wildcard and binding to a specific module
> > > would come for free.  
> > 
> > What would the wildcard do?
> 
> It allows a driver_override to match more than one driver, not too
> dissimilar to your driver_flavor above.  In this case it would match
> all driver names starting with "vfio_pci".  For example if we had:
> 
> softdep vfio-pci pre: vfio-pci-foo vfio-pci-bar
>
> Then we'd pre-seed the condition that drivers foo and bar precede the
> base vfio-pci driver, each will match the device to the driver and have
> an opportunity in their probe function to either claim or skip the
> device.  Userspace could also set and exact driver_override, for
> example if they want to force using the base vfio-pci driver or go
> directly to a specific variant.

Okay, I see. The problem is that this makes 'vfio-pci' monolithic, in
normal situations it will load *everything*.

While that might not seem too bad with these simple drivers, at least
the mlx5 migration driver will have a large dependency tree and pull
in lots of other modules. Even Max's sample from v1 pulls in mlx5_core.ko
and a bunch of other stuff in its orbit.

This is why I want to try for fine grained autoloading first. It
really is the elegant solution if we can work it out.

> > Open coding a match table in probe() and returning failure feels hacky
> > to me.
> 
> How's it any different than Max's get_foo_vfio_pci_driver() that calls
> pci_match_id() with an internal match table?  

Well, I think that is hacky too - but it is hacky only to service user
space compatability so lets put that aside

> It seems a better fit for the existing use cases, for example the
> IGD variant can use a single line table to exclude all except Intel
> VGA class devices in its probe callback, then test availability of
> the extra regions we'd expose, otherwise return -ENODEV.

I don't think we should over-focus on these two firmware triggered
examples. I looked at the Intel GPU driver and it already only reads
the firmware thing for certain PCI ID's, we can absolutely generate a
narrow match table for it. Same is true for the NVIDIA GPU.

The fact this is hard or whatever is beside the point - future drivers
in this scheme should have exact match tables. 

The mlx5 sample is a good example, as it matches a very narrow NVMe
device that is properly labeled with a subvendor ID. It does not match
every NVMe device and then run code to figure it out. I think this is
the right thing to do as it is the only thing that would give us fine
grained module loading.

Even so, I'm not *so* worried about "over matching" - if IGD or the
nvidia stuff load on a wide set of devices then they can just not
enable their extended stuff. It wastes some kernel memory, but it is
OK.

And if some driver *really* gets stuck here the true answer is to
improve the driver core match capability.

> devices in the deny-list and non-endpoint devices.  Many drivers
> clearly place implicit trust in their id_table, others don't.  In the
> case of meta drivers, I think it's fair to make use of the latter
> approach.

Well, AFAIK, the driver core doesn't have a 'try probe, if it fails
then try another driver' approach. One device, one driver. Am I
missing something?

I would prefer not to propose to Greg such a radical change to how
driver loading works..

I also think the softdep/implicit loading/ordering will not be
welcomed, it feels weird to me.

> > We should NOT be avoiding the standard infrastructure for matching
> > drivers to devices by re-implementing it poorly.
> 
> I take some blame for the request_module() behavior of vfio-platform,
> but I think we're on the same page that we don't want to turn
> vfio-pci

Okay, that's good, we can explore the driver core side and see what
could work to decide if we can do fine-grained loading or not.

The question is now how to stage all this work? It is too big to do
all in one shot - can we reform Max's series into something mergable
without the driver core part? For instance removing the
pci_device_driver and dedicated modules and only doing the compat
path? It would look the same from userspace, but the internals would
be split to a library mode.

The patch to add in the device driver would then be small enough to go
along with future driver core changes, and if it fails we still have
several fallback plans to use the librarizided version,.

> into a nexus for loading variant drivers.  Whatever solution we use for
> vfio-pci might translate to replacing that vfio-platform behavior.  

Yes, I'd want to see this too. It shows it is a general enough
idea. Greg has been big on asking for lots of users for driver core
changes, so it is good we have more things.

Thanks,
Jason