Re: [PATCH RFC v1 0/3] Introduce vfio-pci-core subsystem

From: Jason Gunthorpe <jgg@nvidia.com>
To: Cornelia Huck <cohuck@redhat.com>
Cc: Max Gurtovoy <mgurtovoy@nvidia.com>, <alex.williamson@redhat.com>,
	<kvm@vger.kernel.org>, <linux-kernel@vger.kernel.org>,
	<liranl@nvidia.com>, <oren@nvidia.com>, <tzahio@nvidia.com>,
	<leonro@nvidia.com>, <yarong@nvidia.com>, <aviadye@nvidia.com>,
	<shahafs@nvidia.com>, <artemp@nvidia.com>, <kwankhede@nvidia.com>,
	<ACurrid@nvidia.com>, <gmataev@nvidia.com>, <cjia@nvidia.com>
Subject: Re: [PATCH RFC v1 0/3] Introduce vfio-pci-core subsystem
Date: Tue, 19 Jan 2021 15:42:26 -0400	[thread overview]
Message-ID: <20210119194226.GA916035@nvidia.com> (raw)
In-Reply-To: <20210119195610.18da1e78.cohuck@redhat.com>

On Tue, Jan 19, 2021 at 07:56:10PM +0100, Cornelia Huck wrote:
> On Mon, 18 Jan 2021 14:16:26 -0400
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Mon, Jan 18, 2021 at 05:00:09PM +0100, Cornelia Huck wrote:
> > 
> > > > You can say that all the HW specific things are in the mlx5_vfio_pci
> > > > driver. It is an unusual driver because it must bind to both the PCI
> > > > VF with a pci_driver and to the mlx5_core PF using an
> > > > auxiliary_driver. This is needed for the object lifetimes to be
> > > > correct.  
> > > 
> > > Hm... I might be confused about the usage of the term 'driver' here.
> > > IIUC, there are two drivers, one on the pci bus and one on the
> > > auxiliary bus. Is the 'driver' you're talking about here more the
> > > module you load (and not a driver in the driver core sense?)  
> > 
> > Here "driver" would be the common term meaning the code that realizes
> > a subsytem for HW - so mlx5_vfio_pci is a VFIO driver because it
> > ultimately creates a /dev/vfio* through the vfio subsystem.
> > 
> > The same way we usually call something like mlx5_en an "ethernet
> > driver" not just a "pci driver"
> > 
> > > Yes, sure. But it also shows that mlx5_vfio_pci aka the device-specific
> > > code is rather small in comparison to the common vfio-pci code.
> > > Therefore my question whether it will gain more specific changes (that
> > > cannot be covered via the auxiliary driver.)  
> > 
> > I'm not sure what you mean "via the auxiliary driver" - there is only
> > one mlx5_vfio_pci, and the non-RFC version with all the migration code
> > is fairly big.
> > 
> > The pci_driver contributes a 'struct pci_device *' and the
> > auxiliary_driver contributes a 'struct mlx5_core_dev *'. mlx5_vfio_pci
> > fuses them together into a VFIO device. Depending on the VFIO
> > callback, it may use an API from the pci_device or from the
> > mlx5_core_dev device, or both.
> 
> Let's rephrase my question a bit:
> 
> This proposal splits the existing vfio-pci driver into a "core"
> component and code actually implementing the "driver" part. For mlx5,
> an alternative "driver" is introduced that reuses the "core" component
> and also hooks into mlx5-specific code parts via the auxiliary device
> framework.

Yes, I think you understand it well

> (IIUC, the plan is to make existing special cases for devices follow
> mlx5's lead later.)

Well, it is a direction to go. I think adding 'if pci matches XX then
squeeze in driver Y' to vfio-pci was a hacky thing to do, this is a
way out.

We could just add 'if pci matches mlx5 then squeeze in driver mlx5'
too - but that is really too horific to seriously consider.

> I've been thinking of an alternative split: Keep vfio-pci as it is now,
> but add an auxiliary device. 

vfio-pci cannot use auxiliary device. It is only for connecting parts
of the same driver together. vfio-pci has no other parts to connect.

Further, that is not how the driver core in Linux is designed to
work. We don't have subsytems provide a pci_driver and then go look
around for a 2nd driver to somehow mix in. How would it know what
driver to pick? How would drivers be autoloaded? How can the user know
things worked out right? What if they didn't want that? So many
questions.

The standard driver core flow is always 
   pci_driver -> subsystem -> userspace

I don't think VFIO's needs are special, so why deviate?

> I guess my question is: into which callbacks will the additional
> functionality hook? If there's no good way to do what they need to do
> without manipulating the vfio-pci calls, my proposal will not work, and
> this proposal looks like the better way. But it's hard to tell without
> seeing the code, which is why I'm asking :)

Well, could we have done the existing special devices we have today
with that approach? What about the Intel thing I saw RFC'd a while
ago? Or the next set of mlx5 devices beyond storage? Or an future SIOV
device?

If you have doubts the idea is flexible enough, then I think you
already answered the question :)

LWN had a good article on these design patterns. This RFC is following
what LWN called the "tool box" pattern. We keep the driver and
subsystem close together and provide useful tools to share common
code. vfio_pci_core's stuff is a tool. It is a proven and flexable
pattern.

I think you are suggesting what LWN called a "midlayer mistake"
pattern. While that can be workable, it doesn't seem right
here. vfio-pci is not really a logical midlayer, and something acting
as a midlayer should never have a device_driver..

To quote lwn:
   The core thesis of the "midlayer mistake" is that midlayers are bad
   and should not exist. That common functionality which it is so
   tempting to put in a midlayer should instead be provided as library
   routines which can used, augmented, or ignored by each bottom level
   driver independently. Thus every subsystem that supports multiple
   implementations (or drivers) should provide a very thin top layer
   which calls directly into the bottom layer drivers, and a rich library
   of support code that eases the implementation of those drivers. This
   library is available to, but not forced upon, those drivers.

Given the exciting future of VFIO I belive strongly the above is the
right general design direction.

Jason