Re: [libvirt] [PATCH 0/3] sample: vfio mdev display devices.

From: Alex Williamson <alex.williamson@redhat.com>
To: Kirti Wankhede <kwankhede@nvidia.com>
Cc: Neo Jia <cjia@nvidia.com>,
	kvm@vger.kernel.org, Erik Skultety <eskultet@redhat.com>,
	libvirt <libvir-list@redhat.com>,
	Tina Zhang <tina.zhang@intel.com>,
	Gerd Hoffmann <kraxel@redhat.com>,
	intel-gvt-dev@lists.freedesktop.org
Subject: Re: [libvirt] [PATCH 0/3] sample: vfio mdev display devices.
Date: Tue, 24 Apr 2018 16:59:18 -0600	[thread overview]
Message-ID: <20180424165918.5c2ef037@w520.home> (raw)
In-Reply-To: <a5f4ec49-d5aa-e853-03ad-7ca9d7e38206@nvidia.com>

On Wed, 25 Apr 2018 01:20:08 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 4/24/2018 3:10 AM, Alex Williamson wrote:
> > On Wed, 18 Apr 2018 12:31:53 -0600
> > Alex Williamson <alex.williamson@redhat.com> wrote:
> >   
> >> On Mon,  9 Apr 2018 12:35:10 +0200
> >> Gerd Hoffmann <kraxel@redhat.com> wrote:
> >>  
> >>> This little series adds three drivers, for demo-ing and testing vfio
> >>> display interface code.  There is one mdev device for each interface
> >>> type (mdpy.ko for region and mbochs.ko for dmabuf).    
> >>
> >> Erik Skultety brought up a good question today regarding how libvirt is
> >> meant to handle these different flavors of display interfaces and
> >> knowing whether a given mdev device has display support at all.  It
> >> seems that we cannot simply use the default display=auto because
> >> libvirt needs to specifically configure gl support for a dmabuf type
> >> interface versus not having such a requirement for a region interface,
> >> perhaps even removing the emulated graphics in some cases (though I
> >> don't think we have boot graphics through either solution yet).
> >> Additionally, GVT-g seems to need the x-igd-opregion support
> >> enabled(?), which is a non-starter for libvirt as it's an experimental
> >> option!
> >>
> >> Currently the only way to determine display support is through the
> >> VFIO_DEVICE_QUERY_GFX_PLANE ioctl, but for libvirt to probe that on
> >> their own they'd need to get to the point where they could open the
> >> vfio device and perform the ioctl.  That means opening a vfio
> >> container, adding the group, setting the iommu type, and getting the
> >> device.  I was initially a bit appalled at asking libvirt to do that,
> >> but the alternative is to put this information in sysfs, but doing that
> >> we risk that we need to describe every nuance of the mdev device
> >> through sysfs and it becomes a dumping ground for every possible
> >> feature an mdev device might have.
> >>  
> 
> One or two sysfs file for each feature shouldn't be that much of over
> head? In kernel, other subsystem modules expose capability through
> sysfs, like PCI subsystem adds 'boot_vga' file for VGA device which
> returns 0/1 depending on if its boot VGA device. Similarly
> 'd3cold_allowed', 'msi_bus'...

Obviously we could add sysfs files, but unlike properties that the PCI
core exposes about struct pci_dev fields, the idea of a vfio_device is
much more abstract.  Each bus driver creates its own device
representation, so we have a top level vfio_device referencing through
an opaque pointer a vfio_pci_device, vfio_platform_device, or
mdev_device, and each mdev vendor driver creates its own private data
structure below the mdev_device.  So it's not quite a simple as one new
attribute "show" function to handle all devices of that bus_type.  We
need a consistent implementation in each bus driver and vendor driver
or we need to figure out how to percolate the information up to the
vfio core.  Your idea below seems to take the percolate approach.

> >> So I was ready to return and suggest that maybe libvirt should probe
> >> the device to know about these ancillary configuration details, but
> >> then I remembered that both mdev vGPU vendors had external dependencies
> >> to even allow probing the device.  KVMGT will fail to open the device
> >> if it's not associated with an instance of KVM and NVIDIA vGPU, I
> >> believe, will fail if the vGPU manager process cannot find the QEMU
> >> instance to extract the VM UUID.  (Both of these were bad ideas)  
> > 
> > Here's another proposal that's really growing on me:
> > 
> >  * Fix the vendor drivers!  Allow devices to be opened and probed
> >    without these external dependencies.
> >  * Libvirt uses the existing vfio API to open the device and probe the
> >    necessary ioctls, if it can't probe the device, the feature is
> >    unavailable, ie. display=off, no migration.
> >   
> 
> I'm trying to think simpler mechanism using sysfs that could work for
> any feature and knowing source-destination migration compatibility check
> by libvirt before initiating migration.
> 
> I have another proposal:
> * Add a ioctl VFIO_DEVICE_PROBE_FEATURES
> struct vfio_device_features {
>     __u32 argsz;
>     __u32 features;
> }
> 
> Define bit for each feature:
> #define VFIO_DEVICE_FEATURE_DISPLAY_REGION	(1 << 0)
> #define VFIO_DEVICE_FEATURE_DISPLAY_DMABUF	(1 << 1)
> #define VFIO_DEVICE_FEATURE_MIGRATION		(1 << 2)
> 
> * Vendor driver returns bitmask of supported features during
> initialization phase.
> 
> * In vfio core module, trap this ioctl for each device  in
> vfio_device_fops_unl_ioctl(),

Whoops, chicken and egg problem, VFIO_GROUP_GET_DEVICE_FD is our
blocking point with mdev drivers, we can't get a device fd, so we can't
call an ioctl on the device fd.

> check features bitmask returned by vendor
> driver and add a sysfs file if feature is supported that device. This
> sysfs file would return 0/1.

I don't understand why we have an ioctl interface, if the user can get
to the device fd then we have existing interfaces to probe these
things, it seems like you're just wanting to pass a features bitmap
through to vfio_add_group_dev() that vfio-core would expose through
sysfs, but a list of feature bits doesn't convey enough info except for
the most basic uses.

> For migration this bit will only indicate if host driver supports
> migration feature.
> 
> For source and destination compatibility check libvirt would need more
> data/variables to check like,
> * if same type of 'mdev_type' device create-able at destination,
>    i.e. if ('mdev_type'->available_instances > 0)
> 
> * if host_driver_version at source and destination are compatible.
> Host driver from same release branch should be mostly compatible, but if
> there are major changes in structures or APIs, host drivers from
> different branches might not be compatible, for example, if source and
> destination are from different branches and one of the structure had
> changed, then data collected at source might not be compatible with
> structures at destination and typecasting it to changed structures would
> mess up migrated data during restoration.

Of course now you're asking that libvirt understand the release
versioning scheme of every vendor driver and that it remain
programatically consistent.  We can't even do this with in-kernel
drivers.  And in the end, still the best we can do is guess.

> * if guest_driver_version is compatible with host driver at destination.
> For mdev devices, guest driver communicates with host driver in some
> form. If there are changes in structures/APIs of such communication,
> guest driver at source might not be compatible with host driver at
> destination.

And another guess plus now the guest driver is involved which libvirt
has no visibility to.

> 'available_instances' sysfs already exist, later two should be added by
> vendor driver which libvirt can use for migration compatibility check.

As noted previously, display and migration are not necessarily
mdev-only features, it's possible that vfio-pci or vfio-platform could
also implement these, so the sysfs interface cannot be restricted to
the mdev template and lifecycle interface.

One more try... we have a vfio_group fd.  This is created by the bus
drivers calling vfio_add_group_dev() and registers a struct device, a
struct vfio_device_ops, and private data.  Typically we only wire the
device_ops to the resulting file descriptor we get from
VFIO_GROUP_GET_DEVICE_FD, but could we enable sort of a nested ioctl
through the group fd?  The ioctl would need to take a string arg to
match to a device name, plus an ioctl cmd and arg for the device_ops
ioctl.  The group ioctl would need to filter cmds to known, benign
queries.  We'd also need to verify that the allowed ioctls have no
dependencies on setup done in device_ops.open().  *_INFO and
QUERY_GFX_PLANE ioctls would be the only candidates.  Bus drivers could
of course keep an open count in their private data so they know how the
ioctl is being called (if necessary) and the group fd only allows a
single open, so there's no risk that another user could interact with
the group in bad ways once the device is opened (and of course we use
file level access control on the group device file anyway).  This is
sort of a rethink of Paolo's suggestion of a read-only fd, but the fd
is the existing group fd and any references to the device would only be
held around the calling of the nested ioctl.  Could it work?  Thanks,

Alex