From mboxrd@z Thu Jan  1 00:00:00 1970
From: Alex Williamson <alex.williamson@redhat.com>
Subject: Re: [libvirt] [PATCH 0/3] sample: vfio mdev display devices.
Date: Fri, 27 Apr 2018 11:21:25 -0600
Message-ID: <20180427112125.4e71f1ea@t450s.home>
References: <20180409103513.8020-1-kraxel@redhat.com>
	<20180418123153.0f4f037d@w520.home>
	<20180423154003.12c5467a@w520.home>
	<a5f4ec49-d5aa-e853-03ad-7ca9d7e38206@nvidia.com>
	<20180424165918.5c2ef037@w520.home>
	<0a1d6487-0dfb-2ffc-4774-ebaf65c15892@nvidia.com>
	<20180425120057.0fabb70e@w520.home> <20180425195229.GK2496@work-vm>
	<a20611c2-071d-7611-bf1c-c9998ec3a462@nvidia.com>
	<20180426185522.GQ2631@work-vm>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Cc: Neo Jia <cjia@nvidia.com>, kvm@vger.kernel.org,
	Erik Skultety <eskultet@redhat.com>, libvirt <libvir-list@redhat.com>,
	Tina Zhang <tina.zhang@intel.com>, Kirti Wankhede <kwankhede@nvidia.com>,
	Gerd Hoffmann <kraxel@redhat.com>, Laine Stump <laine@redhat.com>,
	Jiri Denemark <jdenemar@redhat.com>, intel-gvt-dev@lists.freedesktop.org
To: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Return-path: <libvir-list-bounces@redhat.com>
In-Reply-To: <20180426185522.GQ2631@work-vm>
List-Unsubscribe: <https://www.redhat.com/mailman/options/libvir-list>,
	<mailto:libvir-list-request@redhat.com?subject=unsubscribe>
List-Archive: <https://www.redhat.com/archives/libvir-list>
List-Post: <mailto:libvir-list@redhat.com>
List-Help: <mailto:libvir-list-request@redhat.com?subject=help>
List-Subscribe: <https://www.redhat.com/mailman/listinfo/libvir-list>,
	<mailto:libvir-list-request@redhat.com?subject=subscribe>
Sender: libvir-list-bounces@redhat.com
Errors-To: libvir-list-bounces@redhat.com
List-Id: kvm.vger.kernel.org

On Thu, 26 Apr 2018 19:55:23 +0100
"Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:

> * Kirti Wankhede (kwankhede@nvidia.com) wrote:
> > 
> > 
> > On 4/26/2018 1:22 AM, Dr. David Alan Gilbert wrote:  
> > > * Alex Williamson (alex.williamson@redhat.com) wrote:  
> > >> On Wed, 25 Apr 2018 21:00:39 +0530
> > >> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> > >>  
> > >>> On 4/25/2018 4:29 AM, Alex Williamson wrote:  
> > >>>> On Wed, 25 Apr 2018 01:20:08 +0530
> > >>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> > >>>>     
> > >>>>> On 4/24/2018 3:10 AM, Alex Williamson wrote:    
> > >>>>>> On Wed, 18 Apr 2018 12:31:53 -0600
> > >>>>>> Alex Williamson <alex.williamson@redhat.com> wrote:
> > >>>>>>       
> > >>>>>>> On Mon,  9 Apr 2018 12:35:10 +0200
> > >>>>>>> Gerd Hoffmann <kraxel@redhat.com> wrote:
> > >>>>>>>      
> > >>>>>>>> This little series adds three drivers, for demo-ing and testing vfio
> > >>>>>>>> display interface code.  There is one mdev device for each interface
> > >>>>>>>> type (mdpy.ko for region and mbochs.ko for dmabuf).        
> > >>>>>>>
> > >>>>>>> Erik Skultety brought up a good question today regarding how libvirt is
> > >>>>>>> meant to handle these different flavors of display interfaces and
> > >>>>>>> knowing whether a given mdev device has display support at all.  It
> > >>>>>>> seems that we cannot simply use the default display=auto because
> > >>>>>>> libvirt needs to specifically configure gl support for a dmabuf type
> > >>>>>>> interface versus not having such a requirement for a region interface,
> > >>>>>>> perhaps even removing the emulated graphics in some cases (though I
> > >>>>>>> don't think we have boot graphics through either solution yet).
> > >>>>>>> Additionally, GVT-g seems to need the x-igd-opregion support
> > >>>>>>> enabled(?), which is a non-starter for libvirt as it's an experimental
> > >>>>>>> option!
> > >>>>>>>
> > >>>>>>> Currently the only way to determine display support is through the
> > >>>>>>> VFIO_DEVICE_QUERY_GFX_PLANE ioctl, but for libvirt to probe that on
> > >>>>>>> their own they'd need to get to the point where they could open the
> > >>>>>>> vfio device and perform the ioctl.  That means opening a vfio
> > >>>>>>> container, adding the group, setting the iommu type, and getting the
> > >>>>>>> device.  I was initially a bit appalled at asking libvirt to do that,
> > >>>>>>> but the alternative is to put this information in sysfs, but doing that
> > >>>>>>> we risk that we need to describe every nuance of the mdev device
> > >>>>>>> through sysfs and it becomes a dumping ground for every possible
> > >>>>>>> feature an mdev device might have.  
> > >> ...      
> > >>>>>>> So I was ready to return and suggest that maybe libvirt should probe
> > >>>>>>> the device to know about these ancillary configuration details, but
> > >>>>>>> then I remembered that both mdev vGPU vendors had external dependencies
> > >>>>>>> to even allow probing the device.  KVMGT will fail to open the device
> > >>>>>>> if it's not associated with an instance of KVM and NVIDIA vGPU, I
> > >>>>>>> believe, will fail if the vGPU manager process cannot find the QEMU
> > >>>>>>> instance to extract the VM UUID.  (Both of these were bad ideas)      
> > >>>>>>
> > >>>>>> Here's another proposal that's really growing on me:
> > >>>>>>
> > >>>>>>  * Fix the vendor drivers!  Allow devices to be opened and probed
> > >>>>>>    without these external dependencies.
> > >>>>>>  * Libvirt uses the existing vfio API to open the device and probe the
> > >>>>>>    necessary ioctls, if it can't probe the device, the feature is
> > >>>>>>    unavailable, ie. display=off, no migration.
> > >>>>>>       
> > >>>>>
> > >>>>> I'm trying to think simpler mechanism using sysfs that could work for
> > >>>>> any feature and knowing source-destination migration compatibility check
> > >>>>> by libvirt before initiating migration.
> > >>>>>
> > >>>>> I have another proposal:
> > >>>>> * Add a ioctl VFIO_DEVICE_PROBE_FEATURES
> > >>>>> struct vfio_device_features {
> > >>>>>     __u32 argsz;
> > >>>>>     __u32 features;
> > >>>>> }
> > >>>>>
> > >>>>> Define bit for each feature:
> > >>>>> #define VFIO_DEVICE_FEATURE_DISPLAY_REGION	(1 << 0)
> > >>>>> #define VFIO_DEVICE_FEATURE_DISPLAY_DMABUF	(1 << 1)
> > >>>>> #define VFIO_DEVICE_FEATURE_MIGRATION		(1 << 2)
> > >>>>>
> > >>>>> * Vendor driver returns bitmask of supported features during
> > >>>>> initialization phase.
> > >>>>>
> > >>>>> * In vfio core module, trap this ioctl for each device  in
> > >>>>> vfio_device_fops_unl_ioctl(),    
> > >>>>
> > >>>> Whoops, chicken and egg problem, VFIO_GROUP_GET_DEVICE_FD is our
> > >>>> blocking point with mdev drivers, we can't get a device fd, so we can't
> > >>>> call an ioctl on the device fd.
> > >>>>     
> > >>>
> > >>> I'm sorry, I thought we could expose features when QEMU initialize, but
> > >>> libvirt needs to know supported features before QEMU initialize.
> > >>>
> > >>>  
> > >>>>> check features bitmask returned by vendor
> > >>>>> driver and add a sysfs file if feature is supported that device. This
> > >>>>> sysfs file would return 0/1.    
> > >>>>
> > >>>> I don't understand why we have an ioctl interface, if the user can get
> > >>>> to the device fd then we have existing interfaces to probe these
> > >>>> things, it seems like you're just wanting to pass a features bitmap
> > >>>> through to vfio_add_group_dev() that vfio-core would expose through
> > >>>> sysfs, but a list of feature bits doesn't convey enough info except for
> > >>>> the most basic uses.
> > >>>>      
> > >>>
> > >>> Yes, vfio_add_group_dev() seems to be better way to convey features to
> > >>> vfio core.
> > >>>  
> > >>>>> For migration this bit will only indicate if host driver supports
> > >>>>> migration feature.
> > >>>>>
> > >>>>> For source and destination compatibility check libvirt would need more
> > >>>>> data/variables to check like,
> > >>>>> * if same type of 'mdev_type' device create-able at destination,
> > >>>>>    i.e. if ('mdev_type'->available_instances > 0)
> > >>>>>
> > >>>>> * if host_driver_version at source and destination are compatible.
> > >>>>> Host driver from same release branch should be mostly compatible, but if
> > >>>>> there are major changes in structures or APIs, host drivers from
> > >>>>> different branches might not be compatible, for example, if source and
> > >>>>> destination are from different branches and one of the structure had
> > >>>>> changed, then data collected at source might not be compatible with
> > >>>>> structures at destination and typecasting it to changed structures would
> > >>>>> mess up migrated data during restoration.    
> > >>>>
> > >>>> Of course now you're asking that libvirt understand the release
> > >>>> versioning scheme of every vendor driver and that it remain
> > >>>> programatically consistent.  We can't even do this with in-kernel
> > >>>> drivers.  And in the end, still the best we can do is guess.
> > >>>>    
> > >>>
> > >>> Libvirt doesn't need to understand the version, libvirt need to do
> > >>> strcmp version string from source and destination. If those are equal,
> > >>> then libvirt would understand that they are compatible.  
> > >>
> > >> Who's to say that the driver version and migration compatibility have
> > >> any relation at all?  Some drivers might focus on designing their own
> > >> migration interface that can maintain compatibility across versions
> > >> (QEMU does this), some drivers may only allow identical version
> > >> migration (which is going to frustrate upper level management tools and
> > >> customers - RHEL goes to great extents to support cross version
> > >> migration).  We cannot have a one size fits all here that driver version
> > >> defines completely the migration compatibility.  
> > > 
> > > I'll agree; I don't know enough about these devices, but to give you
> > > some example of things I'd expect to work:
> > >    a) User adds new machines to their data centre with larger/newer
> > > version of the same vendors GPU; in some cases that should work
> > > (depending on vendor details etc)
> > >    b) The same thing but with identical hardware but a newer driver on
> > > the destination.
> > > 
> > > Obviously there will be some cut offs that say some versions are
> > > incompatible;  but for normal migration we jump through serious hoops
> > > to make sure stuff works; customers will expect the same with some
> > > VFIO devices.
> > >   
> > 
> > How libvirt checks that cut off where some versions are incompatible?  
> 
> 
> We have versioned 'machine types' - so for example QEMU has
>   pc-i440fx-2.11
>   pc-i440fx-2.10
> 
> machine types; any version of qemu that supports machine type
> pc-i440fx-2.10 should behave the same to it's emulated devices.
> If we change the behaviour then we tie it to the new machine type;
> so the behaviour of a device in pc-i440fx-2.11 might be a bit different.
> Occasionally we'll kill off old machine types; (actually we should do it
> more!) - but certainly when we do downstream versions we tie it to
> machine types as well.
> 
> We also have some migration-capability flags, so some features can only
> be used if both sides have that flag, and also Libvirt has some checking
> of host CPU flags.

I think this sort of host compatibility checking for CPU flags is the
part where we need some libvirt input on how they'd like to extend this
for device compatibility.  A complication here is whether it's
reasonable for libvirt to collect migration compatibility data except
for the actual target device.  For instance, if the user model is to
create mdev devices on demand, the vendor driver might be upgraded
between system startup and migration, I don't think we can assume the
migration information remains static or is necessarily the same for
each mdev type provided by the vendor driver, or maybe for each parent
device.  Is it possible that libvirt would evaluate a migration target
device to this extent immediately before the migration?  How would
openstack handle managing a datacenter with such a model?

> > >>>>> * if guest_driver_version is compatible with host driver at destination.
> > >>>>> For mdev devices, guest driver communicates with host driver in some
> > >>>>> form. If there are changes in structures/APIs of such communication,
> > >>>>> guest driver at source might not be compatible with host driver at
> > >>>>> destination.    
> > >>>>
> > >>>> And another guess plus now the guest driver is involved which libvirt
> > >>>> has no visibility to.
> > >>>>      
> > >>>
> > >>> Like above libvirt need to do strcmp.  
> > >>
> > >> Insufficient, imo
> > >>  
> > >>>>> 'available_instances' sysfs already exist, later two should be added by
> > >>>>> vendor driver which libvirt can use for migration compatibility check.    
> > >>>>
> > >>>> As noted previously, display and migration are not necessarily
> > >>>> mdev-only features, it's possible that vfio-pci or vfio-platform could
> > >>>> also implement these, so the sysfs interface cannot be restricted to
> > >>>> the mdev template and lifecycle interface.
> > >>>>     
> > >>>
> > >>> I agree.
> > >>> Feature bitmask passed to vfio core is not mdev specific. But here
> > >>> 'available_instances' for migration compatibility check is mdev
> > >>> specific. If mdev device is not create-able at destination, there is no
> > >>> point in initiating migration by libvirt.  
> > >>
> > >> 'available_instances' for migration compatibility check...?  We use
> > >> available_instances to know whether we have the resources to create a
> > >> given mdev type.  It's certainly a prerequisite to have a device of the
> > >> identical type at the migration target and how we define what is an
> > >> identical device for a directly assigned PCI device is yet another
> > >> overly complicated rat hole.  But an identical device doesn't
> > >> necessarily imply migration compatibility and I think that's the
> > >> problem we're tackling.  We cannot assume based only on the device type
> > >> that migration is compatible, that's basically saying we're never going
> > >> to have any bugs or oversights or new features in the migration stream.  
> > > 
> > > Those things certainly happen; state that we forgot to transfer, new
> > > features enables on devices, devices configured in different ways.
> > >   
> > 
> > How libvirt checks migration compatibility for other devices across QEMU
> > versions where source support a device and destination running with
> > older QEMU version doesn't support that device or that device doesn't
> > exist in that system?  
> 
> Libvirt inspects the qemu to get lists of devices and capabilities; I'll
> leave it to the libvirt guys to add more detail if needed.

Right, so do we need a way to invoke QEMU with a device to report the
migration capabilities of that device?  To this point, I think the
migration viability of a target system has been entirely encompassed
within QEMU's ability to support the versioned machine type and the
compatibility of CPU flags, devices have not been considered as their
compatibility is guaranteed within a machine type and version.  Thanks,

Alex