Re: [PATCH RFC v2] vfio: Documentation for the migration region

From: Cornelia Huck <cohuck@redhat.com>
To: Alex Williamson <alex.williamson@redhat.com>,
	Jason Gunthorpe <jgg@nvidia.com>
Cc: Jonathan Corbet <corbet@lwn.net>,
	linux-doc@vger.kernel.org, kvm@vger.kernel.org,
	Kirti Wankhede <kwankhede@nvidia.com>,
	Max Gurtovoy <mgurtovoy@nvidia.com>,
	Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>,
	Yishai Hadas <yishaih@nvidia.com>
Subject: Re: [PATCH RFC v2] vfio: Documentation for the migration region
Date: Mon, 06 Dec 2021 17:03:00 +0100	[thread overview]
Message-ID: <87zgpdu3ez.fsf@redhat.com> (raw)
In-Reply-To: <20211203110619.1835e584.alex.williamson@redhat.com>

On Fri, Dec 03 2021, Alex Williamson <alex.williamson@redhat.com> wrote:

> On Wed, 1 Dec 2021 19:25:02 -0400
> Jason Gunthorpe <jgg@nvidia.com> wrote:
>
>> On Wed, Dec 01, 2021 at 01:03:14PM -0700, Alex Williamson wrote:
>> > On Tue, 30 Nov 2021 23:14:07 -0400
>> > Jason Gunthorpe <jgg@nvidia.com> wrote:
>> >   
>> > > On Tue, Nov 30, 2021 at 03:35:41PM -0700, Alex Williamson wrote:

>> OTOH "supportable" qemu could certainly make the default choice to
>> require devices for simplicity.
>
> I get a bit lost every time I try to sketch out how QEMU would
> implement it.  Forgive the stream of consciousness and rhetorical
> discussion below...
>
>  - Does it make sense that a device itself can opt-out of p2p mappings?
>    This is simply an MMIO access from another device rather than from
>    the CPU.  Vendors cannot preclude this use case on bare metal,
>    should they be able to in a VM?  We create heterogeneous p2p maps,
>    device A can access device B, but device B maybe can't access device
>    A.  Seems troublesome.
>
>  - If we can't have a per-device policy, then we'd need a per VM
>    policy, likely some way to opt-out of all p2p mappings for vfio
>    devices.  We need to support hotplug of devices though, so is it a
>    per VM policy or is it a per device policy which needs to be
>    consistent among all attached devices?  Perhaps a
>    "enable_p2p=on/off" option on the vfio-pci device, default [on] to
>    match current behavior.  For any case of this option being set to
>    non-default, all devices would need to set it to the same value,
>    non-compliant devices rejected.
>
>  - We could possibly allow migration=on,enable_p2p=on for a non-NDMA
>    device, but the rules change if additional devices are added, they
>    need to be rejected or migration support implicitly disappears.  That
>    seems hard to document, non-deterministic as far as a user is
>    concerned. So maybe for a non-NDMA device we'd require
>    enable_p2p=off in order to set migration=on.  That means we could
>    never enable migration on non-NDMA devices by default, which
>    probably means we also cannot enable it by default on NDMA devices
>    or we get user confusion/inconsistency.
>
>  - Can a user know which devices will require enable_p2p=off in order
>    to set migration=on?  "Read the error log" is a poor user experience
>    and difficult hurdle for libvirt.
>
> So in order to create a predictable QEMU experience in the face of
> optional NDMA per device, I think we preclude being able to enable
> migration support for any vfio device by default and we have an
> exercise to determine how a user or management tool could easily
> determine NDMA compatibility :-\

Hm, maybe we can take a page out of the confidential guest support book
and control this like we do for the virtio iommu flag?

I'm not sure whether there is a pressing need to support p2p for
non-NDMA devices?

> There's a fine line between writing an inter-operable driver and
> writing a driver for the current QEMU implementation.  Obviously we want
> to support the current QEMU implementation, but we want an interface
> that can accommodate how that implementation might evolve.  Once we
> start telling driver authors to expect specific flows rather than
> looking at the operation of each bit, then our implementations become
> more fragile, less versatile relative to the user.

What is actually on the table regarding the current QEMU implementation?
The interface is still marked as experimental, so we have some room for
changes, but we probably don't want to throw everything away.

>
>> > > > Userspace can attempt RESUMING -> RUNNING regardless of what we specify,
>> > > > so a driver needs to be prepared for such an attempted state change
>> > > > either way.  So what's the advantage to telling a driver author that
>> > > > they can expect a given behavior?    
>> > > 
>> > > The above didn't tell a driver author to expect a certain behavior, it
>> > > tells userspace what to do.  
>> > 
>> >   "The migration driver can rely on user-space issuing a
>> >    VFIO_DEVICE_RESET prior to starting RESUMING."  
>> 
>> I trimmed too much, the original text you quoted was
>> 
>> "To abort a RESUMING issue a VFIO_DEVICE_RESET."
>> 
>> Which I still think is fine.
>
> If we're writing a specification, that's really a MAY statement,
> userspace MAY issue a reset to abort the RESUMING process and return
> the device to RUNNING.  They MAY also write the device_state directly,
> which MAY return an error depending on various factors such as whether
> data has been written to the migration state and whether that data is
> complete.  If a failed transitions results in an ERROR device_state,
> the user MUST issue a reset in order to return it to a RUNNING state
> without closing the interface.

Are we actually writing a specification? If yes, we need to be more
clear on what is mandatory (MUST), advised (SHOULD), or allowed
(MAY). If I look at the current proposal, I'm not sure into which
category some of the statements fall.

>
> A recommendation to use reset to skip over potential error conditions
> when the goal is simply a new, clean RUNNING state irrespective of data
> written to the migration region, is fine.  But drivers shouldn't be
> written with only that expectation, just like they shouldn't expect a
> reset precondition to entering RESUMING.
>  
>> > Tracing that shows a reset preceding entering RESUMING doesn't suggest
>> > to me that QEMU is performing a reset for the specific purpose of
>> > entering RESUMING.  Correlation != causation.  
>> 
>> Kernel doesn't care why qemu did it - it was done. Intent doesn't
>> matter :)
>
> This is exactly the sort of "designed for QEMU implementation"
> inter-operability that I want to avoid.  It doesn't take much of a
> crystal ball to guess that gratuitous and redundant device resets slow
> VM instantiation and are a likely target for optimization.

Which brings me back to my question above: Are we wedded to all details
of the current QEMU implementation? Should we consider some of them more
as a MAY? Can we change QEMU to do things differently, given the
experimental nature of the support?