Re: [PATCH RFC] vfio: Revise and update the migration uAPI description

From: Jason Gunthorpe <jgg@nvidia.com>
To: Alex Williamson <alex.williamson@redhat.com>
Cc: "cohuck@redhat.com" <cohuck@redhat.com>,
	"kvm@vger.kernel.org" <kvm@vger.kernel.org>,
	"farman@linux.ibm.com" <farman@linux.ibm.com>,
	"mjrosato@linux.ibm.com" <mjrosato@linux.ibm.com>,
	"pasic@linux.ibm.com" <pasic@linux.ibm.com>,
	"Tian, Kevin" <kevin.tian@intel.com>,
	Yishai Hadas <yishaih@nvidia.com>
Subject: Re: [PATCH RFC] vfio: Revise and update the migration uAPI description
Date: Tue, 18 Jan 2022 17:00:48 -0400	[thread overview]
Message-ID: <20220118210048.GG84788@nvidia.com> (raw)
In-Reply-To: <20220118125522.6c6bb1bb.alex.williamson@redhat.com>

On Tue, Jan 18, 2022 at 12:55:22PM -0700, Alex Williamson wrote:
> On Fri, 14 Jan 2022 15:35:14 -0400
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > Clarify how the migration API works by recasting its specification from a
> > bunch of bits into a formal FSM. This describes the same functional
> > outcome, with no practical ABI change.
> 
> I don't understand why we're so reluctant to drop the previous
> specification and call this v2. 

I won't object - but I can't say it is really necessary.

> Yes, it's clever that the enum for the FSM matches the bit
> definitions, but we're also inserting previously invalid states as a
> standard part of the device flow... (see below)

This is completely invisible to userspace, if userspace never writes
the new states to device_state it can never read them back.

> > This is RFC because the full series is not fully tested yet, that should be
> > done next week. The series can be previewed here:
> > 
> >   https://github.com/jgunthorpe/linux/commits/mlx5_vfio_pci
> > 
> > The mlx5 implementation of the state change:
> > 
> >   https://github.com/jgunthorpe/linux/blob/0a6416da226fe8ee888aa8026f1e363698e137a8/drivers/vfio/pci/mlx5/main.c#L264
> > 
> > Has turned out very clean. Compare this to the v5 version, which is full of
> > subtle bugs:
> > 
> >   https://lore.kernel.org/kvm/20211027095658.144468-12-yishaih@nvidia.com/
> > 
> > This patch adds the VFIO_DEVICE_MIG_ARC_SUPPORTED ioctl:
> > 
> >   https://github.com/jgunthorpe/linux/commit/c92eff6c2afd1ecc9ed5c67a1f81c7f270f6e940
> > 
> > And this shows how the Huawei driver should opt out of P2P arcs:
> > 
> >   https://github.com/jgunthorpe/linux/commit/dd2571c481d27546a33ff4583ce8ad49847fe300
> 
> We've been bitten several times by device support that didn't come to
> be in this whole vfio migration effort.

Which is why this patch is for Huawei not mlx5..

> At some point later hns support is ready, it supports the migration
> region, but migration fails with all existing userspace written to the
> below spec.  I can't imagine that a device advertising migration, but it
> being essentially guaranteed to fail is a viable condition and we can't
> retroactively add this proposed ioctl to existing userspace binaries.
> I think our recourse here would be to rev the migration sub-type again
> so that userspace that doesn't know about devices that lack P2P won't
> enable migration support.

Global versions are rarely a good idea. What happens if we have three
optional things, what do you set the version to in order to get
maximum compatibility?

For the scenario you describe it is much better for qemu to call
VFIO_DEVICE_MIG_ARC_SUPPORTED on every single transition it intends to
use when it first opens the device. If any fail then it can deem the
device as having some future ABI and refuse to use it with migration.

> So I think this ends up being a poor example of how to extend the uAPI.
> An opt-out for part of the base specification is hard, it's much easier
> to opt-in P2P as a feature.

I'm not sure I understand this 'base specification'. 

My remark was how we took current qemu as an ABI added P2P to the
specification and defined it in a way that is naturally backwards
compatible and is still well specified.

> > +enum {
> > +	VFIO_MIG_FSM_MAX_STATE = VFIO_DEVICE_STATE_RUNNING_P2P + 1,
> > +};
> 
> We have existing examples of using VFIO_foo_NUM_bar in the vfio uAPI
> already, let's follow that lead.

OK

> > +/*
> > + * vfio_mig_get_next_state - Compute the next step in the FSM
> > + * @cur_fsm - The current state the device is in
> > + * @new_fsm - The target state to reach
> > + *
> > + * Return the next step in the state progression between cur_fsm and
> > + * new_fsm. This breaks down requests for complex transitions into
> > + * smaller steps and returns the next step to get to new_fsm. The
> > + * function may need to be called up to four times before reaching new_fsm.
> > + *
> > + * VFIO_DEVICE_STATE_ERROR is returned if the state transition is not allowed.
> > + */
> > +static u32 vfio_mig_get_next_state(u32 cur_fsm, u32 new_fsm)
> 
> Can we use a typedef somewhere for type checking?

An earlier draft of this used a typed enum, lets go back to that
then. It just seemed odd because the uapi struct comes in as u32 and
at some point it has to get casted to the enum.

> > +	/*
> > +	 * Make a best effort to restore things back to where we started.
> > +	 */
> > +	while (*cur_state != starting_state) {
> > +		next_state =
> > +			vfio_mig_get_next_state(*cur_state, starting_state);
> > +		if (next_state == VFIO_DEVICE_STATE_ERROR ||
> > +		    device->ops->migration_step_device_state(device,
> > +							     next_state)) {
> > +			*cur_state = VFIO_DEVICE_STATE_ERROR;
> > +			break;
> > +		}
> > +		*cur_state = next_state;
> > +	}
> 
> Core code "transitioning" the device to ERROR seems a little suspect
> here.  I guess you're imagining that driver code calling this with an
> pointer to internal state that it also tests on a non-zero return.

Unfortunately, ideally the migration state would be stored in struct
vfio_device, and ideally the way to transition states would be a core
ioctl, not a region thingy.

If it was an ioctl then I'd return a 'needs reset' and exact current
device state.

> Should we just step-device-state to ERROR to directly inform the driver?

That is certainly a valid choice, it may eliminate the very ugly
pointer argument too. I will try it out.

> > + *   RUNNING_P2P -> STOP
> > + *   STOP_COPY -> STOP
> > + *     While in STOP the device must stop the operation of the device.  The
> > + *     device must not generate interrupts, DMA, or advance its internal
> > + *     state. When stopped the device and kernel migration driver must accept
> > + *     and respond to interaction to support external subsystems in the
> > + *     STOP state, for example PCI MSI-X and PCI config pace. Failure by
> > + *     the user to restrict device access while in STOP must not result in
> > + *     error conditions outside the user context (ex. host system faults).
> > + *
> > + *     The STOP_COPY arc will close the data window.
> 
> These data window "sessions" should probably be described as a concept
> first.

Er, OK.. I think it is all well described now, your new data window
language is much clearer, but lets try a small paragraph at the start
linking the data window to the FSM..

> > + *   RESUMING -> STOP
> > + *     Leaving RESUMING closes the migration region data window and indicates
> > + *     the end of a resuming session for the device.  The kernel migration
> > + *     driver should complete the incorporation of data written to the
> > + *     migration data window into the device internal state and perform final
> > + *     validity and consistency checking of the new device state.  If the user
> > + *     provided data is found to be incomplete, inconsistent, or otherwise
> > + *     invalid, the migration driver must indicate a write(2) error and
> > + *     optionally go to the ERROR state as described below. The data window
> > + *     is closed.
> > + *
> > + *     While in STOP the device has the same behavior as other STOP states
> > + *     described above.
> 
> There is one STOP state, invariant of how the device arrives to it.

In FSM theory we have the idea of Moore and Mealy FSM designs, this
description is following Mealy principles because that best fits the
actual driver implementation.

Paraphrashed, the distinction is that Moore is strictly about the
current state, while Mealy considers the arc by witch the state was
reached.

What I expect is that the device will exhibit a Moore kind of behavior
- eg STOP is STOP, but the driver implementation must be fully
Mealy. Look at how mlx5 is setup, each and every test is looking at an
arc.

This is because the arc encodes important information,
PRE_COPY -> RUNNING has a very different driver implementation than
RUNNING_P2P -> RUNNING even though they both end up in RUNNING.

Yes, the device should exhibit the same continuous behavior while in
the RUNNING state, but we can still describe that behavior by only
documenting arcs.

If you recall I tried to mix Moore and Mealy descriptions in my last
effort and everyone thought that was really confusing, I don't think
this will fare any better.

In this case, RESUMING -> STOP is different than any other way to
reach STOP, and it does trigger special driver behavior that needed a
paragraph to describe. The difference only exists while executing this
arc, and then STOP is STOP.

I think sticking to one FSM style for the documentation is the right
answer here. How about instead of just listing the states at the very
start we have short list with a one line summary of the state?

> The reader can infer from the above statement that the behavior of a
> state may be flexible to how the device arrives at the state unless
> otherwise specified.

There are two different things, the continuous behavior while in a
state, which is not flexible, and the actions the device takes, which
is. Go back to my last document to see how this looks when we try to
split them and spell them out separately.

Anyhow, it seems you are fine with the approach. Yishai can make the
changes you noted and lets go forward next week.

Thanks,
Jason