Re: [RFC PATCH] vfio: Update/Clarify migration uAPI, add NDMA state

From: Jason Gunthorpe <jgg@nvidia.com>
To: Alex Williamson <alex.williamson@redhat.com>
Cc: cohuck@redhat.com, corbet@lwn.net, kvm@vger.kernel.org,
	linux-doc@vger.kernel.org, farman@linux.ibm.com,
	mjrosato@linux.ibm.com, pasic@linux.ibm.com
Subject: Re: [RFC PATCH] vfio: Update/Clarify migration uAPI, add NDMA state
Date: Thu, 6 Jan 2022 17:20:57 -0400	[thread overview]
Message-ID: <20220106212057.GM2328285@nvidia.com> (raw)
In-Reply-To: <20220106111718.0e5f5ed6.alex.williamson@redhat.com>

On Thu, Jan 06, 2022 at 11:17:18AM -0700, Alex Williamson wrote:

> > Honestly, I have no interest in implementing something so complicated
> > for what should be a simple operation. We have no use case for this,
> > no desire to test it, it is just pure kernel cruft and complexity to
> > do this kind of extra work.
> 
> Migration is not a simple operation, we have a device with a host
> kernel driver exporting and importing device state through userspace.
> It's a playground for potential exploit vectors.  I assume this also
> means that you're not performing any sort of validation that the
> incoming data is from a compatible device or providing any versioning
> accommodations in the data stream, all features that would generally be
> considered best practices.

Our architecture has FW code performing all validation and versioning.

The driver is a dumb pipe and works on a simple snapshot basis.

The snapshot basis is the same approach the huawei driver and the 3
other drivers we have in internal development use.

The idea of forcing every snapshot style driver into some complicated
streaming scheme is bad, IMHO. We should be having code sharing here,
see below please.

> It's the uAPI as I understand it.  If you want a new uAPI, propose
> one.

Oh. I thought that was effectively what we are doing. So let's be
clear about it.

We are proposing a 'new API'. It will be binary and backwards
compatible with current qemu.

> > It can't. We define the API to be exactly the permited arcs and no
> > others. That is what simplify means.
> 
> IOW, define the uAPI based on what happens to be the current QEMU
> implementation and limitations.  That's exactly what we were trying to
> avoid in the uAPI design.

I don't like the characterization. Selecting only the arcs that are
useful, and well defined just happens to overlap with most of the
the arcs that qemu uses exactly because qemu is using the useful
subset.

The stuff beyond that is possibly veering into over-engineering. It
was OK when it looked like those behaviors would fall out naturally,
eg via precedence/etc, but now that we fully understand that
implementing the unused part has a real implementation cost, it is not
appealing.

> > If we need to add new FSM arcs and new states later then their support
> > can be exposed via a cap flag.
> > 
> > This is much better than trying to define all arcs as legal, only
> > testing/using a small subset and hoping the somehow in the future this
> > results in extensible interoperability.
> 
> A proposal of which states transitions you want to keep would be useful
> here.

Yishai is working it out currently in code, but you can go back to the
v1 to see where we started from on this thinking.

It looks like we end up with about 8 possible valid FSM states out of
the 16 combinatins of bits and your analysis looks about about right
on the arcs, but NDMA will be included.

We were also thinking to retain STOP. SAVING -> STOP could possibly
serve as the abort path to avoid a double action, and some of the use
cases you ID'd below are achievable if STOP remains.

> We have 20 possible transitions.  I've marked those available via the
> "odd ascii art" diagram as (a), that's 7 transitions.  We could
> essentially remove the NULL state as unreachable as there seems little
> value in the 2 transitions marked (a)* if we look only at migration,
> that would bring us down to 5 of 12 possible transitions.  We need to
> give userspace an abort path though, so we minimally need the 2
> transitions marked (b) (7/12).

> So now we can discuss the remaining 5 transitions:
> 
> {SAVING} -> {RESUMING}
> 	If not supported, user can achieve this via:
> 		{SAVING}->{RUNNING}->{RESUMING}
> 		{SAVING}-RESET->{RUNNING}->{RESUMING}

This can be:

SAVING -> STOP -> RESUMING

> 	It would likely be dis-recommended to return a device to
> 	{RUNNING} for this use case, so the latter would be preferred.

Yes, I would drop this, there is no advantage compared to manually
going to STOP, or through RESET.

> {SAVING} -> {RUNNING|SAVING}
> 	If not supported, user can achieve this via:
> 		{SAVING}->{RUNNING}->{RUNNING|SAVING}
> 
> 	Potential use case: downtime exceeded, avoid full migration
> 	restart (likely not achieved with the alternative flow).

I'm sympathetic to your use case, but this is not natural, or useful
for the snapshot/non-precopy style drivers to implement. Without
support for PRECOPY there is no functional advantage to return to
RUNNING|SAVING.

It is much better if qemu signals the far side to abort the migration,
move the local device to RUNNING and the far device through RESET ->
RESUMING to start all over again. This becomes common code for all the
snapshot drivers to rely on so we don't have to implement this
recovery logic with tricks in each driver inside the migration stream.

Shared code is better.

If someone does think this usecase is important they can add a cap
flag and implement this arc along with the qemu support/etc for that
driver. To be meaningful it would have to be along with a device that
implements PRECOPY with dirty tracking, and a streaming post copy.

So I would discard this arc, at least from the mandatory set.

> {RESUMING} -> {SAVING}
> 	If not supported:
> 		{RESUMING}->{RUNNING}->{SAVING}
> 		{RESUMING}-RESET->{RUNNING}->{SAVING}

The simplified version would be:

    RESUMING -> STOP -> SAVING

> 	Potential use case: validate migration data in = data out (also
> 	cannot be achieved with alternative flow, as device passes
> 	through RUNNING)

No sure mlx5 will guarantee idempotent migration data. It looks like
the Huawei device will be able to do this, and some of the other
simplish device types we are working on could possibly too.

So I'd drop this, STOP is good enough.

> {RESUMING} -> {RUNNING|SAVING}
> 	If not supported:
> 		{RESUMING}->{RUNNING}->{RUNNING|SAVING}
> 
> 	Potential use case: live ping-pong migration (alternative flow
> 	is likely sufficient)

STOP can be used here too, I would drop this since it is redundant.

> {RUNNING|SAVING} -> {RESUMING}
> 	If not supported:
> 		{RUNNING|SAVING}->{RUNNING}->{RESUMING}
> 		{RUNNING|SAVING}-RESET->{RUNNING}->{RESUMING}
> 	(again former is likely dis-recommended)
> 
> 	Potential use case: ???

Yep, discard.

> So what's the proposal?  Do we ditch all of these?  Some of these?  

Yes, all of them.

Including the NDMA states we add 2 new FSM states and 7 new arcs,
IIRC:

   SAVING | RUNNING -> SAVING | RUNNING | NDMA
   SAVING | RUNNING | NDMA -> SAVING
   SAVING | RUNNING | NDMA -> RUNNING
   SAVING | RUNNING | NDMA -> STOP

and

   RESUMING -> RUNNING | NDMA
   RUNNING | NDMA -> RUNNING
   RUNNING | NDMA -> STOP

> drivers follow the previously provided pseudo algorithm with the
> requirement that they cannot pass through an invalid state, we need to
> formally address whether the NULL state is invalid or just not
> reachable by the user.

What is a NULL state?

We have defined (from memory, forgive me I don't have access to
Yishai's latest code at the moment) 8 formal FSM states:

 RUNNING
 PRECOPY
 PRECOPY_NDMA
 STOP_COPY
 STOP
 RESUMING
 RESUMING_NDMA
 ERROR (perhaps MUST_RESET)

Which matches the state labels already given in the header comment.

Mapped onto the 4 device_state bits in an 'ABI compatible with current
qemu' way as the current header comment describes.

Then the list of allowed arcs between them close to what you have
suggested.

IMHO this substantially conforms to what is written down in the header
comment. Consistently using the state labels in code and documentation
then eliminating the named bits will conclude the change in
specification from bits to a FSM.

Yishai, we should also recast the cap discovery as some ioctl to query
directly if the driver supports an arc. Eg we can discover if NDMA is
supported by querying support for PRECOPY -> PRECOPY_NDMA, and if your
timeout use case is supported by querying support for STOP_COPY ->
PRECOPY.

Any future new states/arcs will cleanly fit into this discovery
scheme.

> > Huh? If the device indicated error during RESUMING userspace should
> > probably stop shoving packets into it or it will possibly corrupt the
> > migration stream.
> 
> If a {RESUMING}->{RUNNING} transition fails and the device remains in
> {RESUMING}, it should be valid for userspace to push data to it.  

IMHO this is not useful. If the userspace did RESUMING -> RUNNING then
that is 'end of stream' and continuning to push data is
nonsensical. It is one of the cases where design wise it is much
clearer to just say any exit from RESUMING either succeeds or
userspace needs to reset the device, eg ERROR.

Again, my perspective here is multi-device interoperability and I just
don't think we can rely on a consistent device behavior in such a
strange corner case.

> The vector table is directly accessible via the region mmap.  It
> previously was not, but that becomes a problem with 64k page sizes, and
> even some poorly designed devices on 4k systems when they don't honor
> the PCI spec recommended alignments.  

Yes, I forgot about that.

> But I think that's beside the point, if the user has vectors pointed
> at memory or other devices, they've rather already broken their
> contract for using the device.

Yes, so long as it doesn't allow to compromise the hypervisor
integrity.

> But I think you've identified two classes of DMA, MSI and everything
> else.  The device can assume that an MSI is special and not included in
> NDMA, but it can't know whether other arbitrary DMAs are p2p or memory.
> If we define that the minimum requirement for multi-device migration is
> to quiesce p2p DMA, ex. by not allowing it at all, then NDMA is
> actually significantly more restrictive while it's enabled.

You are right, but in any practical physical device NDMA will be
implemented by halting all DMAs, not just p2p ones.

I don't mind what we label this, so long as we understand that halting
all DMA is a valid device implementation.

Actually, having reflected on this now, one of the things on my list
to fix in iommufd, is that mdevs can get access to P2P pages at all.

This is currently buggy as-is because they cannot DMA map these
things, touch them with the CPU and kmap, or do, really, anything with
them.

So we should be blocking mdev's from accessing P2P, and in that case a
mdev driver can quite rightly say it doesn't support P2P at all and
safely NOP NDMA if NDMA is defined to only impact P2P transactions.

Perhaps that answers the question for the S390 drivers as well.

> Should a device in the ERROR state continue operation or be in a
> quiesced state?  It seems obvious to me that since the ERROR state is
> essentially undefined, the device should cease operations and be
> quiesced by the driver.  If the device is continuing to operate in the
> previous state, why would the driver place the device in the ERROR
> state?  It should have returned an errno and left the device in the
> previous state.

What we found while implementing is the use of ERROR arises when the
driver has been forced to do multiple actions and is unable to fully
unwind the state. So the device_state is not fully the original state
or fully the target state. Thus it is ERROR.

The additional requirement that the driver do another action to
quiesce the device only introduces the possiblity for triple failure.

Since it doesn't bring any value to userspace, I prefer we not define
things in this complicated way.

> > I prefer a model where the device is allowed to keep doing whatever it
> > was doing before it hit the error. You are pushing for a model where
> > upon error we must force the device to stop.
> 
> If the device continues operating in the previous mode then it
> shouldn't enter the ERROR state, it should return errno and re-reading
> device_state should indicate it's in the previous state.

Continues operating in the new/previous state is an 'upper bound' on
what it is allowed to do. For instance if we are going from RUNNING ->
SAVING mlx5 must transit through 'RUNNING|NDMA' as part of its
internal design.

If it then fails it can't continue to pretend it is RUNNING when it is
doing RUNNING|NDMA and a double failure means we can't restore
RUNNING.

Allowing ERROR to continue any behavior allowed by RUNNING allows the
device to be left in RUNNING|NDMA and eliminates the possiblity of
triple failure in a transition to 'STOP'.

Indeed we can simplify the driver by removing failure recoveries for
cases that have a double fault and simply go to ERROR. This is not so
viable if ERROR itself requires an action to enter it as we get back
to having to deal with double and triple faults.

Yishai said he should have something to look at next week. We'll take
a stab at rewording the docs you provided to reflect a FSM approach
too.

Jason