Re: [PATCH RFC] vfio: Documentation for the migration region

From: Jason Gunthorpe <jgg@nvidia.com>
To: Cornelia Huck <cohuck@redhat.com>
Cc: Alex Williamson <alex.williamson@redhat.com>,
	Jonathan Corbet <corbet@lwn.net>,
	linux-doc@vger.kernel.org, kvm@vger.kernel.org,
	Kirti Wankhede <kwankhede@nvidia.com>,
	Max Gurtovoy <mgurtovoy@nvidia.com>,
	Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>,
	Yishai Hadas <yishaih@nvidia.com>
Subject: Re: [PATCH RFC] vfio: Documentation for the migration region
Date: Tue, 23 Nov 2021 12:53:52 -0400	[thread overview]
Message-ID: <20211123165352.GA4670@nvidia.com> (raw)
In-Reply-To: <87zgpvj6lp.fsf@redhat.com>

On Tue, Nov 23, 2021 at 03:21:06PM +0100, Cornelia Huck wrote:
> On Mon, Nov 22 2021, Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > Provide some more complete documentation for the migration region's
> > behavior, specifically focusing on the device_state bits and the whole
> > system view from a VMM.
> >
> > Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> >  Documentation/driver-api/vfio.rst | 208 +++++++++++++++++++++++++++++-
> >  1 file changed, 207 insertions(+), 1 deletion(-)
> >
> > Alex/Cornelia, here is the first draft of the requested documentation I promised
> 
> Thanks, I'm taking a quick first look.
> 
> [As it's Thanksgiving week in the US, I don't think Alex will be reading
> this right now.]

Sure. I've made a lot of changes already so I'll try to post a v2 next
week

> > +The device_state triggers device action both when bits are set/cleared and
> > +continuous behavior for each bit.
> 
> I had trouble parsing this sentence, until I read further down... maybe
> use something like the following:
> 
> The device_state controls both device action and continuous behaviour.
> Setting/clearing a bit triggers device action, and each bit controls
> continuous behaviour.

OK

> > For VMMs they can also control if the VCPUs in
> > +a VM are executing (VCPU RUNNING) and if the IOMMU is logging DMAs (DIRTY
> > +TRACKING). These two controls are not part of the device_state register, KVM
> > +will be used to control the VCPU and VFIO_IOMMU_DIRTY_PAGES_FLAG_START on the
> > +container controls dirty tracking.
> 
> We usually try to keep kvm out of documentation for the vfio
> interfaces; better frame that as an example?

It is important, we can't clearly explain how things like PRI work or
NDMA without talking about a 'VCPU' concept. I think this is a case
where trying to be general is only going to hurt understandability.

Lets add some more text like 'VCPU RUNNING is used to model the
ability of the VFIO userspace to mutate the device. For VMM cases this
would be mapped to a KVM control on a VCPU, but non VMMs must also
similarly suspend their use of the VFIO device in !VCPU_RUNNING'

> > +Along with the device_state the migration driver provides a data window which
> > +allows streaming migration data into or out of the device.
> > +
> > +A lot of flexibility is provided to userspace in how it operates these bits. The
> > +reference flow for saving device state in a live migration, with all features:
> 
> It may also vary depending on the device being migrated (a subchannel
> passed via vfio-ccw will behave differently than a pci device.)

I don't think I like this statement - why/where would the overall flow
differ?

> > +  RUNNING, VCPU_RUNNING
> 
> Nit: everywhere else you used "VCPU RUNNING".

Oh, lets stick with the _ version then

> Also, can we separate device state bits as defined in vfio.h and VMM
> state bits visually a bit :) better?

Any idea? I used | for the migration_state and , for the externa ones.

> > +     Normal operating state
> > +  RUNNING, DIRTY TRACKING, VCPU RUNNING
> > +     Log DMAs
> > +     Stream all memory
> 
> all memory accessed by the device?

In this reference flow this is all VCPU memory. Ie you start global
dirty tracking both in VFIO and in the VCPU and start copying all VM
memory.

> > +Actions on Set/Clear:
> > + - SAVING | RUNNING
> > +   The device clears the data window and begins streaming 'pre copy' migration
> > +   data through the window. Device that cannot log internal state changes return
> > +   a 0 length migration stream.
> 
> Hm. This and the following are "combination states", i.e. not what I'd
> expect if I read about setting/clearing bits. 

If you examine the mlx5 driver you'll see this is how the logic is
actually implemented. It is actually very subtly complicated to
implement this properly. I also added this text:

 When multiple bits change in the migration_state the migration driver must
 process them in a priority order:

 - SAVING | RUNNING
 - !RUNNING
 - !NDMA
 - SAVING | !RUNNING
 - RESUMING
 - !RESUMING
 - NDMA
 - RUNNING

The combination states are actually two bit states and entry/exit are
defined in terms of both bits.

> What you describe is what happens if the device has RUNNING set and
> additionally SAVING is set, isn't it? 

Any change in SAVING & RUNNING that results in the new value being
SAVING | RUNNING must follow the above description.

So
  SAVING|0 -> SAVING|RUNNING
  0|RUNNING -> SAVING|RUNNING
  0 -> SAVING|RUNNING

Are all valid ways to reach this action.

This is the substantive difference between the actions and the
continuous, here something specific happes only on entry: 'clears the
data window and begins'

vs something like NDMA which is just continuously preventing DMA.

> > + - SAVING | !RUNNING
> > +   The device captures its internal state and begins streaming migration data
> > +   through the migration window
> > +
> > + - RESUMING
> > +   The data window is opened and can receive the migration data.
> > +
> > + - !RESUMING
> > +   All the data transferred into the data window is loaded into the device's
> > +   internal state. The migration driver can rely on userspace issuing a
> > +   VFIO_DEVICE_RESET prior to starting RESUMING.
> 
> Can we also fail migration? I.e. clearing RESUMING without setting RUNNING.

No, once RESUMING clears migration cannot be forced to fail, to abort
userspace should trigger reset.

This deserves some more language:

If the migration data is invalid the device should go to the ERROR state.

> > + - DIRTY TRACKING
> > +   On set clear the DMA log and start logging
> > +
> > +   On clear freeze the DMA log and allow userspace to read it. Userspace must
> > +   take care to ensure that DMA is suspended before clearing DIRTY TRACKING, for
> > +   instance by using NDMA.
> > +
> > +   DMA logs should be readable with an "atomic test and clear" to allow
> > +   continuous non-disruptive sampling of the log.
> 
> I'm not sure whether including DIRTY TRACKING with the bits in
> device_state could lead to confusion...

It is part of the flow and userspace must sequence it properly, just
like VCPU. We can't properly describe everything without talking about
it.

> > +Continuous Actions:
> > +  - NDMA
> > +    The device is not allowed to issue new DMA operations.
> 
> Doesn't that make it an action trigger as well? I.e. when NDMA is set, a
> blocker for DMA operations is in place?

For clarity I didn't split things like that. All the continuous
behaviors start when the given bits begins and stop when the bits
end.

Most of the actions talk about changes in the data window

> > +    Before NDMA returns all in progress DMAs must be completed.
> 
> What does that mean? That the operation setting NDMA in device_state
> returns? 

Yes, it must be a synchronous behavior.

> > +  - !RUNNING
> > +    The device should not change its internal state. Implies NDMA. Any internal
> > +    state logging can stop.
> 
> So we have:
> - !RUNNING -- no DMA, regardless whether NDMA is set
> - RUNNING|NDMA -- the device can change its internal state, but not do
>   DMA
> 
> !RUNNING|!NDMA would basically be a valid state if a device is stopped
> before RESUMING, but not for outbound migration?

The reference flows are just examples we can all think on, it is
always valid to go to any of the legal bit patterns, but may not be
useful.

This specifically not a FSM so any before/after migration_state is
technically legal and the device should behave as described here.

> > +  - SAVING | !RUNNING
> > +    RESUMING | !RUNNING
> > +    The device may assume there are no incoming MMIO operations.
> > +
> > +  - RUNNING
> > +    The device can alter its internal state and must respond to incoming MMIO.
> > +
> > +  - SAVING | RUNNING
> > +    The device is logging changes to the internal state.
> > +
> > +  - !VCPU RUNNING
> > +    The CPU must not generate dirty pages or issue MMIO operations to devices.
> > +
> > +  - DIRTY TRACKING
> > +    DMAs are logged
> > +
> > +  - ERROR
> > +    The behavior of the device is undefined. The device must be recovered by
> > +    issuing VFIO_DEVICE_RESET.
> > +
> 
> I'm wondering whether it would be better to distinguish between
> individual bit meanings vs composite states than set/clear actions vs
> continuous actions. This could give us a good overview about what a
> device can/should do while in a certain state and what flipping a
> certain bit implies.

Again, refer to the mlx5 implementation, there are not actually
individual bits here controlling specific things. SAVING for instance
has no device behavior meaning when discussed in isolation.

Thanks,
Jason