kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Cornelia Huck <cohuck@redhat.com>
To: Jason Gunthorpe <jgg@nvidia.com>,
	Alex Williamson <alex.williamson@redhat.com>,
	Jonathan Corbet <corbet@lwn.net>,
	linux-doc@vger.kernel.org
Cc: kvm@vger.kernel.org, Kirti Wankhede <kwankhede@nvidia.com>,
	Max Gurtovoy <mgurtovoy@nvidia.com>,
	Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>,
	Yishai Hadas <yishaih@nvidia.com>
Subject: Re: [PATCH RFC] vfio: Documentation for the migration region
Date: Tue, 23 Nov 2021 15:21:06 +0100	[thread overview]
Message-ID: <87zgpvj6lp.fsf@redhat.com> (raw)
In-Reply-To: <0-v1-0ec87874bede+123-vfio_mig_doc_jgg@nvidia.com>

On Mon, Nov 22 2021, Jason Gunthorpe <jgg@nvidia.com> wrote:

> Provide some more complete documentation for the migration region's
> behavior, specifically focusing on the device_state bits and the whole
> system view from a VMM.
>
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> ---
>  Documentation/driver-api/vfio.rst | 208 +++++++++++++++++++++++++++++-
>  1 file changed, 207 insertions(+), 1 deletion(-)
>
> Alex/Cornelia, here is the first draft of the requested documentation I promised

Thanks, I'm taking a quick first look.

[As it's Thanksgiving week in the US, I don't think Alex will be reading
this right now.]

>
> We think it includes all the feedback from hns, Intel and NVIDIA on this mechanism.
>
> Our thinking is that NDMA would be implemented like this:
>
>    +#define VFIO_DEVICE_STATE_NDMA      (1 << 3)
>
> And a .add_capability ops will be used to signal to userspace driver support:
>
>    +#define VFIO_REGION_INFO_CAP_MIGRATION_NDMA    6
>
> I've described DIRTY TRACKING as a seperate concept here. With the current
> uAPI this would be controlled by VFIO_IOMMU_DIRTY_PAGES_FLAG_START, with our
> change in direction this would be per-tracker control, but no semantic change.
>
> Upon some agreement we'll include this patch in the next iteration of the mlx5 driver
> along with the NDMA bits.
>
> Thanks,
> Jason
>
> diff --git a/Documentation/driver-api/vfio.rst b/Documentation/driver-api/vfio.rst
> index c663b6f978255b..b28c6fb89ee92f 100644
> --- a/Documentation/driver-api/vfio.rst
> +++ b/Documentation/driver-api/vfio.rst
> @@ -242,7 +242,213 @@ group and can access them as follows::
>  VFIO User API
>  -------------------------------------------------------------------------------
>  
> -Please see include/linux/vfio.h for complete API documentation.
> +Please see include/uapi/linux/vfio.h for complete API documentation.
> +
> +-------------------------------------------------------------------------------
> +
> +VFIO migration driver API
> +-------------------------------------------------------------------------------
> +
> +VFIO drivers that support migration implement a migration control register
> +called device_state in the struct vfio_device_migration_info which is in its
> +VFIO_REGION_TYPE_MIGRATION region.
> +
> +The device_state triggers device action both when bits are set/cleared and
> +continuous behavior for each bit.

I had trouble parsing this sentence, until I read further down... maybe
use something like the following:

The device_state controls both device action and continuous behaviour.
Setting/clearing a bit triggers device action, and each bit controls
continuous behaviour.

> For VMMs they can also control if the VCPUs in
> +a VM are executing (VCPU RUNNING) and if the IOMMU is logging DMAs (DIRTY
> +TRACKING). These two controls are not part of the device_state register, KVM
> +will be used to control the VCPU and VFIO_IOMMU_DIRTY_PAGES_FLAG_START on the
> +container controls dirty tracking.

We usually try to keep kvm out of documentation for the vfio
interfaces; better frame that as an example?

> +
> +Along with the device_state the migration driver provides a data window which
> +allows streaming migration data into or out of the device.
> +
> +A lot of flexibility is provided to userspace in how it operates these bits. The
> +reference flow for saving device state in a live migration, with all features:

It may also vary depending on the device being migrated (a subchannel
passed via vfio-ccw will behave differently than a pci device.)

> +
> +  RUNNING, VCPU_RUNNING

Nit: everywhere else you used "VCPU RUNNING".

Also, can we separate device state bits as defined in vfio.h and VMM
state bits visually a bit :) better?

> +     Normal operating state
> +  RUNNING, DIRTY TRACKING, VCPU RUNNING
> +     Log DMAs
> +     Stream all memory

all memory accessed by the device?

> +  SAVING | RUNNING, DIRTY TRACKING, VCPU RUNNING
> +     Log internal device changes (pre-copy)
> +     Stream device state through the migration window
> +
> +     While in this state repeat as desired:
> +	Atomic Read and Clear DMA Dirty log
> +	Stream dirty memory
> +  SAVING | NDMA | RUNNING, VCPU RUNNING
> +     vIOMMU grace state
> +     Complete all in progress IO page faults, idle the vIOMMU
> +  SAVING | NDMA | RUNNING
> +     Peer to Peer DMA grace state
> +     Final snapshot of DMA dirty log (atomic not required)
> +  SAVING
> +     Stream final device state through the migration window
> +     Copy final dirty data
> +  0
> +     Device is halted
> +
> +and the reference flow for resuming:
> +
> +  RUNNING
> +     Issue VFIO_DEVICE_RESET to clear the internal device state
> +  0
> +     Device is halted
> +  RESUMING
> +     Push in migration data. Data captured during pre-copy should be
> +     prepended to data captured during SAVING.
> +  NDMA | RUNNING
> +     Peer to Peer DMA grace state
> +  RUNNING, VCPU RUNNING
> +     Normal operating state
> +
> +If the VMM has multiple VFIO devices undergoing migration then the grace states
> +act as cross device synchronization points. The VMM must bring all devices to
> +the grace state before advancing past it.
> +
> +To support these operations the migration driver is required to implement
> +specific behaviors around the device_state.
> +
> +Actions on Set/Clear:
> + - SAVING | RUNNING
> +   The device clears the data window and begins streaming 'pre copy' migration
> +   data through the window. Device that cannot log internal state changes return
> +   a 0 length migration stream.

Hm. This and the following are "combination states", i.e. not what I'd
expect if I read about setting/clearing bits. What you describe is what
happens if the device has RUNNING set and additionally SAVING is set,
isn't it? What happens if we set SAVING while !RUNNING? The action below
looks like what is happening when RUNNING is cleared while SAVING is set.

> +
> + - SAVING | !RUNNING
> +   The device captures its internal state and begins streaming migration data
> +   through the migration window
> +
> + - RESUMING
> +   The data window is opened and can receive the migration data.
> +
> + - !RESUMING
> +   All the data transferred into the data window is loaded into the device's
> +   internal state. The migration driver can rely on userspace issuing a
> +   VFIO_DEVICE_RESET prior to starting RESUMING.

Can we also fail migration? I.e. clearing RESUMING without setting RUNNING.

> +
> + - DIRTY TRACKING
> +   On set clear the DMA log and start logging
> +
> +   On clear freeze the DMA log and allow userspace to read it. Userspace must
> +   take care to ensure that DMA is suspended before clearing DIRTY TRACKING, for
> +   instance by using NDMA.
> +
> +   DMA logs should be readable with an "atomic test and clear" to allow
> +   continuous non-disruptive sampling of the log.

I'm not sure whether including DIRTY TRACKING with the bits in
device_state could lead to confusion...

> +
> +Continuous Actions:
> +  - NDMA
> +    The device is not allowed to issue new DMA operations.

Doesn't that make it an action trigger as well? I.e. when NDMA is set, a
blocker for DMA operations is in place?

> +    Before NDMA returns all in progress DMAs must be completed.

What does that mean? That the operation setting NDMA in device_state
returns? This would effectively make setting NDMA a trigger to actually
stop doing any DMA.

> +
> +  - !RUNNING
> +    The device should not change its internal state. Implies NDMA. Any internal
> +    state logging can stop.

So we have:
- !RUNNING -- no DMA, regardless whether NDMA is set
- RUNNING|NDMA -- the device can change its internal state, but not do
  DMA

!RUNNING|!NDMA would basically be a valid state if a device is stopped
before RESUMING, but not for outbound migration?

> +
> +  - SAVING | !RUNNING
> +    RESUMING | !RUNNING
> +    The device may assume there are no incoming MMIO operations.
> +
> +  - RUNNING
> +    The device can alter its internal state and must respond to incoming MMIO.
> +
> +  - SAVING | RUNNING
> +    The device is logging changes to the internal state.
> +
> +  - !VCPU RUNNING
> +    The CPU must not generate dirty pages or issue MMIO operations to devices.
> +
> +  - DIRTY TRACKING
> +    DMAs are logged
> +
> +  - ERROR
> +    The behavior of the device is undefined. The device must be recovered by
> +    issuing VFIO_DEVICE_RESET.
> +

I'm wondering whether it would be better to distinguish between
individual bit meanings vs composite states than set/clear actions vs
continuous actions. This could give us a good overview about what a
device can/should do while in a certain state and what flipping a
certain bit implies.

<I still need to read through the rest of the document>


  parent reply	other threads:[~2021-11-23 14:21 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-11-22 19:53 [PATCH RFC] vfio: Documentation for the migration region Jason Gunthorpe
2021-11-22 20:31 ` Jonathan Corbet
2021-11-23  0:20   ` Jason Gunthorpe
2021-11-23  7:22     ` Akira Yokosawa
2021-11-23 14:21 ` Cornelia Huck [this message]
2021-11-23 16:53   ` Jason Gunthorpe
2021-11-24 16:55     ` Cornelia Huck
2021-11-24 18:40       ` Jason Gunthorpe
2021-11-25 12:27         ` Cornelia Huck
2021-11-25 16:14           ` Jason Gunthorpe
2021-11-26 12:56             ` Cornelia Huck
2021-11-26 13:06               ` Jason Gunthorpe
2021-11-26 15:01                 ` Cornelia Huck

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87zgpvj6lp.fsf@redhat.com \
    --to=cohuck@redhat.com \
    --cc=alex.williamson@redhat.com \
    --cc=corbet@lwn.net \
    --cc=jgg@nvidia.com \
    --cc=kvm@vger.kernel.org \
    --cc=kwankhede@nvidia.com \
    --cc=linux-doc@vger.kernel.org \
    --cc=mgurtovoy@nvidia.com \
    --cc=shameerali.kolothum.thodi@huawei.com \
    --cc=yishaih@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).