kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Tian, Kevin" <kevin.tian@intel.com>
To: Alex Williamson <alex.williamson@redhat.com>
Cc: "jgg@nvidia.com" <jgg@nvidia.com>,
	"cohuck@redhat.com" <cohuck@redhat.com>,
	"corbet@lwn.net" <corbet@lwn.net>,
	"kvm@vger.kernel.org" <kvm@vger.kernel.org>,
	"linux-doc@vger.kernel.org" <linux-doc@vger.kernel.org>,
	"farman@linux.ibm.com" <farman@linux.ibm.com>,
	"mjrosato@linux.ibm.com" <mjrosato@linux.ibm.com>,
	"pasic@linux.ibm.com" <pasic@linux.ibm.com>
Subject: RE: [RFC PATCH] vfio: Update/Clarify migration uAPI, add NDMA state
Date: Mon, 10 Jan 2022 06:01:45 +0000	[thread overview]
Message-ID: <BN9PR11MB5276701E9AF2C6167E0D3FCC8C509@BN9PR11MB5276.namprd11.prod.outlook.com> (raw)
In-Reply-To: <20220107093611.6cbc6166.alex.williamson@redhat.com>

> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Saturday, January 8, 2022 12:36 AM
> 
> On Fri, 7 Jan 2022 08:03:57 +0000
> "Tian, Kevin" <kevin.tian@intel.com> wrote:
> 
> > Hi, Alex,
> >
> > Thanks for cleaning up this part, which is very helpful!
> >
> > > From: Alex Williamson <alex.williamson@redhat.com>
> > > Sent: Friday, December 10, 2021 7:34 AM
> > >
> > > + *
> > > + *   The device_state field defines the following bitfield use:
> > > + *
> > > + *     - Bit 0 (RUNNING) [REQUIRED]:
> > > + *        - Setting this bit indicates the device is fully operational, the
> > > + *          device may generate interrupts, DMA, respond to MMIO, all vfio
> > > + *          device regions are functional, and the device may advance its
> > > + *          internal state.  The default device_state must indicate the device
> > > + *          in exclusively the RUNNING state, with no other bits in this field
> > > + *          set.
> > > + *        - Clearing this bit (ie. !RUNNING) must stop the operation of the
> > > + *          device.  The device must not generate interrupts, DMA, or
> advance
> > > + *          its internal state.
> >
> > I'm curious about what it means for the mediated device. I suppose this
> > 'must not' clause is from user p.o.v i.e. no event delivered to the user,
> > no DMA to user memory and no user visible change on mdev state.
> Physically
> > the device resource backing the mdev may still generate interrupt/DMA
> > to the host according to the mediation policy.
> >
> > Is this understanding correct?
> 
> Yes, one mediated device stopped running can't cause the backing
> device to halt, it must continue performing activities for other child
> devices as well as any host duties.  The user owned device should
> effectively stop.
> 
> > > +*           The user should take steps to restrict access
> > > + *          to vfio device regions other than the migration region while the
> > > + *          device is !RUNNING or risk corruption of the device migration
> data
> > > + *          stream.  The device and kernel migration driver must accept and
> > > + *          respond to interaction to support external subsystems in the
> > > + *          !RUNNING state, for example PCI MSI-X and PCI config space.
> >
> > and also respond to mmio access if some state is saved via reading mmio?
> 
> The device must not generate a host fault, ex. PCIe UR, but the idea
> here is that the device stops and preventing further access is the
> user's responsibility.  Failure to stop those accesses may result in
> corrupting the migration data.

ok. With this background I can understand what the last sentence tries
to say. It basically clarifies that while user access to the device is restricted
(by user itself) the kernel access (e.g. pci core, or the mediation driver 
itself) is allowed as long as doing so doesn't advance the to-be-saved state.

> 
> > > + *          Failure by the user to restrict device access while !RUNNING must
> > > + *          not result in error conditions outside the user context (ex.
> > > + *          host system faults).
> > > + *     - Bit 1 (SAVING) [REQUIRED]:
> > > + *        - Setting this bit enables and initializes the migration region data
> > > + *          window and associated fields within vfio_device_migration_info
> for
> > > + *          capturing the migration data stream for the device.  The
> migration
> > > + *          driver may perform actions such as enabling dirty logging of
> device
> > > + *          state with this bit.  The SAVING bit is mutually exclusive with the
> > > + *          RESUMING bit defined below.
> > > + *        - Clearing this bit (ie. !SAVING) de-initializes the migration region
> > > + *          data window and indicates the completion or termination of the
> > > + *          migration data stream for the device.
> > > + *     - Bit 2 (RESUMING) [REQUIRED]:
> > > + *        - Setting this bit enables and initializes the migration region data
> > > + *          window and associated fields within vfio_device_migration_info
> for
> > > + *          restoring the device from a migration data stream captured from
> a
> > > + *          SAVING session with a compatible device.  The migration driver
> may
> > > + *          perform internal device resets as necessary to reinitialize the
> > > + *          internal device state for the incoming migration data.
> > > + *        - Clearing this bit (ie. !RESUMING) de-initializes the migration
> > > + *          region data window and indicates the end of a resuming session
> for
> > > + *          the device.  The kernel migration driver should complete the
> > > + *          incorporation of data written to the migration data window into
> the
> > > + *          device internal state and perform final validity and consistency
> > > + *          checking of the new device state.  If the user provided data is
> > > + *          found to be incomplete, inconsistent, or otherwise invalid, the
> > > + *          migration driver must indicate a write(2) error and follow the
> > > + *          previously described protocol to return either the previous state
> > > + *          or an error state.
> > > + *     - Bit 3 (NDMA) [OPTIONAL]:
> > > + *        The NDMA or "No DMA" state is intended to be a quiescent state
> for
> > > + *        the device for the purposes of managing multiple devices within a
> > > + *        user context where peer-to-peer DMA between devices may be
> active.
> >
> > As discussed with Jason in another thread, this is also required for vPRI
> > when stopping DMA involves completing (instead of preempting) in-fly
> > requests then any vPRI for those requests must be completed when vcpu
> > is running. This cannot be done in !RUNNING which is typically transitioned
> > to after stopping vcpu.
> >
> > It is also useful when the time of stopping device DMA is unbound (even
> > without vPRI). Having a failure path when vcpu is running avoids breaking
> > SLA (if only capturing it after stopping vcpu). This further requires certain
> > interface for the user to specify a timeout value for entering NDMA, though
> > unclear to me what it will be now.
> >
> > > + *        Support for the NDMA bit is indicated through the presence of the
> > > + *        VFIO_REGION_INFO_CAP_MIG_NDMA capability as reported by
> > > + *        VFIO_DEVICE_GET_REGION_INFO for the associated device
> migration
> > > + *        region.
> > > + *        - Setting this bit must prevent the device from initiating any
> > > + *          new DMA or interrupt transactions.  The migration driver must
> >
> > Why also disabling interrupt? vcpu is still running at this point thus
> interrupt
> > could be triggered for many reasons other than DMA...
> 
> It's my understanding that the vCPU would be halted for the NDMA use
> case, we can't very well have vCPUs demanding requests to devices that
> are prevented from completing them.  The NDMA phase is intended to
> support completion of outstanding requests without concurrently
> accepting new requests, AIUI.

By current definition NDMA doesn't require the user to restrict access
to vfio device regions as for !RUNNING. So it's probably not good to tie it
with vcpu stopped.

As explained above it's also required when vPRI is enabled. At least for
current SVA-capable devices from Intel and Huawei they all require
waiting for vPRI completion before transitioning to NDMA can be done
while completing vPRI requires running vcpu.

New requests between NDMA and !RUNNING need to be mediated/
queued by the migration driver and then re-submitted on the 
destination node. This further requires certain mechanism to allow 
dynamically changing the mediation vs. passthru policy on the submit
portal.

> 
> Further conversations in this thread allow for interrupts and deduce
> that the primary requirement of NDMA is to restrict P2P DMA, which can
> be approximated as all non-MSI DMA.
> 
> > > + *          complete any such outstanding operations prior to completing
> > > + *          the transition to the NDMA state.  The NDMA device_state
> > > + *          essentially represents a sub-set of the !RUNNING state for the
> > > + *          purpose of quiescing the device, therefore the NDMA
> device_state
> > > + *          bit is superfluous in combinations including !RUNNING.
> >
> > 'superfluous' means doing so will get a failure, or just not recommended?
> 
> Superfluous meaning redundant.  It's allowed, but DMA is already
> restricted when !RUNNING, so setting NDMA when !RUNNING is meaningless.
> 
> > > + *        - Clearing this bit (ie. !NDMA) negates the device operational
> > > + *          restrictions required by the NDMA state.
> > > + *     - Bits [31:4]:
> > > + *        Reserved for future use, users should use read-modify-write
> > > + *        operations to the device_state field for manipulation of the above
> > > + *        defined bits for optimal compatibility.
> > > + *
> 
> FWIW, I'm expecting to see an alternative uAPI propose using a FSM
> machine in the near future, so while this clarifies what I believe is
> the intention of the existing uAPI, it might be deprecated before we
> bother to commit such clarifications.  Thanks,
> 

got it.

Thanks
Kevin

      reply	other threads:[~2022-01-10  6:01 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-12-09 23:34 [RFC PATCH] vfio: Update/Clarify migration uAPI, add NDMA state Alex Williamson
2021-12-10  1:25 ` Jason Gunthorpe
2021-12-13 20:40   ` Alex Williamson
2021-12-14 12:08     ` Cornelia Huck
2021-12-14 16:26     ` Jason Gunthorpe
2021-12-20 22:26       ` Alex Williamson
2022-01-04 20:28         ` Jason Gunthorpe
2022-01-06 18:17           ` Alex Williamson
2022-01-06 21:20             ` Jason Gunthorpe
2022-01-10  7:55               ` Tian, Kevin
2022-01-10 17:34                 ` Alex Williamson
2022-01-11  2:41                   ` Tian, Kevin
2022-01-10 18:11                 ` Jason Gunthorpe
2022-01-11  3:14                   ` Tian, Kevin
2022-01-11 18:19                     ` Jason Gunthorpe
2022-01-04  3:49       ` Tian, Kevin
2022-01-04 16:09         ` Jason Gunthorpe
2022-01-05  1:59           ` Tian, Kevin
2022-01-05 12:45             ` Jason Gunthorpe
2022-01-06  6:32               ` Tian, Kevin
2022-01-06 15:42                 ` Jason Gunthorpe
2022-01-07  0:00                   ` Tian, Kevin
2022-01-07  0:29                     ` Jason Gunthorpe
2022-01-07  2:01                       ` Tian, Kevin
2022-01-07 17:23                         ` Jason Gunthorpe
2022-01-10  3:14                           ` Tian, Kevin
2022-01-10 17:52                             ` Jason Gunthorpe
2022-01-11  2:57                               ` Tian, Kevin
2022-01-05  3:06           ` Tian, Kevin
2021-12-20 17:38 ` Cornelia Huck
2021-12-20 22:49   ` Alex Williamson
2021-12-21 11:24     ` Cornelia Huck
2022-01-07  8:03 ` Tian, Kevin
2022-01-07 16:36   ` Alex Williamson
2022-01-10  6:01     ` Tian, Kevin [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=BN9PR11MB5276701E9AF2C6167E0D3FCC8C509@BN9PR11MB5276.namprd11.prod.outlook.com \
    --to=kevin.tian@intel.com \
    --cc=alex.williamson@redhat.com \
    --cc=cohuck@redhat.com \
    --cc=corbet@lwn.net \
    --cc=farman@linux.ibm.com \
    --cc=jgg@nvidia.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=mjrosato@linux.ibm.com \
    --cc=pasic@linux.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).