linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jason Gunthorpe <jgg@nvidia.com>
To: "Tian, Kevin" <kevin.tian@intel.com>
Cc: Yishai Hadas <yishaih@nvidia.com>,
	"alex.williamson@redhat.com" <alex.williamson@redhat.com>,
	"bhelgaas@google.com" <bhelgaas@google.com>,
	"saeedm@nvidia.com" <saeedm@nvidia.com>,
	"linux-pci@vger.kernel.org" <linux-pci@vger.kernel.org>,
	"kvm@vger.kernel.org" <kvm@vger.kernel.org>,
	"netdev@vger.kernel.org" <netdev@vger.kernel.org>,
	"kuba@kernel.org" <kuba@kernel.org>,
	"leonro@nvidia.com" <leonro@nvidia.com>,
	"kwankhede@nvidia.com" <kwankhede@nvidia.com>,
	"mgurtovoy@nvidia.com" <mgurtovoy@nvidia.com>,
	"maorg@nvidia.com" <maorg@nvidia.com>,
	"Raj, Ashok" <ashok.raj@intel.com>,
	"shameerali.kolothum.thodi@huawei.com" 
	<shameerali.kolothum.thodi@huawei.com>
Subject: Re: [PATCH V7 mlx5-next 15/15] vfio: Extend the device migration protocol with PRE_COPY
Date: Fri, 18 Feb 2022 10:06:18 -0400	[thread overview]
Message-ID: <20220218140618.GO4160@nvidia.com> (raw)
In-Reply-To: <BN9PR11MB527683AAB1D4CA76EB16ACF68C379@BN9PR11MB5276.namprd11.prod.outlook.com>

On Fri, Feb 18, 2022 at 08:01:47AM +0000, Tian, Kevin wrote:
 
> > A new ioctl VFIO_DEVICE_MIG_PRECOPY is provided to allow userspace to
> > query the progress of the precopy operation in the driver with the idea it
> > will judge to move to STOP_COPY at least once the initial data set is
> > transferred, and possibly after the dirty size has shrunk appropriately.
> > 
> > We think there may also be merit in future extensions to the
> > VFIO_DEVICE_MIG_PRECOPY ioctl to also command the device to throttle the
> > rate it generates internal dirty state.
> > 
> > Compared to the v1 clarification, STOP_COPY -> PRE_COPY is made optional
> 
> essentially it's *BLOCKED* per following context.

Yes I suppose now that we have the cap bits not the arc discovery this
isn't worded well

> > and to be defined in future. While making the whole PRE_COPY feature
> > optional eliminates the concern from mlx5, this is still a complicated arc
> > to implement and seems prudent to leave it closed until a proper use case
> 
> Can you shed some light on the complexity here?

It is with the data_fd, once a driver enters STOP_COPY it should stuff
its final state into the data_fd. If this is aborted back to PRE_COPY
then the data_fd needs to return to streaming changes. Managing this
transition is not trivial - it is something that has to be signaled to
the receiver.

There is also something of a race here where the data_fd can reach
end-of-stream and then the user can do STOP_COPY->PRE_COPY and
continue stuffing data. This makes the construction of the data stream
framing "interesting" as there is no longer a possible in-band end of
stream marker. See the other discussion about async operation why this
is not ideal.

Basically, it is behavior current qemu doesn't trigger that requires
significant complexity and testing in any driver to support
properly. No driver proposed

> Could a driver pretend supporting PRE_COPY by simply returning both 
> initial_bytes and dirty_bytes as ZERO?

I think so, yes.

> and even if the driver doesn't support the base arc (STOP_COPY->
> PRE_COPY_P2P) what about the combination arc (STOP_COPY->STOP->
> RUNNING_P2P->PRE_COPY_P2P)?

Userspace can walk through this sequence on its own, but it cannot be
part of the FSM because it violates the construction rules. The
data_fd is open in two places.

> current FSM already allows STOP->RUNNING_P2P->PRE_COPY_P2P and in
> concept STOP_COPY and STOP have exact same device behavior.

This is allowed because it follows the FSM rules. The data_fd is the
key difference.

> with that combination arc the interim transition from STOP_COPY to
> STOP will terminate the current data stream and RUNNING_P2P to
> PRE_COPY_P2P will return a new data fd. This does violate the definition
> about transition between three 'saving group' of states, which says
> moving between them does not terminate or otherwise affect the
> associated fd.

Right, and because this happens the VMM wuld have to terminate the
resuming session as well. Remember the output of a single saving
data_fd can be sent to a single receiving resuming data_fd - they
cannot be spliced.

> > is developed. We also split the pending_bytes report into the initial and
> > sustaining values, and define the protocol to get an event via poll() for
> 
> I guess this split must have been aligned in earlier discussion but it's still
> useful if some words can be put here for the motivation. Otherwise one 
> could easily ask why not treating the 1st read of pending_bytes as the 
> initial size.

As everything is estimates the approach allows the estimate to be
refined as we go along. PRE_COPY can stop at any time, but knowing
some initial mandatory stage has passed is somewhat consistent with
how qemu seems to treat this.

> > @@ -1596,25 +1596,59 @@ int vfio_mig_get_next_state(struct vfio_device
> > *device,
> >  	 *         RUNNING -> STOP
> >  	 *         STOP -> RUNNING
> 
> The comment for above should be updated too, which currently says:
> 
> 	 * Without P2P the driver must implement:
> 
> and also move it to the end as it talks about the arcs when neither
> P2P nor PRECOPY is supported.

Yes

> > + * PRE_COPY_P2P -> RUNNING_P2P
> >   * RUNNING -> RUNNING_P2P
> >   * STOP -> RUNNING_P2P
> >   *   While in RUNNING_P2P the device is partially running in the P2P
> > quiescent
> >   *   state defined below.
> >   *
> > + *   The PRE_COPY arc will terminate a data transfer session.
> 
> PRE_COPY_P2P

Yes

> 
> > + *
> > + * RUNNING -> PRE_COPY
> > + * RUNNING_P2P -> PRE_COPY_P2P
> >   * STOP -> STOP_COPY
> > - *   This arc begin the process of saving the device state and will return a
> > - *   new data_fd.
> > + *   PRE_COPY, PRE_COPY_P2P and STOP_COPY form the "saving group" of
> > states
> > + *   which share a data transfer session. Moving between these states alters
> > + *   what is streamed in session, but does not terminate or otherwise effect
> 
> 'effect' -> 'affect'?

yes

> > @@ -959,6 +1007,8 @@ struct vfio_device_feature_mig_state {
> >   * above FSM arcs. As there are multiple paths through the FSM arcs the
> > path
> >   * should be selected based on the following rules:
> >   *   - Select the shortest path.
> > + *   - The path cannot have saving group states as interior arcs, only
> > + *     starting/end states.
> 
> what about PRECOPY->PRECOPY_P2P->STOP_COPY? In this case
> PRECOPY_P2P is used as interior arc.

It isn't an interior arc because there are only two arcs :) But yes,
it is bit unclear.

> and if we disallow a non-saving-group state as interior arc when both 
> start and end states are saving-group states (e.g. 
> STOP_COPY->STOP->RUNNING_P2P->PRE_COPY_P2P as I asked in
> the start) then it might be another rule to be specified...

This isn't a shortest path.

> > @@ -972,6 +1022,9 @@ struct vfio_device_feature_mig_state {
> >   * support them. The user can disocver if these states are supported by using
> >   * VFIO_DEVICE_FEATURE_MIGRATION. By using combination transitions the
> > user can
> >   * avoid knowing about these optional states if the kernel driver supports
> > them.
> > + *
> > + * Arcs touching PRE_COPY and PRE_COPY_P2P are removed if support for
> > PRE_COPY
> > + * is not present.
> 
> why adding this sentence particularly for PRE_COPY? Isn't it already
> explained by last paragraph for optional states?

Well, I thought it was clarifying about how the optionality is
constructed.

> > + * Drivers should attempt to return estimates so that initial_bytes +
> > + * dirty_bytes matches the amount of data an immediate transition to
> > STOP_COPY
> > + * will require to be streamed.
> 
> I didn't understand this requirement. In an immediate transition to
> STOP_COPY I expect the amount of data covers the entire device
> state, i.e. initial_bytes. dirty_bytes are dynamic and iteratively returned
> then why we need set some expectation on the sum of 
> initial+round1_dity+round2_dirty+... 

"will require to be streamed" means additional data from this point
forward, not including anything already sent.

It turns into the estimate of how long STOP_COPY will take.

Jason

  reply	other threads:[~2022-02-18 14:06 UTC|newest]

Thread overview: 50+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-02-07 17:22 [PATCH V7 mlx5-next 00/15] Add mlx5 live migration driver and v2 migration protocol Yishai Hadas
2022-02-07 17:22 ` [PATCH V7 mlx5-next 01/15] PCI/IOV: Add pci_iov_vf_id() to get VF index Yishai Hadas
2022-02-07 17:22 ` [PATCH V7 mlx5-next 02/15] net/mlx5: Reuse exported virtfn index function call Yishai Hadas
2022-02-07 17:22 ` [PATCH V7 mlx5-next 03/15] net/mlx5: Disable SRIOV before PF removal Yishai Hadas
2022-02-07 17:22 ` [PATCH V7 mlx5-next 04/15] PCI/IOV: Add pci_iov_get_pf_drvdata() to allow VF reaching the drvdata of a PF Yishai Hadas
2022-02-07 17:22 ` [PATCH V7 mlx5-next 05/15] net/mlx5: Expose APIs to get/put the mlx5 core device Yishai Hadas
2022-02-07 17:22 ` [PATCH V7 mlx5-next 06/15] net/mlx5: Introduce migration bits and structures Yishai Hadas
2022-02-07 17:22 ` [PATCH V7 mlx5-next 07/15] vfio: Have the core code decode the VFIO_DEVICE_FEATURE ioctl Yishai Hadas
2022-02-07 17:22 ` [PATCH V7 mlx5-next 08/15] vfio: Define device migration protocol v2 Yishai Hadas
2022-02-09  0:07   ` Alex Williamson
2022-02-09  2:36     ` Jason Gunthorpe
2022-02-15 10:41       ` Tian, Kevin
2022-02-15 16:04         ` Jason Gunthorpe
2022-02-15 23:32           ` Alex Williamson
2022-02-16  1:17             ` Jason Gunthorpe
2022-02-16  3:17           ` Tian, Kevin
2022-02-16 12:14             ` Jason Gunthorpe
2022-02-17  2:29               ` Tian, Kevin
2022-02-15 10:58       ` Tian, Kevin
2022-02-15 13:13         ` Jason Gunthorpe
2022-02-15  8:04   ` Tian, Kevin
2022-02-15 15:33     ` Jason Gunthorpe
2022-02-16  3:04       ` Tian, Kevin
2022-02-07 17:22 ` [PATCH V7 mlx5-next 09/15] vfio: Extend the device migration protocol with RUNNING_P2P Yishai Hadas
2022-02-15 10:18   ` Tian, Kevin
2022-02-15 15:56     ` Jason Gunthorpe
2022-02-16  2:52       ` Tian, Kevin
2022-02-16 12:11         ` Jason Gunthorpe
2022-02-07 17:22 ` [PATCH V7 mlx5-next 10/15] vfio: Remove migration protocol v1 documentation Yishai Hadas
2022-02-11 11:03   ` Cornelia Huck
2022-02-07 17:22 ` [PATCH V7 mlx5-next 11/15] vfio/mlx5: Expose migration commands over mlx5 device Yishai Hadas
2022-02-07 17:22 ` [PATCH V7 mlx5-next 12/15] vfio/mlx5: Implement vfio_pci driver for mlx5 devices Yishai Hadas
2022-02-09  0:07   ` Alex Williamson
2022-02-07 17:22 ` [PATCH V7 mlx5-next 13/15] vfio/pci: Expose vfio_pci_core_aer_err_detected() Yishai Hadas
2022-02-07 17:22 ` [PATCH V7 mlx5-next 14/15] vfio/mlx5: Use its own PCI reset_done error handler Yishai Hadas
2022-02-09  0:08   ` Alex Williamson
2022-02-09  2:39     ` Jason Gunthorpe
2022-02-10 16:48       ` Alex Williamson
2022-02-10 17:27         ` Jason Gunthorpe
2022-02-07 17:22 ` [PATCH V7 mlx5-next 15/15] vfio: Extend the device migration protocol with PRE_COPY Yishai Hadas
2022-02-17 17:15   ` Alex Williamson
2022-02-18  0:03     ` Jason Gunthorpe
2022-02-18  8:01   ` Tian, Kevin
2022-02-18 14:06     ` Jason Gunthorpe [this message]
2022-02-22  1:43       ` Tian, Kevin
2022-02-22 15:50         ` Jason Gunthorpe
2022-02-23  0:40           ` Tian, Kevin
2022-02-23  0:44             ` Jason Gunthorpe
2022-02-23  1:46               ` Tian, Kevin
2022-02-18  8:11 ` [PATCH V7 mlx5-next 00/15] Add mlx5 live migration driver and v2 migration protocol Tarun Gupta (SW-GPU)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20220218140618.GO4160@nvidia.com \
    --to=jgg@nvidia.com \
    --cc=alex.williamson@redhat.com \
    --cc=ashok.raj@intel.com \
    --cc=bhelgaas@google.com \
    --cc=kevin.tian@intel.com \
    --cc=kuba@kernel.org \
    --cc=kvm@vger.kernel.org \
    --cc=kwankhede@nvidia.com \
    --cc=leonro@nvidia.com \
    --cc=linux-pci@vger.kernel.org \
    --cc=maorg@nvidia.com \
    --cc=mgurtovoy@nvidia.com \
    --cc=netdev@vger.kernel.org \
    --cc=saeedm@nvidia.com \
    --cc=shameerali.kolothum.thodi@huawei.com \
    --cc=yishaih@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).