All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jason Gunthorpe <jgg@nvidia.com>
To: Alex Williamson <alex.williamson@redhat.com>
Cc: Leon Romanovsky <leon@kernel.org>,
	Doug Ledford <dledford@redhat.com>,
	Yishai Hadas <yishaih@nvidia.com>,
	Bjorn Helgaas <bhelgaas@google.com>,
	"David S. Miller" <davem@davemloft.net>,
	Jakub Kicinski <kuba@kernel.org>,
	Kirti Wankhede <kwankhede@nvidia.com>,
	kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-pci@vger.kernel.org, linux-rdma@vger.kernel.org,
	netdev@vger.kernel.org, Saeed Mahameed <saeedm@nvidia.com>,
	Cornelia Huck <cohuck@redhat.com>
Subject: Re: [PATCH mlx5-next 2/7] vfio: Add an API to check migration state transition validity
Date: Wed, 29 Sep 2021 15:26:42 -0300	[thread overview]
Message-ID: <20210929182642.GV964074@nvidia.com> (raw)
In-Reply-To: <20210929120655.28e0b3a6.alex.williamson@redhat.com>

On Wed, Sep 29, 2021 at 12:06:55PM -0600, Alex Williamson wrote:
> On Wed, 29 Sep 2021 13:16:02 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Tue, Sep 28, 2021 at 02:18:44PM -0600, Alex Williamson wrote:
> > > On Tue, 28 Sep 2021 16:35:50 -0300
> > > Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > >   
> > > > On Tue, Sep 28, 2021 at 01:19:58PM -0600, Alex Williamson wrote:
> > > >   
> > > > > In defining the device state, we tried to steer away from defining it
> > > > > in terms of the QEMU migration API, but rather as a set of controls
> > > > > that could be used to support that API to leave us some degree of
> > > > > independence that QEMU implementation might evolve.    
> > > > 
> > > > That is certainly a different perspective, it would have been
> > > > better to not express this idea as a FSM in that case...
> > > > 
> > > > So each state in mlx5vf_pci_set_device_state() should call the correct
> > > > combination of (un)freeze, (un)quiesce and so on so each state
> > > > reflects a defined operation of the device?  
> > > 
> > > I'd expect so, for instance the implementation of entering the _STOP
> > > state presumes a previous state that where the device is apparently
> > > already quiesced.  That doesn't support a direct _RUNNING -> _STOP
> > > transition where I argued in the linked threads that those states
> > > should be reachable from any other state.  Thanks,  
> > 
> > If we focus on mlx5 there are two device 'flags' to manage:
> >  - Device cannot issue DMAs
> >  - Device internal state cannot change (ie cannot receive DMAs)
> > 
> > This is necessary to co-ordinate across multiple devices that might be
> > doing peer to peer DMA between them. The whole multi-device complex
> > should be moved to "cannot issue DMA's" then the whole complex would
> > go to "state cannot change" and be serialized.
> 
> Are you anticipating p2p from outside the VM?  The typical scenario
> here would be that p2p occurs only intra-VM, so all the devices would
> stop issuing DMA (modulo trying to quiesce devices simultaneously).

Inside the VM.

Your 'modulo trying to quiesce devices simultaneously' is correct -
this is a real issue that needs to be solved.

If we put one device in a state where it's internal state is immutable
it can no longer accept DMA messages from the other devices. So there
are two states in the HW model - do not generate DMAs and finally the
immutable internal state where even external DMAs are refused.

> > The expected sequence at the device is thus
> > 
> > Resuming
> >  full stop -> does not issue DMAs -> full operation
> > Suspend
> >  full operation -> does not issue DMAs -> full stop
> > 
> > Further the device has two actions
> >  - Trigger serializating the device state
> >  - Trigger de-serializing the device state
> > 
> > So, what is the behavior upon each state:
> > 
> >  *  000b => Device Stopped, not saving or resuming
> >      Does not issue DMAs
> >      Internal state cannot change
> > 
> >  *  001b => Device running, which is the default state
> >      Neither flags
> > 
> >  *  010b => Stop the device & save the device state, stop-and-copy state
> >      Does not issue DMAs
> >      Internal state cannot change
> > 
> >  *  011b => Device running and save the device state, pre-copy state
> >      Neither flags
> >      (future, DMA tracking turned on)
> > 
> >  *  100b => Device stopped and the device state is resuming
> >      Does not issue DMAs
> >      Internal state cannot change
> 
> cannot change... other that as loaded via migration region.

Yes

> The expected protocol is that if the user write to the device_state
> register returns an errno, the user reevaluates the device_state to
> determine if the desired transition is unavailable (previous state
> value is returned) or generated a fault (error state value
> returned).

Hmm, interesting, mlx5 should be doing this as well. Eg resuming with
corrupt state should fail and cannot be recovered except via reset.

> The 101b state indicates _RUNNING while _RESUMING, which is simply not
> a mode that has been spec'd at this time as it would require some
> mechanism for the device to fault in state on demand.

So lets error on these requests since we don't know what state to put
the device into.

> > The two actions:
> >  trigger serializing the device state
> >    Done when asked to go to 010b ?
> 
> When the _SAVING bit is set.  The exact mechanics depends on the size
> and volatility of the device state.  A GPU might begin in pre-copy
> (011b) to transmit chunks of framebuffer data, recording hashes of
> blocks read by the user to avoid re-sending them during the
> stop-and-copy (010b) phase.  

Here I am talking specifically about mlx5 which does not have a state
capture in pre-copy. So mlx5 should capture state on 010b only, and
the 011b is a NOP.

> >  trigger de-serializing the device state
> >    Done when transition from 100b -> 000b ?
> 
> 100b -> 000b is not a required transition, generally this would be 100b
> -> 001b, ie. end state of _RUNNING vs _STOP.

Sorry, I typo'd it, yes to _RUNNING

> I think the requirement is that de-serialization is complete when the
> _RESUMING bit is cleared.  Whether the driver chooses to de-serialize
> piece-wise as each block of data is written to the device or in bulk
> from a buffer is left to the implementation.  In either case, the
> driver can fail the transition to !_RESUMING if the state is incomplete
> or otherwise corrupt.  It would again be the driver's discretion if
> the device enters the error state or remains in _RESUMING.  If the user
> has no valid state with which to exit the _RESUMING phase, a device
> reset should return the device to _RUNNING with a default initial state.

That makes sense enough.

> > There is a missing state "Stop Active Transactions" which would be
> > only "does not issue DMAs". I've seen a proposal to add that.
> 
> This would be to get all devices to stop issuing DMA while internal
> state can be modified to avoid the synchronization issue of trying to
> stop devices concurrently?  

Yes, as above

> For PCI devices we obviously have the bus master bit to manage that,
> but I could see how a migration extension for such support (perhaps
> even just wired through to BM for PCI) could be useful.  Thanks,

I'm nervous to override the BM bit for something like this, the BM bit
isn't a gentle "please coherently stop what you are doing" it is a
hanbrake the OS pulls to ensure any PCI device becomes quiet.

Thanks,
Jason

  reply	other threads:[~2021-09-29 18:26 UTC|newest]

Thread overview: 58+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-09-22 10:38 [PATCH mlx5-next 0/7] Add mlx5 live migration driver Leon Romanovsky
2021-09-22 10:38 ` [PATCH mlx5-next 1/7] PCI/IOV: Provide internal VF index Leon Romanovsky
2021-09-22 21:59   ` Bjorn Helgaas
2021-09-23  6:35     ` Leon Romanovsky
2021-09-24 13:08       ` Bjorn Helgaas
2021-09-25 10:10         ` Leon Romanovsky
2021-09-25 17:41           ` Bjorn Helgaas
2021-09-26  6:36             ` Leon Romanovsky
2021-09-26 20:23               ` Bjorn Helgaas
2021-09-27 11:55                 ` Leon Romanovsky
2021-09-27 14:47                   ` Bjorn Helgaas
2021-09-22 10:38 ` [PATCH mlx5-next 2/7] vfio: Add an API to check migration state transition validity Leon Romanovsky
2021-09-23 10:33   ` Shameerali Kolothum Thodi
2021-09-23 11:17     ` Leon Romanovsky
2021-09-23 13:55       ` Max Gurtovoy
2021-09-24  7:44         ` Shameerali Kolothum Thodi
2021-09-24  9:37           ` Kirti Wankhede
2021-09-26  9:09           ` Max Gurtovoy
2021-09-26 16:17             ` Shameerali Kolothum Thodi
2021-09-27 18:24               ` Max Gurtovoy
2021-09-27 18:29                 ` Shameerali Kolothum Thodi
2021-09-27 22:46   ` Alex Williamson
2021-09-27 23:12     ` Jason Gunthorpe
2021-09-28 19:19       ` Alex Williamson
2021-09-28 19:35         ` Jason Gunthorpe
2021-09-28 20:18           ` Alex Williamson
2021-09-29 16:16             ` Jason Gunthorpe
2021-09-29 18:06               ` Alex Williamson
2021-09-29 18:26                 ` Jason Gunthorpe [this message]
2021-09-29 10:57         ` Max Gurtovoy
2021-09-29 10:44       ` Max Gurtovoy
2021-09-29 12:35         ` Alex Williamson
2021-09-29 13:26           ` Max Gurtovoy
2021-09-29 13:50             ` Alex Williamson
2021-09-29 14:36               ` Max Gurtovoy
2021-09-29 15:17                 ` Alex Williamson
2021-09-29 15:28                   ` Max Gurtovoy
2021-09-29 16:14                     ` Jason Gunthorpe
2021-09-29 21:48                       ` Max Gurtovoy
2021-09-29 22:44                         ` Alex Williamson
2021-09-30  9:25                           ` Max Gurtovoy
2021-09-30 12:41                             ` Alex Williamson
2021-09-29 23:21                         ` Jason Gunthorpe
2021-09-30  9:34                           ` Max Gurtovoy
2021-09-30 14:47                             ` Jason Gunthorpe
2021-09-30 15:32                               ` Max Gurtovoy
2021-09-30 16:24                                 ` Jason Gunthorpe
2021-09-30 16:51                                   ` Max Gurtovoy
2021-09-30 17:01                                     ` Jason Gunthorpe
2021-09-22 10:38 ` [PATCH mlx5-next 3/7] vfio/pci_core: Make the region->release() function optional Leon Romanovsky
2021-09-23 13:57   ` Max Gurtovoy
2021-09-22 10:38 ` [PATCH mlx5-next 4/7] net/mlx5: Introduce migration bits and structures Leon Romanovsky
2021-09-24  5:48   ` Mark Zhang
2021-09-22 10:38 ` [PATCH mlx5-next 5/7] net/mlx5: Expose APIs to get/put the mlx5 core device Leon Romanovsky
2021-09-22 10:38 ` [PATCH mlx5-next 6/7] mlx5_vfio_pci: Expose migration commands over mlx5 device Leon Romanovsky
2021-09-28 20:22   ` Alex Williamson
2021-09-29  5:36     ` Leon Romanovsky
2021-09-22 10:38 ` [PATCH mlx5-next 7/7] mlx5_vfio_pci: Implement vfio_pci driver for mlx5 devices Leon Romanovsky

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210929182642.GV964074@nvidia.com \
    --to=jgg@nvidia.com \
    --cc=alex.williamson@redhat.com \
    --cc=bhelgaas@google.com \
    --cc=cohuck@redhat.com \
    --cc=davem@davemloft.net \
    --cc=dledford@redhat.com \
    --cc=kuba@kernel.org \
    --cc=kvm@vger.kernel.org \
    --cc=kwankhede@nvidia.com \
    --cc=leon@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=linux-rdma@vger.kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=saeedm@nvidia.com \
    --cc=yishaih@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.