From: Jason Gunthorpe <jgg@nvidia.com>
To: Alex Williamson <alex.williamson@redhat.com>
Cc: Leon Romanovsky <leon@kernel.org>,
Doug Ledford <dledford@redhat.com>,
Yishai Hadas <yishaih@nvidia.com>,
Bjorn Helgaas <bhelgaas@google.com>,
"David S. Miller" <davem@davemloft.net>,
Jakub Kicinski <kuba@kernel.org>,
Kirti Wankhede <kwankhede@nvidia.com>,
kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
linux-pci@vger.kernel.org, linux-rdma@vger.kernel.org,
netdev@vger.kernel.org, Saeed Mahameed <saeedm@nvidia.com>,
Cornelia Huck <cohuck@redhat.com>
Subject: Re: [PATCH mlx5-next 2/7] vfio: Add an API to check migration state transition validity
Date: Wed, 29 Sep 2021 15:26:42 -0300 [thread overview]
Message-ID: <20210929182642.GV964074@nvidia.com> (raw)
In-Reply-To: <20210929120655.28e0b3a6.alex.williamson@redhat.com>
On Wed, Sep 29, 2021 at 12:06:55PM -0600, Alex Williamson wrote:
> On Wed, 29 Sep 2021 13:16:02 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> > On Tue, Sep 28, 2021 at 02:18:44PM -0600, Alex Williamson wrote:
> > > On Tue, 28 Sep 2021 16:35:50 -0300
> > > Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > >
> > > > On Tue, Sep 28, 2021 at 01:19:58PM -0600, Alex Williamson wrote:
> > > >
> > > > > In defining the device state, we tried to steer away from defining it
> > > > > in terms of the QEMU migration API, but rather as a set of controls
> > > > > that could be used to support that API to leave us some degree of
> > > > > independence that QEMU implementation might evolve.
> > > >
> > > > That is certainly a different perspective, it would have been
> > > > better to not express this idea as a FSM in that case...
> > > >
> > > > So each state in mlx5vf_pci_set_device_state() should call the correct
> > > > combination of (un)freeze, (un)quiesce and so on so each state
> > > > reflects a defined operation of the device?
> > >
> > > I'd expect so, for instance the implementation of entering the _STOP
> > > state presumes a previous state that where the device is apparently
> > > already quiesced. That doesn't support a direct _RUNNING -> _STOP
> > > transition where I argued in the linked threads that those states
> > > should be reachable from any other state. Thanks,
> >
> > If we focus on mlx5 there are two device 'flags' to manage:
> > - Device cannot issue DMAs
> > - Device internal state cannot change (ie cannot receive DMAs)
> >
> > This is necessary to co-ordinate across multiple devices that might be
> > doing peer to peer DMA between them. The whole multi-device complex
> > should be moved to "cannot issue DMA's" then the whole complex would
> > go to "state cannot change" and be serialized.
>
> Are you anticipating p2p from outside the VM? The typical scenario
> here would be that p2p occurs only intra-VM, so all the devices would
> stop issuing DMA (modulo trying to quiesce devices simultaneously).
Inside the VM.
Your 'modulo trying to quiesce devices simultaneously' is correct -
this is a real issue that needs to be solved.
If we put one device in a state where it's internal state is immutable
it can no longer accept DMA messages from the other devices. So there
are two states in the HW model - do not generate DMAs and finally the
immutable internal state where even external DMAs are refused.
> > The expected sequence at the device is thus
> >
> > Resuming
> > full stop -> does not issue DMAs -> full operation
> > Suspend
> > full operation -> does not issue DMAs -> full stop
> >
> > Further the device has two actions
> > - Trigger serializating the device state
> > - Trigger de-serializing the device state
> >
> > So, what is the behavior upon each state:
> >
> > * 000b => Device Stopped, not saving or resuming
> > Does not issue DMAs
> > Internal state cannot change
> >
> > * 001b => Device running, which is the default state
> > Neither flags
> >
> > * 010b => Stop the device & save the device state, stop-and-copy state
> > Does not issue DMAs
> > Internal state cannot change
> >
> > * 011b => Device running and save the device state, pre-copy state
> > Neither flags
> > (future, DMA tracking turned on)
> >
> > * 100b => Device stopped and the device state is resuming
> > Does not issue DMAs
> > Internal state cannot change
>
> cannot change... other that as loaded via migration region.
Yes
> The expected protocol is that if the user write to the device_state
> register returns an errno, the user reevaluates the device_state to
> determine if the desired transition is unavailable (previous state
> value is returned) or generated a fault (error state value
> returned).
Hmm, interesting, mlx5 should be doing this as well. Eg resuming with
corrupt state should fail and cannot be recovered except via reset.
> The 101b state indicates _RUNNING while _RESUMING, which is simply not
> a mode that has been spec'd at this time as it would require some
> mechanism for the device to fault in state on demand.
So lets error on these requests since we don't know what state to put
the device into.
> > The two actions:
> > trigger serializing the device state
> > Done when asked to go to 010b ?
>
> When the _SAVING bit is set. The exact mechanics depends on the size
> and volatility of the device state. A GPU might begin in pre-copy
> (011b) to transmit chunks of framebuffer data, recording hashes of
> blocks read by the user to avoid re-sending them during the
> stop-and-copy (010b) phase.
Here I am talking specifically about mlx5 which does not have a state
capture in pre-copy. So mlx5 should capture state on 010b only, and
the 011b is a NOP.
> > trigger de-serializing the device state
> > Done when transition from 100b -> 000b ?
>
> 100b -> 000b is not a required transition, generally this would be 100b
> -> 001b, ie. end state of _RUNNING vs _STOP.
Sorry, I typo'd it, yes to _RUNNING
> I think the requirement is that de-serialization is complete when the
> _RESUMING bit is cleared. Whether the driver chooses to de-serialize
> piece-wise as each block of data is written to the device or in bulk
> from a buffer is left to the implementation. In either case, the
> driver can fail the transition to !_RESUMING if the state is incomplete
> or otherwise corrupt. It would again be the driver's discretion if
> the device enters the error state or remains in _RESUMING. If the user
> has no valid state with which to exit the _RESUMING phase, a device
> reset should return the device to _RUNNING with a default initial state.
That makes sense enough.
> > There is a missing state "Stop Active Transactions" which would be
> > only "does not issue DMAs". I've seen a proposal to add that.
>
> This would be to get all devices to stop issuing DMA while internal
> state can be modified to avoid the synchronization issue of trying to
> stop devices concurrently?
Yes, as above
> For PCI devices we obviously have the bus master bit to manage that,
> but I could see how a migration extension for such support (perhaps
> even just wired through to BM for PCI) could be useful. Thanks,
I'm nervous to override the BM bit for something like this, the BM bit
isn't a gentle "please coherently stop what you are doing" it is a
hanbrake the OS pulls to ensure any PCI device becomes quiet.
Thanks,
Jason
next prev parent reply other threads:[~2021-09-29 18:26 UTC|newest]
Thread overview: 57+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <cover.1632305919.git.leonro@nvidia.com>
2021-09-22 10:38 ` [PATCH mlx5-next 1/7] PCI/IOV: Provide internal VF index Leon Romanovsky
2021-09-22 21:59 ` Bjorn Helgaas
2021-09-23 6:35 ` Leon Romanovsky
2021-09-24 13:08 ` Bjorn Helgaas
2021-09-25 10:10 ` Leon Romanovsky
2021-09-25 17:41 ` Bjorn Helgaas
2021-09-26 6:36 ` Leon Romanovsky
2021-09-26 20:23 ` Bjorn Helgaas
2021-09-27 11:55 ` Leon Romanovsky
2021-09-27 14:47 ` Bjorn Helgaas
2021-09-22 10:38 ` [PATCH mlx5-next 2/7] vfio: Add an API to check migration state transition validity Leon Romanovsky
2021-09-23 10:33 ` Shameerali Kolothum Thodi
2021-09-23 11:17 ` Leon Romanovsky
2021-09-23 13:55 ` Max Gurtovoy
2021-09-24 7:44 ` Shameerali Kolothum Thodi
2021-09-24 9:37 ` Kirti Wankhede
2021-09-26 9:09 ` Max Gurtovoy
2021-09-26 16:17 ` Shameerali Kolothum Thodi
2021-09-27 18:24 ` Max Gurtovoy
2021-09-27 18:29 ` Shameerali Kolothum Thodi
2021-09-27 22:46 ` Alex Williamson
2021-09-27 23:12 ` Jason Gunthorpe
2021-09-28 19:19 ` Alex Williamson
2021-09-28 19:35 ` Jason Gunthorpe
2021-09-28 20:18 ` Alex Williamson
2021-09-29 16:16 ` Jason Gunthorpe
2021-09-29 18:06 ` Alex Williamson
2021-09-29 18:26 ` Jason Gunthorpe [this message]
2021-09-29 10:57 ` Max Gurtovoy
2021-09-29 10:44 ` Max Gurtovoy
2021-09-29 12:35 ` Alex Williamson
2021-09-29 13:26 ` Max Gurtovoy
2021-09-29 13:50 ` Alex Williamson
2021-09-29 14:36 ` Max Gurtovoy
2021-09-29 15:17 ` Alex Williamson
2021-09-29 15:28 ` Max Gurtovoy
2021-09-29 16:14 ` Jason Gunthorpe
2021-09-29 21:48 ` Max Gurtovoy
2021-09-29 22:44 ` Alex Williamson
2021-09-30 9:25 ` Max Gurtovoy
2021-09-30 12:41 ` Alex Williamson
2021-09-29 23:21 ` Jason Gunthorpe
2021-09-30 9:34 ` Max Gurtovoy
2021-09-30 14:47 ` Jason Gunthorpe
2021-09-30 15:32 ` Max Gurtovoy
2021-09-30 16:24 ` Jason Gunthorpe
2021-09-30 16:51 ` Max Gurtovoy
2021-09-30 17:01 ` Jason Gunthorpe
2021-09-22 10:38 ` [PATCH mlx5-next 3/7] vfio/pci_core: Make the region->release() function optional Leon Romanovsky
2021-09-23 13:57 ` Max Gurtovoy
2021-09-22 10:38 ` [PATCH mlx5-next 4/7] net/mlx5: Introduce migration bits and structures Leon Romanovsky
2021-09-24 5:48 ` Mark Zhang
2021-09-22 10:38 ` [PATCH mlx5-next 5/7] net/mlx5: Expose APIs to get/put the mlx5 core device Leon Romanovsky
2021-09-22 10:38 ` [PATCH mlx5-next 6/7] mlx5_vfio_pci: Expose migration commands over mlx5 device Leon Romanovsky
2021-09-28 20:22 ` Alex Williamson
2021-09-29 5:36 ` Leon Romanovsky
2021-09-22 10:38 ` [PATCH mlx5-next 7/7] mlx5_vfio_pci: Implement vfio_pci driver for mlx5 devices Leon Romanovsky
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20210929182642.GV964074@nvidia.com \
--to=jgg@nvidia.com \
--cc=alex.williamson@redhat.com \
--cc=bhelgaas@google.com \
--cc=cohuck@redhat.com \
--cc=davem@davemloft.net \
--cc=dledford@redhat.com \
--cc=kuba@kernel.org \
--cc=kvm@vger.kernel.org \
--cc=kwankhede@nvidia.com \
--cc=leon@kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-pci@vger.kernel.org \
--cc=linux-rdma@vger.kernel.org \
--cc=netdev@vger.kernel.org \
--cc=saeedm@nvidia.com \
--cc=yishaih@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).