All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH V2 0/2] Vitqueue State Synchronization
@ 2021-07-06  4:33 Jason Wang
  2021-07-06  4:33 ` [PATCH V2 1/2] virtio: introduce virtqueue state as basic facility Jason Wang
                   ` (2 more replies)
  0 siblings, 3 replies; 115+ messages in thread
From: Jason Wang @ 2021-07-06  4:33 UTC (permalink / raw)
  To: mst, jasowang, virtio-comment, virtio-dev
  Cc: stefanha, mgurtovoy, cohuck, eperezma, oren, shahafs, parav,
	bodong, amikheev, pasic

Hi All:

This is an updated version to implement virtqueue state
synchronization which is a must for the migration support.

The first patch introduces virtqueue states as a new basic facility of
the virtio device. This is used by the driver to save and restore
virtqueue state. The states were split into available state and used
state to ease the transport specific implementation. It is also
allowed for the device to have its own device specific way to save and
resotre extra virtqueue states like in flight request.

The second patch introduce a new status bit STOP. This bit is used for
the driver to stop the device. The major difference from reset is that
STOP must preserve all the virtqueue state plus the device state.

A driver can then:

- Get the virtqueue state if STOP status bit is set
- Set the virtqueue state after FEATURE_OK but before DRIVER_OK

Device specific state synchronization could be built on top.

Please review.

Thanks

Changes from V1:

- introduce used_idx for synchronize used state for split virtqueue,
  this is needed since the used ring is read only by the driver.
- stick to the uni-directional state machine, this means driver is
  forbidden to clear STOP, the only way to 'resume' the device is to
  reset and re-initialization the device. This simplify the
  implementation and make it works more like a vhost device.
- mandate the virtqueue state setting during device initialization
- clarify the steps that are required for saving the virtqueue state
- allow the device to have its specifc way to save and restore extra
  virtqueue state
- rename DEVICE_STOPPED to STOP
- various other tweaks

Jason Wang (2):
  virtio: introduce virtqueue state as basic facility
  virtio: introduce STOP status bit

 content.tex | 180 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 180 insertions(+)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 115+ messages in thread

* [PATCH V2 1/2] virtio: introduce virtqueue state as basic facility
  2021-07-06  4:33 [PATCH V2 0/2] Vitqueue State Synchronization Jason Wang
@ 2021-07-06  4:33 ` Jason Wang
  2021-07-06  9:32   ` Michael S. Tsirkin
  2021-07-06 12:27   ` [virtio-comment] " Cornelia Huck
  2021-07-06  4:33 ` [PATCH V2 2/2] virtio: introduce STOP status bit Jason Wang
  2021-07-12 10:12 ` [PATCH V2 0/2] Vitqueue State Synchronization Stefan Hajnoczi
  2 siblings, 2 replies; 115+ messages in thread
From: Jason Wang @ 2021-07-06  4:33 UTC (permalink / raw)
  To: mst, jasowang, virtio-comment, virtio-dev
  Cc: stefanha, mgurtovoy, cohuck, eperezma, oren, shahafs, parav,
	bodong, amikheev, pasic

This patch adds new device facility to save and restore virtqueue
state. The virtqueue state is split into two parts:

- The available state: The state that is used for read the next
  available buffer.
- The used state: The state that is used for making buffer used.

Note that, there could be devices that is required to set and get the
requests that are being processed by the device. I leave such API to
be device specific.

This facility could be used by both migration and device diagnostic.

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 content.tex | 117 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 117 insertions(+)

diff --git a/content.tex b/content.tex
index 620c0e2..8877b6f 100644
--- a/content.tex
+++ b/content.tex
@@ -385,6 +385,116 @@ \section{Exporting Objects}\label{sec:Basic Facilities of a Virtio Device / Expo
 types. It is RECOMMENDED that devices generate version 4
 UUIDs as specified by \hyperref[intro:rfc4122]{[RFC4122]}.
 
+\section{Virtqueue State}\label{sec:Virtqueues / Virtqueue State}
+
+When VIRTIO_F_RING_STATE is negotiated, the driver can set and
+get the device internal virtqueue state through the following
+fields. The way to access those fields is transport specific.
+
+\subsection{\field{Available State} Field}
+
+The \field{Available State} field is two bytes for the driver to get
+or set the state that is used by the virtqueue to read for the next
+available buffer.
+
+When VIRTIO_F_RING_PACKED is not negotiated, it contains:
+
+\begin{lstlisting}
+le16 {
+        last_avail_idx : 16;
+} avail_state;
+\end{lstlisting}
+
+The \field{last_avail_idx} field indicates where the device would read
+for the next index from the virtqueue available ring(modulo the queue
+ size). This starts at the value set by the driver, and increases.
+
+When VIRTIO_F_RING_PACKED is negotiated, it contains:
+
+\begin{lstlisting}
+le16 {
+        last_avail_idx : 15;
+        last_avail_wrap_counter : 1;
+} avail_state;
+\end{lstlisting}
+
+The \field{last_avail_idx} field indicates where the device would read for
+the next descriptor head from the descriptor ring. This starts at the
+value set by the driver and wraps around when reaching the end of the
+ring.
+
+The \field{last_avail_wrap_counter} field indicates the last Driver Ring
+Wrap Counter that is observed by the device. This starts at the value
+set by the driver, and is flipped when reaching the end of the ring.
+
+See also \ref{sec:Packed Virtqueues / Driver and Device Ring Wrap Counters}.
+
+\subsection{\field{Used State} Field}
+
+The \field{Used State} field is two bytes for the driver to set and
+get the state used by the virtqueue to make buffer used.
+
+When VIRTIO_F_RING_PACKED is not negotiated, the used state contains:
+
+\begin{lstlisting}
+le16 {
+        used_idx : 16;
+} used_state;
+\end{lstlisting}
+
+The \field{used_idx} where the device would write the next used
+descriptor head to the used ring (modulo the queue size). This starts
+at the value set by the driver, and increases. It is easy to see this
+is the initial value of the \field{idx} in the used ring.
+
+See also \ref{sec:Basic Facilities of a Virtio Device / Virtqueues / The Virtqueue Used Ring}
+
+When VIRTIO_F_RING_PACKED is negotiated, the used state contains:
+
+\begin{lstlisting}
+le16 {
+        used_idx : 15;
+        used_wrap_counter : 1;
+} used_state;
+\end{lstlisting}
+
+The \field{used_idx} indicates where the device would write the next used
+descriptor head to the descriptor ring. This starts at the value set
+by the driver, and warps around when reaching the end of the ring.
+
+\field{used_wrap_counter} is the Device Ring Wrap Counter. This starts
+at the value set by the driver, and is flipped when reaching the end
+of the ring.
+
+See also \ref{sec:Packed Virtqueues / Driver and Device Ring Wrap Counters}.
+
+\drivernormative{\subsection}{Virtqueue State}{Basic Facilities of a Virtio Device / Virtqueue State}
+
+If VIRTIO_F_RING_STATE has been negotiated:
+\begin{itemize}
+\item A driver MUST NOT set the virtqueue state before setting the
+  FEATURE_OK status bit.
+\item A driver MUST NOT set the virtqueue state after setting the
+  DRIVER_OK status bit.
+\end{itemize}
+
+\devicenormative{\subsection}{Virtqueue State}{Basic Facilities of a Virtio Device / Virtqueue State}
+
+If VIRTIO_F_RING_STATE has not been negotiated, a device MUST ingore
+the read and write to the virtqueue state.
+
+If VIRTIO_F_RING_STATE has been negotiated:
+\begin{itemize}
+\item A device SHOULD ignore the write to the virtqueue state if the
+FEATURE_OK status bit is not set.
+\item A device SHOULD ignore the write to the virtqueue state if the
+DRIVER_OK status bit is set.
+\end{itemize}
+
+If VIRTIO_F_RING_STATE has been negotiated, a device MAY has its
+device-specific way for the driver to set and get extra virtqueue
+states such as in flight requests.
+
 \chapter{General Initialization And Device Operation}\label{sec:General Initialization And Device Operation}
 
 We start with an overview of device initialization, then expand on the
@@ -420,6 +530,9 @@ \section{Device Initialization}\label{sec:General Initialization And Device Oper
    device, optional per-bus setup, reading and possibly writing the
    device's virtio configuration space, and population of virtqueues.
 
+\item\label{itm:General Initialization And Device Operation / Device
+  Initialization / Virtqueue State Setup} When VIRTIO_F_RING_STATE has been negotiated, perform virtqueue state setup, including the initialization of the per virtqueue available state, used state and the possible device specific virtqueue state.
+
 \item\label{itm:General Initialization And Device Operation / Device Initialization / Set DRIVER-OK} Set the DRIVER_OK status bit.  At this point the device is
    ``live''.
 \end{enumerate}
@@ -6596,6 +6709,10 @@ \chapter{Reserved Feature Bits}\label{sec:Reserved Feature Bits}
   transport specific.
   For more details about driver notifications over PCI see \ref{sec:Virtio Transport Options / Virtio Over PCI Bus / PCI-specific Initialization And Device Operation / Available Buffer Notifications}.
 
+  \item[VIRTIO_F_RING_STATE(40)] This feature indicates that the driver
+  can set and get the device internal virtqueue state.
+  See \ref{sec:Virtqueues / Virtqueue State}~\nameref{sec:Virtqueues / Virtqueue State}.
+
 \end{description}
 
 \drivernormative{\section}{Reserved Feature Bits}{Reserved Feature Bits}
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 115+ messages in thread

* [PATCH V2 2/2] virtio: introduce STOP status bit
  2021-07-06  4:33 [PATCH V2 0/2] Vitqueue State Synchronization Jason Wang
  2021-07-06  4:33 ` [PATCH V2 1/2] virtio: introduce virtqueue state as basic facility Jason Wang
@ 2021-07-06  4:33 ` Jason Wang
  2021-07-06  9:24   ` [virtio-comment] " Dr. David Alan Gilbert
                     ` (3 more replies)
  2021-07-12 10:12 ` [PATCH V2 0/2] Vitqueue State Synchronization Stefan Hajnoczi
  2 siblings, 4 replies; 115+ messages in thread
From: Jason Wang @ 2021-07-06  4:33 UTC (permalink / raw)
  To: mst, jasowang, virtio-comment, virtio-dev
  Cc: stefanha, mgurtovoy, cohuck, eperezma, oren, shahafs, parav,
	bodong, amikheev, pasic

This patch introduces a new status bit STOP. This will be
used by the driver to stop the device in order to safely fetch the
device state (virtqueue state) from the device.

This is a must for supporting migration.

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 content.tex | 69 ++++++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 66 insertions(+), 3 deletions(-)

diff --git a/content.tex b/content.tex
index 8877b6f..284ead0 100644
--- a/content.tex
+++ b/content.tex
@@ -47,6 +47,10 @@ \section{\field{Device Status} Field}\label{sec:Basic Facilities of a Virtio Dev
 \item[DRIVER_OK (4)] Indicates that the driver is set up and ready to
   drive the device.
 
+\item[STOP (32)] When VIRTIO_F_STOP is negotiated, indicates that the
+  device has been stopped by the driver. This status bit is different
+  from the reset since the device state is preserved.
+
 \item[DEVICE_NEEDS_RESET (64)] Indicates that the device has experienced
   an error from which it can't recover.
 \end{description}
@@ -70,12 +74,38 @@ \section{\field{Device Status} Field}\label{sec:Basic Facilities of a Virtio Dev
 recover by issuing a reset.
 \end{note}
 
+If VIRTIO_F_STOP has been negotiated, the driver MUST NOT set STOP if
+DRIVER_OK is not set.
+
+If VIRTIO_F_STOP has been negotiated, to stop a device, after setting
+STOP, the driver MUST re-read the device status to ensure the STOP bit
+is set to synchronize with the device.
+
 \devicenormative{\subsection}{Device Status Field}{Basic Facilities of a Virtio Device / Device Status Field}
 The device MUST initialize \field{device status} to 0 upon reset.
 
 The device MUST NOT consume buffers or send any used buffer
 notifications to the driver before DRIVER_OK.
 
+If VIRTIO_F_STOP has not been negotiated, the device MUST ignore the
+write of STOP.
+
+If VIRTIO_F_STOP has been negotiated, after the driver writes STOP,
+the device MUST finish any pending operations like in flight requests
+or have its device specific way for driver to save the pending
+operations like in flight requests before setting the STOP status bit.
+
+If VIRTIO_F_STOP has been negotiated, the device MUST NOT consume
+buffers or send any used buffer notifications to the driver after
+STOP. The device MUST keep the configuration space unchanged and MUST
+NOT send configuration space change notification to the driver after
+STOP.
+
+If VIRTIO_F_STOP has been negotiated, after STOP, the device MUST
+preserve all the necessary state (the virtqueue states with the
+possible device specific states) that is required for restoring in the
+future.
+
 \label{sec:Basic Facilities of a Virtio Device / Device Status Field / DEVICENEEDSRESET}The device SHOULD set DEVICE_NEEDS_RESET when it enters an error state
 that a reset is needed.  If DRIVER_OK is set, after it sets DEVICE_NEEDS_RESET, the device
 MUST send a device configuration change notification to the driver.
@@ -474,8 +504,8 @@ \subsection{\field{Used State} Field}
 \begin{itemize}
 \item A driver MUST NOT set the virtqueue state before setting the
   FEATURE_OK status bit.
-\item A driver MUST NOT set the virtqueue state after setting the
-  DRIVER_OK status bit.
+\item A driver MUST NOT set the virtqueue state if DRIVER_OK status
+  bit is set without STOP status bit.
 \end{itemize}
 
 \devicenormative{\subsection}{Virtqueue State}{Basic Facilities of a Virtio Device / Virtqueue State}
@@ -488,7 +518,7 @@ \subsection{\field{Used State} Field}
 \item A device SHOULD ignore the write to the virtqueue state if the
 FEATURE_OK status bit is not set.
 \item A device SHOULD ignore the write to the virtqueue state if the
-DRIVER_OK status bit is set.
+DRIVER_OK status bit is set without STOP status bit.
 \end{itemize}
 
 If VIRTIO_F_RING_STATE has been negotiated, a device MAY has its
@@ -623,6 +653,36 @@ \section{Device Cleanup}\label{sec:General Initialization And Device Operation /
 
 Thus a driver MUST ensure a virtqueue isn't live (by device reset) before removing exposed buffers.
 
+\section{Virtqueue State Saving}
+
+If both VIRTIO_F_RING_STATE and VIRTIO_F_STOP have been negotiated. A
+driver MAY save the internal virtqueue state.
+
+\drivernormative{\subsection}{Virtqueue State Saving}{General Initialization And Device Operation / Virtqueue State Saving}
+
+Assuming the device is 'live'. The driver MUST follow this sequence to
+stop the device and save the virtqueue state:
+
+\begin{enumerate}
+\item Set the STOP status bit.
+
+\item Re-read \field{device status} until the STOP bit is set to
+  synchronize with the device.
+
+\item Read \field{available state} and save it.
+
+\item Read \field{used state} and save it.
+
+\item Read device specific virtqueue states if needed.
+
+\item Reset the device.
+\end{enumerate}
+
+The driver MAY perform device specific steps to save device specific sate.
+
+The driver MAY 'resume' the device by redoing the device initialization
+with the saved virtqueue state. See ref\{sec:General Initialization and Device Operation / Device Initialization}
+
 \chapter{Virtio Transport Options}\label{sec:Virtio Transport Options}
 
 Virtio can use various different buses, thus the standard is split
@@ -6713,6 +6773,9 @@ \chapter{Reserved Feature Bits}\label{sec:Reserved Feature Bits}
   can set and get the device internal virtqueue state.
   See \ref{sec:Virtqueues / Virtqueue State}~\nameref{sec:Virtqueues / Virtqueue State}.
 
+  \item[VIRTIO_F_STOP(41)] This feature indicates that the driver can
+  stop the device.
+  See \ref{sec:Basic Facilities of a Virtio Device / Device Status Field}
 \end{description}
 
 \drivernormative{\section}{Reserved Feature Bits}{Reserved Feature Bits}
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 115+ messages in thread

* Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit
  2021-07-06  4:33 ` [PATCH V2 2/2] virtio: introduce STOP status bit Jason Wang
@ 2021-07-06  9:24   ` Dr. David Alan Gilbert
  2021-07-07  3:20     ` Jason Wang
  2021-07-06 12:50   ` [virtio-comment] " Cornelia Huck
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 115+ messages in thread
From: Dr. David Alan Gilbert @ 2021-07-06  9:24 UTC (permalink / raw)
  To: Jason Wang
  Cc: mst, virtio-comment, virtio-dev, stefanha, mgurtovoy, cohuck,
	eperezma, oren, shahafs, parav, bodong, amikheev, pasic

* Jason Wang (jasowang@redhat.com) wrote:
> This patch introduces a new status bit STOP. This will be
> used by the driver to stop the device in order to safely fetch the
> device state (virtqueue state) from the device.
> 
> This is a must for supporting migration.
> 
> Signed-off-by: Jason Wang <jasowang@redhat.com>
> ---
>  content.tex | 69 ++++++++++++++++++++++++++++++++++++++++++++++++++---
>  1 file changed, 66 insertions(+), 3 deletions(-)
> 
> diff --git a/content.tex b/content.tex
> index 8877b6f..284ead0 100644
> --- a/content.tex
> +++ b/content.tex
> @@ -47,6 +47,10 @@ \section{\field{Device Status} Field}\label{sec:Basic Facilities of a Virtio Dev
>  \item[DRIVER_OK (4)] Indicates that the driver is set up and ready to
>    drive the device.
>  
> +\item[STOP (32)] When VIRTIO_F_STOP is negotiated, indicates that the
> +  device has been stopped by the driver. This status bit is different
> +  from the reset since the device state is preserved.
> +
>  \item[DEVICE_NEEDS_RESET (64)] Indicates that the device has experienced
>    an error from which it can't recover.
>  \end{description}
> @@ -70,12 +74,38 @@ \section{\field{Device Status} Field}\label{sec:Basic Facilities of a Virtio Dev
>  recover by issuing a reset.
>  \end{note}
>  
> +If VIRTIO_F_STOP has been negotiated, the driver MUST NOT set STOP if
> +DRIVER_OK is not set.
> +
> +If VIRTIO_F_STOP has been negotiated, to stop a device, after setting
> +STOP, the driver MUST re-read the device status to ensure the STOP bit
> +is set to synchronize with the device.
> +
>  \devicenormative{\subsection}{Device Status Field}{Basic Facilities of a Virtio Device / Device Status Field}
>  The device MUST initialize \field{device status} to 0 upon reset.
>  
>  The device MUST NOT consume buffers or send any used buffer
>  notifications to the driver before DRIVER_OK.
>  
> +If VIRTIO_F_STOP has not been negotiated, the device MUST ignore the
> +write of STOP.
> +
> +If VIRTIO_F_STOP has been negotiated, after the driver writes STOP,
> +the device MUST finish any pending operations like in flight requests
> +or have its device specific way for driver to save the pending
> +operations like in flight requests before setting the STOP status bit.
> +
> +If VIRTIO_F_STOP has been negotiated, the device MUST NOT consume
> +buffers or send any used buffer notifications to the driver after
> +STOP. The device MUST keep the configuration space unchanged and MUST
> +NOT send configuration space change notification to the driver after
> +STOP.
> +
> +If VIRTIO_F_STOP has been negotiated, after STOP, the device MUST
> +preserve all the necessary state (the virtqueue states with the
> +possible device specific states) that is required for restoring in the
> +future.
> +
>  \label{sec:Basic Facilities of a Virtio Device / Device Status Field / DEVICENEEDSRESET}The device SHOULD set DEVICE_NEEDS_RESET when it enters an error state
>  that a reset is needed.  If DRIVER_OK is set, after it sets DEVICE_NEEDS_RESET, the device
>  MUST send a device configuration change notification to the driver.
> @@ -474,8 +504,8 @@ \subsection{\field{Used State} Field}
>  \begin{itemize}
>  \item A driver MUST NOT set the virtqueue state before setting the
>    FEATURE_OK status bit.
> -\item A driver MUST NOT set the virtqueue state after setting the
> -  DRIVER_OK status bit.
> +\item A driver MUST NOT set the virtqueue state if DRIVER_OK status
> +  bit is set without STOP status bit.
>  \end{itemize}
>  
>  \devicenormative{\subsection}{Virtqueue State}{Basic Facilities of a Virtio Device / Virtqueue State}
> @@ -488,7 +518,7 @@ \subsection{\field{Used State} Field}
>  \item A device SHOULD ignore the write to the virtqueue state if the
>  FEATURE_OK status bit is not set.
>  \item A device SHOULD ignore the write to the virtqueue state if the
> -DRIVER_OK status bit is set.
> +DRIVER_OK status bit is set without STOP status bit.
>  \end{itemize}
>  
>  If VIRTIO_F_RING_STATE has been negotiated, a device MAY has its
> @@ -623,6 +653,36 @@ \section{Device Cleanup}\label{sec:General Initialization And Device Operation /
>  
>  Thus a driver MUST ensure a virtqueue isn't live (by device reset) before removing exposed buffers.
>  
> +\section{Virtqueue State Saving}
> +
> +If both VIRTIO_F_RING_STATE and VIRTIO_F_STOP have been negotiated. A
> +driver MAY save the internal virtqueue state.
> +
> +\drivernormative{\subsection}{Virtqueue State Saving}{General Initialization And Device Operation / Virtqueue State Saving}
> +
> +Assuming the device is 'live'.

Is that defined somewhere?  If I understand correctly, this is all
driven from the driver inside the guest, so for this to work
the guest must be running and already have initialised the driver.

> The driver MUST follow this sequence to
> +stop the device and save the virtqueue state:
> +
> +\begin{enumerate}
> +\item Set the STOP status bit.
> +
> +\item Re-read \field{device status} until the STOP bit is set to
> +  synchronize with the device.

At that point, is it guaranteed that the device has already stopped
and that all outstanding transactions have finished?

> +\item Read \field{available state} and save it.
> +
> +\item Read \field{used state} and save it.
> +
> +\item Read device specific virtqueue states if needed.
> +
> +\item Reset the device.

Say that a migration fails after this point; how does the source
recover?
Why is the 'reset the device' there?

> +\end{enumerate}
> +
> +The driver MAY perform device specific steps to save device specific sate.
> +
> +The driver MAY 'resume' the device by redoing the device initialization
> +with the saved virtqueue state. See ref\{sec:General Initialization and Device Operation / Device Initialization}
> +
>  \chapter{Virtio Transport Options}\label{sec:Virtio Transport Options}
>  
>  Virtio can use various different buses, thus the standard is split
> @@ -6713,6 +6773,9 @@ \chapter{Reserved Feature Bits}\label{sec:Reserved Feature Bits}
>    can set and get the device internal virtqueue state.
>    See \ref{sec:Virtqueues / Virtqueue State}~\nameref{sec:Virtqueues / Virtqueue State}.
>  
> +  \item[VIRTIO_F_STOP(41)] This feature indicates that the driver can
> +  stop the device.
> +  See \ref{sec:Basic Facilities of a Virtio Device / Device Status Field}
>  \end{description}
>  
>  \drivernormative{\section}{Reserved Feature Bits}{Reserved Feature Bits}
> -- 
> 2.25.1
> 
> 
> This publicly archived list offers a means to provide input to the
> OASIS Virtual I/O Device (VIRTIO) TC.
> 
> In order to verify user consent to the Feedback License terms and
> to minimize spam in the list archive, subscription is required
> before posting.
> 
> Subscribe: virtio-comment-subscribe@lists.oasis-open.org
> Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
> List help: virtio-comment-help@lists.oasis-open.org
> List archive: https://lists.oasis-open.org/archives/virtio-comment/
> Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
> List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
> Committee: https://www.oasis-open.org/committees/virtio/
> Join OASIS: https://www.oasis-open.org/join/
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH V2 1/2] virtio: introduce virtqueue state as basic facility
  2021-07-06  4:33 ` [PATCH V2 1/2] virtio: introduce virtqueue state as basic facility Jason Wang
@ 2021-07-06  9:32   ` Michael S. Tsirkin
  2021-07-06 17:09     ` Eugenio Perez Martin
  2021-07-06 12:27   ` [virtio-comment] " Cornelia Huck
  1 sibling, 1 reply; 115+ messages in thread
From: Michael S. Tsirkin @ 2021-07-06  9:32 UTC (permalink / raw)
  To: Jason Wang
  Cc: virtio-comment, virtio-dev, stefanha, mgurtovoy, cohuck,
	eperezma, oren, shahafs, parav, bodong, amikheev, pasic

On Tue, Jul 06, 2021 at 12:33:33PM +0800, Jason Wang wrote:
> This patch adds new device facility to save and restore virtqueue
> state. The virtqueue state is split into two parts:
> 
> - The available state: The state that is used for read the next
>   available buffer.
> - The used state: The state that is used for making buffer used.
> 
> Note that, there could be devices that is required to set and get the
> requests that are being processed by the device. I leave such API to
> be device specific.
> 
> This facility could be used by both migration and device diagnostic.
> 
> Signed-off-by: Jason Wang <jasowang@redhat.com>

Hi Jason!
I feel that for use-cases such as SRIOV,
the facility to save/restore vq should be part of a PF
that is there needs to be a way for one virtio device to
address the state of another one.

Thoughts?

> ---
>  content.tex | 117 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 117 insertions(+)
> 
> diff --git a/content.tex b/content.tex
> index 620c0e2..8877b6f 100644
> --- a/content.tex
> +++ b/content.tex
> @@ -385,6 +385,116 @@ \section{Exporting Objects}\label{sec:Basic Facilities of a Virtio Device / Expo
>  types. It is RECOMMENDED that devices generate version 4
>  UUIDs as specified by \hyperref[intro:rfc4122]{[RFC4122]}.
>  
> +\section{Virtqueue State}\label{sec:Virtqueues / Virtqueue State}
> +
> +When VIRTIO_F_RING_STATE is negotiated, the driver can set and
> +get the device internal virtqueue state through the following
> +fields. The way to access those fields is transport specific.
> +
> +\subsection{\field{Available State} Field}
> +
> +The \field{Available State} field is two bytes for the driver to get
> +or set the state that is used by the virtqueue to read for the next
> +available buffer.
> +
> +When VIRTIO_F_RING_PACKED is not negotiated, it contains:
> +
> +\begin{lstlisting}
> +le16 {
> +        last_avail_idx : 16;
> +} avail_state;
> +\end{lstlisting}
> +
> +The \field{last_avail_idx} field indicates where the device would read
> +for the next index from the virtqueue available ring(modulo the queue
> + size). This starts at the value set by the driver, and increases.
> +
> +When VIRTIO_F_RING_PACKED is negotiated, it contains:
> +
> +\begin{lstlisting}
> +le16 {
> +        last_avail_idx : 15;
> +        last_avail_wrap_counter : 1;
> +} avail_state;
> +\end{lstlisting}
> +
> +The \field{last_avail_idx} field indicates where the device would read for
> +the next descriptor head from the descriptor ring. This starts at the
> +value set by the driver and wraps around when reaching the end of the
> +ring.
> +
> +The \field{last_avail_wrap_counter} field indicates the last Driver Ring
> +Wrap Counter that is observed by the device. This starts at the value
> +set by the driver, and is flipped when reaching the end of the ring.
> +
> +See also \ref{sec:Packed Virtqueues / Driver and Device Ring Wrap Counters}.
> +
> +\subsection{\field{Used State} Field}
> +
> +The \field{Used State} field is two bytes for the driver to set and
> +get the state used by the virtqueue to make buffer used.
> +
> +When VIRTIO_F_RING_PACKED is not negotiated, the used state contains:
> +
> +\begin{lstlisting}
> +le16 {
> +        used_idx : 16;
> +} used_state;
> +\end{lstlisting}
> +
> +The \field{used_idx} where the device would write the next used
> +descriptor head to the used ring (modulo the queue size). This starts
> +at the value set by the driver, and increases. It is easy to see this
> +is the initial value of the \field{idx} in the used ring.
> +
> +See also \ref{sec:Basic Facilities of a Virtio Device / Virtqueues / The Virtqueue Used Ring}
> +
> +When VIRTIO_F_RING_PACKED is negotiated, the used state contains:
> +
> +\begin{lstlisting}
> +le16 {
> +        used_idx : 15;
> +        used_wrap_counter : 1;
> +} used_state;
> +\end{lstlisting}
> +
> +The \field{used_idx} indicates where the device would write the next used
> +descriptor head to the descriptor ring. This starts at the value set
> +by the driver, and warps around when reaching the end of the ring.
> +
> +\field{used_wrap_counter} is the Device Ring Wrap Counter. This starts
> +at the value set by the driver, and is flipped when reaching the end
> +of the ring.
> +
> +See also \ref{sec:Packed Virtqueues / Driver and Device Ring Wrap Counters}.


Above only fully describes the vq state if descriptors
are used in order or at least all out of order descriptors are consumed
at time of save.

Adding later option to devices such as net will need extra spec work.


> +\drivernormative{\subsection}{Virtqueue State}{Basic Facilities of a Virtio Device / Virtqueue State}
> +
> +If VIRTIO_F_RING_STATE has been negotiated:
> +\begin{itemize}
> +\item A driver MUST NOT set the virtqueue state before setting the
> +  FEATURE_OK status bit.
> +\item A driver MUST NOT set the virtqueue state after setting the
> +  DRIVER_OK status bit.
> +\end{itemize}
> +
> +\devicenormative{\subsection}{Virtqueue State}{Basic Facilities of a Virtio Device / Virtqueue State}
> +
> +If VIRTIO_F_RING_STATE has not been negotiated, a device MUST ingore
> +the read and write to the virtqueue state.
> +
> +If VIRTIO_F_RING_STATE has been negotiated:
> +\begin{itemize}
> +\item A device SHOULD ignore the write to the virtqueue state if the
> +FEATURE_OK status bit is not set.
> +\item A device SHOULD ignore the write to the virtqueue state if the
> +DRIVER_OK status bit is set.
> +\end{itemize}
> +
> +If VIRTIO_F_RING_STATE has been negotiated, a device MAY has its


may have?
should also go into a normative section

> +device-specific way for the driver to set and get extra virtqueue
> +states such as in flight requests.
> +
>  \chapter{General Initialization And Device Operation}\label{sec:General Initialization And Device Operation}
>  
>  We start with an overview of device initialization, then expand on the
> @@ -420,6 +530,9 @@ \section{Device Initialization}\label{sec:General Initialization And Device Oper
>     device, optional per-bus setup, reading and possibly writing the
>     device's virtio configuration space, and population of virtqueues.
>  
> +\item\label{itm:General Initialization And Device Operation / Device
> +  Initialization / Virtqueue State Setup} When VIRTIO_F_RING_STATE has been negotiated, perform virtqueue state setup, including the initialization of the per virtqueue available state, used state and the possible device specific virtqueue state.
> +
>  \item\label{itm:General Initialization And Device Operation / Device Initialization / Set DRIVER-OK} Set the DRIVER_OK status bit.  At this point the device is
>     ``live''.
>  \end{enumerate}
> @@ -6596,6 +6709,10 @@ \chapter{Reserved Feature Bits}\label{sec:Reserved Feature Bits}
>    transport specific.
>    For more details about driver notifications over PCI see \ref{sec:Virtio Transport Options / Virtio Over PCI Bus / PCI-specific Initialization And Device Operation / Available Buffer Notifications}.
>  
> +  \item[VIRTIO_F_RING_STATE(40)] This feature indicates that the driver
> +  can set and get the device internal virtqueue state.
> +  See \ref{sec:Virtqueues / Virtqueue State}~\nameref{sec:Virtqueues / Virtqueue State}.
> +
>  \end{description}
>  
>  \drivernormative{\section}{Reserved Feature Bits}{Reserved Feature Bits}
> -- 
> 2.25.1


^ permalink raw reply	[flat|nested] 115+ messages in thread

* [virtio-comment] Re: [PATCH V2 1/2] virtio: introduce virtqueue state as basic facility
  2021-07-06  4:33 ` [PATCH V2 1/2] virtio: introduce virtqueue state as basic facility Jason Wang
  2021-07-06  9:32   ` Michael S. Tsirkin
@ 2021-07-06 12:27   ` Cornelia Huck
  2021-07-07  3:29     ` [virtio-dev] " Jason Wang
  1 sibling, 1 reply; 115+ messages in thread
From: Cornelia Huck @ 2021-07-06 12:27 UTC (permalink / raw)
  To: Jason Wang, mst, jasowang, virtio-comment, virtio-dev
  Cc: stefanha, mgurtovoy, eperezma, oren, shahafs, parav, bodong,
	amikheev, pasic

On Tue, Jul 06 2021, Jason Wang <jasowang@redhat.com> wrote:

> This patch adds new device facility to save and restore virtqueue
> state. The virtqueue state is split into two parts:
>
> - The available state: The state that is used for read the next
>   available buffer.
> - The used state: The state that is used for making buffer used.
>
> Note that, there could be devices that is required to set and get the
> requests that are being processed by the device. I leave such API to
> be device specific.
>
> This facility could be used by both migration and device diagnostic.
>
> Signed-off-by: Jason Wang <jasowang@redhat.com>
> ---
>  content.tex | 117 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 117 insertions(+)

> +\devicenormative{\subsection}{Virtqueue State}{Basic Facilities of a Virtio Device / Virtqueue State}
> +
> +If VIRTIO_F_RING_STATE has not been negotiated, a device MUST ingore
> +the read and write to the virtqueue state.
> +
> +If VIRTIO_F_RING_STATE has been negotiated:
> +\begin{itemize}
> +\item A device SHOULD ignore the write to the virtqueue state if the
> +FEATURE_OK status bit is not set.
> +\item A device SHOULD ignore the write to the virtqueue state if the
> +DRIVER_OK status bit is set.
> +\end{itemize}
> +
> +If VIRTIO_F_RING_STATE has been negotiated, a device MAY has its
> +device-specific way for the driver to set and get extra virtqueue
> +states such as in flight requests.

Maybe better

"If VIRTIO_F_RING_STATE has been negotiated, a device MAY provide a
device-specific mechanism to set and get extra virtqueue states such as
in flight reqeuests."

If a device type supports this facility, does it imply that it is always
present when VIRTIO_RING_STATE has been negotiated? I guess it could
define further device-specific features to make it more configurable.


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 115+ messages in thread

* [virtio-comment] Re: [PATCH V2 2/2] virtio: introduce STOP status bit
  2021-07-06  4:33 ` [PATCH V2 2/2] virtio: introduce STOP status bit Jason Wang
  2021-07-06  9:24   ` [virtio-comment] " Dr. David Alan Gilbert
@ 2021-07-06 12:50   ` Cornelia Huck
  2021-07-06 13:18     ` Jason Wang
  2021-07-09 17:35   ` Eugenio Perez Martin
  2021-07-10 20:40   ` Michael S. Tsirkin
  3 siblings, 1 reply; 115+ messages in thread
From: Cornelia Huck @ 2021-07-06 12:50 UTC (permalink / raw)
  To: Jason Wang, mst, jasowang, virtio-comment, virtio-dev
  Cc: stefanha, mgurtovoy, eperezma, oren, shahafs, parav, bodong,
	amikheev, pasic

On Tue, Jul 06 2021, Jason Wang <jasowang@redhat.com> wrote:

> This patch introduces a new status bit STOP. This will be
> used by the driver to stop the device in order to safely fetch the
> device state (virtqueue state) from the device.
>
> This is a must for supporting migration.
>
> Signed-off-by: Jason Wang <jasowang@redhat.com>
> ---
>  content.tex | 69 ++++++++++++++++++++++++++++++++++++++++++++++++++---
>  1 file changed, 66 insertions(+), 3 deletions(-)
>
> diff --git a/content.tex b/content.tex
> index 8877b6f..284ead0 100644
> --- a/content.tex
> +++ b/content.tex
> @@ -47,6 +47,10 @@ \section{\field{Device Status} Field}\label{sec:Basic Facilities of a Virtio Dev
>  \item[DRIVER_OK (4)] Indicates that the driver is set up and ready to
>    drive the device.
>  
> +\item[STOP (32)] When VIRTIO_F_STOP is negotiated, indicates that the
> +  device has been stopped by the driver. This status bit is different
> +  from the reset since the device state is preserved.
> +
>  \item[DEVICE_NEEDS_RESET (64)] Indicates that the device has experienced
>    an error from which it can't recover.
>  \end{description}
> @@ -70,12 +74,38 @@ \section{\field{Device Status} Field}\label{sec:Basic Facilities of a Virtio Dev
>  recover by issuing a reset.
>  \end{note}
>  
> +If VIRTIO_F_STOP has been negotiated, the driver MUST NOT set STOP if
> +DRIVER_OK is not set.
> +
> +If VIRTIO_F_STOP has been negotiated, to stop a device, after setting
> +STOP, the driver MUST re-read the device status to ensure the STOP bit
> +is set to synchronize with the device.

Is this more that the driver needs to re-read the status until STOP is
set to make sure that the stop process has finished? If the device has
offered the feature and the driver accepted it, I'd assume that the
device will eventually finish with the procedure, or sets NEEDS_RESET if
something goes wrong?

> +
>  \devicenormative{\subsection}{Device Status Field}{Basic Facilities of a Virtio Device / Device Status Field}
>  The device MUST initialize \field{device status} to 0 upon reset.
>  
>  The device MUST NOT consume buffers or send any used buffer
>  notifications to the driver before DRIVER_OK.
>  
> +If VIRTIO_F_STOP has not been negotiated, the device MUST ignore the
> +write of STOP.

Maybe use "If VIRTIO_F_STOP has been negotiated:" and put the next three
points in a list? Might read a bit better.

> +
> +If VIRTIO_F_STOP has been negotiated, after the driver writes STOP,
> +the device MUST finish any pending operations like in flight requests
> +or have its device specific way for driver to save the pending
> +operations like in flight requests before setting the STOP status bit.
> +
> +If VIRTIO_F_STOP has been negotiated, the device MUST NOT consume
> +buffers or send any used buffer notifications to the driver after
> +STOP. The device MUST keep the configuration space unchanged and MUST
> +NOT send configuration space change notification to the driver after
> +STOP.
> +
> +If VIRTIO_F_STOP has been negotiated, after STOP, the device MUST
> +preserve all the necessary state (the virtqueue states with the
> +possible device specific states) that is required for restoring in the
> +future.

What happens if the driver writes STOP in when DRIVER_OK has not been
set? Should the device set NEEDS_RESET, as suggested above? Same, if
saving the states somehow goes wrong?

> +
>  \label{sec:Basic Facilities of a Virtio Device / Device Status Field / DEVICENEEDSRESET}The device SHOULD set DEVICE_NEEDS_RESET when it enters an error state
>  that a reset is needed.  If DRIVER_OK is set, after it sets DEVICE_NEEDS_RESET, the device
>  MUST send a device configuration change notification to the driver.
> @@ -474,8 +504,8 @@ \subsection{\field{Used State} Field}
>  \begin{itemize}
>  \item A driver MUST NOT set the virtqueue state before setting the
>    FEATURE_OK status bit.
> -\item A driver MUST NOT set the virtqueue state after setting the
> -  DRIVER_OK status bit.
> +\item A driver MUST NOT set the virtqueue state if DRIVER_OK status
> +  bit is set without STOP status bit.
>  \end{itemize}
>  
>  \devicenormative{\subsection}{Virtqueue State}{Basic Facilities of a Virtio Device / Virtqueue State}
> @@ -488,7 +518,7 @@ \subsection{\field{Used State} Field}
>  \item A device SHOULD ignore the write to the virtqueue state if the
>  FEATURE_OK status bit is not set.
>  \item A device SHOULD ignore the write to the virtqueue state if the
> -DRIVER_OK status bit is set.
> +DRIVER_OK status bit is set without STOP status bit.
>  \end{itemize}
>  
>  If VIRTIO_F_RING_STATE has been negotiated, a device MAY has its
> @@ -623,6 +653,36 @@ \section{Device Cleanup}\label{sec:General Initialization And Device Operation /
>  
>  Thus a driver MUST ensure a virtqueue isn't live (by device reset) before removing exposed buffers.
>  
> +\section{Virtqueue State Saving}
> +
> +If both VIRTIO_F_RING_STATE and VIRTIO_F_STOP have been negotiated. A
> +driver MAY save the internal virtqueue state.

Is that device type specific, or something generic? The last patch
suggests that it may vary by device type.

> +
> +\drivernormative{\subsection}{Virtqueue State Saving}{General Initialization And Device Operation / Virtqueue State Saving}
> +
> +Assuming the device is 'live'. The driver MUST follow this sequence to
> +stop the device and save the virtqueue state:
> +
> +\begin{enumerate}
> +\item Set the STOP status bit.
> +
> +\item Re-read \field{device status} until the STOP bit is set to
> +  synchronize with the device.
> +
> +\item Read \field{available state} and save it.
> +
> +\item Read \field{used state} and save it.
> +
> +\item Read device specific virtqueue states if needed.
> +
> +\item Reset the device.
> +\end{enumerate}
> +
> +The driver MAY perform device specific steps to save device specific sate.
> +
> +The driver MAY 'resume' the device by redoing the device initialization
> +with the saved virtqueue state. See ref\{sec:General Initialization and Device Operation / Device Initialization}
> +
>  \chapter{Virtio Transport Options}\label{sec:Virtio Transport Options}
>  
>  Virtio can use various different buses, thus the standard is split
> @@ -6713,6 +6773,9 @@ \chapter{Reserved Feature Bits}\label{sec:Reserved Feature Bits}
>    can set and get the device internal virtqueue state.
>    See \ref{sec:Virtqueues / Virtqueue State}~\nameref{sec:Virtqueues / Virtqueue State}.
>  
> +  \item[VIRTIO_F_STOP(41)] This feature indicates that the driver can
> +  stop the device.
> +  See \ref{sec:Basic Facilities of a Virtio Device / Device Status Field}
>  \end{description}
>  
>  \drivernormative{\section}{Reserved Feature Bits}{Reserved Feature Bits}


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH V2 2/2] virtio: introduce STOP status bit
  2021-07-06 12:50   ` [virtio-comment] " Cornelia Huck
@ 2021-07-06 13:18     ` Jason Wang
  2021-07-06 14:27       ` [virtio-dev] " Cornelia Huck
  0 siblings, 1 reply; 115+ messages in thread
From: Jason Wang @ 2021-07-06 13:18 UTC (permalink / raw)
  To: Cornelia Huck, mst, virtio-comment, virtio-dev
  Cc: stefanha, mgurtovoy, eperezma, oren, shahafs, parav, bodong,
	amikheev, pasic


在 2021/7/6 下午8:50, Cornelia Huck 写道:
> On Tue, Jul 06 2021, Jason Wang <jasowang@redhat.com> wrote:
>
>> This patch introduces a new status bit STOP. This will be
>> used by the driver to stop the device in order to safely fetch the
>> device state (virtqueue state) from the device.
>>
>> This is a must for supporting migration.
>>
>> Signed-off-by: Jason Wang <jasowang@redhat.com>
>> ---
>>   content.tex | 69 ++++++++++++++++++++++++++++++++++++++++++++++++++---
>>   1 file changed, 66 insertions(+), 3 deletions(-)
>>
>> diff --git a/content.tex b/content.tex
>> index 8877b6f..284ead0 100644
>> --- a/content.tex
>> +++ b/content.tex
>> @@ -47,6 +47,10 @@ \section{\field{Device Status} Field}\label{sec:Basic Facilities of a Virtio Dev
>>   \item[DRIVER_OK (4)] Indicates that the driver is set up and ready to
>>     drive the device.
>>   
>> +\item[STOP (32)] When VIRTIO_F_STOP is negotiated, indicates that the
>> +  device has been stopped by the driver. This status bit is different
>> +  from the reset since the device state is preserved.
>> +
>>   \item[DEVICE_NEEDS_RESET (64)] Indicates that the device has experienced
>>     an error from which it can't recover.
>>   \end{description}
>> @@ -70,12 +74,38 @@ \section{\field{Device Status} Field}\label{sec:Basic Facilities of a Virtio Dev
>>   recover by issuing a reset.
>>   \end{note}
>>   
>> +If VIRTIO_F_STOP has been negotiated, the driver MUST NOT set STOP if
>> +DRIVER_OK is not set.
>> +
>> +If VIRTIO_F_STOP has been negotiated, to stop a device, after setting
>> +STOP, the driver MUST re-read the device status to ensure the STOP bit
>> +is set to synchronize with the device.
> Is this more that the driver needs to re-read the status until STOP is
> set to make sure that the stop process has finished?


Yes.


>   If the device has
> offered the feature and the driver accepted it, I'd assume that the
> device will eventually finish with the procedure, or sets NEEDS_RESET if
> something goes wrong?


As stated below, the device must either:

1) finish all pending requests

or

2) provide a device specific way for the driver to save and restore 
pending requests

before setting STOP.

Otherwise the device can't offer this feature.

Using NEEDS_RESET seems more complicated than this.


>
>> +
>>   \devicenormative{\subsection}{Device Status Field}{Basic Facilities of a Virtio Device / Device Status Field}
>>   The device MUST initialize \field{device status} to 0 upon reset.
>>   
>>   The device MUST NOT consume buffers or send any used buffer
>>   notifications to the driver before DRIVER_OK.
>>   
>> +If VIRTIO_F_STOP has not been negotiated, the device MUST ignore the
>> +write of STOP.
> Maybe use "If VIRTIO_F_STOP has been negotiated:" and put the next three
> points in a list? Might read a bit better.


That's fine.


>
>> +
>> +If VIRTIO_F_STOP has been negotiated, after the driver writes STOP,
>> +the device MUST finish any pending operations like in flight requests
>> +or have its device specific way for driver to save the pending
>> +operations like in flight requests before setting the STOP status bit.
>> +
>> +If VIRTIO_F_STOP has been negotiated, the device MUST NOT consume
>> +buffers or send any used buffer notifications to the driver after
>> +STOP. The device MUST keep the configuration space unchanged and MUST
>> +NOT send configuration space change notification to the driver after
>> +STOP.
>> +
>> +If VIRTIO_F_STOP has been negotiated, after STOP, the device MUST
>> +preserve all the necessary state (the virtqueue states with the
>> +possible device specific states) that is required for restoring in the
>> +future.
> What happens if the driver writes STOP in when DRIVER_OK has not been
> set?


I think we need a device normative like:

If VIRTIO_F_STOP has been negotiated, the driver SHOULD ignore the STOP 
status bit if DRIVER_OK is not set.


>   Should the device set NEEDS_RESET, as suggested above? Same, if
> saving the states somehow goes wrong?


I try hard to avoid NEEDS_RESET, so the driver is required to only read 
the state during DRIVER_OK & STOP, and set the state during FEATURES_OK 
& !DRIVER_OK. This is described in the driver normative in patch 1 and 
below.


>
>> +
>>   \label{sec:Basic Facilities of a Virtio Device / Device Status Field / DEVICENEEDSRESET}The device SHOULD set DEVICE_NEEDS_RESET when it enters an error state
>>   that a reset is needed.  If DRIVER_OK is set, after it sets DEVICE_NEEDS_RESET, the device
>>   MUST send a device configuration change notification to the driver.
>> @@ -474,8 +504,8 @@ \subsection{\field{Used State} Field}
>>   \begin{itemize}
>>   \item A driver MUST NOT set the virtqueue state before setting the
>>     FEATURE_OK status bit.
>> -\item A driver MUST NOT set the virtqueue state after setting the
>> -  DRIVER_OK status bit.
>> +\item A driver MUST NOT set the virtqueue state if DRIVER_OK status
>> +  bit is set without STOP status bit.
>>   \end{itemize}
>>   
>>   \devicenormative{\subsection}{Virtqueue State}{Basic Facilities of a Virtio Device / Virtqueue State}
>> @@ -488,7 +518,7 @@ \subsection{\field{Used State} Field}
>>   \item A device SHOULD ignore the write to the virtqueue state if the
>>   FEATURE_OK status bit is not set.
>>   \item A device SHOULD ignore the write to the virtqueue state if the
>> -DRIVER_OK status bit is set.
>> +DRIVER_OK status bit is set without STOP status bit.
>>   \end{itemize}
>>   
>>   If VIRTIO_F_RING_STATE has been negotiated, a device MAY has its
>> @@ -623,6 +653,36 @@ \section{Device Cleanup}\label{sec:General Initialization And Device Operation /
>>   
>>   Thus a driver MUST ensure a virtqueue isn't live (by device reset) before removing exposed buffers.
>>   
>> +\section{Virtqueue State Saving}
>> +
>> +If both VIRTIO_F_RING_STATE and VIRTIO_F_STOP have been negotiated. A
>> +driver MAY save the internal virtqueue state.
> Is that device type specific, or something generic? The last patch
> suggests that it may vary by device type.


Both virtqueue state (avail/used state) and the STOP status bit is generic.

But the device is free to have its own specific:

1) extra virtqueue states (pending requests)
2) device states

And 2) is out of the scope of this series.

Thanks


>
>> +
>> +\drivernormative{\subsection}{Virtqueue State Saving}{General Initialization And Device Operation / Virtqueue State Saving}
>> +
>> +Assuming the device is 'live'. The driver MUST follow this sequence to
>> +stop the device and save the virtqueue state:
>> +
>> +\begin{enumerate}
>> +\item Set the STOP status bit.
>> +
>> +\item Re-read \field{device status} until the STOP bit is set to
>> +  synchronize with the device.
>> +
>> +\item Read \field{available state} and save it.
>> +
>> +\item Read \field{used state} and save it.
>> +
>> +\item Read device specific virtqueue states if needed.
>> +
>> +\item Reset the device.
>> +\end{enumerate}
>> +
>> +The driver MAY perform device specific steps to save device specific sate.
>> +
>> +The driver MAY 'resume' the device by redoing the device initialization
>> +with the saved virtqueue state. See ref\{sec:General Initialization and Device Operation / Device Initialization}
>> +
>>   \chapter{Virtio Transport Options}\label{sec:Virtio Transport Options}
>>   
>>   Virtio can use various different buses, thus the standard is split
>> @@ -6713,6 +6773,9 @@ \chapter{Reserved Feature Bits}\label{sec:Reserved Feature Bits}
>>     can set and get the device internal virtqueue state.
>>     See \ref{sec:Virtqueues / Virtqueue State}~\nameref{sec:Virtqueues / Virtqueue State}.
>>   
>> +  \item[VIRTIO_F_STOP(41)] This feature indicates that the driver can
>> +  stop the device.
>> +  See \ref{sec:Basic Facilities of a Virtio Device / Device Status Field}
>>   \end{description}
>>   
>>   \drivernormative{\section}{Reserved Feature Bits}{Reserved Feature Bits}


^ permalink raw reply	[flat|nested] 115+ messages in thread

* [virtio-dev] Re: [PATCH V2 2/2] virtio: introduce STOP status bit
  2021-07-06 13:18     ` Jason Wang
@ 2021-07-06 14:27       ` Cornelia Huck
  2021-07-07  0:05         ` Max Gurtovoy
  2021-07-07  2:56         ` Jason Wang
  0 siblings, 2 replies; 115+ messages in thread
From: Cornelia Huck @ 2021-07-06 14:27 UTC (permalink / raw)
  To: Jason Wang, mst, virtio-comment, virtio-dev
  Cc: stefanha, mgurtovoy, eperezma, oren, shahafs, parav, bodong,
	amikheev, pasic

On Tue, Jul 06 2021, Jason Wang <jasowang@redhat.com> wrote:

> 在 2021/7/6 下午8:50, Cornelia Huck 写道:
>> On Tue, Jul 06 2021, Jason Wang <jasowang@redhat.com> wrote:

>>> +If VIRTIO_F_STOP has been negotiated, to stop a device, after setting
>>> +STOP, the driver MUST re-read the device status to ensure the STOP bit
>>> +is set to synchronize with the device.
>> Is this more that the driver needs to re-read the status until STOP is
>> set to make sure that the stop process has finished?
>
>
> Yes.
>
>
>>   If the device has
>> offered the feature and the driver accepted it, I'd assume that the
>> device will eventually finish with the procedure, or sets NEEDS_RESET if
>> something goes wrong?
>
>
> As stated below, the device must either:
>
> 1) finish all pending requests
>
> or
>
> 2) provide a device specific way for the driver to save and restore 
> pending requests
>
> before setting STOP.
>
> Otherwise the device can't offer this feature.
>
> Using NEEDS_RESET seems more complicated than this.

Hm, what happens on an internal error? I assume that the device would
need to signal that in some way. Or should it simply set STOP and
discard any pending requests? The driver would not be able to
distinguish that from a normal STOP.

>>> +If VIRTIO_F_STOP has been negotiated, after the driver writes STOP,
>>> +the device MUST finish any pending operations like in flight requests
>>> +or have its device specific way for driver to save the pending
>>> +operations like in flight requests before setting the STOP status bit.
>>> +
>>> +If VIRTIO_F_STOP has been negotiated, the device MUST NOT consume
>>> +buffers or send any used buffer notifications to the driver after
>>> +STOP. The device MUST keep the configuration space unchanged and MUST
>>> +NOT send configuration space change notification to the driver after
>>> +STOP.
>>> +
>>> +If VIRTIO_F_STOP has been negotiated, after STOP, the device MUST
>>> +preserve all the necessary state (the virtqueue states with the
>>> +possible device specific states) that is required for restoring in the
>>> +future.
>> What happens if the driver writes STOP in when DRIVER_OK has not been
>> set?
>
>
> I think we need a device normative like:
>
> If VIRTIO_F_STOP has been negotiated, the driver SHOULD ignore the STOP 
> status bit if DRIVER_OK is not set.

That's the device that needs to do the ignoring, right?

>
>
>>   Should the device set NEEDS_RESET, as suggested above? Same, if
>> saving the states somehow goes wrong?
>
>
> I try hard to avoid NEEDS_RESET, so the driver is required to only read 
> the state during DRIVER_OK & STOP, and set the state during FEATURES_OK 
> & !DRIVER_OK. This is described in the driver normative in patch 1 and 
> below.

The device can certainly ignore STOP requests that are out of spec. But
I think we cannot get around signaling device errors in some way.

>>> +\section{Virtqueue State Saving}
>>> +
>>> +If both VIRTIO_F_RING_STATE and VIRTIO_F_STOP have been negotiated. A
>>> +driver MAY save the internal virtqueue state.
>> Is that device type specific, or something generic? The last patch
>> suggests that it may vary by device type.
>
>
> Both virtqueue state (avail/used state) and the STOP status bit is generic.
>
> But the device is free to have its own specific:
>
> 1) extra virtqueue states (pending requests)
> 2) device states
>
> And 2) is out of the scope of this series.

Ok.


---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH V2 1/2] virtio: introduce virtqueue state as basic facility
  2021-07-06  9:32   ` Michael S. Tsirkin
@ 2021-07-06 17:09     ` Eugenio Perez Martin
  2021-07-06 19:08       ` Michael S. Tsirkin
  2021-07-07  2:41       ` Jason Wang
  0 siblings, 2 replies; 115+ messages in thread
From: Eugenio Perez Martin @ 2021-07-06 17:09 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason Wang, virtio-comment, Virtio-Dev, Stefan Hajnoczi,
	Max Gurtovoy, Cornelia Huck, Oren Duer, Shahaf Shuler,
	Parav Pandit, Bodong Wang, Alexander Mikheev, Halil Pasic

On Tue, Jul 6, 2021 at 11:32 AM Michael S. Tsirkin <mst@redhat.com> wrote:
>
> On Tue, Jul 06, 2021 at 12:33:33PM +0800, Jason Wang wrote:
> > This patch adds new device facility to save and restore virtqueue
> > state. The virtqueue state is split into two parts:
> >
> > - The available state: The state that is used for read the next
> >   available buffer.
> > - The used state: The state that is used for making buffer used.
> >
> > Note that, there could be devices that is required to set and get the
> > requests that are being processed by the device. I leave such API to
> > be device specific.
> >
> > This facility could be used by both migration and device diagnostic.
> >
> > Signed-off-by: Jason Wang <jasowang@redhat.com>
>
> Hi Jason!
> I feel that for use-cases such as SRIOV,
> the facility to save/restore vq should be part of a PF
> that is there needs to be a way for one virtio device to
> address the state of another one.
>

Hi!

In my opinion we should go the other way around: To make features as
orthogonal/independent as possible, and just make them work together
if we have to. In this particular case, I think it should be easier to
decide how to report status, its needs, etc for a VF, and then open
the possibility for the PF to query or set them, reusing format,
behavior, etc. as much as possible.

I think that the most controversial point about doing it non-SR IOV
way is the exposing of these features/fields to the guest using
specific transport facilities, like PCI common config. However I think
it should not be hard for the hypervisor to intercept them and even to
expose them conditionally. Please correct me if this guessing was not
right and you had other concerns.

> Thoughts?
>
> > ---
> >  content.tex | 117 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 117 insertions(+)
> >
> > diff --git a/content.tex b/content.tex
> > index 620c0e2..8877b6f 100644
> > --- a/content.tex
> > +++ b/content.tex
> > @@ -385,6 +385,116 @@ \section{Exporting Objects}\label{sec:Basic Facilities of a Virtio Device / Expo
> >  types. It is RECOMMENDED that devices generate version 4
> >  UUIDs as specified by \hyperref[intro:rfc4122]{[RFC4122]}.
> >
> > +\section{Virtqueue State}\label{sec:Virtqueues / Virtqueue State}
> > +
> > +When VIRTIO_F_RING_STATE is negotiated, the driver can set and
> > +get the device internal virtqueue state through the following
> > +fields. The way to access those fields is transport specific.
> > +
> > +\subsection{\field{Available State} Field}
> > +
> > +The \field{Available State} field is two bytes for the driver to get
> > +or set the state that is used by the virtqueue to read for the next
> > +available buffer.
> > +
> > +When VIRTIO_F_RING_PACKED is not negotiated, it contains:
> > +
> > +\begin{lstlisting}
> > +le16 {
> > +        last_avail_idx : 16;
> > +} avail_state;
> > +\end{lstlisting}
> > +
> > +The \field{last_avail_idx} field indicates where the device would read
> > +for the next index from the virtqueue available ring(modulo the queue
> > + size). This starts at the value set by the driver, and increases.
> > +
> > +When VIRTIO_F_RING_PACKED is negotiated, it contains:
> > +
> > +\begin{lstlisting}
> > +le16 {
> > +        last_avail_idx : 15;
> > +        last_avail_wrap_counter : 1;
> > +} avail_state;
> > +\end{lstlisting}
> > +
> > +The \field{last_avail_idx} field indicates where the device would read for
> > +the next descriptor head from the descriptor ring. This starts at the
> > +value set by the driver and wraps around when reaching the end of the
> > +ring.
> > +
> > +The \field{last_avail_wrap_counter} field indicates the last Driver Ring
> > +Wrap Counter that is observed by the device. This starts at the value
> > +set by the driver, and is flipped when reaching the end of the ring.
> > +
> > +See also \ref{sec:Packed Virtqueues / Driver and Device Ring Wrap Counters}.
> > +
> > +\subsection{\field{Used State} Field}
> > +
> > +The \field{Used State} field is two bytes for the driver to set and
> > +get the state used by the virtqueue to make buffer used.
> > +
> > +When VIRTIO_F_RING_PACKED is not negotiated, the used state contains:
> > +
> > +\begin{lstlisting}
> > +le16 {
> > +        used_idx : 16;
> > +} used_state;
> > +\end{lstlisting}
> > +
> > +The \field{used_idx} where the device would write the next used
> > +descriptor head to the used ring (modulo the queue size). This starts
> > +at the value set by the driver, and increases. It is easy to see this
> > +is the initial value of the \field{idx} in the used ring.
> > +
> > +See also \ref{sec:Basic Facilities of a Virtio Device / Virtqueues / The Virtqueue Used Ring}
> > +
> > +When VIRTIO_F_RING_PACKED is negotiated, the used state contains:
> > +
> > +\begin{lstlisting}
> > +le16 {
> > +        used_idx : 15;
> > +        used_wrap_counter : 1;
> > +} used_state;
> > +\end{lstlisting}
> > +
> > +The \field{used_idx} indicates where the device would write the next used
> > +descriptor head to the descriptor ring. This starts at the value set
> > +by the driver, and warps around when reaching the end of the ring.
> > +
> > +\field{used_wrap_counter} is the Device Ring Wrap Counter. This starts
> > +at the value set by the driver, and is flipped when reaching the end
> > +of the ring.
> > +
> > +See also \ref{sec:Packed Virtqueues / Driver and Device Ring Wrap Counters}.
>
>
> Above only fully describes the vq state if descriptors
> are used in order or at least all out of order descriptors are consumed
> at time of save.
>

I think that the most straightforward solution would be to add
something similar to VHOST_USER_GET_INFLIGHT_FD, but without the _FD
part.

Thanks!

> Adding later option to devices such as net will need extra spec work.
>
>
> > +\drivernormative{\subsection}{Virtqueue State}{Basic Facilities of a Virtio Device / Virtqueue State}
> > +
> > +If VIRTIO_F_RING_STATE has been negotiated:
> > +\begin{itemize}
> > +\item A driver MUST NOT set the virtqueue state before setting the
> > +  FEATURE_OK status bit.
> > +\item A driver MUST NOT set the virtqueue state after setting the
> > +  DRIVER_OK status bit.
> > +\end{itemize}
> > +
> > +\devicenormative{\subsection}{Virtqueue State}{Basic Facilities of a Virtio Device / Virtqueue State}
> > +
> > +If VIRTIO_F_RING_STATE has not been negotiated, a device MUST ingore
> > +the read and write to the virtqueue state.
> > +
> > +If VIRTIO_F_RING_STATE has been negotiated:
> > +\begin{itemize}
> > +\item A device SHOULD ignore the write to the virtqueue state if the
> > +FEATURE_OK status bit is not set.
> > +\item A device SHOULD ignore the write to the virtqueue state if the
> > +DRIVER_OK status bit is set.
> > +\end{itemize}
> > +
> > +If VIRTIO_F_RING_STATE has been negotiated, a device MAY has its
>
>
> may have?
> should also go into a normative section
>
> > +device-specific way for the driver to set and get extra virtqueue
> > +states such as in flight requests.
> > +
> >  \chapter{General Initialization And Device Operation}\label{sec:General Initialization And Device Operation}
> >
> >  We start with an overview of device initialization, then expand on the
> > @@ -420,6 +530,9 @@ \section{Device Initialization}\label{sec:General Initialization And Device Oper
> >     device, optional per-bus setup, reading and possibly writing the
> >     device's virtio configuration space, and population of virtqueues.
> >
> > +\item\label{itm:General Initialization And Device Operation / Device
> > +  Initialization / Virtqueue State Setup} When VIRTIO_F_RING_STATE has been negotiated, perform virtqueue state setup, including the initialization of the per virtqueue available state, used state and the possible device specific virtqueue state.
> > +
> >  \item\label{itm:General Initialization And Device Operation / Device Initialization / Set DRIVER-OK} Set the DRIVER_OK status bit.  At this point the device is
> >     ``live''.
> >  \end{enumerate}
> > @@ -6596,6 +6709,10 @@ \chapter{Reserved Feature Bits}\label{sec:Reserved Feature Bits}
> >    transport specific.
> >    For more details about driver notifications over PCI see \ref{sec:Virtio Transport Options / Virtio Over PCI Bus / PCI-specific Initialization And Device Operation / Available Buffer Notifications}.
> >
> > +  \item[VIRTIO_F_RING_STATE(40)] This feature indicates that the driver
> > +  can set and get the device internal virtqueue state.
> > +  See \ref{sec:Virtqueues / Virtqueue State}~\nameref{sec:Virtqueues / Virtqueue State}.
> > +
> >  \end{description}
> >
> >  \drivernormative{\section}{Reserved Feature Bits}{Reserved Feature Bits}
> > --
> > 2.25.1
>


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH V2 1/2] virtio: introduce virtqueue state as basic facility
  2021-07-06 17:09     ` Eugenio Perez Martin
@ 2021-07-06 19:08       ` Michael S. Tsirkin
  2021-07-06 23:49         ` Max Gurtovoy
  2021-07-07  2:42         ` Jason Wang
  2021-07-07  2:41       ` Jason Wang
  1 sibling, 2 replies; 115+ messages in thread
From: Michael S. Tsirkin @ 2021-07-06 19:08 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: Jason Wang, virtio-comment, Virtio-Dev, Stefan Hajnoczi,
	Max Gurtovoy, Cornelia Huck, Oren Duer, Shahaf Shuler,
	Parav Pandit, Bodong Wang, Alexander Mikheev, Halil Pasic

On Tue, Jul 06, 2021 at 07:09:10PM +0200, Eugenio Perez Martin wrote:
> On Tue, Jul 6, 2021 at 11:32 AM Michael S. Tsirkin <mst@redhat.com> wrote:
> >
> > On Tue, Jul 06, 2021 at 12:33:33PM +0800, Jason Wang wrote:
> > > This patch adds new device facility to save and restore virtqueue
> > > state. The virtqueue state is split into two parts:
> > >
> > > - The available state: The state that is used for read the next
> > >   available buffer.
> > > - The used state: The state that is used for making buffer used.
> > >
> > > Note that, there could be devices that is required to set and get the
> > > requests that are being processed by the device. I leave such API to
> > > be device specific.
> > >
> > > This facility could be used by both migration and device diagnostic.
> > >
> > > Signed-off-by: Jason Wang <jasowang@redhat.com>
> >
> > Hi Jason!
> > I feel that for use-cases such as SRIOV,
> > the facility to save/restore vq should be part of a PF
> > that is there needs to be a way for one virtio device to
> > address the state of another one.
> >
> 
> Hi!
> 
> In my opinion we should go the other way around: To make features as
> orthogonal/independent as possible, and just make them work together
> if we have to. In this particular case, I think it should be easier to
> decide how to report status, its needs, etc for a VF, and then open
> the possibility for the PF to query or set them, reusing format,
> behavior, etc. as much as possible.
> 
> I think that the most controversial point about doing it non-SR IOV
> way is the exposing of these features/fields to the guest using
> specific transport facilities, like PCI common config. However I think
> it should not be hard for the hypervisor to intercept them and even to
> expose them conditionally. Please correct me if this guessing was not
> right and you had other concerns.


Possibly. I'd like to see some guidance on how this all will work
in practice then. Maybe make it all part of a non-normative section
for now.
I think that the feature itself is not very useful outside of
migration so we don't really gain much by adding it as is
without all the other missing pieces.
I would say let's see more of the whole picture before we commit.



> > Thoughts?
> >
> > > ---
> > >  content.tex | 117 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> > >  1 file changed, 117 insertions(+)
> > >
> > > diff --git a/content.tex b/content.tex
> > > index 620c0e2..8877b6f 100644
> > > --- a/content.tex
> > > +++ b/content.tex
> > > @@ -385,6 +385,116 @@ \section{Exporting Objects}\label{sec:Basic Facilities of a Virtio Device / Expo
> > >  types. It is RECOMMENDED that devices generate version 4
> > >  UUIDs as specified by \hyperref[intro:rfc4122]{[RFC4122]}.
> > >
> > > +\section{Virtqueue State}\label{sec:Virtqueues / Virtqueue State}
> > > +
> > > +When VIRTIO_F_RING_STATE is negotiated, the driver can set and
> > > +get the device internal virtqueue state through the following
> > > +fields. The way to access those fields is transport specific.
> > > +
> > > +\subsection{\field{Available State} Field}
> > > +
> > > +The \field{Available State} field is two bytes for the driver to get
> > > +or set the state that is used by the virtqueue to read for the next
> > > +available buffer.
> > > +
> > > +When VIRTIO_F_RING_PACKED is not negotiated, it contains:
> > > +
> > > +\begin{lstlisting}
> > > +le16 {
> > > +        last_avail_idx : 16;
> > > +} avail_state;
> > > +\end{lstlisting}
> > > +
> > > +The \field{last_avail_idx} field indicates where the device would read
> > > +for the next index from the virtqueue available ring(modulo the queue
> > > + size). This starts at the value set by the driver, and increases.
> > > +
> > > +When VIRTIO_F_RING_PACKED is negotiated, it contains:
> > > +
> > > +\begin{lstlisting}
> > > +le16 {
> > > +        last_avail_idx : 15;
> > > +        last_avail_wrap_counter : 1;
> > > +} avail_state;
> > > +\end{lstlisting}
> > > +
> > > +The \field{last_avail_idx} field indicates where the device would read for
> > > +the next descriptor head from the descriptor ring. This starts at the
> > > +value set by the driver and wraps around when reaching the end of the
> > > +ring.
> > > +
> > > +The \field{last_avail_wrap_counter} field indicates the last Driver Ring
> > > +Wrap Counter that is observed by the device. This starts at the value
> > > +set by the driver, and is flipped when reaching the end of the ring.
> > > +
> > > +See also \ref{sec:Packed Virtqueues / Driver and Device Ring Wrap Counters}.
> > > +
> > > +\subsection{\field{Used State} Field}
> > > +
> > > +The \field{Used State} field is two bytes for the driver to set and
> > > +get the state used by the virtqueue to make buffer used.
> > > +
> > > +When VIRTIO_F_RING_PACKED is not negotiated, the used state contains:
> > > +
> > > +\begin{lstlisting}
> > > +le16 {
> > > +        used_idx : 16;
> > > +} used_state;
> > > +\end{lstlisting}
> > > +
> > > +The \field{used_idx} where the device would write the next used
> > > +descriptor head to the used ring (modulo the queue size). This starts
> > > +at the value set by the driver, and increases. It is easy to see this
> > > +is the initial value of the \field{idx} in the used ring.
> > > +
> > > +See also \ref{sec:Basic Facilities of a Virtio Device / Virtqueues / The Virtqueue Used Ring}
> > > +
> > > +When VIRTIO_F_RING_PACKED is negotiated, the used state contains:
> > > +
> > > +\begin{lstlisting}
> > > +le16 {
> > > +        used_idx : 15;
> > > +        used_wrap_counter : 1;
> > > +} used_state;
> > > +\end{lstlisting}
> > > +
> > > +The \field{used_idx} indicates where the device would write the next used
> > > +descriptor head to the descriptor ring. This starts at the value set
> > > +by the driver, and warps around when reaching the end of the ring.
> > > +
> > > +\field{used_wrap_counter} is the Device Ring Wrap Counter. This starts
> > > +at the value set by the driver, and is flipped when reaching the end
> > > +of the ring.
> > > +
> > > +See also \ref{sec:Packed Virtqueues / Driver and Device Ring Wrap Counters}.
> >
> >
> > Above only fully describes the vq state if descriptors
> > are used in order or at least all out of order descriptors are consumed
> > at time of save.
> >
> 
> I think that the most straightforward solution would be to add
> something similar to VHOST_USER_GET_INFLIGHT_FD, but without the _FD
> part.
> 
> Thanks!
> 
> > Adding later option to devices such as net will need extra spec work.
> >
> >
> > > +\drivernormative{\subsection}{Virtqueue State}{Basic Facilities of a Virtio Device / Virtqueue State}
> > > +
> > > +If VIRTIO_F_RING_STATE has been negotiated:
> > > +\begin{itemize}
> > > +\item A driver MUST NOT set the virtqueue state before setting the
> > > +  FEATURE_OK status bit.
> > > +\item A driver MUST NOT set the virtqueue state after setting the
> > > +  DRIVER_OK status bit.
> > > +\end{itemize}
> > > +
> > > +\devicenormative{\subsection}{Virtqueue State}{Basic Facilities of a Virtio Device / Virtqueue State}
> > > +
> > > +If VIRTIO_F_RING_STATE has not been negotiated, a device MUST ingore
> > > +the read and write to the virtqueue state.
> > > +
> > > +If VIRTIO_F_RING_STATE has been negotiated:
> > > +\begin{itemize}
> > > +\item A device SHOULD ignore the write to the virtqueue state if the
> > > +FEATURE_OK status bit is not set.
> > > +\item A device SHOULD ignore the write to the virtqueue state if the
> > > +DRIVER_OK status bit is set.
> > > +\end{itemize}
> > > +
> > > +If VIRTIO_F_RING_STATE has been negotiated, a device MAY has its
> >
> >
> > may have?
> > should also go into a normative section
> >
> > > +device-specific way for the driver to set and get extra virtqueue
> > > +states such as in flight requests.
> > > +
> > >  \chapter{General Initialization And Device Operation}\label{sec:General Initialization And Device Operation}
> > >
> > >  We start with an overview of device initialization, then expand on the
> > > @@ -420,6 +530,9 @@ \section{Device Initialization}\label{sec:General Initialization And Device Oper
> > >     device, optional per-bus setup, reading and possibly writing the
> > >     device's virtio configuration space, and population of virtqueues.
> > >
> > > +\item\label{itm:General Initialization And Device Operation / Device
> > > +  Initialization / Virtqueue State Setup} When VIRTIO_F_RING_STATE has been negotiated, perform virtqueue state setup, including the initialization of the per virtqueue available state, used state and the possible device specific virtqueue state.
> > > +
> > >  \item\label{itm:General Initialization And Device Operation / Device Initialization / Set DRIVER-OK} Set the DRIVER_OK status bit.  At this point the device is
> > >     ``live''.
> > >  \end{enumerate}
> > > @@ -6596,6 +6709,10 @@ \chapter{Reserved Feature Bits}\label{sec:Reserved Feature Bits}
> > >    transport specific.
> > >    For more details about driver notifications over PCI see \ref{sec:Virtio Transport Options / Virtio Over PCI Bus / PCI-specific Initialization And Device Operation / Available Buffer Notifications}.
> > >
> > > +  \item[VIRTIO_F_RING_STATE(40)] This feature indicates that the driver
> > > +  can set and get the device internal virtqueue state.
> > > +  See \ref{sec:Virtqueues / Virtqueue State}~\nameref{sec:Virtqueues / Virtqueue State}.
> > > +
> > >  \end{description}
> > >
> > >  \drivernormative{\section}{Reserved Feature Bits}{Reserved Feature Bits}
> > > --
> > > 2.25.1
> >


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH V2 1/2] virtio: introduce virtqueue state as basic facility
  2021-07-06 19:08       ` Michael S. Tsirkin
@ 2021-07-06 23:49         ` Max Gurtovoy
  2021-07-07  2:50           ` Jason Wang
  2021-07-07  2:42         ` Jason Wang
  1 sibling, 1 reply; 115+ messages in thread
From: Max Gurtovoy @ 2021-07-06 23:49 UTC (permalink / raw)
  To: Michael S. Tsirkin, Eugenio Perez Martin
  Cc: Jason Wang, virtio-comment, Virtio-Dev, Stefan Hajnoczi,
	Cornelia Huck, Oren Duer, Shahaf Shuler, Parav Pandit,
	Bodong Wang, Alexander Mikheev, Halil Pasic


On 7/6/2021 10:08 PM, Michael S. Tsirkin wrote:
> On Tue, Jul 06, 2021 at 07:09:10PM +0200, Eugenio Perez Martin wrote:
>> On Tue, Jul 6, 2021 at 11:32 AM Michael S. Tsirkin <mst@redhat.com> wrote:
>>> On Tue, Jul 06, 2021 at 12:33:33PM +0800, Jason Wang wrote:
>>>> This patch adds new device facility to save and restore virtqueue
>>>> state. The virtqueue state is split into two parts:
>>>>
>>>> - The available state: The state that is used for read the next
>>>>    available buffer.
>>>> - The used state: The state that is used for making buffer used.
>>>>
>>>> Note that, there could be devices that is required to set and get the
>>>> requests that are being processed by the device. I leave such API to
>>>> be device specific.
>>>>
>>>> This facility could be used by both migration and device diagnostic.
>>>>
>>>> Signed-off-by: Jason Wang <jasowang@redhat.com>
>>> Hi Jason!
>>> I feel that for use-cases such as SRIOV,
>>> the facility to save/restore vq should be part of a PF
>>> that is there needs to be a way for one virtio device to
>>> address the state of another one.
>>>
>> Hi!
>>
>> In my opinion we should go the other way around: To make features as
>> orthogonal/independent as possible, and just make them work together
>> if we have to. In this particular case, I think it should be easier to
>> decide how to report status, its needs, etc for a VF, and then open
>> the possibility for the PF to query or set them, reusing format,
>> behavior, etc. as much as possible.
>>
>> I think that the most controversial point about doing it non-SR IOV
>> way is the exposing of these features/fields to the guest using
>> specific transport facilities, like PCI common config. However I think
>> it should not be hard for the hypervisor to intercept them and even to
>> expose them conditionally. Please correct me if this guessing was not
>> right and you had other concerns.
>
> Possibly. I'd like to see some guidance on how this all will work
> in practice then. Maybe make it all part of a non-normative section
> for now.
> I think that the feature itself is not very useful outside of
> migration so we don't really gain much by adding it as is
> without all the other missing pieces.
> I would say let's see more of the whole picture before we commit.

I agree here. I also can't see the whole picture for SRIOV case.

I'll try to combine the admin control queue suggested in previous patch 
set to my proposal of PF managing the VF migration.

Feature negotiation is part of virtio device-driver communication and 
not part of the migration software that should manage the migration process.

For me, seems like queue state is something that should be internal and 
not be exposed to guest drivers that see this as a new feature.

>
>
>
>>> Thoughts?
>>>
>>>> ---
>>>>   content.tex | 117 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>   1 file changed, 117 insertions(+)
>>>>
>>>> diff --git a/content.tex b/content.tex
>>>> index 620c0e2..8877b6f 100644
>>>> --- a/content.tex
>>>> +++ b/content.tex
>>>> @@ -385,6 +385,116 @@ \section{Exporting Objects}\label{sec:Basic Facilities of a Virtio Device / Expo
>>>>   types. It is RECOMMENDED that devices generate version 4
>>>>   UUIDs as specified by \hyperref[intro:rfc4122]{[RFC4122]}.
>>>>
>>>> +\section{Virtqueue State}\label{sec:Virtqueues / Virtqueue State}
>>>> +
>>>> +When VIRTIO_F_RING_STATE is negotiated, the driver can set and
>>>> +get the device internal virtqueue state through the following
>>>> +fields. The way to access those fields is transport specific.
>>>> +
>>>> +\subsection{\field{Available State} Field}
>>>> +
>>>> +The \field{Available State} field is two bytes for the driver to get
>>>> +or set the state that is used by the virtqueue to read for the next
>>>> +available buffer.
>>>> +
>>>> +When VIRTIO_F_RING_PACKED is not negotiated, it contains:
>>>> +
>>>> +\begin{lstlisting}
>>>> +le16 {
>>>> +        last_avail_idx : 16;
>>>> +} avail_state;
>>>> +\end{lstlisting}
>>>> +
>>>> +The \field{last_avail_idx} field indicates where the device would read
>>>> +for the next index from the virtqueue available ring(modulo the queue
>>>> + size). This starts at the value set by the driver, and increases.
>>>> +
>>>> +When VIRTIO_F_RING_PACKED is negotiated, it contains:
>>>> +
>>>> +\begin{lstlisting}
>>>> +le16 {
>>>> +        last_avail_idx : 15;
>>>> +        last_avail_wrap_counter : 1;
>>>> +} avail_state;
>>>> +\end{lstlisting}
>>>> +
>>>> +The \field{last_avail_idx} field indicates where the device would read for
>>>> +the next descriptor head from the descriptor ring. This starts at the
>>>> +value set by the driver and wraps around when reaching the end of the
>>>> +ring.
>>>> +
>>>> +The \field{last_avail_wrap_counter} field indicates the last Driver Ring
>>>> +Wrap Counter that is observed by the device. This starts at the value
>>>> +set by the driver, and is flipped when reaching the end of the ring.
>>>> +
>>>> +See also \ref{sec:Packed Virtqueues / Driver and Device Ring Wrap Counters}.
>>>> +
>>>> +\subsection{\field{Used State} Field}
>>>> +
>>>> +The \field{Used State} field is two bytes for the driver to set and
>>>> +get the state used by the virtqueue to make buffer used.
>>>> +
>>>> +When VIRTIO_F_RING_PACKED is not negotiated, the used state contains:
>>>> +
>>>> +\begin{lstlisting}
>>>> +le16 {
>>>> +        used_idx : 16;
>>>> +} used_state;
>>>> +\end{lstlisting}
>>>> +
>>>> +The \field{used_idx} where the device would write the next used
>>>> +descriptor head to the used ring (modulo the queue size). This starts
>>>> +at the value set by the driver, and increases. It is easy to see this
>>>> +is the initial value of the \field{idx} in the used ring.
>>>> +
>>>> +See also \ref{sec:Basic Facilities of a Virtio Device / Virtqueues / The Virtqueue Used Ring}
>>>> +
>>>> +When VIRTIO_F_RING_PACKED is negotiated, the used state contains:
>>>> +
>>>> +\begin{lstlisting}
>>>> +le16 {
>>>> +        used_idx : 15;
>>>> +        used_wrap_counter : 1;
>>>> +} used_state;
>>>> +\end{lstlisting}
>>>> +
>>>> +The \field{used_idx} indicates where the device would write the next used
>>>> +descriptor head to the descriptor ring. This starts at the value set
>>>> +by the driver, and warps around when reaching the end of the ring.
>>>> +
>>>> +\field{used_wrap_counter} is the Device Ring Wrap Counter. This starts
>>>> +at the value set by the driver, and is flipped when reaching the end
>>>> +of the ring.
>>>> +
>>>> +See also \ref{sec:Packed Virtqueues / Driver and Device Ring Wrap Counters}.
>>>
>>> Above only fully describes the vq state if descriptors
>>> are used in order or at least all out of order descriptors are consumed
>>> at time of save.
>>>
>> I think that the most straightforward solution would be to add
>> something similar to VHOST_USER_GET_INFLIGHT_FD, but without the _FD
>> part.
>>
>> Thanks!
>>
>>> Adding later option to devices such as net will need extra spec work.
>>>
>>>
>>>> +\drivernormative{\subsection}{Virtqueue State}{Basic Facilities of a Virtio Device / Virtqueue State}
>>>> +
>>>> +If VIRTIO_F_RING_STATE has been negotiated:
>>>> +\begin{itemize}
>>>> +\item A driver MUST NOT set the virtqueue state before setting the
>>>> +  FEATURE_OK status bit.
>>>> +\item A driver MUST NOT set the virtqueue state after setting the
>>>> +  DRIVER_OK status bit.
>>>> +\end{itemize}
>>>> +
>>>> +\devicenormative{\subsection}{Virtqueue State}{Basic Facilities of a Virtio Device / Virtqueue State}
>>>> +
>>>> +If VIRTIO_F_RING_STATE has not been negotiated, a device MUST ingore
>>>> +the read and write to the virtqueue state.
>>>> +
>>>> +If VIRTIO_F_RING_STATE has been negotiated:
>>>> +\begin{itemize}
>>>> +\item A device SHOULD ignore the write to the virtqueue state if the
>>>> +FEATURE_OK status bit is not set.
>>>> +\item A device SHOULD ignore the write to the virtqueue state if the
>>>> +DRIVER_OK status bit is set.
>>>> +\end{itemize}
>>>> +
>>>> +If VIRTIO_F_RING_STATE has been negotiated, a device MAY has its
>>>
>>> may have?
>>> should also go into a normative section
>>>
>>>> +device-specific way for the driver to set and get extra virtqueue
>>>> +states such as in flight requests.
>>>> +
>>>>   \chapter{General Initialization And Device Operation}\label{sec:General Initialization And Device Operation}
>>>>
>>>>   We start with an overview of device initialization, then expand on the
>>>> @@ -420,6 +530,9 @@ \section{Device Initialization}\label{sec:General Initialization And Device Oper
>>>>      device, optional per-bus setup, reading and possibly writing the
>>>>      device's virtio configuration space, and population of virtqueues.
>>>>
>>>> +\item\label{itm:General Initialization And Device Operation / Device
>>>> +  Initialization / Virtqueue State Setup} When VIRTIO_F_RING_STATE has been negotiated, perform virtqueue state setup, including the initialization of the per virtqueue available state, used state and the possible device specific virtqueue state.
>>>> +
>>>>   \item\label{itm:General Initialization And Device Operation / Device Initialization / Set DRIVER-OK} Set the DRIVER_OK status bit.  At this point the device is
>>>>      ``live''.
>>>>   \end{enumerate}
>>>> @@ -6596,6 +6709,10 @@ \chapter{Reserved Feature Bits}\label{sec:Reserved Feature Bits}
>>>>     transport specific.
>>>>     For more details about driver notifications over PCI see \ref{sec:Virtio Transport Options / Virtio Over PCI Bus / PCI-specific Initialization And Device Operation / Available Buffer Notifications}.
>>>>
>>>> +  \item[VIRTIO_F_RING_STATE(40)] This feature indicates that the driver
>>>> +  can set and get the device internal virtqueue state.
>>>> +  See \ref{sec:Virtqueues / Virtqueue State}~\nameref{sec:Virtqueues / Virtqueue State}.
>>>> +
>>>>   \end{description}
>>>>
>>>>   \drivernormative{\section}{Reserved Feature Bits}{Reserved Feature Bits}
>>>> --
>>>> 2.25.1


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH V2 2/2] virtio: introduce STOP status bit
  2021-07-06 14:27       ` [virtio-dev] " Cornelia Huck
@ 2021-07-07  0:05         ` Max Gurtovoy
  2021-07-07  3:14           ` Jason Wang
  2021-07-07  2:56         ` Jason Wang
  1 sibling, 1 reply; 115+ messages in thread
From: Max Gurtovoy @ 2021-07-07  0:05 UTC (permalink / raw)
  To: Cornelia Huck, Jason Wang, mst, virtio-comment, virtio-dev
  Cc: stefanha, eperezma, oren, shahafs, parav, bodong, amikheev, pasic


On 7/6/2021 5:27 PM, Cornelia Huck wrote:
> On Tue, Jul 06 2021, Jason Wang <jasowang@redhat.com> wrote:
>
>> 在 2021/7/6 下午8:50, Cornelia Huck 写道:
>>> On Tue, Jul 06 2021, Jason Wang <jasowang@redhat.com> wrote:
>>>> +If VIRTIO_F_STOP has been negotiated, to stop a device, after setting
>>>> +STOP, the driver MUST re-read the device status to ensure the STOP bit
>>>> +is set to synchronize with the device.
>>> Is this more that the driver needs to re-read the status until STOP is
>>> set to make sure that the stop process has finished?
>>
>> Yes.
>>
>>
>>>    If the device has
>>> offered the feature and the driver accepted it, I'd assume that the
>>> device will eventually finish with the procedure, or sets NEEDS_RESET if
>>> something goes wrong?
>>
>> As stated below, the device must either:
>>
>> 1) finish all pending requests
>>
>> or
>>
>> 2) provide a device specific way for the driver to save and restore
>> pending requests
>>
>> before setting STOP.
>>
>> Otherwise the device can't offer this feature.
>>
>> Using NEEDS_RESET seems more complicated than this.
> Hm, what happens on an internal error? I assume that the device would
> need to signal that in some way. Or should it simply set STOP and
> discard any pending requests? The driver would not be able to
> distinguish that from a normal STOP.

Again, this looks like vdpa specific solution where the BAR is managed 
by vdpa driver on the host.

In SRIOV the flow is different.

Please look on the state machine in my proposal.

You need a way to quiesce a device (internal state can still change, but 
device will stop dirty guest pages and stop changing other devices 
states in p2p) and a way to freeze a device (internal state is not 
allowed to be changed and  state can be queried by the migration software).

Is it possible to have p2p in vdpa ?

>
>>>> +If VIRTIO_F_STOP has been negotiated, after the driver writes STOP,
>>>> +the device MUST finish any pending operations like in flight requests
>>>> +or have its device specific way for driver to save the pending
>>>> +operations like in flight requests before setting the STOP status bit.
>>>> +
>>>> +If VIRTIO_F_STOP has been negotiated, the device MUST NOT consume
>>>> +buffers or send any used buffer notifications to the driver after
>>>> +STOP. The device MUST keep the configuration space unchanged and MUST
>>>> +NOT send configuration space change notification to the driver after
>>>> +STOP.
>>>> +
>>>> +If VIRTIO_F_STOP has been negotiated, after STOP, the device MUST
>>>> +preserve all the necessary state (the virtqueue states with the
>>>> +possible device specific states) that is required for restoring in the
>>>> +future.
>>> What happens if the driver writes STOP in when DRIVER_OK has not been
>>> set?
>>
>> I think we need a device normative like:
>>
>> If VIRTIO_F_STOP has been negotiated, the driver SHOULD ignore the STOP
>> status bit if DRIVER_OK is not set.
> That's the device that needs to do the ignoring, right?
>
>>
>>>    Should the device set NEEDS_RESET, as suggested above? Same, if
>>> saving the states somehow goes wrong?
>>
>> I try hard to avoid NEEDS_RESET, so the driver is required to only read
>> the state during DRIVER_OK & STOP, and set the state during FEATURES_OK
>> & !DRIVER_OK. This is described in the driver normative in patch 1 and
>> below.
> The device can certainly ignore STOP requests that are out of spec. But
> I think we cannot get around signaling device errors in some way.
>
>>>> +\section{Virtqueue State Saving}
>>>> +
>>>> +If both VIRTIO_F_RING_STATE and VIRTIO_F_STOP have been negotiated. A
>>>> +driver MAY save the internal virtqueue state.
>>> Is that device type specific, or something generic? The last patch
>>> suggests that it may vary by device type.
>>
>> Both virtqueue state (avail/used state) and the STOP status bit is generic.
>>
>> But the device is free to have its own specific:
>>
>> 1) extra virtqueue states (pending requests)
>> 2) device states
>>
>> And 2) is out of the scope of this series.
> Ok.
>


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH V2 1/2] virtio: introduce virtqueue state as basic facility
  2021-07-06 17:09     ` Eugenio Perez Martin
  2021-07-06 19:08       ` Michael S. Tsirkin
@ 2021-07-07  2:41       ` Jason Wang
  1 sibling, 0 replies; 115+ messages in thread
From: Jason Wang @ 2021-07-07  2:41 UTC (permalink / raw)
  To: Eugenio Perez Martin, Michael S. Tsirkin
  Cc: virtio-comment, Virtio-Dev, Stefan Hajnoczi, Max Gurtovoy,
	Cornelia Huck, Oren Duer, Shahaf Shuler, Parav Pandit,
	Bodong Wang, Alexander Mikheev, Halil Pasic


在 2021/7/7 上午1:09, Eugenio Perez Martin 写道:
> On Tue, Jul 6, 2021 at 11:32 AM Michael S. Tsirkin <mst@redhat.com> wrote:
>> On Tue, Jul 06, 2021 at 12:33:33PM +0800, Jason Wang wrote:
>>> This patch adds new device facility to save and restore virtqueue
>>> state. The virtqueue state is split into two parts:
>>>
>>> - The available state: The state that is used for read the next
>>>    available buffer.
>>> - The used state: The state that is used for making buffer used.
>>>
>>> Note that, there could be devices that is required to set and get the
>>> requests that are being processed by the device. I leave such API to
>>> be device specific.
>>>
>>> This facility could be used by both migration and device diagnostic.
>>>
>>> Signed-off-by: Jason Wang <jasowang@redhat.com>
>> Hi Jason!
>> I feel that for use-cases such as SRIOV,
>> the facility to save/restore vq should be part of a PF
>> that is there needs to be a way for one virtio device to
>> address the state of another one.
>>
> Hi!
>
> In my opinion we should go the other way around: To make features as
> orthogonal/independent as possible, and just make them work together
> if we have to.


I agree.


>   In this particular case, I think it should be easier to
> decide how to report status, its needs, etc for a VF, and then open
> the possibility for the PF to query or set them, reusing format,
> behavior, etc. as much as possible.


Yes, that's why I introduce virtqueue state as basic facility and reuse 
the device status bit for stop.


>
> I think that the most controversial point about doing it non-SR IOV
> way is the exposing of these features/fields to the guest using
> specific transport facilities, like PCI common config. However I think
> it should not be hard for the hypervisor to intercept them and even to
> expose them conditionally.


This could be done via admin virtqueue. I'm working on the device 
slicing at virtio level via admin virtqueue, this means, all the basic 
facility of a device slice could be carried via admin virtqueue: device 
status, configuration, probing, features negotiation etc.

Thanks


>   Please correct me if this guessing was not
> right and you had other concerns.
>
>> Thoughts?
>>
>>> ---
>>>   content.tex | 117 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>   1 file changed, 117 insertions(+)
>>>
>>> diff --git a/content.tex b/content.tex
>>> index 620c0e2..8877b6f 100644
>>> --- a/content.tex
>>> +++ b/content.tex
>>> @@ -385,6 +385,116 @@ \section{Exporting Objects}\label{sec:Basic Facilities of a Virtio Device / Expo
>>>   types. It is RECOMMENDED that devices generate version 4
>>>   UUIDs as specified by \hyperref[intro:rfc4122]{[RFC4122]}.
>>>
>>> +\section{Virtqueue State}\label{sec:Virtqueues / Virtqueue State}
>>> +
>>> +When VIRTIO_F_RING_STATE is negotiated, the driver can set and
>>> +get the device internal virtqueue state through the following
>>> +fields. The way to access those fields is transport specific.
>>> +
>>> +\subsection{\field{Available State} Field}
>>> +
>>> +The \field{Available State} field is two bytes for the driver to get
>>> +or set the state that is used by the virtqueue to read for the next
>>> +available buffer.
>>> +
>>> +When VIRTIO_F_RING_PACKED is not negotiated, it contains:
>>> +
>>> +\begin{lstlisting}
>>> +le16 {
>>> +        last_avail_idx : 16;
>>> +} avail_state;
>>> +\end{lstlisting}
>>> +
>>> +The \field{last_avail_idx} field indicates where the device would read
>>> +for the next index from the virtqueue available ring(modulo the queue
>>> + size). This starts at the value set by the driver, and increases.
>>> +
>>> +When VIRTIO_F_RING_PACKED is negotiated, it contains:
>>> +
>>> +\begin{lstlisting}
>>> +le16 {
>>> +        last_avail_idx : 15;
>>> +        last_avail_wrap_counter : 1;
>>> +} avail_state;
>>> +\end{lstlisting}
>>> +
>>> +The \field{last_avail_idx} field indicates where the device would read for
>>> +the next descriptor head from the descriptor ring. This starts at the
>>> +value set by the driver and wraps around when reaching the end of the
>>> +ring.
>>> +
>>> +The \field{last_avail_wrap_counter} field indicates the last Driver Ring
>>> +Wrap Counter that is observed by the device. This starts at the value
>>> +set by the driver, and is flipped when reaching the end of the ring.
>>> +
>>> +See also \ref{sec:Packed Virtqueues / Driver and Device Ring Wrap Counters}.
>>> +
>>> +\subsection{\field{Used State} Field}
>>> +
>>> +The \field{Used State} field is two bytes for the driver to set and
>>> +get the state used by the virtqueue to make buffer used.
>>> +
>>> +When VIRTIO_F_RING_PACKED is not negotiated, the used state contains:
>>> +
>>> +\begin{lstlisting}
>>> +le16 {
>>> +        used_idx : 16;
>>> +} used_state;
>>> +\end{lstlisting}
>>> +
>>> +The \field{used_idx} where the device would write the next used
>>> +descriptor head to the used ring (modulo the queue size). This starts
>>> +at the value set by the driver, and increases. It is easy to see this
>>> +is the initial value of the \field{idx} in the used ring.
>>> +
>>> +See also \ref{sec:Basic Facilities of a Virtio Device / Virtqueues / The Virtqueue Used Ring}
>>> +
>>> +When VIRTIO_F_RING_PACKED is negotiated, the used state contains:
>>> +
>>> +\begin{lstlisting}
>>> +le16 {
>>> +        used_idx : 15;
>>> +        used_wrap_counter : 1;
>>> +} used_state;
>>> +\end{lstlisting}
>>> +
>>> +The \field{used_idx} indicates where the device would write the next used
>>> +descriptor head to the descriptor ring. This starts at the value set
>>> +by the driver, and warps around when reaching the end of the ring.
>>> +
>>> +\field{used_wrap_counter} is the Device Ring Wrap Counter. This starts
>>> +at the value set by the driver, and is flipped when reaching the end
>>> +of the ring.
>>> +
>>> +See also \ref{sec:Packed Virtqueues / Driver and Device Ring Wrap Counters}.
>>
>> Above only fully describes the vq state if descriptors
>> are used in order or at least all out of order descriptors are consumed
>> at time of save.
>>
> I think that the most straightforward solution would be to add
> something similar to VHOST_USER_GET_INFLIGHT_FD, but without the _FD
> part.
>
> Thanks!
>
>> Adding later option to devices such as net will need extra spec work.
>>
>>
>>> +\drivernormative{\subsection}{Virtqueue State}{Basic Facilities of a Virtio Device / Virtqueue State}
>>> +
>>> +If VIRTIO_F_RING_STATE has been negotiated:
>>> +\begin{itemize}
>>> +\item A driver MUST NOT set the virtqueue state before setting the
>>> +  FEATURE_OK status bit.
>>> +\item A driver MUST NOT set the virtqueue state after setting the
>>> +  DRIVER_OK status bit.
>>> +\end{itemize}
>>> +
>>> +\devicenormative{\subsection}{Virtqueue State}{Basic Facilities of a Virtio Device / Virtqueue State}
>>> +
>>> +If VIRTIO_F_RING_STATE has not been negotiated, a device MUST ingore
>>> +the read and write to the virtqueue state.
>>> +
>>> +If VIRTIO_F_RING_STATE has been negotiated:
>>> +\begin{itemize}
>>> +\item A device SHOULD ignore the write to the virtqueue state if the
>>> +FEATURE_OK status bit is not set.
>>> +\item A device SHOULD ignore the write to the virtqueue state if the
>>> +DRIVER_OK status bit is set.
>>> +\end{itemize}
>>> +
>>> +If VIRTIO_F_RING_STATE has been negotiated, a device MAY has its
>>
>> may have?
>> should also go into a normative section
>>
>>> +device-specific way for the driver to set and get extra virtqueue
>>> +states such as in flight requests.
>>> +
>>>   \chapter{General Initialization And Device Operation}\label{sec:General Initialization And Device Operation}
>>>
>>>   We start with an overview of device initialization, then expand on the
>>> @@ -420,6 +530,9 @@ \section{Device Initialization}\label{sec:General Initialization And Device Oper
>>>      device, optional per-bus setup, reading and possibly writing the
>>>      device's virtio configuration space, and population of virtqueues.
>>>
>>> +\item\label{itm:General Initialization And Device Operation / Device
>>> +  Initialization / Virtqueue State Setup} When VIRTIO_F_RING_STATE has been negotiated, perform virtqueue state setup, including the initialization of the per virtqueue available state, used state and the possible device specific virtqueue state.
>>> +
>>>   \item\label{itm:General Initialization And Device Operation / Device Initialization / Set DRIVER-OK} Set the DRIVER_OK status bit.  At this point the device is
>>>      ``live''.
>>>   \end{enumerate}
>>> @@ -6596,6 +6709,10 @@ \chapter{Reserved Feature Bits}\label{sec:Reserved Feature Bits}
>>>     transport specific.
>>>     For more details about driver notifications over PCI see \ref{sec:Virtio Transport Options / Virtio Over PCI Bus / PCI-specific Initialization And Device Operation / Available Buffer Notifications}.
>>>
>>> +  \item[VIRTIO_F_RING_STATE(40)] This feature indicates that the driver
>>> +  can set and get the device internal virtqueue state.
>>> +  See \ref{sec:Virtqueues / Virtqueue State}~\nameref{sec:Virtqueues / Virtqueue State}.
>>> +
>>>   \end{description}
>>>
>>>   \drivernormative{\section}{Reserved Feature Bits}{Reserved Feature Bits}
>>> --
>>> 2.25.1


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH V2 1/2] virtio: introduce virtqueue state as basic facility
  2021-07-06 19:08       ` Michael S. Tsirkin
  2021-07-06 23:49         ` Max Gurtovoy
@ 2021-07-07  2:42         ` Jason Wang
  2021-07-07  4:36           ` Jason Wang
  1 sibling, 1 reply; 115+ messages in thread
From: Jason Wang @ 2021-07-07  2:42 UTC (permalink / raw)
  To: Michael S. Tsirkin, Eugenio Perez Martin
  Cc: virtio-comment, Virtio-Dev, Stefan Hajnoczi, Max Gurtovoy,
	Cornelia Huck, Oren Duer, Shahaf Shuler, Parav Pandit,
	Bodong Wang, Alexander Mikheev, Halil Pasic


在 2021/7/7 上午3:08, Michael S. Tsirkin 写道:
> On Tue, Jul 06, 2021 at 07:09:10PM +0200, Eugenio Perez Martin wrote:
>> On Tue, Jul 6, 2021 at 11:32 AM Michael S. Tsirkin <mst@redhat.com> wrote:
>>> On Tue, Jul 06, 2021 at 12:33:33PM +0800, Jason Wang wrote:
>>>> This patch adds new device facility to save and restore virtqueue
>>>> state. The virtqueue state is split into two parts:
>>>>
>>>> - The available state: The state that is used for read the next
>>>>    available buffer.
>>>> - The used state: The state that is used for making buffer used.
>>>>
>>>> Note that, there could be devices that is required to set and get the
>>>> requests that are being processed by the device. I leave such API to
>>>> be device specific.
>>>>
>>>> This facility could be used by both migration and device diagnostic.
>>>>
>>>> Signed-off-by: Jason Wang <jasowang@redhat.com>
>>> Hi Jason!
>>> I feel that for use-cases such as SRIOV,
>>> the facility to save/restore vq should be part of a PF
>>> that is there needs to be a way for one virtio device to
>>> address the state of another one.
>>>
>> Hi!
>>
>> In my opinion we should go the other way around: To make features as
>> orthogonal/independent as possible, and just make them work together
>> if we have to. In this particular case, I think it should be easier to
>> decide how to report status, its needs, etc for a VF, and then open
>> the possibility for the PF to query or set them, reusing format,
>> behavior, etc. as much as possible.
>>
>> I think that the most controversial point about doing it non-SR IOV
>> way is the exposing of these features/fields to the guest using
>> specific transport facilities, like PCI common config. However I think
>> it should not be hard for the hypervisor to intercept them and even to
>> expose them conditionally. Please correct me if this guessing was not
>> right and you had other concerns.
>
> Possibly. I'd like to see some guidance on how this all will work
> in practice then. Maybe make it all part of a non-normative section
> for now.
> I think that the feature itself is not very useful outside of
> migration so we don't really gain much by adding it as is
> without all the other missing pieces.


For networking device, the only missing part is the transport 
implementation of the virtqueue state.


> I would say let's see more of the whole picture before we commit.


I will include an implementation of PCI as an example.

Thanks


>
>
>
>>> Thoughts?
>>>
>>>> ---
>>>>   content.tex | 117 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>   1 file changed, 117 insertions(+)
>>>>
>>>> diff --git a/content.tex b/content.tex
>>>> index 620c0e2..8877b6f 100644
>>>> --- a/content.tex
>>>> +++ b/content.tex
>>>> @@ -385,6 +385,116 @@ \section{Exporting Objects}\label{sec:Basic Facilities of a Virtio Device / Expo
>>>>   types. It is RECOMMENDED that devices generate version 4
>>>>   UUIDs as specified by \hyperref[intro:rfc4122]{[RFC4122]}.
>>>>
>>>> +\section{Virtqueue State}\label{sec:Virtqueues / Virtqueue State}
>>>> +
>>>> +When VIRTIO_F_RING_STATE is negotiated, the driver can set and
>>>> +get the device internal virtqueue state through the following
>>>> +fields. The way to access those fields is transport specific.
>>>> +
>>>> +\subsection{\field{Available State} Field}
>>>> +
>>>> +The \field{Available State} field is two bytes for the driver to get
>>>> +or set the state that is used by the virtqueue to read for the next
>>>> +available buffer.
>>>> +
>>>> +When VIRTIO_F_RING_PACKED is not negotiated, it contains:
>>>> +
>>>> +\begin{lstlisting}
>>>> +le16 {
>>>> +        last_avail_idx : 16;
>>>> +} avail_state;
>>>> +\end{lstlisting}
>>>> +
>>>> +The \field{last_avail_idx} field indicates where the device would read
>>>> +for the next index from the virtqueue available ring(modulo the queue
>>>> + size). This starts at the value set by the driver, and increases.
>>>> +
>>>> +When VIRTIO_F_RING_PACKED is negotiated, it contains:
>>>> +
>>>> +\begin{lstlisting}
>>>> +le16 {
>>>> +        last_avail_idx : 15;
>>>> +        last_avail_wrap_counter : 1;
>>>> +} avail_state;
>>>> +\end{lstlisting}
>>>> +
>>>> +The \field{last_avail_idx} field indicates where the device would read for
>>>> +the next descriptor head from the descriptor ring. This starts at the
>>>> +value set by the driver and wraps around when reaching the end of the
>>>> +ring.
>>>> +
>>>> +The \field{last_avail_wrap_counter} field indicates the last Driver Ring
>>>> +Wrap Counter that is observed by the device. This starts at the value
>>>> +set by the driver, and is flipped when reaching the end of the ring.
>>>> +
>>>> +See also \ref{sec:Packed Virtqueues / Driver and Device Ring Wrap Counters}.
>>>> +
>>>> +\subsection{\field{Used State} Field}
>>>> +
>>>> +The \field{Used State} field is two bytes for the driver to set and
>>>> +get the state used by the virtqueue to make buffer used.
>>>> +
>>>> +When VIRTIO_F_RING_PACKED is not negotiated, the used state contains:
>>>> +
>>>> +\begin{lstlisting}
>>>> +le16 {
>>>> +        used_idx : 16;
>>>> +} used_state;
>>>> +\end{lstlisting}
>>>> +
>>>> +The \field{used_idx} where the device would write the next used
>>>> +descriptor head to the used ring (modulo the queue size). This starts
>>>> +at the value set by the driver, and increases. It is easy to see this
>>>> +is the initial value of the \field{idx} in the used ring.
>>>> +
>>>> +See also \ref{sec:Basic Facilities of a Virtio Device / Virtqueues / The Virtqueue Used Ring}
>>>> +
>>>> +When VIRTIO_F_RING_PACKED is negotiated, the used state contains:
>>>> +
>>>> +\begin{lstlisting}
>>>> +le16 {
>>>> +        used_idx : 15;
>>>> +        used_wrap_counter : 1;
>>>> +} used_state;
>>>> +\end{lstlisting}
>>>> +
>>>> +The \field{used_idx} indicates where the device would write the next used
>>>> +descriptor head to the descriptor ring. This starts at the value set
>>>> +by the driver, and warps around when reaching the end of the ring.
>>>> +
>>>> +\field{used_wrap_counter} is the Device Ring Wrap Counter. This starts
>>>> +at the value set by the driver, and is flipped when reaching the end
>>>> +of the ring.
>>>> +
>>>> +See also \ref{sec:Packed Virtqueues / Driver and Device Ring Wrap Counters}.
>>>
>>> Above only fully describes the vq state if descriptors
>>> are used in order or at least all out of order descriptors are consumed
>>> at time of save.
>>>
>> I think that the most straightforward solution would be to add
>> something similar to VHOST_USER_GET_INFLIGHT_FD, but without the _FD
>> part.
>>
>> Thanks!
>>
>>> Adding later option to devices such as net will need extra spec work.
>>>
>>>
>>>> +\drivernormative{\subsection}{Virtqueue State}{Basic Facilities of a Virtio Device / Virtqueue State}
>>>> +
>>>> +If VIRTIO_F_RING_STATE has been negotiated:
>>>> +\begin{itemize}
>>>> +\item A driver MUST NOT set the virtqueue state before setting the
>>>> +  FEATURE_OK status bit.
>>>> +\item A driver MUST NOT set the virtqueue state after setting the
>>>> +  DRIVER_OK status bit.
>>>> +\end{itemize}
>>>> +
>>>> +\devicenormative{\subsection}{Virtqueue State}{Basic Facilities of a Virtio Device / Virtqueue State}
>>>> +
>>>> +If VIRTIO_F_RING_STATE has not been negotiated, a device MUST ingore
>>>> +the read and write to the virtqueue state.
>>>> +
>>>> +If VIRTIO_F_RING_STATE has been negotiated:
>>>> +\begin{itemize}
>>>> +\item A device SHOULD ignore the write to the virtqueue state if the
>>>> +FEATURE_OK status bit is not set.
>>>> +\item A device SHOULD ignore the write to the virtqueue state if the
>>>> +DRIVER_OK status bit is set.
>>>> +\end{itemize}
>>>> +
>>>> +If VIRTIO_F_RING_STATE has been negotiated, a device MAY has its
>>>
>>> may have?
>>> should also go into a normative section
>>>
>>>> +device-specific way for the driver to set and get extra virtqueue
>>>> +states such as in flight requests.
>>>> +
>>>>   \chapter{General Initialization And Device Operation}\label{sec:General Initialization And Device Operation}
>>>>
>>>>   We start with an overview of device initialization, then expand on the
>>>> @@ -420,6 +530,9 @@ \section{Device Initialization}\label{sec:General Initialization And Device Oper
>>>>      device, optional per-bus setup, reading and possibly writing the
>>>>      device's virtio configuration space, and population of virtqueues.
>>>>
>>>> +\item\label{itm:General Initialization And Device Operation / Device
>>>> +  Initialization / Virtqueue State Setup} When VIRTIO_F_RING_STATE has been negotiated, perform virtqueue state setup, including the initialization of the per virtqueue available state, used state and the possible device specific virtqueue state.
>>>> +
>>>>   \item\label{itm:General Initialization And Device Operation / Device Initialization / Set DRIVER-OK} Set the DRIVER_OK status bit.  At this point the device is
>>>>      ``live''.
>>>>   \end{enumerate}
>>>> @@ -6596,6 +6709,10 @@ \chapter{Reserved Feature Bits}\label{sec:Reserved Feature Bits}
>>>>     transport specific.
>>>>     For more details about driver notifications over PCI see \ref{sec:Virtio Transport Options / Virtio Over PCI Bus / PCI-specific Initialization And Device Operation / Available Buffer Notifications}.
>>>>
>>>> +  \item[VIRTIO_F_RING_STATE(40)] This feature indicates that the driver
>>>> +  can set and get the device internal virtqueue state.
>>>> +  See \ref{sec:Virtqueues / Virtqueue State}~\nameref{sec:Virtqueues / Virtqueue State}.
>>>> +
>>>>   \end{description}
>>>>
>>>>   \drivernormative{\section}{Reserved Feature Bits}{Reserved Feature Bits}
>>>> --
>>>> 2.25.1


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH V2 1/2] virtio: introduce virtqueue state as basic facility
  2021-07-06 23:49         ` Max Gurtovoy
@ 2021-07-07  2:50           ` Jason Wang
  2021-07-07 12:03             ` Max Gurtovoy
  0 siblings, 1 reply; 115+ messages in thread
From: Jason Wang @ 2021-07-07  2:50 UTC (permalink / raw)
  To: Max Gurtovoy, Michael S. Tsirkin, Eugenio Perez Martin
  Cc: virtio-comment, Virtio-Dev, Stefan Hajnoczi, Cornelia Huck,
	Oren Duer, Shahaf Shuler, Parav Pandit, Bodong Wang,
	Alexander Mikheev, Halil Pasic


在 2021/7/7 上午7:49, Max Gurtovoy 写道:
>
> On 7/6/2021 10:08 PM, Michael S. Tsirkin wrote:
>> On Tue, Jul 06, 2021 at 07:09:10PM +0200, Eugenio Perez Martin wrote:
>>> On Tue, Jul 6, 2021 at 11:32 AM Michael S. Tsirkin <mst@redhat.com> 
>>> wrote:
>>>> On Tue, Jul 06, 2021 at 12:33:33PM +0800, Jason Wang wrote:
>>>>> This patch adds new device facility to save and restore virtqueue
>>>>> state. The virtqueue state is split into two parts:
>>>>>
>>>>> - The available state: The state that is used for read the next
>>>>>    available buffer.
>>>>> - The used state: The state that is used for making buffer used.
>>>>>
>>>>> Note that, there could be devices that is required to set and get the
>>>>> requests that are being processed by the device. I leave such API to
>>>>> be device specific.
>>>>>
>>>>> This facility could be used by both migration and device diagnostic.
>>>>>
>>>>> Signed-off-by: Jason Wang <jasowang@redhat.com>
>>>> Hi Jason!
>>>> I feel that for use-cases such as SRIOV,
>>>> the facility to save/restore vq should be part of a PF
>>>> that is there needs to be a way for one virtio device to
>>>> address the state of another one.
>>>>
>>> Hi!
>>>
>>> In my opinion we should go the other way around: To make features as
>>> orthogonal/independent as possible, and just make them work together
>>> if we have to. In this particular case, I think it should be easier to
>>> decide how to report status, its needs, etc for a VF, and then open
>>> the possibility for the PF to query or set them, reusing format,
>>> behavior, etc. as much as possible.
>>>
>>> I think that the most controversial point about doing it non-SR IOV
>>> way is the exposing of these features/fields to the guest using
>>> specific transport facilities, like PCI common config. However I think
>>> it should not be hard for the hypervisor to intercept them and even to
>>> expose them conditionally. Please correct me if this guessing was not
>>> right and you had other concerns.
>>
>> Possibly. I'd like to see some guidance on how this all will work
>> in practice then. Maybe make it all part of a non-normative section
>> for now.
>> I think that the feature itself is not very useful outside of
>> migration so we don't really gain much by adding it as is
>> without all the other missing pieces.
>> I would say let's see more of the whole picture before we commit.
>
> I agree here. I also can't see the whole picture for SRIOV case.


Again, it's not related to SR-IOV at all. It tries to introduce basic 
facility in the virtio level which can work for all types of virtio device.

Transport such as PCI need to implement its own way to access those 
state. It's not hard to implement them simply via capability.

It works like other basic facility like device status, features etc.

For SR-IOV, it doesn't prevent you from implementing that via the admin 
virtqueue.


>
> I'll try to combine the admin control queue suggested in previous 
> patch set to my proposal of PF managing the VF migration.


Note that, the admin virtqueue should be transport indepedent when 
trying to introduce them.


>
> Feature negotiation is part of virtio device-driver communication and 
> not part of the migration software that should manage the migration 
> process.
>
> For me, seems like queue state is something that should be internal 
> and not be exposed to guest drivers that see this as a new feature.


This is not true, we have the case of nested virtualization. As 
mentioned in another thread, it's the hypervisor that need to choose 
between hiding or shadowing the internal virtqueue state.

Thanks


>
>>
>>
>>
>>>> Thoughts?
>>>>
>>>>> ---
>>>>>   content.tex | 117 
>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>   1 file changed, 117 insertions(+)
>>>>>
>>>>> diff --git a/content.tex b/content.tex
>>>>> index 620c0e2..8877b6f 100644
>>>>> --- a/content.tex
>>>>> +++ b/content.tex
>>>>> @@ -385,6 +385,116 @@ \section{Exporting Objects}\label{sec:Basic 
>>>>> Facilities of a Virtio Device / Expo
>>>>>   types. It is RECOMMENDED that devices generate version 4
>>>>>   UUIDs as specified by \hyperref[intro:rfc4122]{[RFC4122]}.
>>>>>
>>>>> +\section{Virtqueue State}\label{sec:Virtqueues / Virtqueue State}
>>>>> +
>>>>> +When VIRTIO_F_RING_STATE is negotiated, the driver can set and
>>>>> +get the device internal virtqueue state through the following
>>>>> +fields. The way to access those fields is transport specific.
>>>>> +
>>>>> +\subsection{\field{Available State} Field}
>>>>> +
>>>>> +The \field{Available State} field is two bytes for the driver to get
>>>>> +or set the state that is used by the virtqueue to read for the next
>>>>> +available buffer.
>>>>> +
>>>>> +When VIRTIO_F_RING_PACKED is not negotiated, it contains:
>>>>> +
>>>>> +\begin{lstlisting}
>>>>> +le16 {
>>>>> +        last_avail_idx : 16;
>>>>> +} avail_state;
>>>>> +\end{lstlisting}
>>>>> +
>>>>> +The \field{last_avail_idx} field indicates where the device would 
>>>>> read
>>>>> +for the next index from the virtqueue available ring(modulo the 
>>>>> queue
>>>>> + size). This starts at the value set by the driver, and increases.
>>>>> +
>>>>> +When VIRTIO_F_RING_PACKED is negotiated, it contains:
>>>>> +
>>>>> +\begin{lstlisting}
>>>>> +le16 {
>>>>> +        last_avail_idx : 15;
>>>>> +        last_avail_wrap_counter : 1;
>>>>> +} avail_state;
>>>>> +\end{lstlisting}
>>>>> +
>>>>> +The \field{last_avail_idx} field indicates where the device would 
>>>>> read for
>>>>> +the next descriptor head from the descriptor ring. This starts at 
>>>>> the
>>>>> +value set by the driver and wraps around when reaching the end of 
>>>>> the
>>>>> +ring.
>>>>> +
>>>>> +The \field{last_avail_wrap_counter} field indicates the last 
>>>>> Driver Ring
>>>>> +Wrap Counter that is observed by the device. This starts at the 
>>>>> value
>>>>> +set by the driver, and is flipped when reaching the end of the ring.
>>>>> +
>>>>> +See also \ref{sec:Packed Virtqueues / Driver and Device Ring Wrap 
>>>>> Counters}.
>>>>> +
>>>>> +\subsection{\field{Used State} Field}
>>>>> +
>>>>> +The \field{Used State} field is two bytes for the driver to set and
>>>>> +get the state used by the virtqueue to make buffer used.
>>>>> +
>>>>> +When VIRTIO_F_RING_PACKED is not negotiated, the used state 
>>>>> contains:
>>>>> +
>>>>> +\begin{lstlisting}
>>>>> +le16 {
>>>>> +        used_idx : 16;
>>>>> +} used_state;
>>>>> +\end{lstlisting}
>>>>> +
>>>>> +The \field{used_idx} where the device would write the next used
>>>>> +descriptor head to the used ring (modulo the queue size). This 
>>>>> starts
>>>>> +at the value set by the driver, and increases. It is easy to see 
>>>>> this
>>>>> +is the initial value of the \field{idx} in the used ring.
>>>>> +
>>>>> +See also \ref{sec:Basic Facilities of a Virtio Device / 
>>>>> Virtqueues / The Virtqueue Used Ring}
>>>>> +
>>>>> +When VIRTIO_F_RING_PACKED is negotiated, the used state contains:
>>>>> +
>>>>> +\begin{lstlisting}
>>>>> +le16 {
>>>>> +        used_idx : 15;
>>>>> +        used_wrap_counter : 1;
>>>>> +} used_state;
>>>>> +\end{lstlisting}
>>>>> +
>>>>> +The \field{used_idx} indicates where the device would write the 
>>>>> next used
>>>>> +descriptor head to the descriptor ring. This starts at the value set
>>>>> +by the driver, and warps around when reaching the end of the ring.
>>>>> +
>>>>> +\field{used_wrap_counter} is the Device Ring Wrap Counter. This 
>>>>> starts
>>>>> +at the value set by the driver, and is flipped when reaching the end
>>>>> +of the ring.
>>>>> +
>>>>> +See also \ref{sec:Packed Virtqueues / Driver and Device Ring Wrap 
>>>>> Counters}.
>>>>
>>>> Above only fully describes the vq state if descriptors
>>>> are used in order or at least all out of order descriptors are 
>>>> consumed
>>>> at time of save.
>>>>
>>> I think that the most straightforward solution would be to add
>>> something similar to VHOST_USER_GET_INFLIGHT_FD, but without the _FD
>>> part.
>>>
>>> Thanks!
>>>
>>>> Adding later option to devices such as net will need extra spec work.
>>>>
>>>>
>>>>> +\drivernormative{\subsection}{Virtqueue State}{Basic Facilities 
>>>>> of a Virtio Device / Virtqueue State}
>>>>> +
>>>>> +If VIRTIO_F_RING_STATE has been negotiated:
>>>>> +\begin{itemize}
>>>>> +\item A driver MUST NOT set the virtqueue state before setting the
>>>>> +  FEATURE_OK status bit.
>>>>> +\item A driver MUST NOT set the virtqueue state after setting the
>>>>> +  DRIVER_OK status bit.
>>>>> +\end{itemize}
>>>>> +
>>>>> +\devicenormative{\subsection}{Virtqueue State}{Basic Facilities 
>>>>> of a Virtio Device / Virtqueue State}
>>>>> +
>>>>> +If VIRTIO_F_RING_STATE has not been negotiated, a device MUST ingore
>>>>> +the read and write to the virtqueue state.
>>>>> +
>>>>> +If VIRTIO_F_RING_STATE has been negotiated:
>>>>> +\begin{itemize}
>>>>> +\item A device SHOULD ignore the write to the virtqueue state if the
>>>>> +FEATURE_OK status bit is not set.
>>>>> +\item A device SHOULD ignore the write to the virtqueue state if the
>>>>> +DRIVER_OK status bit is set.
>>>>> +\end{itemize}
>>>>> +
>>>>> +If VIRTIO_F_RING_STATE has been negotiated, a device MAY has its
>>>>
>>>> may have?
>>>> should also go into a normative section
>>>>
>>>>> +device-specific way for the driver to set and get extra virtqueue
>>>>> +states such as in flight requests.
>>>>> +
>>>>>   \chapter{General Initialization And Device 
>>>>> Operation}\label{sec:General Initialization And Device Operation}
>>>>>
>>>>>   We start with an overview of device initialization, then expand 
>>>>> on the
>>>>> @@ -420,6 +530,9 @@ \section{Device 
>>>>> Initialization}\label{sec:General Initialization And Device Oper
>>>>>      device, optional per-bus setup, reading and possibly writing the
>>>>>      device's virtio configuration space, and population of 
>>>>> virtqueues.
>>>>>
>>>>> +\item\label{itm:General Initialization And Device Operation / Device
>>>>> +  Initialization / Virtqueue State Setup} When 
>>>>> VIRTIO_F_RING_STATE has been negotiated, perform virtqueue state 
>>>>> setup, including the initialization of the per virtqueue available 
>>>>> state, used state and the possible device specific virtqueue state.
>>>>> +
>>>>>   \item\label{itm:General Initialization And Device Operation / 
>>>>> Device Initialization / Set DRIVER-OK} Set the DRIVER_OK status 
>>>>> bit.  At this point the device is
>>>>>      ``live''.
>>>>>   \end{enumerate}
>>>>> @@ -6596,6 +6709,10 @@ \chapter{Reserved Feature 
>>>>> Bits}\label{sec:Reserved Feature Bits}
>>>>>     transport specific.
>>>>>     For more details about driver notifications over PCI see 
>>>>> \ref{sec:Virtio Transport Options / Virtio Over PCI Bus / 
>>>>> PCI-specific Initialization And Device Operation / Available 
>>>>> Buffer Notifications}.
>>>>>
>>>>> +  \item[VIRTIO_F_RING_STATE(40)] This feature indicates that the 
>>>>> driver
>>>>> +  can set and get the device internal virtqueue state.
>>>>> +  See \ref{sec:Virtqueues / Virtqueue 
>>>>> State}~\nameref{sec:Virtqueues / Virtqueue State}.
>>>>> +
>>>>>   \end{description}
>>>>>
>>>>>   \drivernormative{\section}{Reserved Feature Bits}{Reserved 
>>>>> Feature Bits}
>>>>> -- 
>>>>> 2.25.1
>


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH V2 2/2] virtio: introduce STOP status bit
  2021-07-06 14:27       ` [virtio-dev] " Cornelia Huck
  2021-07-07  0:05         ` Max Gurtovoy
@ 2021-07-07  2:56         ` Jason Wang
  2021-07-07 16:45           ` [virtio-comment] " Cornelia Huck
  1 sibling, 1 reply; 115+ messages in thread
From: Jason Wang @ 2021-07-07  2:56 UTC (permalink / raw)
  To: Cornelia Huck, mst, virtio-comment, virtio-dev
  Cc: stefanha, mgurtovoy, eperezma, oren, shahafs, parav, bodong,
	amikheev, pasic


在 2021/7/6 下午10:27, Cornelia Huck 写道:
> On Tue, Jul 06 2021, Jason Wang <jasowang@redhat.com> wrote:
>
>> 在 2021/7/6 下午8:50, Cornelia Huck 写道:
>>> On Tue, Jul 06 2021, Jason Wang <jasowang@redhat.com> wrote:
>>>> +If VIRTIO_F_STOP has been negotiated, to stop a device, after setting
>>>> +STOP, the driver MUST re-read the device status to ensure the STOP bit
>>>> +is set to synchronize with the device.
>>> Is this more that the driver needs to re-read the status until STOP is
>>> set to make sure that the stop process has finished?
>>
>> Yes.
>>
>>
>>>    If the device has
>>> offered the feature and the driver accepted it, I'd assume that the
>>> device will eventually finish with the procedure, or sets NEEDS_RESET if
>>> something goes wrong?
>>
>> As stated below, the device must either:
>>
>> 1) finish all pending requests
>>
>> or
>>
>> 2) provide a device specific way for the driver to save and restore
>> pending requests
>>
>> before setting STOP.
>>
>> Otherwise the device can't offer this feature.
>>
>> Using NEEDS_RESET seems more complicated than this.
> Hm, what happens on an internal error?


A question, can reset fail? If yes, do we need to define how to proceed 
in the driver side?

If not, I don't need the reason we need to deal with that in the STOP.


> I assume that the device would
> need to signal that in some way. Or should it simply set STOP and
> discard any pending requests?


The current proposal doesn't mandate to complete all the in-flight 
requests. The device can provide a way for the driver to set and get the 
pending requests that could be re-submitted later (or in the destination).


> The driver would not be able to
> distinguish that from a normal STOP.
>
>>>> +If VIRTIO_F_STOP has been negotiated, after the driver writes STOP,
>>>> +the device MUST finish any pending operations like in flight requests
>>>> +or have its device specific way for driver to save the pending
>>>> +operations like in flight requests before setting the STOP status bit.
>>>> +
>>>> +If VIRTIO_F_STOP has been negotiated, the device MUST NOT consume
>>>> +buffers or send any used buffer notifications to the driver after
>>>> +STOP. The device MUST keep the configuration space unchanged and MUST
>>>> +NOT send configuration space change notification to the driver after
>>>> +STOP.
>>>> +
>>>> +If VIRTIO_F_STOP has been negotiated, after STOP, the device MUST
>>>> +preserve all the necessary state (the virtqueue states with the
>>>> +possible device specific states) that is required for restoring in the
>>>> +future.
>>> What happens if the driver writes STOP in when DRIVER_OK has not been
>>> set?
>>
>> I think we need a device normative like:
>>
>> If VIRTIO_F_STOP has been negotiated, the driver SHOULD ignore the STOP
>> status bit if DRIVER_OK is not set.
> That's the device that needs to do the ignoring, right?


Yes, it's a typo.


>
>>
>>>    Should the device set NEEDS_RESET, as suggested above? Same, if
>>> saving the states somehow goes wrong?
>>
>> I try hard to avoid NEEDS_RESET, so the driver is required to only read
>> the state during DRIVER_OK & STOP, and set the state during FEATURES_OK
>> & !DRIVER_OK. This is described in the driver normative in patch 1 and
>> below.
> The device can certainly ignore STOP requests that are out of spec. But
> I think we cannot get around signaling device errors in some way.


See above,

For virtqueue state, it works like the other basic facilities. So write 
and read to those facility itself isn't expected to be fail. But we can 
validate whether it makes any effort by re-read the status.

For STOP bit, if we need error reporting, we probably need that in the 
case of reset as well.

Thanks


>
>>>> +\section{Virtqueue State Saving}
>>>> +
>>>> +If both VIRTIO_F_RING_STATE and VIRTIO_F_STOP have been negotiated. A
>>>> +driver MAY save the internal virtqueue state.
>>> Is that device type specific, or something generic? The last patch
>>> suggests that it may vary by device type.
>>
>> Both virtqueue state (avail/used state) and the STOP status bit is generic.
>>
>> But the device is free to have its own specific:
>>
>> 1) extra virtqueue states (pending requests)
>> 2) device states
>>
>> And 2) is out of the scope of this series.
> Ok.
>


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH V2 2/2] virtio: introduce STOP status bit
  2021-07-07  0:05         ` Max Gurtovoy
@ 2021-07-07  3:14           ` Jason Wang
  0 siblings, 0 replies; 115+ messages in thread
From: Jason Wang @ 2021-07-07  3:14 UTC (permalink / raw)
  To: Max Gurtovoy, Cornelia Huck, mst, virtio-comment, virtio-dev
  Cc: stefanha, eperezma, oren, shahafs, parav, bodong, amikheev, pasic


在 2021/7/7 上午8:05, Max Gurtovoy 写道:
>
> On 7/6/2021 5:27 PM, Cornelia Huck wrote:
>> On Tue, Jul 06 2021, Jason Wang <jasowang@redhat.com> wrote:
>>
>>> 在 2021/7/6 下午8:50, Cornelia Huck 写道:
>>>> On Tue, Jul 06 2021, Jason Wang <jasowang@redhat.com> wrote:
>>>>> +If VIRTIO_F_STOP has been negotiated, to stop a device, after 
>>>>> setting
>>>>> +STOP, the driver MUST re-read the device status to ensure the 
>>>>> STOP bit
>>>>> +is set to synchronize with the device.
>>>> Is this more that the driver needs to re-read the status until STOP is
>>>> set to make sure that the stop process has finished?
>>>
>>> Yes.
>>>
>>>
>>>>    If the device has
>>>> offered the feature and the driver accepted it, I'd assume that the
>>>> device will eventually finish with the procedure, or sets 
>>>> NEEDS_RESET if
>>>> something goes wrong?
>>>
>>> As stated below, the device must either:
>>>
>>> 1) finish all pending requests
>>>
>>> or
>>>
>>> 2) provide a device specific way for the driver to save and restore
>>> pending requests
>>>
>>> before setting STOP.
>>>
>>> Otherwise the device can't offer this feature.
>>>
>>> Using NEEDS_RESET seems more complicated than this.
>> Hm, what happens on an internal error? I assume that the device would
>> need to signal that in some way. Or should it simply set STOP and
>> discard any pending requests? The driver would not be able to
>> distinguish that from a normal STOP.
>
> Again, this looks like vdpa specific solution where the BAR is managed 
> by vdpa driver on the host.


No, it has nothing specific to vDPA. It's the general facility of the 
virtio device which is unrelated to how it is implemented: 
registers/BAR/queue command/shared memory etc.

The design aims to fit for all kinds of use cases. vDPA is just one of 
the suggested way. Hypervisor may choose to expose the BAR/registers to 
the guest directly or not (e.g in the case of nested virtualization).


>
> In SRIOV the flow is different.
>
> Please look on the state machine in my proposal.
>
> You need a way to quiesce a device (internal state can still change, 
> but device will stop dirty guest pages and stop changing other devices 
> states in p2p) 


See below, transport is free to have their own requirements. And it 
doesn't conflict with the general device facility.


> and a way to freeze a device (internal state is not allowed to be 
> changed and  state can be queried by the migration software).


Again, this proposal is at virtio general level. It works like device reset:

1) At general virtio level, we had virtio level reset
2) At transport level like PCI, we have PCI level reset

They are not contradict.

For STOP:

1) At general virtio level, we had STOP definition
2) Transport is free to use sub states to implement the STOP (e.g 
quiesce vs freeze):
     2.1) after driver write STOP, device go to quiesce state
     2.2) after device is freezed, device can set STOP bit

But before introducing transport specific facility (PCI), it's better to 
check whether PCI has already had the plan for this kind of operation in 
the PCIE spec.


>
> Is it possible to have p2p in vdpa ?


Again, it's nothing specific to vDPA. The design tries to work on any 
devices and any kind of software layers on top.

For vDPA, we forbid p2p, but it's not hard to implement p2p in vDPA 
especially consider there's a work that is being done in unifying the 
IOMMU API between VFIO and vDPA.

Thanks


>
>>
>>>>> +If VIRTIO_F_STOP has been negotiated, after the driver writes STOP,
>>>>> +the device MUST finish any pending operations like in flight 
>>>>> requests
>>>>> +or have its device specific way for driver to save the pending
>>>>> +operations like in flight requests before setting the STOP status 
>>>>> bit.
>>>>> +
>>>>> +If VIRTIO_F_STOP has been negotiated, the device MUST NOT consume
>>>>> +buffers or send any used buffer notifications to the driver after
>>>>> +STOP. The device MUST keep the configuration space unchanged and 
>>>>> MUST
>>>>> +NOT send configuration space change notification to the driver after
>>>>> +STOP.
>>>>> +
>>>>> +If VIRTIO_F_STOP has been negotiated, after STOP, the device MUST
>>>>> +preserve all the necessary state (the virtqueue states with the
>>>>> +possible device specific states) that is required for restoring 
>>>>> in the
>>>>> +future.
>>>> What happens if the driver writes STOP in when DRIVER_OK has not been
>>>> set?
>>>
>>> I think we need a device normative like:
>>>
>>> If VIRTIO_F_STOP has been negotiated, the driver SHOULD ignore the STOP
>>> status bit if DRIVER_OK is not set.
>> That's the device that needs to do the ignoring, right?
>>
>>>
>>>>    Should the device set NEEDS_RESET, as suggested above? Same, if
>>>> saving the states somehow goes wrong?
>>>
>>> I try hard to avoid NEEDS_RESET, so the driver is required to only read
>>> the state during DRIVER_OK & STOP, and set the state during FEATURES_OK
>>> & !DRIVER_OK. This is described in the driver normative in patch 1 and
>>> below.
>> The device can certainly ignore STOP requests that are out of spec. But
>> I think we cannot get around signaling device errors in some way.
>>
>>>>> +\section{Virtqueue State Saving}
>>>>> +
>>>>> +If both VIRTIO_F_RING_STATE and VIRTIO_F_STOP have been 
>>>>> negotiated. A
>>>>> +driver MAY save the internal virtqueue state.
>>>> Is that device type specific, or something generic? The last patch
>>>> suggests that it may vary by device type.
>>>
>>> Both virtqueue state (avail/used state) and the STOP status bit is 
>>> generic.
>>>
>>> But the device is free to have its own specific:
>>>
>>> 1) extra virtqueue states (pending requests)
>>> 2) device states
>>>
>>> And 2) is out of the scope of this series.
>> Ok.
>>
>


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit
  2021-07-06  9:24   ` [virtio-comment] " Dr. David Alan Gilbert
@ 2021-07-07  3:20     ` Jason Wang
  2021-07-09 17:23       ` Eugenio Perez Martin
  0 siblings, 1 reply; 115+ messages in thread
From: Jason Wang @ 2021-07-07  3:20 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: mst, virtio-comment, virtio-dev, stefanha, mgurtovoy, cohuck,
	eperezma, oren, shahafs, parav, bodong, amikheev, pasic


在 2021/7/6 下午5:24, Dr. David Alan Gilbert 写道:
> * Jason Wang (jasowang@redhat.com) wrote:
>> This patch introduces a new status bit STOP. This will be
>> used by the driver to stop the device in order to safely fetch the
>> device state (virtqueue state) from the device.
>>
>> This is a must for supporting migration.
>>
>> Signed-off-by: Jason Wang <jasowang@redhat.com>
>> ---
>>   content.tex | 69 ++++++++++++++++++++++++++++++++++++++++++++++++++---
>>   1 file changed, 66 insertions(+), 3 deletions(-)
>>
>> diff --git a/content.tex b/content.tex
>> index 8877b6f..284ead0 100644
>> --- a/content.tex
>> +++ b/content.tex
>> @@ -47,6 +47,10 @@ \section{\field{Device Status} Field}\label{sec:Basic Facilities of a Virtio Dev
>>   \item[DRIVER_OK (4)] Indicates that the driver is set up and ready to
>>     drive the device.
>>   
>> +\item[STOP (32)] When VIRTIO_F_STOP is negotiated, indicates that the
>> +  device has been stopped by the driver. This status bit is different
>> +  from the reset since the device state is preserved.
>> +
>>   \item[DEVICE_NEEDS_RESET (64)] Indicates that the device has experienced
>>     an error from which it can't recover.
>>   \end{description}
>> @@ -70,12 +74,38 @@ \section{\field{Device Status} Field}\label{sec:Basic Facilities of a Virtio Dev
>>   recover by issuing a reset.
>>   \end{note}
>>   
>> +If VIRTIO_F_STOP has been negotiated, the driver MUST NOT set STOP if
>> +DRIVER_OK is not set.
>> +
>> +If VIRTIO_F_STOP has been negotiated, to stop a device, after setting
>> +STOP, the driver MUST re-read the device status to ensure the STOP bit
>> +is set to synchronize with the device.
>> +
>>   \devicenormative{\subsection}{Device Status Field}{Basic Facilities of a Virtio Device / Device Status Field}
>>   The device MUST initialize \field{device status} to 0 upon reset.
>>   
>>   The device MUST NOT consume buffers or send any used buffer
>>   notifications to the driver before DRIVER_OK.
>>   
>> +If VIRTIO_F_STOP has not been negotiated, the device MUST ignore the
>> +write of STOP.
>> +
>> +If VIRTIO_F_STOP has been negotiated, after the driver writes STOP,
>> +the device MUST finish any pending operations like in flight requests
>> +or have its device specific way for driver to save the pending
>> +operations like in flight requests before setting the STOP status bit.
>> +
>> +If VIRTIO_F_STOP has been negotiated, the device MUST NOT consume
>> +buffers or send any used buffer notifications to the driver after
>> +STOP. The device MUST keep the configuration space unchanged and MUST
>> +NOT send configuration space change notification to the driver after
>> +STOP.
>> +
>> +If VIRTIO_F_STOP has been negotiated, after STOP, the device MUST
>> +preserve all the necessary state (the virtqueue states with the
>> +possible device specific states) that is required for restoring in the
>> +future.
>> +
>>   \label{sec:Basic Facilities of a Virtio Device / Device Status Field / DEVICENEEDSRESET}The device SHOULD set DEVICE_NEEDS_RESET when it enters an error state
>>   that a reset is needed.  If DRIVER_OK is set, after it sets DEVICE_NEEDS_RESET, the device
>>   MUST send a device configuration change notification to the driver.
>> @@ -474,8 +504,8 @@ \subsection{\field{Used State} Field}
>>   \begin{itemize}
>>   \item A driver MUST NOT set the virtqueue state before setting the
>>     FEATURE_OK status bit.
>> -\item A driver MUST NOT set the virtqueue state after setting the
>> -  DRIVER_OK status bit.
>> +\item A driver MUST NOT set the virtqueue state if DRIVER_OK status
>> +  bit is set without STOP status bit.
>>   \end{itemize}
>>   
>>   \devicenormative{\subsection}{Virtqueue State}{Basic Facilities of a Virtio Device / Virtqueue State}
>> @@ -488,7 +518,7 @@ \subsection{\field{Used State} Field}
>>   \item A device SHOULD ignore the write to the virtqueue state if the
>>   FEATURE_OK status bit is not set.
>>   \item A device SHOULD ignore the write to the virtqueue state if the
>> -DRIVER_OK status bit is set.
>> +DRIVER_OK status bit is set without STOP status bit.
>>   \end{itemize}
>>   
>>   If VIRTIO_F_RING_STATE has been negotiated, a device MAY has its
>> @@ -623,6 +653,36 @@ \section{Device Cleanup}\label{sec:General Initialization And Device Operation /
>>   
>>   Thus a driver MUST ensure a virtqueue isn't live (by device reset) before removing exposed buffers.
>>   
>> +\section{Virtqueue State Saving}
>> +
>> +If both VIRTIO_F_RING_STATE and VIRTIO_F_STOP have been negotiated. A
>> +driver MAY save the internal virtqueue state.
>> +
>> +\drivernormative{\subsection}{Virtqueue State Saving}{General Initialization And Device Operation / Virtqueue State Saving}
>> +
>> +Assuming the device is 'live'.
> Is that defined somewhere?


Probably not, but 'live' is used several times in the spec.

I can clarify that in the spec.


>   If I understand correctly, this is all
> driven from the driver inside the guest, so for this to work
> the guest must be running and already have initialised the driver.


Yes.


>
>> The driver MUST follow this sequence to
>> +stop the device and save the virtqueue state:
>> +
>> +\begin{enumerate}
>> +\item Set the STOP status bit.
>> +
>> +\item Re-read \field{device status} until the STOP bit is set to
>> +  synchronize with the device.
> At that point, is it guaranteed that the device has already stopped
> and that all outstanding transactions have finished?


No, the device may choose from one of the below:

1) all the pending transactions have finished

or

2) a device specific way for the driver to save and restore the pending 
transasctions


>
>> +\item Read \field{available state} and save it.
>> +
>> +\item Read \field{used state} and save it.
>> +
>> +\item Read device specific virtqueue states if needed.
>> +
>> +\item Reset the device.
> Say that a migration fails after this point; how does the source
> recover?


Re-do the device initialization, and restore the state then.


> Why is the 'reset the device' there?


Because the current virtio device status machine is uni-directional. 
It's not allowed to clear a status bit. So the only way to recover is to 
reset and redo the initialization.

Thanks


>
>> +\end{enumerate}
>> +
>> +The driver MAY perform device specific steps to save device specific sate.
>> +
>> +The driver MAY 'resume' the device by redoing the device initialization
>> +with the saved virtqueue state. See ref\{sec:General Initialization and Device Operation / Device Initialization}
>> +
>>   \chapter{Virtio Transport Options}\label{sec:Virtio Transport Options}
>>   
>>   Virtio can use various different buses, thus the standard is split
>> @@ -6713,6 +6773,9 @@ \chapter{Reserved Feature Bits}\label{sec:Reserved Feature Bits}
>>     can set and get the device internal virtqueue state.
>>     See \ref{sec:Virtqueues / Virtqueue State}~\nameref{sec:Virtqueues / Virtqueue State}.
>>   
>> +  \item[VIRTIO_F_STOP(41)] This feature indicates that the driver can
>> +  stop the device.
>> +  See \ref{sec:Basic Facilities of a Virtio Device / Device Status Field}
>>   \end{description}
>>   
>>   \drivernormative{\section}{Reserved Feature Bits}{Reserved Feature Bits}
>> -- 
>> 2.25.1
>>
>>
>> This publicly archived list offers a means to provide input to the
>> OASIS Virtual I/O Device (VIRTIO) TC.
>>
>> In order to verify user consent to the Feedback License terms and
>> to minimize spam in the list archive, subscription is required
>> before posting.
>>
>> Subscribe: virtio-comment-subscribe@lists.oasis-open.org
>> Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
>> List help: virtio-comment-help@lists.oasis-open.org
>> List archive: https://lists.oasis-open.org/archives/virtio-comment/
>> Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
>> List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
>> Committee: https://www.oasis-open.org/committees/virtio/
>> Join OASIS: https://www.oasis-open.org/join/
>>


^ permalink raw reply	[flat|nested] 115+ messages in thread

* [virtio-dev] Re: [PATCH V2 1/2] virtio: introduce virtqueue state as basic facility
  2021-07-06 12:27   ` [virtio-comment] " Cornelia Huck
@ 2021-07-07  3:29     ` Jason Wang
  0 siblings, 0 replies; 115+ messages in thread
From: Jason Wang @ 2021-07-07  3:29 UTC (permalink / raw)
  To: Cornelia Huck, mst, virtio-comment, virtio-dev
  Cc: stefanha, mgurtovoy, eperezma, oren, shahafs, parav, bodong,
	amikheev, pasic


在 2021/7/6 下午8:27, Cornelia Huck 写道:
> On Tue, Jul 06 2021, Jason Wang <jasowang@redhat.com> wrote:
>
>> This patch adds new device facility to save and restore virtqueue
>> state. The virtqueue state is split into two parts:
>>
>> - The available state: The state that is used for read the next
>>    available buffer.
>> - The used state: The state that is used for making buffer used.
>>
>> Note that, there could be devices that is required to set and get the
>> requests that are being processed by the device. I leave such API to
>> be device specific.
>>
>> This facility could be used by both migration and device diagnostic.
>>
>> Signed-off-by: Jason Wang <jasowang@redhat.com>
>> ---
>>   content.tex | 117 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 117 insertions(+)
>> +\devicenormative{\subsection}{Virtqueue State}{Basic Facilities of a Virtio Device / Virtqueue State}
>> +
>> +If VIRTIO_F_RING_STATE has not been negotiated, a device MUST ingore
>> +the read and write to the virtqueue state.
>> +
>> +If VIRTIO_F_RING_STATE has been negotiated:
>> +\begin{itemize}
>> +\item A device SHOULD ignore the write to the virtqueue state if the
>> +FEATURE_OK status bit is not set.
>> +\item A device SHOULD ignore the write to the virtqueue state if the
>> +DRIVER_OK status bit is set.
>> +\end{itemize}
>> +
>> +If VIRTIO_F_RING_STATE has been negotiated, a device MAY has its
>> +device-specific way for the driver to set and get extra virtqueue
>> +states such as in flight requests.
> Maybe better
>
> "If VIRTIO_F_RING_STATE has been negotiated, a device MAY provide a
> device-specific mechanism to set and get extra virtqueue states such as
> in flight reqeuests."
>
> If a device type supports this facility, does it imply that it is always
> present when VIRTIO_RING_STATE has been negotiated?


It's not, it's device specific.


> I guess it could
> define further device-specific features to make it more configurable.


Yes, I can clarify this in the next version.

Thanks


>


---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH V2 1/2] virtio: introduce virtqueue state as basic facility
  2021-07-07  2:42         ` Jason Wang
@ 2021-07-07  4:36           ` Jason Wang
  0 siblings, 0 replies; 115+ messages in thread
From: Jason Wang @ 2021-07-07  4:36 UTC (permalink / raw)
  To: Michael S. Tsirkin, Eugenio Perez Martin
  Cc: virtio-comment, Virtio-Dev, Stefan Hajnoczi, Max Gurtovoy,
	Cornelia Huck, Oren Duer, Shahaf Shuler, Parav Pandit,
	Bodong Wang, Alexander Mikheev, Halil Pasic


在 2021/7/7 上午10:42, Jason Wang 写道:
>
> 在 2021/7/7 上午3:08, Michael S. Tsirkin 写道:
>> On Tue, Jul 06, 2021 at 07:09:10PM +0200, Eugenio Perez Martin wrote:
>>> On Tue, Jul 6, 2021 at 11:32 AM Michael S. Tsirkin <mst@redhat.com> 
>>> wrote:
>>>> On Tue, Jul 06, 2021 at 12:33:33PM +0800, Jason Wang wrote:
>>>>> This patch adds new device facility to save and restore virtqueue
>>>>> state. The virtqueue state is split into two parts:
>>>>>
>>>>> - The available state: The state that is used for read the next
>>>>>    available buffer.
>>>>> - The used state: The state that is used for making buffer used.
>>>>>
>>>>> Note that, there could be devices that is required to set and get the
>>>>> requests that are being processed by the device. I leave such API to
>>>>> be device specific.
>>>>>
>>>>> This facility could be used by both migration and device diagnostic.
>>>>>
>>>>> Signed-off-by: Jason Wang <jasowang@redhat.com>
>>>> Hi Jason!
>>>> I feel that for use-cases such as SRIOV,
>>>> the facility to save/restore vq should be part of a PF
>>>> that is there needs to be a way for one virtio device to
>>>> address the state of another one.
>>>>
>>> Hi!
>>>
>>> In my opinion we should go the other way around: To make features as
>>> orthogonal/independent as possible, and just make them work together
>>> if we have to. In this particular case, I think it should be easier to
>>> decide how to report status, its needs, etc for a VF, and then open
>>> the possibility for the PF to query or set them, reusing format,
>>> behavior, etc. as much as possible.
>>>
>>> I think that the most controversial point about doing it non-SR IOV
>>> way is the exposing of these features/fields to the guest using
>>> specific transport facilities, like PCI common config. However I think
>>> it should not be hard for the hypervisor to intercept them and even to
>>> expose them conditionally. Please correct me if this guessing was not
>>> right and you had other concerns.
>>
>> Possibly. I'd like to see some guidance on how this all will work
>> in practice then. Maybe make it all part of a non-normative section
>> for now.
>> I think that the feature itself is not very useful outside of
>> migration so we don't really gain much by adding it as is
>> without all the other missing pieces.
>
>
> For networking device, the only missing part is the transport 
> implementation of the virtqueue state.


So I've posted a patch to implement the virtqueue state for PCI. This 
should be sufficient for a virtio-PCI device to be migrated.

Thanks


>
>
>> I would say let's see more of the whole picture before we commit.
>
>
> I will include an implementation of PCI as an example.
>
> Thanks
>
>
>>
>>
>>
>>>> Thoughts?
>>>>
>>>>> ---
>>>>>   content.tex | 117 
>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>   1 file changed, 117 insertions(+)
>>>>>
>>>>> diff --git a/content.tex b/content.tex
>>>>> index 620c0e2..8877b6f 100644
>>>>> --- a/content.tex
>>>>> +++ b/content.tex
>>>>> @@ -385,6 +385,116 @@ \section{Exporting Objects}\label{sec:Basic 
>>>>> Facilities of a Virtio Device / Expo
>>>>>   types. It is RECOMMENDED that devices generate version 4
>>>>>   UUIDs as specified by \hyperref[intro:rfc4122]{[RFC4122]}.
>>>>>
>>>>> +\section{Virtqueue State}\label{sec:Virtqueues / Virtqueue State}
>>>>> +
>>>>> +When VIRTIO_F_RING_STATE is negotiated, the driver can set and
>>>>> +get the device internal virtqueue state through the following
>>>>> +fields. The way to access those fields is transport specific.
>>>>> +
>>>>> +\subsection{\field{Available State} Field}
>>>>> +
>>>>> +The \field{Available State} field is two bytes for the driver to get
>>>>> +or set the state that is used by the virtqueue to read for the next
>>>>> +available buffer.
>>>>> +
>>>>> +When VIRTIO_F_RING_PACKED is not negotiated, it contains:
>>>>> +
>>>>> +\begin{lstlisting}
>>>>> +le16 {
>>>>> +        last_avail_idx : 16;
>>>>> +} avail_state;
>>>>> +\end{lstlisting}
>>>>> +
>>>>> +The \field{last_avail_idx} field indicates where the device would 
>>>>> read
>>>>> +for the next index from the virtqueue available ring(modulo the 
>>>>> queue
>>>>> + size). This starts at the value set by the driver, and increases.
>>>>> +
>>>>> +When VIRTIO_F_RING_PACKED is negotiated, it contains:
>>>>> +
>>>>> +\begin{lstlisting}
>>>>> +le16 {
>>>>> +        last_avail_idx : 15;
>>>>> +        last_avail_wrap_counter : 1;
>>>>> +} avail_state;
>>>>> +\end{lstlisting}
>>>>> +
>>>>> +The \field{last_avail_idx} field indicates where the device would 
>>>>> read for
>>>>> +the next descriptor head from the descriptor ring. This starts at 
>>>>> the
>>>>> +value set by the driver and wraps around when reaching the end of 
>>>>> the
>>>>> +ring.
>>>>> +
>>>>> +The \field{last_avail_wrap_counter} field indicates the last 
>>>>> Driver Ring
>>>>> +Wrap Counter that is observed by the device. This starts at the 
>>>>> value
>>>>> +set by the driver, and is flipped when reaching the end of the ring.
>>>>> +
>>>>> +See also \ref{sec:Packed Virtqueues / Driver and Device Ring Wrap 
>>>>> Counters}.
>>>>> +
>>>>> +\subsection{\field{Used State} Field}
>>>>> +
>>>>> +The \field{Used State} field is two bytes for the driver to set and
>>>>> +get the state used by the virtqueue to make buffer used.
>>>>> +
>>>>> +When VIRTIO_F_RING_PACKED is not negotiated, the used state 
>>>>> contains:
>>>>> +
>>>>> +\begin{lstlisting}
>>>>> +le16 {
>>>>> +        used_idx : 16;
>>>>> +} used_state;
>>>>> +\end{lstlisting}
>>>>> +
>>>>> +The \field{used_idx} where the device would write the next used
>>>>> +descriptor head to the used ring (modulo the queue size). This 
>>>>> starts
>>>>> +at the value set by the driver, and increases. It is easy to see 
>>>>> this
>>>>> +is the initial value of the \field{idx} in the used ring.
>>>>> +
>>>>> +See also \ref{sec:Basic Facilities of a Virtio Device / 
>>>>> Virtqueues / The Virtqueue Used Ring}
>>>>> +
>>>>> +When VIRTIO_F_RING_PACKED is negotiated, the used state contains:
>>>>> +
>>>>> +\begin{lstlisting}
>>>>> +le16 {
>>>>> +        used_idx : 15;
>>>>> +        used_wrap_counter : 1;
>>>>> +} used_state;
>>>>> +\end{lstlisting}
>>>>> +
>>>>> +The \field{used_idx} indicates where the device would write the 
>>>>> next used
>>>>> +descriptor head to the descriptor ring. This starts at the value set
>>>>> +by the driver, and warps around when reaching the end of the ring.
>>>>> +
>>>>> +\field{used_wrap_counter} is the Device Ring Wrap Counter. This 
>>>>> starts
>>>>> +at the value set by the driver, and is flipped when reaching the end
>>>>> +of the ring.
>>>>> +
>>>>> +See also \ref{sec:Packed Virtqueues / Driver and Device Ring Wrap 
>>>>> Counters}.
>>>>
>>>> Above only fully describes the vq state if descriptors
>>>> are used in order or at least all out of order descriptors are 
>>>> consumed
>>>> at time of save.
>>>>
>>> I think that the most straightforward solution would be to add
>>> something similar to VHOST_USER_GET_INFLIGHT_FD, but without the _FD
>>> part.
>>>
>>> Thanks!
>>>
>>>> Adding later option to devices such as net will need extra spec work.
>>>>
>>>>
>>>>> +\drivernormative{\subsection}{Virtqueue State}{Basic Facilities 
>>>>> of a Virtio Device / Virtqueue State}
>>>>> +
>>>>> +If VIRTIO_F_RING_STATE has been negotiated:
>>>>> +\begin{itemize}
>>>>> +\item A driver MUST NOT set the virtqueue state before setting the
>>>>> +  FEATURE_OK status bit.
>>>>> +\item A driver MUST NOT set the virtqueue state after setting the
>>>>> +  DRIVER_OK status bit.
>>>>> +\end{itemize}
>>>>> +
>>>>> +\devicenormative{\subsection}{Virtqueue State}{Basic Facilities 
>>>>> of a Virtio Device / Virtqueue State}
>>>>> +
>>>>> +If VIRTIO_F_RING_STATE has not been negotiated, a device MUST ingore
>>>>> +the read and write to the virtqueue state.
>>>>> +
>>>>> +If VIRTIO_F_RING_STATE has been negotiated:
>>>>> +\begin{itemize}
>>>>> +\item A device SHOULD ignore the write to the virtqueue state if the
>>>>> +FEATURE_OK status bit is not set.
>>>>> +\item A device SHOULD ignore the write to the virtqueue state if the
>>>>> +DRIVER_OK status bit is set.
>>>>> +\end{itemize}
>>>>> +
>>>>> +If VIRTIO_F_RING_STATE has been negotiated, a device MAY has its
>>>>
>>>> may have?
>>>> should also go into a normative section
>>>>
>>>>> +device-specific way for the driver to set and get extra virtqueue
>>>>> +states such as in flight requests.
>>>>> +
>>>>>   \chapter{General Initialization And Device 
>>>>> Operation}\label{sec:General Initialization And Device Operation}
>>>>>
>>>>>   We start with an overview of device initialization, then expand 
>>>>> on the
>>>>> @@ -420,6 +530,9 @@ \section{Device 
>>>>> Initialization}\label{sec:General Initialization And Device Oper
>>>>>      device, optional per-bus setup, reading and possibly writing the
>>>>>      device's virtio configuration space, and population of 
>>>>> virtqueues.
>>>>>
>>>>> +\item\label{itm:General Initialization And Device Operation / Device
>>>>> +  Initialization / Virtqueue State Setup} When 
>>>>> VIRTIO_F_RING_STATE has been negotiated, perform virtqueue state 
>>>>> setup, including the initialization of the per virtqueue available 
>>>>> state, used state and the possible device specific virtqueue state.
>>>>> +
>>>>>   \item\label{itm:General Initialization And Device Operation / 
>>>>> Device Initialization / Set DRIVER-OK} Set the DRIVER_OK status 
>>>>> bit.  At this point the device is
>>>>>      ``live''.
>>>>>   \end{enumerate}
>>>>> @@ -6596,6 +6709,10 @@ \chapter{Reserved Feature 
>>>>> Bits}\label{sec:Reserved Feature Bits}
>>>>>     transport specific.
>>>>>     For more details about driver notifications over PCI see 
>>>>> \ref{sec:Virtio Transport Options / Virtio Over PCI Bus / 
>>>>> PCI-specific Initialization And Device Operation / Available 
>>>>> Buffer Notifications}.
>>>>>
>>>>> +  \item[VIRTIO_F_RING_STATE(40)] This feature indicates that the 
>>>>> driver
>>>>> +  can set and get the device internal virtqueue state.
>>>>> +  See \ref{sec:Virtqueues / Virtqueue 
>>>>> State}~\nameref{sec:Virtqueues / Virtqueue State}.
>>>>> +
>>>>>   \end{description}
>>>>>
>>>>>   \drivernormative{\section}{Reserved Feature Bits}{Reserved 
>>>>> Feature Bits}
>>>>> -- 
>>>>> 2.25.1


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH V2 1/2] virtio: introduce virtqueue state as basic facility
  2021-07-07  2:50           ` Jason Wang
@ 2021-07-07 12:03             ` Max Gurtovoy
  2021-07-07 12:11               ` [virtio-comment] " Jason Wang
  0 siblings, 1 reply; 115+ messages in thread
From: Max Gurtovoy @ 2021-07-07 12:03 UTC (permalink / raw)
  To: Jason Wang, Michael S. Tsirkin, Eugenio Perez Martin
  Cc: virtio-comment, Virtio-Dev, Stefan Hajnoczi, Cornelia Huck,
	Oren Duer, Shahaf Shuler, Parav Pandit, Bodong Wang,
	Alexander Mikheev, Halil Pasic


On 7/7/2021 5:50 AM, Jason Wang wrote:
>
> 在 2021/7/7 上午7:49, Max Gurtovoy 写道:
>>
>> On 7/6/2021 10:08 PM, Michael S. Tsirkin wrote:
>>> On Tue, Jul 06, 2021 at 07:09:10PM +0200, Eugenio Perez Martin wrote:
>>>> On Tue, Jul 6, 2021 at 11:32 AM Michael S. Tsirkin <mst@redhat.com> 
>>>> wrote:
>>>>> On Tue, Jul 06, 2021 at 12:33:33PM +0800, Jason Wang wrote:
>>>>>> This patch adds new device facility to save and restore virtqueue
>>>>>> state. The virtqueue state is split into two parts:
>>>>>>
>>>>>> - The available state: The state that is used for read the next
>>>>>>    available buffer.
>>>>>> - The used state: The state that is used for making buffer used.
>>>>>>
>>>>>> Note that, there could be devices that is required to set and get 
>>>>>> the
>>>>>> requests that are being processed by the device. I leave such API to
>>>>>> be device specific.
>>>>>>
>>>>>> This facility could be used by both migration and device diagnostic.
>>>>>>
>>>>>> Signed-off-by: Jason Wang <jasowang@redhat.com>
>>>>> Hi Jason!
>>>>> I feel that for use-cases such as SRIOV,
>>>>> the facility to save/restore vq should be part of a PF
>>>>> that is there needs to be a way for one virtio device to
>>>>> address the state of another one.
>>>>>
>>>> Hi!
>>>>
>>>> In my opinion we should go the other way around: To make features as
>>>> orthogonal/independent as possible, and just make them work together
>>>> if we have to. In this particular case, I think it should be easier to
>>>> decide how to report status, its needs, etc for a VF, and then open
>>>> the possibility for the PF to query or set them, reusing format,
>>>> behavior, etc. as much as possible.
>>>>
>>>> I think that the most controversial point about doing it non-SR IOV
>>>> way is the exposing of these features/fields to the guest using
>>>> specific transport facilities, like PCI common config. However I think
>>>> it should not be hard for the hypervisor to intercept them and even to
>>>> expose them conditionally. Please correct me if this guessing was not
>>>> right and you had other concerns.
>>>
>>> Possibly. I'd like to see some guidance on how this all will work
>>> in practice then. Maybe make it all part of a non-normative section
>>> for now.
>>> I think that the feature itself is not very useful outside of
>>> migration so we don't really gain much by adding it as is
>>> without all the other missing pieces.
>>> I would say let's see more of the whole picture before we commit.
>>
>> I agree here. I also can't see the whole picture for SRIOV case.
>
>
> Again, it's not related to SR-IOV at all. It tries to introduce basic 
> facility in the virtio level which can work for all types of virtio 
> device.
>
> Transport such as PCI need to implement its own way to access those 
> state. It's not hard to implement them simply via capability.
>
> It works like other basic facility like device status, features etc.
>
> For SR-IOV, it doesn't prevent you from implementing that via the 
> admin virtqueue.
>
>
>>
>> I'll try to combine the admin control queue suggested in previous 
>> patch set to my proposal of PF managing the VF migration.
>
>
> Note that, the admin virtqueue should be transport indepedent when 
> trying to introduce them.
>
>
>>
>> Feature negotiation is part of virtio device-driver communication and 
>> not part of the migration software that should manage the migration 
>> process.
>>
>> For me, seems like queue state is something that should be internal 
>> and not be exposed to guest drivers that see this as a new feature.
>
>
> This is not true, we have the case of nested virtualization. As 
> mentioned in another thread, it's the hypervisor that need to choose 
> between hiding or shadowing the internal virtqueue state.
>
> Thanks

In the nested environment, do you mean the Level 1 is Real PF with X VFs 
and in Level 2 the X VF seen as PFs in the guests and expose another Y VFs ?

If so, the guest PF will manage the migration for it's Y VFs.


>
>
>>
>>>
>>>
>>>
>>>>> Thoughts?
>>>>>
>>>>>> ---
>>>>>>   content.tex | 117 
>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>>   1 file changed, 117 insertions(+)
>>>>>>
>>>>>> diff --git a/content.tex b/content.tex
>>>>>> index 620c0e2..8877b6f 100644
>>>>>> --- a/content.tex
>>>>>> +++ b/content.tex
>>>>>> @@ -385,6 +385,116 @@ \section{Exporting Objects}\label{sec:Basic 
>>>>>> Facilities of a Virtio Device / Expo
>>>>>>   types. It is RECOMMENDED that devices generate version 4
>>>>>>   UUIDs as specified by \hyperref[intro:rfc4122]{[RFC4122]}.
>>>>>>
>>>>>> +\section{Virtqueue State}\label{sec:Virtqueues / Virtqueue State}
>>>>>> +
>>>>>> +When VIRTIO_F_RING_STATE is negotiated, the driver can set and
>>>>>> +get the device internal virtqueue state through the following
>>>>>> +fields. The way to access those fields is transport specific.
>>>>>> +
>>>>>> +\subsection{\field{Available State} Field}
>>>>>> +
>>>>>> +The \field{Available State} field is two bytes for the driver to 
>>>>>> get
>>>>>> +or set the state that is used by the virtqueue to read for the next
>>>>>> +available buffer.
>>>>>> +
>>>>>> +When VIRTIO_F_RING_PACKED is not negotiated, it contains:
>>>>>> +
>>>>>> +\begin{lstlisting}
>>>>>> +le16 {
>>>>>> +        last_avail_idx : 16;
>>>>>> +} avail_state;
>>>>>> +\end{lstlisting}
>>>>>> +
>>>>>> +The \field{last_avail_idx} field indicates where the device 
>>>>>> would read
>>>>>> +for the next index from the virtqueue available ring(modulo the 
>>>>>> queue
>>>>>> + size). This starts at the value set by the driver, and increases.
>>>>>> +
>>>>>> +When VIRTIO_F_RING_PACKED is negotiated, it contains:
>>>>>> +
>>>>>> +\begin{lstlisting}
>>>>>> +le16 {
>>>>>> +        last_avail_idx : 15;
>>>>>> +        last_avail_wrap_counter : 1;
>>>>>> +} avail_state;
>>>>>> +\end{lstlisting}
>>>>>> +
>>>>>> +The \field{last_avail_idx} field indicates where the device 
>>>>>> would read for
>>>>>> +the next descriptor head from the descriptor ring. This starts 
>>>>>> at the
>>>>>> +value set by the driver and wraps around when reaching the end 
>>>>>> of the
>>>>>> +ring.
>>>>>> +
>>>>>> +The \field{last_avail_wrap_counter} field indicates the last 
>>>>>> Driver Ring
>>>>>> +Wrap Counter that is observed by the device. This starts at the 
>>>>>> value
>>>>>> +set by the driver, and is flipped when reaching the end of the 
>>>>>> ring.
>>>>>> +
>>>>>> +See also \ref{sec:Packed Virtqueues / Driver and Device Ring 
>>>>>> Wrap Counters}.
>>>>>> +
>>>>>> +\subsection{\field{Used State} Field}
>>>>>> +
>>>>>> +The \field{Used State} field is two bytes for the driver to set and
>>>>>> +get the state used by the virtqueue to make buffer used.
>>>>>> +
>>>>>> +When VIRTIO_F_RING_PACKED is not negotiated, the used state 
>>>>>> contains:
>>>>>> +
>>>>>> +\begin{lstlisting}
>>>>>> +le16 {
>>>>>> +        used_idx : 16;
>>>>>> +} used_state;
>>>>>> +\end{lstlisting}
>>>>>> +
>>>>>> +The \field{used_idx} where the device would write the next used
>>>>>> +descriptor head to the used ring (modulo the queue size). This 
>>>>>> starts
>>>>>> +at the value set by the driver, and increases. It is easy to see 
>>>>>> this
>>>>>> +is the initial value of the \field{idx} in the used ring.
>>>>>> +
>>>>>> +See also \ref{sec:Basic Facilities of a Virtio Device / 
>>>>>> Virtqueues / The Virtqueue Used Ring}
>>>>>> +
>>>>>> +When VIRTIO_F_RING_PACKED is negotiated, the used state contains:
>>>>>> +
>>>>>> +\begin{lstlisting}
>>>>>> +le16 {
>>>>>> +        used_idx : 15;
>>>>>> +        used_wrap_counter : 1;
>>>>>> +} used_state;
>>>>>> +\end{lstlisting}
>>>>>> +
>>>>>> +The \field{used_idx} indicates where the device would write the 
>>>>>> next used
>>>>>> +descriptor head to the descriptor ring. This starts at the value 
>>>>>> set
>>>>>> +by the driver, and warps around when reaching the end of the ring.
>>>>>> +
>>>>>> +\field{used_wrap_counter} is the Device Ring Wrap Counter. This 
>>>>>> starts
>>>>>> +at the value set by the driver, and is flipped when reaching the 
>>>>>> end
>>>>>> +of the ring.
>>>>>> +
>>>>>> +See also \ref{sec:Packed Virtqueues / Driver and Device Ring 
>>>>>> Wrap Counters}.
>>>>>
>>>>> Above only fully describes the vq state if descriptors
>>>>> are used in order or at least all out of order descriptors are 
>>>>> consumed
>>>>> at time of save.
>>>>>
>>>> I think that the most straightforward solution would be to add
>>>> something similar to VHOST_USER_GET_INFLIGHT_FD, but without the _FD
>>>> part.
>>>>
>>>> Thanks!
>>>>
>>>>> Adding later option to devices such as net will need extra spec work.
>>>>>
>>>>>
>>>>>> +\drivernormative{\subsection}{Virtqueue State}{Basic Facilities 
>>>>>> of a Virtio Device / Virtqueue State}
>>>>>> +
>>>>>> +If VIRTIO_F_RING_STATE has been negotiated:
>>>>>> +\begin{itemize}
>>>>>> +\item A driver MUST NOT set the virtqueue state before setting the
>>>>>> +  FEATURE_OK status bit.
>>>>>> +\item A driver MUST NOT set the virtqueue state after setting the
>>>>>> +  DRIVER_OK status bit.
>>>>>> +\end{itemize}
>>>>>> +
>>>>>> +\devicenormative{\subsection}{Virtqueue State}{Basic Facilities 
>>>>>> of a Virtio Device / Virtqueue State}
>>>>>> +
>>>>>> +If VIRTIO_F_RING_STATE has not been negotiated, a device MUST 
>>>>>> ingore
>>>>>> +the read and write to the virtqueue state.
>>>>>> +
>>>>>> +If VIRTIO_F_RING_STATE has been negotiated:
>>>>>> +\begin{itemize}
>>>>>> +\item A device SHOULD ignore the write to the virtqueue state if 
>>>>>> the
>>>>>> +FEATURE_OK status bit is not set.
>>>>>> +\item A device SHOULD ignore the write to the virtqueue state if 
>>>>>> the
>>>>>> +DRIVER_OK status bit is set.
>>>>>> +\end{itemize}
>>>>>> +
>>>>>> +If VIRTIO_F_RING_STATE has been negotiated, a device MAY has its
>>>>>
>>>>> may have?
>>>>> should also go into a normative section
>>>>>
>>>>>> +device-specific way for the driver to set and get extra virtqueue
>>>>>> +states such as in flight requests.
>>>>>> +
>>>>>>   \chapter{General Initialization And Device 
>>>>>> Operation}\label{sec:General Initialization And Device Operation}
>>>>>>
>>>>>>   We start with an overview of device initialization, then expand 
>>>>>> on the
>>>>>> @@ -420,6 +530,9 @@ \section{Device 
>>>>>> Initialization}\label{sec:General Initialization And Device Oper
>>>>>>      device, optional per-bus setup, reading and possibly writing 
>>>>>> the
>>>>>>      device's virtio configuration space, and population of 
>>>>>> virtqueues.
>>>>>>
>>>>>> +\item\label{itm:General Initialization And Device Operation / 
>>>>>> Device
>>>>>> +  Initialization / Virtqueue State Setup} When 
>>>>>> VIRTIO_F_RING_STATE has been negotiated, perform virtqueue state 
>>>>>> setup, including the initialization of the per virtqueue 
>>>>>> available state, used state and the possible device specific 
>>>>>> virtqueue state.
>>>>>> +
>>>>>>   \item\label{itm:General Initialization And Device Operation / 
>>>>>> Device Initialization / Set DRIVER-OK} Set the DRIVER_OK status 
>>>>>> bit.  At this point the device is
>>>>>>      ``live''.
>>>>>>   \end{enumerate}
>>>>>> @@ -6596,6 +6709,10 @@ \chapter{Reserved Feature 
>>>>>> Bits}\label{sec:Reserved Feature Bits}
>>>>>>     transport specific.
>>>>>>     For more details about driver notifications over PCI see 
>>>>>> \ref{sec:Virtio Transport Options / Virtio Over PCI Bus / 
>>>>>> PCI-specific Initialization And Device Operation / Available 
>>>>>> Buffer Notifications}.
>>>>>>
>>>>>> +  \item[VIRTIO_F_RING_STATE(40)] This feature indicates that the 
>>>>>> driver
>>>>>> +  can set and get the device internal virtqueue state.
>>>>>> +  See \ref{sec:Virtqueues / Virtqueue 
>>>>>> State}~\nameref{sec:Virtqueues / Virtqueue State}.
>>>>>> +
>>>>>>   \end{description}
>>>>>>
>>>>>>   \drivernormative{\section}{Reserved Feature Bits}{Reserved 
>>>>>> Feature Bits}
>>>>>> -- 
>>>>>> 2.25.1
>>
>


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [virtio-comment] Re: [PATCH V2 1/2] virtio: introduce virtqueue state as basic facility
  2021-07-07 12:03             ` Max Gurtovoy
@ 2021-07-07 12:11               ` Jason Wang
  0 siblings, 0 replies; 115+ messages in thread
From: Jason Wang @ 2021-07-07 12:11 UTC (permalink / raw)
  To: virtio-comment


在 2021/7/7 下午8:03, Max Gurtovoy 写道:
>
> On 7/7/2021 5:50 AM, Jason Wang wrote:
>>
>> 在 2021/7/7 上午7:49, Max Gurtovoy 写道:
>>>
>>> On 7/6/2021 10:08 PM, Michael S. Tsirkin wrote:
>>>> On Tue, Jul 06, 2021 at 07:09:10PM +0200, Eugenio Perez Martin wrote:
>>>>> On Tue, Jul 6, 2021 at 11:32 AM Michael S. Tsirkin 
>>>>> <mst@redhat.com> wrote:
>>>>>> On Tue, Jul 06, 2021 at 12:33:33PM +0800, Jason Wang wrote:
>>>>>>> This patch adds new device facility to save and restore virtqueue
>>>>>>> state. The virtqueue state is split into two parts:
>>>>>>>
>>>>>>> - The available state: The state that is used for read the next
>>>>>>>    available buffer.
>>>>>>> - The used state: The state that is used for making buffer used.
>>>>>>>
>>>>>>> Note that, there could be devices that is required to set and 
>>>>>>> get the
>>>>>>> requests that are being processed by the device. I leave such 
>>>>>>> API to
>>>>>>> be device specific.
>>>>>>>
>>>>>>> This facility could be used by both migration and device 
>>>>>>> diagnostic.
>>>>>>>
>>>>>>> Signed-off-by: Jason Wang <jasowang@redhat.com>
>>>>>> Hi Jason!
>>>>>> I feel that for use-cases such as SRIOV,
>>>>>> the facility to save/restore vq should be part of a PF
>>>>>> that is there needs to be a way for one virtio device to
>>>>>> address the state of another one.
>>>>>>
>>>>> Hi!
>>>>>
>>>>> In my opinion we should go the other way around: To make features as
>>>>> orthogonal/independent as possible, and just make them work together
>>>>> if we have to. In this particular case, I think it should be 
>>>>> easier to
>>>>> decide how to report status, its needs, etc for a VF, and then open
>>>>> the possibility for the PF to query or set them, reusing format,
>>>>> behavior, etc. as much as possible.
>>>>>
>>>>> I think that the most controversial point about doing it non-SR IOV
>>>>> way is the exposing of these features/fields to the guest using
>>>>> specific transport facilities, like PCI common config. However I 
>>>>> think
>>>>> it should not be hard for the hypervisor to intercept them and 
>>>>> even to
>>>>> expose them conditionally. Please correct me if this guessing was not
>>>>> right and you had other concerns.
>>>>
>>>> Possibly. I'd like to see some guidance on how this all will work
>>>> in practice then. Maybe make it all part of a non-normative section
>>>> for now.
>>>> I think that the feature itself is not very useful outside of
>>>> migration so we don't really gain much by adding it as is
>>>> without all the other missing pieces.
>>>> I would say let's see more of the whole picture before we commit.
>>>
>>> I agree here. I also can't see the whole picture for SRIOV case.
>>
>>
>> Again, it's not related to SR-IOV at all. It tries to introduce basic 
>> facility in the virtio level which can work for all types of virtio 
>> device.
>>
>> Transport such as PCI need to implement its own way to access those 
>> state. It's not hard to implement them simply via capability.
>>
>> It works like other basic facility like device status, features etc.
>>
>> For SR-IOV, it doesn't prevent you from implementing that via the 
>> admin virtqueue.
>>
>>
>>>
>>> I'll try to combine the admin control queue suggested in previous 
>>> patch set to my proposal of PF managing the VF migration.
>>
>>
>> Note that, the admin virtqueue should be transport indepedent when 
>> trying to introduce them.
>>
>>
>>>
>>> Feature negotiation is part of virtio device-driver communication 
>>> and not part of the migration software that should manage the 
>>> migration process.
>>>
>>> For me, seems like queue state is something that should be internal 
>>> and not be exposed to guest drivers that see this as a new feature.
>>
>>
>> This is not true, we have the case of nested virtualization. As 
>> mentioned in another thread, it's the hypervisor that need to choose 
>> between hiding or shadowing the internal virtqueue state.
>>
>> Thanks
>
> In the nested environment, do you mean the Level 1 is Real PF with X 
> VFs and in Level 2 the X VF seen as PFs in the guests and expose 
> another Y VFs ?


I meant PF is managed in L0. And the VF is assigned to L2 guest. In this 
case, we can expose the virtqueue state feature to L1 guest for 
migration L2 guest.


>
> If so, the guest PF will manage the migration for it's Y VFs. 


Does this mean you want to pass PF to L1 guest?

Thanks


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 115+ messages in thread

* [virtio-comment] Re: [PATCH V2 2/2] virtio: introduce STOP status bit
  2021-07-07  2:56         ` Jason Wang
@ 2021-07-07 16:45           ` Cornelia Huck
  2021-07-08  4:06             ` Jason Wang
  0 siblings, 1 reply; 115+ messages in thread
From: Cornelia Huck @ 2021-07-07 16:45 UTC (permalink / raw)
  To: Jason Wang, mst, virtio-comment, virtio-dev
  Cc: stefanha, mgurtovoy, eperezma, oren, shahafs, parav, bodong,
	amikheev, pasic

On Wed, Jul 07 2021, Jason Wang <jasowang@redhat.com> wrote:

> 在 2021/7/6 下午10:27, Cornelia Huck 写道:
>> On Tue, Jul 06 2021, Jason Wang <jasowang@redhat.com> wrote:
>>
>>> 在 2021/7/6 下午8:50, Cornelia Huck 写道:
>>>> On Tue, Jul 06 2021, Jason Wang <jasowang@redhat.com> wrote:
>>>>> +If VIRTIO_F_STOP has been negotiated, to stop a device, after setting
>>>>> +STOP, the driver MUST re-read the device status to ensure the STOP bit
>>>>> +is set to synchronize with the device.
>>>> Is this more that the driver needs to re-read the status until STOP is
>>>> set to make sure that the stop process has finished?
>>>
>>> Yes.
>>>
>>>
>>>>    If the device has
>>>> offered the feature and the driver accepted it, I'd assume that the
>>>> device will eventually finish with the procedure, or sets NEEDS_RESET if
>>>> something goes wrong?
>>>
>>> As stated below, the device must either:
>>>
>>> 1) finish all pending requests
>>>
>>> or
>>>
>>> 2) provide a device specific way for the driver to save and restore
>>> pending requests
>>>
>>> before setting STOP.
>>>
>>> Otherwise the device can't offer this feature.
>>>
>>> Using NEEDS_RESET seems more complicated than this.
>> Hm, what happens on an internal error?
>
>
> A question, can reset fail? If yes, do we need to define how to proceed 
> in the driver side?
>
> If not, I don't need the reason we need to deal with that in the STOP.

When you put it that way, it makes sense. Let's just keep it simple,
then.


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH V2 2/2] virtio: introduce STOP status bit
  2021-07-07 16:45           ` [virtio-comment] " Cornelia Huck
@ 2021-07-08  4:06             ` Jason Wang
  0 siblings, 0 replies; 115+ messages in thread
From: Jason Wang @ 2021-07-08  4:06 UTC (permalink / raw)
  To: Cornelia Huck, mst, virtio-comment, virtio-dev
  Cc: stefanha, mgurtovoy, eperezma, oren, shahafs, parav, bodong,
	amikheev, pasic


在 2021/7/8 上午12:45, Cornelia Huck 写道:
> On Wed, Jul 07 2021, Jason Wang <jasowang@redhat.com> wrote:
>
>> 在 2021/7/6 下午10:27, Cornelia Huck 写道:
>>> On Tue, Jul 06 2021, Jason Wang <jasowang@redhat.com> wrote:
>>>
>>>> 在 2021/7/6 下午8:50, Cornelia Huck 写道:
>>>>> On Tue, Jul 06 2021, Jason Wang <jasowang@redhat.com> wrote:
>>>>>> +If VIRTIO_F_STOP has been negotiated, to stop a device, after setting
>>>>>> +STOP, the driver MUST re-read the device status to ensure the STOP bit
>>>>>> +is set to synchronize with the device.
>>>>> Is this more that the driver needs to re-read the status until STOP is
>>>>> set to make sure that the stop process has finished?
>>>> Yes.
>>>>
>>>>
>>>>>     If the device has
>>>>> offered the feature and the driver accepted it, I'd assume that the
>>>>> device will eventually finish with the procedure, or sets NEEDS_RESET if
>>>>> something goes wrong?
>>>> As stated below, the device must either:
>>>>
>>>> 1) finish all pending requests
>>>>
>>>> or
>>>>
>>>> 2) provide a device specific way for the driver to save and restore
>>>> pending requests
>>>>
>>>> before setting STOP.
>>>>
>>>> Otherwise the device can't offer this feature.
>>>>
>>>> Using NEEDS_RESET seems more complicated than this.
>>> Hm, what happens on an internal error?
>>
>> A question, can reset fail? If yes, do we need to define how to proceed
>> in the driver side?
>>
>> If not, I don't need the reason we need to deal with that in the STOP.
> When you put it that way, it makes sense. Let's just keep it simple,
> then.


I agree and according to the spec, reset should succeed, since it was 
used to recover from NEEDS_RESET:

"""

The driver SHOULD NOT rely on completion of operations of a device if 
DEVICE_NEEDS_RESET is set. Note: For example, the driver can’t assume 
requests in flight will be completed if DEVICE_NEEDS_RESET is set, nor 
can it assume that they have not been completed. A good implementation 
will try to recover by issuing a reset.

"""

Thanks


>


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit
  2021-07-07  3:20     ` Jason Wang
@ 2021-07-09 17:23       ` Eugenio Perez Martin
  2021-07-10 20:36         ` Michael S. Tsirkin
  2021-07-12  3:53         ` Jason Wang
  0 siblings, 2 replies; 115+ messages in thread
From: Eugenio Perez Martin @ 2021-07-09 17:23 UTC (permalink / raw)
  To: Jason Wang, Dr. David Alan Gilbert
  Cc: Michael Tsirkin, virtio-comment, Virtio-Dev, Stefan Hajnoczi,
	Max Gurtovoy, Cornelia Huck, Oren Duer, Shahaf Shuler,
	Parav Pandit, Bodong Wang, Alexander Mikheev, Halil Pasic

> >   If I understand correctly, this is all
> > driven from the driver inside the guest, so for this to work
> > the guest must be running and already have initialised the driver.
>
>
> Yes.
>

As I see it, the feature can be driven entirely by the VMM as long as
it intercept the relevant configuration space (PCI, MMIO, etc) from
guest's reads and writes, and present it as coherent and transparent
for the guest. Some use cases I can imagine with a physical device (or
vp_vpda device) with VIRTIO_F_STOP:

1) The VMM chooses not to pass the feature flag. The guest cannot stop
the device, so any write to this flag is an error/undefined.
2) The VMM passes the flag to the guest. The guest can stop the device.
2.1) The VMM stops the device to perform a live migration, and the
guest does not write to STOP in any moment of the LM. It resets the
destination device with the state, and then initializes the device.
2.2) The guest stops the device and, when STOP(32) is set, the source
VMM migrates the device status. The destination VMM realizes the bit,
so it sets the bit in the destination too after device initialization.
2.3) The device is not initialized by the guest so it doesn't matter
what bit has the HW, but the VM can be migrated.

Am I missing something?

Thanks!


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH V2 2/2] virtio: introduce STOP status bit
  2021-07-06  4:33 ` [PATCH V2 2/2] virtio: introduce STOP status bit Jason Wang
  2021-07-06  9:24   ` [virtio-comment] " Dr. David Alan Gilbert
  2021-07-06 12:50   ` [virtio-comment] " Cornelia Huck
@ 2021-07-09 17:35   ` Eugenio Perez Martin
  2021-07-12  4:06     ` Jason Wang
  2021-07-10 20:40   ` Michael S. Tsirkin
  3 siblings, 1 reply; 115+ messages in thread
From: Eugenio Perez Martin @ 2021-07-09 17:35 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael Tsirkin, virtio-comment, Virtio-Dev, Stefan Hajnoczi,
	Max Gurtovoy, Cornelia Huck, Oren Duer, Shahaf Shuler,
	Parav Pandit, Bodong Wang, Alexander Mikheev, Halil Pasic

On Tue, Jul 6, 2021 at 6:34 AM Jason Wang <jasowang@redhat.com> wrote:
>
> This patch introduces a new status bit STOP. This will be
> used by the driver to stop the device in order to safely fetch the
> device state (virtqueue state) from the device.
>
> This is a must for supporting migration.
>
> Signed-off-by: Jason Wang <jasowang@redhat.com>
> ---
>  content.tex | 69 ++++++++++++++++++++++++++++++++++++++++++++++++++---
>  1 file changed, 66 insertions(+), 3 deletions(-)
>
> diff --git a/content.tex b/content.tex
> index 8877b6f..284ead0 100644
> --- a/content.tex
> +++ b/content.tex
> @@ -47,6 +47,10 @@ \section{\field{Device Status} Field}\label{sec:Basic Facilities of a Virtio Dev
>  \item[DRIVER_OK (4)] Indicates that the driver is set up and ready to
>    drive the device.
>
> +\item[STOP (32)] When VIRTIO_F_STOP is negotiated, indicates that the
> +  device has been stopped by the driver. This status bit is different
> +  from the reset since the device state is preserved.
> +
>  \item[DEVICE_NEEDS_RESET (64)] Indicates that the device has experienced
>    an error from which it can't recover.
>  \end{description}
> @@ -70,12 +74,38 @@ \section{\field{Device Status} Field}\label{sec:Basic Facilities of a Virtio Dev
>  recover by issuing a reset.
>  \end{note}
>
> +If VIRTIO_F_STOP has been negotiated, the driver MUST NOT set STOP if
> +DRIVER_OK is not set.
> +
> +If VIRTIO_F_STOP has been negotiated, to stop a device, after setting
> +STOP, the driver MUST re-read the device status to ensure the STOP bit
> +is set to synchronize with the device.
> +
>  \devicenormative{\subsection}{Device Status Field}{Basic Facilities of a Virtio Device / Device Status Field}
>  The device MUST initialize \field{device status} to 0 upon reset.
>
>  The device MUST NOT consume buffers or send any used buffer
>  notifications to the driver before DRIVER_OK.
>
> +If VIRTIO_F_STOP has not been negotiated, the device MUST ignore the
> +write of STOP.
> +
> +If VIRTIO_F_STOP has been negotiated, after the driver writes STOP,
> +the device MUST finish any pending operations like in flight requests
> +or have its device specific way for driver to save the pending
> +operations like in flight requests before setting the STOP status bit.
> +
> +If VIRTIO_F_STOP has been negotiated, the device MUST NOT consume
> +buffers or send any used buffer notifications to the driver after
> +STOP. The device MUST keep the configuration space unchanged and MUST
> +NOT send configuration space change notification to the driver after
> +STOP.
> +
> +If VIRTIO_F_STOP has been negotiated, after STOP, the device MUST
> +preserve all the necessary state (the virtqueue states with the
> +possible device specific states) that is required for restoring in the
> +future.
> +
>  \label{sec:Basic Facilities of a Virtio Device / Device Status Field / DEVICENEEDSRESET}The device SHOULD set DEVICE_NEEDS_RESET when it enters an error state
>  that a reset is needed.  If DRIVER_OK is set, after it sets DEVICE_NEEDS_RESET, the device
>  MUST send a device configuration change notification to the driver.
> @@ -474,8 +504,8 @@ \subsection{\field{Used State} Field}
>  \begin{itemize}
>  \item A driver MUST NOT set the virtqueue state before setting the
>    FEATURE_OK status bit.
> -\item A driver MUST NOT set the virtqueue state after setting the
> -  DRIVER_OK status bit.
> +\item A driver MUST NOT set the virtqueue state if DRIVER_OK status
> +  bit is set without STOP status bit.
>  \end{itemize}
>
>  \devicenormative{\subsection}{Virtqueue State}{Basic Facilities of a Virtio Device / Virtqueue State}
> @@ -488,7 +518,7 @@ \subsection{\field{Used State} Field}
>  \item A device SHOULD ignore the write to the virtqueue state if the
>  FEATURE_OK status bit is not set.
>  \item A device SHOULD ignore the write to the virtqueue state if the
> -DRIVER_OK status bit is set.
> +DRIVER_OK status bit is set without STOP status bit.
>  \end{itemize}
>
>  If VIRTIO_F_RING_STATE has been negotiated, a device MAY has its
> @@ -623,6 +653,36 @@ \section{Device Cleanup}\label{sec:General Initialization And Device Operation /
>
>  Thus a driver MUST ensure a virtqueue isn't live (by device reset) before removing exposed buffers.
>
> +\section{Virtqueue State Saving}
> +
> +If both VIRTIO_F_RING_STATE and VIRTIO_F_STOP have been negotiated. A
> +driver MAY save the internal virtqueue state.
> +
> +\drivernormative{\subsection}{Virtqueue State Saving}{General Initialization And Device Operation / Virtqueue State Saving}
> +
> +Assuming the device is 'live'. The driver MUST follow this sequence to
> +stop the device and save the virtqueue state:
> +
> +\begin{enumerate}
> +\item Set the STOP status bit.
> +
> +\item Re-read \field{device status} until the STOP bit is set to
> +  synchronize with the device.
> +
> +\item Read \field{available state} and save it.
> +
> +\item Read \field{used state} and save it.
> +
> +\item Read device specific virtqueue states if needed.
> +
> +\item Reset the device.

Maybe I'm being too nitpicky here, but the next user of the device
must reset anyway to start using it (as "driver initialization"
states). Maybe it's better to remove this "reset the device" from the
enumeration and ... [1]

> +\end{enumerate}
> +
> +The driver MAY perform device specific steps to save device specific sate.

s/sate/state/

> +
> +The driver MAY 'resume' the device by redoing the device initialization
> +with the saved virtqueue state. See ref\{sec:General Initialization and Device Operation / Device Initialization}

[1] just let this aclaration? If we let the reset in the enumeration,
it seems that the reset step is a MUST to save the virtqueue state.

> +
>  \chapter{Virtio Transport Options}\label{sec:Virtio Transport Options}
>
>  Virtio can use various different buses, thus the standard is split
> @@ -6713,6 +6773,9 @@ \chapter{Reserved Feature Bits}\label{sec:Reserved Feature Bits}
>    can set and get the device internal virtqueue state.
>    See \ref{sec:Virtqueues / Virtqueue State}~\nameref{sec:Virtqueues / Virtqueue State}.
>
> +  \item[VIRTIO_F_STOP(41)] This feature indicates that the driver can
> +  stop the device.
> +  See \ref{sec:Basic Facilities of a Virtio Device / Device Status Field}
>  \end{description}
>
>  \drivernormative{\section}{Reserved Feature Bits}{Reserved Feature Bits}
> --
> 2.25.1
>


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit
  2021-07-09 17:23       ` Eugenio Perez Martin
@ 2021-07-10 20:36         ` Michael S. Tsirkin
  2021-07-12  4:00           ` Jason Wang
  2021-07-12  3:53         ` Jason Wang
  1 sibling, 1 reply; 115+ messages in thread
From: Michael S. Tsirkin @ 2021-07-10 20:36 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: Jason Wang, Dr. David Alan Gilbert, virtio-comment, Virtio-Dev,
	Stefan Hajnoczi, Max Gurtovoy, Cornelia Huck, Oren Duer,
	Shahaf Shuler, Parav Pandit, Bodong Wang, Alexander Mikheev,
	Halil Pasic

On Fri, Jul 09, 2021 at 07:23:33PM +0200, Eugenio Perez Martin wrote:
> > >   If I understand correctly, this is all
> > > driven from the driver inside the guest, so for this to work
> > > the guest must be running and already have initialised the driver.
> >
> >
> > Yes.
> >
> 
> As I see it, the feature can be driven entirely by the VMM as long as
> it intercept the relevant configuration space (PCI, MMIO, etc) from
> guest's reads and writes, and present it as coherent and transparent
> for the guest. Some use cases I can imagine with a physical device (or
> vp_vpda device) with VIRTIO_F_STOP:
> 
> 1) The VMM chooses not to pass the feature flag. The guest cannot stop
> the device, so any write to this flag is an error/undefined.
> 2) The VMM passes the flag to the guest. The guest can stop the device.
> 2.1) The VMM stops the device to perform a live migration, and the
> guest does not write to STOP in any moment of the LM. It resets the
> destination device with the state, and then initializes the device.
> 2.2) The guest stops the device and, when STOP(32) is set, the source
> VMM migrates the device status. The destination VMM realizes the bit,
> so it sets the bit in the destination too after device initialization.
> 2.3) The device is not initialized by the guest so it doesn't matter
> what bit has the HW, but the VM can be migrated.
> 
> Am I missing something?
> 
> Thanks!

It's doable like this. It's all a lot of hoops to jump through though.
It's also not easy for devices to implement.
Why don't we design the feature in a way that is useable by VMMs
and implementable by devices in a simple way?

-- 
MST


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH V2 2/2] virtio: introduce STOP status bit
  2021-07-06  4:33 ` [PATCH V2 2/2] virtio: introduce STOP status bit Jason Wang
                     ` (2 preceding siblings ...)
  2021-07-09 17:35   ` Eugenio Perez Martin
@ 2021-07-10 20:40   ` Michael S. Tsirkin
  2021-07-12  4:04     ` Jason Wang
  3 siblings, 1 reply; 115+ messages in thread
From: Michael S. Tsirkin @ 2021-07-10 20:40 UTC (permalink / raw)
  To: Jason Wang
  Cc: virtio-comment, virtio-dev, stefanha, mgurtovoy, cohuck,
	eperezma, oren, shahafs, parav, bodong, amikheev, pasic

On Tue, Jul 06, 2021 at 12:33:34PM +0800, Jason Wang wrote:
> This patch introduces a new status bit STOP. This will be
> used by the driver to stop the device in order to safely fetch the
> device state (virtqueue state) from the device.
> 
> This is a must for supporting migration.
> 
> Signed-off-by: Jason Wang <jasowang@redhat.com>

So I feel that if the point is to be able to build a
feature that is useful beyond live migration then this
kind of misses the mark. For example the serial device
for a while now wanted a way to take back buffers
added to a given VQ. If you are working on a way for *driver*
to stop the VQs then let us make it useful for this ...


> ---
>  content.tex | 69 ++++++++++++++++++++++++++++++++++++++++++++++++++---
>  1 file changed, 66 insertions(+), 3 deletions(-)
> 
> diff --git a/content.tex b/content.tex
> index 8877b6f..284ead0 100644
> --- a/content.tex
> +++ b/content.tex
> @@ -47,6 +47,10 @@ \section{\field{Device Status} Field}\label{sec:Basic Facilities of a Virtio Dev
>  \item[DRIVER_OK (4)] Indicates that the driver is set up and ready to
>    drive the device.
>  
> +\item[STOP (32)] When VIRTIO_F_STOP is negotiated, indicates that the
> +  device has been stopped by the driver. This status bit is different
> +  from the reset since the device state is preserved.
> +
>  \item[DEVICE_NEEDS_RESET (64)] Indicates that the device has experienced
>    an error from which it can't recover.
>  \end{description}
> @@ -70,12 +74,38 @@ \section{\field{Device Status} Field}\label{sec:Basic Facilities of a Virtio Dev
>  recover by issuing a reset.
>  \end{note}
>  
> +If VIRTIO_F_STOP has been negotiated, the driver MUST NOT set STOP if
> +DRIVER_OK is not set.
> +
> +If VIRTIO_F_STOP has been negotiated, to stop a device, after setting
> +STOP, the driver MUST re-read the device status to ensure the STOP bit
> +is set to synchronize with the device.
> +
>  \devicenormative{\subsection}{Device Status Field}{Basic Facilities of a Virtio Device / Device Status Field}
>  The device MUST initialize \field{device status} to 0 upon reset.
>  
>  The device MUST NOT consume buffers or send any used buffer
>  notifications to the driver before DRIVER_OK.
>  
> +If VIRTIO_F_STOP has not been negotiated, the device MUST ignore the
> +write of STOP.
> +
> +If VIRTIO_F_STOP has been negotiated, after the driver writes STOP,
> +the device MUST finish any pending operations like in flight requests
> +or have its device specific way for driver to save the pending
> +operations like in flight requests before setting the STOP status bit.
> +
> +If VIRTIO_F_STOP has been negotiated, the device MUST NOT consume
> +buffers or send any used buffer notifications to the driver after
> +STOP. The device MUST keep the configuration space unchanged and MUST
> +NOT send configuration space change notification to the driver after
> +STOP.
> +
> +If VIRTIO_F_STOP has been negotiated, after STOP, the device MUST
> +preserve all the necessary state (the virtqueue states with the
> +possible device specific states) that is required for restoring in the
> +future.
> +
>  \label{sec:Basic Facilities of a Virtio Device / Device Status Field / DEVICENEEDSRESET}The device SHOULD set DEVICE_NEEDS_RESET when it enters an error state
>  that a reset is needed.  If DRIVER_OK is set, after it sets DEVICE_NEEDS_RESET, the device
>  MUST send a device configuration change notification to the driver.
> @@ -474,8 +504,8 @@ \subsection{\field{Used State} Field}
>  \begin{itemize}
>  \item A driver MUST NOT set the virtqueue state before setting the
>    FEATURE_OK status bit.
> -\item A driver MUST NOT set the virtqueue state after setting the
> -  DRIVER_OK status bit.
> +\item A driver MUST NOT set the virtqueue state if DRIVER_OK status
> +  bit is set without STOP status bit.
>  \end{itemize}
>  
>  \devicenormative{\subsection}{Virtqueue State}{Basic Facilities of a Virtio Device / Virtqueue State}
> @@ -488,7 +518,7 @@ \subsection{\field{Used State} Field}
>  \item A device SHOULD ignore the write to the virtqueue state if the
>  FEATURE_OK status bit is not set.
>  \item A device SHOULD ignore the write to the virtqueue state if the
> -DRIVER_OK status bit is set.
> +DRIVER_OK status bit is set without STOP status bit.
>  \end{itemize}
>  
>  If VIRTIO_F_RING_STATE has been negotiated, a device MAY has its
> @@ -623,6 +653,36 @@ \section{Device Cleanup}\label{sec:General Initialization And Device Operation /
>  
>  Thus a driver MUST ensure a virtqueue isn't live (by device reset) before removing exposed buffers.
>  
> +\section{Virtqueue State Saving}
> +
> +If both VIRTIO_F_RING_STATE and VIRTIO_F_STOP have been negotiated. A
> +driver MAY save the internal virtqueue state.
> +
> +\drivernormative{\subsection}{Virtqueue State Saving}{General Initialization And Device Operation / Virtqueue State Saving}
> +
> +Assuming the device is 'live'. The driver MUST follow this sequence to
> +stop the device and save the virtqueue state:
> +
> +\begin{enumerate}
> +\item Set the STOP status bit.
> +
> +\item Re-read \field{device status} until the STOP bit is set to
> +  synchronize with the device.
> +
> +\item Read \field{available state} and save it.
> +
> +\item Read \field{used state} and save it.
> +
> +\item Read device specific virtqueue states if needed.
> +
> +\item Reset the device.
> +\end{enumerate}
> +
> +The driver MAY perform device specific steps to save device specific sate.
> +
> +The driver MAY 'resume' the device by redoing the device initialization
> +with the saved virtqueue state. See ref\{sec:General Initialization and Device Operation / Device Initialization}
> +
>  \chapter{Virtio Transport Options}\label{sec:Virtio Transport Options}
>  
>  Virtio can use various different buses, thus the standard is split
> @@ -6713,6 +6773,9 @@ \chapter{Reserved Feature Bits}\label{sec:Reserved Feature Bits}
>    can set and get the device internal virtqueue state.
>    See \ref{sec:Virtqueues / Virtqueue State}~\nameref{sec:Virtqueues / Virtqueue State}.
>  
> +  \item[VIRTIO_F_STOP(41)] This feature indicates that the driver can
> +  stop the device.
> +  See \ref{sec:Basic Facilities of a Virtio Device / Device Status Field}
>  \end{description}
>  
>  \drivernormative{\section}{Reserved Feature Bits}{Reserved Feature Bits}
> -- 
> 2.25.1


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit
  2021-07-09 17:23       ` Eugenio Perez Martin
  2021-07-10 20:36         ` Michael S. Tsirkin
@ 2021-07-12  3:53         ` Jason Wang
  1 sibling, 0 replies; 115+ messages in thread
From: Jason Wang @ 2021-07-12  3:53 UTC (permalink / raw)
  To: Eugenio Perez Martin, Dr. David Alan Gilbert
  Cc: Michael Tsirkin, virtio-comment, Virtio-Dev, Stefan Hajnoczi,
	Max Gurtovoy, Cornelia Huck, Oren Duer, Shahaf Shuler,
	Parav Pandit, Bodong Wang, Alexander Mikheev, Halil Pasic


在 2021/7/10 上午1:23, Eugenio Perez Martin 写道:
>>>    If I understand correctly, this is all
>>> driven from the driver inside the guest, so for this to work
>>> the guest must be running and already have initialised the driver.
>>
>> Yes.
>>
> As I see it, the feature can be driven entirely by the VMM as long as
> it intercept the relevant configuration space (PCI, MMIO, etc) from
> guest's reads and writes, and present it as coherent and transparent
> for the guest. Some use cases I can imagine with a physical device (or
> vp_vpda device) with VIRTIO_F_STOP:


It's basically for nested live migration. (E.g present a VIRTIO_F_STOP 
to L1 guest and above)


>
> 1) The VMM chooses not to pass the feature flag. The guest cannot stop
> the device, so any write to this flag is an error/undefined.
> 2) The VMM passes the flag to the guest. The guest can stop the device.
> 2.1) The VMM stops the device to perform a live migration, and the
> guest does not write to STOP in any moment of the LM. It resets the
> destination device with the state, and then initializes the device.
> 2.2) The guest stops the device and, when STOP(32) is set, the source
> VMM migrates the device status. The destination VMM realizes the bit,
> so it sets the bit in the destination too after device initialization.
> 2.3) The device is not initialized by the guest so it doesn't matter
> what bit has the HW, but the VM can be migrated.
>
> Am I missing something?


Something like this, note that in any cases we should not let the L(x) 
guest to access the device status bit of L(x-1) directly. The VMM in 
L(x-1) is in charge of shadowing the device status (DRIVER_OK and STOP) 
correctly. If we want to migrate L(x) guest, VMM in L(x-1) needs to 
migrate the device status correctly and restore them in the destination:

1) The device status bit in L1 is implemented by L0 VMM

2) The device status bit in L2 is implemented by L1 VMM

3) If we need to migrate L1, we just migrate the L1 device status and 
restore it

4) If we need to migrate L2, we just migrate the L2 device status and 
restore it

E.g the logic of L0 VMM (source) is:

if L1_status | DRIVER_OK && !(L1_status | STOP):
     set STOP to L0_status;
     wait until STOP is set for L0_status;
save L1 state(status) and send it to destination

The logic of L0 VMM(dest) is:

get L1 state(status)
if L1_status | DRIVER_OK:
     perform the L0 device initialization (e.g features based of L1 state)
     propagate L1_status(DRIVER_OK|DRIVER_STOP) to the L0 device

So:

For 2.1,  STOP is set by source VMM by not recovered by source since 
it's not a state of L1 device

For 2.2,  STOP is 'set' by L1 and shadowed by VMM, and it was recovered 
by the destination since it's a state of L1 device

For 2.3, STOP is not set by source and not recovered by destination 
since it's not a state of L1 device

Thanks



>
> Thanks!
>


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit
  2021-07-10 20:36         ` Michael S. Tsirkin
@ 2021-07-12  4:00           ` Jason Wang
  2021-07-12  9:57             ` Stefan Hajnoczi
  0 siblings, 1 reply; 115+ messages in thread
From: Jason Wang @ 2021-07-12  4:00 UTC (permalink / raw)
  To: Michael S. Tsirkin, Eugenio Perez Martin
  Cc: Dr. David Alan Gilbert, virtio-comment, Virtio-Dev,
	Stefan Hajnoczi, Max Gurtovoy, Cornelia Huck, Oren Duer,
	Shahaf Shuler, Parav Pandit, Bodong Wang, Alexander Mikheev,
	Halil Pasic


在 2021/7/11 上午4:36, Michael S. Tsirkin 写道:
> On Fri, Jul 09, 2021 at 07:23:33PM +0200, Eugenio Perez Martin wrote:
>>>>    If I understand correctly, this is all
>>>> driven from the driver inside the guest, so for this to work
>>>> the guest must be running and already have initialised the driver.
>>>
>>> Yes.
>>>
>> As I see it, the feature can be driven entirely by the VMM as long as
>> it intercept the relevant configuration space (PCI, MMIO, etc) from
>> guest's reads and writes, and present it as coherent and transparent
>> for the guest. Some use cases I can imagine with a physical device (or
>> vp_vpda device) with VIRTIO_F_STOP:
>>
>> 1) The VMM chooses not to pass the feature flag. The guest cannot stop
>> the device, so any write to this flag is an error/undefined.
>> 2) The VMM passes the flag to the guest. The guest can stop the device.
>> 2.1) The VMM stops the device to perform a live migration, and the
>> guest does not write to STOP in any moment of the LM. It resets the
>> destination device with the state, and then initializes the device.
>> 2.2) The guest stops the device and, when STOP(32) is set, the source
>> VMM migrates the device status. The destination VMM realizes the bit,
>> so it sets the bit in the destination too after device initialization.
>> 2.3) The device is not initialized by the guest so it doesn't matter
>> what bit has the HW, but the VM can be migrated.
>>
>> Am I missing something?
>>
>> Thanks!
> It's doable like this. It's all a lot of hoops to jump through though.
> It's also not easy for devices to implement.


It just requires a new status bit. Anything that makes you think it's 
hard to implement?

E.g for networking device, it should be sufficient to use this bit + the 
virtqueue state.


> Why don't we design the feature in a way that is useable by VMMs
> and implementable by devices in a simple way?


It use the common technology like register shadowing without any further 
stuffs.

Or do you have any other ideas?

(I think we all know migration will be very hard if we simply pass 
through those state registers).

Thanks


>


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH V2 2/2] virtio: introduce STOP status bit
  2021-07-10 20:40   ` Michael S. Tsirkin
@ 2021-07-12  4:04     ` Jason Wang
  0 siblings, 0 replies; 115+ messages in thread
From: Jason Wang @ 2021-07-12  4:04 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-comment, virtio-dev, stefanha, mgurtovoy, cohuck,
	eperezma, oren, shahafs, parav, bodong, amikheev, pasic


在 2021/7/11 上午4:40, Michael S. Tsirkin 写道:
> On Tue, Jul 06, 2021 at 12:33:34PM +0800, Jason Wang wrote:
>> This patch introduces a new status bit STOP. This will be
>> used by the driver to stop the device in order to safely fetch the
>> device state (virtqueue state) from the device.
>>
>> This is a must for supporting migration.
>>
>> Signed-off-by: Jason Wang<jasowang@redhat.com>
> So I feel that if the point is to be able to build a
> feature that is useful beyond live migration then this
> kind of misses the mark. For example the serial device
> for a while now wanted a way to take back buffers
> added to a given VQ. If you are working on a way for*driver*
> to stop the VQs then let us make it useful for this ...


I'm a little bit confused.

If it's not for migration and not for the VMM to track the buffer. The 
drivers itself should have the sufficient knowledge for the buffers that 
is consumed by not made used.

What missed here is just the way to "stop" or "freeze" the device. With 
this status, the driver can use its own knowledge to take back buffers 
since we know it will be safe (device won't use those buffers any mroe).

Thanks


>
>


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH V2 2/2] virtio: introduce STOP status bit
  2021-07-09 17:35   ` Eugenio Perez Martin
@ 2021-07-12  4:06     ` Jason Wang
  0 siblings, 0 replies; 115+ messages in thread
From: Jason Wang @ 2021-07-12  4:06 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: Michael Tsirkin, virtio-comment, Virtio-Dev, Stefan Hajnoczi,
	Max Gurtovoy, Cornelia Huck, Oren Duer, Shahaf Shuler,
	Parav Pandit, Bodong Wang, Alexander Mikheev, Halil Pasic


在 2021/7/10 上午1:35, Eugenio Perez Martin 写道:
> On Tue, Jul 6, 2021 at 6:34 AM Jason Wang <jasowang@redhat.com> wrote:
>> This patch introduces a new status bit STOP. This will be
>> used by the driver to stop the device in order to safely fetch the
>> device state (virtqueue state) from the device.
>>
>> This is a must for supporting migration.
>>
>> Signed-off-by: Jason Wang <jasowang@redhat.com>
>> ---
>>   content.tex | 69 ++++++++++++++++++++++++++++++++++++++++++++++++++---
>>   1 file changed, 66 insertions(+), 3 deletions(-)
>>
>> diff --git a/content.tex b/content.tex
>> index 8877b6f..284ead0 100644
>> --- a/content.tex
>> +++ b/content.tex
>> @@ -47,6 +47,10 @@ \section{\field{Device Status} Field}\label{sec:Basic Facilities of a Virtio Dev
>>   \item[DRIVER_OK (4)] Indicates that the driver is set up and ready to
>>     drive the device.
>>
>> +\item[STOP (32)] When VIRTIO_F_STOP is negotiated, indicates that the
>> +  device has been stopped by the driver. This status bit is different
>> +  from the reset since the device state is preserved.
>> +
>>   \item[DEVICE_NEEDS_RESET (64)] Indicates that the device has experienced
>>     an error from which it can't recover.
>>   \end{description}
>> @@ -70,12 +74,38 @@ \section{\field{Device Status} Field}\label{sec:Basic Facilities of a Virtio Dev
>>   recover by issuing a reset.
>>   \end{note}
>>
>> +If VIRTIO_F_STOP has been negotiated, the driver MUST NOT set STOP if
>> +DRIVER_OK is not set.
>> +
>> +If VIRTIO_F_STOP has been negotiated, to stop a device, after setting
>> +STOP, the driver MUST re-read the device status to ensure the STOP bit
>> +is set to synchronize with the device.
>> +
>>   \devicenormative{\subsection}{Device Status Field}{Basic Facilities of a Virtio Device / Device Status Field}
>>   The device MUST initialize \field{device status} to 0 upon reset.
>>
>>   The device MUST NOT consume buffers or send any used buffer
>>   notifications to the driver before DRIVER_OK.
>>
>> +If VIRTIO_F_STOP has not been negotiated, the device MUST ignore the
>> +write of STOP.
>> +
>> +If VIRTIO_F_STOP has been negotiated, after the driver writes STOP,
>> +the device MUST finish any pending operations like in flight requests
>> +or have its device specific way for driver to save the pending
>> +operations like in flight requests before setting the STOP status bit.
>> +
>> +If VIRTIO_F_STOP has been negotiated, the device MUST NOT consume
>> +buffers or send any used buffer notifications to the driver after
>> +STOP. The device MUST keep the configuration space unchanged and MUST
>> +NOT send configuration space change notification to the driver after
>> +STOP.
>> +
>> +If VIRTIO_F_STOP has been negotiated, after STOP, the device MUST
>> +preserve all the necessary state (the virtqueue states with the
>> +possible device specific states) that is required for restoring in the
>> +future.
>> +
>>   \label{sec:Basic Facilities of a Virtio Device / Device Status Field / DEVICENEEDSRESET}The device SHOULD set DEVICE_NEEDS_RESET when it enters an error state
>>   that a reset is needed.  If DRIVER_OK is set, after it sets DEVICE_NEEDS_RESET, the device
>>   MUST send a device configuration change notification to the driver.
>> @@ -474,8 +504,8 @@ \subsection{\field{Used State} Field}
>>   \begin{itemize}
>>   \item A driver MUST NOT set the virtqueue state before setting the
>>     FEATURE_OK status bit.
>> -\item A driver MUST NOT set the virtqueue state after setting the
>> -  DRIVER_OK status bit.
>> +\item A driver MUST NOT set the virtqueue state if DRIVER_OK status
>> +  bit is set without STOP status bit.
>>   \end{itemize}
>>
>>   \devicenormative{\subsection}{Virtqueue State}{Basic Facilities of a Virtio Device / Virtqueue State}
>> @@ -488,7 +518,7 @@ \subsection{\field{Used State} Field}
>>   \item A device SHOULD ignore the write to the virtqueue state if the
>>   FEATURE_OK status bit is not set.
>>   \item A device SHOULD ignore the write to the virtqueue state if the
>> -DRIVER_OK status bit is set.
>> +DRIVER_OK status bit is set without STOP status bit.
>>   \end{itemize}
>>
>>   If VIRTIO_F_RING_STATE has been negotiated, a device MAY has its
>> @@ -623,6 +653,36 @@ \section{Device Cleanup}\label{sec:General Initialization And Device Operation /
>>
>>   Thus a driver MUST ensure a virtqueue isn't live (by device reset) before removing exposed buffers.
>>
>> +\section{Virtqueue State Saving}
>> +
>> +If both VIRTIO_F_RING_STATE and VIRTIO_F_STOP have been negotiated. A
>> +driver MAY save the internal virtqueue state.
>> +
>> +\drivernormative{\subsection}{Virtqueue State Saving}{General Initialization And Device Operation / Virtqueue State Saving}
>> +
>> +Assuming the device is 'live'. The driver MUST follow this sequence to
>> +stop the device and save the virtqueue state:
>> +
>> +\begin{enumerate}
>> +\item Set the STOP status bit.
>> +
>> +\item Re-read \field{device status} until the STOP bit is set to
>> +  synchronize with the device.
>> +
>> +\item Read \field{available state} and save it.
>> +
>> +\item Read \field{used state} and save it.
>> +
>> +\item Read device specific virtqueue states if needed.
>> +
>> +\item Reset the device.
> Maybe I'm being too nitpicky here,


Nope, any comments are welcomed.


> but the next user of the device
> must reset anyway to start using it (as "driver initialization"
> states). Maybe it's better to remove this "reset the device" from the
> enumeration and ... [1]


That's fine since the first step for "driver initialization" is also a 
reset.

Let me have more think about this.


>
>> +\end{enumerate}
>> +
>> +The driver MAY perform device specific steps to save device specific sate.
> s/sate/state/


Will fix.


>
>> +
>> +The driver MAY 'resume' the device by redoing the device initialization
>> +with the saved virtqueue state. See ref\{sec:General Initialization and Device Operation / Device Initialization}
> [1] just let this aclaration? If we let the reset in the enumeration,
> it seems that the reset step is a MUST to save the virtqueue state.


Yes.

Thanks


>
>> +
>>   \chapter{Virtio Transport Options}\label{sec:Virtio Transport Options}
>>
>>   Virtio can use various different buses, thus the standard is split
>> @@ -6713,6 +6773,9 @@ \chapter{Reserved Feature Bits}\label{sec:Reserved Feature Bits}
>>     can set and get the device internal virtqueue state.
>>     See \ref{sec:Virtqueues / Virtqueue State}~\nameref{sec:Virtqueues / Virtqueue State}.
>>
>> +  \item[VIRTIO_F_STOP(41)] This feature indicates that the driver can
>> +  stop the device.
>> +  See \ref{sec:Basic Facilities of a Virtio Device / Device Status Field}
>>   \end{description}
>>
>>   \drivernormative{\section}{Reserved Feature Bits}{Reserved Feature Bits}
>> --
>> 2.25.1
>>


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit
  2021-07-12  4:00           ` Jason Wang
@ 2021-07-12  9:57             ` Stefan Hajnoczi
  2021-07-13  3:27               ` Jason Wang
  0 siblings, 1 reply; 115+ messages in thread
From: Stefan Hajnoczi @ 2021-07-12  9:57 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, Eugenio Perez Martin, Dr. David Alan Gilbert,
	virtio-comment, Virtio-Dev, Max Gurtovoy, Cornelia Huck,
	Oren Duer, Shahaf Shuler, Parav Pandit, Bodong Wang,
	Alexander Mikheev, Halil Pasic

[-- Attachment #1: Type: text/plain, Size: 2975 bytes --]

On Mon, Jul 12, 2021 at 12:00:39PM +0800, Jason Wang wrote:
> 
> 在 2021/7/11 上午4:36, Michael S. Tsirkin 写道:
> > On Fri, Jul 09, 2021 at 07:23:33PM +0200, Eugenio Perez Martin wrote:
> > > > >    If I understand correctly, this is all
> > > > > driven from the driver inside the guest, so for this to work
> > > > > the guest must be running and already have initialised the driver.
> > > > 
> > > > Yes.
> > > > 
> > > As I see it, the feature can be driven entirely by the VMM as long as
> > > it intercept the relevant configuration space (PCI, MMIO, etc) from
> > > guest's reads and writes, and present it as coherent and transparent
> > > for the guest. Some use cases I can imagine with a physical device (or
> > > vp_vpda device) with VIRTIO_F_STOP:
> > > 
> > > 1) The VMM chooses not to pass the feature flag. The guest cannot stop
> > > the device, so any write to this flag is an error/undefined.
> > > 2) The VMM passes the flag to the guest. The guest can stop the device.
> > > 2.1) The VMM stops the device to perform a live migration, and the
> > > guest does not write to STOP in any moment of the LM. It resets the
> > > destination device with the state, and then initializes the device.
> > > 2.2) The guest stops the device and, when STOP(32) is set, the source
> > > VMM migrates the device status. The destination VMM realizes the bit,
> > > so it sets the bit in the destination too after device initialization.
> > > 2.3) The device is not initialized by the guest so it doesn't matter
> > > what bit has the HW, but the VM can be migrated.
> > > 
> > > Am I missing something?
> > > 
> > > Thanks!
> > It's doable like this. It's all a lot of hoops to jump through though.
> > It's also not easy for devices to implement.
> 
> 
> It just requires a new status bit. Anything that makes you think it's hard
> to implement?
> 
> E.g for networking device, it should be sufficient to use this bit + the
> virtqueue state.
> 
> 
> > Why don't we design the feature in a way that is useable by VMMs
> > and implementable by devices in a simple way?
> 
> 
> It use the common technology like register shadowing without any further
> stuffs.
> 
> Or do you have any other ideas?
> 
> (I think we all know migration will be very hard if we simply pass through
> those state registers).

If an admin virtqueue is used instead of the STOP Device Status field
bit then there's no need to re-read the Device Status field in a loop
until the device has stopped.

When migrating a guest with many VIRTIO devices a busy waiting approach
extends downtime if implemented sequentially (stopping one device at a
time). It can be implemented concurrently (setting the STOP bit on all
devices and then looping until all their Device Status fields have the
bit set), but this becomes more complex to implement.

I'm a little worried about adding a new bit that requires busy
waiting...

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 484 bytes --]

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH V2 0/2] Vitqueue State Synchronization
  2021-07-06  4:33 [PATCH V2 0/2] Vitqueue State Synchronization Jason Wang
  2021-07-06  4:33 ` [PATCH V2 1/2] virtio: introduce virtqueue state as basic facility Jason Wang
  2021-07-06  4:33 ` [PATCH V2 2/2] virtio: introduce STOP status bit Jason Wang
@ 2021-07-12 10:12 ` Stefan Hajnoczi
  2021-07-13  3:08   ` Jason Wang
  2 siblings, 1 reply; 115+ messages in thread
From: Stefan Hajnoczi @ 2021-07-12 10:12 UTC (permalink / raw)
  To: Jason Wang
  Cc: mst, virtio-comment, virtio-dev, mgurtovoy, cohuck, eperezma,
	oren, shahafs, parav, bodong, amikheev, pasic

[-- Attachment #1: Type: text/plain, Size: 1940 bytes --]

On Tue, Jul 06, 2021 at 12:33:32PM +0800, Jason Wang wrote:
> Hi All:
> 
> This is an updated version to implement virtqueue state
> synchronization which is a must for the migration support.
> 
> The first patch introduces virtqueue states as a new basic facility of
> the virtio device. This is used by the driver to save and restore
> virtqueue state. The states were split into available state and used
> state to ease the transport specific implementation. It is also
> allowed for the device to have its own device specific way to save and
> resotre extra virtqueue states like in flight request.
> 
> The second patch introduce a new status bit STOP. This bit is used for
> the driver to stop the device. The major difference from reset is that
> STOP must preserve all the virtqueue state plus the device state.
> 
> A driver can then:
> 
> - Get the virtqueue state if STOP status bit is set
> - Set the virtqueue state after FEATURE_OK but before DRIVER_OK
> 
> Device specific state synchronization could be built on top.

Will you send a proof-of-concept implementation to demonstrate how it
works in practice?

You mentioned being able to migrate virtio-net devices using this
interface, but what about state like VIRTIO_NET_S_LINK_UP that is either
per-device or associated with a non-rx/tx virtqueue?

Basically I'm not sure if the scope of this is just to migrate state
associated with offloaded virtqueues (vDPA, VFIO/mdev, etc) or if it's
really supposed to migrate the entire device?

Do you have an approach in mind for saving/loading device-specific
state? Here are devices and their state:
- virtio-blk: a list of requests that the destination device can
  re-submit
- virtio-scsi: a list of requests that the destination device can
  re-submit
- virtio-serial: active ports, including the current buffer being
  transferred
- virtio-net: MAC address, status, etc

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH V2 0/2] Vitqueue State Synchronization
  2021-07-12 10:12 ` [PATCH V2 0/2] Vitqueue State Synchronization Stefan Hajnoczi
@ 2021-07-13  3:08   ` Jason Wang
  2021-07-13 10:30     ` Stefan Hajnoczi
  0 siblings, 1 reply; 115+ messages in thread
From: Jason Wang @ 2021-07-13  3:08 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: mst, virtio-comment, virtio-dev, mgurtovoy, cohuck, eperezma,
	oren, shahafs, parav, bodong, amikheev, pasic


在 2021/7/12 下午6:12, Stefan Hajnoczi 写道:
> On Tue, Jul 06, 2021 at 12:33:32PM +0800, Jason Wang wrote:
>> Hi All:
>>
>> This is an updated version to implement virtqueue state
>> synchronization which is a must for the migration support.
>>
>> The first patch introduces virtqueue states as a new basic facility of
>> the virtio device. This is used by the driver to save and restore
>> virtqueue state. The states were split into available state and used
>> state to ease the transport specific implementation. It is also
>> allowed for the device to have its own device specific way to save and
>> resotre extra virtqueue states like in flight request.
>>
>> The second patch introduce a new status bit STOP. This bit is used for
>> the driver to stop the device. The major difference from reset is that
>> STOP must preserve all the virtqueue state plus the device state.
>>
>> A driver can then:
>>
>> - Get the virtqueue state if STOP status bit is set
>> - Set the virtqueue state after FEATURE_OK but before DRIVER_OK
>>
>> Device specific state synchronization could be built on top.
> Will you send a proof-of-concept implementation to demonstrate how it
> works in practice?


Eugenio has implemented a prototype for this. (Note that the codes was 
for previous version of the proposal, but it's sufficient to demonstrate 
how it works).

https://www.mail-archive.com/qemu-devel@nongnu.org/msg809332.html

https://www.mail-archive.com/qemu-devel@nongnu.org/msg809335.html


>
> You mentioned being able to migrate virtio-net devices using this
> interface, but what about state like VIRTIO_NET_S_LINK_UP that is either
> per-device or associated with a non-rx/tx virtqueue?


Note that the config space will be maintained by Qemu. So Qemu can 
choose to emulate link down by simply don't set DRIVER_OK to the device.


>
> Basically I'm not sure if the scope of this is just to migrate state
> associated with offloaded virtqueues (vDPA, VFIO/mdev, etc) or if it's
> really supposed to migrate the entire device?


As the subject, it's the virtqueue state not the device state. The 
series tries to introduce the minimal sets of functions that could be 
used to migrate the network device.


>
> Do you have an approach in mind for saving/loading device-specific
> state? Here are devices and their state:
> - virtio-blk: a list of requests that the destination device can
>    re-submit
> - virtio-scsi: a list of requests that the destination device can
>    re-submit
> - virtio-serial: active ports, including the current buffer being
>    transferred


Actually, we had two types of additional states:

- pending (or inflight) buffers, we can introduce a transport specific 
way to specify the auxiliary page which is used to stored the inflight 
descriptors (as what vhost-user did)
- other device states, this needs to be done via a device specific way, 
and it would be hard to generalize them


> - virtio-net: MAC address, status, etc


So VMM will intercept all the control commands, that means we don't need 
to query any states that is changed via those control commands.

E.g The Qemu is in charge of shadowing control virtqueue, so we don't 
even need to interface to query any of those states that is set via 
control virtqueue.

But all those device state stuffs is out of the scope of this proposal.

I can see one of the possible gap is that people may think the migration 
facility is designed for the simple passthrough that Linux provides, 
that means the device is assigend 'entirely' to the guest. This is not 
case for the case of live migration, some kind of mediation must be done 
in the middle.

And that's the work of VMM through vDPA + Qemu: intercepting control 
command but not datapath.

Thanks


>
> Stefan


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit
  2021-07-12  9:57             ` Stefan Hajnoczi
@ 2021-07-13  3:27               ` Jason Wang
  2021-07-13  8:19                 ` Cornelia Huck
  2021-07-13 10:00                 ` Stefan Hajnoczi
  0 siblings, 2 replies; 115+ messages in thread
From: Jason Wang @ 2021-07-13  3:27 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Michael S. Tsirkin, Eugenio Perez Martin, Dr. David Alan Gilbert,
	virtio-comment, Virtio-Dev, Max Gurtovoy, Cornelia Huck,
	Oren Duer, Shahaf Shuler, Parav Pandit, Bodong Wang,
	Alexander Mikheev, Halil Pasic


在 2021/7/12 下午5:57, Stefan Hajnoczi 写道:
> On Mon, Jul 12, 2021 at 12:00:39PM +0800, Jason Wang wrote:
>> 在 2021/7/11 上午4:36, Michael S. Tsirkin 写道:
>>> On Fri, Jul 09, 2021 at 07:23:33PM +0200, Eugenio Perez Martin wrote:
>>>>>>     If I understand correctly, this is all
>>>>>> driven from the driver inside the guest, so for this to work
>>>>>> the guest must be running and already have initialised the driver.
>>>>> Yes.
>>>>>
>>>> As I see it, the feature can be driven entirely by the VMM as long as
>>>> it intercept the relevant configuration space (PCI, MMIO, etc) from
>>>> guest's reads and writes, and present it as coherent and transparent
>>>> for the guest. Some use cases I can imagine with a physical device (or
>>>> vp_vpda device) with VIRTIO_F_STOP:
>>>>
>>>> 1) The VMM chooses not to pass the feature flag. The guest cannot stop
>>>> the device, so any write to this flag is an error/undefined.
>>>> 2) The VMM passes the flag to the guest. The guest can stop the device.
>>>> 2.1) The VMM stops the device to perform a live migration, and the
>>>> guest does not write to STOP in any moment of the LM. It resets the
>>>> destination device with the state, and then initializes the device.
>>>> 2.2) The guest stops the device and, when STOP(32) is set, the source
>>>> VMM migrates the device status. The destination VMM realizes the bit,
>>>> so it sets the bit in the destination too after device initialization.
>>>> 2.3) The device is not initialized by the guest so it doesn't matter
>>>> what bit has the HW, but the VM can be migrated.
>>>>
>>>> Am I missing something?
>>>>
>>>> Thanks!
>>> It's doable like this. It's all a lot of hoops to jump through though.
>>> It's also not easy for devices to implement.
>>
>> It just requires a new status bit. Anything that makes you think it's hard
>> to implement?
>>
>> E.g for networking device, it should be sufficient to use this bit + the
>> virtqueue state.
>>
>>
>>> Why don't we design the feature in a way that is useable by VMMs
>>> and implementable by devices in a simple way?
>>
>> It use the common technology like register shadowing without any further
>> stuffs.
>>
>> Or do you have any other ideas?
>>
>> (I think we all know migration will be very hard if we simply pass through
>> those state registers).
> If an admin virtqueue is used instead of the STOP Device Status field
> bit then there's no need to re-read the Device Status field in a loop
> until the device has stopped.


Probably not. Let me to clarify several points:

- This proposal has nothing to do with admin virtqueue. Actually, admin 
virtqueue could be used for carrying any basic device facility like 
status bit. E.g I'm going to post patches that use admin virtqueue as a 
"transport" for device slicing at virtio level.
- Even if we had introduced admin virtqueue, we still need a per 
function interface for this. This is a must for nested virtualization, 
we can't always expect things like PF can be assigned to L1 guest.
- According to the proposal, there's no need for the device to complete 
all the consumed buffers, device can choose to expose those inflight 
descriptors in a device specific way and set the STOP bit. This means, 
if we have the device specific in-flight descriptor reporting facility, 
the device can almost set the STOP bit immediately.
- If we don't go with the basic device facility but using the admin 
virtqueue specific method, we still need to clarify how it works with 
the device status state machine, it will be some kind of sub-states 
which looks much more complicated than the current proposal.


>
> When migrating a guest with many VIRTIO devices a busy waiting approach
> extends downtime if implemented sequentially (stopping one device at a
> time).


Well. You need some kinds of waiting for sure, the device/DMA needs 
sometime to be stopped. The downtime is determined by a specific virtio 
implementation which is hard to be restricted at the spec level. We can 
clarify that the device must set the STOP bit in e.g 100ms.


>   It can be implemented concurrently (setting the STOP bit on all
> devices and then looping until all their Device Status fields have the
> bit set), but this becomes more complex to implement.


I still don't get what kind of complexity did you worry here.


>
> I'm a little worried about adding a new bit that requires busy
> waiting...


Busy wait is not something that is introduced in this patch:

4.1.4.3.2 Driver Requirements: Common configuration structure layout

After writing 0 to device_status, the driver MUST wait for a read of 
device_status to return 0 before reinitializing the device.

Since it was required for at least one transport. We need do something 
similar to when introducing basic facility.

Thanks


>
> Stefan


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit
  2021-07-13  3:27               ` Jason Wang
@ 2021-07-13  8:19                 ` Cornelia Huck
  2021-07-13  9:13                   ` Jason Wang
  2021-07-13 10:00                 ` Stefan Hajnoczi
  1 sibling, 1 reply; 115+ messages in thread
From: Cornelia Huck @ 2021-07-13  8:19 UTC (permalink / raw)
  To: Jason Wang, Stefan Hajnoczi
  Cc: Michael S. Tsirkin, Eugenio Perez Martin, Dr. David Alan Gilbert,
	virtio-comment, Virtio-Dev, Max Gurtovoy, Oren Duer,
	Shahaf Shuler, Parav Pandit, Bodong Wang, Alexander Mikheev,
	Halil Pasic

On Tue, Jul 13 2021, Jason Wang <jasowang@redhat.com> wrote:

> 在 2021/7/12 下午5:57, Stefan Hajnoczi 写道:
>> When migrating a guest with many VIRTIO devices a busy waiting approach
>> extends downtime if implemented sequentially (stopping one device at a
>> time).
>
>
> Well. You need some kinds of waiting for sure, the device/DMA needs 
> sometime to be stopped. The downtime is determined by a specific virtio 
> implementation which is hard to be restricted at the spec level. We can 
> clarify that the device must set the STOP bit in e.g 100ms.

I don't think we can introduce arbitrary upper bounds here. At most, we
can say that the device SHOULD try to set the STOP bit as early as
possible (and make use of the mechanism to expose in-flight buffers.)

If we want to avoid polling for the STOP bit, we need some kind of
notification mechanism, I guess. For ccw, I'd just use a channel
command to stop the device; completion of that channel program would
indicate that the device is done with the stop procedure. Not sure how
well that translates to other transports.


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit
  2021-07-13  8:19                 ` Cornelia Huck
@ 2021-07-13  9:13                   ` Jason Wang
  2021-07-13 11:31                     ` Cornelia Huck
  0 siblings, 1 reply; 115+ messages in thread
From: Jason Wang @ 2021-07-13  9:13 UTC (permalink / raw)
  To: Cornelia Huck, Stefan Hajnoczi
  Cc: Michael S. Tsirkin, Eugenio Perez Martin, Dr. David Alan Gilbert,
	virtio-comment, Virtio-Dev, Max Gurtovoy, Oren Duer,
	Shahaf Shuler, Parav Pandit, Bodong Wang, Alexander Mikheev,
	Halil Pasic


在 2021/7/13 下午4:19, Cornelia Huck 写道:
> On Tue, Jul 13 2021, Jason Wang <jasowang@redhat.com> wrote:
>
>> 在 2021/7/12 下午5:57, Stefan Hajnoczi 写道:
>>> When migrating a guest with many VIRTIO devices a busy waiting approach
>>> extends downtime if implemented sequentially (stopping one device at a
>>> time).
>>
>> Well. You need some kinds of waiting for sure, the device/DMA needs
>> sometime to be stopped. The downtime is determined by a specific virtio
>> implementation which is hard to be restricted at the spec level. We can
>> clarify that the device must set the STOP bit in e.g 100ms.
> I don't think we can introduce arbitrary upper bounds here. At most, we
> can say that the device SHOULD try to set the STOP bit as early as
> possible (and make use of the mechanism to expose in-flight buffers.)


Yes, that's my understanding.


>
> If we want to avoid polling for the STOP bit, we need some kind of
> notification mechanism, I guess. For ccw, I'd just use a channel
> command to stop the device; completion of that channel program would
> indicate that the device is done with the stop procedure.


A question, is interrupt used for such notification, or the VMM can 
choose to poll for the completion?


> Not sure how
> well that translates to other transports.


Actually, it's not necessarily a busy polling. VMM can schedule other 
process in and recheck the bit periodically.

Or as you mentioned before, we can use some kind of interrupt but it 
would be more complicated than the simple status bit. It's better to 
introduce the interrupt only if the status bit doesn't fit.

Thanks


>


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit
  2021-07-13  3:27               ` Jason Wang
  2021-07-13  8:19                 ` Cornelia Huck
@ 2021-07-13 10:00                 ` Stefan Hajnoczi
  2021-07-13 12:16                   ` Jason Wang
  1 sibling, 1 reply; 115+ messages in thread
From: Stefan Hajnoczi @ 2021-07-13 10:00 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, Eugenio Perez Martin, Dr. David Alan Gilbert,
	virtio-comment, Virtio-Dev, Max Gurtovoy, Cornelia Huck,
	Oren Duer, Shahaf Shuler, Parav Pandit, Bodong Wang,
	Alexander Mikheev, Halil Pasic

[-- Attachment #1: Type: text/plain, Size: 6657 bytes --]

On Tue, Jul 13, 2021 at 11:27:03AM +0800, Jason Wang wrote:
> 
> 在 2021/7/12 下午5:57, Stefan Hajnoczi 写道:
> > On Mon, Jul 12, 2021 at 12:00:39PM +0800, Jason Wang wrote:
> > > 在 2021/7/11 上午4:36, Michael S. Tsirkin 写道:
> > > > On Fri, Jul 09, 2021 at 07:23:33PM +0200, Eugenio Perez Martin wrote:
> > > > > > >     If I understand correctly, this is all
> > > > > > > driven from the driver inside the guest, so for this to work
> > > > > > > the guest must be running and already have initialised the driver.
> > > > > > Yes.
> > > > > > 
> > > > > As I see it, the feature can be driven entirely by the VMM as long as
> > > > > it intercept the relevant configuration space (PCI, MMIO, etc) from
> > > > > guest's reads and writes, and present it as coherent and transparent
> > > > > for the guest. Some use cases I can imagine with a physical device (or
> > > > > vp_vpda device) with VIRTIO_F_STOP:
> > > > > 
> > > > > 1) The VMM chooses not to pass the feature flag. The guest cannot stop
> > > > > the device, so any write to this flag is an error/undefined.
> > > > > 2) The VMM passes the flag to the guest. The guest can stop the device.
> > > > > 2.1) The VMM stops the device to perform a live migration, and the
> > > > > guest does not write to STOP in any moment of the LM. It resets the
> > > > > destination device with the state, and then initializes the device.
> > > > > 2.2) The guest stops the device and, when STOP(32) is set, the source
> > > > > VMM migrates the device status. The destination VMM realizes the bit,
> > > > > so it sets the bit in the destination too after device initialization.
> > > > > 2.3) The device is not initialized by the guest so it doesn't matter
> > > > > what bit has the HW, but the VM can be migrated.
> > > > > 
> > > > > Am I missing something?
> > > > > 
> > > > > Thanks!
> > > > It's doable like this. It's all a lot of hoops to jump through though.
> > > > It's also not easy for devices to implement.
> > > 
> > > It just requires a new status bit. Anything that makes you think it's hard
> > > to implement?
> > > 
> > > E.g for networking device, it should be sufficient to use this bit + the
> > > virtqueue state.
> > > 
> > > 
> > > > Why don't we design the feature in a way that is useable by VMMs
> > > > and implementable by devices in a simple way?
> > > 
> > > It use the common technology like register shadowing without any further
> > > stuffs.
> > > 
> > > Or do you have any other ideas?
> > > 
> > > (I think we all know migration will be very hard if we simply pass through
> > > those state registers).
> > If an admin virtqueue is used instead of the STOP Device Status field
> > bit then there's no need to re-read the Device Status field in a loop
> > until the device has stopped.
> 
> 
> Probably not. Let me to clarify several points:
> 
> - This proposal has nothing to do with admin virtqueue. Actually, admin
> virtqueue could be used for carrying any basic device facility like status
> bit. E.g I'm going to post patches that use admin virtqueue as a "transport"
> for device slicing at virtio level.
> - Even if we had introduced admin virtqueue, we still need a per function
> interface for this. This is a must for nested virtualization, we can't
> always expect things like PF can be assigned to L1 guest.
> - According to the proposal, there's no need for the device to complete all
> the consumed buffers, device can choose to expose those inflight descriptors
> in a device specific way and set the STOP bit. This means, if we have the
> device specific in-flight descriptor reporting facility, the device can
> almost set the STOP bit immediately.
> - If we don't go with the basic device facility but using the admin
> virtqueue specific method, we still need to clarify how it works with the
> device status state machine, it will be some kind of sub-states which looks
> much more complicated than the current proposal.
> 
> 
> > 
> > When migrating a guest with many VIRTIO devices a busy waiting approach
> > extends downtime if implemented sequentially (stopping one device at a
> > time).
> 
> 
> Well. You need some kinds of waiting for sure, the device/DMA needs sometime
> to be stopped. The downtime is determined by a specific virtio
> implementation which is hard to be restricted at the spec level. We can
> clarify that the device must set the STOP bit in e.g 100ms.
> 
> 
> >   It can be implemented concurrently (setting the STOP bit on all
> > devices and then looping until all their Device Status fields have the
> > bit set), but this becomes more complex to implement.
> 
> 
> I still don't get what kind of complexity did you worry here.
> 
> 
> > 
> > I'm a little worried about adding a new bit that requires busy
> > waiting...
> 
> 
> Busy wait is not something that is introduced in this patch:
> 
> 4.1.4.3.2 Driver Requirements: Common configuration structure layout
> 
> After writing 0 to device_status, the driver MUST wait for a read of
> device_status to return 0 before reinitializing the device.
> 
> Since it was required for at least one transport. We need do something
> similar to when introducing basic facility.

Adding the STOP but as a Device Status bit is a small and clean VIRTIO
spec change. I like that.

On the other hand, devices need time to stop and that time can be
unbounded. For example, software virtio-blk/scsi implementations since
cannot immediately cancel in-flight I/O requests on Linux hosts.

The natural interface for long-running operations is virtqueue requests.
That's why I mentioned the alternative of using an admin virtqueue
instead of a Device Status bit.

Although you mentioned that the stopped state needs to be reflected in
the Device Status field somehow, I'm not sure about that since the
driver typically doesn't need to know whether the device is being
migrated. In fact, the VMM would need to hide this bit and it's safer to
keep it out-of-band instead of risking exposing it by accident.

In addition, stateful devices need to load/save non-trivial amounts of
data. They need DMA to do this efficiently, so an admin virtqueue is a
good fit again. This isn't addressed in this patch series, but it's the
next step and I think it's worth planning for it.

If all devices could stop very quickly and were stateless then I would
agree that the STOP bit is an ideal solution. I think it will be
necessary to support devices that don't behave like that, so the admin
virtqueue approach seems worth exploring.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH V2 0/2] Vitqueue State Synchronization
  2021-07-13  3:08   ` Jason Wang
@ 2021-07-13 10:30     ` Stefan Hajnoczi
  2021-07-13 11:56       ` Jason Wang
  0 siblings, 1 reply; 115+ messages in thread
From: Stefan Hajnoczi @ 2021-07-13 10:30 UTC (permalink / raw)
  To: Jason Wang
  Cc: mst, virtio-comment, virtio-dev, mgurtovoy, cohuck, eperezma,
	oren, shahafs, parav, bodong, amikheev, pasic

[-- Attachment #1: Type: text/plain, Size: 4842 bytes --]

On Tue, Jul 13, 2021 at 11:08:28AM +0800, Jason Wang wrote:
> 
> 在 2021/7/12 下午6:12, Stefan Hajnoczi 写道:
> > On Tue, Jul 06, 2021 at 12:33:32PM +0800, Jason Wang wrote:
> > > Hi All:
> > > 
> > > This is an updated version to implement virtqueue state
> > > synchronization which is a must for the migration support.
> > > 
> > > The first patch introduces virtqueue states as a new basic facility of
> > > the virtio device. This is used by the driver to save and restore
> > > virtqueue state. The states were split into available state and used
> > > state to ease the transport specific implementation. It is also
> > > allowed for the device to have its own device specific way to save and
> > > resotre extra virtqueue states like in flight request.
> > > 
> > > The second patch introduce a new status bit STOP. This bit is used for
> > > the driver to stop the device. The major difference from reset is that
> > > STOP must preserve all the virtqueue state plus the device state.
> > > 
> > > A driver can then:
> > > 
> > > - Get the virtqueue state if STOP status bit is set
> > > - Set the virtqueue state after FEATURE_OK but before DRIVER_OK
> > > 
> > > Device specific state synchronization could be built on top.
> > Will you send a proof-of-concept implementation to demonstrate how it
> > works in practice?
> 
> 
> Eugenio has implemented a prototype for this. (Note that the codes was for
> previous version of the proposal, but it's sufficient to demonstrate how it
> works).
> 
> https://www.mail-archive.com/qemu-devel@nongnu.org/msg809332.html
> 
> https://www.mail-archive.com/qemu-devel@nongnu.org/msg809335.html
> 
> 
> > 
> > You mentioned being able to migrate virtio-net devices using this
> > interface, but what about state like VIRTIO_NET_S_LINK_UP that is either
> > per-device or associated with a non-rx/tx virtqueue?
> 
> 
> Note that the config space will be maintained by Qemu. So Qemu can choose to
> emulate link down by simply don't set DRIVER_OK to the device.
> 
> 
> > 
> > Basically I'm not sure if the scope of this is just to migrate state
> > associated with offloaded virtqueues (vDPA, VFIO/mdev, etc) or if it's
> > really supposed to migrate the entire device?
> 
> 
> As the subject, it's the virtqueue state not the device state. The series
> tries to introduce the minimal sets of functions that could be used to
> migrate the network device.
>
> 
> 
> > 
> > Do you have an approach in mind for saving/loading device-specific
> > state? Here are devices and their state:
> > - virtio-blk: a list of requests that the destination device can
> >    re-submit
> > - virtio-scsi: a list of requests that the destination device can
> >    re-submit
> > - virtio-serial: active ports, including the current buffer being
> >    transferred
> 
> 
> Actually, we had two types of additional states:
> 
> - pending (or inflight) buffers, we can introduce a transport specific way
> to specify the auxiliary page which is used to stored the inflight
> descriptors (as what vhost-user did)
> - other device states, this needs to be done via a device specific way, and
> it would be hard to generalize them
> 
> 
> > - virtio-net: MAC address, status, etc
> 
> 
> So VMM will intercept all the control commands, that means we don't need to
> query any states that is changed via those control commands.
> 
> E.g The Qemu is in charge of shadowing control virtqueue, so we don't even
> need to interface to query any of those states that is set via control
> virtqueue.
> 
> But all those device state stuffs is out of the scope of this proposal.
> 
> I can see one of the possible gap is that people may think the migration
> facility is designed for the simple passthrough that Linux provides, that
> means the device is assigend 'entirely' to the guest. This is not case for
> the case of live migration, some kind of mediation must be done in the
> middle.
> 
> And that's the work of VMM through vDPA + Qemu: intercepting control command
> but not datapath.

I thought this was a more general migration mechanism that passthrough
devices could use. Thanks for explaining. Maybe this can be made clearer
in the spec - it's not a full save/load mechanism, it can only be used
in conjunction with another component that is aware of the device's
state.

There is a gap between this approach and VFIO's migration interface. It
appears to be impossible to write a VFIO/mdev or vfio-user device that
passes a physical virtio-pci device through to the guest with migration
support. The reason is because VIRTIO lacks an interface to save/load
device (not virtqueue) state. I guess it will be added sooner or later,
it's similar to what Max Gurtovoy recently proposed.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit
  2021-07-13  9:13                   ` Jason Wang
@ 2021-07-13 11:31                     ` Cornelia Huck
  2021-07-13 12:23                       ` Jason Wang
  0 siblings, 1 reply; 115+ messages in thread
From: Cornelia Huck @ 2021-07-13 11:31 UTC (permalink / raw)
  To: Jason Wang, Stefan Hajnoczi
  Cc: Michael S. Tsirkin, Eugenio Perez Martin, Dr. David Alan Gilbert,
	virtio-comment, Virtio-Dev, Max Gurtovoy, Oren Duer,
	Shahaf Shuler, Parav Pandit, Bodong Wang, Alexander Mikheev,
	Halil Pasic

On Tue, Jul 13 2021, Jason Wang <jasowang@redhat.com> wrote:

> 在 2021/7/13 下午4:19, Cornelia Huck 写道:
>> On Tue, Jul 13 2021, Jason Wang <jasowang@redhat.com> wrote:
>>
>>> 在 2021/7/12 下午5:57, Stefan Hajnoczi 写道:
>>>> When migrating a guest with many VIRTIO devices a busy waiting approach
>>>> extends downtime if implemented sequentially (stopping one device at a
>>>> time).
>>>
>>> Well. You need some kinds of waiting for sure, the device/DMA needs
>>> sometime to be stopped. The downtime is determined by a specific virtio
>>> implementation which is hard to be restricted at the spec level. We can
>>> clarify that the device must set the STOP bit in e.g 100ms.
>> I don't think we can introduce arbitrary upper bounds here. At most, we
>> can say that the device SHOULD try to set the STOP bit as early as
>> possible (and make use of the mechanism to expose in-flight buffers.)
>
>
> Yes, that's my understanding.
>
>
>>
>> If we want to avoid polling for the STOP bit, we need some kind of
>> notification mechanism, I guess. For ccw, I'd just use a channel
>> command to stop the device; completion of that channel program would
>> indicate that the device is done with the stop procedure.
>
>
> A question, is interrupt used for such notification, or the VMM can 
> choose to poll for the completion?

You can poll for the subchannel to become status pending.

>
>
>> Not sure how
>> well that translates to other transports.
>
>
> Actually, it's not necessarily a busy polling. VMM can schedule other 
> process in and recheck the bit periodically.
>
> Or as you mentioned before, we can use some kind of interrupt but it 
> would be more complicated than the simple status bit. It's better to 
> introduce the interrupt only if the status bit doesn't fit.

At least for ccw, waiting for the status bit to be set also involves an
interrupt or polling (we use another channel program to retrieve the
status.) A dedicated channel command would actually be better, as the
interrupt/status pending would already inform us of success.


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [PATCH V2 0/2] Vitqueue State Synchronization
  2021-07-13 10:30     ` Stefan Hajnoczi
@ 2021-07-13 11:56       ` Jason Wang
  0 siblings, 0 replies; 115+ messages in thread
From: Jason Wang @ 2021-07-13 11:56 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: mst, virtio-comment, virtio-dev, mgurtovoy, cohuck, eperezma,
	oren, shahafs, parav, bodong, amikheev, pasic


在 2021/7/13 下午6:30, Stefan Hajnoczi 写道:
> On Tue, Jul 13, 2021 at 11:08:28AM +0800, Jason Wang wrote:
>> 在 2021/7/12 下午6:12, Stefan Hajnoczi 写道:
>>> On Tue, Jul 06, 2021 at 12:33:32PM +0800, Jason Wang wrote:
>>>> Hi All:
>>>>
>>>> This is an updated version to implement virtqueue state
>>>> synchronization which is a must for the migration support.
>>>>
>>>> The first patch introduces virtqueue states as a new basic facility of
>>>> the virtio device. This is used by the driver to save and restore
>>>> virtqueue state. The states were split into available state and used
>>>> state to ease the transport specific implementation. It is also
>>>> allowed for the device to have its own device specific way to save and
>>>> resotre extra virtqueue states like in flight request.
>>>>
>>>> The second patch introduce a new status bit STOP. This bit is used for
>>>> the driver to stop the device. The major difference from reset is that
>>>> STOP must preserve all the virtqueue state plus the device state.
>>>>
>>>> A driver can then:
>>>>
>>>> - Get the virtqueue state if STOP status bit is set
>>>> - Set the virtqueue state after FEATURE_OK but before DRIVER_OK
>>>>
>>>> Device specific state synchronization could be built on top.
>>> Will you send a proof-of-concept implementation to demonstrate how it
>>> works in practice?
>>
>> Eugenio has implemented a prototype for this. (Note that the codes was for
>> previous version of the proposal, but it's sufficient to demonstrate how it
>> works).
>>
>> https://www.mail-archive.com/qemu-devel@nongnu.org/msg809332.html
>>
>> https://www.mail-archive.com/qemu-devel@nongnu.org/msg809335.html
>>
>>
>>> You mentioned being able to migrate virtio-net devices using this
>>> interface, but what about state like VIRTIO_NET_S_LINK_UP that is either
>>> per-device or associated with a non-rx/tx virtqueue?
>>
>> Note that the config space will be maintained by Qemu. So Qemu can choose to
>> emulate link down by simply don't set DRIVER_OK to the device.
>>
>>
>>> Basically I'm not sure if the scope of this is just to migrate state
>>> associated with offloaded virtqueues (vDPA, VFIO/mdev, etc) or if it's
>>> really supposed to migrate the entire device?
>>
>> As the subject, it's the virtqueue state not the device state. The series
>> tries to introduce the minimal sets of functions that could be used to
>> migrate the network device.
>>
>>
>>
>>> Do you have an approach in mind for saving/loading device-specific
>>> state? Here are devices and their state:
>>> - virtio-blk: a list of requests that the destination device can
>>>     re-submit
>>> - virtio-scsi: a list of requests that the destination device can
>>>     re-submit
>>> - virtio-serial: active ports, including the current buffer being
>>>     transferred
>>
>> Actually, we had two types of additional states:
>>
>> - pending (or inflight) buffers, we can introduce a transport specific way
>> to specify the auxiliary page which is used to stored the inflight
>> descriptors (as what vhost-user did)
>> - other device states, this needs to be done via a device specific way, and
>> it would be hard to generalize them
>>
>>
>>> - virtio-net: MAC address, status, etc
>>
>> So VMM will intercept all the control commands, that means we don't need to
>> query any states that is changed via those control commands.
>>
>> E.g The Qemu is in charge of shadowing control virtqueue, so we don't even
>> need to interface to query any of those states that is set via control
>> virtqueue.
>>
>> But all those device state stuffs is out of the scope of this proposal.
>>
>> I can see one of the possible gap is that people may think the migration
>> facility is designed for the simple passthrough that Linux provides, that
>> means the device is assigend 'entirely' to the guest. This is not case for
>> the case of live migration, some kind of mediation must be done in the
>> middle.
>>
>> And that's the work of VMM through vDPA + Qemu: intercepting control command
>> but not datapath.
> I thought this was a more general migration mechanism that passthrough
> devices could use. Thanks for explaining. Maybe this can be made clearer
> in the spec - it's not a full save/load mechanism, it can only be used
> in conjunction with another component that is aware of the device's
> state.


Yes, and actually this should be the suggested way for migrating virtio 
device.

The advantage is obvious, to leverage the mature virtio/vhost software 
stack then we don't need to care much about things like migration 
compatibility.


>
> There is a gap between this approach and VFIO's migration interface. It
> appears to be impossible to write a VFIO/mdev or vfio-user device that
> passes a physical virtio-pci device through to the guest with migration
> support.


I think mediation(mdev) is a must for support live migration in this 
case even for VFIO. If you simply assign the device to the guest, the 
VMM will lose all the control to the device.

And what's more important, virtio is not PCI specific so it can work 
where VFIO can not work:

1) The physical device that doesn't use PCI as its transport
2) The guest that doesn't use PCI or even don't have PCI

That's the consideration for introducing all those as basic facility 
first. Then we can let the transport to implement them in a transport 
comfortable way (admin virtqueue or capabilities).


> The reason is because VIRTIO lacks an interface to save/load
> device (not virtqueue) state. I guess it will be added sooner or later,
> it's similar to what Max Gurtovoy recently proposed.


So my understanding is:

1) Each device should define its own state that needs to be migrated

then, we can define

2) How to design the device interface

Admin virtqueue is a solution for 2) but not 1). And an obvious drawback 
for admin virtqueue is that it's not easily to be used in the nested 
environment where you still need a per function interface.

Thanks


>
> Stefan


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit
  2021-07-13 10:00                 ` Stefan Hajnoczi
@ 2021-07-13 12:16                   ` Jason Wang
  2021-07-14  9:53                     ` Stefan Hajnoczi
  0 siblings, 1 reply; 115+ messages in thread
From: Jason Wang @ 2021-07-13 12:16 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Michael S. Tsirkin, Eugenio Perez Martin, Dr. David Alan Gilbert,
	virtio-comment, Virtio-Dev, Max Gurtovoy, Cornelia Huck,
	Oren Duer, Shahaf Shuler, Parav Pandit, Bodong Wang,
	Alexander Mikheev, Halil Pasic


在 2021/7/13 下午6:00, Stefan Hajnoczi 写道:
> On Tue, Jul 13, 2021 at 11:27:03AM +0800, Jason Wang wrote:
>> 在 2021/7/12 下午5:57, Stefan Hajnoczi 写道:
>>> On Mon, Jul 12, 2021 at 12:00:39PM +0800, Jason Wang wrote:
>>>> 在 2021/7/11 上午4:36, Michael S. Tsirkin 写道:
>>>>> On Fri, Jul 09, 2021 at 07:23:33PM +0200, Eugenio Perez Martin wrote:
>>>>>>>>      If I understand correctly, this is all
>>>>>>>> driven from the driver inside the guest, so for this to work
>>>>>>>> the guest must be running and already have initialised the driver.
>>>>>>> Yes.
>>>>>>>
>>>>>> As I see it, the feature can be driven entirely by the VMM as long as
>>>>>> it intercept the relevant configuration space (PCI, MMIO, etc) from
>>>>>> guest's reads and writes, and present it as coherent and transparent
>>>>>> for the guest. Some use cases I can imagine with a physical device (or
>>>>>> vp_vpda device) with VIRTIO_F_STOP:
>>>>>>
>>>>>> 1) The VMM chooses not to pass the feature flag. The guest cannot stop
>>>>>> the device, so any write to this flag is an error/undefined.
>>>>>> 2) The VMM passes the flag to the guest. The guest can stop the device.
>>>>>> 2.1) The VMM stops the device to perform a live migration, and the
>>>>>> guest does not write to STOP in any moment of the LM. It resets the
>>>>>> destination device with the state, and then initializes the device.
>>>>>> 2.2) The guest stops the device and, when STOP(32) is set, the source
>>>>>> VMM migrates the device status. The destination VMM realizes the bit,
>>>>>> so it sets the bit in the destination too after device initialization.
>>>>>> 2.3) The device is not initialized by the guest so it doesn't matter
>>>>>> what bit has the HW, but the VM can be migrated.
>>>>>>
>>>>>> Am I missing something?
>>>>>>
>>>>>> Thanks!
>>>>> It's doable like this. It's all a lot of hoops to jump through though.
>>>>> It's also not easy for devices to implement.
>>>> It just requires a new status bit. Anything that makes you think it's hard
>>>> to implement?
>>>>
>>>> E.g for networking device, it should be sufficient to use this bit + the
>>>> virtqueue state.
>>>>
>>>>
>>>>> Why don't we design the feature in a way that is useable by VMMs
>>>>> and implementable by devices in a simple way?
>>>> It use the common technology like register shadowing without any further
>>>> stuffs.
>>>>
>>>> Or do you have any other ideas?
>>>>
>>>> (I think we all know migration will be very hard if we simply pass through
>>>> those state registers).
>>> If an admin virtqueue is used instead of the STOP Device Status field
>>> bit then there's no need to re-read the Device Status field in a loop
>>> until the device has stopped.
>>
>> Probably not. Let me to clarify several points:
>>
>> - This proposal has nothing to do with admin virtqueue. Actually, admin
>> virtqueue could be used for carrying any basic device facility like status
>> bit. E.g I'm going to post patches that use admin virtqueue as a "transport"
>> for device slicing at virtio level.
>> - Even if we had introduced admin virtqueue, we still need a per function
>> interface for this. This is a must for nested virtualization, we can't
>> always expect things like PF can be assigned to L1 guest.
>> - According to the proposal, there's no need for the device to complete all
>> the consumed buffers, device can choose to expose those inflight descriptors
>> in a device specific way and set the STOP bit. This means, if we have the
>> device specific in-flight descriptor reporting facility, the device can
>> almost set the STOP bit immediately.
>> - If we don't go with the basic device facility but using the admin
>> virtqueue specific method, we still need to clarify how it works with the
>> device status state machine, it will be some kind of sub-states which looks
>> much more complicated than the current proposal.
>>
>>
>>> When migrating a guest with many VIRTIO devices a busy waiting approach
>>> extends downtime if implemented sequentially (stopping one device at a
>>> time).
>>
>> Well. You need some kinds of waiting for sure, the device/DMA needs sometime
>> to be stopped. The downtime is determined by a specific virtio
>> implementation which is hard to be restricted at the spec level. We can
>> clarify that the device must set the STOP bit in e.g 100ms.
>>
>>
>>>    It can be implemented concurrently (setting the STOP bit on all
>>> devices and then looping until all their Device Status fields have the
>>> bit set), but this becomes more complex to implement.
>>
>> I still don't get what kind of complexity did you worry here.
>>
>>
>>> I'm a little worried about adding a new bit that requires busy
>>> waiting...
>>
>> Busy wait is not something that is introduced in this patch:
>>
>> 4.1.4.3.2 Driver Requirements: Common configuration structure layout
>>
>> After writing 0 to device_status, the driver MUST wait for a read of
>> device_status to return 0 before reinitializing the device.
>>
>> Since it was required for at least one transport. We need do something
>> similar to when introducing basic facility.
> Adding the STOP but as a Device Status bit is a small and clean VIRTIO
> spec change. I like that.
>
> On the other hand, devices need time to stop and that time can be
> unbounded. For example, software virtio-blk/scsi implementations since
> cannot immediately cancel in-flight I/O requests on Linux hosts.
>
> The natural interface for long-running operations is virtqueue requests.
> That's why I mentioned the alternative of using an admin virtqueue
> instead of a Device Status bit.


So I'm not against the admin virtqueue. As said before, admin virtqueue 
could be used for carrying the device status bit.

Send a command to set STOP status bit to admin virtqueue. Device will 
make the command buffer used after it has successfully stopped the device.

AFAIK, they are not mutually exclusive, since they are trying to solve 
different problems.

Device status - basic device facility

Admin virtqueue - transport/device specific way to implement (part of) 
the device facility

>
> Although you mentioned that the stopped state needs to be reflected in
> the Device Status field somehow, I'm not sure about that since the
> driver typically doesn't need to know whether the device is being
> migrated.


The guest won't see the real device status bit. VMM will shadow the 
device status bit in this case.

E.g with the current vhost-vDPA, vDPA behave like a vhost device, guest 
is unaware of the migration.

STOP status bit is set by Qemu to real virtio hardware. But guest will 
only see the DRIVER_OK without STOP.

It's not hard to implement the nested on top, see the discussion 
initiated by Eugenio about how expose VIRTIO_F_STOP to guest for nested 
live migration.


>   In fact, the VMM would need to hide this bit and it's safer to
> keep it out-of-band instead of risking exposing it by accident.


See above, VMM may choose to hide or expose the capability. It's useful 
for migrating a nested guest.

If we design an interface that can be used in the nested environment, 
it's not an ideal interface.


>
> In addition, stateful devices need to load/save non-trivial amounts of
> data. They need DMA to do this efficiently, so an admin virtqueue is a
> good fit again.


I don't get the point here. You still need to address the exact the 
similar issues for admin virtqueue: the unbound time in freezing the 
device, the interaction with the virtio device status state machine.

And with admin virtqueue, it's actually far more complicated e.g you 
need to define how to synchronize the concurrent access to the basic 
facilites.


>   This isn't addressed in this patch series, but it's the
> next step and I think it's worth planning for it.


I agree, but for admin virtqueue, it's better to use it as a full 
transport instead of just use it for carrying part of the device basic 
facilities. Actually, as I said, I had patches to do that. But the 
motivation is not for live migration but for device slicing. I will post 
RFC before the KVM Forum this year (since I'm going to talk device 
slicing at virtio level). It does not conflict with Max's proposal, 
since migration part is not there.


>
> If all devices could stop very quickly and were stateless then I would
> agree that the STOP bit is an ideal solution.


Note that in Max's proposal it also have something similar the 
"quiescence" and "freezed" state. It doesn't differ from STOP bit 
fundamentally. As Max suggested, we could introduce more status bit if 
necessary or even consider to unify Max's proposal with mine.


> I think it will be
> necessary to support devices that don't behave like that, so the admin
> virtqueue approach seems worth exploring.


Yes and as mentioned in another thread. I think the best way is to 
define the device specific state first and then consider how to 
implement the interface.

Admin virtqueue is worth to explore but should not be the only method. 
Device/transport are freed to implement it in many ways based on the 
actual hardware.

Thanks


>
> Stefan


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit
  2021-07-13 11:31                     ` Cornelia Huck
@ 2021-07-13 12:23                       ` Jason Wang
  2021-07-13 12:28                         ` Cornelia Huck
  0 siblings, 1 reply; 115+ messages in thread
From: Jason Wang @ 2021-07-13 12:23 UTC (permalink / raw)
  To: Cornelia Huck, Stefan Hajnoczi
  Cc: Michael S. Tsirkin, Eugenio Perez Martin, Dr. David Alan Gilbert,
	virtio-comment, Virtio-Dev, Max Gurtovoy, Oren Duer,
	Shahaf Shuler, Parav Pandit, Bodong Wang, Alexander Mikheev,
	Halil Pasic


在 2021/7/13 下午7:31, Cornelia Huck 写道:
> On Tue, Jul 13 2021, Jason Wang <jasowang@redhat.com> wrote:
>
>> 在 2021/7/13 下午4:19, Cornelia Huck 写道:
>>> On Tue, Jul 13 2021, Jason Wang <jasowang@redhat.com> wrote:
>>>
>>>> 在 2021/7/12 下午5:57, Stefan Hajnoczi 写道:
>>>>> When migrating a guest with many VIRTIO devices a busy waiting approach
>>>>> extends downtime if implemented sequentially (stopping one device at a
>>>>> time).
>>>> Well. You need some kinds of waiting for sure, the device/DMA needs
>>>> sometime to be stopped. The downtime is determined by a specific virtio
>>>> implementation which is hard to be restricted at the spec level. We can
>>>> clarify that the device must set the STOP bit in e.g 100ms.
>>> I don't think we can introduce arbitrary upper bounds here. At most, we
>>> can say that the device SHOULD try to set the STOP bit as early as
>>> possible (and make use of the mechanism to expose in-flight buffers.)
>>
>> Yes, that's my understanding.
>>
>>
>>> If we want to avoid polling for the STOP bit, we need some kind of
>>> notification mechanism, I guess. For ccw, I'd just use a channel
>>> command to stop the device; completion of that channel program would
>>> indicate that the device is done with the stop procedure.
>>
>> A question, is interrupt used for such notification, or the VMM can
>> choose to poll for the completion?
> You can poll for the subchannel to become status pending.
>
>>
>>> Not sure how
>>> well that translates to other transports.
>>
>> Actually, it's not necessarily a busy polling. VMM can schedule other
>> process in and recheck the bit periodically.
>>
>> Or as you mentioned before, we can use some kind of interrupt but it
>> would be more complicated than the simple status bit. It's better to
>> introduce the interrupt only if the status bit doesn't fit.
> At least for ccw, waiting for the status bit to be set also involves an
> interrupt or polling (we use another channel program to retrieve the
> status.) A dedicated channel command would actually be better, as the
> interrupt/status pending would already inform us of success.


So it looks to me it doesn't conflict with this design: the device must 
wait for the device to be stopped to signal the success of the ccw command?

Thanks


>


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit
  2021-07-13 12:23                       ` Jason Wang
@ 2021-07-13 12:28                         ` Cornelia Huck
  2021-07-14  2:47                           ` Jason Wang
  0 siblings, 1 reply; 115+ messages in thread
From: Cornelia Huck @ 2021-07-13 12:28 UTC (permalink / raw)
  To: Jason Wang, Stefan Hajnoczi
  Cc: Michael S. Tsirkin, Eugenio Perez Martin, Dr. David Alan Gilbert,
	virtio-comment, Virtio-Dev, Max Gurtovoy, Oren Duer,
	Shahaf Shuler, Parav Pandit, Bodong Wang, Alexander Mikheev,
	Halil Pasic

On Tue, Jul 13 2021, Jason Wang <jasowang@redhat.com> wrote:

> 在 2021/7/13 下午7:31, Cornelia Huck 写道:
>> On Tue, Jul 13 2021, Jason Wang <jasowang@redhat.com> wrote:
>>
>>> 在 2021/7/13 下午4:19, Cornelia Huck 写道:
>>>> On Tue, Jul 13 2021, Jason Wang <jasowang@redhat.com> wrote:
>>>>
>>>>> 在 2021/7/12 下午5:57, Stefan Hajnoczi 写道:
>>>>>> When migrating a guest with many VIRTIO devices a busy waiting approach
>>>>>> extends downtime if implemented sequentially (stopping one device at a
>>>>>> time).
>>>>> Well. You need some kinds of waiting for sure, the device/DMA needs
>>>>> sometime to be stopped. The downtime is determined by a specific virtio
>>>>> implementation which is hard to be restricted at the spec level. We can
>>>>> clarify that the device must set the STOP bit in e.g 100ms.
>>>> I don't think we can introduce arbitrary upper bounds here. At most, we
>>>> can say that the device SHOULD try to set the STOP bit as early as
>>>> possible (and make use of the mechanism to expose in-flight buffers.)
>>>
>>> Yes, that's my understanding.
>>>
>>>
>>>> If we want to avoid polling for the STOP bit, we need some kind of
>>>> notification mechanism, I guess. For ccw, I'd just use a channel
>>>> command to stop the device; completion of that channel program would
>>>> indicate that the device is done with the stop procedure.
>>>
>>> A question, is interrupt used for such notification, or the VMM can
>>> choose to poll for the completion?
>> You can poll for the subchannel to become status pending.
>>
>>>
>>>> Not sure how
>>>> well that translates to other transports.
>>>
>>> Actually, it's not necessarily a busy polling. VMM can schedule other
>>> process in and recheck the bit periodically.
>>>
>>> Or as you mentioned before, we can use some kind of interrupt but it
>>> would be more complicated than the simple status bit. It's better to
>>> introduce the interrupt only if the status bit doesn't fit.
>> At least for ccw, waiting for the status bit to be set also involves an
>> interrupt or polling (we use another channel program to retrieve the
>> status.) A dedicated channel command would actually be better, as the
>> interrupt/status pending would already inform us of success.
>
>
> So it looks to me it doesn't conflict with this design: the device must 
> wait for the device to be stopped to signal the success of the ccw command?

Yes, the difference is mainly how that information can be extracted.


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit
  2021-07-13 12:28                         ` Cornelia Huck
@ 2021-07-14  2:47                           ` Jason Wang
  2021-07-14  6:20                             ` Cornelia Huck
  0 siblings, 1 reply; 115+ messages in thread
From: Jason Wang @ 2021-07-14  2:47 UTC (permalink / raw)
  To: Cornelia Huck, Stefan Hajnoczi
  Cc: Michael S. Tsirkin, Eugenio Perez Martin, Dr. David Alan Gilbert,
	virtio-comment, Virtio-Dev, Max Gurtovoy, Oren Duer,
	Shahaf Shuler, Parav Pandit, Bodong Wang, Alexander Mikheev,
	Halil Pasic


在 2021/7/13 下午8:28, Cornelia Huck 写道:
> On Tue, Jul 13 2021, Jason Wang <jasowang@redhat.com> wrote:
>
>> 在 2021/7/13 下午7:31, Cornelia Huck 写道:
>>> On Tue, Jul 13 2021, Jason Wang <jasowang@redhat.com> wrote:
>>>
>>>> 在 2021/7/13 下午4:19, Cornelia Huck 写道:
>>>>> On Tue, Jul 13 2021, Jason Wang <jasowang@redhat.com> wrote:
>>>>>
>>>>>> 在 2021/7/12 下午5:57, Stefan Hajnoczi 写道:
>>>>>>> When migrating a guest with many VIRTIO devices a busy waiting approach
>>>>>>> extends downtime if implemented sequentially (stopping one device at a
>>>>>>> time).
>>>>>> Well. You need some kinds of waiting for sure, the device/DMA needs
>>>>>> sometime to be stopped. The downtime is determined by a specific virtio
>>>>>> implementation which is hard to be restricted at the spec level. We can
>>>>>> clarify that the device must set the STOP bit in e.g 100ms.
>>>>> I don't think we can introduce arbitrary upper bounds here. At most, we
>>>>> can say that the device SHOULD try to set the STOP bit as early as
>>>>> possible (and make use of the mechanism to expose in-flight buffers.)
>>>> Yes, that's my understanding.
>>>>
>>>>
>>>>> If we want to avoid polling for the STOP bit, we need some kind of
>>>>> notification mechanism, I guess. For ccw, I'd just use a channel
>>>>> command to stop the device; completion of that channel program would
>>>>> indicate that the device is done with the stop procedure.
>>>> A question, is interrupt used for such notification, or the VMM can
>>>> choose to poll for the completion?
>>> You can poll for the subchannel to become status pending.
>>>
>>>>> Not sure how
>>>>> well that translates to other transports.
>>>> Actually, it's not necessarily a busy polling. VMM can schedule other
>>>> process in and recheck the bit periodically.
>>>>
>>>> Or as you mentioned before, we can use some kind of interrupt but it
>>>> would be more complicated than the simple status bit. It's better to
>>>> introduce the interrupt only if the status bit doesn't fit.
>>> At least for ccw, waiting for the status bit to be set also involves an
>>> interrupt or polling (we use another channel program to retrieve the
>>> status.) A dedicated channel command would actually be better, as the
>>> interrupt/status pending would already inform us of success.
>>
>> So it looks to me it doesn't conflict with this design: the device must
>> wait for the device to be stopped to signal the success of the ccw command?
> Yes, the difference is mainly how that information can be extracted.


So I had a look at how reset is described for ccw:

"

In order to reset a device, a driver sends the CCW_CMD_VDEV_RESET command.

"

This implies something similar, that is the success of the command means 
the success of the reset.

I wonder maybe I can remove the "re-read" from the basic facility and 
let the transport to decide what to do.

- for PCI, if a registers is used, we need re-read
- for CCW, follow the current implication, re-read is not needed and we 
can wait/poll for the success of the ccw command
- for admin virtqueue, it should be something similar to ccw, wait/poll 
for the success of the admin virtqueue command

Thanks

>


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit
  2021-07-14  2:47                           ` Jason Wang
@ 2021-07-14  6:20                             ` Cornelia Huck
  2021-07-14  8:53                               ` Jason Wang
  0 siblings, 1 reply; 115+ messages in thread
From: Cornelia Huck @ 2021-07-14  6:20 UTC (permalink / raw)
  To: Jason Wang, Stefan Hajnoczi
  Cc: Michael S. Tsirkin, Eugenio Perez Martin, Dr. David Alan Gilbert,
	virtio-comment, Virtio-Dev, Max Gurtovoy, Oren Duer,
	Shahaf Shuler, Parav Pandit, Bodong Wang, Alexander Mikheev,
	Halil Pasic

On Wed, Jul 14 2021, Jason Wang <jasowang@redhat.com> wrote:

> 在 2021/7/13 下午8:28, Cornelia Huck 写道:
>> On Tue, Jul 13 2021, Jason Wang <jasowang@redhat.com> wrote:
>>
>>> 在 2021/7/13 下午7:31, Cornelia Huck 写道:
>>>> On Tue, Jul 13 2021, Jason Wang <jasowang@redhat.com> wrote:
>>>>
>>>>> 在 2021/7/13 下午4:19, Cornelia Huck 写道:
>>>>>> On Tue, Jul 13 2021, Jason Wang <jasowang@redhat.com> wrote:
>>>>>>
>>>>>>> 在 2021/7/12 下午5:57, Stefan Hajnoczi 写道:
>>>>>>>> When migrating a guest with many VIRTIO devices a busy waiting approach
>>>>>>>> extends downtime if implemented sequentially (stopping one device at a
>>>>>>>> time).
>>>>>>> Well. You need some kinds of waiting for sure, the device/DMA needs
>>>>>>> sometime to be stopped. The downtime is determined by a specific virtio
>>>>>>> implementation which is hard to be restricted at the spec level. We can
>>>>>>> clarify that the device must set the STOP bit in e.g 100ms.
>>>>>> I don't think we can introduce arbitrary upper bounds here. At most, we
>>>>>> can say that the device SHOULD try to set the STOP bit as early as
>>>>>> possible (and make use of the mechanism to expose in-flight buffers.)
>>>>> Yes, that's my understanding.
>>>>>
>>>>>
>>>>>> If we want to avoid polling for the STOP bit, we need some kind of
>>>>>> notification mechanism, I guess. For ccw, I'd just use a channel
>>>>>> command to stop the device; completion of that channel program would
>>>>>> indicate that the device is done with the stop procedure.
>>>>> A question, is interrupt used for such notification, or the VMM can
>>>>> choose to poll for the completion?
>>>> You can poll for the subchannel to become status pending.
>>>>
>>>>>> Not sure how
>>>>>> well that translates to other transports.
>>>>> Actually, it's not necessarily a busy polling. VMM can schedule other
>>>>> process in and recheck the bit periodically.
>>>>>
>>>>> Or as you mentioned before, we can use some kind of interrupt but it
>>>>> would be more complicated than the simple status bit. It's better to
>>>>> introduce the interrupt only if the status bit doesn't fit.
>>>> At least for ccw, waiting for the status bit to be set also involves an
>>>> interrupt or polling (we use another channel program to retrieve the
>>>> status.) A dedicated channel command would actually be better, as the
>>>> interrupt/status pending would already inform us of success.
>>>
>>> So it looks to me it doesn't conflict with this design: the device must
>>> wait for the device to be stopped to signal the success of the ccw command?
>> Yes, the difference is mainly how that information can be extracted.
>
>
> So I had a look at how reset is described for ccw:
>
> "
>
> In order to reset a device, a driver sends the CCW_CMD_VDEV_RESET command.
>
> "
>
> This implies something similar, that is the success of the command means 
> the success of the reset.

Yes, indeed.

>
> I wonder maybe I can remove the "re-read" from the basic facility and 
> let the transport to decide what to do.
>
> - for PCI, if a registers is used, we need re-read
> - for CCW, follow the current implication, re-read is not needed and we 
> can wait/poll for the success of the ccw command

If we are going with a status bit, it would be the same as for pci (we
have WRITE_STATUS and READ_STATUS commands.) If we are going with a
distinct command, we can skip the re-read. (I'd probably go with a more
generic 'trigger an action' meta-command, but that would work just the
same.)

> - for admin virtqueue, it should be something similar to ccw, wait/poll 
> for the success of the admin virtqueue command

Or we should maybe standardize on the admin virtqueue? That seems less
confusing to me.


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit
  2021-07-14  6:20                             ` Cornelia Huck
@ 2021-07-14  8:53                               ` Jason Wang
  2021-07-14  9:24                                 ` [virtio-dev] " Cornelia Huck
  0 siblings, 1 reply; 115+ messages in thread
From: Jason Wang @ 2021-07-14  8:53 UTC (permalink / raw)
  To: Cornelia Huck, Stefan Hajnoczi
  Cc: Michael S. Tsirkin, Eugenio Perez Martin, Dr. David Alan Gilbert,
	virtio-comment, Virtio-Dev, Max Gurtovoy, Oren Duer,
	Shahaf Shuler, Parav Pandit, Bodong Wang, Alexander Mikheev,
	Halil Pasic


在 2021/7/14 下午2:20, Cornelia Huck 写道:
> On Wed, Jul 14 2021, Jason Wang <jasowang@redhat.com> wrote:
>
>> 在 2021/7/13 下午8:28, Cornelia Huck 写道:
>>> On Tue, Jul 13 2021, Jason Wang <jasowang@redhat.com> wrote:
>>>
>>>> 在 2021/7/13 下午7:31, Cornelia Huck 写道:
>>>>> On Tue, Jul 13 2021, Jason Wang <jasowang@redhat.com> wrote:
>>>>>
>>>>>> 在 2021/7/13 下午4:19, Cornelia Huck 写道:
>>>>>>> On Tue, Jul 13 2021, Jason Wang <jasowang@redhat.com> wrote:
>>>>>>>
>>>>>>>> 在 2021/7/12 下午5:57, Stefan Hajnoczi 写道:
>>>>>>>>> When migrating a guest with many VIRTIO devices a busy waiting approach
>>>>>>>>> extends downtime if implemented sequentially (stopping one device at a
>>>>>>>>> time).
>>>>>>>> Well. You need some kinds of waiting for sure, the device/DMA needs
>>>>>>>> sometime to be stopped. The downtime is determined by a specific virtio
>>>>>>>> implementation which is hard to be restricted at the spec level. We can
>>>>>>>> clarify that the device must set the STOP bit in e.g 100ms.
>>>>>>> I don't think we can introduce arbitrary upper bounds here. At most, we
>>>>>>> can say that the device SHOULD try to set the STOP bit as early as
>>>>>>> possible (and make use of the mechanism to expose in-flight buffers.)
>>>>>> Yes, that's my understanding.
>>>>>>
>>>>>>
>>>>>>> If we want to avoid polling for the STOP bit, we need some kind of
>>>>>>> notification mechanism, I guess. For ccw, I'd just use a channel
>>>>>>> command to stop the device; completion of that channel program would
>>>>>>> indicate that the device is done with the stop procedure.
>>>>>> A question, is interrupt used for such notification, or the VMM can
>>>>>> choose to poll for the completion?
>>>>> You can poll for the subchannel to become status pending.
>>>>>
>>>>>>> Not sure how
>>>>>>> well that translates to other transports.
>>>>>> Actually, it's not necessarily a busy polling. VMM can schedule other
>>>>>> process in and recheck the bit periodically.
>>>>>>
>>>>>> Or as you mentioned before, we can use some kind of interrupt but it
>>>>>> would be more complicated than the simple status bit. It's better to
>>>>>> introduce the interrupt only if the status bit doesn't fit.
>>>>> At least for ccw, waiting for the status bit to be set also involves an
>>>>> interrupt or polling (we use another channel program to retrieve the
>>>>> status.) A dedicated channel command would actually be better, as the
>>>>> interrupt/status pending would already inform us of success.
>>>> So it looks to me it doesn't conflict with this design: the device must
>>>> wait for the device to be stopped to signal the success of the ccw command?
>>> Yes, the difference is mainly how that information can be extracted.
>>
>> So I had a look at how reset is described for ccw:
>>
>> "
>>
>> In order to reset a device, a driver sends the CCW_CMD_VDEV_RESET command.
>>
>> "
>>
>> This implies something similar, that is the success of the command means
>> the success of the reset.
> Yes, indeed.
>
>> I wonder maybe I can remove the "re-read" from the basic facility and
>> let the transport to decide what to do.
>>
>> - for PCI, if a registers is used, we need re-read
>> - for CCW, follow the current implication, re-read is not needed and we
>> can wait/poll for the success of the ccw command
> If we are going with a status bit, it would be the same as for pci (we
> have WRITE_STATUS and READ_STATUS commands.)


So spec is unclear of the implications of the success of a command:

E.g for RESET (CCW_CMD_VDEV_RESET), the success of the command implies 
the success or the reset.

But for set_status (CCW_CMD_WRITE_STATUS), the success of the command 
does not imply the bit is set by the device.

If I understand this correctly, we still need re-read here.


>   If we are going with a
> distinct command, we can skip the re-read.


Then it would be better to introduce the STOP as a dedicated device 
facility (as reset):

The device MUST present STOP bit after it has been stopped.

And for PCI:

- it was set via set the bit in the registers

for ccw:

- a distinct command (as reset) is introduced, and STOP is forbidden to 
set via device status?


> (I'd probably go with a more
> generic 'trigger an action' meta-command, but that would work just the
> same.)
>
>> - for admin virtqueue, it should be something similar to ccw, wait/poll
>> for the success of the admin virtqueue command
> Or we should maybe standardize on the admin virtqueue?


That's one way to go.


>   That seems less
> confusing to me.


But it's just one of the possible interface to carry the commands. We 
still need to define the semantic or facility of "stop" first.

And we still need to clarify the implication for the success of each 
specific command as ccw. (E.g whether or not a re-read(get after set) is 
need)

The only difference is the transport: ccw command vs virtqqueue.

Thanks


>


^ permalink raw reply	[flat|nested] 115+ messages in thread

* [virtio-dev] Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit
  2021-07-14  8:53                               ` Jason Wang
@ 2021-07-14  9:24                                 ` Cornelia Huck
  2021-07-15  2:01                                   ` Jason Wang
  0 siblings, 1 reply; 115+ messages in thread
From: Cornelia Huck @ 2021-07-14  9:24 UTC (permalink / raw)
  To: Jason Wang, Stefan Hajnoczi
  Cc: Michael S. Tsirkin, Eugenio Perez Martin, Dr. David Alan Gilbert,
	virtio-comment, Virtio-Dev, Max Gurtovoy, Oren Duer,
	Shahaf Shuler, Parav Pandit, Bodong Wang, Alexander Mikheev,
	Halil Pasic

On Wed, Jul 14 2021, Jason Wang <jasowang@redhat.com> wrote:

> 在 2021/7/14 下午2:20, Cornelia Huck 写道:
>> On Wed, Jul 14 2021, Jason Wang <jasowang@redhat.com> wrote:
>>> So I had a look at how reset is described for ccw:
>>>
>>> "
>>>
>>> In order to reset a device, a driver sends the CCW_CMD_VDEV_RESET command.
>>>
>>> "
>>>
>>> This implies something similar, that is the success of the command means
>>> the success of the reset.
>> Yes, indeed.
>>
>>> I wonder maybe I can remove the "re-read" from the basic facility and
>>> let the transport to decide what to do.
>>>
>>> - for PCI, if a registers is used, we need re-read
>>> - for CCW, follow the current implication, re-read is not needed and we
>>> can wait/poll for the success of the ccw command
>> If we are going with a status bit, it would be the same as for pci (we
>> have WRITE_STATUS and READ_STATUS commands.)
>
>
> So spec is unclear of the implications of the success of a command:
>
> E.g for RESET (CCW_CMD_VDEV_RESET), the success of the command implies 
> the success or the reset.

Yes, sending RESET is basically the ccw equivalent of "write 0 to the
status", and getting a status/interrupt that the command finished
successfully is the equivalent of "get 0 when reading the status back".
[We did not have a "read back status" command originally.]

>
> But for set_status (CCW_CMD_WRITE_STATUS), the success of the command 
> does not imply the bit is set by the device.

Yes, the success only indicates that the device has received the command
successfully. It can still refuse to set some values, or only set them
later.

>
> If I understand this correctly, we still need re-read here.

Yes.

[Let me know if we can make this more clear in the spec!]

>
>
>>   If we are going with a
>> distinct command, we can skip the re-read.
>
>
> Then it would be better to introduce the STOP as a dedicated device 
> facility (as reset):
>
> The device MUST present STOP bit after it has been stopped.
>
> And for PCI:
>
> - it was set via set the bit in the registers
>
> for ccw:
>
> - a distinct command (as reset) is introduced, and STOP is forbidden to 
> set via device status?

I think the situation for reset is different: a zero status is a natural
way to express that a device is in its initial state. It does not really
matter whether it is a freshly initialized device, or whether it has
been reset by the driver, or which mechanism the driver is using.

For STOP, we'd end up indicating a certain status, with one way to
actually write the status, and the other a dedicated command. I'd expect
that the driver will still read the status to check whether the STOP bit
is present, as it may take some time, regardless of the transport used
(going by the Linux implementation, the various callbacks to interact
with the device state are assumed to be synchronous, and we have to make
the asynchronous ccw interactions synchronous beneath the covers; if we
stick with that model for STOP, the asynchronous nature of ccw commands
does not buy us anything.)

So, maybe using the same mechanism for every transport is better, if we
end up reading the status back anyway.

>
>
>> (I'd probably go with a more
>> generic 'trigger an action' meta-command, but that would work just the
>> same.)
>>
>>> - for admin virtqueue, it should be something similar to ccw, wait/poll
>>> for the success of the admin virtqueue command
>> Or we should maybe standardize on the admin virtqueue?
>
>
> That's one way to go.
>
>
>>   That seems less
>> confusing to me.
>
>
> But it's just one of the possible interface to carry the commands. We 
> still need to define the semantic or facility of "stop" first.
>
> And we still need to clarify the implication for the success of each 
> specific command as ccw. (E.g whether or not a re-read(get after set) is 
> need)
>
> The only difference is the transport: ccw command vs virtqqueue.

Historically, too many differences in how the transports implement
device/driver interactions have lead to some awkwardness (see e.g. the
might_sleep annotations, which are surprising to someone working with
the pci transport.) So I think there's benefit in making the
interactions either very similar, or so different that the transports
can do their own things. (As said above, having an extra ccw command for
STOP is probably only useful if generic code isn't polling the status
for the bit to be set.)

So, maybe either/or

- write STOP to status, read it back (via already existing methods)
- use a virtqueue

The extra ccw command would only make sense if other transports
implemented STOP via e.g. a new register (would that also be an option?)


---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit
  2021-07-13 12:16                   ` Jason Wang
@ 2021-07-14  9:53                     ` Stefan Hajnoczi
  2021-07-14 10:29                       ` Jason Wang
  0 siblings, 1 reply; 115+ messages in thread
From: Stefan Hajnoczi @ 2021-07-14  9:53 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, Eugenio Perez Martin, Dr. David Alan Gilbert,
	virtio-comment, Virtio-Dev, Max Gurtovoy, Cornelia Huck,
	Oren Duer, Shahaf Shuler, Parav Pandit, Bodong Wang,
	Alexander Mikheev, Halil Pasic

[-- Attachment #1: Type: text/plain, Size: 10713 bytes --]

On Tue, Jul 13, 2021 at 08:16:35PM +0800, Jason Wang wrote:
> 
> 在 2021/7/13 下午6:00, Stefan Hajnoczi 写道:
> > On Tue, Jul 13, 2021 at 11:27:03AM +0800, Jason Wang wrote:
> > > 在 2021/7/12 下午5:57, Stefan Hajnoczi 写道:
> > > > On Mon, Jul 12, 2021 at 12:00:39PM +0800, Jason Wang wrote:
> > > > > 在 2021/7/11 上午4:36, Michael S. Tsirkin 写道:
> > > > > > On Fri, Jul 09, 2021 at 07:23:33PM +0200, Eugenio Perez Martin wrote:
> > > > > > > > >      If I understand correctly, this is all
> > > > > > > > > driven from the driver inside the guest, so for this to work
> > > > > > > > > the guest must be running and already have initialised the driver.
> > > > > > > > Yes.
> > > > > > > > 
> > > > > > > As I see it, the feature can be driven entirely by the VMM as long as
> > > > > > > it intercept the relevant configuration space (PCI, MMIO, etc) from
> > > > > > > guest's reads and writes, and present it as coherent and transparent
> > > > > > > for the guest. Some use cases I can imagine with a physical device (or
> > > > > > > vp_vpda device) with VIRTIO_F_STOP:
> > > > > > > 
> > > > > > > 1) The VMM chooses not to pass the feature flag. The guest cannot stop
> > > > > > > the device, so any write to this flag is an error/undefined.
> > > > > > > 2) The VMM passes the flag to the guest. The guest can stop the device.
> > > > > > > 2.1) The VMM stops the device to perform a live migration, and the
> > > > > > > guest does not write to STOP in any moment of the LM. It resets the
> > > > > > > destination device with the state, and then initializes the device.
> > > > > > > 2.2) The guest stops the device and, when STOP(32) is set, the source
> > > > > > > VMM migrates the device status. The destination VMM realizes the bit,
> > > > > > > so it sets the bit in the destination too after device initialization.
> > > > > > > 2.3) The device is not initialized by the guest so it doesn't matter
> > > > > > > what bit has the HW, but the VM can be migrated.
> > > > > > > 
> > > > > > > Am I missing something?
> > > > > > > 
> > > > > > > Thanks!
> > > > > > It's doable like this. It's all a lot of hoops to jump through though.
> > > > > > It's also not easy for devices to implement.
> > > > > It just requires a new status bit. Anything that makes you think it's hard
> > > > > to implement?
> > > > > 
> > > > > E.g for networking device, it should be sufficient to use this bit + the
> > > > > virtqueue state.
> > > > > 
> > > > > 
> > > > > > Why don't we design the feature in a way that is useable by VMMs
> > > > > > and implementable by devices in a simple way?
> > > > > It use the common technology like register shadowing without any further
> > > > > stuffs.
> > > > > 
> > > > > Or do you have any other ideas?
> > > > > 
> > > > > (I think we all know migration will be very hard if we simply pass through
> > > > > those state registers).
> > > > If an admin virtqueue is used instead of the STOP Device Status field
> > > > bit then there's no need to re-read the Device Status field in a loop
> > > > until the device has stopped.
> > > 
> > > Probably not. Let me to clarify several points:
> > > 
> > > - This proposal has nothing to do with admin virtqueue. Actually, admin
> > > virtqueue could be used for carrying any basic device facility like status
> > > bit. E.g I'm going to post patches that use admin virtqueue as a "transport"
> > > for device slicing at virtio level.
> > > - Even if we had introduced admin virtqueue, we still need a per function
> > > interface for this. This is a must for nested virtualization, we can't
> > > always expect things like PF can be assigned to L1 guest.
> > > - According to the proposal, there's no need for the device to complete all
> > > the consumed buffers, device can choose to expose those inflight descriptors
> > > in a device specific way and set the STOP bit. This means, if we have the
> > > device specific in-flight descriptor reporting facility, the device can
> > > almost set the STOP bit immediately.
> > > - If we don't go with the basic device facility but using the admin
> > > virtqueue specific method, we still need to clarify how it works with the
> > > device status state machine, it will be some kind of sub-states which looks
> > > much more complicated than the current proposal.
> > > 
> > > 
> > > > When migrating a guest with many VIRTIO devices a busy waiting approach
> > > > extends downtime if implemented sequentially (stopping one device at a
> > > > time).
> > > 
> > > Well. You need some kinds of waiting for sure, the device/DMA needs sometime
> > > to be stopped. The downtime is determined by a specific virtio
> > > implementation which is hard to be restricted at the spec level. We can
> > > clarify that the device must set the STOP bit in e.g 100ms.
> > > 
> > > 
> > > >    It can be implemented concurrently (setting the STOP bit on all
> > > > devices and then looping until all their Device Status fields have the
> > > > bit set), but this becomes more complex to implement.
> > > 
> > > I still don't get what kind of complexity did you worry here.
> > > 
> > > 
> > > > I'm a little worried about adding a new bit that requires busy
> > > > waiting...
> > > 
> > > Busy wait is not something that is introduced in this patch:
> > > 
> > > 4.1.4.3.2 Driver Requirements: Common configuration structure layout
> > > 
> > > After writing 0 to device_status, the driver MUST wait for a read of
> > > device_status to return 0 before reinitializing the device.
> > > 
> > > Since it was required for at least one transport. We need do something
> > > similar to when introducing basic facility.
> > Adding the STOP but as a Device Status bit is a small and clean VIRTIO
> > spec change. I like that.
> > 
> > On the other hand, devices need time to stop and that time can be
> > unbounded. For example, software virtio-blk/scsi implementations since
> > cannot immediately cancel in-flight I/O requests on Linux hosts.
> > 
> > The natural interface for long-running operations is virtqueue requests.
> > That's why I mentioned the alternative of using an admin virtqueue
> > instead of a Device Status bit.
> 
> 
> So I'm not against the admin virtqueue. As said before, admin virtqueue
> could be used for carrying the device status bit.
> 
> Send a command to set STOP status bit to admin virtqueue. Device will make
> the command buffer used after it has successfully stopped the device.
> 
> AFAIK, they are not mutually exclusive, since they are trying to solve
> different problems.
> 
> Device status - basic device facility
> 
> Admin virtqueue - transport/device specific way to implement (part of) the
> device facility
> 
> > 
> > Although you mentioned that the stopped state needs to be reflected in
> > the Device Status field somehow, I'm not sure about that since the
> > driver typically doesn't need to know whether the device is being
> > migrated.
> 
> 
> The guest won't see the real device status bit. VMM will shadow the device
> status bit in this case.
> 
> E.g with the current vhost-vDPA, vDPA behave like a vhost device, guest is
> unaware of the migration.
> 
> STOP status bit is set by Qemu to real virtio hardware. But guest will only
> see the DRIVER_OK without STOP.
> 
> It's not hard to implement the nested on top, see the discussion initiated
> by Eugenio about how expose VIRTIO_F_STOP to guest for nested live
> migration.
> 
> 
> >   In fact, the VMM would need to hide this bit and it's safer to
> > keep it out-of-band instead of risking exposing it by accident.
> 
> 
> See above, VMM may choose to hide or expose the capability. It's useful for
> migrating a nested guest.
> 
> If we design an interface that can be used in the nested environment, it's
> not an ideal interface.
> 
> 
> > 
> > In addition, stateful devices need to load/save non-trivial amounts of
> > data. They need DMA to do this efficiently, so an admin virtqueue is a
> > good fit again.
> 
> 
> I don't get the point here. You still need to address the exact the similar
> issues for admin virtqueue: the unbound time in freezing the device, the
> interaction with the virtio device status state machine.

Device state state can be large so a register interface would be a
bottleneck. DMA is needed. I think a virtqueue is a good fit for
saving/loading device state.

If we're going to need it for saving/loading device state anyway, then
that's another reason to consider using a virtqueue for stopping the
device, saving/loading virtqueue state, etc.

> And with admin virtqueue, it's actually far more complicated e.g you need to
> define how to synchronize the concurrent access to the basic facilites.

I'm not sure I understand? Driver complexity? Device implementation
complexity?

> >   This isn't addressed in this patch series, but it's the
> > next step and I think it's worth planning for it.
> 
> 
> I agree, but for admin virtqueue, it's better to use it as a full transport
> instead of just use it for carrying part of the device basic facilities.
> Actually, as I said, I had patches to do that. But the motivation is not for
> live migration but for device slicing. I will post RFC before the KVM Forum
> this year (since I'm going to talk device slicing at virtio level). It does
> not conflict with Max's proposal, since migration part is not there.

Great, I'm looking forward to your device slicing idea.

> > If all devices could stop very quickly and were stateless then I would
> > agree that the STOP bit is an ideal solution.
> 
> 
> Note that in Max's proposal it also have something similar the "quiescence"
> and "freezed" state. It doesn't differ from STOP bit fundamentally. As Max
> suggested, we could introduce more status bit if necessary or even consider
> to unify Max's proposal with mine.
> 
> 
> > I think it will be
> > necessary to support devices that don't behave like that, so the admin
> > virtqueue approach seems worth exploring.
> 
> 
> Yes and as mentioned in another thread. I think the best way is to define
> the device specific state first and then consider how to implement the
> interface.
> 
> Admin virtqueue is worth to explore but should not be the only method.
> Device/transport are freed to implement it in many ways based on the actual
> hardware.

What's the advantage of this proposal compared to an admin virtqueue? I
see the admin virtqueue as a more general interface than this proposal
and it can cover this use case.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit
  2021-07-14  9:53                     ` Stefan Hajnoczi
@ 2021-07-14 10:29                       ` Jason Wang
  2021-07-14 15:07                         ` Stefan Hajnoczi
  0 siblings, 1 reply; 115+ messages in thread
From: Jason Wang @ 2021-07-14 10:29 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Michael S. Tsirkin, Eugenio Perez Martin, Dr. David Alan Gilbert,
	virtio-comment, Virtio-Dev, Max Gurtovoy, Cornelia Huck,
	Oren Duer, Shahaf Shuler, Parav Pandit, Bodong Wang,
	Alexander Mikheev, Halil Pasic


在 2021/7/14 下午5:53, Stefan Hajnoczi 写道:
> On Tue, Jul 13, 2021 at 08:16:35PM +0800, Jason Wang wrote:
>> 在 2021/7/13 下午6:00, Stefan Hajnoczi 写道:
>>> On Tue, Jul 13, 2021 at 11:27:03AM +0800, Jason Wang wrote:
>>>> 在 2021/7/12 下午5:57, Stefan Hajnoczi 写道:
>>>>> On Mon, Jul 12, 2021 at 12:00:39PM +0800, Jason Wang wrote:
>>>>>> 在 2021/7/11 上午4:36, Michael S. Tsirkin 写道:
>>>>>>> On Fri, Jul 09, 2021 at 07:23:33PM +0200, Eugenio Perez Martin wrote:
>>>>>>>>>>       If I understand correctly, this is all
>>>>>>>>>> driven from the driver inside the guest, so for this to work
>>>>>>>>>> the guest must be running and already have initialised the driver.
>>>>>>>>> Yes.
>>>>>>>>>
>>>>>>>> As I see it, the feature can be driven entirely by the VMM as long as
>>>>>>>> it intercept the relevant configuration space (PCI, MMIO, etc) from
>>>>>>>> guest's reads and writes, and present it as coherent and transparent
>>>>>>>> for the guest. Some use cases I can imagine with a physical device (or
>>>>>>>> vp_vpda device) with VIRTIO_F_STOP:
>>>>>>>>
>>>>>>>> 1) The VMM chooses not to pass the feature flag. The guest cannot stop
>>>>>>>> the device, so any write to this flag is an error/undefined.
>>>>>>>> 2) The VMM passes the flag to the guest. The guest can stop the device.
>>>>>>>> 2.1) The VMM stops the device to perform a live migration, and the
>>>>>>>> guest does not write to STOP in any moment of the LM. It resets the
>>>>>>>> destination device with the state, and then initializes the device.
>>>>>>>> 2.2) The guest stops the device and, when STOP(32) is set, the source
>>>>>>>> VMM migrates the device status. The destination VMM realizes the bit,
>>>>>>>> so it sets the bit in the destination too after device initialization.
>>>>>>>> 2.3) The device is not initialized by the guest so it doesn't matter
>>>>>>>> what bit has the HW, but the VM can be migrated.
>>>>>>>>
>>>>>>>> Am I missing something?
>>>>>>>>
>>>>>>>> Thanks!
>>>>>>> It's doable like this. It's all a lot of hoops to jump through though.
>>>>>>> It's also not easy for devices to implement.
>>>>>> It just requires a new status bit. Anything that makes you think it's hard
>>>>>> to implement?
>>>>>>
>>>>>> E.g for networking device, it should be sufficient to use this bit + the
>>>>>> virtqueue state.
>>>>>>
>>>>>>
>>>>>>> Why don't we design the feature in a way that is useable by VMMs
>>>>>>> and implementable by devices in a simple way?
>>>>>> It use the common technology like register shadowing without any further
>>>>>> stuffs.
>>>>>>
>>>>>> Or do you have any other ideas?
>>>>>>
>>>>>> (I think we all know migration will be very hard if we simply pass through
>>>>>> those state registers).
>>>>> If an admin virtqueue is used instead of the STOP Device Status field
>>>>> bit then there's no need to re-read the Device Status field in a loop
>>>>> until the device has stopped.
>>>> Probably not. Let me to clarify several points:
>>>>
>>>> - This proposal has nothing to do with admin virtqueue. Actually, admin
>>>> virtqueue could be used for carrying any basic device facility like status
>>>> bit. E.g I'm going to post patches that use admin virtqueue as a "transport"
>>>> for device slicing at virtio level.
>>>> - Even if we had introduced admin virtqueue, we still need a per function
>>>> interface for this. This is a must for nested virtualization, we can't
>>>> always expect things like PF can be assigned to L1 guest.
>>>> - According to the proposal, there's no need for the device to complete all
>>>> the consumed buffers, device can choose to expose those inflight descriptors
>>>> in a device specific way and set the STOP bit. This means, if we have the
>>>> device specific in-flight descriptor reporting facility, the device can
>>>> almost set the STOP bit immediately.
>>>> - If we don't go with the basic device facility but using the admin
>>>> virtqueue specific method, we still need to clarify how it works with the
>>>> device status state machine, it will be some kind of sub-states which looks
>>>> much more complicated than the current proposal.
>>>>
>>>>
>>>>> When migrating a guest with many VIRTIO devices a busy waiting approach
>>>>> extends downtime if implemented sequentially (stopping one device at a
>>>>> time).
>>>> Well. You need some kinds of waiting for sure, the device/DMA needs sometime
>>>> to be stopped. The downtime is determined by a specific virtio
>>>> implementation which is hard to be restricted at the spec level. We can
>>>> clarify that the device must set the STOP bit in e.g 100ms.
>>>>
>>>>
>>>>>     It can be implemented concurrently (setting the STOP bit on all
>>>>> devices and then looping until all their Device Status fields have the
>>>>> bit set), but this becomes more complex to implement.
>>>> I still don't get what kind of complexity did you worry here.
>>>>
>>>>
>>>>> I'm a little worried about adding a new bit that requires busy
>>>>> waiting...
>>>> Busy wait is not something that is introduced in this patch:
>>>>
>>>> 4.1.4.3.2 Driver Requirements: Common configuration structure layout
>>>>
>>>> After writing 0 to device_status, the driver MUST wait for a read of
>>>> device_status to return 0 before reinitializing the device.
>>>>
>>>> Since it was required for at least one transport. We need do something
>>>> similar to when introducing basic facility.
>>> Adding the STOP but as a Device Status bit is a small and clean VIRTIO
>>> spec change. I like that.
>>>
>>> On the other hand, devices need time to stop and that time can be
>>> unbounded. For example, software virtio-blk/scsi implementations since
>>> cannot immediately cancel in-flight I/O requests on Linux hosts.
>>>
>>> The natural interface for long-running operations is virtqueue requests.
>>> That's why I mentioned the alternative of using an admin virtqueue
>>> instead of a Device Status bit.
>>
>> So I'm not against the admin virtqueue. As said before, admin virtqueue
>> could be used for carrying the device status bit.
>>
>> Send a command to set STOP status bit to admin virtqueue. Device will make
>> the command buffer used after it has successfully stopped the device.
>>
>> AFAIK, they are not mutually exclusive, since they are trying to solve
>> different problems.
>>
>> Device status - basic device facility
>>
>> Admin virtqueue - transport/device specific way to implement (part of) the
>> device facility
>>
>>> Although you mentioned that the stopped state needs to be reflected in
>>> the Device Status field somehow, I'm not sure about that since the
>>> driver typically doesn't need to know whether the device is being
>>> migrated.
>>
>> The guest won't see the real device status bit. VMM will shadow the device
>> status bit in this case.
>>
>> E.g with the current vhost-vDPA, vDPA behave like a vhost device, guest is
>> unaware of the migration.
>>
>> STOP status bit is set by Qemu to real virtio hardware. But guest will only
>> see the DRIVER_OK without STOP.
>>
>> It's not hard to implement the nested on top, see the discussion initiated
>> by Eugenio about how expose VIRTIO_F_STOP to guest for nested live
>> migration.
>>
>>
>>>    In fact, the VMM would need to hide this bit and it's safer to
>>> keep it out-of-band instead of risking exposing it by accident.
>>
>> See above, VMM may choose to hide or expose the capability. It's useful for
>> migrating a nested guest.
>>
>> If we design an interface that can be used in the nested environment, it's
>> not an ideal interface.
>>
>>
>>> In addition, stateful devices need to load/save non-trivial amounts of
>>> data. They need DMA to do this efficiently, so an admin virtqueue is a
>>> good fit again.
>>
>> I don't get the point here. You still need to address the exact the similar
>> issues for admin virtqueue: the unbound time in freezing the device, the
>> interaction with the virtio device status state machine.
> Device state state can be large so a register interface would be a
> bottleneck. DMA is needed. I think a virtqueue is a good fit for
> saving/loading device state.


So this patch doesn't mandate a register interface, isn't it? And DMA 
doesn't means a virtqueue, it could be a transport specific method.

I think we need to start from defining the state of one specific device 
and see what is the best interface.

Note that software can choose to intercept all the control commands, and 
shadow them. This means the best interface could be device specific.


>
> If we're going to need it for saving/loading device state anyway, then
> that's another reason to consider using a virtqueue for stopping the
> device, saving/loading virtqueue state, etc.


It requires much more works than the simple virtqueue interface: (the 
main issues is that the function is not self-contained in a single function)

1) how to interact with the existing device status state machine?
2) how to make it work in a nested environment?
3) how to migrate the PF?
4) do we need to allow more control other than just stop/freeze the 
device in the admin virtqueue? If yes, how to handle the concurrent 
access from PF and VF?
5) how it is expected to work with non-PCI virtio device?

And as I've stated several times, virtqueue is the interface or 
transport which carries the commands for implementing specific 
semantics. It doesn't conflict with what is proposed in this patch.


>
>> And with admin virtqueue, it's actually far more complicated e.g you need to
>> define how to synchronize the concurrent access to the basic facilites.
> I'm not sure I understand? Driver complexity? Device implementation
> complexity?


See the above questions.


>
>>>    This isn't addressed in this patch series, but it's the
>>> next step and I think it's worth planning for it.
>>
>> I agree, but for admin virtqueue, it's better to use it as a full transport
>> instead of just use it for carrying part of the device basic facilities.
>> Actually, as I said, I had patches to do that. But the motivation is not for
>> live migration but for device slicing. I will post RFC before the KVM Forum
>> this year (since I'm going to talk device slicing at virtio level). It does
>> not conflict with Max's proposal, since migration part is not there.
> Great, I'm looking forward to your device slicing idea.


Thanks


>
>>> If all devices could stop very quickly and were stateless then I would
>>> agree that the STOP bit is an ideal solution.
>>
>> Note that in Max's proposal it also have something similar the "quiescence"
>> and "freezed" state. It doesn't differ from STOP bit fundamentally. As Max
>> suggested, we could introduce more status bit if necessary or even consider
>> to unify Max's proposal with mine.
>>
>>
>>> I think it will be
>>> necessary to support devices that don't behave like that, so the admin
>>> virtqueue approach seems worth exploring.
>>
>> Yes and as mentioned in another thread. I think the best way is to define
>> the device specific state first and then consider how to implement the
>> interface.
>>
>> Admin virtqueue is worth to explore but should not be the only method.
>> Device/transport are freed to implement it in many ways based on the actual
>> hardware.
> What's the advantage of this proposal compared to an admin virtqueue?


See above. They are not contradict. Admin virtqueue could be used to 
transport the basic facility that is introduced in this patch.


>   I
> see the admin virtqueue as a more general interface than this proposal
> and it can cover this use case.


Actually not, this proposal doesn't limit how it was actually 
implemented in the device or transport. So it is a much general proposal.

If you are talking about the patch of implementing it via PCI 
capability, it's just to demonstrate one of the possible implementation.

Thanks


>
> Stefan


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit
  2021-07-14 10:29                       ` Jason Wang
@ 2021-07-14 15:07                         ` Stefan Hajnoczi
  2021-07-14 16:22                           ` Max Gurtovoy
  2021-07-15  1:35                           ` Jason Wang
  0 siblings, 2 replies; 115+ messages in thread
From: Stefan Hajnoczi @ 2021-07-14 15:07 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, Eugenio Perez Martin, Dr. David Alan Gilbert,
	virtio-comment, Virtio-Dev, Max Gurtovoy, Cornelia Huck,
	Oren Duer, Shahaf Shuler, Parav Pandit, Bodong Wang,
	Alexander Mikheev, Halil Pasic

[-- Attachment #1: Type: text/plain, Size: 12503 bytes --]

On Wed, Jul 14, 2021 at 06:29:28PM +0800, Jason Wang wrote:
> 
> 在 2021/7/14 下午5:53, Stefan Hajnoczi 写道:
> > On Tue, Jul 13, 2021 at 08:16:35PM +0800, Jason Wang wrote:
> > > 在 2021/7/13 下午6:00, Stefan Hajnoczi 写道:
> > > > On Tue, Jul 13, 2021 at 11:27:03AM +0800, Jason Wang wrote:
> > > > > 在 2021/7/12 下午5:57, Stefan Hajnoczi 写道:
> > > > > > On Mon, Jul 12, 2021 at 12:00:39PM +0800, Jason Wang wrote:
> > > > > > > 在 2021/7/11 上午4:36, Michael S. Tsirkin 写道:
> > > > > > > > On Fri, Jul 09, 2021 at 07:23:33PM +0200, Eugenio Perez Martin wrote:
> > > > > > > > > > >       If I understand correctly, this is all
> > > > > > > > > > > driven from the driver inside the guest, so for this to work
> > > > > > > > > > > the guest must be running and already have initialised the driver.
> > > > > > > > > > Yes.
> > > > > > > > > > 
> > > > > > > > > As I see it, the feature can be driven entirely by the VMM as long as
> > > > > > > > > it intercept the relevant configuration space (PCI, MMIO, etc) from
> > > > > > > > > guest's reads and writes, and present it as coherent and transparent
> > > > > > > > > for the guest. Some use cases I can imagine with a physical device (or
> > > > > > > > > vp_vpda device) with VIRTIO_F_STOP:
> > > > > > > > > 
> > > > > > > > > 1) The VMM chooses not to pass the feature flag. The guest cannot stop
> > > > > > > > > the device, so any write to this flag is an error/undefined.
> > > > > > > > > 2) The VMM passes the flag to the guest. The guest can stop the device.
> > > > > > > > > 2.1) The VMM stops the device to perform a live migration, and the
> > > > > > > > > guest does not write to STOP in any moment of the LM. It resets the
> > > > > > > > > destination device with the state, and then initializes the device.
> > > > > > > > > 2.2) The guest stops the device and, when STOP(32) is set, the source
> > > > > > > > > VMM migrates the device status. The destination VMM realizes the bit,
> > > > > > > > > so it sets the bit in the destination too after device initialization.
> > > > > > > > > 2.3) The device is not initialized by the guest so it doesn't matter
> > > > > > > > > what bit has the HW, but the VM can be migrated.
> > > > > > > > > 
> > > > > > > > > Am I missing something?
> > > > > > > > > 
> > > > > > > > > Thanks!
> > > > > > > > It's doable like this. It's all a lot of hoops to jump through though.
> > > > > > > > It's also not easy for devices to implement.
> > > > > > > It just requires a new status bit. Anything that makes you think it's hard
> > > > > > > to implement?
> > > > > > > 
> > > > > > > E.g for networking device, it should be sufficient to use this bit + the
> > > > > > > virtqueue state.
> > > > > > > 
> > > > > > > 
> > > > > > > > Why don't we design the feature in a way that is useable by VMMs
> > > > > > > > and implementable by devices in a simple way?
> > > > > > > It use the common technology like register shadowing without any further
> > > > > > > stuffs.
> > > > > > > 
> > > > > > > Or do you have any other ideas?
> > > > > > > 
> > > > > > > (I think we all know migration will be very hard if we simply pass through
> > > > > > > those state registers).
> > > > > > If an admin virtqueue is used instead of the STOP Device Status field
> > > > > > bit then there's no need to re-read the Device Status field in a loop
> > > > > > until the device has stopped.
> > > > > Probably not. Let me to clarify several points:
> > > > > 
> > > > > - This proposal has nothing to do with admin virtqueue. Actually, admin
> > > > > virtqueue could be used for carrying any basic device facility like status
> > > > > bit. E.g I'm going to post patches that use admin virtqueue as a "transport"
> > > > > for device slicing at virtio level.
> > > > > - Even if we had introduced admin virtqueue, we still need a per function
> > > > > interface for this. This is a must for nested virtualization, we can't
> > > > > always expect things like PF can be assigned to L1 guest.
> > > > > - According to the proposal, there's no need for the device to complete all
> > > > > the consumed buffers, device can choose to expose those inflight descriptors
> > > > > in a device specific way and set the STOP bit. This means, if we have the
> > > > > device specific in-flight descriptor reporting facility, the device can
> > > > > almost set the STOP bit immediately.
> > > > > - If we don't go with the basic device facility but using the admin
> > > > > virtqueue specific method, we still need to clarify how it works with the
> > > > > device status state machine, it will be some kind of sub-states which looks
> > > > > much more complicated than the current proposal.
> > > > > 
> > > > > 
> > > > > > When migrating a guest with many VIRTIO devices a busy waiting approach
> > > > > > extends downtime if implemented sequentially (stopping one device at a
> > > > > > time).
> > > > > Well. You need some kinds of waiting for sure, the device/DMA needs sometime
> > > > > to be stopped. The downtime is determined by a specific virtio
> > > > > implementation which is hard to be restricted at the spec level. We can
> > > > > clarify that the device must set the STOP bit in e.g 100ms.
> > > > > 
> > > > > 
> > > > > >     It can be implemented concurrently (setting the STOP bit on all
> > > > > > devices and then looping until all their Device Status fields have the
> > > > > > bit set), but this becomes more complex to implement.
> > > > > I still don't get what kind of complexity did you worry here.
> > > > > 
> > > > > 
> > > > > > I'm a little worried about adding a new bit that requires busy
> > > > > > waiting...
> > > > > Busy wait is not something that is introduced in this patch:
> > > > > 
> > > > > 4.1.4.3.2 Driver Requirements: Common configuration structure layout
> > > > > 
> > > > > After writing 0 to device_status, the driver MUST wait for a read of
> > > > > device_status to return 0 before reinitializing the device.
> > > > > 
> > > > > Since it was required for at least one transport. We need do something
> > > > > similar to when introducing basic facility.
> > > > Adding the STOP but as a Device Status bit is a small and clean VIRTIO
> > > > spec change. I like that.
> > > > 
> > > > On the other hand, devices need time to stop and that time can be
> > > > unbounded. For example, software virtio-blk/scsi implementations since
> > > > cannot immediately cancel in-flight I/O requests on Linux hosts.
> > > > 
> > > > The natural interface for long-running operations is virtqueue requests.
> > > > That's why I mentioned the alternative of using an admin virtqueue
> > > > instead of a Device Status bit.
> > > 
> > > So I'm not against the admin virtqueue. As said before, admin virtqueue
> > > could be used for carrying the device status bit.
> > > 
> > > Send a command to set STOP status bit to admin virtqueue. Device will make
> > > the command buffer used after it has successfully stopped the device.
> > > 
> > > AFAIK, they are not mutually exclusive, since they are trying to solve
> > > different problems.
> > > 
> > > Device status - basic device facility
> > > 
> > > Admin virtqueue - transport/device specific way to implement (part of) the
> > > device facility
> > > 
> > > > Although you mentioned that the stopped state needs to be reflected in
> > > > the Device Status field somehow, I'm not sure about that since the
> > > > driver typically doesn't need to know whether the device is being
> > > > migrated.
> > > 
> > > The guest won't see the real device status bit. VMM will shadow the device
> > > status bit in this case.
> > > 
> > > E.g with the current vhost-vDPA, vDPA behave like a vhost device, guest is
> > > unaware of the migration.
> > > 
> > > STOP status bit is set by Qemu to real virtio hardware. But guest will only
> > > see the DRIVER_OK without STOP.
> > > 
> > > It's not hard to implement the nested on top, see the discussion initiated
> > > by Eugenio about how expose VIRTIO_F_STOP to guest for nested live
> > > migration.
> > > 
> > > 
> > > >    In fact, the VMM would need to hide this bit and it's safer to
> > > > keep it out-of-band instead of risking exposing it by accident.
> > > 
> > > See above, VMM may choose to hide or expose the capability. It's useful for
> > > migrating a nested guest.
> > > 
> > > If we design an interface that can be used in the nested environment, it's
> > > not an ideal interface.
> > > 
> > > 
> > > > In addition, stateful devices need to load/save non-trivial amounts of
> > > > data. They need DMA to do this efficiently, so an admin virtqueue is a
> > > > good fit again.
> > > 
> > > I don't get the point here. You still need to address the exact the similar
> > > issues for admin virtqueue: the unbound time in freezing the device, the
> > > interaction with the virtio device status state machine.
> > Device state state can be large so a register interface would be a
> > bottleneck. DMA is needed. I think a virtqueue is a good fit for
> > saving/loading device state.
> 
> 
> So this patch doesn't mandate a register interface, isn't it?

You're right, not this patch. I mentioned it because your other patch
series ("[PATCH] virtio-pci: implement VIRTIO_F_RING_STATE") implements
it a register interface.

> And DMA
> doesn't means a virtqueue, it could be a transport specific method.

Yes, although virtqueues are a pretty good interface that works across
transports (PCI/MMIO/etc) thanks to the standard vring memory layout.

> I think we need to start from defining the state of one specific device and
> see what is the best interface.

virtio-blk might be the simplest. I think virtio-net has more device
state and virtio-scsi is definitely more complext than virtio-blk.

First we need agreement on whether "device state" encompasses the full
state of the device or just state that is unknown to the VMM. That's
basically the difference between the vhost/vDPA's selective passthrough
approach and VFIO's full passthrough approach. For example, some of the
virtio-net state is available to the VMM with vhost/vDPA because it
intercepts the virtio-net control virtqueue.

Also, we need to decide to what degree the device state representation
is standardized in the VIRTIO specification. I think it makes sense to
standardize it if it's possible to convey all necessary state and device
implementors can easily implement this device state representation. If
not, then device implementation-specific device state would be needed.

I think that's a larger discussion that deserves its own email thread.

> Note that software can choose to intercept all the control commands, and
> shadow them. This means the best interface could be device specific.
> 
> 
> > 
> > If we're going to need it for saving/loading device state anyway, then
> > that's another reason to consider using a virtqueue for stopping the
> > device, saving/loading virtqueue state, etc.
> 
> 
> It requires much more works than the simple virtqueue interface: (the main
> issues is that the function is not self-contained in a single function)
> 
> 1) how to interact with the existing device status state machine?
> 2) how to make it work in a nested environment?
> 3) how to migrate the PF?
> 4) do we need to allow more control other than just stop/freeze the device
> in the admin virtqueue? If yes, how to handle the concurrent access from PF
> and VF?
> 5) how it is expected to work with non-PCI virtio device?

I guess your device splitting proposal addresses some of these things?

Max probably has the most to say about these points.

If you want more input I can try to answer too, but I personally am not
developing devices that need this right now, so I might not be the best
person to propose solutions.

> And as I've stated several times, virtqueue is the interface or transport
> which carries the commands for implementing specific semantics. It doesn't
> conflict with what is proposed in this patch.

The abstract operations for stopping the device and fetching virtqueue
state sound good to me, but I don't think a Device Status field STOP bit
should be added. An out-of-band stop operation would support devices
that take a long time to stop better.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit
  2021-07-14 15:07                         ` Stefan Hajnoczi
@ 2021-07-14 16:22                           ` Max Gurtovoy
  2021-07-15  1:38                             ` Jason Wang
  2021-07-15  1:35                           ` Jason Wang
  1 sibling, 1 reply; 115+ messages in thread
From: Max Gurtovoy @ 2021-07-14 16:22 UTC (permalink / raw)
  To: Stefan Hajnoczi, Jason Wang
  Cc: Michael S. Tsirkin, Eugenio Perez Martin, Dr. David Alan Gilbert,
	virtio-comment, Virtio-Dev, Cornelia Huck, Oren Duer,
	Shahaf Shuler, Parav Pandit, Bodong Wang, Alexander Mikheev,
	Halil Pasic


On 7/14/2021 6:07 PM, Stefan Hajnoczi wrote:
>> It requires much more works than the simple virtqueue interface: (the main
>> issues is that the function is not self-contained in a single function)
>>
>> 1) how to interact with the existing device status state machine?
>> 2) how to make it work in a nested environment?
>> 3) how to migrate the PF?
>> 4) do we need to allow more control other than just stop/freeze the device
>> in the admin virtqueue? If yes, how to handle the concurrent access from PF
>> and VF?
>> 5) how it is expected to work with non-PCI virtio device?
> I guess your device splitting proposal addresses some of these things?
>
> Max probably has the most to say about these points.
>
> If you want more input I can try to answer too, but I personally am not
> developing devices that need this right now, so I might not be the best
> person to propose solutions.

I think we mentioned this in the past and agreed that the only common 
entity between my solution for virtio VF migration to this proposal is 
the new admin control queue.

I can prepare some draft for this.

In our solution the PF will manage migration process for it's VFs using 
the PF admin queue. PF is not migratable.

I don't know who is using nested environments in production so don't 
know if it worth talking about that.

But, if you would like to implement it for testing, no problem. The VF 
in level n, probably seen as PF in level n+1. So it can manage the 
migration process for its nested VFs.

For question 5) what non-PCI devices are interesting in live migration ?


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit
  2021-07-14 15:07                         ` Stefan Hajnoczi
  2021-07-14 16:22                           ` Max Gurtovoy
@ 2021-07-15  1:35                           ` Jason Wang
  2021-07-15  9:16                             ` [virtio-dev] " Stefan Hajnoczi
  2021-07-15 10:01                             ` Stefan Hajnoczi
  1 sibling, 2 replies; 115+ messages in thread
From: Jason Wang @ 2021-07-15  1:35 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Michael S. Tsirkin, Eugenio Perez Martin, Dr. David Alan Gilbert,
	virtio-comment, Virtio-Dev, Max Gurtovoy, Cornelia Huck,
	Oren Duer, Shahaf Shuler, Parav Pandit, Bodong Wang,
	Alexander Mikheev, Halil Pasic


在 2021/7/14 下午11:07, Stefan Hajnoczi 写道:
> On Wed, Jul 14, 2021 at 06:29:28PM +0800, Jason Wang wrote:
>> 在 2021/7/14 下午5:53, Stefan Hajnoczi 写道:
>>> On Tue, Jul 13, 2021 at 08:16:35PM +0800, Jason Wang wrote:
>>>> 在 2021/7/13 下午6:00, Stefan Hajnoczi 写道:
>>>>> On Tue, Jul 13, 2021 at 11:27:03AM +0800, Jason Wang wrote:
>>>>>> 在 2021/7/12 下午5:57, Stefan Hajnoczi 写道:
>>>>>>> On Mon, Jul 12, 2021 at 12:00:39PM +0800, Jason Wang wrote:
>>>>>>>> 在 2021/7/11 上午4:36, Michael S. Tsirkin 写道:
>>>>>>>>> On Fri, Jul 09, 2021 at 07:23:33PM +0200, Eugenio Perez Martin wrote:
>>>>>>>>>>>>        If I understand correctly, this is all
>>>>>>>>>>>> driven from the driver inside the guest, so for this to work
>>>>>>>>>>>> the guest must be running and already have initialised the driver.
>>>>>>>>>>> Yes.
>>>>>>>>>>>
>>>>>>>>>> As I see it, the feature can be driven entirely by the VMM as long as
>>>>>>>>>> it intercept the relevant configuration space (PCI, MMIO, etc) from
>>>>>>>>>> guest's reads and writes, and present it as coherent and transparent
>>>>>>>>>> for the guest. Some use cases I can imagine with a physical device (or
>>>>>>>>>> vp_vpda device) with VIRTIO_F_STOP:
>>>>>>>>>>
>>>>>>>>>> 1) The VMM chooses not to pass the feature flag. The guest cannot stop
>>>>>>>>>> the device, so any write to this flag is an error/undefined.
>>>>>>>>>> 2) The VMM passes the flag to the guest. The guest can stop the device.
>>>>>>>>>> 2.1) The VMM stops the device to perform a live migration, and the
>>>>>>>>>> guest does not write to STOP in any moment of the LM. It resets the
>>>>>>>>>> destination device with the state, and then initializes the device.
>>>>>>>>>> 2.2) The guest stops the device and, when STOP(32) is set, the source
>>>>>>>>>> VMM migrates the device status. The destination VMM realizes the bit,
>>>>>>>>>> so it sets the bit in the destination too after device initialization.
>>>>>>>>>> 2.3) The device is not initialized by the guest so it doesn't matter
>>>>>>>>>> what bit has the HW, but the VM can be migrated.
>>>>>>>>>>
>>>>>>>>>> Am I missing something?
>>>>>>>>>>
>>>>>>>>>> Thanks!
>>>>>>>>> It's doable like this. It's all a lot of hoops to jump through though.
>>>>>>>>> It's also not easy for devices to implement.
>>>>>>>> It just requires a new status bit. Anything that makes you think it's hard
>>>>>>>> to implement?
>>>>>>>>
>>>>>>>> E.g for networking device, it should be sufficient to use this bit + the
>>>>>>>> virtqueue state.
>>>>>>>>
>>>>>>>>
>>>>>>>>> Why don't we design the feature in a way that is useable by VMMs
>>>>>>>>> and implementable by devices in a simple way?
>>>>>>>> It use the common technology like register shadowing without any further
>>>>>>>> stuffs.
>>>>>>>>
>>>>>>>> Or do you have any other ideas?
>>>>>>>>
>>>>>>>> (I think we all know migration will be very hard if we simply pass through
>>>>>>>> those state registers).
>>>>>>> If an admin virtqueue is used instead of the STOP Device Status field
>>>>>>> bit then there's no need to re-read the Device Status field in a loop
>>>>>>> until the device has stopped.
>>>>>> Probably not. Let me to clarify several points:
>>>>>>
>>>>>> - This proposal has nothing to do with admin virtqueue. Actually, admin
>>>>>> virtqueue could be used for carrying any basic device facility like status
>>>>>> bit. E.g I'm going to post patches that use admin virtqueue as a "transport"
>>>>>> for device slicing at virtio level.
>>>>>> - Even if we had introduced admin virtqueue, we still need a per function
>>>>>> interface for this. This is a must for nested virtualization, we can't
>>>>>> always expect things like PF can be assigned to L1 guest.
>>>>>> - According to the proposal, there's no need for the device to complete all
>>>>>> the consumed buffers, device can choose to expose those inflight descriptors
>>>>>> in a device specific way and set the STOP bit. This means, if we have the
>>>>>> device specific in-flight descriptor reporting facility, the device can
>>>>>> almost set the STOP bit immediately.
>>>>>> - If we don't go with the basic device facility but using the admin
>>>>>> virtqueue specific method, we still need to clarify how it works with the
>>>>>> device status state machine, it will be some kind of sub-states which looks
>>>>>> much more complicated than the current proposal.
>>>>>>
>>>>>>
>>>>>>> When migrating a guest with many VIRTIO devices a busy waiting approach
>>>>>>> extends downtime if implemented sequentially (stopping one device at a
>>>>>>> time).
>>>>>> Well. You need some kinds of waiting for sure, the device/DMA needs sometime
>>>>>> to be stopped. The downtime is determined by a specific virtio
>>>>>> implementation which is hard to be restricted at the spec level. We can
>>>>>> clarify that the device must set the STOP bit in e.g 100ms.
>>>>>>
>>>>>>
>>>>>>>      It can be implemented concurrently (setting the STOP bit on all
>>>>>>> devices and then looping until all their Device Status fields have the
>>>>>>> bit set), but this becomes more complex to implement.
>>>>>> I still don't get what kind of complexity did you worry here.
>>>>>>
>>>>>>
>>>>>>> I'm a little worried about adding a new bit that requires busy
>>>>>>> waiting...
>>>>>> Busy wait is not something that is introduced in this patch:
>>>>>>
>>>>>> 4.1.4.3.2 Driver Requirements: Common configuration structure layout
>>>>>>
>>>>>> After writing 0 to device_status, the driver MUST wait for a read of
>>>>>> device_status to return 0 before reinitializing the device.
>>>>>>
>>>>>> Since it was required for at least one transport. We need do something
>>>>>> similar to when introducing basic facility.
>>>>> Adding the STOP but as a Device Status bit is a small and clean VIRTIO
>>>>> spec change. I like that.
>>>>>
>>>>> On the other hand, devices need time to stop and that time can be
>>>>> unbounded. For example, software virtio-blk/scsi implementations since
>>>>> cannot immediately cancel in-flight I/O requests on Linux hosts.
>>>>>
>>>>> The natural interface for long-running operations is virtqueue requests.
>>>>> That's why I mentioned the alternative of using an admin virtqueue
>>>>> instead of a Device Status bit.
>>>> So I'm not against the admin virtqueue. As said before, admin virtqueue
>>>> could be used for carrying the device status bit.
>>>>
>>>> Send a command to set STOP status bit to admin virtqueue. Device will make
>>>> the command buffer used after it has successfully stopped the device.
>>>>
>>>> AFAIK, they are not mutually exclusive, since they are trying to solve
>>>> different problems.
>>>>
>>>> Device status - basic device facility
>>>>
>>>> Admin virtqueue - transport/device specific way to implement (part of) the
>>>> device facility
>>>>
>>>>> Although you mentioned that the stopped state needs to be reflected in
>>>>> the Device Status field somehow, I'm not sure about that since the
>>>>> driver typically doesn't need to know whether the device is being
>>>>> migrated.
>>>> The guest won't see the real device status bit. VMM will shadow the device
>>>> status bit in this case.
>>>>
>>>> E.g with the current vhost-vDPA, vDPA behave like a vhost device, guest is
>>>> unaware of the migration.
>>>>
>>>> STOP status bit is set by Qemu to real virtio hardware. But guest will only
>>>> see the DRIVER_OK without STOP.
>>>>
>>>> It's not hard to implement the nested on top, see the discussion initiated
>>>> by Eugenio about how expose VIRTIO_F_STOP to guest for nested live
>>>> migration.
>>>>
>>>>
>>>>>     In fact, the VMM would need to hide this bit and it's safer to
>>>>> keep it out-of-band instead of risking exposing it by accident.
>>>> See above, VMM may choose to hide or expose the capability. It's useful for
>>>> migrating a nested guest.
>>>>
>>>> If we design an interface that can be used in the nested environment, it's
>>>> not an ideal interface.
>>>>
>>>>
>>>>> In addition, stateful devices need to load/save non-trivial amounts of
>>>>> data. They need DMA to do this efficiently, so an admin virtqueue is a
>>>>> good fit again.
>>>> I don't get the point here. You still need to address the exact the similar
>>>> issues for admin virtqueue: the unbound time in freezing the device, the
>>>> interaction with the virtio device status state machine.
>>> Device state state can be large so a register interface would be a
>>> bottleneck. DMA is needed. I think a virtqueue is a good fit for
>>> saving/loading device state.
>>
>> So this patch doesn't mandate a register interface, isn't it?
> You're right, not this patch. I mentioned it because your other patch
> series ("[PATCH] virtio-pci: implement VIRTIO_F_RING_STATE") implements
> it a register interface.
>
>> And DMA
>> doesn't means a virtqueue, it could be a transport specific method.
> Yes, although virtqueues are a pretty good interface that works across
> transports (PCI/MMIO/etc) thanks to the standard vring memory layout.
>
>> I think we need to start from defining the state of one specific device and
>> see what is the best interface.
> virtio-blk might be the simplest. I think virtio-net has more device
> state and virtio-scsi is definitely more complext than virtio-blk.
>
> First we need agreement on whether "device state" encompasses the full
> state of the device or just state that is unknown to the VMM.


I think we've discussed this in the past. It can't work since:

1) The state and its format must be clearly defined in the spec
2) We need to maintain migration compatibility and debug-ability
3) Not a proper uAPI desgin


> That's
> basically the difference between the vhost/vDPA's selective passthrough
> approach and VFIO's full passthrough approach.


We can't do VFIO full pasthrough for migration anyway, some kind of mdev 
is required but it's duplicated with the current vp_vdpa driver.


>   For example, some of the
> virtio-net state is available to the VMM with vhost/vDPA because it
> intercepts the virtio-net control virtqueue.
>
> Also, we need to decide to what degree the device state representation
> is standardized in the VIRTIO specification.


I think all the states must be defined in the spec otherwise the device 
can't claim it supports migration at virtio level.


>   I think it makes sense to
> standardize it if it's possible to convey all necessary state and device
> implementors can easily implement this device state representation.


I doubt it's high device specific. E.g can we standardize device(GPU) 
memory?


> If
> not, then device implementation-specific device state would be needed.


Yes.


>
> I think that's a larger discussion that deserves its own email thread.


I agree, but it doesn't prevent us from starting from simple device that 
virtqueue state is sufficient (e.g virtio-net).


>
>> Note that software can choose to intercept all the control commands, and
>> shadow them. This means the best interface could be device specific.
>>
>>
>>> If we're going to need it for saving/loading device state anyway, then
>>> that's another reason to consider using a virtqueue for stopping the
>>> device, saving/loading virtqueue state, etc.
>>
>> It requires much more works than the simple virtqueue interface: (the main
>> issues is that the function is not self-contained in a single function)
>>
>> 1) how to interact with the existing device status state machine?
>> 2) how to make it work in a nested environment?
>> 3) how to migrate the PF?
>> 4) do we need to allow more control other than just stop/freeze the device
>> in the admin virtqueue? If yes, how to handle the concurrent access from PF
>> and VF?
>> 5) how it is expected to work with non-PCI virtio device?
> I guess your device splitting proposal addresses some of these things?


Note that the device facility doesn't limit how it is used. So the 
difference is per-function(VF) interface vs PF interface.

Per-function interface is self contained so it can address all the above 
issues:

1) STOP is the part of the device status state machine
2) it's self contained in the function, so it works in the nested 
environment by simply assign the function without any other dependency
3) for PF, those function is still self-contained, so it can be migrated
4) the problem doesn't exist since we have a single control path
5) non PCI device is freed to implement their own per device interface

And actually, the per function interface has already been done by some 
vendors.


>
> Max probably has the most to say about these points.
>
> If you want more input I can try to answer too, but I personally am not
> developing devices that need this right now, so I might not be the best
> person to propose solutions.


I think for us we should make sure the architecture is good enough to be 
not limited to any specific use cases.


>
>> And as I've stated several times, virtqueue is the interface or transport
>> which carries the commands for implementing specific semantics. It doesn't
>> conflict with what is proposed in this patch.
> The abstract operations for stopping the device and fetching virtqueue
> state sound good to me, but I don't think a Device Status field STOP bit
> should be added. An out-of-band stop operation would support devices
> that take a long time to stop better.


So the long time request is not something that is introduced by the STOP 
bit. Spec already use that for reset.

Thanks


>
> Stefan


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit
  2021-07-14 16:22                           ` Max Gurtovoy
@ 2021-07-15  1:38                             ` Jason Wang
  2021-07-15  9:26                               ` Stefan Hajnoczi
  2021-07-15 21:18                               ` Michael S. Tsirkin
  0 siblings, 2 replies; 115+ messages in thread
From: Jason Wang @ 2021-07-15  1:38 UTC (permalink / raw)
  To: Max Gurtovoy, Stefan Hajnoczi
  Cc: Michael S. Tsirkin, Eugenio Perez Martin, Dr. David Alan Gilbert,
	virtio-comment, Virtio-Dev, Cornelia Huck, Oren Duer,
	Shahaf Shuler, Parav Pandit, Bodong Wang, Alexander Mikheev,
	Halil Pasic


在 2021/7/15 上午12:22, Max Gurtovoy 写道:
>
> On 7/14/2021 6:07 PM, Stefan Hajnoczi wrote:
>>> It requires much more works than the simple virtqueue interface: 
>>> (the main
>>> issues is that the function is not self-contained in a single function)
>>>
>>> 1) how to interact with the existing device status state machine?
>>> 2) how to make it work in a nested environment?
>>> 3) how to migrate the PF?
>>> 4) do we need to allow more control other than just stop/freeze the 
>>> device
>>> in the admin virtqueue? If yes, how to handle the concurrent access 
>>> from PF
>>> and VF?
>>> 5) how it is expected to work with non-PCI virtio device?
>> I guess your device splitting proposal addresses some of these things?
>>
>> Max probably has the most to say about these points.
>>
>> If you want more input I can try to answer too, but I personally am not
>> developing devices that need this right now, so I might not be the best
>> person to propose solutions.
>
> I think we mentioned this in the past and agreed that the only common 
> entity between my solution for virtio VF migration to this proposal is 
> the new admin control queue.
>
> I can prepare some draft for this.
>
> In our solution the PF will manage migration process for it's VFs 
> using the PF admin queue. PF is not migratable.


That limits the use cases.


>
> I don't know who is using nested environments in production so don't 
> know if it worth talking about that.


There should be plenty users for the nested case.


>
> But, if you would like to implement it for testing, no problem. The VF 
> in level n, probably seen as PF in level n+1. So it can manage the 
> migration process for its nested VFs.


The PF dependency makes the design almost impossible to be used in a 
nested environment.


>
> For question 5) what non-PCI devices are interesting in live migration ?
>

Why not? Virtio support transport other than PCI (CCW, MMIO).

Thanks



^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit
  2021-07-14  9:24                                 ` [virtio-dev] " Cornelia Huck
@ 2021-07-15  2:01                                   ` Jason Wang
  0 siblings, 0 replies; 115+ messages in thread
From: Jason Wang @ 2021-07-15  2:01 UTC (permalink / raw)
  To: Cornelia Huck, Stefan Hajnoczi
  Cc: Michael S. Tsirkin, Eugenio Perez Martin, Dr. David Alan Gilbert,
	virtio-comment, Virtio-Dev, Max Gurtovoy, Oren Duer,
	Shahaf Shuler, Parav Pandit, Bodong Wang, Alexander Mikheev,
	Halil Pasic


在 2021/7/14 下午5:24, Cornelia Huck 写道:
> On Wed, Jul 14 2021, Jason Wang <jasowang@redhat.com> wrote:
>
>> 在 2021/7/14 下午2:20, Cornelia Huck 写道:
>>> On Wed, Jul 14 2021, Jason Wang <jasowang@redhat.com> wrote:
>>>> So I had a look at how reset is described for ccw:
>>>>
>>>> "
>>>>
>>>> In order to reset a device, a driver sends the CCW_CMD_VDEV_RESET command.
>>>>
>>>> "
>>>>
>>>> This implies something similar, that is the success of the command means
>>>> the success of the reset.
>>> Yes, indeed.
>>>
>>>> I wonder maybe I can remove the "re-read" from the basic facility and
>>>> let the transport to decide what to do.
>>>>
>>>> - for PCI, if a registers is used, we need re-read
>>>> - for CCW, follow the current implication, re-read is not needed and we
>>>> can wait/poll for the success of the ccw command
>>> If we are going with a status bit, it would be the same as for pci (we
>>> have WRITE_STATUS and READ_STATUS commands.)
>>
>> So spec is unclear of the implications of the success of a command:
>>
>> E.g for RESET (CCW_CMD_VDEV_RESET), the success of the command implies
>> the success or the reset.
> Yes, sending RESET is basically the ccw equivalent of "write 0 to the
> status", and getting a status/interrupt that the command finished
> successfully is the equivalent of "get 0 when reading the status back".
> [We did not have a "read back status" command originally.]
>
>> But for set_status (CCW_CMD_WRITE_STATUS), the success of the command
>> does not imply the bit is set by the device.
> Yes, the success only indicates that the device has received the command
> successfully. It can still refuse to set some values, or only set them
> later.
>
>> If I understand this correctly, we still need re-read here.
> Yes.
>
> [Let me know if we can make this more clear in the spec!]


We probably need to clarify them, I can only deduce those implications 
by reading the kernel driver codes.

Basically the subtle difference between VDEV_RESET and WRITE_STATUS.


>
>>
>>>    If we are going with a
>>> distinct command, we can skip the re-read.
>>
>> Then it would be better to introduce the STOP as a dedicated device
>> facility (as reset):
>>
>> The device MUST present STOP bit after it has been stopped.
>>
>> And for PCI:
>>
>> - it was set via set the bit in the registers
>>
>> for ccw:
>>
>> - a distinct command (as reset) is introduced, and STOP is forbidden to
>> set via device status?
> I think the situation for reset is different: a zero status is a natural
> way to express that a device is in its initial state. It does not really
> matter whether it is a freshly initialized device, or whether it has
> been reset by the driver, or which mechanism the driver is using.
>
> For STOP, we'd end up indicating a certain status, with one way to
> actually write the status, and the other a dedicated command. I'd expect
> that the driver will still read the status to check whether the STOP bit
> is present, as it may take some time, regardless of the transport used
> (going by the Linux implementation, the various callbacks to interact
> with the device state are assumed to be synchronous, and we have to make
> the asynchronous ccw interactions synchronous beneath the covers; if we
> stick with that model for STOP, the asynchronous nature of ccw commands
> does not buy us anything.)
>
> So, maybe using the same mechanism for every transport is better, if we
> end up reading the status back anyway.


Ok.


>
>>
>>> (I'd probably go with a more
>>> generic 'trigger an action' meta-command, but that would work just the
>>> same.)
>>>
>>>> - for admin virtqueue, it should be something similar to ccw, wait/poll
>>>> for the success of the admin virtqueue command
>>> Or we should maybe standardize on the admin virtqueue?
>>
>> That's one way to go.
>>
>>
>>>    That seems less
>>> confusing to me.
>>
>> But it's just one of the possible interface to carry the commands. We
>> still need to define the semantic or facility of "stop" first.
>>
>> And we still need to clarify the implication for the success of each
>> specific command as ccw. (E.g whether or not a re-read(get after set) is
>> need)
>>
>> The only difference is the transport: ccw command vs virtqqueue.
> Historically, too many differences in how the transports implement
> device/driver interactions have lead to some awkwardness (see e.g. the
> might_sleep annotations, which are surprising to someone working with
> the pci transport.)


Yes but it actually helps for the transport that depends on an extra 
layer to talk with the device. So it's not bad.


> So I think there's benefit in making the
> interactions either very similar, or so different that the transports
> can do their own things. (As said above, having an extra ccw command for
> STOP is probably only useful if generic code isn't polling the status
> for the bit to be set.)


I agree, but it will require the "general api" like admin virtqueue to 
carry not only the STOP stuffs but all the other configurations. 
Otherwise it's still a partial solution.


>
> So, maybe either/or
>
> - write STOP to status, read it back (via already existing methods)
> - use a virtqueue
>
> The extra ccw command would only make sense if other transports
> implemented STOP via e.g. a new register (would that also be an option?)


That's possible, but it's better then to introduce it as a basic 
facility and clarify how it interact with the existing device status.

Thanks


>


^ permalink raw reply	[flat|nested] 115+ messages in thread

* [virtio-dev] Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit
  2021-07-15  1:35                           ` Jason Wang
@ 2021-07-15  9:16                             ` Stefan Hajnoczi
  2021-07-16  1:44                               ` Jason Wang
  2021-07-15 10:01                             ` Stefan Hajnoczi
  1 sibling, 1 reply; 115+ messages in thread
From: Stefan Hajnoczi @ 2021-07-15  9:16 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, Eugenio Perez Martin, Dr. David Alan Gilbert,
	virtio-comment, Virtio-Dev, Max Gurtovoy, Cornelia Huck,
	Oren Duer, Shahaf Shuler, Parav Pandit, Bodong Wang,
	Alexander Mikheev, Halil Pasic

[-- Attachment #1: Type: text/plain, Size: 1301 bytes --]

On Thu, Jul 15, 2021 at 09:35:13AM +0800, Jason Wang wrote:
> 在 2021/7/14 下午11:07, Stefan Hajnoczi 写道:
> > On Wed, Jul 14, 2021 at 06:29:28PM +0800, Jason Wang wrote:
> > > And as I've stated several times, virtqueue is the interface or transport
> > > which carries the commands for implementing specific semantics. It doesn't
> > > conflict with what is proposed in this patch.
> > The abstract operations for stopping the device and fetching virtqueue
> > state sound good to me, but I don't think a Device Status field STOP bit
> > should be added. An out-of-band stop operation would support devices
> > that take a long time to stop better.
> 
> 
> So the long time request is not something that is introduced by the STOP
> bit. Spec already use that for reset.

Reset doesn't affect migration downtime. The register polling approach
is problematic during migration downtime because it's difficult to stop
multiple devices and do other downtime cleanup concurrently. Stopping
devices sequentially increases migration downtime, so I think the
interface should encourage concurrently stopping multiple devices.

I think you and Cornelia discussed that an interrupt could be added to
solve this problem. That would address my concerns about the STOP bit.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit
  2021-07-15  1:38                             ` Jason Wang
@ 2021-07-15  9:26                               ` Stefan Hajnoczi
  2021-07-16  1:48                                 ` Jason Wang
  2021-07-15 21:18                               ` Michael S. Tsirkin
  1 sibling, 1 reply; 115+ messages in thread
From: Stefan Hajnoczi @ 2021-07-15  9:26 UTC (permalink / raw)
  To: Jason Wang
  Cc: Max Gurtovoy, Michael S. Tsirkin, Eugenio Perez Martin,
	Dr. David Alan Gilbert, virtio-comment, Virtio-Dev,
	Cornelia Huck, Oren Duer, Shahaf Shuler, Parav Pandit,
	Bodong Wang, Alexander Mikheev, Halil Pasic

[-- Attachment #1: Type: text/plain, Size: 3023 bytes --]

On Thu, Jul 15, 2021 at 09:38:55AM +0800, Jason Wang wrote:
> 
> 在 2021/7/15 上午12:22, Max Gurtovoy 写道:
> > 
> > On 7/14/2021 6:07 PM, Stefan Hajnoczi wrote:
> > > > It requires much more works than the simple virtqueue interface:
> > > > (the main
> > > > issues is that the function is not self-contained in a single function)
> > > > 
> > > > 1) how to interact with the existing device status state machine?
> > > > 2) how to make it work in a nested environment?
> > > > 3) how to migrate the PF?
> > > > 4) do we need to allow more control other than just stop/freeze
> > > > the device
> > > > in the admin virtqueue? If yes, how to handle the concurrent
> > > > access from PF
> > > > and VF?
> > > > 5) how it is expected to work with non-PCI virtio device?
> > > I guess your device splitting proposal addresses some of these things?
> > > 
> > > Max probably has the most to say about these points.
> > > 
> > > If you want more input I can try to answer too, but I personally am not
> > > developing devices that need this right now, so I might not be the best
> > > person to propose solutions.
> > 
> > I think we mentioned this in the past and agreed that the only common
> > entity between my solution for virtio VF migration to this proposal is
> > the new admin control queue.
> > 
> > I can prepare some draft for this.
> > 
> > In our solution the PF will manage migration process for it's VFs using
> > the PF admin queue. PF is not migratable.
> 
> 
> That limits the use cases.
> 
> 
> > 
> > I don't know who is using nested environments in production so don't
> > know if it worth talking about that.
> 
> 
> There should be plenty users for the nested case.

Yes, nested virtualization is becoming available in clouds, etc. I think
nested virtualization support should be part of the design.

> > 
> > But, if you would like to implement it for testing, no problem. The VF
> > in level n, probably seen as PF in level n+1. So it can manage the
> > migration process for its nested VFs.
> 
> 
> The PF dependency makes the design almost impossible to be used in a nested
> environment.

I'm not sure I understood Max's example, but first I want to check I
understand yours:

A physical PF is passed through to an L1 guest. L2 guests are assigned
VFs created by the L1 guest from the PF.

Now we want to live migrate the L1 guest to another host. We need to
migrate the PF and its VFs are automatically included since there is no
migration from the L2 perspective?

> > 
> > For question 5) what non-PCI devices are interesting in live migration ?
> > 
> 
> Why not? Virtio support transport other than PCI (CCW, MMIO).

Yes, VIRTIO isn't tied to PCI and the migration functionality should be
mappable to other transports.

Luckily the admin virtqueue approach maps naturally to other transports.
What requires more thought is how the admin virtqueue is
enumerated/managed on those other transports.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit
  2021-07-15  1:35                           ` Jason Wang
  2021-07-15  9:16                             ` [virtio-dev] " Stefan Hajnoczi
@ 2021-07-15 10:01                             ` Stefan Hajnoczi
  2021-07-16  2:03                               ` Jason Wang
  1 sibling, 1 reply; 115+ messages in thread
From: Stefan Hajnoczi @ 2021-07-15 10:01 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, Eugenio Perez Martin, Dr. David Alan Gilbert,
	virtio-comment, Virtio-Dev, Max Gurtovoy, Cornelia Huck,
	Oren Duer, Shahaf Shuler, Parav Pandit, Bodong Wang,
	Alexander Mikheev, Halil Pasic

[-- Attachment #1: Type: text/plain, Size: 14472 bytes --]

On Thu, Jul 15, 2021 at 09:35:13AM +0800, Jason Wang wrote:
> 
> 在 2021/7/14 下午11:07, Stefan Hajnoczi 写道:
> > On Wed, Jul 14, 2021 at 06:29:28PM +0800, Jason Wang wrote:
> > > 在 2021/7/14 下午5:53, Stefan Hajnoczi 写道:
> > > > On Tue, Jul 13, 2021 at 08:16:35PM +0800, Jason Wang wrote:
> > > > > 在 2021/7/13 下午6:00, Stefan Hajnoczi 写道:
> > > > > > On Tue, Jul 13, 2021 at 11:27:03AM +0800, Jason Wang wrote:
> > > > > > > 在 2021/7/12 下午5:57, Stefan Hajnoczi 写道:
> > > > > > > > On Mon, Jul 12, 2021 at 12:00:39PM +0800, Jason Wang wrote:
> > > > > > > > > 在 2021/7/11 上午4:36, Michael S. Tsirkin 写道:
> > > > > > > > > > On Fri, Jul 09, 2021 at 07:23:33PM +0200, Eugenio Perez Martin wrote:
> > > > > > > > > > > > >        If I understand correctly, this is all
> > > > > > > > > > > > > driven from the driver inside the guest, so for this to work
> > > > > > > > > > > > > the guest must be running and already have initialised the driver.
> > > > > > > > > > > > Yes.
> > > > > > > > > > > > 
> > > > > > > > > > > As I see it, the feature can be driven entirely by the VMM as long as
> > > > > > > > > > > it intercept the relevant configuration space (PCI, MMIO, etc) from
> > > > > > > > > > > guest's reads and writes, and present it as coherent and transparent
> > > > > > > > > > > for the guest. Some use cases I can imagine with a physical device (or
> > > > > > > > > > > vp_vpda device) with VIRTIO_F_STOP:
> > > > > > > > > > > 
> > > > > > > > > > > 1) The VMM chooses not to pass the feature flag. The guest cannot stop
> > > > > > > > > > > the device, so any write to this flag is an error/undefined.
> > > > > > > > > > > 2) The VMM passes the flag to the guest. The guest can stop the device.
> > > > > > > > > > > 2.1) The VMM stops the device to perform a live migration, and the
> > > > > > > > > > > guest does not write to STOP in any moment of the LM. It resets the
> > > > > > > > > > > destination device with the state, and then initializes the device.
> > > > > > > > > > > 2.2) The guest stops the device and, when STOP(32) is set, the source
> > > > > > > > > > > VMM migrates the device status. The destination VMM realizes the bit,
> > > > > > > > > > > so it sets the bit in the destination too after device initialization.
> > > > > > > > > > > 2.3) The device is not initialized by the guest so it doesn't matter
> > > > > > > > > > > what bit has the HW, but the VM can be migrated.
> > > > > > > > > > > 
> > > > > > > > > > > Am I missing something?
> > > > > > > > > > > 
> > > > > > > > > > > Thanks!
> > > > > > > > > > It's doable like this. It's all a lot of hoops to jump through though.
> > > > > > > > > > It's also not easy for devices to implement.
> > > > > > > > > It just requires a new status bit. Anything that makes you think it's hard
> > > > > > > > > to implement?
> > > > > > > > > 
> > > > > > > > > E.g for networking device, it should be sufficient to use this bit + the
> > > > > > > > > virtqueue state.
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > > Why don't we design the feature in a way that is useable by VMMs
> > > > > > > > > > and implementable by devices in a simple way?
> > > > > > > > > It use the common technology like register shadowing without any further
> > > > > > > > > stuffs.
> > > > > > > > > 
> > > > > > > > > Or do you have any other ideas?
> > > > > > > > > 
> > > > > > > > > (I think we all know migration will be very hard if we simply pass through
> > > > > > > > > those state registers).
> > > > > > > > If an admin virtqueue is used instead of the STOP Device Status field
> > > > > > > > bit then there's no need to re-read the Device Status field in a loop
> > > > > > > > until the device has stopped.
> > > > > > > Probably not. Let me to clarify several points:
> > > > > > > 
> > > > > > > - This proposal has nothing to do with admin virtqueue. Actually, admin
> > > > > > > virtqueue could be used for carrying any basic device facility like status
> > > > > > > bit. E.g I'm going to post patches that use admin virtqueue as a "transport"
> > > > > > > for device slicing at virtio level.
> > > > > > > - Even if we had introduced admin virtqueue, we still need a per function
> > > > > > > interface for this. This is a must for nested virtualization, we can't
> > > > > > > always expect things like PF can be assigned to L1 guest.
> > > > > > > - According to the proposal, there's no need for the device to complete all
> > > > > > > the consumed buffers, device can choose to expose those inflight descriptors
> > > > > > > in a device specific way and set the STOP bit. This means, if we have the
> > > > > > > device specific in-flight descriptor reporting facility, the device can
> > > > > > > almost set the STOP bit immediately.
> > > > > > > - If we don't go with the basic device facility but using the admin
> > > > > > > virtqueue specific method, we still need to clarify how it works with the
> > > > > > > device status state machine, it will be some kind of sub-states which looks
> > > > > > > much more complicated than the current proposal.
> > > > > > > 
> > > > > > > 
> > > > > > > > When migrating a guest with many VIRTIO devices a busy waiting approach
> > > > > > > > extends downtime if implemented sequentially (stopping one device at a
> > > > > > > > time).
> > > > > > > Well. You need some kinds of waiting for sure, the device/DMA needs sometime
> > > > > > > to be stopped. The downtime is determined by a specific virtio
> > > > > > > implementation which is hard to be restricted at the spec level. We can
> > > > > > > clarify that the device must set the STOP bit in e.g 100ms.
> > > > > > > 
> > > > > > > 
> > > > > > > >      It can be implemented concurrently (setting the STOP bit on all
> > > > > > > > devices and then looping until all their Device Status fields have the
> > > > > > > > bit set), but this becomes more complex to implement.
> > > > > > > I still don't get what kind of complexity did you worry here.
> > > > > > > 
> > > > > > > 
> > > > > > > > I'm a little worried about adding a new bit that requires busy
> > > > > > > > waiting...
> > > > > > > Busy wait is not something that is introduced in this patch:
> > > > > > > 
> > > > > > > 4.1.4.3.2 Driver Requirements: Common configuration structure layout
> > > > > > > 
> > > > > > > After writing 0 to device_status, the driver MUST wait for a read of
> > > > > > > device_status to return 0 before reinitializing the device.
> > > > > > > 
> > > > > > > Since it was required for at least one transport. We need do something
> > > > > > > similar to when introducing basic facility.
> > > > > > Adding the STOP but as a Device Status bit is a small and clean VIRTIO
> > > > > > spec change. I like that.
> > > > > > 
> > > > > > On the other hand, devices need time to stop and that time can be
> > > > > > unbounded. For example, software virtio-blk/scsi implementations since
> > > > > > cannot immediately cancel in-flight I/O requests on Linux hosts.
> > > > > > 
> > > > > > The natural interface for long-running operations is virtqueue requests.
> > > > > > That's why I mentioned the alternative of using an admin virtqueue
> > > > > > instead of a Device Status bit.
> > > > > So I'm not against the admin virtqueue. As said before, admin virtqueue
> > > > > could be used for carrying the device status bit.
> > > > > 
> > > > > Send a command to set STOP status bit to admin virtqueue. Device will make
> > > > > the command buffer used after it has successfully stopped the device.
> > > > > 
> > > > > AFAIK, they are not mutually exclusive, since they are trying to solve
> > > > > different problems.
> > > > > 
> > > > > Device status - basic device facility
> > > > > 
> > > > > Admin virtqueue - transport/device specific way to implement (part of) the
> > > > > device facility
> > > > > 
> > > > > > Although you mentioned that the stopped state needs to be reflected in
> > > > > > the Device Status field somehow, I'm not sure about that since the
> > > > > > driver typically doesn't need to know whether the device is being
> > > > > > migrated.
> > > > > The guest won't see the real device status bit. VMM will shadow the device
> > > > > status bit in this case.
> > > > > 
> > > > > E.g with the current vhost-vDPA, vDPA behave like a vhost device, guest is
> > > > > unaware of the migration.
> > > > > 
> > > > > STOP status bit is set by Qemu to real virtio hardware. But guest will only
> > > > > see the DRIVER_OK without STOP.
> > > > > 
> > > > > It's not hard to implement the nested on top, see the discussion initiated
> > > > > by Eugenio about how expose VIRTIO_F_STOP to guest for nested live
> > > > > migration.
> > > > > 
> > > > > 
> > > > > >     In fact, the VMM would need to hide this bit and it's safer to
> > > > > > keep it out-of-band instead of risking exposing it by accident.
> > > > > See above, VMM may choose to hide or expose the capability. It's useful for
> > > > > migrating a nested guest.
> > > > > 
> > > > > If we design an interface that can be used in the nested environment, it's
> > > > > not an ideal interface.
> > > > > 
> > > > > 
> > > > > > In addition, stateful devices need to load/save non-trivial amounts of
> > > > > > data. They need DMA to do this efficiently, so an admin virtqueue is a
> > > > > > good fit again.
> > > > > I don't get the point here. You still need to address the exact the similar
> > > > > issues for admin virtqueue: the unbound time in freezing the device, the
> > > > > interaction with the virtio device status state machine.
> > > > Device state state can be large so a register interface would be a
> > > > bottleneck. DMA is needed. I think a virtqueue is a good fit for
> > > > saving/loading device state.
> > > 
> > > So this patch doesn't mandate a register interface, isn't it?
> > You're right, not this patch. I mentioned it because your other patch
> > series ("[PATCH] virtio-pci: implement VIRTIO_F_RING_STATE") implements
> > it a register interface.
> > 
> > > And DMA
> > > doesn't means a virtqueue, it could be a transport specific method.
> > Yes, although virtqueues are a pretty good interface that works across
> > transports (PCI/MMIO/etc) thanks to the standard vring memory layout.
> > 
> > > I think we need to start from defining the state of one specific device and
> > > see what is the best interface.
> > virtio-blk might be the simplest. I think virtio-net has more device
> > state and virtio-scsi is definitely more complext than virtio-blk.
> > 
> > First we need agreement on whether "device state" encompasses the full
> > state of the device or just state that is unknown to the VMM.
> 
> 
> I think we've discussed this in the past. It can't work since:
> 
> 1) The state and its format must be clearly defined in the spec
> 2) We need to maintain migration compatibility and debug-ability

Some devices need implementation-specific state. They should still be
able to live migrate even if it means cross-implementation migration and
debug-ability is not possible.

> 3) Not a proper uAPI desgin

I never understood this argument. The Linux uAPI passes through lots of
opaque data from devices to userspace. Allowing an
implementation-specific device state representation is nothing new. VFIO
already does it.

> 
> 
> > That's
> > basically the difference between the vhost/vDPA's selective passthrough
> > approach and VFIO's full passthrough approach.
> 
> 
> We can't do VFIO full pasthrough for migration anyway, some kind of mdev is
> required but it's duplicated with the current vp_vdpa driver.

I'm not sure that's true. Generic VFIO PCI migration can probably be
achieved without mdev:
1. Define a migration PCI Capability that indicates support for
   VFIO_REGION_TYPE_MIGRATION. This allows the PCI device to implement
   the migration interface in hardware instead of an mdev driver.
2. The VMM either uses the migration PCI Capability directly from
   userspace or core VFIO PCI code advertises VFIO_REGION_TYPE_MIGRATION
   to userspace so migration can proceed in the same way as with
   VFIO/mdev drivers.
3. The PCI Capability is not passed through to the guest.

Changpeng Liu originally mentioned the idea of defining a migration PCI
Capability.

> >   For example, some of the
> > virtio-net state is available to the VMM with vhost/vDPA because it
> > intercepts the virtio-net control virtqueue.
> > 
> > Also, we need to decide to what degree the device state representation
> > is standardized in the VIRTIO specification.
> 
> 
> I think all the states must be defined in the spec otherwise the device
> can't claim it supports migration at virtio level.
> 
> 
> >   I think it makes sense to
> > standardize it if it's possible to convey all necessary state and device
> > implementors can easily implement this device state representation.
> 
> 
> I doubt it's high device specific. E.g can we standardize device(GPU)
> memory?

For devices that have little internal state it's possible to define a
standard device state representation.

For other devices, like virtio-crypto, virtio-fs, etc it becomes
difficult because the device implementation contains state that will be
needed but is very specific to the implementation. These devices *are*
migratable but they don't have standard state. Even here there is a
spectrum:
- Host OS-specific state (e.g. Linux struct file_handles)
- Library-specific state (e.g. crypto library state)
- Implementation-specific state (e.g. sshfs inode state for virtio-fs)

This is why I think it's necessary to support both standard device state
representations and implementation-specific device state
representations.

> > If
> > not, then device implementation-specific device state would be needed.
> 
> 
> Yes.
> 
> 
> > 
> > I think that's a larger discussion that deserves its own email thread.
> 
> 
> I agree, but it doesn't prevent us from starting from simple device that
> virtqueue state is sufficient (e.g virtio-net).

Right, stateless devices don't need to save/load device state, at least
when virtqueues are passed through selectively instead of full
passsthrough.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit
  2021-07-15  1:38                             ` Jason Wang
  2021-07-15  9:26                               ` Stefan Hajnoczi
@ 2021-07-15 21:18                               ` Michael S. Tsirkin
  2021-07-16  2:19                                 ` Jason Wang
  1 sibling, 1 reply; 115+ messages in thread
From: Michael S. Tsirkin @ 2021-07-15 21:18 UTC (permalink / raw)
  To: Jason Wang
  Cc: Max Gurtovoy, Stefan Hajnoczi, Eugenio Perez Martin,
	Dr. David Alan Gilbert, virtio-comment, Virtio-Dev,
	Cornelia Huck, Oren Duer, Shahaf Shuler, Parav Pandit,
	Bodong Wang, Alexander Mikheev, Halil Pasic

On Thu, Jul 15, 2021 at 09:38:55AM +0800, Jason Wang wrote:
> 
> 在 2021/7/15 上午12:22, Max Gurtovoy 写道:
> > 
> > On 7/14/2021 6:07 PM, Stefan Hajnoczi wrote:
> > > > It requires much more works than the simple virtqueue interface:
> > > > (the main
> > > > issues is that the function is not self-contained in a single function)
> > > > 
> > > > 1) how to interact with the existing device status state machine?
> > > > 2) how to make it work in a nested environment?
> > > > 3) how to migrate the PF?
> > > > 4) do we need to allow more control other than just stop/freeze
> > > > the device
> > > > in the admin virtqueue? If yes, how to handle the concurrent
> > > > access from PF
> > > > and VF?
> > > > 5) how it is expected to work with non-PCI virtio device?
> > > I guess your device splitting proposal addresses some of these things?
> > > 
> > > Max probably has the most to say about these points.
> > > 
> > > If you want more input I can try to answer too, but I personally am not
> > > developing devices that need this right now, so I might not be the best
> > > person to propose solutions.
> > 
> > I think we mentioned this in the past and agreed that the only common
> > entity between my solution for virtio VF migration to this proposal is
> > the new admin control queue.
> > 
> > I can prepare some draft for this.
> > 
> > In our solution the PF will manage migration process for it's VFs using
> > the PF admin queue. PF is not migratable.
> 
> 
> That limits the use cases.
> 
> 
> > 
> > I don't know who is using nested environments in production so don't
> > know if it worth talking about that.
> 
> 
> There should be plenty users for the nested case.
> 
> 
> > 
> > But, if you would like to implement it for testing, no problem. The VF
> > in level n, probably seen as PF in level n+1. So it can manage the
> > migration process for its nested VFs.
> 
> 
> The PF dependency makes the design almost impossible to be used in a nested
> environment.

Can't we have an emulated PF? This would be the most reasonable thing to
do.

> 
> > 
> > For question 5) what non-PCI devices are interesting in live migration ?
> > 
> 
> Why not? Virtio support transport other than PCI (CCW, MMIO).
> 
> Thanks
> 


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit
  2021-07-15  9:16                             ` [virtio-dev] " Stefan Hajnoczi
@ 2021-07-16  1:44                               ` Jason Wang
  2021-07-19 12:18                                 ` [virtio-dev] " Stefan Hajnoczi
  2021-07-20 10:31                                 ` Cornelia Huck
  0 siblings, 2 replies; 115+ messages in thread
From: Jason Wang @ 2021-07-16  1:44 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Michael S. Tsirkin, Eugenio Perez Martin, Dr. David Alan Gilbert,
	virtio-comment, Virtio-Dev, Max Gurtovoy, Cornelia Huck,
	Oren Duer, Shahaf Shuler, Parav Pandit, Bodong Wang,
	Alexander Mikheev, Halil Pasic


在 2021/7/15 下午5:16, Stefan Hajnoczi 写道:
> On Thu, Jul 15, 2021 at 09:35:13AM +0800, Jason Wang wrote:
>> 在 2021/7/14 下午11:07, Stefan Hajnoczi 写道:
>>> On Wed, Jul 14, 2021 at 06:29:28PM +0800, Jason Wang wrote:
>>>> And as I've stated several times, virtqueue is the interface or transport
>>>> which carries the commands for implementing specific semantics. It doesn't
>>>> conflict with what is proposed in this patch.
>>> The abstract operations for stopping the device and fetching virtqueue
>>> state sound good to me, but I don't think a Device Status field STOP bit
>>> should be added. An out-of-band stop operation would support devices
>>> that take a long time to stop better.
>>
>> So the long time request is not something that is introduced by the STOP
>> bit. Spec already use that for reset.
> Reset doesn't affect migration downtime. The register polling approach
> is problematic during migration downtime because it's difficult to stop
> multiple devices and do other downtime cleanup concurrently.


This part I don't understand. We don't have an centralized control path 
that is used for each virtual functions.

VMM is free to stop multiple devices and poll for all those device status?


> Stopping
> devices sequentially increases migration downtime, so I think the
> interface should encourage concurrently stopping multiple devices.
>
> I think you and Cornelia discussed that an interrupt could be added to
> solve this problem. That would address my concerns about the STOP bit.


The problems are:

1) if we generate an interrupt after STOP, it breaks the STOP semantic 
where the device should not generate any interrupt
2) if we generate an interrupt before STOP, we may end up with race 
conditions

Thanks


>
> Stefan


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit
  2021-07-15  9:26                               ` Stefan Hajnoczi
@ 2021-07-16  1:48                                 ` Jason Wang
  2021-07-19 12:08                                   ` Stefan Hajnoczi
  0 siblings, 1 reply; 115+ messages in thread
From: Jason Wang @ 2021-07-16  1:48 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Max Gurtovoy, Michael S. Tsirkin, Eugenio Perez Martin,
	Dr. David Alan Gilbert, virtio-comment, Virtio-Dev,
	Cornelia Huck, Oren Duer, Shahaf Shuler, Parav Pandit,
	Bodong Wang, Alexander Mikheev, Halil Pasic


在 2021/7/15 下午5:26, Stefan Hajnoczi 写道:
> On Thu, Jul 15, 2021 at 09:38:55AM +0800, Jason Wang wrote:
>> 在 2021/7/15 上午12:22, Max Gurtovoy 写道:
>>> On 7/14/2021 6:07 PM, Stefan Hajnoczi wrote:
>>>>> It requires much more works than the simple virtqueue interface:
>>>>> (the main
>>>>> issues is that the function is not self-contained in a single function)
>>>>>
>>>>> 1) how to interact with the existing device status state machine?
>>>>> 2) how to make it work in a nested environment?
>>>>> 3) how to migrate the PF?
>>>>> 4) do we need to allow more control other than just stop/freeze
>>>>> the device
>>>>> in the admin virtqueue? If yes, how to handle the concurrent
>>>>> access from PF
>>>>> and VF?
>>>>> 5) how it is expected to work with non-PCI virtio device?
>>>> I guess your device splitting proposal addresses some of these things?
>>>>
>>>> Max probably has the most to say about these points.
>>>>
>>>> If you want more input I can try to answer too, but I personally am not
>>>> developing devices that need this right now, so I might not be the best
>>>> person to propose solutions.
>>> I think we mentioned this in the past and agreed that the only common
>>> entity between my solution for virtio VF migration to this proposal is
>>> the new admin control queue.
>>>
>>> I can prepare some draft for this.
>>>
>>> In our solution the PF will manage migration process for it's VFs using
>>> the PF admin queue. PF is not migratable.
>>
>> That limits the use cases.
>>
>>
>>> I don't know who is using nested environments in production so don't
>>> know if it worth talking about that.
>>
>> There should be plenty users for the nested case.
> Yes, nested virtualization is becoming available in clouds, etc. I think
> nested virtualization support should be part of the design.
>
>>> But, if you would like to implement it for testing, no problem. The VF
>>> in level n, probably seen as PF in level n+1. So it can manage the
>>> migration process for its nested VFs.
>>
>> The PF dependency makes the design almost impossible to be used in a nested
>> environment.
> I'm not sure I understood Max's example, but first I want to check I
> understand yours:
>
> A physical PF is passed through to an L1 guest. L2 guests are assigned
> VFs created by the L1 guest from the PF.
>
> Now we want to live migrate the L1 guest to another host. We need to
> migrate the PF and its VFs are automatically included since there is no
> migration from the L2 perspective?


Yes, and I believe the more common case is.

PF is for L0, and we want to migrate L2 guest.

This can hardly work in the current design.

The reason is that the functions is not self contained in the VF.


>
>>> For question 5) what non-PCI devices are interesting in live migration ?
>>>
>> Why not? Virtio support transport other than PCI (CCW, MMIO).
> Yes, VIRTIO isn't tied to PCI and the migration functionality should be
> mappable to other transports.
>
> Luckily the admin virtqueue approach maps naturally to other transports.
> What requires more thought is how the admin virtqueue is
> enumerated/managed on those other transports.


So admin virtqueue is really one way to go. But we can't mandate it in 
the spec. Sometime, it would be hard to define where the admin virtqueue 
needs to be located consider the transport may lack the concept of 
something like PF.

To me the most valuable part of the admin virtqueue is that it sits in 
the PF (or management device) where the DMA is naturally isolated.

Thanks


>
> Stefan


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit
  2021-07-15 10:01                             ` Stefan Hajnoczi
@ 2021-07-16  2:03                               ` Jason Wang
  2021-07-16  3:53                                 ` Jason Wang
  2021-07-19 12:43                                 ` Stefan Hajnoczi
  0 siblings, 2 replies; 115+ messages in thread
From: Jason Wang @ 2021-07-16  2:03 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Michael S. Tsirkin, Eugenio Perez Martin, Dr. David Alan Gilbert,
	virtio-comment, Virtio-Dev, Max Gurtovoy, Cornelia Huck,
	Oren Duer, Shahaf Shuler, Parav Pandit, Bodong Wang,
	Alexander Mikheev, Halil Pasic


在 2021/7/15 下午6:01, Stefan Hajnoczi 写道:
> On Thu, Jul 15, 2021 at 09:35:13AM +0800, Jason Wang wrote:
>> 在 2021/7/14 下午11:07, Stefan Hajnoczi 写道:
>>> On Wed, Jul 14, 2021 at 06:29:28PM +0800, Jason Wang wrote:
>>>> 在 2021/7/14 下午5:53, Stefan Hajnoczi 写道:
>>>>> On Tue, Jul 13, 2021 at 08:16:35PM +0800, Jason Wang wrote:
>>>>>> 在 2021/7/13 下午6:00, Stefan Hajnoczi 写道:
>>>>>>> On Tue, Jul 13, 2021 at 11:27:03AM +0800, Jason Wang wrote:
>>>>>>>> 在 2021/7/12 下午5:57, Stefan Hajnoczi 写道:
>>>>>>>>> On Mon, Jul 12, 2021 at 12:00:39PM +0800, Jason Wang wrote:
>>>>>>>>>> 在 2021/7/11 上午4:36, Michael S. Tsirkin 写道:
>>>>>>>>>>> On Fri, Jul 09, 2021 at 07:23:33PM +0200, Eugenio Perez Martin wrote:
>>>>>>>>>>>>>>         If I understand correctly, this is all
>>>>>>>>>>>>>> driven from the driver inside the guest, so for this to work
>>>>>>>>>>>>>> the guest must be running and already have initialised the driver.
>>>>>>>>>>>>> Yes.
>>>>>>>>>>>>>
>>>>>>>>>>>> As I see it, the feature can be driven entirely by the VMM as long as
>>>>>>>>>>>> it intercept the relevant configuration space (PCI, MMIO, etc) from
>>>>>>>>>>>> guest's reads and writes, and present it as coherent and transparent
>>>>>>>>>>>> for the guest. Some use cases I can imagine with a physical device (or
>>>>>>>>>>>> vp_vpda device) with VIRTIO_F_STOP:
>>>>>>>>>>>>
>>>>>>>>>>>> 1) The VMM chooses not to pass the feature flag. The guest cannot stop
>>>>>>>>>>>> the device, so any write to this flag is an error/undefined.
>>>>>>>>>>>> 2) The VMM passes the flag to the guest. The guest can stop the device.
>>>>>>>>>>>> 2.1) The VMM stops the device to perform a live migration, and the
>>>>>>>>>>>> guest does not write to STOP in any moment of the LM. It resets the
>>>>>>>>>>>> destination device with the state, and then initializes the device.
>>>>>>>>>>>> 2.2) The guest stops the device and, when STOP(32) is set, the source
>>>>>>>>>>>> VMM migrates the device status. The destination VMM realizes the bit,
>>>>>>>>>>>> so it sets the bit in the destination too after device initialization.
>>>>>>>>>>>> 2.3) The device is not initialized by the guest so it doesn't matter
>>>>>>>>>>>> what bit has the HW, but the VM can be migrated.
>>>>>>>>>>>>
>>>>>>>>>>>> Am I missing something?
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks!
>>>>>>>>>>> It's doable like this. It's all a lot of hoops to jump through though.
>>>>>>>>>>> It's also not easy for devices to implement.
>>>>>>>>>> It just requires a new status bit. Anything that makes you think it's hard
>>>>>>>>>> to implement?
>>>>>>>>>>
>>>>>>>>>> E.g for networking device, it should be sufficient to use this bit + the
>>>>>>>>>> virtqueue state.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> Why don't we design the feature in a way that is useable by VMMs
>>>>>>>>>>> and implementable by devices in a simple way?
>>>>>>>>>> It use the common technology like register shadowing without any further
>>>>>>>>>> stuffs.
>>>>>>>>>>
>>>>>>>>>> Or do you have any other ideas?
>>>>>>>>>>
>>>>>>>>>> (I think we all know migration will be very hard if we simply pass through
>>>>>>>>>> those state registers).
>>>>>>>>> If an admin virtqueue is used instead of the STOP Device Status field
>>>>>>>>> bit then there's no need to re-read the Device Status field in a loop
>>>>>>>>> until the device has stopped.
>>>>>>>> Probably not. Let me to clarify several points:
>>>>>>>>
>>>>>>>> - This proposal has nothing to do with admin virtqueue. Actually, admin
>>>>>>>> virtqueue could be used for carrying any basic device facility like status
>>>>>>>> bit. E.g I'm going to post patches that use admin virtqueue as a "transport"
>>>>>>>> for device slicing at virtio level.
>>>>>>>> - Even if we had introduced admin virtqueue, we still need a per function
>>>>>>>> interface for this. This is a must for nested virtualization, we can't
>>>>>>>> always expect things like PF can be assigned to L1 guest.
>>>>>>>> - According to the proposal, there's no need for the device to complete all
>>>>>>>> the consumed buffers, device can choose to expose those inflight descriptors
>>>>>>>> in a device specific way and set the STOP bit. This means, if we have the
>>>>>>>> device specific in-flight descriptor reporting facility, the device can
>>>>>>>> almost set the STOP bit immediately.
>>>>>>>> - If we don't go with the basic device facility but using the admin
>>>>>>>> virtqueue specific method, we still need to clarify how it works with the
>>>>>>>> device status state machine, it will be some kind of sub-states which looks
>>>>>>>> much more complicated than the current proposal.
>>>>>>>>
>>>>>>>>
>>>>>>>>> When migrating a guest with many VIRTIO devices a busy waiting approach
>>>>>>>>> extends downtime if implemented sequentially (stopping one device at a
>>>>>>>>> time).
>>>>>>>> Well. You need some kinds of waiting for sure, the device/DMA needs sometime
>>>>>>>> to be stopped. The downtime is determined by a specific virtio
>>>>>>>> implementation which is hard to be restricted at the spec level. We can
>>>>>>>> clarify that the device must set the STOP bit in e.g 100ms.
>>>>>>>>
>>>>>>>>
>>>>>>>>>       It can be implemented concurrently (setting the STOP bit on all
>>>>>>>>> devices and then looping until all their Device Status fields have the
>>>>>>>>> bit set), but this becomes more complex to implement.
>>>>>>>> I still don't get what kind of complexity did you worry here.
>>>>>>>>
>>>>>>>>
>>>>>>>>> I'm a little worried about adding a new bit that requires busy
>>>>>>>>> waiting...
>>>>>>>> Busy wait is not something that is introduced in this patch:
>>>>>>>>
>>>>>>>> 4.1.4.3.2 Driver Requirements: Common configuration structure layout
>>>>>>>>
>>>>>>>> After writing 0 to device_status, the driver MUST wait for a read of
>>>>>>>> device_status to return 0 before reinitializing the device.
>>>>>>>>
>>>>>>>> Since it was required for at least one transport. We need do something
>>>>>>>> similar to when introducing basic facility.
>>>>>>> Adding the STOP but as a Device Status bit is a small and clean VIRTIO
>>>>>>> spec change. I like that.
>>>>>>>
>>>>>>> On the other hand, devices need time to stop and that time can be
>>>>>>> unbounded. For example, software virtio-blk/scsi implementations since
>>>>>>> cannot immediately cancel in-flight I/O requests on Linux hosts.
>>>>>>>
>>>>>>> The natural interface for long-running operations is virtqueue requests.
>>>>>>> That's why I mentioned the alternative of using an admin virtqueue
>>>>>>> instead of a Device Status bit.
>>>>>> So I'm not against the admin virtqueue. As said before, admin virtqueue
>>>>>> could be used for carrying the device status bit.
>>>>>>
>>>>>> Send a command to set STOP status bit to admin virtqueue. Device will make
>>>>>> the command buffer used after it has successfully stopped the device.
>>>>>>
>>>>>> AFAIK, they are not mutually exclusive, since they are trying to solve
>>>>>> different problems.
>>>>>>
>>>>>> Device status - basic device facility
>>>>>>
>>>>>> Admin virtqueue - transport/device specific way to implement (part of) the
>>>>>> device facility
>>>>>>
>>>>>>> Although you mentioned that the stopped state needs to be reflected in
>>>>>>> the Device Status field somehow, I'm not sure about that since the
>>>>>>> driver typically doesn't need to know whether the device is being
>>>>>>> migrated.
>>>>>> The guest won't see the real device status bit. VMM will shadow the device
>>>>>> status bit in this case.
>>>>>>
>>>>>> E.g with the current vhost-vDPA, vDPA behave like a vhost device, guest is
>>>>>> unaware of the migration.
>>>>>>
>>>>>> STOP status bit is set by Qemu to real virtio hardware. But guest will only
>>>>>> see the DRIVER_OK without STOP.
>>>>>>
>>>>>> It's not hard to implement the nested on top, see the discussion initiated
>>>>>> by Eugenio about how expose VIRTIO_F_STOP to guest for nested live
>>>>>> migration.
>>>>>>
>>>>>>
>>>>>>>      In fact, the VMM would need to hide this bit and it's safer to
>>>>>>> keep it out-of-band instead of risking exposing it by accident.
>>>>>> See above, VMM may choose to hide or expose the capability. It's useful for
>>>>>> migrating a nested guest.
>>>>>>
>>>>>> If we design an interface that can be used in the nested environment, it's
>>>>>> not an ideal interface.
>>>>>>
>>>>>>
>>>>>>> In addition, stateful devices need to load/save non-trivial amounts of
>>>>>>> data. They need DMA to do this efficiently, so an admin virtqueue is a
>>>>>>> good fit again.
>>>>>> I don't get the point here. You still need to address the exact the similar
>>>>>> issues for admin virtqueue: the unbound time in freezing the device, the
>>>>>> interaction with the virtio device status state machine.
>>>>> Device state state can be large so a register interface would be a
>>>>> bottleneck. DMA is needed. I think a virtqueue is a good fit for
>>>>> saving/loading device state.
>>>> So this patch doesn't mandate a register interface, isn't it?
>>> You're right, not this patch. I mentioned it because your other patch
>>> series ("[PATCH] virtio-pci: implement VIRTIO_F_RING_STATE") implements
>>> it a register interface.
>>>
>>>> And DMA
>>>> doesn't means a virtqueue, it could be a transport specific method.
>>> Yes, although virtqueues are a pretty good interface that works across
>>> transports (PCI/MMIO/etc) thanks to the standard vring memory layout.
>>>
>>>> I think we need to start from defining the state of one specific device and
>>>> see what is the best interface.
>>> virtio-blk might be the simplest. I think virtio-net has more device
>>> state and virtio-scsi is definitely more complext than virtio-blk.
>>>
>>> First we need agreement on whether "device state" encompasses the full
>>> state of the device or just state that is unknown to the VMM.
>>
>> I think we've discussed this in the past. It can't work since:
>>
>> 1) The state and its format must be clearly defined in the spec
>> 2) We need to maintain migration compatibility and debug-ability
> Some devices need implementation-specific state. They should still be
> able to live migrate even if it means cross-implementation migration and
> debug-ability is not possible.


I think we need to re-visit this conclusion. Migration compatibility is 
pretty important, especially consider the software stack has spent a 
huge mount of effort in maintaining them.

Say a virtio hardware would break this, this mean we will lose all the 
advantages of being a standard device.

If we can't do live migration among:

1) different backends, e.g migrate from virtio hardware to migrate software
2) different vendors

We failed to say as a standard device and the customer is in fact locked 
by the vendor implicitly.


>
>> 3) Not a proper uAPI desgin
> I never understood this argument. The Linux uAPI passes through lots of
> opaque data from devices to userspace. Allowing an
> implementation-specific device state representation is nothing new. VFIO
> already does it.


I think we've already had a lots of discussion for VFIO but without a 
conclusion. Maybe we need the verdict from Linus or Greg (not sure if 
it's too late). But that's not related to virito and this thread.

What you propose here is kind of conflict with the efforts of virtio. I 
think we all aggree that we should define the state in the spec. 
Assuming this is correct:

1) why do we still offer opaque migration state to userspace?
2) how can it be integrated into the current VMM (Qemu) virtio devices' 
migration bytes stream?

We should standardize everything that is visible by the driver to be a 
standard device. That's the power of virtio.


>
>>
>>> That's
>>> basically the difference between the vhost/vDPA's selective passthrough
>>> approach and VFIO's full passthrough approach.
>>
>> We can't do VFIO full pasthrough for migration anyway, some kind of mdev is
>> required but it's duplicated with the current vp_vdpa driver.
> I'm not sure that's true. Generic VFIO PCI migration can probably be
> achieved without mdev:
> 1. Define a migration PCI Capability that indicates support for
>     VFIO_REGION_TYPE_MIGRATION. This allows the PCI device to implement
>     the migration interface in hardware instead of an mdev driver.


So I think it still depend on the driver to implement migrate state 
which is vendor specific.

Note that it's just an uAPI definition not something defined in the PCI 
spec.

Out of curiosity, the patch is merged without any real users in the 
Linux. This is very bad since we lose the change to audit the whole design.


> 2. The VMM either uses the migration PCI Capability directly from
>     userspace or core VFIO PCI code advertises VFIO_REGION_TYPE_MIGRATION
>     to userspace so migration can proceed in the same way as with
>     VFIO/mdev drivers.
> 3. The PCI Capability is not passed through to the guest.


This brings troubles in the nested environment.

Thanks


>
> Changpeng Liu originally mentioned the idea of defining a migration PCI
> Capability.
>
>>>    For example, some of the
>>> virtio-net state is available to the VMM with vhost/vDPA because it
>>> intercepts the virtio-net control virtqueue.
>>>
>>> Also, we need to decide to what degree the device state representation
>>> is standardized in the VIRTIO specification.
>>
>> I think all the states must be defined in the spec otherwise the device
>> can't claim it supports migration at virtio level.
>>
>>
>>>    I think it makes sense to
>>> standardize it if it's possible to convey all necessary state and device
>>> implementors can easily implement this device state representation.
>>
>> I doubt it's high device specific. E.g can we standardize device(GPU)
>> memory?
> For devices that have little internal state it's possible to define a
> standard device state representation.
>
> For other devices, like virtio-crypto, virtio-fs, etc it becomes
> difficult because the device implementation contains state that will be
> needed but is very specific to the implementation. These devices *are*
> migratable but they don't have standard state. Even here there is a
> spectrum:
> - Host OS-specific state (e.g. Linux struct file_handles)
> - Library-specific state (e.g. crypto library state)
> - Implementation-specific state (e.g. sshfs inode state for virtio-fs)
>
> This is why I think it's necessary to support both standard device state
> representations and implementation-specific device state
> representations.
>
>>> If
>>> not, then device implementation-specific device state would be needed.
>>
>> Yes.
>>
>>
>>> I think that's a larger discussion that deserves its own email thread.
>>
>> I agree, but it doesn't prevent us from starting from simple device that
>> virtqueue state is sufficient (e.g virtio-net).
> Right, stateless devices don't need to save/load device state, at least
> when virtqueues are passed through selectively instead of full
> passsthrough.
>
> Stefan


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit
  2021-07-15 21:18                               ` Michael S. Tsirkin
@ 2021-07-16  2:19                                 ` Jason Wang
  0 siblings, 0 replies; 115+ messages in thread
From: Jason Wang @ 2021-07-16  2:19 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Max Gurtovoy, Stefan Hajnoczi, Eugenio Perez Martin,
	Dr. David Alan Gilbert, virtio-comment, Virtio-Dev,
	Cornelia Huck, Oren Duer, Shahaf Shuler, Parav Pandit,
	Bodong Wang, Alexander Mikheev, Halil Pasic


在 2021/7/16 上午5:18, Michael S. Tsirkin 写道:
> On Thu, Jul 15, 2021 at 09:38:55AM +0800, Jason Wang wrote:
>> 在 2021/7/15 上午12:22, Max Gurtovoy 写道:
>>> On 7/14/2021 6:07 PM, Stefan Hajnoczi wrote:
>>>>> It requires much more works than the simple virtqueue interface:
>>>>> (the main
>>>>> issues is that the function is not self-contained in a single function)
>>>>>
>>>>> 1) how to interact with the existing device status state machine?
>>>>> 2) how to make it work in a nested environment?
>>>>> 3) how to migrate the PF?
>>>>> 4) do we need to allow more control other than just stop/freeze
>>>>> the device
>>>>> in the admin virtqueue? If yes, how to handle the concurrent
>>>>> access from PF
>>>>> and VF?
>>>>> 5) how it is expected to work with non-PCI virtio device?
>>>> I guess your device splitting proposal addresses some of these things?
>>>>
>>>> Max probably has the most to say about these points.
>>>>
>>>> If you want more input I can try to answer too, but I personally am not
>>>> developing devices that need this right now, so I might not be the best
>>>> person to propose solutions.
>>> I think we mentioned this in the past and agreed that the only common
>>> entity between my solution for virtio VF migration to this proposal is
>>> the new admin control queue.
>>>
>>> I can prepare some draft for this.
>>>
>>> In our solution the PF will manage migration process for it's VFs using
>>> the PF admin queue. PF is not migratable.
>>
>> That limits the use cases.
>>
>>
>>> I don't know who is using nested environments in production so don't
>>> know if it worth talking about that.
>>
>> There should be plenty users for the nested case.
>>
>>
>>> But, if you would like to implement it for testing, no problem. The VF
>>> in level n, probably seen as PF in level n+1. So it can manage the
>>> migration process for its nested VFs.
>>
>> The PF dependency makes the design almost impossible to be used in a nested
>> environment.
> Can't we have an emulated PF? This would be the most reasonable thing to
> do.


Isn't this much more complicated?

- VMM needs to emulate PF
- it means we need the support for PF live migration
- if we want to live migrate VF in L(x) we need PF(physical or emulated) 
from L0 to L(x-1)
- complicate the management stack

If we really want to go with admin + PF, it should be used with the per 
VF function as well. That is hide the admin vq + PF from any of the 
above layers.

L0 virtio device: admin + PF
L1 virtio device: L0 VMM presents the function that is self contained in 
a single function which is implemented vai (admin + PF)
...
Lx virtio device: L(x-1) VMM presents the function that is self 
contained in a single function

In another way, if all the function is self contained, we get even more 
simplified model

Lx virtio device: L(x-1) VMM presents the function that is self 
contained in a single function

But the OS should hide the implementation details on how the device is 
mediated and expose a unified uAPI for the userspace VMM(Qemu). That is 
what vDPA tries to achieve, provides a single device centric uAPI. This 
is another call for having a self contained per device(function) 
interface for all those stuffs. It would be very complicated to 
introduce uAPI for PF (or what ever it was called), and actually we 
can't mandate this even if we invent them. We've already had vendors 
that implements the virtqueue state via VF BAR(capability).

Thanks


>
>>> For question 5) what non-PCI devices are interesting in live migration ?
>>>
>> Why not? Virtio support transport other than PCI (CCW, MMIO).
>>
>> Thanks
>>


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit
  2021-07-16  2:03                               ` Jason Wang
@ 2021-07-16  3:53                                 ` Jason Wang
  2021-07-19 12:45                                   ` Stefan Hajnoczi
  2021-07-19 12:43                                 ` Stefan Hajnoczi
  1 sibling, 1 reply; 115+ messages in thread
From: Jason Wang @ 2021-07-16  3:53 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Michael S. Tsirkin, Eugenio Perez Martin, Dr. David Alan Gilbert,
	virtio-comment, Virtio-Dev, Max Gurtovoy, Cornelia Huck,
	Oren Duer, Shahaf Shuler, Parav Pandit, Bodong Wang,
	Alexander Mikheev, Halil Pasic


在 2021/7/16 上午10:03, Jason Wang 写道:
>
> 在 2021/7/15 下午6:01, Stefan Hajnoczi 写道:
>> On Thu, Jul 15, 2021 at 09:35:13AM +0800, Jason Wang wrote:
>>> 在 2021/7/14 下午11:07, Stefan Hajnoczi 写道:
>>>> On Wed, Jul 14, 2021 at 06:29:28PM +0800, Jason Wang wrote:
>>>>> 在 2021/7/14 下午5:53, Stefan Hajnoczi 写道:
>>>>>> On Tue, Jul 13, 2021 at 08:16:35PM +0800, Jason Wang wrote:
>>>>>>> 在 2021/7/13 下午6:00, Stefan Hajnoczi 写道:
>>>>>>>> On Tue, Jul 13, 2021 at 11:27:03AM +0800, Jason Wang wrote:
>>>>>>>>> 在 2021/7/12 下午5:57, Stefan Hajnoczi 写道:
>>>>>>>>>> On Mon, Jul 12, 2021 at 12:00:39PM +0800, Jason Wang wrote:
>>>>>>>>>>> 在 2021/7/11 上午4:36, Michael S. Tsirkin 写道:
>>>>>>>>>>>> On Fri, Jul 09, 2021 at 07:23:33PM +0200, Eugenio Perez 
>>>>>>>>>>>> Martin wrote:
>>>>>>>>>>>>>>>         If I understand correctly, this is all
>>>>>>>>>>>>>>> driven from the driver inside the guest, so for this to 
>>>>>>>>>>>>>>> work
>>>>>>>>>>>>>>> the guest must be running and already have initialised 
>>>>>>>>>>>>>>> the driver.
>>>>>>>>>>>>>> Yes.
>>>>>>>>>>>>>>
>>>>>>>>>>>>> As I see it, the feature can be driven entirely by the VMM 
>>>>>>>>>>>>> as long as
>>>>>>>>>>>>> it intercept the relevant configuration space (PCI, MMIO, 
>>>>>>>>>>>>> etc) from
>>>>>>>>>>>>> guest's reads and writes, and present it as coherent and 
>>>>>>>>>>>>> transparent
>>>>>>>>>>>>> for the guest. Some use cases I can imagine with a 
>>>>>>>>>>>>> physical device (or
>>>>>>>>>>>>> vp_vpda device) with VIRTIO_F_STOP:
>>>>>>>>>>>>>
>>>>>>>>>>>>> 1) The VMM chooses not to pass the feature flag. The guest 
>>>>>>>>>>>>> cannot stop
>>>>>>>>>>>>> the device, so any write to this flag is an error/undefined.
>>>>>>>>>>>>> 2) The VMM passes the flag to the guest. The guest can 
>>>>>>>>>>>>> stop the device.
>>>>>>>>>>>>> 2.1) The VMM stops the device to perform a live migration, 
>>>>>>>>>>>>> and the
>>>>>>>>>>>>> guest does not write to STOP in any moment of the LM. It 
>>>>>>>>>>>>> resets the
>>>>>>>>>>>>> destination device with the state, and then initializes 
>>>>>>>>>>>>> the device.
>>>>>>>>>>>>> 2.2) The guest stops the device and, when STOP(32) is set, 
>>>>>>>>>>>>> the source
>>>>>>>>>>>>> VMM migrates the device status. The destination VMM 
>>>>>>>>>>>>> realizes the bit,
>>>>>>>>>>>>> so it sets the bit in the destination too after device 
>>>>>>>>>>>>> initialization.
>>>>>>>>>>>>> 2.3) The device is not initialized by the guest so it 
>>>>>>>>>>>>> doesn't matter
>>>>>>>>>>>>> what bit has the HW, but the VM can be migrated.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Am I missing something?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks!
>>>>>>>>>>>> It's doable like this. It's all a lot of hoops to jump 
>>>>>>>>>>>> through though.
>>>>>>>>>>>> It's also not easy for devices to implement.
>>>>>>>>>>> It just requires a new status bit. Anything that makes you 
>>>>>>>>>>> think it's hard
>>>>>>>>>>> to implement?
>>>>>>>>>>>
>>>>>>>>>>> E.g for networking device, it should be sufficient to use 
>>>>>>>>>>> this bit + the
>>>>>>>>>>> virtqueue state.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> Why don't we design the feature in a way that is useable by 
>>>>>>>>>>>> VMMs
>>>>>>>>>>>> and implementable by devices in a simple way?
>>>>>>>>>>> It use the common technology like register shadowing without 
>>>>>>>>>>> any further
>>>>>>>>>>> stuffs.
>>>>>>>>>>>
>>>>>>>>>>> Or do you have any other ideas?
>>>>>>>>>>>
>>>>>>>>>>> (I think we all know migration will be very hard if we 
>>>>>>>>>>> simply pass through
>>>>>>>>>>> those state registers).
>>>>>>>>>> If an admin virtqueue is used instead of the STOP Device 
>>>>>>>>>> Status field
>>>>>>>>>> bit then there's no need to re-read the Device Status field 
>>>>>>>>>> in a loop
>>>>>>>>>> until the device has stopped.
>>>>>>>>> Probably not. Let me to clarify several points:
>>>>>>>>>
>>>>>>>>> - This proposal has nothing to do with admin virtqueue. 
>>>>>>>>> Actually, admin
>>>>>>>>> virtqueue could be used for carrying any basic device facility 
>>>>>>>>> like status
>>>>>>>>> bit. E.g I'm going to post patches that use admin virtqueue as 
>>>>>>>>> a "transport"
>>>>>>>>> for device slicing at virtio level.
>>>>>>>>> - Even if we had introduced admin virtqueue, we still need a 
>>>>>>>>> per function
>>>>>>>>> interface for this. This is a must for nested virtualization, 
>>>>>>>>> we can't
>>>>>>>>> always expect things like PF can be assigned to L1 guest.
>>>>>>>>> - According to the proposal, there's no need for the device to 
>>>>>>>>> complete all
>>>>>>>>> the consumed buffers, device can choose to expose those 
>>>>>>>>> inflight descriptors
>>>>>>>>> in a device specific way and set the STOP bit. This means, if 
>>>>>>>>> we have the
>>>>>>>>> device specific in-flight descriptor reporting facility, the 
>>>>>>>>> device can
>>>>>>>>> almost set the STOP bit immediately.
>>>>>>>>> - If we don't go with the basic device facility but using the 
>>>>>>>>> admin
>>>>>>>>> virtqueue specific method, we still need to clarify how it 
>>>>>>>>> works with the
>>>>>>>>> device status state machine, it will be some kind of 
>>>>>>>>> sub-states which looks
>>>>>>>>> much more complicated than the current proposal.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> When migrating a guest with many VIRTIO devices a busy 
>>>>>>>>>> waiting approach
>>>>>>>>>> extends downtime if implemented sequentially (stopping one 
>>>>>>>>>> device at a
>>>>>>>>>> time).
>>>>>>>>> Well. You need some kinds of waiting for sure, the device/DMA 
>>>>>>>>> needs sometime
>>>>>>>>> to be stopped. The downtime is determined by a specific virtio
>>>>>>>>> implementation which is hard to be restricted at the spec 
>>>>>>>>> level. We can
>>>>>>>>> clarify that the device must set the STOP bit in e.g 100ms.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>       It can be implemented concurrently (setting the STOP 
>>>>>>>>>> bit on all
>>>>>>>>>> devices and then looping until all their Device Status fields 
>>>>>>>>>> have the
>>>>>>>>>> bit set), but this becomes more complex to implement.
>>>>>>>>> I still don't get what kind of complexity did you worry here.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> I'm a little worried about adding a new bit that requires busy
>>>>>>>>>> waiting...
>>>>>>>>> Busy wait is not something that is introduced in this patch:
>>>>>>>>>
>>>>>>>>> 4.1.4.3.2 Driver Requirements: Common configuration structure 
>>>>>>>>> layout
>>>>>>>>>
>>>>>>>>> After writing 0 to device_status, the driver MUST wait for a 
>>>>>>>>> read of
>>>>>>>>> device_status to return 0 before reinitializing the device.
>>>>>>>>>
>>>>>>>>> Since it was required for at least one transport. We need do 
>>>>>>>>> something
>>>>>>>>> similar to when introducing basic facility.
>>>>>>>> Adding the STOP but as a Device Status bit is a small and clean 
>>>>>>>> VIRTIO
>>>>>>>> spec change. I like that.
>>>>>>>>
>>>>>>>> On the other hand, devices need time to stop and that time can be
>>>>>>>> unbounded. For example, software virtio-blk/scsi 
>>>>>>>> implementations since
>>>>>>>> cannot immediately cancel in-flight I/O requests on Linux hosts.
>>>>>>>>
>>>>>>>> The natural interface for long-running operations is virtqueue 
>>>>>>>> requests.
>>>>>>>> That's why I mentioned the alternative of using an admin virtqueue
>>>>>>>> instead of a Device Status bit.
>>>>>>> So I'm not against the admin virtqueue. As said before, admin 
>>>>>>> virtqueue
>>>>>>> could be used for carrying the device status bit.
>>>>>>>
>>>>>>> Send a command to set STOP status bit to admin virtqueue. Device 
>>>>>>> will make
>>>>>>> the command buffer used after it has successfully stopped the 
>>>>>>> device.
>>>>>>>
>>>>>>> AFAIK, they are not mutually exclusive, since they are trying to 
>>>>>>> solve
>>>>>>> different problems.
>>>>>>>
>>>>>>> Device status - basic device facility
>>>>>>>
>>>>>>> Admin virtqueue - transport/device specific way to implement 
>>>>>>> (part of) the
>>>>>>> device facility
>>>>>>>
>>>>>>>> Although you mentioned that the stopped state needs to be 
>>>>>>>> reflected in
>>>>>>>> the Device Status field somehow, I'm not sure about that since the
>>>>>>>> driver typically doesn't need to know whether the device is being
>>>>>>>> migrated.
>>>>>>> The guest won't see the real device status bit. VMM will shadow 
>>>>>>> the device
>>>>>>> status bit in this case.
>>>>>>>
>>>>>>> E.g with the current vhost-vDPA, vDPA behave like a vhost 
>>>>>>> device, guest is
>>>>>>> unaware of the migration.
>>>>>>>
>>>>>>> STOP status bit is set by Qemu to real virtio hardware. But 
>>>>>>> guest will only
>>>>>>> see the DRIVER_OK without STOP.
>>>>>>>
>>>>>>> It's not hard to implement the nested on top, see the discussion 
>>>>>>> initiated
>>>>>>> by Eugenio about how expose VIRTIO_F_STOP to guest for nested live
>>>>>>> migration.
>>>>>>>
>>>>>>>
>>>>>>>>      In fact, the VMM would need to hide this bit and it's 
>>>>>>>> safer to
>>>>>>>> keep it out-of-band instead of risking exposing it by accident.
>>>>>>> See above, VMM may choose to hide or expose the capability. It's 
>>>>>>> useful for
>>>>>>> migrating a nested guest.
>>>>>>>
>>>>>>> If we design an interface that can be used in the nested 
>>>>>>> environment, it's
>>>>>>> not an ideal interface.
>>>>>>>
>>>>>>>
>>>>>>>> In addition, stateful devices need to load/save non-trivial 
>>>>>>>> amounts of
>>>>>>>> data. They need DMA to do this efficiently, so an admin 
>>>>>>>> virtqueue is a
>>>>>>>> good fit again.
>>>>>>> I don't get the point here. You still need to address the exact 
>>>>>>> the similar
>>>>>>> issues for admin virtqueue: the unbound time in freezing the 
>>>>>>> device, the
>>>>>>> interaction with the virtio device status state machine.
>>>>>> Device state state can be large so a register interface would be a
>>>>>> bottleneck. DMA is needed. I think a virtqueue is a good fit for
>>>>>> saving/loading device state.
>>>>> So this patch doesn't mandate a register interface, isn't it?
>>>> You're right, not this patch. I mentioned it because your other patch
>>>> series ("[PATCH] virtio-pci: implement VIRTIO_F_RING_STATE") 
>>>> implements
>>>> it a register interface.
>>>>
>>>>> And DMA
>>>>> doesn't means a virtqueue, it could be a transport specific method.
>>>> Yes, although virtqueues are a pretty good interface that works across
>>>> transports (PCI/MMIO/etc) thanks to the standard vring memory layout.
>>>>
>>>>> I think we need to start from defining the state of one specific 
>>>>> device and
>>>>> see what is the best interface.
>>>> virtio-blk might be the simplest. I think virtio-net has more device
>>>> state and virtio-scsi is definitely more complext than virtio-blk.
>>>>
>>>> First we need agreement on whether "device state" encompasses the full
>>>> state of the device or just state that is unknown to the VMM.
>>>
>>> I think we've discussed this in the past. It can't work since:
>>>
>>> 1) The state and its format must be clearly defined in the spec
>>> 2) We need to maintain migration compatibility and debug-ability
>> Some devices need implementation-specific state. They should still be
>> able to live migrate even if it means cross-implementation migration and
>> debug-ability is not possible.
>
>
> I think we need to re-visit this conclusion. Migration compatibility 
> is pretty important, especially consider the software stack has spent 
> a huge mount of effort in maintaining them.
>
> Say a virtio hardware would break this, this mean we will lose all the 
> advantages of being a standard device.
>
> If we can't do live migration among:
>
> 1) different backends, e.g migrate from virtio hardware to migrate 
> software
> 2) different vendors
>
> We failed to say as a standard device and the customer is in fact 
> locked by the vendor implicitly.
>
>
>>
>>> 3) Not a proper uAPI desgin
>> I never understood this argument. The Linux uAPI passes through lots of
>> opaque data from devices to userspace. Allowing an
>> implementation-specific device state representation is nothing new. VFIO
>> already does it.
>
>
> I think we've already had a lots of discussion for VFIO but without a 
> conclusion. Maybe we need the verdict from Linus or Greg (not sure if 
> it's too late). But that's not related to virito and this thread.
>
> What you propose here is kind of conflict with the efforts of virtio. 
> I think we all aggree that we should define the state in the spec. 
> Assuming this is correct:
>
> 1) why do we still offer opaque migration state to userspace?
> 2) how can it be integrated into the current VMM (Qemu) virtio 
> devices' migration bytes stream?
>
> We should standardize everything that is visible by the driver to be a 
> standard device. That's the power of virtio.
>
>
>>
>>>
>>>> That's
>>>> basically the difference between the vhost/vDPA's selective 
>>>> passthrough
>>>> approach and VFIO's full passthrough approach.
>>>
>>> We can't do VFIO full pasthrough for migration anyway, some kind of 
>>> mdev is
>>> required but it's duplicated with the current vp_vdpa driver.
>> I'm not sure that's true. Generic VFIO PCI migration can probably be
>> achieved without mdev:
>> 1. Define a migration PCI Capability that indicates support for
>>     VFIO_REGION_TYPE_MIGRATION. This allows the PCI device to implement
>>     the migration interface in hardware instead of an mdev driver.
>
>
> So I think it still depend on the driver to implement migrate state 
> which is vendor specific.
>
> Note that it's just an uAPI definition not something defined in the 
> PCI spec.
>
> Out of curiosity, the patch is merged without any real users in the 
> Linux. This is very bad since we lose the change to audit the whole 
> design.
>
>
>> 2. The VMM either uses the migration PCI Capability directly from
>>     userspace or core VFIO PCI code advertises 
>> VFIO_REGION_TYPE_MIGRATION
>>     to userspace so migration can proceed in the same way as with
>>     VFIO/mdev drivers.
>> 3. The PCI Capability is not passed through to the guest.
>
>
> This brings troubles in the nested environment.
>
> Thanks
>
>
>>
>> Changpeng Liu originally mentioned the idea of defining a migration PCI
>> Capability.
>>
>>>>    For example, some of the
>>>> virtio-net state is available to the VMM with vhost/vDPA because it
>>>> intercepts the virtio-net control virtqueue.
>>>>
>>>> Also, we need to decide to what degree the device state representation
>>>> is standardized in the VIRTIO specification.
>>>
>>> I think all the states must be defined in the spec otherwise the device
>>> can't claim it supports migration at virtio level.
>>>
>>>
>>>>    I think it makes sense to
>>>> standardize it if it's possible to convey all necessary state and 
>>>> device
>>>> implementors can easily implement this device state representation.
>>>
>>> I doubt it's high device specific. E.g can we standardize device(GPU)
>>> memory?
>> For devices that have little internal state it's possible to define a
>> standard device state representation.
>>
>> For other devices, like virtio-crypto, virtio-fs, etc it becomes
>> difficult because the device implementation contains state that will be
>> needed but is very specific to the implementation. These devices *are*
>> migratable but they don't have standard state. Even here there is a
>> spectrum:
>> - Host OS-specific state (e.g. Linux struct file_handles)
>> - Library-specific state (e.g. crypto library state)
>> - Implementation-specific state (e.g. sshfs inode state for virtio-fs)
>>
>> This is why I think it's necessary to support both standard device state
>> representations and implementation-specific device state
>> representations.


Having two ways will bring extra complexity. That why I suggest:

- to have general facility for the virtuqueue to be migrated
- leave the device specific state to be device specific. so device can 
choose what is convenient way or interface.

Thanks


>>
>>>> If
>>>> not, then device implementation-specific device state would be needed.
>>>
>>> Yes.
>>>
>>>
>>>> I think that's a larger discussion that deserves its own email thread.
>>>
>>> I agree, but it doesn't prevent us from starting from simple device 
>>> that
>>> virtqueue state is sufficient (e.g virtio-net).
>> Right, stateless devices don't need to save/load device state, at least
>> when virtqueues are passed through selectively instead of full
>> passsthrough.
>>
>> Stefan


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit
  2021-07-16  1:48                                 ` Jason Wang
@ 2021-07-19 12:08                                   ` Stefan Hajnoczi
  2021-07-20  2:46                                     ` Jason Wang
  0 siblings, 1 reply; 115+ messages in thread
From: Stefan Hajnoczi @ 2021-07-19 12:08 UTC (permalink / raw)
  To: Jason Wang
  Cc: Max Gurtovoy, Michael S. Tsirkin, Eugenio Perez Martin,
	Dr. David Alan Gilbert, virtio-comment, Virtio-Dev,
	Cornelia Huck, Oren Duer, Shahaf Shuler, Parav Pandit,
	Bodong Wang, Alexander Mikheev, Halil Pasic

[-- Attachment #1: Type: text/plain, Size: 3323 bytes --]

On Fri, Jul 16, 2021 at 09:48:43AM +0800, Jason Wang wrote:
> 
> 在 2021/7/15 下午5:26, Stefan Hajnoczi 写道:
> > On Thu, Jul 15, 2021 at 09:38:55AM +0800, Jason Wang wrote:
> > > 在 2021/7/15 上午12:22, Max Gurtovoy 写道:
> > > > On 7/14/2021 6:07 PM, Stefan Hajnoczi wrote:
> > > > > > It requires much more works than the simple virtqueue interface:
> > > > > > (the main
> > > > > > issues is that the function is not self-contained in a single function)
> > > > > > 
> > > > > > 1) how to interact with the existing device status state machine?
> > > > > > 2) how to make it work in a nested environment?
> > > > > > 3) how to migrate the PF?
> > > > > > 4) do we need to allow more control other than just stop/freeze
> > > > > > the device
> > > > > > in the admin virtqueue? If yes, how to handle the concurrent
> > > > > > access from PF
> > > > > > and VF?
> > > > > > 5) how it is expected to work with non-PCI virtio device?
> > > > > I guess your device splitting proposal addresses some of these things?
> > > > > 
> > > > > Max probably has the most to say about these points.
> > > > > 
> > > > > If you want more input I can try to answer too, but I personally am not
> > > > > developing devices that need this right now, so I might not be the best
> > > > > person to propose solutions.
> > > > I think we mentioned this in the past and agreed that the only common
> > > > entity between my solution for virtio VF migration to this proposal is
> > > > the new admin control queue.
> > > > 
> > > > I can prepare some draft for this.
> > > > 
> > > > In our solution the PF will manage migration process for it's VFs using
> > > > the PF admin queue. PF is not migratable.
> > > 
> > > That limits the use cases.
> > > 
> > > 
> > > > I don't know who is using nested environments in production so don't
> > > > know if it worth talking about that.
> > > 
> > > There should be plenty users for the nested case.
> > Yes, nested virtualization is becoming available in clouds, etc. I think
> > nested virtualization support should be part of the design.
> > 
> > > > But, if you would like to implement it for testing, no problem. The VF
> > > > in level n, probably seen as PF in level n+1. So it can manage the
> > > > migration process for its nested VFs.
> > > 
> > > The PF dependency makes the design almost impossible to be used in a nested
> > > environment.
> > I'm not sure I understood Max's example, but first I want to check I
> > understand yours:
> > 
> > A physical PF is passed through to an L1 guest. L2 guests are assigned
> > VFs created by the L1 guest from the PF.
> > 
> > Now we want to live migrate the L1 guest to another host. We need to
> > migrate the PF and its VFs are automatically included since there is no
> > migration from the L2 perspective?
> 
> 
> Yes, and I believe the more common case is.
> 
> PF is for L0, and we want to migrate L2 guest.
> 
> This can hardly work in the current design.
> 
> The reason is that the functions is not self contained in the VF.

Thanks for highlighting this case. It requires that the mechanism for
stopping and saving/loading state comes with the VF so the L1 guest can
perform live migration even though it does not have L0 PF access.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [virtio-dev] Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit
  2021-07-16  1:44                               ` Jason Wang
@ 2021-07-19 12:18                                 ` Stefan Hajnoczi
  2021-07-20  2:50                                   ` Jason Wang
  2021-07-20 10:31                                 ` Cornelia Huck
  1 sibling, 1 reply; 115+ messages in thread
From: Stefan Hajnoczi @ 2021-07-19 12:18 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, Eugenio Perez Martin, Dr. David Alan Gilbert,
	virtio-comment, Virtio-Dev, Max Gurtovoy, Cornelia Huck,
	Oren Duer, Shahaf Shuler, Parav Pandit, Bodong Wang,
	Alexander Mikheev, Halil Pasic

[-- Attachment #1: Type: text/plain, Size: 1558 bytes --]

On Fri, Jul 16, 2021 at 09:44:26AM +0800, Jason Wang wrote:
> 
> 在 2021/7/15 下午5:16, Stefan Hajnoczi 写道:
> > On Thu, Jul 15, 2021 at 09:35:13AM +0800, Jason Wang wrote:
> > > 在 2021/7/14 下午11:07, Stefan Hajnoczi 写道:
> > > > On Wed, Jul 14, 2021 at 06:29:28PM +0800, Jason Wang wrote:
> > > > > And as I've stated several times, virtqueue is the interface or transport
> > > > > which carries the commands for implementing specific semantics. It doesn't
> > > > > conflict with what is proposed in this patch.
> > > > The abstract operations for stopping the device and fetching virtqueue
> > > > state sound good to me, but I don't think a Device Status field STOP bit
> > > > should be added. An out-of-band stop operation would support devices
> > > > that take a long time to stop better.
> > > 
> > > So the long time request is not something that is introduced by the STOP
> > > bit. Spec already use that for reset.
> > Reset doesn't affect migration downtime. The register polling approach
> > is problematic during migration downtime because it's difficult to stop
> > multiple devices and do other downtime cleanup concurrently.
> 
> 
> This part I don't understand. We don't have an centralized control path that
> is used for each virtual functions.
> 
> VMM is free to stop multiple devices and poll for all those device status?

Yes, it's possible to do that but I think it's harder for VMMs to
implement and consumes CPU (which competes with software devices that
are trying to stop).

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit
  2021-07-16  2:03                               ` Jason Wang
  2021-07-16  3:53                                 ` Jason Wang
@ 2021-07-19 12:43                                 ` Stefan Hajnoczi
  2021-07-20  3:02                                   ` Jason Wang
  1 sibling, 1 reply; 115+ messages in thread
From: Stefan Hajnoczi @ 2021-07-19 12:43 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, Eugenio Perez Martin, Dr. David Alan Gilbert,
	virtio-comment, Virtio-Dev, Max Gurtovoy, Cornelia Huck,
	Oren Duer, Shahaf Shuler, Parav Pandit, Bodong Wang,
	Alexander Mikheev, Halil Pasic

[-- Attachment #1: Type: text/plain, Size: 16782 bytes --]

On Fri, Jul 16, 2021 at 10:03:17AM +0800, Jason Wang wrote:
> 
> 在 2021/7/15 下午6:01, Stefan Hajnoczi 写道:
> > On Thu, Jul 15, 2021 at 09:35:13AM +0800, Jason Wang wrote:
> > > 在 2021/7/14 下午11:07, Stefan Hajnoczi 写道:
> > > > On Wed, Jul 14, 2021 at 06:29:28PM +0800, Jason Wang wrote:
> > > > > 在 2021/7/14 下午5:53, Stefan Hajnoczi 写道:
> > > > > > On Tue, Jul 13, 2021 at 08:16:35PM +0800, Jason Wang wrote:
> > > > > > > 在 2021/7/13 下午6:00, Stefan Hajnoczi 写道:
> > > > > > > > On Tue, Jul 13, 2021 at 11:27:03AM +0800, Jason Wang wrote:
> > > > > > > > > 在 2021/7/12 下午5:57, Stefan Hajnoczi 写道:
> > > > > > > > > > On Mon, Jul 12, 2021 at 12:00:39PM +0800, Jason Wang wrote:
> > > > > > > > > > > 在 2021/7/11 上午4:36, Michael S. Tsirkin 写道:
> > > > > > > > > > > > On Fri, Jul 09, 2021 at 07:23:33PM +0200, Eugenio Perez Martin wrote:
> > > > > > > > > > > > > > >         If I understand correctly, this is all
> > > > > > > > > > > > > > > driven from the driver inside the guest, so for this to work
> > > > > > > > > > > > > > > the guest must be running and already have initialised the driver.
> > > > > > > > > > > > > > Yes.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > As I see it, the feature can be driven entirely by the VMM as long as
> > > > > > > > > > > > > it intercept the relevant configuration space (PCI, MMIO, etc) from
> > > > > > > > > > > > > guest's reads and writes, and present it as coherent and transparent
> > > > > > > > > > > > > for the guest. Some use cases I can imagine with a physical device (or
> > > > > > > > > > > > > vp_vpda device) with VIRTIO_F_STOP:
> > > > > > > > > > > > > 
> > > > > > > > > > > > > 1) The VMM chooses not to pass the feature flag. The guest cannot stop
> > > > > > > > > > > > > the device, so any write to this flag is an error/undefined.
> > > > > > > > > > > > > 2) The VMM passes the flag to the guest. The guest can stop the device.
> > > > > > > > > > > > > 2.1) The VMM stops the device to perform a live migration, and the
> > > > > > > > > > > > > guest does not write to STOP in any moment of the LM. It resets the
> > > > > > > > > > > > > destination device with the state, and then initializes the device.
> > > > > > > > > > > > > 2.2) The guest stops the device and, when STOP(32) is set, the source
> > > > > > > > > > > > > VMM migrates the device status. The destination VMM realizes the bit,
> > > > > > > > > > > > > so it sets the bit in the destination too after device initialization.
> > > > > > > > > > > > > 2.3) The device is not initialized by the guest so it doesn't matter
> > > > > > > > > > > > > what bit has the HW, but the VM can be migrated.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Am I missing something?
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Thanks!
> > > > > > > > > > > > It's doable like this. It's all a lot of hoops to jump through though.
> > > > > > > > > > > > It's also not easy for devices to implement.
> > > > > > > > > > > It just requires a new status bit. Anything that makes you think it's hard
> > > > > > > > > > > to implement?
> > > > > > > > > > > 
> > > > > > > > > > > E.g for networking device, it should be sufficient to use this bit + the
> > > > > > > > > > > virtqueue state.
> > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > > Why don't we design the feature in a way that is useable by VMMs
> > > > > > > > > > > > and implementable by devices in a simple way?
> > > > > > > > > > > It use the common technology like register shadowing without any further
> > > > > > > > > > > stuffs.
> > > > > > > > > > > 
> > > > > > > > > > > Or do you have any other ideas?
> > > > > > > > > > > 
> > > > > > > > > > > (I think we all know migration will be very hard if we simply pass through
> > > > > > > > > > > those state registers).
> > > > > > > > > > If an admin virtqueue is used instead of the STOP Device Status field
> > > > > > > > > > bit then there's no need to re-read the Device Status field in a loop
> > > > > > > > > > until the device has stopped.
> > > > > > > > > Probably not. Let me to clarify several points:
> > > > > > > > > 
> > > > > > > > > - This proposal has nothing to do with admin virtqueue. Actually, admin
> > > > > > > > > virtqueue could be used for carrying any basic device facility like status
> > > > > > > > > bit. E.g I'm going to post patches that use admin virtqueue as a "transport"
> > > > > > > > > for device slicing at virtio level.
> > > > > > > > > - Even if we had introduced admin virtqueue, we still need a per function
> > > > > > > > > interface for this. This is a must for nested virtualization, we can't
> > > > > > > > > always expect things like PF can be assigned to L1 guest.
> > > > > > > > > - According to the proposal, there's no need for the device to complete all
> > > > > > > > > the consumed buffers, device can choose to expose those inflight descriptors
> > > > > > > > > in a device specific way and set the STOP bit. This means, if we have the
> > > > > > > > > device specific in-flight descriptor reporting facility, the device can
> > > > > > > > > almost set the STOP bit immediately.
> > > > > > > > > - If we don't go with the basic device facility but using the admin
> > > > > > > > > virtqueue specific method, we still need to clarify how it works with the
> > > > > > > > > device status state machine, it will be some kind of sub-states which looks
> > > > > > > > > much more complicated than the current proposal.
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > > When migrating a guest with many VIRTIO devices a busy waiting approach
> > > > > > > > > > extends downtime if implemented sequentially (stopping one device at a
> > > > > > > > > > time).
> > > > > > > > > Well. You need some kinds of waiting for sure, the device/DMA needs sometime
> > > > > > > > > to be stopped. The downtime is determined by a specific virtio
> > > > > > > > > implementation which is hard to be restricted at the spec level. We can
> > > > > > > > > clarify that the device must set the STOP bit in e.g 100ms.
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > >       It can be implemented concurrently (setting the STOP bit on all
> > > > > > > > > > devices and then looping until all their Device Status fields have the
> > > > > > > > > > bit set), but this becomes more complex to implement.
> > > > > > > > > I still don't get what kind of complexity did you worry here.
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > > I'm a little worried about adding a new bit that requires busy
> > > > > > > > > > waiting...
> > > > > > > > > Busy wait is not something that is introduced in this patch:
> > > > > > > > > 
> > > > > > > > > 4.1.4.3.2 Driver Requirements: Common configuration structure layout
> > > > > > > > > 
> > > > > > > > > After writing 0 to device_status, the driver MUST wait for a read of
> > > > > > > > > device_status to return 0 before reinitializing the device.
> > > > > > > > > 
> > > > > > > > > Since it was required for at least one transport. We need do something
> > > > > > > > > similar to when introducing basic facility.
> > > > > > > > Adding the STOP but as a Device Status bit is a small and clean VIRTIO
> > > > > > > > spec change. I like that.
> > > > > > > > 
> > > > > > > > On the other hand, devices need time to stop and that time can be
> > > > > > > > unbounded. For example, software virtio-blk/scsi implementations since
> > > > > > > > cannot immediately cancel in-flight I/O requests on Linux hosts.
> > > > > > > > 
> > > > > > > > The natural interface for long-running operations is virtqueue requests.
> > > > > > > > That's why I mentioned the alternative of using an admin virtqueue
> > > > > > > > instead of a Device Status bit.
> > > > > > > So I'm not against the admin virtqueue. As said before, admin virtqueue
> > > > > > > could be used for carrying the device status bit.
> > > > > > > 
> > > > > > > Send a command to set STOP status bit to admin virtqueue. Device will make
> > > > > > > the command buffer used after it has successfully stopped the device.
> > > > > > > 
> > > > > > > AFAIK, they are not mutually exclusive, since they are trying to solve
> > > > > > > different problems.
> > > > > > > 
> > > > > > > Device status - basic device facility
> > > > > > > 
> > > > > > > Admin virtqueue - transport/device specific way to implement (part of) the
> > > > > > > device facility
> > > > > > > 
> > > > > > > > Although you mentioned that the stopped state needs to be reflected in
> > > > > > > > the Device Status field somehow, I'm not sure about that since the
> > > > > > > > driver typically doesn't need to know whether the device is being
> > > > > > > > migrated.
> > > > > > > The guest won't see the real device status bit. VMM will shadow the device
> > > > > > > status bit in this case.
> > > > > > > 
> > > > > > > E.g with the current vhost-vDPA, vDPA behave like a vhost device, guest is
> > > > > > > unaware of the migration.
> > > > > > > 
> > > > > > > STOP status bit is set by Qemu to real virtio hardware. But guest will only
> > > > > > > see the DRIVER_OK without STOP.
> > > > > > > 
> > > > > > > It's not hard to implement the nested on top, see the discussion initiated
> > > > > > > by Eugenio about how expose VIRTIO_F_STOP to guest for nested live
> > > > > > > migration.
> > > > > > > 
> > > > > > > 
> > > > > > > >      In fact, the VMM would need to hide this bit and it's safer to
> > > > > > > > keep it out-of-band instead of risking exposing it by accident.
> > > > > > > See above, VMM may choose to hide or expose the capability. It's useful for
> > > > > > > migrating a nested guest.
> > > > > > > 
> > > > > > > If we design an interface that can be used in the nested environment, it's
> > > > > > > not an ideal interface.
> > > > > > > 
> > > > > > > 
> > > > > > > > In addition, stateful devices need to load/save non-trivial amounts of
> > > > > > > > data. They need DMA to do this efficiently, so an admin virtqueue is a
> > > > > > > > good fit again.
> > > > > > > I don't get the point here. You still need to address the exact the similar
> > > > > > > issues for admin virtqueue: the unbound time in freezing the device, the
> > > > > > > interaction with the virtio device status state machine.
> > > > > > Device state state can be large so a register interface would be a
> > > > > > bottleneck. DMA is needed. I think a virtqueue is a good fit for
> > > > > > saving/loading device state.
> > > > > So this patch doesn't mandate a register interface, isn't it?
> > > > You're right, not this patch. I mentioned it because your other patch
> > > > series ("[PATCH] virtio-pci: implement VIRTIO_F_RING_STATE") implements
> > > > it a register interface.
> > > > 
> > > > > And DMA
> > > > > doesn't means a virtqueue, it could be a transport specific method.
> > > > Yes, although virtqueues are a pretty good interface that works across
> > > > transports (PCI/MMIO/etc) thanks to the standard vring memory layout.
> > > > 
> > > > > I think we need to start from defining the state of one specific device and
> > > > > see what is the best interface.
> > > > virtio-blk might be the simplest. I think virtio-net has more device
> > > > state and virtio-scsi is definitely more complext than virtio-blk.
> > > > 
> > > > First we need agreement on whether "device state" encompasses the full
> > > > state of the device or just state that is unknown to the VMM.
> > > 
> > > I think we've discussed this in the past. It can't work since:
> > > 
> > > 1) The state and its format must be clearly defined in the spec
> > > 2) We need to maintain migration compatibility and debug-ability
> > Some devices need implementation-specific state. They should still be
> > able to live migrate even if it means cross-implementation migration and
> > debug-ability is not possible.
> 
> 
> I think we need to re-visit this conclusion. Migration compatibility is
> pretty important, especially consider the software stack has spent a huge
> mount of effort in maintaining them.
> 
> Say a virtio hardware would break this, this mean we will lose all the
> advantages of being a standard device.
> 
> If we can't do live migration among:
> 
> 1) different backends, e.g migrate from virtio hardware to migrate software
> 2) different vendors
> 
> We failed to say as a standard device and the customer is in fact locked by
> the vendor implicitly.

My virtiofs device implementation is backed by an in-memory file system.
The device state includes the contents of each file.

Your virtiofs device implementation uses Linux file handles to keep
track of open files. The device state includes Linux file handles (but
not the contents of each file) so the destination host can access the
same files on shared storage.

Cornelia's virtiofs device implementation is backed by an object storage
HTTP API. The device state includes API object IDs.

The device state is implementation-dependent. There is no standard
representation and it's not possible to migrate between device
implementations. How are they supposed to migrate?

This is why I think it's necessarily to allow implementation-specific
device state representations.

> > 
> > > 3) Not a proper uAPI desgin
> > I never understood this argument. The Linux uAPI passes through lots of
> > opaque data from devices to userspace. Allowing an
> > implementation-specific device state representation is nothing new. VFIO
> > already does it.
> 
> 
> I think we've already had a lots of discussion for VFIO but without a
> conclusion. Maybe we need the verdict from Linus or Greg (not sure if it's
> too late). But that's not related to virito and this thread.
> 
> What you propose here is kind of conflict with the efforts of virtio. I
> think we all aggree that we should define the state in the spec. Assuming
> this is correct:
> 
> 1) why do we still offer opaque migration state to userspace?

See above. Stateful devices may require an implementation-defined device
state representation.

> 2) how can it be integrated into the current VMM (Qemu) virtio devices'
> migration bytes stream?

Opaque data like D-Bus VMState:
https://qemu.readthedocs.io/en/latest/interop/dbus-vmstate.html

> > 
> > > 
> > > > That's
> > > > basically the difference between the vhost/vDPA's selective passthrough
> > > > approach and VFIO's full passthrough approach.
> > > 
> > > We can't do VFIO full pasthrough for migration anyway, some kind of mdev is
> > > required but it's duplicated with the current vp_vdpa driver.
> > I'm not sure that's true. Generic VFIO PCI migration can probably be
> > achieved without mdev:
> > 1. Define a migration PCI Capability that indicates support for
> >     VFIO_REGION_TYPE_MIGRATION. This allows the PCI device to implement
> >     the migration interface in hardware instead of an mdev driver.
> 
> 
> So I think it still depend on the driver to implement migrate state which is
> vendor specific.

The current VFIO migration interface depends on a device-specific
software mdev driver but here I'm showing that the physical device can
implement the migration interface so that no device-specific driver code
is needed.

> 
> Note that it's just an uAPI definition not something defined in the PCI
> spec.

Yes, that's why I mentioned Changpeng Liu's idea to turn the uAPI into a
standard PCI Capability to eliminate the need for device-specific
drivers.

> 
> Out of curiosity, the patch is merged without any real users in the Linux.
> This is very bad since we lose the change to audit the whole design.

I agree. It would have helped to have a complete vision for how live
migration should work along with demos. I don't see any migration code
in samples/vfio-mdev/ :(.

> > 2. The VMM either uses the migration PCI Capability directly from
> >     userspace or core VFIO PCI code advertises VFIO_REGION_TYPE_MIGRATION
> >     to userspace so migration can proceed in the same way as with
> >     VFIO/mdev drivers.
> > 3. The PCI Capability is not passed through to the guest.
> 
> 
> This brings troubles in the nested environment.

It depends on the device splitting/management design. If L0 wishes to
let L1 manage the VFs then it would need to expose a management device.
Since the migration interface is generic (not device-specific) a generic
management device solves this for all devices.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit
  2021-07-16  3:53                                 ` Jason Wang
@ 2021-07-19 12:45                                   ` Stefan Hajnoczi
  2021-07-20  3:04                                     ` Jason Wang
  0 siblings, 1 reply; 115+ messages in thread
From: Stefan Hajnoczi @ 2021-07-19 12:45 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, Eugenio Perez Martin, Dr. David Alan Gilbert,
	virtio-comment, Virtio-Dev, Max Gurtovoy, Cornelia Huck,
	Oren Duer, Shahaf Shuler, Parav Pandit, Bodong Wang,
	Alexander Mikheev, Halil Pasic

[-- Attachment #1: Type: text/plain, Size: 20433 bytes --]

On Fri, Jul 16, 2021 at 11:53:13AM +0800, Jason Wang wrote:
> 
> 在 2021/7/16 上午10:03, Jason Wang 写道:
> > 
> > 在 2021/7/15 下午6:01, Stefan Hajnoczi 写道:
> > > On Thu, Jul 15, 2021 at 09:35:13AM +0800, Jason Wang wrote:
> > > > 在 2021/7/14 下午11:07, Stefan Hajnoczi 写道:
> > > > > On Wed, Jul 14, 2021 at 06:29:28PM +0800, Jason Wang wrote:
> > > > > > 在 2021/7/14 下午5:53, Stefan Hajnoczi 写道:
> > > > > > > On Tue, Jul 13, 2021 at 08:16:35PM +0800, Jason Wang wrote:
> > > > > > > > 在 2021/7/13 下午6:00, Stefan Hajnoczi 写道:
> > > > > > > > > On Tue, Jul 13, 2021 at 11:27:03AM +0800, Jason Wang wrote:
> > > > > > > > > > 在 2021/7/12 下午5:57, Stefan Hajnoczi 写道:
> > > > > > > > > > > On Mon, Jul 12, 2021 at 12:00:39PM +0800, Jason Wang wrote:
> > > > > > > > > > > > 在 2021/7/11 上午4:36, Michael S. Tsirkin 写道:
> > > > > > > > > > > > > On Fri, Jul 09, 2021 at
> > > > > > > > > > > > > 07:23:33PM +0200, Eugenio
> > > > > > > > > > > > > Perez Martin wrote:
> > > > > > > > > > > > > > > >         If I understand correctly, this is all
> > > > > > > > > > > > > > > > driven from the
> > > > > > > > > > > > > > > > driver inside
> > > > > > > > > > > > > > > > the guest, so
> > > > > > > > > > > > > > > > for this to work
> > > > > > > > > > > > > > > > the guest must
> > > > > > > > > > > > > > > > be running and
> > > > > > > > > > > > > > > > already have
> > > > > > > > > > > > > > > > initialised the
> > > > > > > > > > > > > > > > driver.
> > > > > > > > > > > > > > > Yes.
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > As I see it, the feature
> > > > > > > > > > > > > > can be driven entirely
> > > > > > > > > > > > > > by the VMM as long as
> > > > > > > > > > > > > > it intercept the
> > > > > > > > > > > > > > relevant configuration
> > > > > > > > > > > > > > space (PCI, MMIO, etc)
> > > > > > > > > > > > > > from
> > > > > > > > > > > > > > guest's reads and
> > > > > > > > > > > > > > writes, and present it
> > > > > > > > > > > > > > as coherent and
> > > > > > > > > > > > > > transparent
> > > > > > > > > > > > > > for the guest. Some use
> > > > > > > > > > > > > > cases I can imagine with
> > > > > > > > > > > > > > a physical device (or
> > > > > > > > > > > > > > vp_vpda device) with VIRTIO_F_STOP:
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > 1) The VMM chooses not
> > > > > > > > > > > > > > to pass the feature
> > > > > > > > > > > > > > flag. The guest cannot
> > > > > > > > > > > > > > stop
> > > > > > > > > > > > > > the device, so any write to this flag is an error/undefined.
> > > > > > > > > > > > > > 2) The VMM passes the
> > > > > > > > > > > > > > flag to the guest. The
> > > > > > > > > > > > > > guest can stop the
> > > > > > > > > > > > > > device.
> > > > > > > > > > > > > > 2.1) The VMM stops the
> > > > > > > > > > > > > > device to perform a live
> > > > > > > > > > > > > > migration, and the
> > > > > > > > > > > > > > guest does not write to
> > > > > > > > > > > > > > STOP in any moment of
> > > > > > > > > > > > > > the LM. It resets the
> > > > > > > > > > > > > > destination device with
> > > > > > > > > > > > > > the state, and then
> > > > > > > > > > > > > > initializes the device.
> > > > > > > > > > > > > > 2.2) The guest stops the
> > > > > > > > > > > > > > device and, when
> > > > > > > > > > > > > > STOP(32) is set, the
> > > > > > > > > > > > > > source
> > > > > > > > > > > > > > VMM migrates the device
> > > > > > > > > > > > > > status. The destination
> > > > > > > > > > > > > > VMM realizes the bit,
> > > > > > > > > > > > > > so it sets the bit in
> > > > > > > > > > > > > > the destination too
> > > > > > > > > > > > > > after device
> > > > > > > > > > > > > > initialization.
> > > > > > > > > > > > > > 2.3) The device is not
> > > > > > > > > > > > > > initialized by the guest
> > > > > > > > > > > > > > so it doesn't matter
> > > > > > > > > > > > > > what bit has the HW, but the VM can be migrated.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > Am I missing something?
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > Thanks!
> > > > > > > > > > > > > It's doable like this. It's
> > > > > > > > > > > > > all a lot of hoops to jump
> > > > > > > > > > > > > through though.
> > > > > > > > > > > > > It's also not easy for devices to implement.
> > > > > > > > > > > > It just requires a new status
> > > > > > > > > > > > bit. Anything that makes you
> > > > > > > > > > > > think it's hard
> > > > > > > > > > > > to implement?
> > > > > > > > > > > > 
> > > > > > > > > > > > E.g for networking device, it
> > > > > > > > > > > > should be sufficient to use this
> > > > > > > > > > > > bit + the
> > > > > > > > > > > > virtqueue state.
> > > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > > > Why don't we design the
> > > > > > > > > > > > > feature in a way that is
> > > > > > > > > > > > > useable by VMMs
> > > > > > > > > > > > > and implementable by devices in a simple way?
> > > > > > > > > > > > It use the common technology
> > > > > > > > > > > > like register shadowing without
> > > > > > > > > > > > any further
> > > > > > > > > > > > stuffs.
> > > > > > > > > > > > 
> > > > > > > > > > > > Or do you have any other ideas?
> > > > > > > > > > > > 
> > > > > > > > > > > > (I think we all know migration
> > > > > > > > > > > > will be very hard if we simply
> > > > > > > > > > > > pass through
> > > > > > > > > > > > those state registers).
> > > > > > > > > > > If an admin virtqueue is used
> > > > > > > > > > > instead of the STOP Device Status
> > > > > > > > > > > field
> > > > > > > > > > > bit then there's no need to re-read
> > > > > > > > > > > the Device Status field in a loop
> > > > > > > > > > > until the device has stopped.
> > > > > > > > > > Probably not. Let me to clarify several points:
> > > > > > > > > > 
> > > > > > > > > > - This proposal has nothing to do with
> > > > > > > > > > admin virtqueue. Actually, admin
> > > > > > > > > > virtqueue could be used for carrying any
> > > > > > > > > > basic device facility like status
> > > > > > > > > > bit. E.g I'm going to post patches that
> > > > > > > > > > use admin virtqueue as a "transport"
> > > > > > > > > > for device slicing at virtio level.
> > > > > > > > > > - Even if we had introduced admin
> > > > > > > > > > virtqueue, we still need a per function
> > > > > > > > > > interface for this. This is a must for
> > > > > > > > > > nested virtualization, we can't
> > > > > > > > > > always expect things like PF can be assigned to L1 guest.
> > > > > > > > > > - According to the proposal, there's no
> > > > > > > > > > need for the device to complete all
> > > > > > > > > > the consumed buffers, device can choose
> > > > > > > > > > to expose those inflight descriptors
> > > > > > > > > > in a device specific way and set the
> > > > > > > > > > STOP bit. This means, if we have the
> > > > > > > > > > device specific in-flight descriptor
> > > > > > > > > > reporting facility, the device can
> > > > > > > > > > almost set the STOP bit immediately.
> > > > > > > > > > - If we don't go with the basic device
> > > > > > > > > > facility but using the admin
> > > > > > > > > > virtqueue specific method, we still need
> > > > > > > > > > to clarify how it works with the
> > > > > > > > > > device status state machine, it will be
> > > > > > > > > > some kind of sub-states which looks
> > > > > > > > > > much more complicated than the current proposal.
> > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > > When migrating a guest with many
> > > > > > > > > > > VIRTIO devices a busy waiting
> > > > > > > > > > > approach
> > > > > > > > > > > extends downtime if implemented
> > > > > > > > > > > sequentially (stopping one device at
> > > > > > > > > > > a
> > > > > > > > > > > time).
> > > > > > > > > > Well. You need some kinds of waiting for
> > > > > > > > > > sure, the device/DMA needs sometime
> > > > > > > > > > to be stopped. The downtime is determined by a specific virtio
> > > > > > > > > > implementation which is hard to be
> > > > > > > > > > restricted at the spec level. We can
> > > > > > > > > > clarify that the device must set the STOP bit in e.g 100ms.
> > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > >       It can be implemented
> > > > > > > > > > > concurrently (setting the STOP bit
> > > > > > > > > > > on all
> > > > > > > > > > > devices and then looping until all
> > > > > > > > > > > their Device Status fields have the
> > > > > > > > > > > bit set), but this becomes more complex to implement.
> > > > > > > > > > I still don't get what kind of complexity did you worry here.
> > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > > I'm a little worried about adding a new bit that requires busy
> > > > > > > > > > > waiting...
> > > > > > > > > > Busy wait is not something that is introduced in this patch:
> > > > > > > > > > 
> > > > > > > > > > 4.1.4.3.2 Driver Requirements: Common
> > > > > > > > > > configuration structure layout
> > > > > > > > > > 
> > > > > > > > > > After writing 0 to device_status, the
> > > > > > > > > > driver MUST wait for a read of
> > > > > > > > > > device_status to return 0 before reinitializing the device.
> > > > > > > > > > 
> > > > > > > > > > Since it was required for at least one
> > > > > > > > > > transport. We need do something
> > > > > > > > > > similar to when introducing basic facility.
> > > > > > > > > Adding the STOP but as a Device Status bit
> > > > > > > > > is a small and clean VIRTIO
> > > > > > > > > spec change. I like that.
> > > > > > > > > 
> > > > > > > > > On the other hand, devices need time to stop and that time can be
> > > > > > > > > unbounded. For example, software
> > > > > > > > > virtio-blk/scsi implementations since
> > > > > > > > > cannot immediately cancel in-flight I/O requests on Linux hosts.
> > > > > > > > > 
> > > > > > > > > The natural interface for long-running
> > > > > > > > > operations is virtqueue requests.
> > > > > > > > > That's why I mentioned the alternative of using an admin virtqueue
> > > > > > > > > instead of a Device Status bit.
> > > > > > > > So I'm not against the admin virtqueue. As said
> > > > > > > > before, admin virtqueue
> > > > > > > > could be used for carrying the device status bit.
> > > > > > > > 
> > > > > > > > Send a command to set STOP status bit to admin
> > > > > > > > virtqueue. Device will make
> > > > > > > > the command buffer used after it has
> > > > > > > > successfully stopped the device.
> > > > > > > > 
> > > > > > > > AFAIK, they are not mutually exclusive, since
> > > > > > > > they are trying to solve
> > > > > > > > different problems.
> > > > > > > > 
> > > > > > > > Device status - basic device facility
> > > > > > > > 
> > > > > > > > Admin virtqueue - transport/device specific way
> > > > > > > > to implement (part of) the
> > > > > > > > device facility
> > > > > > > > 
> > > > > > > > > Although you mentioned that the stopped
> > > > > > > > > state needs to be reflected in
> > > > > > > > > the Device Status field somehow, I'm not sure about that since the
> > > > > > > > > driver typically doesn't need to know whether the device is being
> > > > > > > > > migrated.
> > > > > > > > The guest won't see the real device status bit.
> > > > > > > > VMM will shadow the device
> > > > > > > > status bit in this case.
> > > > > > > > 
> > > > > > > > E.g with the current vhost-vDPA, vDPA behave
> > > > > > > > like a vhost device, guest is
> > > > > > > > unaware of the migration.
> > > > > > > > 
> > > > > > > > STOP status bit is set by Qemu to real virtio
> > > > > > > > hardware. But guest will only
> > > > > > > > see the DRIVER_OK without STOP.
> > > > > > > > 
> > > > > > > > It's not hard to implement the nested on top,
> > > > > > > > see the discussion initiated
> > > > > > > > by Eugenio about how expose VIRTIO_F_STOP to guest for nested live
> > > > > > > > migration.
> > > > > > > > 
> > > > > > > > 
> > > > > > > > >      In fact, the VMM would need to hide
> > > > > > > > > this bit and it's safer to
> > > > > > > > > keep it out-of-band instead of risking exposing it by accident.
> > > > > > > > See above, VMM may choose to hide or expose the
> > > > > > > > capability. It's useful for
> > > > > > > > migrating a nested guest.
> > > > > > > > 
> > > > > > > > If we design an interface that can be used in
> > > > > > > > the nested environment, it's
> > > > > > > > not an ideal interface.
> > > > > > > > 
> > > > > > > > 
> > > > > > > > > In addition, stateful devices need to
> > > > > > > > > load/save non-trivial amounts of
> > > > > > > > > data. They need DMA to do this efficiently,
> > > > > > > > > so an admin virtqueue is a
> > > > > > > > > good fit again.
> > > > > > > > I don't get the point here. You still need to
> > > > > > > > address the exact the similar
> > > > > > > > issues for admin virtqueue: the unbound time in
> > > > > > > > freezing the device, the
> > > > > > > > interaction with the virtio device status state machine.
> > > > > > > Device state state can be large so a register interface would be a
> > > > > > > bottleneck. DMA is needed. I think a virtqueue is a good fit for
> > > > > > > saving/loading device state.
> > > > > > So this patch doesn't mandate a register interface, isn't it?
> > > > > You're right, not this patch. I mentioned it because your other patch
> > > > > series ("[PATCH] virtio-pci: implement VIRTIO_F_RING_STATE")
> > > > > implements
> > > > > it a register interface.
> > > > > 
> > > > > > And DMA
> > > > > > doesn't means a virtqueue, it could be a transport specific method.
> > > > > Yes, although virtqueues are a pretty good interface that works across
> > > > > transports (PCI/MMIO/etc) thanks to the standard vring memory layout.
> > > > > 
> > > > > > I think we need to start from defining the state of one
> > > > > > specific device and
> > > > > > see what is the best interface.
> > > > > virtio-blk might be the simplest. I think virtio-net has more device
> > > > > state and virtio-scsi is definitely more complext than virtio-blk.
> > > > > 
> > > > > First we need agreement on whether "device state" encompasses the full
> > > > > state of the device or just state that is unknown to the VMM.
> > > > 
> > > > I think we've discussed this in the past. It can't work since:
> > > > 
> > > > 1) The state and its format must be clearly defined in the spec
> > > > 2) We need to maintain migration compatibility and debug-ability
> > > Some devices need implementation-specific state. They should still be
> > > able to live migrate even if it means cross-implementation migration and
> > > debug-ability is not possible.
> > 
> > 
> > I think we need to re-visit this conclusion. Migration compatibility is
> > pretty important, especially consider the software stack has spent a
> > huge mount of effort in maintaining them.
> > 
> > Say a virtio hardware would break this, this mean we will lose all the
> > advantages of being a standard device.
> > 
> > If we can't do live migration among:
> > 
> > 1) different backends, e.g migrate from virtio hardware to migrate
> > software
> > 2) different vendors
> > 
> > We failed to say as a standard device and the customer is in fact locked
> > by the vendor implicitly.
> > 
> > 
> > > 
> > > > 3) Not a proper uAPI desgin
> > > I never understood this argument. The Linux uAPI passes through lots of
> > > opaque data from devices to userspace. Allowing an
> > > implementation-specific device state representation is nothing new. VFIO
> > > already does it.
> > 
> > 
> > I think we've already had a lots of discussion for VFIO but without a
> > conclusion. Maybe we need the verdict from Linus or Greg (not sure if
> > it's too late). But that's not related to virito and this thread.
> > 
> > What you propose here is kind of conflict with the efforts of virtio. I
> > think we all aggree that we should define the state in the spec.
> > Assuming this is correct:
> > 
> > 1) why do we still offer opaque migration state to userspace?
> > 2) how can it be integrated into the current VMM (Qemu) virtio devices'
> > migration bytes stream?
> > 
> > We should standardize everything that is visible by the driver to be a
> > standard device. That's the power of virtio.
> > 
> > 
> > > 
> > > > 
> > > > > That's
> > > > > basically the difference between the vhost/vDPA's selective
> > > > > passthrough
> > > > > approach and VFIO's full passthrough approach.
> > > > 
> > > > We can't do VFIO full pasthrough for migration anyway, some kind
> > > > of mdev is
> > > > required but it's duplicated with the current vp_vdpa driver.
> > > I'm not sure that's true. Generic VFIO PCI migration can probably be
> > > achieved without mdev:
> > > 1. Define a migration PCI Capability that indicates support for
> > >     VFIO_REGION_TYPE_MIGRATION. This allows the PCI device to implement
> > >     the migration interface in hardware instead of an mdev driver.
> > 
> > 
> > So I think it still depend on the driver to implement migrate state
> > which is vendor specific.
> > 
> > Note that it's just an uAPI definition not something defined in the PCI
> > spec.
> > 
> > Out of curiosity, the patch is merged without any real users in the
> > Linux. This is very bad since we lose the change to audit the whole
> > design.
> > 
> > 
> > > 2. The VMM either uses the migration PCI Capability directly from
> > >     userspace or core VFIO PCI code advertises
> > > VFIO_REGION_TYPE_MIGRATION
> > >     to userspace so migration can proceed in the same way as with
> > >     VFIO/mdev drivers.
> > > 3. The PCI Capability is not passed through to the guest.
> > 
> > 
> > This brings troubles in the nested environment.
> > 
> > Thanks
> > 
> > 
> > > 
> > > Changpeng Liu originally mentioned the idea of defining a migration PCI
> > > Capability.
> > > 
> > > > >    For example, some of the
> > > > > virtio-net state is available to the VMM with vhost/vDPA because it
> > > > > intercepts the virtio-net control virtqueue.
> > > > > 
> > > > > Also, we need to decide to what degree the device state representation
> > > > > is standardized in the VIRTIO specification.
> > > > 
> > > > I think all the states must be defined in the spec otherwise the device
> > > > can't claim it supports migration at virtio level.
> > > > 
> > > > 
> > > > >    I think it makes sense to
> > > > > standardize it if it's possible to convey all necessary
> > > > > state and device
> > > > > implementors can easily implement this device state representation.
> > > > 
> > > > I doubt it's high device specific. E.g can we standardize device(GPU)
> > > > memory?
> > > For devices that have little internal state it's possible to define a
> > > standard device state representation.
> > > 
> > > For other devices, like virtio-crypto, virtio-fs, etc it becomes
> > > difficult because the device implementation contains state that will be
> > > needed but is very specific to the implementation. These devices *are*
> > > migratable but they don't have standard state. Even here there is a
> > > spectrum:
> > > - Host OS-specific state (e.g. Linux struct file_handles)
> > > - Library-specific state (e.g. crypto library state)
> > > - Implementation-specific state (e.g. sshfs inode state for virtio-fs)
> > > 
> > > This is why I think it's necessary to support both standard device state
> > > representations and implementation-specific device state
> > > representations.
> 
> 
> Having two ways will bring extra complexity. That why I suggest:
> 
> - to have general facility for the virtuqueue to be migrated
> - leave the device specific state to be device specific. so device can
> choose what is convenient way or interface.

I don't think we have a choice. For stateful devices it can be
impossible to define a standard device state representation.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit
  2021-07-19 12:08                                   ` Stefan Hajnoczi
@ 2021-07-20  2:46                                     ` Jason Wang
  0 siblings, 0 replies; 115+ messages in thread
From: Jason Wang @ 2021-07-20  2:46 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Max Gurtovoy, Michael S. Tsirkin, Eugenio Perez Martin,
	Dr. David Alan Gilbert, virtio-comment, Virtio-Dev,
	Cornelia Huck, Oren Duer, Shahaf Shuler, Parav Pandit,
	Bodong Wang, Alexander Mikheev, Halil Pasic


在 2021/7/19 下午8:08, Stefan Hajnoczi 写道:
> On Fri, Jul 16, 2021 at 09:48:43AM +0800, Jason Wang wrote:
>> 在 2021/7/15 下午5:26, Stefan Hajnoczi 写道:
>>> On Thu, Jul 15, 2021 at 09:38:55AM +0800, Jason Wang wrote:
>>>> 在 2021/7/15 上午12:22, Max Gurtovoy 写道:
>>>>> On 7/14/2021 6:07 PM, Stefan Hajnoczi wrote:
>>>>>>> It requires much more works than the simple virtqueue interface:
>>>>>>> (the main
>>>>>>> issues is that the function is not self-contained in a single function)
>>>>>>>
>>>>>>> 1) how to interact with the existing device status state machine?
>>>>>>> 2) how to make it work in a nested environment?
>>>>>>> 3) how to migrate the PF?
>>>>>>> 4) do we need to allow more control other than just stop/freeze
>>>>>>> the device
>>>>>>> in the admin virtqueue? If yes, how to handle the concurrent
>>>>>>> access from PF
>>>>>>> and VF?
>>>>>>> 5) how it is expected to work with non-PCI virtio device?
>>>>>> I guess your device splitting proposal addresses some of these things?
>>>>>>
>>>>>> Max probably has the most to say about these points.
>>>>>>
>>>>>> If you want more input I can try to answer too, but I personally am not
>>>>>> developing devices that need this right now, so I might not be the best
>>>>>> person to propose solutions.
>>>>> I think we mentioned this in the past and agreed that the only common
>>>>> entity between my solution for virtio VF migration to this proposal is
>>>>> the new admin control queue.
>>>>>
>>>>> I can prepare some draft for this.
>>>>>
>>>>> In our solution the PF will manage migration process for it's VFs using
>>>>> the PF admin queue. PF is not migratable.
>>>> That limits the use cases.
>>>>
>>>>
>>>>> I don't know who is using nested environments in production so don't
>>>>> know if it worth talking about that.
>>>> There should be plenty users for the nested case.
>>> Yes, nested virtualization is becoming available in clouds, etc. I think
>>> nested virtualization support should be part of the design.
>>>
>>>>> But, if you would like to implement it for testing, no problem. The VF
>>>>> in level n, probably seen as PF in level n+1. So it can manage the
>>>>> migration process for its nested VFs.
>>>> The PF dependency makes the design almost impossible to be used in a nested
>>>> environment.
>>> I'm not sure I understood Max's example, but first I want to check I
>>> understand yours:
>>>
>>> A physical PF is passed through to an L1 guest. L2 guests are assigned
>>> VFs created by the L1 guest from the PF.
>>>
>>> Now we want to live migrate the L1 guest to another host. We need to
>>> migrate the PF and its VFs are automatically included since there is no
>>> migration from the L2 perspective?
>>
>> Yes, and I believe the more common case is.
>>
>> PF is for L0, and we want to migrate L2 guest.
>>
>> This can hardly work in the current design.
>>
>> The reason is that the functions is not self contained in the VF.
> Thanks for highlighting this case. It requires that the mechanism for
> stopping and saving/loading state comes with the VF so the L1 guest can
> perform live migration even though it does not have L0 PF access.
>
> Stefan


Yes.

Thanks



^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [virtio-dev] Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit
  2021-07-19 12:18                                 ` [virtio-dev] " Stefan Hajnoczi
@ 2021-07-20  2:50                                   ` Jason Wang
  0 siblings, 0 replies; 115+ messages in thread
From: Jason Wang @ 2021-07-20  2:50 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Michael S. Tsirkin, Eugenio Perez Martin, Dr. David Alan Gilbert,
	virtio-comment, Virtio-Dev, Max Gurtovoy, Cornelia Huck,
	Oren Duer, Shahaf Shuler, Parav Pandit, Bodong Wang,
	Alexander Mikheev, Halil Pasic


在 2021/7/19 下午8:18, Stefan Hajnoczi 写道:
> On Fri, Jul 16, 2021 at 09:44:26AM +0800, Jason Wang wrote:
>> 在 2021/7/15 下午5:16, Stefan Hajnoczi 写道:
>>> On Thu, Jul 15, 2021 at 09:35:13AM +0800, Jason Wang wrote:
>>>> 在 2021/7/14 下午11:07, Stefan Hajnoczi 写道:
>>>>> On Wed, Jul 14, 2021 at 06:29:28PM +0800, Jason Wang wrote:
>>>>>> And as I've stated several times, virtqueue is the interface or transport
>>>>>> which carries the commands for implementing specific semantics. It doesn't
>>>>>> conflict with what is proposed in this patch.
>>>>> The abstract operations for stopping the device and fetching virtqueue
>>>>> state sound good to me, but I don't think a Device Status field STOP bit
>>>>> should be added. An out-of-band stop operation would support devices
>>>>> that take a long time to stop better.
>>>> So the long time request is not something that is introduced by the STOP
>>>> bit. Spec already use that for reset.
>>> Reset doesn't affect migration downtime. The register polling approach
>>> is problematic during migration downtime because it's difficult to stop
>>> multiple devices and do other downtime cleanup concurrently.
>>
>> This part I don't understand. We don't have an centralized control path that
>> is used for each virtual functions.
>>
>> VMM is free to stop multiple devices and poll for all those device status?
> Yes, it's possible to do that but I think it's harder for VMMs to
> implement and consumes CPU (which competes with software devices that
> are trying to stop).


Possibly, actually, there are two issues:

1) How to send the command to check the succeeding of the command
2) Whether or not we need a notification for the completion of the command

What is being proposed is 1), and 2) could be done via a transport 
specific way. (But requires more thought).

Actually, it's the software that choose the best way for them, it can do 
busy polling, timer etc.

Thanks


>
> Stefan


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit
  2021-07-19 12:43                                 ` Stefan Hajnoczi
@ 2021-07-20  3:02                                   ` Jason Wang
  2021-07-20 10:19                                     ` Stefan Hajnoczi
  2021-07-20 12:27                                     ` Max Gurtovoy
  0 siblings, 2 replies; 115+ messages in thread
From: Jason Wang @ 2021-07-20  3:02 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Michael S. Tsirkin, Eugenio Perez Martin, Dr. David Alan Gilbert,
	virtio-comment, Virtio-Dev, Max Gurtovoy, Cornelia Huck,
	Oren Duer, Shahaf Shuler, Parav Pandit, Bodong Wang,
	Alexander Mikheev, Halil Pasic


在 2021/7/19 下午8:43, Stefan Hajnoczi 写道:
> On Fri, Jul 16, 2021 at 10:03:17AM +0800, Jason Wang wrote:
>> 在 2021/7/15 下午6:01, Stefan Hajnoczi 写道:
>>> On Thu, Jul 15, 2021 at 09:35:13AM +0800, Jason Wang wrote:
>>>> 在 2021/7/14 下午11:07, Stefan Hajnoczi 写道:
>>>>> On Wed, Jul 14, 2021 at 06:29:28PM +0800, Jason Wang wrote:
>>>>>> 在 2021/7/14 下午5:53, Stefan Hajnoczi 写道:
>>>>>>> On Tue, Jul 13, 2021 at 08:16:35PM +0800, Jason Wang wrote:
>>>>>>>> 在 2021/7/13 下午6:00, Stefan Hajnoczi 写道:
>>>>>>>>> On Tue, Jul 13, 2021 at 11:27:03AM +0800, Jason Wang wrote:
>>>>>>>>>> 在 2021/7/12 下午5:57, Stefan Hajnoczi 写道:
>>>>>>>>>>> On Mon, Jul 12, 2021 at 12:00:39PM +0800, Jason Wang wrote:
>>>>>>>>>>>> 在 2021/7/11 上午4:36, Michael S. Tsirkin 写道:
>>>>>>>>>>>>> On Fri, Jul 09, 2021 at 07:23:33PM +0200, Eugenio Perez Martin wrote:
>>>>>>>>>>>>>>>>          If I understand correctly, this is all
>>>>>>>>>>>>>>>> driven from the driver inside the guest, so for this to work
>>>>>>>>>>>>>>>> the guest must be running and already have initialised the driver.
>>>>>>>>>>>>>>> Yes.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> As I see it, the feature can be driven entirely by the VMM as long as
>>>>>>>>>>>>>> it intercept the relevant configuration space (PCI, MMIO, etc) from
>>>>>>>>>>>>>> guest's reads and writes, and present it as coherent and transparent
>>>>>>>>>>>>>> for the guest. Some use cases I can imagine with a physical device (or
>>>>>>>>>>>>>> vp_vpda device) with VIRTIO_F_STOP:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 1) The VMM chooses not to pass the feature flag. The guest cannot stop
>>>>>>>>>>>>>> the device, so any write to this flag is an error/undefined.
>>>>>>>>>>>>>> 2) The VMM passes the flag to the guest. The guest can stop the device.
>>>>>>>>>>>>>> 2.1) The VMM stops the device to perform a live migration, and the
>>>>>>>>>>>>>> guest does not write to STOP in any moment of the LM. It resets the
>>>>>>>>>>>>>> destination device with the state, and then initializes the device.
>>>>>>>>>>>>>> 2.2) The guest stops the device and, when STOP(32) is set, the source
>>>>>>>>>>>>>> VMM migrates the device status. The destination VMM realizes the bit,
>>>>>>>>>>>>>> so it sets the bit in the destination too after device initialization.
>>>>>>>>>>>>>> 2.3) The device is not initialized by the guest so it doesn't matter
>>>>>>>>>>>>>> what bit has the HW, but the VM can be migrated.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Am I missing something?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks!
>>>>>>>>>>>>> It's doable like this. It's all a lot of hoops to jump through though.
>>>>>>>>>>>>> It's also not easy for devices to implement.
>>>>>>>>>>>> It just requires a new status bit. Anything that makes you think it's hard
>>>>>>>>>>>> to implement?
>>>>>>>>>>>>
>>>>>>>>>>>> E.g for networking device, it should be sufficient to use this bit + the
>>>>>>>>>>>> virtqueue state.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> Why don't we design the feature in a way that is useable by VMMs
>>>>>>>>>>>>> and implementable by devices in a simple way?
>>>>>>>>>>>> It use the common technology like register shadowing without any further
>>>>>>>>>>>> stuffs.
>>>>>>>>>>>>
>>>>>>>>>>>> Or do you have any other ideas?
>>>>>>>>>>>>
>>>>>>>>>>>> (I think we all know migration will be very hard if we simply pass through
>>>>>>>>>>>> those state registers).
>>>>>>>>>>> If an admin virtqueue is used instead of the STOP Device Status field
>>>>>>>>>>> bit then there's no need to re-read the Device Status field in a loop
>>>>>>>>>>> until the device has stopped.
>>>>>>>>>> Probably not. Let me to clarify several points:
>>>>>>>>>>
>>>>>>>>>> - This proposal has nothing to do with admin virtqueue. Actually, admin
>>>>>>>>>> virtqueue could be used for carrying any basic device facility like status
>>>>>>>>>> bit. E.g I'm going to post patches that use admin virtqueue as a "transport"
>>>>>>>>>> for device slicing at virtio level.
>>>>>>>>>> - Even if we had introduced admin virtqueue, we still need a per function
>>>>>>>>>> interface for this. This is a must for nested virtualization, we can't
>>>>>>>>>> always expect things like PF can be assigned to L1 guest.
>>>>>>>>>> - According to the proposal, there's no need for the device to complete all
>>>>>>>>>> the consumed buffers, device can choose to expose those inflight descriptors
>>>>>>>>>> in a device specific way and set the STOP bit. This means, if we have the
>>>>>>>>>> device specific in-flight descriptor reporting facility, the device can
>>>>>>>>>> almost set the STOP bit immediately.
>>>>>>>>>> - If we don't go with the basic device facility but using the admin
>>>>>>>>>> virtqueue specific method, we still need to clarify how it works with the
>>>>>>>>>> device status state machine, it will be some kind of sub-states which looks
>>>>>>>>>> much more complicated than the current proposal.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> When migrating a guest with many VIRTIO devices a busy waiting approach
>>>>>>>>>>> extends downtime if implemented sequentially (stopping one device at a
>>>>>>>>>>> time).
>>>>>>>>>> Well. You need some kinds of waiting for sure, the device/DMA needs sometime
>>>>>>>>>> to be stopped. The downtime is determined by a specific virtio
>>>>>>>>>> implementation which is hard to be restricted at the spec level. We can
>>>>>>>>>> clarify that the device must set the STOP bit in e.g 100ms.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>        It can be implemented concurrently (setting the STOP bit on all
>>>>>>>>>>> devices and then looping until all their Device Status fields have the
>>>>>>>>>>> bit set), but this becomes more complex to implement.
>>>>>>>>>> I still don't get what kind of complexity did you worry here.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> I'm a little worried about adding a new bit that requires busy
>>>>>>>>>>> waiting...
>>>>>>>>>> Busy wait is not something that is introduced in this patch:
>>>>>>>>>>
>>>>>>>>>> 4.1.4.3.2 Driver Requirements: Common configuration structure layout
>>>>>>>>>>
>>>>>>>>>> After writing 0 to device_status, the driver MUST wait for a read of
>>>>>>>>>> device_status to return 0 before reinitializing the device.
>>>>>>>>>>
>>>>>>>>>> Since it was required for at least one transport. We need do something
>>>>>>>>>> similar to when introducing basic facility.
>>>>>>>>> Adding the STOP but as a Device Status bit is a small and clean VIRTIO
>>>>>>>>> spec change. I like that.
>>>>>>>>>
>>>>>>>>> On the other hand, devices need time to stop and that time can be
>>>>>>>>> unbounded. For example, software virtio-blk/scsi implementations since
>>>>>>>>> cannot immediately cancel in-flight I/O requests on Linux hosts.
>>>>>>>>>
>>>>>>>>> The natural interface for long-running operations is virtqueue requests.
>>>>>>>>> That's why I mentioned the alternative of using an admin virtqueue
>>>>>>>>> instead of a Device Status bit.
>>>>>>>> So I'm not against the admin virtqueue. As said before, admin virtqueue
>>>>>>>> could be used for carrying the device status bit.
>>>>>>>>
>>>>>>>> Send a command to set STOP status bit to admin virtqueue. Device will make
>>>>>>>> the command buffer used after it has successfully stopped the device.
>>>>>>>>
>>>>>>>> AFAIK, they are not mutually exclusive, since they are trying to solve
>>>>>>>> different problems.
>>>>>>>>
>>>>>>>> Device status - basic device facility
>>>>>>>>
>>>>>>>> Admin virtqueue - transport/device specific way to implement (part of) the
>>>>>>>> device facility
>>>>>>>>
>>>>>>>>> Although you mentioned that the stopped state needs to be reflected in
>>>>>>>>> the Device Status field somehow, I'm not sure about that since the
>>>>>>>>> driver typically doesn't need to know whether the device is being
>>>>>>>>> migrated.
>>>>>>>> The guest won't see the real device status bit. VMM will shadow the device
>>>>>>>> status bit in this case.
>>>>>>>>
>>>>>>>> E.g with the current vhost-vDPA, vDPA behave like a vhost device, guest is
>>>>>>>> unaware of the migration.
>>>>>>>>
>>>>>>>> STOP status bit is set by Qemu to real virtio hardware. But guest will only
>>>>>>>> see the DRIVER_OK without STOP.
>>>>>>>>
>>>>>>>> It's not hard to implement the nested on top, see the discussion initiated
>>>>>>>> by Eugenio about how expose VIRTIO_F_STOP to guest for nested live
>>>>>>>> migration.
>>>>>>>>
>>>>>>>>
>>>>>>>>>       In fact, the VMM would need to hide this bit and it's safer to
>>>>>>>>> keep it out-of-band instead of risking exposing it by accident.
>>>>>>>> See above, VMM may choose to hide or expose the capability. It's useful for
>>>>>>>> migrating a nested guest.
>>>>>>>>
>>>>>>>> If we design an interface that can be used in the nested environment, it's
>>>>>>>> not an ideal interface.
>>>>>>>>
>>>>>>>>
>>>>>>>>> In addition, stateful devices need to load/save non-trivial amounts of
>>>>>>>>> data. They need DMA to do this efficiently, so an admin virtqueue is a
>>>>>>>>> good fit again.
>>>>>>>> I don't get the point here. You still need to address the exact the similar
>>>>>>>> issues for admin virtqueue: the unbound time in freezing the device, the
>>>>>>>> interaction with the virtio device status state machine.
>>>>>>> Device state state can be large so a register interface would be a
>>>>>>> bottleneck. DMA is needed. I think a virtqueue is a good fit for
>>>>>>> saving/loading device state.
>>>>>> So this patch doesn't mandate a register interface, isn't it?
>>>>> You're right, not this patch. I mentioned it because your other patch
>>>>> series ("[PATCH] virtio-pci: implement VIRTIO_F_RING_STATE") implements
>>>>> it a register interface.
>>>>>
>>>>>> And DMA
>>>>>> doesn't means a virtqueue, it could be a transport specific method.
>>>>> Yes, although virtqueues are a pretty good interface that works across
>>>>> transports (PCI/MMIO/etc) thanks to the standard vring memory layout.
>>>>>
>>>>>> I think we need to start from defining the state of one specific device and
>>>>>> see what is the best interface.
>>>>> virtio-blk might be the simplest. I think virtio-net has more device
>>>>> state and virtio-scsi is definitely more complext than virtio-blk.
>>>>>
>>>>> First we need agreement on whether "device state" encompasses the full
>>>>> state of the device or just state that is unknown to the VMM.
>>>> I think we've discussed this in the past. It can't work since:
>>>>
>>>> 1) The state and its format must be clearly defined in the spec
>>>> 2) We need to maintain migration compatibility and debug-ability
>>> Some devices need implementation-specific state. They should still be
>>> able to live migrate even if it means cross-implementation migration and
>>> debug-ability is not possible.
>>
>> I think we need to re-visit this conclusion. Migration compatibility is
>> pretty important, especially consider the software stack has spent a huge
>> mount of effort in maintaining them.
>>
>> Say a virtio hardware would break this, this mean we will lose all the
>> advantages of being a standard device.
>>
>> If we can't do live migration among:
>>
>> 1) different backends, e.g migrate from virtio hardware to migrate software
>> 2) different vendors
>>
>> We failed to say as a standard device and the customer is in fact locked by
>> the vendor implicitly.
> My virtiofs device implementation is backed by an in-memory file system.
> The device state includes the contents of each file.
>
> Your virtiofs device implementation uses Linux file handles to keep
> track of open files. The device state includes Linux file handles (but
> not the contents of each file) so the destination host can access the
> same files on shared storage.
>
> Cornelia's virtiofs device implementation is backed by an object storage
> HTTP API. The device state includes API object IDs.
>
> The device state is implementation-dependent. There is no standard
> representation and it's not possible to migrate between device
> implementations. How are they supposed to migrate?


So if I understand correclty, virtio-fs is not desigined to be migrate-able?

(Having a check on the current virtio-fs support in qemu, it looks to me 
it has a migration blocker).


>
> This is why I think it's necessarily to allow implementation-specific
> device state representations.


Or you probably mean you don't support cross backend migration. This 
sounds like a drawback and it's actually not a standard device but a 
vendor/implementation specific device.

It would bring a lot of troubles, not only for the implementation but 
for the management. Maybe we can start from adding the support of 
migration for some specific backend and start from there.


>
>>>> 3) Not a proper uAPI desgin
>>> I never understood this argument. The Linux uAPI passes through lots of
>>> opaque data from devices to userspace. Allowing an
>>> implementation-specific device state representation is nothing new. VFIO
>>> already does it.
>>
>> I think we've already had a lots of discussion for VFIO but without a
>> conclusion. Maybe we need the verdict from Linus or Greg (not sure if it's
>> too late). But that's not related to virito and this thread.
>>
>> What you propose here is kind of conflict with the efforts of virtio. I
>> think we all aggree that we should define the state in the spec. Assuming
>> this is correct:
>>
>> 1) why do we still offer opaque migration state to userspace?
> See above. Stateful devices may require an implementation-defined device
> state representation.


So my point stand still, it's not a standard device if we do this.


>
>> 2) how can it be integrated into the current VMM (Qemu) virtio devices'
>> migration bytes stream?
> Opaque data like D-Bus VMState:
> https://qemu.readthedocs.io/en/latest/interop/dbus-vmstate.html


Actually, I meant how to keep the opaque state which is compatible with 
all the existing device that can do migration.

E.g we want to live migration virtio-blk among any backends (from a 
hardware device to a software backend).


>
>>>>> That's
>>>>> basically the difference between the vhost/vDPA's selective passthrough
>>>>> approach and VFIO's full passthrough approach.
>>>> We can't do VFIO full pasthrough for migration anyway, some kind of mdev is
>>>> required but it's duplicated with the current vp_vdpa driver.
>>> I'm not sure that's true. Generic VFIO PCI migration can probably be
>>> achieved without mdev:
>>> 1. Define a migration PCI Capability that indicates support for
>>>      VFIO_REGION_TYPE_MIGRATION. This allows the PCI device to implement
>>>      the migration interface in hardware instead of an mdev driver.
>>
>> So I think it still depend on the driver to implement migrate state which is
>> vendor specific.
> The current VFIO migration interface depends on a device-specific
> software mdev driver but here I'm showing that the physical device can
> implement the migration interface so that no device-specific driver code
> is needed.


This is not what I read from the patch:

  * device_state: (read/write)
  *      - The user application writes to this field to inform the 
vendor driver
  *        about the device state to be transitioned to.
  *      - The vendor driver should take the necessary actions to change the
  *        device state. After successful transition to a given state, the
  *        vendor driver should return success on write(device_state, state)
  *        system call. If the device state transition fails, the vendor 
driver
  *        should return an appropriate -errno for the fault condition.

Vendor driver need to mediate between the uAPI and the actual device.


>
>> Note that it's just an uAPI definition not something defined in the PCI
>> spec.
> Yes, that's why I mentioned Changpeng Liu's idea to turn the uAPI into a
> standard PCI Capability to eliminate the need for device-specific
> drivers.


Ok.


>
>> Out of curiosity, the patch is merged without any real users in the Linux.
>> This is very bad since we lose the change to audit the whole design.
> I agree. It would have helped to have a complete vision for how live
> migration should work along with demos. I don't see any migration code
> in samples/vfio-mdev/ :(.


Right.


>>> 2. The VMM either uses the migration PCI Capability directly from
>>>      userspace or core VFIO PCI code advertises VFIO_REGION_TYPE_MIGRATION
>>>      to userspace so migration can proceed in the same way as with
>>>      VFIO/mdev drivers.
>>> 3. The PCI Capability is not passed through to the guest.
>>
>> This brings troubles in the nested environment.
> It depends on the device splitting/management design. If L0 wishes to
> let L1 manage the VFs then it would need to expose a management device.
> Since the migration interface is generic (not device-specific) a generic
> management device solves this for all devices.


Right, but it's a burden to expose the management device or it may just 
won't work.

Thanks


>
> Stefan


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit
  2021-07-19 12:45                                   ` Stefan Hajnoczi
@ 2021-07-20  3:04                                     ` Jason Wang
  2021-07-20  8:50                                       ` Stefan Hajnoczi
  0 siblings, 1 reply; 115+ messages in thread
From: Jason Wang @ 2021-07-20  3:04 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Michael S. Tsirkin, Eugenio Perez Martin, Dr. David Alan Gilbert,
	virtio-comment, Virtio-Dev, Max Gurtovoy, Cornelia Huck,
	Oren Duer, Shahaf Shuler, Parav Pandit, Bodong Wang,
	Alexander Mikheev, Halil Pasic


在 2021/7/19 下午8:45, Stefan Hajnoczi 写道:
> On Fri, Jul 16, 2021 at 11:53:13AM +0800, Jason Wang wrote:
>> 在 2021/7/16 上午10:03, Jason Wang 写道:
>>> 在 2021/7/15 下午6:01, Stefan Hajnoczi 写道:
>>>> On Thu, Jul 15, 2021 at 09:35:13AM +0800, Jason Wang wrote:
>>>>> 在 2021/7/14 下午11:07, Stefan Hajnoczi 写道:
>>>>>> On Wed, Jul 14, 2021 at 06:29:28PM +0800, Jason Wang wrote:
>>>>>>> 在 2021/7/14 下午5:53, Stefan Hajnoczi 写道:
>>>>>>>> On Tue, Jul 13, 2021 at 08:16:35PM +0800, Jason Wang wrote:
>>>>>>>>> 在 2021/7/13 下午6:00, Stefan Hajnoczi 写道:
>>>>>>>>>> On Tue, Jul 13, 2021 at 11:27:03AM +0800, Jason Wang wrote:
>>>>>>>>>>> 在 2021/7/12 下午5:57, Stefan Hajnoczi 写道:
>>>>>>>>>>>> On Mon, Jul 12, 2021 at 12:00:39PM +0800, Jason Wang wrote:
>>>>>>>>>>>>> 在 2021/7/11 上午4:36, Michael S. Tsirkin 写道:
>>>>>>>>>>>>>> On Fri, Jul 09, 2021 at
>>>>>>>>>>>>>> 07:23:33PM +0200, Eugenio
>>>>>>>>>>>>>> Perez Martin wrote:
>>>>>>>>>>>>>>>>>          If I understand correctly, this is all
>>>>>>>>>>>>>>>>> driven from the
>>>>>>>>>>>>>>>>> driver inside
>>>>>>>>>>>>>>>>> the guest, so
>>>>>>>>>>>>>>>>> for this to work
>>>>>>>>>>>>>>>>> the guest must
>>>>>>>>>>>>>>>>> be running and
>>>>>>>>>>>>>>>>> already have
>>>>>>>>>>>>>>>>> initialised the
>>>>>>>>>>>>>>>>> driver.
>>>>>>>>>>>>>>>> Yes.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> As I see it, the feature
>>>>>>>>>>>>>>> can be driven entirely
>>>>>>>>>>>>>>> by the VMM as long as
>>>>>>>>>>>>>>> it intercept the
>>>>>>>>>>>>>>> relevant configuration
>>>>>>>>>>>>>>> space (PCI, MMIO, etc)
>>>>>>>>>>>>>>> from
>>>>>>>>>>>>>>> guest's reads and
>>>>>>>>>>>>>>> writes, and present it
>>>>>>>>>>>>>>> as coherent and
>>>>>>>>>>>>>>> transparent
>>>>>>>>>>>>>>> for the guest. Some use
>>>>>>>>>>>>>>> cases I can imagine with
>>>>>>>>>>>>>>> a physical device (or
>>>>>>>>>>>>>>> vp_vpda device) with VIRTIO_F_STOP:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 1) The VMM chooses not
>>>>>>>>>>>>>>> to pass the feature
>>>>>>>>>>>>>>> flag. The guest cannot
>>>>>>>>>>>>>>> stop
>>>>>>>>>>>>>>> the device, so any write to this flag is an error/undefined.
>>>>>>>>>>>>>>> 2) The VMM passes the
>>>>>>>>>>>>>>> flag to the guest. The
>>>>>>>>>>>>>>> guest can stop the
>>>>>>>>>>>>>>> device.
>>>>>>>>>>>>>>> 2.1) The VMM stops the
>>>>>>>>>>>>>>> device to perform a live
>>>>>>>>>>>>>>> migration, and the
>>>>>>>>>>>>>>> guest does not write to
>>>>>>>>>>>>>>> STOP in any moment of
>>>>>>>>>>>>>>> the LM. It resets the
>>>>>>>>>>>>>>> destination device with
>>>>>>>>>>>>>>> the state, and then
>>>>>>>>>>>>>>> initializes the device.
>>>>>>>>>>>>>>> 2.2) The guest stops the
>>>>>>>>>>>>>>> device and, when
>>>>>>>>>>>>>>> STOP(32) is set, the
>>>>>>>>>>>>>>> source
>>>>>>>>>>>>>>> VMM migrates the device
>>>>>>>>>>>>>>> status. The destination
>>>>>>>>>>>>>>> VMM realizes the bit,
>>>>>>>>>>>>>>> so it sets the bit in
>>>>>>>>>>>>>>> the destination too
>>>>>>>>>>>>>>> after device
>>>>>>>>>>>>>>> initialization.
>>>>>>>>>>>>>>> 2.3) The device is not
>>>>>>>>>>>>>>> initialized by the guest
>>>>>>>>>>>>>>> so it doesn't matter
>>>>>>>>>>>>>>> what bit has the HW, but the VM can be migrated.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Am I missing something?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks!
>>>>>>>>>>>>>> It's doable like this. It's
>>>>>>>>>>>>>> all a lot of hoops to jump
>>>>>>>>>>>>>> through though.
>>>>>>>>>>>>>> It's also not easy for devices to implement.
>>>>>>>>>>>>> It just requires a new status
>>>>>>>>>>>>> bit. Anything that makes you
>>>>>>>>>>>>> think it's hard
>>>>>>>>>>>>> to implement?
>>>>>>>>>>>>>
>>>>>>>>>>>>> E.g for networking device, it
>>>>>>>>>>>>> should be sufficient to use this
>>>>>>>>>>>>> bit + the
>>>>>>>>>>>>> virtqueue state.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Why don't we design the
>>>>>>>>>>>>>> feature in a way that is
>>>>>>>>>>>>>> useable by VMMs
>>>>>>>>>>>>>> and implementable by devices in a simple way?
>>>>>>>>>>>>> It use the common technology
>>>>>>>>>>>>> like register shadowing without
>>>>>>>>>>>>> any further
>>>>>>>>>>>>> stuffs.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Or do you have any other ideas?
>>>>>>>>>>>>>
>>>>>>>>>>>>> (I think we all know migration
>>>>>>>>>>>>> will be very hard if we simply
>>>>>>>>>>>>> pass through
>>>>>>>>>>>>> those state registers).
>>>>>>>>>>>> If an admin virtqueue is used
>>>>>>>>>>>> instead of the STOP Device Status
>>>>>>>>>>>> field
>>>>>>>>>>>> bit then there's no need to re-read
>>>>>>>>>>>> the Device Status field in a loop
>>>>>>>>>>>> until the device has stopped.
>>>>>>>>>>> Probably not. Let me to clarify several points:
>>>>>>>>>>>
>>>>>>>>>>> - This proposal has nothing to do with
>>>>>>>>>>> admin virtqueue. Actually, admin
>>>>>>>>>>> virtqueue could be used for carrying any
>>>>>>>>>>> basic device facility like status
>>>>>>>>>>> bit. E.g I'm going to post patches that
>>>>>>>>>>> use admin virtqueue as a "transport"
>>>>>>>>>>> for device slicing at virtio level.
>>>>>>>>>>> - Even if we had introduced admin
>>>>>>>>>>> virtqueue, we still need a per function
>>>>>>>>>>> interface for this. This is a must for
>>>>>>>>>>> nested virtualization, we can't
>>>>>>>>>>> always expect things like PF can be assigned to L1 guest.
>>>>>>>>>>> - According to the proposal, there's no
>>>>>>>>>>> need for the device to complete all
>>>>>>>>>>> the consumed buffers, device can choose
>>>>>>>>>>> to expose those inflight descriptors
>>>>>>>>>>> in a device specific way and set the
>>>>>>>>>>> STOP bit. This means, if we have the
>>>>>>>>>>> device specific in-flight descriptor
>>>>>>>>>>> reporting facility, the device can
>>>>>>>>>>> almost set the STOP bit immediately.
>>>>>>>>>>> - If we don't go with the basic device
>>>>>>>>>>> facility but using the admin
>>>>>>>>>>> virtqueue specific method, we still need
>>>>>>>>>>> to clarify how it works with the
>>>>>>>>>>> device status state machine, it will be
>>>>>>>>>>> some kind of sub-states which looks
>>>>>>>>>>> much more complicated than the current proposal.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> When migrating a guest with many
>>>>>>>>>>>> VIRTIO devices a busy waiting
>>>>>>>>>>>> approach
>>>>>>>>>>>> extends downtime if implemented
>>>>>>>>>>>> sequentially (stopping one device at
>>>>>>>>>>>> a
>>>>>>>>>>>> time).
>>>>>>>>>>> Well. You need some kinds of waiting for
>>>>>>>>>>> sure, the device/DMA needs sometime
>>>>>>>>>>> to be stopped. The downtime is determined by a specific virtio
>>>>>>>>>>> implementation which is hard to be
>>>>>>>>>>> restricted at the spec level. We can
>>>>>>>>>>> clarify that the device must set the STOP bit in e.g 100ms.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>        It can be implemented
>>>>>>>>>>>> concurrently (setting the STOP bit
>>>>>>>>>>>> on all
>>>>>>>>>>>> devices and then looping until all
>>>>>>>>>>>> their Device Status fields have the
>>>>>>>>>>>> bit set), but this becomes more complex to implement.
>>>>>>>>>>> I still don't get what kind of complexity did you worry here.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> I'm a little worried about adding a new bit that requires busy
>>>>>>>>>>>> waiting...
>>>>>>>>>>> Busy wait is not something that is introduced in this patch:
>>>>>>>>>>>
>>>>>>>>>>> 4.1.4.3.2 Driver Requirements: Common
>>>>>>>>>>> configuration structure layout
>>>>>>>>>>>
>>>>>>>>>>> After writing 0 to device_status, the
>>>>>>>>>>> driver MUST wait for a read of
>>>>>>>>>>> device_status to return 0 before reinitializing the device.
>>>>>>>>>>>
>>>>>>>>>>> Since it was required for at least one
>>>>>>>>>>> transport. We need do something
>>>>>>>>>>> similar to when introducing basic facility.
>>>>>>>>>> Adding the STOP but as a Device Status bit
>>>>>>>>>> is a small and clean VIRTIO
>>>>>>>>>> spec change. I like that.
>>>>>>>>>>
>>>>>>>>>> On the other hand, devices need time to stop and that time can be
>>>>>>>>>> unbounded. For example, software
>>>>>>>>>> virtio-blk/scsi implementations since
>>>>>>>>>> cannot immediately cancel in-flight I/O requests on Linux hosts.
>>>>>>>>>>
>>>>>>>>>> The natural interface for long-running
>>>>>>>>>> operations is virtqueue requests.
>>>>>>>>>> That's why I mentioned the alternative of using an admin virtqueue
>>>>>>>>>> instead of a Device Status bit.
>>>>>>>>> So I'm not against the admin virtqueue. As said
>>>>>>>>> before, admin virtqueue
>>>>>>>>> could be used for carrying the device status bit.
>>>>>>>>>
>>>>>>>>> Send a command to set STOP status bit to admin
>>>>>>>>> virtqueue. Device will make
>>>>>>>>> the command buffer used after it has
>>>>>>>>> successfully stopped the device.
>>>>>>>>>
>>>>>>>>> AFAIK, they are not mutually exclusive, since
>>>>>>>>> they are trying to solve
>>>>>>>>> different problems.
>>>>>>>>>
>>>>>>>>> Device status - basic device facility
>>>>>>>>>
>>>>>>>>> Admin virtqueue - transport/device specific way
>>>>>>>>> to implement (part of) the
>>>>>>>>> device facility
>>>>>>>>>
>>>>>>>>>> Although you mentioned that the stopped
>>>>>>>>>> state needs to be reflected in
>>>>>>>>>> the Device Status field somehow, I'm not sure about that since the
>>>>>>>>>> driver typically doesn't need to know whether the device is being
>>>>>>>>>> migrated.
>>>>>>>>> The guest won't see the real device status bit.
>>>>>>>>> VMM will shadow the device
>>>>>>>>> status bit in this case.
>>>>>>>>>
>>>>>>>>> E.g with the current vhost-vDPA, vDPA behave
>>>>>>>>> like a vhost device, guest is
>>>>>>>>> unaware of the migration.
>>>>>>>>>
>>>>>>>>> STOP status bit is set by Qemu to real virtio
>>>>>>>>> hardware. But guest will only
>>>>>>>>> see the DRIVER_OK without STOP.
>>>>>>>>>
>>>>>>>>> It's not hard to implement the nested on top,
>>>>>>>>> see the discussion initiated
>>>>>>>>> by Eugenio about how expose VIRTIO_F_STOP to guest for nested live
>>>>>>>>> migration.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>       In fact, the VMM would need to hide
>>>>>>>>>> this bit and it's safer to
>>>>>>>>>> keep it out-of-band instead of risking exposing it by accident.
>>>>>>>>> See above, VMM may choose to hide or expose the
>>>>>>>>> capability. It's useful for
>>>>>>>>> migrating a nested guest.
>>>>>>>>>
>>>>>>>>> If we design an interface that can be used in
>>>>>>>>> the nested environment, it's
>>>>>>>>> not an ideal interface.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> In addition, stateful devices need to
>>>>>>>>>> load/save non-trivial amounts of
>>>>>>>>>> data. They need DMA to do this efficiently,
>>>>>>>>>> so an admin virtqueue is a
>>>>>>>>>> good fit again.
>>>>>>>>> I don't get the point here. You still need to
>>>>>>>>> address the exact the similar
>>>>>>>>> issues for admin virtqueue: the unbound time in
>>>>>>>>> freezing the device, the
>>>>>>>>> interaction with the virtio device status state machine.
>>>>>>>> Device state state can be large so a register interface would be a
>>>>>>>> bottleneck. DMA is needed. I think a virtqueue is a good fit for
>>>>>>>> saving/loading device state.
>>>>>>> So this patch doesn't mandate a register interface, isn't it?
>>>>>> You're right, not this patch. I mentioned it because your other patch
>>>>>> series ("[PATCH] virtio-pci: implement VIRTIO_F_RING_STATE")
>>>>>> implements
>>>>>> it a register interface.
>>>>>>
>>>>>>> And DMA
>>>>>>> doesn't means a virtqueue, it could be a transport specific method.
>>>>>> Yes, although virtqueues are a pretty good interface that works across
>>>>>> transports (PCI/MMIO/etc) thanks to the standard vring memory layout.
>>>>>>
>>>>>>> I think we need to start from defining the state of one
>>>>>>> specific device and
>>>>>>> see what is the best interface.
>>>>>> virtio-blk might be the simplest. I think virtio-net has more device
>>>>>> state and virtio-scsi is definitely more complext than virtio-blk.
>>>>>>
>>>>>> First we need agreement on whether "device state" encompasses the full
>>>>>> state of the device or just state that is unknown to the VMM.
>>>>> I think we've discussed this in the past. It can't work since:
>>>>>
>>>>> 1) The state and its format must be clearly defined in the spec
>>>>> 2) We need to maintain migration compatibility and debug-ability
>>>> Some devices need implementation-specific state. They should still be
>>>> able to live migrate even if it means cross-implementation migration and
>>>> debug-ability is not possible.
>>>
>>> I think we need to re-visit this conclusion. Migration compatibility is
>>> pretty important, especially consider the software stack has spent a
>>> huge mount of effort in maintaining them.
>>>
>>> Say a virtio hardware would break this, this mean we will lose all the
>>> advantages of being a standard device.
>>>
>>> If we can't do live migration among:
>>>
>>> 1) different backends, e.g migrate from virtio hardware to migrate
>>> software
>>> 2) different vendors
>>>
>>> We failed to say as a standard device and the customer is in fact locked
>>> by the vendor implicitly.
>>>
>>>
>>>>> 3) Not a proper uAPI desgin
>>>> I never understood this argument. The Linux uAPI passes through lots of
>>>> opaque data from devices to userspace. Allowing an
>>>> implementation-specific device state representation is nothing new. VFIO
>>>> already does it.
>>>
>>> I think we've already had a lots of discussion for VFIO but without a
>>> conclusion. Maybe we need the verdict from Linus or Greg (not sure if
>>> it's too late). But that's not related to virito and this thread.
>>>
>>> What you propose here is kind of conflict with the efforts of virtio. I
>>> think we all aggree that we should define the state in the spec.
>>> Assuming this is correct:
>>>
>>> 1) why do we still offer opaque migration state to userspace?
>>> 2) how can it be integrated into the current VMM (Qemu) virtio devices'
>>> migration bytes stream?
>>>
>>> We should standardize everything that is visible by the driver to be a
>>> standard device. That's the power of virtio.
>>>
>>>
>>>>>> That's
>>>>>> basically the difference between the vhost/vDPA's selective
>>>>>> passthrough
>>>>>> approach and VFIO's full passthrough approach.
>>>>> We can't do VFIO full pasthrough for migration anyway, some kind
>>>>> of mdev is
>>>>> required but it's duplicated with the current vp_vdpa driver.
>>>> I'm not sure that's true. Generic VFIO PCI migration can probably be
>>>> achieved without mdev:
>>>> 1. Define a migration PCI Capability that indicates support for
>>>>      VFIO_REGION_TYPE_MIGRATION. This allows the PCI device to implement
>>>>      the migration interface in hardware instead of an mdev driver.
>>>
>>> So I think it still depend on the driver to implement migrate state
>>> which is vendor specific.
>>>
>>> Note that it's just an uAPI definition not something defined in the PCI
>>> spec.
>>>
>>> Out of curiosity, the patch is merged without any real users in the
>>> Linux. This is very bad since we lose the change to audit the whole
>>> design.
>>>
>>>
>>>> 2. The VMM either uses the migration PCI Capability directly from
>>>>      userspace or core VFIO PCI code advertises
>>>> VFIO_REGION_TYPE_MIGRATION
>>>>      to userspace so migration can proceed in the same way as with
>>>>      VFIO/mdev drivers.
>>>> 3. The PCI Capability is not passed through to the guest.
>>>
>>> This brings troubles in the nested environment.
>>>
>>> Thanks
>>>
>>>
>>>> Changpeng Liu originally mentioned the idea of defining a migration PCI
>>>> Capability.
>>>>
>>>>>>     For example, some of the
>>>>>> virtio-net state is available to the VMM with vhost/vDPA because it
>>>>>> intercepts the virtio-net control virtqueue.
>>>>>>
>>>>>> Also, we need to decide to what degree the device state representation
>>>>>> is standardized in the VIRTIO specification.
>>>>> I think all the states must be defined in the spec otherwise the device
>>>>> can't claim it supports migration at virtio level.
>>>>>
>>>>>
>>>>>>     I think it makes sense to
>>>>>> standardize it if it's possible to convey all necessary
>>>>>> state and device
>>>>>> implementors can easily implement this device state representation.
>>>>> I doubt it's high device specific. E.g can we standardize device(GPU)
>>>>> memory?
>>>> For devices that have little internal state it's possible to define a
>>>> standard device state representation.
>>>>
>>>> For other devices, like virtio-crypto, virtio-fs, etc it becomes
>>>> difficult because the device implementation contains state that will be
>>>> needed but is very specific to the implementation. These devices *are*
>>>> migratable but they don't have standard state. Even here there is a
>>>> spectrum:
>>>> - Host OS-specific state (e.g. Linux struct file_handles)
>>>> - Library-specific state (e.g. crypto library state)
>>>> - Implementation-specific state (e.g. sshfs inode state for virtio-fs)
>>>>
>>>> This is why I think it's necessary to support both standard device state
>>>> representations and implementation-specific device state
>>>> representations.
>>
>> Having two ways will bring extra complexity. That why I suggest:
>>
>> - to have general facility for the virtuqueue to be migrated
>> - leave the device specific state to be device specific. so device can
>> choose what is convenient way or interface.
> I don't think we have a choice. For stateful devices it can be
> impossible to define a standard device state representation.


Let me clarify, I agree we can't have a standard device state for all 
kinds of device.

That's way I tend to leave them to be device specific. (but not 
implementation specific)

But we can generalize the virtqueue state for sure.

Thanks


>
> Stefan


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit
  2021-07-20  3:04                                     ` Jason Wang
@ 2021-07-20  8:50                                       ` Stefan Hajnoczi
  2021-07-20 10:48                                         ` Cornelia Huck
  2021-07-21  2:29                                         ` Jason Wang
  0 siblings, 2 replies; 115+ messages in thread
From: Stefan Hajnoczi @ 2021-07-20  8:50 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, Eugenio Perez Martin, Dr. David Alan Gilbert,
	virtio-comment, Virtio-Dev, Max Gurtovoy, Cornelia Huck,
	Oren Duer, Shahaf Shuler, Parav Pandit, Bodong Wang,
	Alexander Mikheev, Halil Pasic

[-- Attachment #1: Type: text/plain, Size: 23289 bytes --]

On Tue, Jul 20, 2021 at 11:04:55AM +0800, Jason Wang wrote:
> 
> 在 2021/7/19 下午8:45, Stefan Hajnoczi 写道:
> > On Fri, Jul 16, 2021 at 11:53:13AM +0800, Jason Wang wrote:
> > > 在 2021/7/16 上午10:03, Jason Wang 写道:
> > > > 在 2021/7/15 下午6:01, Stefan Hajnoczi 写道:
> > > > > On Thu, Jul 15, 2021 at 09:35:13AM +0800, Jason Wang wrote:
> > > > > > 在 2021/7/14 下午11:07, Stefan Hajnoczi 写道:
> > > > > > > On Wed, Jul 14, 2021 at 06:29:28PM +0800, Jason Wang wrote:
> > > > > > > > 在 2021/7/14 下午5:53, Stefan Hajnoczi 写道:
> > > > > > > > > On Tue, Jul 13, 2021 at 08:16:35PM +0800, Jason Wang wrote:
> > > > > > > > > > 在 2021/7/13 下午6:00, Stefan Hajnoczi 写道:
> > > > > > > > > > > On Tue, Jul 13, 2021 at 11:27:03AM +0800, Jason Wang wrote:
> > > > > > > > > > > > 在 2021/7/12 下午5:57, Stefan Hajnoczi 写道:
> > > > > > > > > > > > > On Mon, Jul 12, 2021 at 12:00:39PM +0800, Jason Wang wrote:
> > > > > > > > > > > > > > 在 2021/7/11 上午4:36, Michael S. Tsirkin 写道:
> > > > > > > > > > > > > > > On Fri, Jul 09, 2021 at
> > > > > > > > > > > > > > > 07:23:33PM +0200, Eugenio
> > > > > > > > > > > > > > > Perez Martin wrote:
> > > > > > > > > > > > > > > > > >          If I understand correctly, this is all
> > > > > > > > > > > > > > > > > > driven from the
> > > > > > > > > > > > > > > > > > driver inside
> > > > > > > > > > > > > > > > > > the guest, so
> > > > > > > > > > > > > > > > > > for this to work
> > > > > > > > > > > > > > > > > > the guest must
> > > > > > > > > > > > > > > > > > be running and
> > > > > > > > > > > > > > > > > > already have
> > > > > > > > > > > > > > > > > > initialised the
> > > > > > > > > > > > > > > > > > driver.
> > > > > > > > > > > > > > > > > Yes.
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > As I see it, the feature
> > > > > > > > > > > > > > > > can be driven entirely
> > > > > > > > > > > > > > > > by the VMM as long as
> > > > > > > > > > > > > > > > it intercept the
> > > > > > > > > > > > > > > > relevant configuration
> > > > > > > > > > > > > > > > space (PCI, MMIO, etc)
> > > > > > > > > > > > > > > > from
> > > > > > > > > > > > > > > > guest's reads and
> > > > > > > > > > > > > > > > writes, and present it
> > > > > > > > > > > > > > > > as coherent and
> > > > > > > > > > > > > > > > transparent
> > > > > > > > > > > > > > > > for the guest. Some use
> > > > > > > > > > > > > > > > cases I can imagine with
> > > > > > > > > > > > > > > > a physical device (or
> > > > > > > > > > > > > > > > vp_vpda device) with VIRTIO_F_STOP:
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > 1) The VMM chooses not
> > > > > > > > > > > > > > > > to pass the feature
> > > > > > > > > > > > > > > > flag. The guest cannot
> > > > > > > > > > > > > > > > stop
> > > > > > > > > > > > > > > > the device, so any write to this flag is an error/undefined.
> > > > > > > > > > > > > > > > 2) The VMM passes the
> > > > > > > > > > > > > > > > flag to the guest. The
> > > > > > > > > > > > > > > > guest can stop the
> > > > > > > > > > > > > > > > device.
> > > > > > > > > > > > > > > > 2.1) The VMM stops the
> > > > > > > > > > > > > > > > device to perform a live
> > > > > > > > > > > > > > > > migration, and the
> > > > > > > > > > > > > > > > guest does not write to
> > > > > > > > > > > > > > > > STOP in any moment of
> > > > > > > > > > > > > > > > the LM. It resets the
> > > > > > > > > > > > > > > > destination device with
> > > > > > > > > > > > > > > > the state, and then
> > > > > > > > > > > > > > > > initializes the device.
> > > > > > > > > > > > > > > > 2.2) The guest stops the
> > > > > > > > > > > > > > > > device and, when
> > > > > > > > > > > > > > > > STOP(32) is set, the
> > > > > > > > > > > > > > > > source
> > > > > > > > > > > > > > > > VMM migrates the device
> > > > > > > > > > > > > > > > status. The destination
> > > > > > > > > > > > > > > > VMM realizes the bit,
> > > > > > > > > > > > > > > > so it sets the bit in
> > > > > > > > > > > > > > > > the destination too
> > > > > > > > > > > > > > > > after device
> > > > > > > > > > > > > > > > initialization.
> > > > > > > > > > > > > > > > 2.3) The device is not
> > > > > > > > > > > > > > > > initialized by the guest
> > > > > > > > > > > > > > > > so it doesn't matter
> > > > > > > > > > > > > > > > what bit has the HW, but the VM can be migrated.
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > Am I missing something?
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > Thanks!
> > > > > > > > > > > > > > > It's doable like this. It's
> > > > > > > > > > > > > > > all a lot of hoops to jump
> > > > > > > > > > > > > > > through though.
> > > > > > > > > > > > > > > It's also not easy for devices to implement.
> > > > > > > > > > > > > > It just requires a new status
> > > > > > > > > > > > > > bit. Anything that makes you
> > > > > > > > > > > > > > think it's hard
> > > > > > > > > > > > > > to implement?
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > E.g for networking device, it
> > > > > > > > > > > > > > should be sufficient to use this
> > > > > > > > > > > > > > bit + the
> > > > > > > > > > > > > > virtqueue state.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > Why don't we design the
> > > > > > > > > > > > > > > feature in a way that is
> > > > > > > > > > > > > > > useable by VMMs
> > > > > > > > > > > > > > > and implementable by devices in a simple way?
> > > > > > > > > > > > > > It use the common technology
> > > > > > > > > > > > > > like register shadowing without
> > > > > > > > > > > > > > any further
> > > > > > > > > > > > > > stuffs.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > Or do you have any other ideas?
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > (I think we all know migration
> > > > > > > > > > > > > > will be very hard if we simply
> > > > > > > > > > > > > > pass through
> > > > > > > > > > > > > > those state registers).
> > > > > > > > > > > > > If an admin virtqueue is used
> > > > > > > > > > > > > instead of the STOP Device Status
> > > > > > > > > > > > > field
> > > > > > > > > > > > > bit then there's no need to re-read
> > > > > > > > > > > > > the Device Status field in a loop
> > > > > > > > > > > > > until the device has stopped.
> > > > > > > > > > > > Probably not. Let me to clarify several points:
> > > > > > > > > > > > 
> > > > > > > > > > > > - This proposal has nothing to do with
> > > > > > > > > > > > admin virtqueue. Actually, admin
> > > > > > > > > > > > virtqueue could be used for carrying any
> > > > > > > > > > > > basic device facility like status
> > > > > > > > > > > > bit. E.g I'm going to post patches that
> > > > > > > > > > > > use admin virtqueue as a "transport"
> > > > > > > > > > > > for device slicing at virtio level.
> > > > > > > > > > > > - Even if we had introduced admin
> > > > > > > > > > > > virtqueue, we still need a per function
> > > > > > > > > > > > interface for this. This is a must for
> > > > > > > > > > > > nested virtualization, we can't
> > > > > > > > > > > > always expect things like PF can be assigned to L1 guest.
> > > > > > > > > > > > - According to the proposal, there's no
> > > > > > > > > > > > need for the device to complete all
> > > > > > > > > > > > the consumed buffers, device can choose
> > > > > > > > > > > > to expose those inflight descriptors
> > > > > > > > > > > > in a device specific way and set the
> > > > > > > > > > > > STOP bit. This means, if we have the
> > > > > > > > > > > > device specific in-flight descriptor
> > > > > > > > > > > > reporting facility, the device can
> > > > > > > > > > > > almost set the STOP bit immediately.
> > > > > > > > > > > > - If we don't go with the basic device
> > > > > > > > > > > > facility but using the admin
> > > > > > > > > > > > virtqueue specific method, we still need
> > > > > > > > > > > > to clarify how it works with the
> > > > > > > > > > > > device status state machine, it will be
> > > > > > > > > > > > some kind of sub-states which looks
> > > > > > > > > > > > much more complicated than the current proposal.
> > > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > > > When migrating a guest with many
> > > > > > > > > > > > > VIRTIO devices a busy waiting
> > > > > > > > > > > > > approach
> > > > > > > > > > > > > extends downtime if implemented
> > > > > > > > > > > > > sequentially (stopping one device at
> > > > > > > > > > > > > a
> > > > > > > > > > > > > time).
> > > > > > > > > > > > Well. You need some kinds of waiting for
> > > > > > > > > > > > sure, the device/DMA needs sometime
> > > > > > > > > > > > to be stopped. The downtime is determined by a specific virtio
> > > > > > > > > > > > implementation which is hard to be
> > > > > > > > > > > > restricted at the spec level. We can
> > > > > > > > > > > > clarify that the device must set the STOP bit in e.g 100ms.
> > > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > > >        It can be implemented
> > > > > > > > > > > > > concurrently (setting the STOP bit
> > > > > > > > > > > > > on all
> > > > > > > > > > > > > devices and then looping until all
> > > > > > > > > > > > > their Device Status fields have the
> > > > > > > > > > > > > bit set), but this becomes more complex to implement.
> > > > > > > > > > > > I still don't get what kind of complexity did you worry here.
> > > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > > > I'm a little worried about adding a new bit that requires busy
> > > > > > > > > > > > > waiting...
> > > > > > > > > > > > Busy wait is not something that is introduced in this patch:
> > > > > > > > > > > > 
> > > > > > > > > > > > 4.1.4.3.2 Driver Requirements: Common
> > > > > > > > > > > > configuration structure layout
> > > > > > > > > > > > 
> > > > > > > > > > > > After writing 0 to device_status, the
> > > > > > > > > > > > driver MUST wait for a read of
> > > > > > > > > > > > device_status to return 0 before reinitializing the device.
> > > > > > > > > > > > 
> > > > > > > > > > > > Since it was required for at least one
> > > > > > > > > > > > transport. We need do something
> > > > > > > > > > > > similar to when introducing basic facility.
> > > > > > > > > > > Adding the STOP but as a Device Status bit
> > > > > > > > > > > is a small and clean VIRTIO
> > > > > > > > > > > spec change. I like that.
> > > > > > > > > > > 
> > > > > > > > > > > On the other hand, devices need time to stop and that time can be
> > > > > > > > > > > unbounded. For example, software
> > > > > > > > > > > virtio-blk/scsi implementations since
> > > > > > > > > > > cannot immediately cancel in-flight I/O requests on Linux hosts.
> > > > > > > > > > > 
> > > > > > > > > > > The natural interface for long-running
> > > > > > > > > > > operations is virtqueue requests.
> > > > > > > > > > > That's why I mentioned the alternative of using an admin virtqueue
> > > > > > > > > > > instead of a Device Status bit.
> > > > > > > > > > So I'm not against the admin virtqueue. As said
> > > > > > > > > > before, admin virtqueue
> > > > > > > > > > could be used for carrying the device status bit.
> > > > > > > > > > 
> > > > > > > > > > Send a command to set STOP status bit to admin
> > > > > > > > > > virtqueue. Device will make
> > > > > > > > > > the command buffer used after it has
> > > > > > > > > > successfully stopped the device.
> > > > > > > > > > 
> > > > > > > > > > AFAIK, they are not mutually exclusive, since
> > > > > > > > > > they are trying to solve
> > > > > > > > > > different problems.
> > > > > > > > > > 
> > > > > > > > > > Device status - basic device facility
> > > > > > > > > > 
> > > > > > > > > > Admin virtqueue - transport/device specific way
> > > > > > > > > > to implement (part of) the
> > > > > > > > > > device facility
> > > > > > > > > > 
> > > > > > > > > > > Although you mentioned that the stopped
> > > > > > > > > > > state needs to be reflected in
> > > > > > > > > > > the Device Status field somehow, I'm not sure about that since the
> > > > > > > > > > > driver typically doesn't need to know whether the device is being
> > > > > > > > > > > migrated.
> > > > > > > > > > The guest won't see the real device status bit.
> > > > > > > > > > VMM will shadow the device
> > > > > > > > > > status bit in this case.
> > > > > > > > > > 
> > > > > > > > > > E.g with the current vhost-vDPA, vDPA behave
> > > > > > > > > > like a vhost device, guest is
> > > > > > > > > > unaware of the migration.
> > > > > > > > > > 
> > > > > > > > > > STOP status bit is set by Qemu to real virtio
> > > > > > > > > > hardware. But guest will only
> > > > > > > > > > see the DRIVER_OK without STOP.
> > > > > > > > > > 
> > > > > > > > > > It's not hard to implement the nested on top,
> > > > > > > > > > see the discussion initiated
> > > > > > > > > > by Eugenio about how expose VIRTIO_F_STOP to guest for nested live
> > > > > > > > > > migration.
> > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > >       In fact, the VMM would need to hide
> > > > > > > > > > > this bit and it's safer to
> > > > > > > > > > > keep it out-of-band instead of risking exposing it by accident.
> > > > > > > > > > See above, VMM may choose to hide or expose the
> > > > > > > > > > capability. It's useful for
> > > > > > > > > > migrating a nested guest.
> > > > > > > > > > 
> > > > > > > > > > If we design an interface that can be used in
> > > > > > > > > > the nested environment, it's
> > > > > > > > > > not an ideal interface.
> > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > > In addition, stateful devices need to
> > > > > > > > > > > load/save non-trivial amounts of
> > > > > > > > > > > data. They need DMA to do this efficiently,
> > > > > > > > > > > so an admin virtqueue is a
> > > > > > > > > > > good fit again.
> > > > > > > > > > I don't get the point here. You still need to
> > > > > > > > > > address the exact the similar
> > > > > > > > > > issues for admin virtqueue: the unbound time in
> > > > > > > > > > freezing the device, the
> > > > > > > > > > interaction with the virtio device status state machine.
> > > > > > > > > Device state state can be large so a register interface would be a
> > > > > > > > > bottleneck. DMA is needed. I think a virtqueue is a good fit for
> > > > > > > > > saving/loading device state.
> > > > > > > > So this patch doesn't mandate a register interface, isn't it?
> > > > > > > You're right, not this patch. I mentioned it because your other patch
> > > > > > > series ("[PATCH] virtio-pci: implement VIRTIO_F_RING_STATE")
> > > > > > > implements
> > > > > > > it a register interface.
> > > > > > > 
> > > > > > > > And DMA
> > > > > > > > doesn't means a virtqueue, it could be a transport specific method.
> > > > > > > Yes, although virtqueues are a pretty good interface that works across
> > > > > > > transports (PCI/MMIO/etc) thanks to the standard vring memory layout.
> > > > > > > 
> > > > > > > > I think we need to start from defining the state of one
> > > > > > > > specific device and
> > > > > > > > see what is the best interface.
> > > > > > > virtio-blk might be the simplest. I think virtio-net has more device
> > > > > > > state and virtio-scsi is definitely more complext than virtio-blk.
> > > > > > > 
> > > > > > > First we need agreement on whether "device state" encompasses the full
> > > > > > > state of the device or just state that is unknown to the VMM.
> > > > > > I think we've discussed this in the past. It can't work since:
> > > > > > 
> > > > > > 1) The state and its format must be clearly defined in the spec
> > > > > > 2) We need to maintain migration compatibility and debug-ability
> > > > > Some devices need implementation-specific state. They should still be
> > > > > able to live migrate even if it means cross-implementation migration and
> > > > > debug-ability is not possible.
> > > > 
> > > > I think we need to re-visit this conclusion. Migration compatibility is
> > > > pretty important, especially consider the software stack has spent a
> > > > huge mount of effort in maintaining them.
> > > > 
> > > > Say a virtio hardware would break this, this mean we will lose all the
> > > > advantages of being a standard device.
> > > > 
> > > > If we can't do live migration among:
> > > > 
> > > > 1) different backends, e.g migrate from virtio hardware to migrate
> > > > software
> > > > 2) different vendors
> > > > 
> > > > We failed to say as a standard device and the customer is in fact locked
> > > > by the vendor implicitly.
> > > > 
> > > > 
> > > > > > 3) Not a proper uAPI desgin
> > > > > I never understood this argument. The Linux uAPI passes through lots of
> > > > > opaque data from devices to userspace. Allowing an
> > > > > implementation-specific device state representation is nothing new. VFIO
> > > > > already does it.
> > > > 
> > > > I think we've already had a lots of discussion for VFIO but without a
> > > > conclusion. Maybe we need the verdict from Linus or Greg (not sure if
> > > > it's too late). But that's not related to virito and this thread.
> > > > 
> > > > What you propose here is kind of conflict with the efforts of virtio. I
> > > > think we all aggree that we should define the state in the spec.
> > > > Assuming this is correct:
> > > > 
> > > > 1) why do we still offer opaque migration state to userspace?
> > > > 2) how can it be integrated into the current VMM (Qemu) virtio devices'
> > > > migration bytes stream?
> > > > 
> > > > We should standardize everything that is visible by the driver to be a
> > > > standard device. That's the power of virtio.
> > > > 
> > > > 
> > > > > > > That's
> > > > > > > basically the difference between the vhost/vDPA's selective
> > > > > > > passthrough
> > > > > > > approach and VFIO's full passthrough approach.
> > > > > > We can't do VFIO full pasthrough for migration anyway, some kind
> > > > > > of mdev is
> > > > > > required but it's duplicated with the current vp_vdpa driver.
> > > > > I'm not sure that's true. Generic VFIO PCI migration can probably be
> > > > > achieved without mdev:
> > > > > 1. Define a migration PCI Capability that indicates support for
> > > > >      VFIO_REGION_TYPE_MIGRATION. This allows the PCI device to implement
> > > > >      the migration interface in hardware instead of an mdev driver.
> > > > 
> > > > So I think it still depend on the driver to implement migrate state
> > > > which is vendor specific.
> > > > 
> > > > Note that it's just an uAPI definition not something defined in the PCI
> > > > spec.
> > > > 
> > > > Out of curiosity, the patch is merged without any real users in the
> > > > Linux. This is very bad since we lose the change to audit the whole
> > > > design.
> > > > 
> > > > 
> > > > > 2. The VMM either uses the migration PCI Capability directly from
> > > > >      userspace or core VFIO PCI code advertises
> > > > > VFIO_REGION_TYPE_MIGRATION
> > > > >      to userspace so migration can proceed in the same way as with
> > > > >      VFIO/mdev drivers.
> > > > > 3. The PCI Capability is not passed through to the guest.
> > > > 
> > > > This brings troubles in the nested environment.
> > > > 
> > > > Thanks
> > > > 
> > > > 
> > > > > Changpeng Liu originally mentioned the idea of defining a migration PCI
> > > > > Capability.
> > > > > 
> > > > > > >     For example, some of the
> > > > > > > virtio-net state is available to the VMM with vhost/vDPA because it
> > > > > > > intercepts the virtio-net control virtqueue.
> > > > > > > 
> > > > > > > Also, we need to decide to what degree the device state representation
> > > > > > > is standardized in the VIRTIO specification.
> > > > > > I think all the states must be defined in the spec otherwise the device
> > > > > > can't claim it supports migration at virtio level.
> > > > > > 
> > > > > > 
> > > > > > >     I think it makes sense to
> > > > > > > standardize it if it's possible to convey all necessary
> > > > > > > state and device
> > > > > > > implementors can easily implement this device state representation.
> > > > > > I doubt it's high device specific. E.g can we standardize device(GPU)
> > > > > > memory?
> > > > > For devices that have little internal state it's possible to define a
> > > > > standard device state representation.
> > > > > 
> > > > > For other devices, like virtio-crypto, virtio-fs, etc it becomes
> > > > > difficult because the device implementation contains state that will be
> > > > > needed but is very specific to the implementation. These devices *are*
> > > > > migratable but they don't have standard state. Even here there is a
> > > > > spectrum:
> > > > > - Host OS-specific state (e.g. Linux struct file_handles)
> > > > > - Library-specific state (e.g. crypto library state)
> > > > > - Implementation-specific state (e.g. sshfs inode state for virtio-fs)
> > > > > 
> > > > > This is why I think it's necessary to support both standard device state
> > > > > representations and implementation-specific device state
> > > > > representations.
> > > 
> > > Having two ways will bring extra complexity. That why I suggest:
> > > 
> > > - to have general facility for the virtuqueue to be migrated
> > > - leave the device specific state to be device specific. so device can
> > > choose what is convenient way or interface.
> > I don't think we have a choice. For stateful devices it can be
> > impossible to define a standard device state representation.
> 
> 
> Let me clarify, I agree we can't have a standard device state for all kinds
> of device.
> 
> That's way I tend to leave them to be device specific. (but not
> implementation specific)

Unfortunately device state is sometimes implementation-specific. Not
because the device is proprietary, but because the actual state is
meaningless to other implementations.

I mentioned virtiofs as an example where file system backends can be
implemented in completely different ways so the device state cannot be
migrated between implementations.

> But we can generalize the virtqueue state for sure.

I agree and also that some device types can standardize their device
state representations. But I think it's a technical requirement to
support implementation-specific state for device types where
cross-implementation migration is not possible.

I'm not saying the implementation-specific state representation has to
be a binary blob. There could be an identifier registry to ensure live
migration compatibility checks can be performed. There could also be a
standard binary encoding for migration data. But the contents will be
implementation-specific for some devices.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit
  2021-07-20  3:02                                   ` Jason Wang
@ 2021-07-20 10:19                                     ` Stefan Hajnoczi
  2021-07-21  2:52                                       ` Jason Wang
  2021-07-20 12:27                                     ` Max Gurtovoy
  1 sibling, 1 reply; 115+ messages in thread
From: Stefan Hajnoczi @ 2021-07-20 10:19 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, Eugenio Perez Martin, Dr. David Alan Gilbert,
	virtio-comment, Virtio-Dev, Max Gurtovoy, Cornelia Huck,
	Oren Duer, Shahaf Shuler, Parav Pandit, Bodong Wang,
	Alexander Mikheev, Halil Pasic

[-- Attachment #1: Type: text/plain, Size: 21656 bytes --]

On Tue, Jul 20, 2021 at 11:02:42AM +0800, Jason Wang wrote:
> 
> 在 2021/7/19 下午8:43, Stefan Hajnoczi 写道:
> > On Fri, Jul 16, 2021 at 10:03:17AM +0800, Jason Wang wrote:
> > > 在 2021/7/15 下午6:01, Stefan Hajnoczi 写道:
> > > > On Thu, Jul 15, 2021 at 09:35:13AM +0800, Jason Wang wrote:
> > > > > 在 2021/7/14 下午11:07, Stefan Hajnoczi 写道:
> > > > > > On Wed, Jul 14, 2021 at 06:29:28PM +0800, Jason Wang wrote:
> > > > > > > 在 2021/7/14 下午5:53, Stefan Hajnoczi 写道:
> > > > > > > > On Tue, Jul 13, 2021 at 08:16:35PM +0800, Jason Wang wrote:
> > > > > > > > > 在 2021/7/13 下午6:00, Stefan Hajnoczi 写道:
> > > > > > > > > > On Tue, Jul 13, 2021 at 11:27:03AM +0800, Jason Wang wrote:
> > > > > > > > > > > 在 2021/7/12 下午5:57, Stefan Hajnoczi 写道:
> > > > > > > > > > > > On Mon, Jul 12, 2021 at 12:00:39PM +0800, Jason Wang wrote:
> > > > > > > > > > > > > 在 2021/7/11 上午4:36, Michael S. Tsirkin 写道:
> > > > > > > > > > > > > > On Fri, Jul 09, 2021 at 07:23:33PM +0200, Eugenio Perez Martin wrote:
> > > > > > > > > > > > > > > > >          If I understand correctly, this is all
> > > > > > > > > > > > > > > > > driven from the driver inside the guest, so for this to work
> > > > > > > > > > > > > > > > > the guest must be running and already have initialised the driver.
> > > > > > > > > > > > > > > > Yes.
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > As I see it, the feature can be driven entirely by the VMM as long as
> > > > > > > > > > > > > > > it intercept the relevant configuration space (PCI, MMIO, etc) from
> > > > > > > > > > > > > > > guest's reads and writes, and present it as coherent and transparent
> > > > > > > > > > > > > > > for the guest. Some use cases I can imagine with a physical device (or
> > > > > > > > > > > > > > > vp_vpda device) with VIRTIO_F_STOP:
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > 1) The VMM chooses not to pass the feature flag. The guest cannot stop
> > > > > > > > > > > > > > > the device, so any write to this flag is an error/undefined.
> > > > > > > > > > > > > > > 2) The VMM passes the flag to the guest. The guest can stop the device.
> > > > > > > > > > > > > > > 2.1) The VMM stops the device to perform a live migration, and the
> > > > > > > > > > > > > > > guest does not write to STOP in any moment of the LM. It resets the
> > > > > > > > > > > > > > > destination device with the state, and then initializes the device.
> > > > > > > > > > > > > > > 2.2) The guest stops the device and, when STOP(32) is set, the source
> > > > > > > > > > > > > > > VMM migrates the device status. The destination VMM realizes the bit,
> > > > > > > > > > > > > > > so it sets the bit in the destination too after device initialization.
> > > > > > > > > > > > > > > 2.3) The device is not initialized by the guest so it doesn't matter
> > > > > > > > > > > > > > > what bit has the HW, but the VM can be migrated.
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > Am I missing something?
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > Thanks!
> > > > > > > > > > > > > > It's doable like this. It's all a lot of hoops to jump through though.
> > > > > > > > > > > > > > It's also not easy for devices to implement.
> > > > > > > > > > > > > It just requires a new status bit. Anything that makes you think it's hard
> > > > > > > > > > > > > to implement?
> > > > > > > > > > > > > 
> > > > > > > > > > > > > E.g for networking device, it should be sufficient to use this bit + the
> > > > > > > > > > > > > virtqueue state.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > 
> > > > > > > > > > > > > > Why don't we design the feature in a way that is useable by VMMs
> > > > > > > > > > > > > > and implementable by devices in a simple way?
> > > > > > > > > > > > > It use the common technology like register shadowing without any further
> > > > > > > > > > > > > stuffs.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Or do you have any other ideas?
> > > > > > > > > > > > > 
> > > > > > > > > > > > > (I think we all know migration will be very hard if we simply pass through
> > > > > > > > > > > > > those state registers).
> > > > > > > > > > > > If an admin virtqueue is used instead of the STOP Device Status field
> > > > > > > > > > > > bit then there's no need to re-read the Device Status field in a loop
> > > > > > > > > > > > until the device has stopped.
> > > > > > > > > > > Probably not. Let me to clarify several points:
> > > > > > > > > > > 
> > > > > > > > > > > - This proposal has nothing to do with admin virtqueue. Actually, admin
> > > > > > > > > > > virtqueue could be used for carrying any basic device facility like status
> > > > > > > > > > > bit. E.g I'm going to post patches that use admin virtqueue as a "transport"
> > > > > > > > > > > for device slicing at virtio level.
> > > > > > > > > > > - Even if we had introduced admin virtqueue, we still need a per function
> > > > > > > > > > > interface for this. This is a must for nested virtualization, we can't
> > > > > > > > > > > always expect things like PF can be assigned to L1 guest.
> > > > > > > > > > > - According to the proposal, there's no need for the device to complete all
> > > > > > > > > > > the consumed buffers, device can choose to expose those inflight descriptors
> > > > > > > > > > > in a device specific way and set the STOP bit. This means, if we have the
> > > > > > > > > > > device specific in-flight descriptor reporting facility, the device can
> > > > > > > > > > > almost set the STOP bit immediately.
> > > > > > > > > > > - If we don't go with the basic device facility but using the admin
> > > > > > > > > > > virtqueue specific method, we still need to clarify how it works with the
> > > > > > > > > > > device status state machine, it will be some kind of sub-states which looks
> > > > > > > > > > > much more complicated than the current proposal.
> > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > > When migrating a guest with many VIRTIO devices a busy waiting approach
> > > > > > > > > > > > extends downtime if implemented sequentially (stopping one device at a
> > > > > > > > > > > > time).
> > > > > > > > > > > Well. You need some kinds of waiting for sure, the device/DMA needs sometime
> > > > > > > > > > > to be stopped. The downtime is determined by a specific virtio
> > > > > > > > > > > implementation which is hard to be restricted at the spec level. We can
> > > > > > > > > > > clarify that the device must set the STOP bit in e.g 100ms.
> > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > >        It can be implemented concurrently (setting the STOP bit on all
> > > > > > > > > > > > devices and then looping until all their Device Status fields have the
> > > > > > > > > > > > bit set), but this becomes more complex to implement.
> > > > > > > > > > > I still don't get what kind of complexity did you worry here.
> > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > > I'm a little worried about adding a new bit that requires busy
> > > > > > > > > > > > waiting...
> > > > > > > > > > > Busy wait is not something that is introduced in this patch:
> > > > > > > > > > > 
> > > > > > > > > > > 4.1.4.3.2 Driver Requirements: Common configuration structure layout
> > > > > > > > > > > 
> > > > > > > > > > > After writing 0 to device_status, the driver MUST wait for a read of
> > > > > > > > > > > device_status to return 0 before reinitializing the device.
> > > > > > > > > > > 
> > > > > > > > > > > Since it was required for at least one transport. We need do something
> > > > > > > > > > > similar to when introducing basic facility.
> > > > > > > > > > Adding the STOP but as a Device Status bit is a small and clean VIRTIO
> > > > > > > > > > spec change. I like that.
> > > > > > > > > > 
> > > > > > > > > > On the other hand, devices need time to stop and that time can be
> > > > > > > > > > unbounded. For example, software virtio-blk/scsi implementations since
> > > > > > > > > > cannot immediately cancel in-flight I/O requests on Linux hosts.
> > > > > > > > > > 
> > > > > > > > > > The natural interface for long-running operations is virtqueue requests.
> > > > > > > > > > That's why I mentioned the alternative of using an admin virtqueue
> > > > > > > > > > instead of a Device Status bit.
> > > > > > > > > So I'm not against the admin virtqueue. As said before, admin virtqueue
> > > > > > > > > could be used for carrying the device status bit.
> > > > > > > > > 
> > > > > > > > > Send a command to set STOP status bit to admin virtqueue. Device will make
> > > > > > > > > the command buffer used after it has successfully stopped the device.
> > > > > > > > > 
> > > > > > > > > AFAIK, they are not mutually exclusive, since they are trying to solve
> > > > > > > > > different problems.
> > > > > > > > > 
> > > > > > > > > Device status - basic device facility
> > > > > > > > > 
> > > > > > > > > Admin virtqueue - transport/device specific way to implement (part of) the
> > > > > > > > > device facility
> > > > > > > > > 
> > > > > > > > > > Although you mentioned that the stopped state needs to be reflected in
> > > > > > > > > > the Device Status field somehow, I'm not sure about that since the
> > > > > > > > > > driver typically doesn't need to know whether the device is being
> > > > > > > > > > migrated.
> > > > > > > > > The guest won't see the real device status bit. VMM will shadow the device
> > > > > > > > > status bit in this case.
> > > > > > > > > 
> > > > > > > > > E.g with the current vhost-vDPA, vDPA behave like a vhost device, guest is
> > > > > > > > > unaware of the migration.
> > > > > > > > > 
> > > > > > > > > STOP status bit is set by Qemu to real virtio hardware. But guest will only
> > > > > > > > > see the DRIVER_OK without STOP.
> > > > > > > > > 
> > > > > > > > > It's not hard to implement the nested on top, see the discussion initiated
> > > > > > > > > by Eugenio about how expose VIRTIO_F_STOP to guest for nested live
> > > > > > > > > migration.
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > >       In fact, the VMM would need to hide this bit and it's safer to
> > > > > > > > > > keep it out-of-band instead of risking exposing it by accident.
> > > > > > > > > See above, VMM may choose to hide or expose the capability. It's useful for
> > > > > > > > > migrating a nested guest.
> > > > > > > > > 
> > > > > > > > > If we design an interface that can be used in the nested environment, it's
> > > > > > > > > not an ideal interface.
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > > In addition, stateful devices need to load/save non-trivial amounts of
> > > > > > > > > > data. They need DMA to do this efficiently, so an admin virtqueue is a
> > > > > > > > > > good fit again.
> > > > > > > > > I don't get the point here. You still need to address the exact the similar
> > > > > > > > > issues for admin virtqueue: the unbound time in freezing the device, the
> > > > > > > > > interaction with the virtio device status state machine.
> > > > > > > > Device state state can be large so a register interface would be a
> > > > > > > > bottleneck. DMA is needed. I think a virtqueue is a good fit for
> > > > > > > > saving/loading device state.
> > > > > > > So this patch doesn't mandate a register interface, isn't it?
> > > > > > You're right, not this patch. I mentioned it because your other patch
> > > > > > series ("[PATCH] virtio-pci: implement VIRTIO_F_RING_STATE") implements
> > > > > > it a register interface.
> > > > > > 
> > > > > > > And DMA
> > > > > > > doesn't means a virtqueue, it could be a transport specific method.
> > > > > > Yes, although virtqueues are a pretty good interface that works across
> > > > > > transports (PCI/MMIO/etc) thanks to the standard vring memory layout.
> > > > > > 
> > > > > > > I think we need to start from defining the state of one specific device and
> > > > > > > see what is the best interface.
> > > > > > virtio-blk might be the simplest. I think virtio-net has more device
> > > > > > state and virtio-scsi is definitely more complext than virtio-blk.
> > > > > > 
> > > > > > First we need agreement on whether "device state" encompasses the full
> > > > > > state of the device or just state that is unknown to the VMM.
> > > > > I think we've discussed this in the past. It can't work since:
> > > > > 
> > > > > 1) The state and its format must be clearly defined in the spec
> > > > > 2) We need to maintain migration compatibility and debug-ability
> > > > Some devices need implementation-specific state. They should still be
> > > > able to live migrate even if it means cross-implementation migration and
> > > > debug-ability is not possible.
> > > 
> > > I think we need to re-visit this conclusion. Migration compatibility is
> > > pretty important, especially consider the software stack has spent a huge
> > > mount of effort in maintaining them.
> > > 
> > > Say a virtio hardware would break this, this mean we will lose all the
> > > advantages of being a standard device.
> > > 
> > > If we can't do live migration among:
> > > 
> > > 1) different backends, e.g migrate from virtio hardware to migrate software
> > > 2) different vendors
> > > 
> > > We failed to say as a standard device and the customer is in fact locked by
> > > the vendor implicitly.
> > My virtiofs device implementation is backed by an in-memory file system.
> > The device state includes the contents of each file.
> > 
> > Your virtiofs device implementation uses Linux file handles to keep
> > track of open files. The device state includes Linux file handles (but
> > not the contents of each file) so the destination host can access the
> > same files on shared storage.
> > 
> > Cornelia's virtiofs device implementation is backed by an object storage
> > HTTP API. The device state includes API object IDs.
> > 
> > The device state is implementation-dependent. There is no standard
> > representation and it's not possible to migrate between device
> > implementations. How are they supposed to migrate?
> 
> 
> So if I understand correclty, virtio-fs is not desigined to be migrate-able?
> 
> (Having a check on the current virtio-fs support in qemu, it looks to me it
> has a migration blocker).

The code does not support live migration but it's on the roadmap. Max
Reitz added Linux file handle support to virtiofsd. That was the first
step towards being able to migrate the device's state.

> > This is why I think it's necessarily to allow implementation-specific
> > device state representations.
> 
> 
> Or you probably mean you don't support cross backend migration. This sounds
> like a drawback and it's actually not a standard device but a
> vendor/implementation specific device.
> 
> It would bring a lot of troubles, not only for the implementation but for
> the management. Maybe we can start from adding the support of migration for
> some specific backend and start from there.

Yes, it's complicated. Some implementations could be compatible, but
others can never be compatible because they have completely different
state.

The virtiofsd implementation is the main one for virtiofs and the device
state representation can be published, even standardized. Others can
implement it to achieve migration compatibility.

But it must be possible for implementations that have completely
different state to migrate too. virtiofsd isn't special.

> > > > > 3) Not a proper uAPI desgin
> > > > I never understood this argument. The Linux uAPI passes through lots of
> > > > opaque data from devices to userspace. Allowing an
> > > > implementation-specific device state representation is nothing new. VFIO
> > > > already does it.
> > > 
> > > I think we've already had a lots of discussion for VFIO but without a
> > > conclusion. Maybe we need the verdict from Linus or Greg (not sure if it's
> > > too late). But that's not related to virito and this thread.
> > > 
> > > What you propose here is kind of conflict with the efforts of virtio. I
> > > think we all aggree that we should define the state in the spec. Assuming
> > > this is correct:
> > > 
> > > 1) why do we still offer opaque migration state to userspace?
> > See above. Stateful devices may require an implementation-defined device
> > state representation.
> 
> 
> So my point stand still, it's not a standard device if we do this.

These "non-standard devices" still need to be able to migrate. How
should we do that?

> > > 2) how can it be integrated into the current VMM (Qemu) virtio devices'
> > > migration bytes stream?
> > Opaque data like D-Bus VMState:
> > https://qemu.readthedocs.io/en/latest/interop/dbus-vmstate.html
> 
> 
> Actually, I meant how to keep the opaque state which is compatible with all
> the existing device that can do migration.
> 
> E.g we want to live migration virtio-blk among any backends (from a hardware
> device to a software backend).

There was a series of email threads last year where migration
compatibility was discussed:

https://lists.gnu.org/archive/html/qemu-devel/2020-11/msg02620.html

I proposed an algorithm for checking migration compatibility between
devices. The source and destination device can have different
implementations (e.g. hardware, software, etc).

It involves picking an identifier like virtio-spec.org/pci/virtio-net
for the device state representation and device parameters for aspects of
the device that vary between instances (e.g. tso=on|off).

It's more complex than today's live migration approach in libvirt and
QEMU. Today libvirt configures the source and destination in a
compatible manner (thanks to knowledge of the device implementation) and
then QEMU transfers the device state.

Part of the point of defining a migration compatibility algorithm is
that it's possible to lift the assumptions out of libvirt so that
arbitrary device implementations can be supported (hardware, software,
etc) without putting knowledge about every device/VMM implementation
into libvirt.

(The other advantage is that this allows orchestration software to
determine migration compatibility before starting a migration.)

> > > > > > That's
> > > > > > basically the difference between the vhost/vDPA's selective passthrough
> > > > > > approach and VFIO's full passthrough approach.
> > > > > We can't do VFIO full pasthrough for migration anyway, some kind of mdev is
> > > > > required but it's duplicated with the current vp_vdpa driver.
> > > > I'm not sure that's true. Generic VFIO PCI migration can probably be
> > > > achieved without mdev:
> > > > 1. Define a migration PCI Capability that indicates support for
> > > >      VFIO_REGION_TYPE_MIGRATION. This allows the PCI device to implement
> > > >      the migration interface in hardware instead of an mdev driver.
> > > 
> > > So I think it still depend on the driver to implement migrate state which is
> > > vendor specific.
> > The current VFIO migration interface depends on a device-specific
> > software mdev driver but here I'm showing that the physical device can
> > implement the migration interface so that no device-specific driver code
> > is needed.
> 
> 
> This is not what I read from the patch:
> 
>  * device_state: (read/write)
>  *      - The user application writes to this field to inform the vendor
> driver
>  *        about the device state to be transitioned to.
>  *      - The vendor driver should take the necessary actions to change the
>  *        device state. After successful transition to a given state, the
>  *        vendor driver should return success on write(device_state, state)
>  *        system call. If the device state transition fails, the vendor
> driver
>  *        should return an appropriate -errno for the fault condition.
> 
> Vendor driver need to mediate between the uAPI and the actual device.

Yes, that's the current state of VFIO migration. If a hardware interface
(e.g. PCI Capability) is defined that maps to this API then no
device-specific drivers would be necessary because core VFIO PCI code
can implement the uAPI by talking to the hardware.

> > > > 2. The VMM either uses the migration PCI Capability directly from
> > > >      userspace or core VFIO PCI code advertises VFIO_REGION_TYPE_MIGRATION
> > > >      to userspace so migration can proceed in the same way as with
> > > >      VFIO/mdev drivers.
> > > > 3. The PCI Capability is not passed through to the guest.
> > > 
> > > This brings troubles in the nested environment.
> > It depends on the device splitting/management design. If L0 wishes to
> > let L1 manage the VFs then it would need to expose a management device.
> > Since the migration interface is generic (not device-specific) a generic
> > management device solves this for all devices.
> 
> 
> Right, but it's a burden to expose the management device or it may just
> won't work.

A single generic management device is not a huge burden and it may turn
out that keeping the management device out-of-band is actually a
desirable feature if the device owner does not wish to expose the
stop/save/load functionality for some reason.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit
  2021-07-16  1:44                               ` Jason Wang
  2021-07-19 12:18                                 ` [virtio-dev] " Stefan Hajnoczi
@ 2021-07-20 10:31                                 ` Cornelia Huck
  2021-07-21  2:59                                   ` Jason Wang
  1 sibling, 1 reply; 115+ messages in thread
From: Cornelia Huck @ 2021-07-20 10:31 UTC (permalink / raw)
  To: Jason Wang, Stefan Hajnoczi
  Cc: Michael S. Tsirkin, Eugenio Perez Martin, Dr. David Alan Gilbert,
	virtio-comment, Virtio-Dev, Max Gurtovoy, Oren Duer,
	Shahaf Shuler, Parav Pandit, Bodong Wang, Alexander Mikheev,
	Halil Pasic

On Fri, Jul 16 2021, Jason Wang <jasowang@redhat.com> wrote:

> 在 2021/7/15 下午5:16, Stefan Hajnoczi 写道:
>> Stopping
>> devices sequentially increases migration downtime, so I think the
>> interface should encourage concurrently stopping multiple devices.
>>
>> I think you and Cornelia discussed that an interrupt could be added to
>> solve this problem. That would address my concerns about the STOP bit.
>
>
> The problems are:
>
> 1) if we generate an interrupt after STOP, it breaks the STOP semantic 
> where the device should not generate any interrupt
> 2) if we generate an interrupt before STOP, we may end up with race 
> conditions

I think not all interrupts are created equal here.

For virtqueue notification interrupts, I agree. If the device is being
stopped, no notification interrupts will be generated.

For device interrupts in the general sense, banning these would make it
impossible to implement STOP for CCW, as any channel program (be it
RESET, READ_STATUS, or any new one) is required to generate a status
pending/interrupt when it is finished. I also don't see how that would
create a race condition for CCW.

Why can't we simply have an interrupt indicating completion of the STOP
request, and no further interrupts after that?


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit
  2021-07-20  8:50                                       ` Stefan Hajnoczi
@ 2021-07-20 10:48                                         ` Cornelia Huck
  2021-07-20 12:47                                           ` Stefan Hajnoczi
  2021-07-21  2:29                                         ` Jason Wang
  1 sibling, 1 reply; 115+ messages in thread
From: Cornelia Huck @ 2021-07-20 10:48 UTC (permalink / raw)
  To: Stefan Hajnoczi, Jason Wang
  Cc: Michael S. Tsirkin, Eugenio Perez Martin, Dr. David Alan Gilbert,
	virtio-comment, Virtio-Dev, Max Gurtovoy, Oren Duer,
	Shahaf Shuler, Parav Pandit, Bodong Wang, Alexander Mikheev,
	Halil Pasic

On Tue, Jul 20 2021, Stefan Hajnoczi <stefanha@redhat.com> wrote:

> On Tue, Jul 20, 2021 at 11:04:55AM +0800, Jason Wang wrote:
>> Let me clarify, I agree we can't have a standard device state for all kinds
>> of device.
>> 
>> That's way I tend to leave them to be device specific. (but not
>> implementation specific)
>
> Unfortunately device state is sometimes implementation-specific. Not
> because the device is proprietary, but because the actual state is
> meaningless to other implementations.
>
> I mentioned virtiofs as an example where file system backends can be
> implemented in completely different ways so the device state cannot be
> migrated between implementations.
>
>> But we can generalize the virtqueue state for sure.
>
> I agree and also that some device types can standardize their device
> state representations. But I think it's a technical requirement to
> support implementation-specific state for device types where
> cross-implementation migration is not possible.
>
> I'm not saying the implementation-specific state representation has to
> be a binary blob. There could be an identifier registry to ensure live
> migration compatibility checks can be performed. There could also be a
> standard binary encoding for migration data. But the contents will be
> implementation-specific for some devices.

Can we at least put those implementation-specific states into some kind
of structured, standardized form? E.g. something like

<type category: file system backend data>
<type identifier: file system foo>
<length>
<data>

so that we can at least do compat checks for "I know how to handle foo"?


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit
  2021-07-20  3:02                                   ` Jason Wang
  2021-07-20 10:19                                     ` Stefan Hajnoczi
@ 2021-07-20 12:27                                     ` Max Gurtovoy
  2021-07-20 12:57                                       ` Stefan Hajnoczi
  2021-07-21  3:09                                       ` Jason Wang
  1 sibling, 2 replies; 115+ messages in thread
From: Max Gurtovoy @ 2021-07-20 12:27 UTC (permalink / raw)
  To: Jason Wang, Stefan Hajnoczi
  Cc: Michael S. Tsirkin, Eugenio Perez Martin, Dr. David Alan Gilbert,
	virtio-comment, Virtio-Dev, Cornelia Huck, Oren Duer,
	Shahaf Shuler, Parav Pandit, Bodong Wang, Alexander Mikheev,
	Halil Pasic


On 7/20/2021 6:02 AM, Jason Wang wrote:
>
> 在 2021/7/19 下午8:43, Stefan Hajnoczi 写道:
>> On Fri, Jul 16, 2021 at 10:03:17AM +0800, Jason Wang wrote:
>>> 在 2021/7/15 下午6:01, Stefan Hajnoczi 写道:
>>>> On Thu, Jul 15, 2021 at 09:35:13AM +0800, Jason Wang wrote:
>>>>> 在 2021/7/14 下午11:07, Stefan Hajnoczi 写道:
>>>>>> On Wed, Jul 14, 2021 at 06:29:28PM +0800, Jason Wang wrote:
>>>>>>> 在 2021/7/14 下午5:53, Stefan Hajnoczi 写道:
>>>>>>>> On Tue, Jul 13, 2021 at 08:16:35PM +0800, Jason Wang wrote:
>>>>>>>>> 在 2021/7/13 下午6:00, Stefan Hajnoczi 写道:
>>>>>>>>>> On Tue, Jul 13, 2021 at 11:27:03AM +0800, Jason Wang wrote:
>>>>>>>>>>> 在 2021/7/12 下午5:57, Stefan Hajnoczi 写道:
>>>>>>>>>>>> On Mon, Jul 12, 2021 at 12:00:39PM +0800, Jason Wang wrote:
>>>>>>>>>>>>> 在 2021/7/11 上午4:36, Michael S. Tsirkin 写道:
>>>>>>>>>>>>>> On Fri, Jul 09, 2021 at 07:23:33PM +0200, Eugenio Perez 
>>>>>>>>>>>>>> Martin wrote:
>>>>>>>>>>>>>>>>>          If I understand correctly, this is all
>>>>>>>>>>>>>>>>> driven from the driver inside the guest, so for this 
>>>>>>>>>>>>>>>>> to work
>>>>>>>>>>>>>>>>> the guest must be running and already have initialised 
>>>>>>>>>>>>>>>>> the driver.
>>>>>>>>>>>>>>>> Yes.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> As I see it, the feature can be driven entirely by the 
>>>>>>>>>>>>>>> VMM as long as
>>>>>>>>>>>>>>> it intercept the relevant configuration space (PCI, 
>>>>>>>>>>>>>>> MMIO, etc) from
>>>>>>>>>>>>>>> guest's reads and writes, and present it as coherent and 
>>>>>>>>>>>>>>> transparent
>>>>>>>>>>>>>>> for the guest. Some use cases I can imagine with a 
>>>>>>>>>>>>>>> physical device (or
>>>>>>>>>>>>>>> vp_vpda device) with VIRTIO_F_STOP:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 1) The VMM chooses not to pass the feature flag. The 
>>>>>>>>>>>>>>> guest cannot stop
>>>>>>>>>>>>>>> the device, so any write to this flag is an 
>>>>>>>>>>>>>>> error/undefined.
>>>>>>>>>>>>>>> 2) The VMM passes the flag to the guest. The guest can 
>>>>>>>>>>>>>>> stop the device.
>>>>>>>>>>>>>>> 2.1) The VMM stops the device to perform a live 
>>>>>>>>>>>>>>> migration, and the
>>>>>>>>>>>>>>> guest does not write to STOP in any moment of the LM. It 
>>>>>>>>>>>>>>> resets the
>>>>>>>>>>>>>>> destination device with the state, and then initializes 
>>>>>>>>>>>>>>> the device.
>>>>>>>>>>>>>>> 2.2) The guest stops the device and, when STOP(32) is 
>>>>>>>>>>>>>>> set, the source
>>>>>>>>>>>>>>> VMM migrates the device status. The destination VMM 
>>>>>>>>>>>>>>> realizes the bit,
>>>>>>>>>>>>>>> so it sets the bit in the destination too after device 
>>>>>>>>>>>>>>> initialization.
>>>>>>>>>>>>>>> 2.3) The device is not initialized by the guest so it 
>>>>>>>>>>>>>>> doesn't matter
>>>>>>>>>>>>>>> what bit has the HW, but the VM can be migrated.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Am I missing something?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks!
>>>>>>>>>>>>>> It's doable like this. It's all a lot of hoops to jump 
>>>>>>>>>>>>>> through though.
>>>>>>>>>>>>>> It's also not easy for devices to implement.
>>>>>>>>>>>>> It just requires a new status bit. Anything that makes you 
>>>>>>>>>>>>> think it's hard
>>>>>>>>>>>>> to implement?
>>>>>>>>>>>>>
>>>>>>>>>>>>> E.g for networking device, it should be sufficient to use 
>>>>>>>>>>>>> this bit + the
>>>>>>>>>>>>> virtqueue state.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Why don't we design the feature in a way that is useable 
>>>>>>>>>>>>>> by VMMs
>>>>>>>>>>>>>> and implementable by devices in a simple way?
>>>>>>>>>>>>> It use the common technology like register shadowing 
>>>>>>>>>>>>> without any further
>>>>>>>>>>>>> stuffs.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Or do you have any other ideas?
>>>>>>>>>>>>>
>>>>>>>>>>>>> (I think we all know migration will be very hard if we 
>>>>>>>>>>>>> simply pass through
>>>>>>>>>>>>> those state registers).
>>>>>>>>>>>> If an admin virtqueue is used instead of the STOP Device 
>>>>>>>>>>>> Status field
>>>>>>>>>>>> bit then there's no need to re-read the Device Status field 
>>>>>>>>>>>> in a loop
>>>>>>>>>>>> until the device has stopped.
>>>>>>>>>>> Probably not. Let me to clarify several points:
>>>>>>>>>>>
>>>>>>>>>>> - This proposal has nothing to do with admin virtqueue. 
>>>>>>>>>>> Actually, admin
>>>>>>>>>>> virtqueue could be used for carrying any basic device 
>>>>>>>>>>> facility like status
>>>>>>>>>>> bit. E.g I'm going to post patches that use admin virtqueue 
>>>>>>>>>>> as a "transport"
>>>>>>>>>>> for device slicing at virtio level.
>>>>>>>>>>> - Even if we had introduced admin virtqueue, we still need a 
>>>>>>>>>>> per function
>>>>>>>>>>> interface for this. This is a must for nested 
>>>>>>>>>>> virtualization, we can't
>>>>>>>>>>> always expect things like PF can be assigned to L1 guest.
>>>>>>>>>>> - According to the proposal, there's no need for the device 
>>>>>>>>>>> to complete all
>>>>>>>>>>> the consumed buffers, device can choose to expose those 
>>>>>>>>>>> inflight descriptors
>>>>>>>>>>> in a device specific way and set the STOP bit. This means, 
>>>>>>>>>>> if we have the
>>>>>>>>>>> device specific in-flight descriptor reporting facility, the 
>>>>>>>>>>> device can
>>>>>>>>>>> almost set the STOP bit immediately.
>>>>>>>>>>> - If we don't go with the basic device facility but using 
>>>>>>>>>>> the admin
>>>>>>>>>>> virtqueue specific method, we still need to clarify how it 
>>>>>>>>>>> works with the
>>>>>>>>>>> device status state machine, it will be some kind of 
>>>>>>>>>>> sub-states which looks
>>>>>>>>>>> much more complicated than the current proposal.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> When migrating a guest with many VIRTIO devices a busy 
>>>>>>>>>>>> waiting approach
>>>>>>>>>>>> extends downtime if implemented sequentially (stopping one 
>>>>>>>>>>>> device at a
>>>>>>>>>>>> time).
>>>>>>>>>>> Well. You need some kinds of waiting for sure, the 
>>>>>>>>>>> device/DMA needs sometime
>>>>>>>>>>> to be stopped. The downtime is determined by a specific virtio
>>>>>>>>>>> implementation which is hard to be restricted at the spec 
>>>>>>>>>>> level. We can
>>>>>>>>>>> clarify that the device must set the STOP bit in e.g 100ms.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>        It can be implemented concurrently (setting the STOP 
>>>>>>>>>>>> bit on all
>>>>>>>>>>>> devices and then looping until all their Device Status 
>>>>>>>>>>>> fields have the
>>>>>>>>>>>> bit set), but this becomes more complex to implement.
>>>>>>>>>>> I still don't get what kind of complexity did you worry here.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> I'm a little worried about adding a new bit that requires busy
>>>>>>>>>>>> waiting...
>>>>>>>>>>> Busy wait is not something that is introduced in this patch:
>>>>>>>>>>>
>>>>>>>>>>> 4.1.4.3.2 Driver Requirements: Common configuration 
>>>>>>>>>>> structure layout
>>>>>>>>>>>
>>>>>>>>>>> After writing 0 to device_status, the driver MUST wait for a 
>>>>>>>>>>> read of
>>>>>>>>>>> device_status to return 0 before reinitializing the device.
>>>>>>>>>>>
>>>>>>>>>>> Since it was required for at least one transport. We need do 
>>>>>>>>>>> something
>>>>>>>>>>> similar to when introducing basic facility.
>>>>>>>>>> Adding the STOP but as a Device Status bit is a small and 
>>>>>>>>>> clean VIRTIO
>>>>>>>>>> spec change. I like that.
>>>>>>>>>>
>>>>>>>>>> On the other hand, devices need time to stop and that time 
>>>>>>>>>> can be
>>>>>>>>>> unbounded. For example, software virtio-blk/scsi 
>>>>>>>>>> implementations since
>>>>>>>>>> cannot immediately cancel in-flight I/O requests on Linux hosts.
>>>>>>>>>>
>>>>>>>>>> The natural interface for long-running operations is 
>>>>>>>>>> virtqueue requests.
>>>>>>>>>> That's why I mentioned the alternative of using an admin 
>>>>>>>>>> virtqueue
>>>>>>>>>> instead of a Device Status bit.
>>>>>>>>> So I'm not against the admin virtqueue. As said before, admin 
>>>>>>>>> virtqueue
>>>>>>>>> could be used for carrying the device status bit.
>>>>>>>>>
>>>>>>>>> Send a command to set STOP status bit to admin virtqueue. 
>>>>>>>>> Device will make
>>>>>>>>> the command buffer used after it has successfully stopped the 
>>>>>>>>> device.
>>>>>>>>>
>>>>>>>>> AFAIK, they are not mutually exclusive, since they are trying 
>>>>>>>>> to solve
>>>>>>>>> different problems.
>>>>>>>>>
>>>>>>>>> Device status - basic device facility
>>>>>>>>>
>>>>>>>>> Admin virtqueue - transport/device specific way to implement 
>>>>>>>>> (part of) the
>>>>>>>>> device facility
>>>>>>>>>
>>>>>>>>>> Although you mentioned that the stopped state needs to be 
>>>>>>>>>> reflected in
>>>>>>>>>> the Device Status field somehow, I'm not sure about that 
>>>>>>>>>> since the
>>>>>>>>>> driver typically doesn't need to know whether the device is 
>>>>>>>>>> being
>>>>>>>>>> migrated.
>>>>>>>>> The guest won't see the real device status bit. VMM will 
>>>>>>>>> shadow the device
>>>>>>>>> status bit in this case.
>>>>>>>>>
>>>>>>>>> E.g with the current vhost-vDPA, vDPA behave like a vhost 
>>>>>>>>> device, guest is
>>>>>>>>> unaware of the migration.
>>>>>>>>>
>>>>>>>>> STOP status bit is set by Qemu to real virtio hardware. But 
>>>>>>>>> guest will only
>>>>>>>>> see the DRIVER_OK without STOP.
>>>>>>>>>
>>>>>>>>> It's not hard to implement the nested on top, see the 
>>>>>>>>> discussion initiated
>>>>>>>>> by Eugenio about how expose VIRTIO_F_STOP to guest for nested 
>>>>>>>>> live
>>>>>>>>> migration.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>       In fact, the VMM would need to hide this bit and it's 
>>>>>>>>>> safer to
>>>>>>>>>> keep it out-of-band instead of risking exposing it by accident.
>>>>>>>>> See above, VMM may choose to hide or expose the capability. 
>>>>>>>>> It's useful for
>>>>>>>>> migrating a nested guest.
>>>>>>>>>
>>>>>>>>> If we design an interface that can be used in the nested 
>>>>>>>>> environment, it's
>>>>>>>>> not an ideal interface.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> In addition, stateful devices need to load/save non-trivial 
>>>>>>>>>> amounts of
>>>>>>>>>> data. They need DMA to do this efficiently, so an admin 
>>>>>>>>>> virtqueue is a
>>>>>>>>>> good fit again.
>>>>>>>>> I don't get the point here. You still need to address the 
>>>>>>>>> exact the similar
>>>>>>>>> issues for admin virtqueue: the unbound time in freezing the 
>>>>>>>>> device, the
>>>>>>>>> interaction with the virtio device status state machine.
>>>>>>>> Device state state can be large so a register interface would be a
>>>>>>>> bottleneck. DMA is needed. I think a virtqueue is a good fit for
>>>>>>>> saving/loading device state.
>>>>>>> So this patch doesn't mandate a register interface, isn't it?
>>>>>> You're right, not this patch. I mentioned it because your other 
>>>>>> patch
>>>>>> series ("[PATCH] virtio-pci: implement VIRTIO_F_RING_STATE") 
>>>>>> implements
>>>>>> it a register interface.
>>>>>>
>>>>>>> And DMA
>>>>>>> doesn't means a virtqueue, it could be a transport specific method.
>>>>>> Yes, although virtqueues are a pretty good interface that works 
>>>>>> across
>>>>>> transports (PCI/MMIO/etc) thanks to the standard vring memory 
>>>>>> layout.
>>>>>>
>>>>>>> I think we need to start from defining the state of one specific 
>>>>>>> device and
>>>>>>> see what is the best interface.
>>>>>> virtio-blk might be the simplest. I think virtio-net has more device
>>>>>> state and virtio-scsi is definitely more complext than virtio-blk.
>>>>>>
>>>>>> First we need agreement on whether "device state" encompasses the 
>>>>>> full
>>>>>> state of the device or just state that is unknown to the VMM.
>>>>> I think we've discussed this in the past. It can't work since:
>>>>>
>>>>> 1) The state and its format must be clearly defined in the spec
>>>>> 2) We need to maintain migration compatibility and debug-ability
>>>> Some devices need implementation-specific state. They should still be
>>>> able to live migrate even if it means cross-implementation 
>>>> migration and
>>>> debug-ability is not possible.
>>>
>>> I think we need to re-visit this conclusion. Migration compatibility is
>>> pretty important, especially consider the software stack has spent a 
>>> huge
>>> mount of effort in maintaining them.
>>>
>>> Say a virtio hardware would break this, this mean we will lose all the
>>> advantages of being a standard device.
>>>
>>> If we can't do live migration among:
>>>
>>> 1) different backends, e.g migrate from virtio hardware to migrate 
>>> software
>>> 2) different vendors
>>>
>>> We failed to say as a standard device and the customer is in fact 
>>> locked by
>>> the vendor implicitly.
>> My virtiofs device implementation is backed by an in-memory file system.
>> The device state includes the contents of each file.
>>
>> Your virtiofs device implementation uses Linux file handles to keep
>> track of open files. The device state includes Linux file handles (but
>> not the contents of each file) so the destination host can access the
>> same files on shared storage.
>>
>> Cornelia's virtiofs device implementation is backed by an object storage
>> HTTP API. The device state includes API object IDs.
>>
>> The device state is implementation-dependent. There is no standard
>> representation and it's not possible to migrate between device
>> implementations. How are they supposed to migrate?
>
>
> So if I understand correclty, virtio-fs is not desigined to be 
> migrate-able?
>
> (Having a check on the current virtio-fs support in qemu, it looks to 
> me it has a migration blocker).
>
>
>>
>> This is why I think it's necessarily to allow implementation-specific
>> device state representations.
>
>
> Or you probably mean you don't support cross backend migration. This 
> sounds like a drawback and it's actually not a standard device but a 
> vendor/implementation specific device.
>
> It would bring a lot of troubles, not only for the implementation but 
> for the management. Maybe we can start from adding the support of 
> migration for some specific backend and start from there.
>
>
>>
>>>>> 3) Not a proper uAPI desgin
>>>> I never understood this argument. The Linux uAPI passes through 
>>>> lots of
>>>> opaque data from devices to userspace. Allowing an
>>>> implementation-specific device state representation is nothing new. 
>>>> VFIO
>>>> already does it.
>>>
>>> I think we've already had a lots of discussion for VFIO but without a
>>> conclusion. Maybe we need the verdict from Linus or Greg (not sure 
>>> if it's
>>> too late). But that's not related to virito and this thread.
>>>
>>> What you propose here is kind of conflict with the efforts of virtio. I
>>> think we all aggree that we should define the state in the spec. 
>>> Assuming
>>> this is correct:
>>>
>>> 1) why do we still offer opaque migration state to userspace?
>> See above. Stateful devices may require an implementation-defined device
>> state representation.
>
>
> So my point stand still, it's not a standard device if we do this.
>
>
>>
>>> 2) how can it be integrated into the current VMM (Qemu) virtio devices'
>>> migration bytes stream?
>> Opaque data like D-Bus VMState:
>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fqemu.readthedocs.io%2Fen%2Flatest%2Finterop%2Fdbus-vmstate.html&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C73950d2060194ce2a43e08d94b2ae003%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637623469808033640%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=6smKOLikySbPdeQa1sbNSRdTB13p3ma09BH%2BeknXAS4%3D&amp;reserved=0 
>>
>
>
> Actually, I meant how to keep the opaque state which is compatible 
> with all the existing device that can do migration.
>
> E.g we want to live migration virtio-blk among any backends (from a 
> hardware device to a software backend).

I prefer we'll handle HW to SW migration in the future.

We're still debating on other basic stuff.

>
>
>>
>>>>>> That's
>>>>>> basically the difference between the vhost/vDPA's selective 
>>>>>> passthrough
>>>>>> approach and VFIO's full passthrough approach.
>>>>> We can't do VFIO full pasthrough for migration anyway, some kind 
>>>>> of mdev is
>>>>> required but it's duplicated with the current vp_vdpa driver.
>>>> I'm not sure that's true. Generic VFIO PCI migration can probably be
>>>> achieved without mdev:
>>>> 1. Define a migration PCI Capability that indicates support for
>>>>      VFIO_REGION_TYPE_MIGRATION. This allows the PCI device to 
>>>> implement
>>>>      the migration interface in hardware instead of an mdev driver.
>>>
>>> So I think it still depend on the driver to implement migrate state 
>>> which is
>>> vendor specific.
>> The current VFIO migration interface depends on a device-specific
>> software mdev driver but here I'm showing that the physical device can
>> implement the migration interface so that no device-specific driver code
>> is needed.
>
>
> This is not what I read from the patch:
>
>  * device_state: (read/write)
>  *      - The user application writes to this field to inform the 
> vendor driver
>  *        about the device state to be transitioned to.
>  *      - The vendor driver should take the necessary actions to 
> change the
>  *        device state. After successful transition to a given state, the
>  *        vendor driver should return success on write(device_state, 
> state)
>  *        system call. If the device state transition fails, the 
> vendor driver
>  *        should return an appropriate -errno for the fault condition.
>
> Vendor driver need to mediate between the uAPI and the actual device.

We're building an infrastructure for VFIO PCI devices in the last few 
months.

It should be merged hopefully to kernel 5.15.

>
>
>>
>>> Note that it's just an uAPI definition not something defined in the PCI
>>> spec.
>> Yes, that's why I mentioned Changpeng Liu's idea to turn the uAPI into a
>> standard PCI Capability to eliminate the need for device-specific
>> drivers.
>
>
> Ok.
>
>
>>
>>> Out of curiosity, the patch is merged without any real users in the 
>>> Linux.
>>> This is very bad since we lose the change to audit the whole design.
>> I agree. It would have helped to have a complete vision for how live
>> migration should work along with demos. I don't see any migration code
>> in samples/vfio-mdev/ :(.
>
>
> Right.

Creating a standard is not related to Linux nor VFIO.

With the proposal that I've sent, we can develop a migration driver and 
virtio device that will support it (NVIDIA virtio-blk SNAP device).

And you can build live migration support in virtio_vdpa driver (if VDPA 
migration protocol will be implemented).

>
>
>>>> 2. The VMM either uses the migration PCI Capability directly from
>>>>      userspace or core VFIO PCI code advertises 
>>>> VFIO_REGION_TYPE_MIGRATION
>>>>      to userspace so migration can proceed in the same way as with
>>>>      VFIO/mdev drivers.
>>>> 3. The PCI Capability is not passed through to the guest.
>>>
>>> This brings troubles in the nested environment.
>> It depends on the device splitting/management design. If L0 wishes to
>> let L1 manage the VFs then it would need to expose a management device.
>> Since the migration interface is generic (not device-specific) a generic
>> management device solves this for all devices.
>
>
> Right, but it's a burden to expose the management device or it may 
> just won't work.
>
> Thanks
>
>
>>
>> Stefan
>


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit
  2021-07-20 10:48                                         ` Cornelia Huck
@ 2021-07-20 12:47                                           ` Stefan Hajnoczi
  0 siblings, 0 replies; 115+ messages in thread
From: Stefan Hajnoczi @ 2021-07-20 12:47 UTC (permalink / raw)
  To: Cornelia Huck
  Cc: Jason Wang, Michael S. Tsirkin, Eugenio Perez Martin,
	Dr. David Alan Gilbert, virtio-comment, Virtio-Dev, Max Gurtovoy,
	Oren Duer, Shahaf Shuler, Parav Pandit, Bodong Wang,
	Alexander Mikheev, Halil Pasic

[-- Attachment #1: Type: text/plain, Size: 1866 bytes --]

On Tue, Jul 20, 2021 at 12:48:43PM +0200, Cornelia Huck wrote:
> On Tue, Jul 20 2021, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> 
> > On Tue, Jul 20, 2021 at 11:04:55AM +0800, Jason Wang wrote:
> >> Let me clarify, I agree we can't have a standard device state for all kinds
> >> of device.
> >> 
> >> That's way I tend to leave them to be device specific. (but not
> >> implementation specific)
> >
> > Unfortunately device state is sometimes implementation-specific. Not
> > because the device is proprietary, but because the actual state is
> > meaningless to other implementations.
> >
> > I mentioned virtiofs as an example where file system backends can be
> > implemented in completely different ways so the device state cannot be
> > migrated between implementations.
> >
> >> But we can generalize the virtqueue state for sure.
> >
> > I agree and also that some device types can standardize their device
> > state representations. But I think it's a technical requirement to
> > support implementation-specific state for device types where
> > cross-implementation migration is not possible.
> >
> > I'm not saying the implementation-specific state representation has to
> > be a binary blob. There could be an identifier registry to ensure live
> > migration compatibility checks can be performed. There could also be a
> > standard binary encoding for migration data. But the contents will be
> > implementation-specific for some devices.
> 
> Can we at least put those implementation-specific states into some kind
> of structured, standardized form? E.g. something like
> 
> <type category: file system backend data>
> <type identifier: file system foo>
> <length>
> <data>
> 
> so that we can at least do compat checks for "I know how to handle foo"?

Yes, that's what I was trying to describe.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit
  2021-07-20 12:27                                     ` Max Gurtovoy
@ 2021-07-20 12:57                                       ` Stefan Hajnoczi
  2021-07-20 13:09                                         ` Max Gurtovoy
  2021-07-21  3:09                                       ` Jason Wang
  1 sibling, 1 reply; 115+ messages in thread
From: Stefan Hajnoczi @ 2021-07-20 12:57 UTC (permalink / raw)
  To: Max Gurtovoy
  Cc: Jason Wang, Michael S. Tsirkin, Eugenio Perez Martin,
	Dr. David Alan Gilbert, virtio-comment, Virtio-Dev,
	Cornelia Huck, Oren Duer, Shahaf Shuler, Parav Pandit,
	Bodong Wang, Alexander Mikheev, Halil Pasic

[-- Attachment #1: Type: text/plain, Size: 4631 bytes --]

On Tue, Jul 20, 2021 at 03:27:00PM +0300, Max Gurtovoy wrote:
> On 7/20/2021 6:02 AM, Jason Wang wrote:
> > 在 2021/7/19 下午8:43, Stefan Hajnoczi 写道:
> > > On Fri, Jul 16, 2021 at 10:03:17AM +0800, Jason Wang wrote:
> > > > 在 2021/7/15 下午6:01, Stefan Hajnoczi 写道:
> > > > > On Thu, Jul 15, 2021 at 09:35:13AM +0800, Jason Wang wrote:
> > > > > > 在 2021/7/14 下午11:07, Stefan Hajnoczi 写道:
> > > > > > > On Wed, Jul 14, 2021 at 06:29:28PM +0800, Jason Wang wrote:
> > > > > > > > 在 2021/7/14 下午5:53, Stefan Hajnoczi 写道:
> > > > > > > > > On Tue, Jul 13, 2021 at 08:16:35PM +0800, Jason Wang wrote:
> > > > > > > > > > 在 2021/7/13 下午6:00, Stefan Hajnoczi 写道:
> > > > > > > > > > > On Tue, Jul 13, 2021 at 11:27:03AM +0800, Jason Wang wrote:
> > > > > > > > > > > > 在 2021/7/12 下午5:57, Stefan Hajnoczi 写道:
> > > > > > > > > > > > > On Mon, Jul 12, 2021 at 12:00:39PM +0800, Jason Wang wrote:
> > > > > > > > > > > > > > 在 2021/7/11 上午4:36, Michael S. Tsirkin 写道:
> > > > > > > > > > > > > > > On Fri, Jul 09, 2021
> > > > > > > > > > > > > > > at 07:23:33PM +0200,
> > > > > > > > > > > > > > > Eugenio Perez Martin
> > > > > > > > > > > > > > > wrote:
> > > > > > > That's
> > > > > > > basically the difference between the vhost/vDPA's
> > > > > > > selective passthrough
> > > > > > > approach and VFIO's full passthrough approach.
> > > > > > We can't do VFIO full pasthrough for migration anyway,
> > > > > > some kind of mdev is
> > > > > > required but it's duplicated with the current vp_vdpa driver.
> > > > > I'm not sure that's true. Generic VFIO PCI migration can probably be
> > > > > achieved without mdev:
> > > > > 1. Define a migration PCI Capability that indicates support for
> > > > >      VFIO_REGION_TYPE_MIGRATION. This allows the PCI device
> > > > > to implement
> > > > >      the migration interface in hardware instead of an mdev driver.
> > > > 
> > > > So I think it still depend on the driver to implement migrate
> > > > state which is
> > > > vendor specific.
> > > The current VFIO migration interface depends on a device-specific
> > > software mdev driver but here I'm showing that the physical device can
> > > implement the migration interface so that no device-specific driver code
> > > is needed.
> > 
> > 
> > This is not what I read from the patch:
> > 
> >  * device_state: (read/write)
> >  *      - The user application writes to this field to inform the vendor
> > driver
> >  *        about the device state to be transitioned to.
> >  *      - The vendor driver should take the necessary actions to change
> > the
> >  *        device state. After successful transition to a given state, the
> >  *        vendor driver should return success on write(device_state,
> > state)
> >  *        system call. If the device state transition fails, the vendor
> > driver
> >  *        should return an appropriate -errno for the fault condition.
> > 
> > Vendor driver need to mediate between the uAPI and the actual device.
> 
> We're building an infrastructure for VFIO PCI devices in the last few
> months.
> 
> It should be merged hopefully to kernel 5.15.

Do you have links to patch series or a brief description of the VFIO API
features that are on the roadmap?

> 
> > 
> > 
> > > 
> > > > Note that it's just an uAPI definition not something defined in the PCI
> > > > spec.
> > > Yes, that's why I mentioned Changpeng Liu's idea to turn the uAPI into a
> > > standard PCI Capability to eliminate the need for device-specific
> > > drivers.
> > 
> > 
> > Ok.
> > 
> > 
> > > 
> > > > Out of curiosity, the patch is merged without any real users in
> > > > the Linux.
> > > > This is very bad since we lose the change to audit the whole design.
> > > I agree. It would have helped to have a complete vision for how live
> > > migration should work along with demos. I don't see any migration code
> > > in samples/vfio-mdev/ :(.
> > 
> > 
> > Right.
> 
> Creating a standard is not related to Linux nor VFIO.
> 
> With the proposal that I've sent, we can develop a migration driver and
> virtio device that will support it (NVIDIA virtio-blk SNAP device).
> 
> And you can build live migration support in virtio_vdpa driver (if VDPA
> migration protocol will be implemented).

I guess "VDPA migration protocol" is not referring to Jason's proposal
but instead to a new vDPA interface that adds the migration interface
from you proposal to vDPA?

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit
  2021-07-20 12:57                                       ` Stefan Hajnoczi
@ 2021-07-20 13:09                                         ` Max Gurtovoy
  2021-07-21  3:06                                           ` Jason Wang
  2021-07-21 10:48                                           ` Stefan Hajnoczi
  0 siblings, 2 replies; 115+ messages in thread
From: Max Gurtovoy @ 2021-07-20 13:09 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Jason Wang, Michael S. Tsirkin, Eugenio Perez Martin,
	Dr. David Alan Gilbert, virtio-comment, Virtio-Dev,
	Cornelia Huck, Oren Duer, Shahaf Shuler, Parav Pandit,
	Bodong Wang, Alexander Mikheev, Halil Pasic


On 7/20/2021 3:57 PM, Stefan Hajnoczi wrote:
> On Tue, Jul 20, 2021 at 03:27:00PM +0300, Max Gurtovoy wrote:
>> On 7/20/2021 6:02 AM, Jason Wang wrote:
>>> 在 2021/7/19 下午8:43, Stefan Hajnoczi 写道:
>>>> On Fri, Jul 16, 2021 at 10:03:17AM +0800, Jason Wang wrote:
>>>>> 在 2021/7/15 下午6:01, Stefan Hajnoczi 写道:
>>>>>> On Thu, Jul 15, 2021 at 09:35:13AM +0800, Jason Wang wrote:
>>>>>>> 在 2021/7/14 下午11:07, Stefan Hajnoczi 写道:
>>>>>>>> On Wed, Jul 14, 2021 at 06:29:28PM +0800, Jason Wang wrote:
>>>>>>>>> 在 2021/7/14 下午5:53, Stefan Hajnoczi 写道:
>>>>>>>>>> On Tue, Jul 13, 2021 at 08:16:35PM +0800, Jason Wang wrote:
>>>>>>>>>>> 在 2021/7/13 下午6:00, Stefan Hajnoczi 写道:
>>>>>>>>>>>> On Tue, Jul 13, 2021 at 11:27:03AM +0800, Jason Wang wrote:
>>>>>>>>>>>>> 在 2021/7/12 下午5:57, Stefan Hajnoczi 写道:
>>>>>>>>>>>>>> On Mon, Jul 12, 2021 at 12:00:39PM +0800, Jason Wang wrote:
>>>>>>>>>>>>>>> 在 2021/7/11 上午4:36, Michael S. Tsirkin 写道:
>>>>>>>>>>>>>>>> On Fri, Jul 09, 2021
>>>>>>>>>>>>>>>> at 07:23:33PM +0200,
>>>>>>>>>>>>>>>> Eugenio Perez Martin
>>>>>>>>>>>>>>>> wrote:
>>>>>>>> That's
>>>>>>>> basically the difference between the vhost/vDPA's
>>>>>>>> selective passthrough
>>>>>>>> approach and VFIO's full passthrough approach.
>>>>>>> We can't do VFIO full pasthrough for migration anyway,
>>>>>>> some kind of mdev is
>>>>>>> required but it's duplicated with the current vp_vdpa driver.
>>>>>> I'm not sure that's true. Generic VFIO PCI migration can probably be
>>>>>> achieved without mdev:
>>>>>> 1. Define a migration PCI Capability that indicates support for
>>>>>>       VFIO_REGION_TYPE_MIGRATION. This allows the PCI device
>>>>>> to implement
>>>>>>       the migration interface in hardware instead of an mdev driver.
>>>>> So I think it still depend on the driver to implement migrate
>>>>> state which is
>>>>> vendor specific.
>>>> The current VFIO migration interface depends on a device-specific
>>>> software mdev driver but here I'm showing that the physical device can
>>>> implement the migration interface so that no device-specific driver code
>>>> is needed.
>>>
>>> This is not what I read from the patch:
>>>
>>>   * device_state: (read/write)
>>>   *      - The user application writes to this field to inform the vendor
>>> driver
>>>   *        about the device state to be transitioned to.
>>>   *      - The vendor driver should take the necessary actions to change
>>> the
>>>   *        device state. After successful transition to a given state, the
>>>   *        vendor driver should return success on write(device_state,
>>> state)
>>>   *        system call. If the device state transition fails, the vendor
>>> driver
>>>   *        should return an appropriate -errno for the fault condition.
>>>
>>> Vendor driver need to mediate between the uAPI and the actual device.
>> We're building an infrastructure for VFIO PCI devices in the last few
>> months.
>>
>> It should be merged hopefully to kernel 5.15.
> Do you have links to patch series or a brief description of the VFIO API
> features that are on the roadmap?

we devided it to few patchsets .

The entire series can be found at:

https://github.com/jgunthorpe/linux/commits/mlx5_vfio_pci

We'll first add support for mlx5 devices suspend/resume (ConnectX-6 and 
above).

The driver is ready in the series above.

The next step is to add Live migration for virtio. Hopefully we will be 
able to agree on the Virtio PCI in the near future.

>
>>>
>>>>> Note that it's just an uAPI definition not something defined in the PCI
>>>>> spec.
>>>> Yes, that's why I mentioned Changpeng Liu's idea to turn the uAPI into a
>>>> standard PCI Capability to eliminate the need for device-specific
>>>> drivers.
>>>
>>> Ok.
>>>
>>>
>>>>> Out of curiosity, the patch is merged without any real users in
>>>>> the Linux.
>>>>> This is very bad since we lose the change to audit the whole design.
>>>> I agree. It would have helped to have a complete vision for how live
>>>> migration should work along with demos. I don't see any migration code
>>>> in samples/vfio-mdev/ :(.
>>>
>>> Right.
>> Creating a standard is not related to Linux nor VFIO.
>>
>> With the proposal that I've sent, we can develop a migration driver and
>> virtio device that will support it (NVIDIA virtio-blk SNAP device).
>>
>> And you can build live migration support in virtio_vdpa driver (if VDPA
>> migration protocol will be implemented).
> I guess "VDPA migration protocol" is not referring to Jason's proposal
> but instead to a new vDPA interface that adds the migration interface
> from you proposal to vDPA?

I mean a parallel protocol to VFIO.


> Stefan


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit
  2021-07-20  8:50                                       ` Stefan Hajnoczi
  2021-07-20 10:48                                         ` Cornelia Huck
@ 2021-07-21  2:29                                         ` Jason Wang
  2021-07-21 10:20                                           ` Stefan Hajnoczi
  1 sibling, 1 reply; 115+ messages in thread
From: Jason Wang @ 2021-07-21  2:29 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Michael S. Tsirkin, Eugenio Perez Martin, Dr. David Alan Gilbert,
	virtio-comment, Virtio-Dev, Max Gurtovoy, Cornelia Huck,
	Oren Duer, Shahaf Shuler, Parav Pandit, Bodong Wang,
	Alexander Mikheev, Halil Pasic


在 2021/7/20 下午4:50, Stefan Hajnoczi 写道:
> On Tue, Jul 20, 2021 at 11:04:55AM +0800, Jason Wang wrote:
>> 在 2021/7/19 下午8:45, Stefan Hajnoczi 写道:
>>> On Fri, Jul 16, 2021 at 11:53:13AM +0800, Jason Wang wrote:
>>>> 在 2021/7/16 上午10:03, Jason Wang 写道:
>>>>> 在 2021/7/15 下午6:01, Stefan Hajnoczi 写道:
>>>>>> On Thu, Jul 15, 2021 at 09:35:13AM +0800, Jason Wang wrote:
>>>>>>> 在 2021/7/14 下午11:07, Stefan Hajnoczi 写道:
>>>>>>>> On Wed, Jul 14, 2021 at 06:29:28PM +0800, Jason Wang wrote:
>>>>>>>>> 在 2021/7/14 下午5:53, Stefan Hajnoczi 写道:
>>>>>>>>>> On Tue, Jul 13, 2021 at 08:16:35PM +0800, Jason Wang wrote:
>>>>>>>>>>> 在 2021/7/13 下午6:00, Stefan Hajnoczi 写道:
>>>>>>>>>>>> On Tue, Jul 13, 2021 at 11:27:03AM +0800, Jason Wang wrote:
>>>>>>>>>>>>> 在 2021/7/12 下午5:57, Stefan Hajnoczi 写道:
>>>>>>>>>>>>>> On Mon, Jul 12, 2021 at 12:00:39PM +0800, Jason Wang wrote:
>>>>>>>>>>>>>>> 在 2021/7/11 上午4:36, Michael S. Tsirkin 写道:
>>>>>>>>>>>>>>>> On Fri, Jul 09, 2021 at
>>>>>>>>>>>>>>>> 07:23:33PM +0200, Eugenio
>>>>>>>>>>>>>>>> Perez Martin wrote:
>>>>>>>>>>>>>>>>>>>           If I understand correctly, this is all
>>>>>>>>>>>>>>>>>>> driven from the
>>>>>>>>>>>>>>>>>>> driver inside
>>>>>>>>>>>>>>>>>>> the guest, so
>>>>>>>>>>>>>>>>>>> for this to work
>>>>>>>>>>>>>>>>>>> the guest must
>>>>>>>>>>>>>>>>>>> be running and
>>>>>>>>>>>>>>>>>>> already have
>>>>>>>>>>>>>>>>>>> initialised the
>>>>>>>>>>>>>>>>>>> driver.
>>>>>>>>>>>>>>>>>> Yes.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> As I see it, the feature
>>>>>>>>>>>>>>>>> can be driven entirely
>>>>>>>>>>>>>>>>> by the VMM as long as
>>>>>>>>>>>>>>>>> it intercept the
>>>>>>>>>>>>>>>>> relevant configuration
>>>>>>>>>>>>>>>>> space (PCI, MMIO, etc)
>>>>>>>>>>>>>>>>> from
>>>>>>>>>>>>>>>>> guest's reads and
>>>>>>>>>>>>>>>>> writes, and present it
>>>>>>>>>>>>>>>>> as coherent and
>>>>>>>>>>>>>>>>> transparent
>>>>>>>>>>>>>>>>> for the guest. Some use
>>>>>>>>>>>>>>>>> cases I can imagine with
>>>>>>>>>>>>>>>>> a physical device (or
>>>>>>>>>>>>>>>>> vp_vpda device) with VIRTIO_F_STOP:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> 1) The VMM chooses not
>>>>>>>>>>>>>>>>> to pass the feature
>>>>>>>>>>>>>>>>> flag. The guest cannot
>>>>>>>>>>>>>>>>> stop
>>>>>>>>>>>>>>>>> the device, so any write to this flag is an error/undefined.
>>>>>>>>>>>>>>>>> 2) The VMM passes the
>>>>>>>>>>>>>>>>> flag to the guest. The
>>>>>>>>>>>>>>>>> guest can stop the
>>>>>>>>>>>>>>>>> device.
>>>>>>>>>>>>>>>>> 2.1) The VMM stops the
>>>>>>>>>>>>>>>>> device to perform a live
>>>>>>>>>>>>>>>>> migration, and the
>>>>>>>>>>>>>>>>> guest does not write to
>>>>>>>>>>>>>>>>> STOP in any moment of
>>>>>>>>>>>>>>>>> the LM. It resets the
>>>>>>>>>>>>>>>>> destination device with
>>>>>>>>>>>>>>>>> the state, and then
>>>>>>>>>>>>>>>>> initializes the device.
>>>>>>>>>>>>>>>>> 2.2) The guest stops the
>>>>>>>>>>>>>>>>> device and, when
>>>>>>>>>>>>>>>>> STOP(32) is set, the
>>>>>>>>>>>>>>>>> source
>>>>>>>>>>>>>>>>> VMM migrates the device
>>>>>>>>>>>>>>>>> status. The destination
>>>>>>>>>>>>>>>>> VMM realizes the bit,
>>>>>>>>>>>>>>>>> so it sets the bit in
>>>>>>>>>>>>>>>>> the destination too
>>>>>>>>>>>>>>>>> after device
>>>>>>>>>>>>>>>>> initialization.
>>>>>>>>>>>>>>>>> 2.3) The device is not
>>>>>>>>>>>>>>>>> initialized by the guest
>>>>>>>>>>>>>>>>> so it doesn't matter
>>>>>>>>>>>>>>>>> what bit has the HW, but the VM can be migrated.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Am I missing something?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks!
>>>>>>>>>>>>>>>> It's doable like this. It's
>>>>>>>>>>>>>>>> all a lot of hoops to jump
>>>>>>>>>>>>>>>> through though.
>>>>>>>>>>>>>>>> It's also not easy for devices to implement.
>>>>>>>>>>>>>>> It just requires a new status
>>>>>>>>>>>>>>> bit. Anything that makes you
>>>>>>>>>>>>>>> think it's hard
>>>>>>>>>>>>>>> to implement?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> E.g for networking device, it
>>>>>>>>>>>>>>> should be sufficient to use this
>>>>>>>>>>>>>>> bit + the
>>>>>>>>>>>>>>> virtqueue state.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Why don't we design the
>>>>>>>>>>>>>>>> feature in a way that is
>>>>>>>>>>>>>>>> useable by VMMs
>>>>>>>>>>>>>>>> and implementable by devices in a simple way?
>>>>>>>>>>>>>>> It use the common technology
>>>>>>>>>>>>>>> like register shadowing without
>>>>>>>>>>>>>>> any further
>>>>>>>>>>>>>>> stuffs.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Or do you have any other ideas?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> (I think we all know migration
>>>>>>>>>>>>>>> will be very hard if we simply
>>>>>>>>>>>>>>> pass through
>>>>>>>>>>>>>>> those state registers).
>>>>>>>>>>>>>> If an admin virtqueue is used
>>>>>>>>>>>>>> instead of the STOP Device Status
>>>>>>>>>>>>>> field
>>>>>>>>>>>>>> bit then there's no need to re-read
>>>>>>>>>>>>>> the Device Status field in a loop
>>>>>>>>>>>>>> until the device has stopped.
>>>>>>>>>>>>> Probably not. Let me to clarify several points:
>>>>>>>>>>>>>
>>>>>>>>>>>>> - This proposal has nothing to do with
>>>>>>>>>>>>> admin virtqueue. Actually, admin
>>>>>>>>>>>>> virtqueue could be used for carrying any
>>>>>>>>>>>>> basic device facility like status
>>>>>>>>>>>>> bit. E.g I'm going to post patches that
>>>>>>>>>>>>> use admin virtqueue as a "transport"
>>>>>>>>>>>>> for device slicing at virtio level.
>>>>>>>>>>>>> - Even if we had introduced admin
>>>>>>>>>>>>> virtqueue, we still need a per function
>>>>>>>>>>>>> interface for this. This is a must for
>>>>>>>>>>>>> nested virtualization, we can't
>>>>>>>>>>>>> always expect things like PF can be assigned to L1 guest.
>>>>>>>>>>>>> - According to the proposal, there's no
>>>>>>>>>>>>> need for the device to complete all
>>>>>>>>>>>>> the consumed buffers, device can choose
>>>>>>>>>>>>> to expose those inflight descriptors
>>>>>>>>>>>>> in a device specific way and set the
>>>>>>>>>>>>> STOP bit. This means, if we have the
>>>>>>>>>>>>> device specific in-flight descriptor
>>>>>>>>>>>>> reporting facility, the device can
>>>>>>>>>>>>> almost set the STOP bit immediately.
>>>>>>>>>>>>> - If we don't go with the basic device
>>>>>>>>>>>>> facility but using the admin
>>>>>>>>>>>>> virtqueue specific method, we still need
>>>>>>>>>>>>> to clarify how it works with the
>>>>>>>>>>>>> device status state machine, it will be
>>>>>>>>>>>>> some kind of sub-states which looks
>>>>>>>>>>>>> much more complicated than the current proposal.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> When migrating a guest with many
>>>>>>>>>>>>>> VIRTIO devices a busy waiting
>>>>>>>>>>>>>> approach
>>>>>>>>>>>>>> extends downtime if implemented
>>>>>>>>>>>>>> sequentially (stopping one device at
>>>>>>>>>>>>>> a
>>>>>>>>>>>>>> time).
>>>>>>>>>>>>> Well. You need some kinds of waiting for
>>>>>>>>>>>>> sure, the device/DMA needs sometime
>>>>>>>>>>>>> to be stopped. The downtime is determined by a specific virtio
>>>>>>>>>>>>> implementation which is hard to be
>>>>>>>>>>>>> restricted at the spec level. We can
>>>>>>>>>>>>> clarify that the device must set the STOP bit in e.g 100ms.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>         It can be implemented
>>>>>>>>>>>>>> concurrently (setting the STOP bit
>>>>>>>>>>>>>> on all
>>>>>>>>>>>>>> devices and then looping until all
>>>>>>>>>>>>>> their Device Status fields have the
>>>>>>>>>>>>>> bit set), but this becomes more complex to implement.
>>>>>>>>>>>>> I still don't get what kind of complexity did you worry here.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> I'm a little worried about adding a new bit that requires busy
>>>>>>>>>>>>>> waiting...
>>>>>>>>>>>>> Busy wait is not something that is introduced in this patch:
>>>>>>>>>>>>>
>>>>>>>>>>>>> 4.1.4.3.2 Driver Requirements: Common
>>>>>>>>>>>>> configuration structure layout
>>>>>>>>>>>>>
>>>>>>>>>>>>> After writing 0 to device_status, the
>>>>>>>>>>>>> driver MUST wait for a read of
>>>>>>>>>>>>> device_status to return 0 before reinitializing the device.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Since it was required for at least one
>>>>>>>>>>>>> transport. We need do something
>>>>>>>>>>>>> similar to when introducing basic facility.
>>>>>>>>>>>> Adding the STOP but as a Device Status bit
>>>>>>>>>>>> is a small and clean VIRTIO
>>>>>>>>>>>> spec change. I like that.
>>>>>>>>>>>>
>>>>>>>>>>>> On the other hand, devices need time to stop and that time can be
>>>>>>>>>>>> unbounded. For example, software
>>>>>>>>>>>> virtio-blk/scsi implementations since
>>>>>>>>>>>> cannot immediately cancel in-flight I/O requests on Linux hosts.
>>>>>>>>>>>>
>>>>>>>>>>>> The natural interface for long-running
>>>>>>>>>>>> operations is virtqueue requests.
>>>>>>>>>>>> That's why I mentioned the alternative of using an admin virtqueue
>>>>>>>>>>>> instead of a Device Status bit.
>>>>>>>>>>> So I'm not against the admin virtqueue. As said
>>>>>>>>>>> before, admin virtqueue
>>>>>>>>>>> could be used for carrying the device status bit.
>>>>>>>>>>>
>>>>>>>>>>> Send a command to set STOP status bit to admin
>>>>>>>>>>> virtqueue. Device will make
>>>>>>>>>>> the command buffer used after it has
>>>>>>>>>>> successfully stopped the device.
>>>>>>>>>>>
>>>>>>>>>>> AFAIK, they are not mutually exclusive, since
>>>>>>>>>>> they are trying to solve
>>>>>>>>>>> different problems.
>>>>>>>>>>>
>>>>>>>>>>> Device status - basic device facility
>>>>>>>>>>>
>>>>>>>>>>> Admin virtqueue - transport/device specific way
>>>>>>>>>>> to implement (part of) the
>>>>>>>>>>> device facility
>>>>>>>>>>>
>>>>>>>>>>>> Although you mentioned that the stopped
>>>>>>>>>>>> state needs to be reflected in
>>>>>>>>>>>> the Device Status field somehow, I'm not sure about that since the
>>>>>>>>>>>> driver typically doesn't need to know whether the device is being
>>>>>>>>>>>> migrated.
>>>>>>>>>>> The guest won't see the real device status bit.
>>>>>>>>>>> VMM will shadow the device
>>>>>>>>>>> status bit in this case.
>>>>>>>>>>>
>>>>>>>>>>> E.g with the current vhost-vDPA, vDPA behave
>>>>>>>>>>> like a vhost device, guest is
>>>>>>>>>>> unaware of the migration.
>>>>>>>>>>>
>>>>>>>>>>> STOP status bit is set by Qemu to real virtio
>>>>>>>>>>> hardware. But guest will only
>>>>>>>>>>> see the DRIVER_OK without STOP.
>>>>>>>>>>>
>>>>>>>>>>> It's not hard to implement the nested on top,
>>>>>>>>>>> see the discussion initiated
>>>>>>>>>>> by Eugenio about how expose VIRTIO_F_STOP to guest for nested live
>>>>>>>>>>> migration.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>        In fact, the VMM would need to hide
>>>>>>>>>>>> this bit and it's safer to
>>>>>>>>>>>> keep it out-of-band instead of risking exposing it by accident.
>>>>>>>>>>> See above, VMM may choose to hide or expose the
>>>>>>>>>>> capability. It's useful for
>>>>>>>>>>> migrating a nested guest.
>>>>>>>>>>>
>>>>>>>>>>> If we design an interface that can be used in
>>>>>>>>>>> the nested environment, it's
>>>>>>>>>>> not an ideal interface.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> In addition, stateful devices need to
>>>>>>>>>>>> load/save non-trivial amounts of
>>>>>>>>>>>> data. They need DMA to do this efficiently,
>>>>>>>>>>>> so an admin virtqueue is a
>>>>>>>>>>>> good fit again.
>>>>>>>>>>> I don't get the point here. You still need to
>>>>>>>>>>> address the exact the similar
>>>>>>>>>>> issues for admin virtqueue: the unbound time in
>>>>>>>>>>> freezing the device, the
>>>>>>>>>>> interaction with the virtio device status state machine.
>>>>>>>>>> Device state state can be large so a register interface would be a
>>>>>>>>>> bottleneck. DMA is needed. I think a virtqueue is a good fit for
>>>>>>>>>> saving/loading device state.
>>>>>>>>> So this patch doesn't mandate a register interface, isn't it?
>>>>>>>> You're right, not this patch. I mentioned it because your other patch
>>>>>>>> series ("[PATCH] virtio-pci: implement VIRTIO_F_RING_STATE")
>>>>>>>> implements
>>>>>>>> it a register interface.
>>>>>>>>
>>>>>>>>> And DMA
>>>>>>>>> doesn't means a virtqueue, it could be a transport specific method.
>>>>>>>> Yes, although virtqueues are a pretty good interface that works across
>>>>>>>> transports (PCI/MMIO/etc) thanks to the standard vring memory layout.
>>>>>>>>
>>>>>>>>> I think we need to start from defining the state of one
>>>>>>>>> specific device and
>>>>>>>>> see what is the best interface.
>>>>>>>> virtio-blk might be the simplest. I think virtio-net has more device
>>>>>>>> state and virtio-scsi is definitely more complext than virtio-blk.
>>>>>>>>
>>>>>>>> First we need agreement on whether "device state" encompasses the full
>>>>>>>> state of the device or just state that is unknown to the VMM.
>>>>>>> I think we've discussed this in the past. It can't work since:
>>>>>>>
>>>>>>> 1) The state and its format must be clearly defined in the spec
>>>>>>> 2) We need to maintain migration compatibility and debug-ability
>>>>>> Some devices need implementation-specific state. They should still be
>>>>>> able to live migrate even if it means cross-implementation migration and
>>>>>> debug-ability is not possible.
>>>>> I think we need to re-visit this conclusion. Migration compatibility is
>>>>> pretty important, especially consider the software stack has spent a
>>>>> huge mount of effort in maintaining them.
>>>>>
>>>>> Say a virtio hardware would break this, this mean we will lose all the
>>>>> advantages of being a standard device.
>>>>>
>>>>> If we can't do live migration among:
>>>>>
>>>>> 1) different backends, e.g migrate from virtio hardware to migrate
>>>>> software
>>>>> 2) different vendors
>>>>>
>>>>> We failed to say as a standard device and the customer is in fact locked
>>>>> by the vendor implicitly.
>>>>>
>>>>>
>>>>>>> 3) Not a proper uAPI desgin
>>>>>> I never understood this argument. The Linux uAPI passes through lots of
>>>>>> opaque data from devices to userspace. Allowing an
>>>>>> implementation-specific device state representation is nothing new. VFIO
>>>>>> already does it.
>>>>> I think we've already had a lots of discussion for VFIO but without a
>>>>> conclusion. Maybe we need the verdict from Linus or Greg (not sure if
>>>>> it's too late). But that's not related to virito and this thread.
>>>>>
>>>>> What you propose here is kind of conflict with the efforts of virtio. I
>>>>> think we all aggree that we should define the state in the spec.
>>>>> Assuming this is correct:
>>>>>
>>>>> 1) why do we still offer opaque migration state to userspace?
>>>>> 2) how can it be integrated into the current VMM (Qemu) virtio devices'
>>>>> migration bytes stream?
>>>>>
>>>>> We should standardize everything that is visible by the driver to be a
>>>>> standard device. That's the power of virtio.
>>>>>
>>>>>
>>>>>>>> That's
>>>>>>>> basically the difference between the vhost/vDPA's selective
>>>>>>>> passthrough
>>>>>>>> approach and VFIO's full passthrough approach.
>>>>>>> We can't do VFIO full pasthrough for migration anyway, some kind
>>>>>>> of mdev is
>>>>>>> required but it's duplicated with the current vp_vdpa driver.
>>>>>> I'm not sure that's true. Generic VFIO PCI migration can probably be
>>>>>> achieved without mdev:
>>>>>> 1. Define a migration PCI Capability that indicates support for
>>>>>>       VFIO_REGION_TYPE_MIGRATION. This allows the PCI device to implement
>>>>>>       the migration interface in hardware instead of an mdev driver.
>>>>> So I think it still depend on the driver to implement migrate state
>>>>> which is vendor specific.
>>>>>
>>>>> Note that it's just an uAPI definition not something defined in the PCI
>>>>> spec.
>>>>>
>>>>> Out of curiosity, the patch is merged without any real users in the
>>>>> Linux. This is very bad since we lose the change to audit the whole
>>>>> design.
>>>>>
>>>>>
>>>>>> 2. The VMM either uses the migration PCI Capability directly from
>>>>>>       userspace or core VFIO PCI code advertises
>>>>>> VFIO_REGION_TYPE_MIGRATION
>>>>>>       to userspace so migration can proceed in the same way as with
>>>>>>       VFIO/mdev drivers.
>>>>>> 3. The PCI Capability is not passed through to the guest.
>>>>> This brings troubles in the nested environment.
>>>>>
>>>>> Thanks
>>>>>
>>>>>
>>>>>> Changpeng Liu originally mentioned the idea of defining a migration PCI
>>>>>> Capability.
>>>>>>
>>>>>>>>      For example, some of the
>>>>>>>> virtio-net state is available to the VMM with vhost/vDPA because it
>>>>>>>> intercepts the virtio-net control virtqueue.
>>>>>>>>
>>>>>>>> Also, we need to decide to what degree the device state representation
>>>>>>>> is standardized in the VIRTIO specification.
>>>>>>> I think all the states must be defined in the spec otherwise the device
>>>>>>> can't claim it supports migration at virtio level.
>>>>>>>
>>>>>>>
>>>>>>>>      I think it makes sense to
>>>>>>>> standardize it if it's possible to convey all necessary
>>>>>>>> state and device
>>>>>>>> implementors can easily implement this device state representation.
>>>>>>> I doubt it's high device specific. E.g can we standardize device(GPU)
>>>>>>> memory?
>>>>>> For devices that have little internal state it's possible to define a
>>>>>> standard device state representation.
>>>>>>
>>>>>> For other devices, like virtio-crypto, virtio-fs, etc it becomes
>>>>>> difficult because the device implementation contains state that will be
>>>>>> needed but is very specific to the implementation. These devices *are*
>>>>>> migratable but they don't have standard state. Even here there is a
>>>>>> spectrum:
>>>>>> - Host OS-specific state (e.g. Linux struct file_handles)
>>>>>> - Library-specific state (e.g. crypto library state)
>>>>>> - Implementation-specific state (e.g. sshfs inode state for virtio-fs)
>>>>>>
>>>>>> This is why I think it's necessary to support both standard device state
>>>>>> representations and implementation-specific device state
>>>>>> representations.
>>>> Having two ways will bring extra complexity. That why I suggest:
>>>>
>>>> - to have general facility for the virtuqueue to be migrated
>>>> - leave the device specific state to be device specific. so device can
>>>> choose what is convenient way or interface.
>>> I don't think we have a choice. For stateful devices it can be
>>> impossible to define a standard device state representation.
>>
>> Let me clarify, I agree we can't have a standard device state for all kinds
>> of device.
>>
>> That's way I tend to leave them to be device specific. (but not
>> implementation specific)
> Unfortunately device state is sometimes implementation-specific. Not
> because the device is proprietary, but because the actual state is
> meaningless to other implementations.
>
> I mentioned virtiofs as an example where file system backends can be
> implemented in completely different ways so the device state cannot be
> migrated between implementations.


So let me clarify my understanding, we had two kinds of states:

1) implementation specific state that is not noticeable by the driver
2) device specific state that is noticeable by the driver

We don't have the interest in 1).

For 2) it's what needs to be defined in the spec. If we fail to 
generalize the device specific state, it can't be used by a standard 
virtio driver. Or maybe you can give a concrete example on how vitio-fs 
fail in doing this?


>
>> But we can generalize the virtqueue state for sure.
> I agree and also that some device types can standardize their device
> state representations. But I think it's a technical requirement to
> support implementation-specific state for device types where
> cross-implementation migration is not possible.


A question here, if the driver depends on the implementation specific 
state, how can we make sure that driver can work for other 
implementation. If we're sure that a single driver can work for all 
kinds of implementations, it means the we had device specific state not 
implementation state.


>
> I'm not saying the implementation-specific state representation has to
> be a binary blob. There could be an identifier registry to ensure live
> migration compatibility checks can be performed. There could also be a
> standard binary encoding for migration data.


Yes, such requirements has been well studied in the past. There should 
be plenty of protocols to do this.


>   But the contents will be
> implementation-specific for some devices.


If we allow this, it breaks the spec effort for having a standard 
devices. And it will block the real customers.

Thanks


>
> Stefan


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit
  2021-07-20 10:19                                     ` Stefan Hajnoczi
@ 2021-07-21  2:52                                       ` Jason Wang
  2021-07-21 10:42                                         ` Stefan Hajnoczi
  0 siblings, 1 reply; 115+ messages in thread
From: Jason Wang @ 2021-07-21  2:52 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Michael S. Tsirkin, Eugenio Perez Martin, Dr. David Alan Gilbert,
	virtio-comment, Virtio-Dev, Max Gurtovoy, Cornelia Huck,
	Oren Duer, Shahaf Shuler, Parav Pandit, Bodong Wang,
	Alexander Mikheev, Halil Pasic


在 2021/7/20 下午6:19, Stefan Hajnoczi 写道:
> On Tue, Jul 20, 2021 at 11:02:42AM +0800, Jason Wang wrote:
>> 在 2021/7/19 下午8:43, Stefan Hajnoczi 写道:
>>> On Fri, Jul 16, 2021 at 10:03:17AM +0800, Jason Wang wrote:
>>>> 在 2021/7/15 下午6:01, Stefan Hajnoczi 写道:
>>>>> On Thu, Jul 15, 2021 at 09:35:13AM +0800, Jason Wang wrote:
>>>>>> 在 2021/7/14 下午11:07, Stefan Hajnoczi 写道:
>>>>>>> On Wed, Jul 14, 2021 at 06:29:28PM +0800, Jason Wang wrote:
>>>>>>>> 在 2021/7/14 下午5:53, Stefan Hajnoczi 写道:
>>>>>>>>> On Tue, Jul 13, 2021 at 08:16:35PM +0800, Jason Wang wrote:
>>>>>>>>>> 在 2021/7/13 下午6:00, Stefan Hajnoczi 写道:
>>>>>>>>>>> On Tue, Jul 13, 2021 at 11:27:03AM +0800, Jason Wang wrote:
>>>>>>>>>>>> 在 2021/7/12 下午5:57, Stefan Hajnoczi 写道:
>>>>>>>>>>>>> On Mon, Jul 12, 2021 at 12:00:39PM +0800, Jason Wang wrote:
>>>>>>>>>>>>>> 在 2021/7/11 上午4:36, Michael S. Tsirkin 写道:
>>>>>>>>>>>>>>> On Fri, Jul 09, 2021 at 07:23:33PM +0200, Eugenio Perez Martin wrote:
>>>>>>>>>>>>>>>>>>           If I understand correctly, this is all
>>>>>>>>>>>>>>>>>> driven from the driver inside the guest, so for this to work
>>>>>>>>>>>>>>>>>> the guest must be running and already have initialised the driver.
>>>>>>>>>>>>>>>>> Yes.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> As I see it, the feature can be driven entirely by the VMM as long as
>>>>>>>>>>>>>>>> it intercept the relevant configuration space (PCI, MMIO, etc) from
>>>>>>>>>>>>>>>> guest's reads and writes, and present it as coherent and transparent
>>>>>>>>>>>>>>>> for the guest. Some use cases I can imagine with a physical device (or
>>>>>>>>>>>>>>>> vp_vpda device) with VIRTIO_F_STOP:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 1) The VMM chooses not to pass the feature flag. The guest cannot stop
>>>>>>>>>>>>>>>> the device, so any write to this flag is an error/undefined.
>>>>>>>>>>>>>>>> 2) The VMM passes the flag to the guest. The guest can stop the device.
>>>>>>>>>>>>>>>> 2.1) The VMM stops the device to perform a live migration, and the
>>>>>>>>>>>>>>>> guest does not write to STOP in any moment of the LM. It resets the
>>>>>>>>>>>>>>>> destination device with the state, and then initializes the device.
>>>>>>>>>>>>>>>> 2.2) The guest stops the device and, when STOP(32) is set, the source
>>>>>>>>>>>>>>>> VMM migrates the device status. The destination VMM realizes the bit,
>>>>>>>>>>>>>>>> so it sets the bit in the destination too after device initialization.
>>>>>>>>>>>>>>>> 2.3) The device is not initialized by the guest so it doesn't matter
>>>>>>>>>>>>>>>> what bit has the HW, but the VM can be migrated.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Am I missing something?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks!
>>>>>>>>>>>>>>> It's doable like this. It's all a lot of hoops to jump through though.
>>>>>>>>>>>>>>> It's also not easy for devices to implement.
>>>>>>>>>>>>>> It just requires a new status bit. Anything that makes you think it's hard
>>>>>>>>>>>>>> to implement?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> E.g for networking device, it should be sufficient to use this bit + the
>>>>>>>>>>>>>> virtqueue state.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Why don't we design the feature in a way that is useable by VMMs
>>>>>>>>>>>>>>> and implementable by devices in a simple way?
>>>>>>>>>>>>>> It use the common technology like register shadowing without any further
>>>>>>>>>>>>>> stuffs.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Or do you have any other ideas?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> (I think we all know migration will be very hard if we simply pass through
>>>>>>>>>>>>>> those state registers).
>>>>>>>>>>>>> If an admin virtqueue is used instead of the STOP Device Status field
>>>>>>>>>>>>> bit then there's no need to re-read the Device Status field in a loop
>>>>>>>>>>>>> until the device has stopped.
>>>>>>>>>>>> Probably not. Let me to clarify several points:
>>>>>>>>>>>>
>>>>>>>>>>>> - This proposal has nothing to do with admin virtqueue. Actually, admin
>>>>>>>>>>>> virtqueue could be used for carrying any basic device facility like status
>>>>>>>>>>>> bit. E.g I'm going to post patches that use admin virtqueue as a "transport"
>>>>>>>>>>>> for device slicing at virtio level.
>>>>>>>>>>>> - Even if we had introduced admin virtqueue, we still need a per function
>>>>>>>>>>>> interface for this. This is a must for nested virtualization, we can't
>>>>>>>>>>>> always expect things like PF can be assigned to L1 guest.
>>>>>>>>>>>> - According to the proposal, there's no need for the device to complete all
>>>>>>>>>>>> the consumed buffers, device can choose to expose those inflight descriptors
>>>>>>>>>>>> in a device specific way and set the STOP bit. This means, if we have the
>>>>>>>>>>>> device specific in-flight descriptor reporting facility, the device can
>>>>>>>>>>>> almost set the STOP bit immediately.
>>>>>>>>>>>> - If we don't go with the basic device facility but using the admin
>>>>>>>>>>>> virtqueue specific method, we still need to clarify how it works with the
>>>>>>>>>>>> device status state machine, it will be some kind of sub-states which looks
>>>>>>>>>>>> much more complicated than the current proposal.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> When migrating a guest with many VIRTIO devices a busy waiting approach
>>>>>>>>>>>>> extends downtime if implemented sequentially (stopping one device at a
>>>>>>>>>>>>> time).
>>>>>>>>>>>> Well. You need some kinds of waiting for sure, the device/DMA needs sometime
>>>>>>>>>>>> to be stopped. The downtime is determined by a specific virtio
>>>>>>>>>>>> implementation which is hard to be restricted at the spec level. We can
>>>>>>>>>>>> clarify that the device must set the STOP bit in e.g 100ms.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>>         It can be implemented concurrently (setting the STOP bit on all
>>>>>>>>>>>>> devices and then looping until all their Device Status fields have the
>>>>>>>>>>>>> bit set), but this becomes more complex to implement.
>>>>>>>>>>>> I still don't get what kind of complexity did you worry here.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> I'm a little worried about adding a new bit that requires busy
>>>>>>>>>>>>> waiting...
>>>>>>>>>>>> Busy wait is not something that is introduced in this patch:
>>>>>>>>>>>>
>>>>>>>>>>>> 4.1.4.3.2 Driver Requirements: Common configuration structure layout
>>>>>>>>>>>>
>>>>>>>>>>>> After writing 0 to device_status, the driver MUST wait for a read of
>>>>>>>>>>>> device_status to return 0 before reinitializing the device.
>>>>>>>>>>>>
>>>>>>>>>>>> Since it was required for at least one transport. We need do something
>>>>>>>>>>>> similar to when introducing basic facility.
>>>>>>>>>>> Adding the STOP but as a Device Status bit is a small and clean VIRTIO
>>>>>>>>>>> spec change. I like that.
>>>>>>>>>>>
>>>>>>>>>>> On the other hand, devices need time to stop and that time can be
>>>>>>>>>>> unbounded. For example, software virtio-blk/scsi implementations since
>>>>>>>>>>> cannot immediately cancel in-flight I/O requests on Linux hosts.
>>>>>>>>>>>
>>>>>>>>>>> The natural interface for long-running operations is virtqueue requests.
>>>>>>>>>>> That's why I mentioned the alternative of using an admin virtqueue
>>>>>>>>>>> instead of a Device Status bit.
>>>>>>>>>> So I'm not against the admin virtqueue. As said before, admin virtqueue
>>>>>>>>>> could be used for carrying the device status bit.
>>>>>>>>>>
>>>>>>>>>> Send a command to set STOP status bit to admin virtqueue. Device will make
>>>>>>>>>> the command buffer used after it has successfully stopped the device.
>>>>>>>>>>
>>>>>>>>>> AFAIK, they are not mutually exclusive, since they are trying to solve
>>>>>>>>>> different problems.
>>>>>>>>>>
>>>>>>>>>> Device status - basic device facility
>>>>>>>>>>
>>>>>>>>>> Admin virtqueue - transport/device specific way to implement (part of) the
>>>>>>>>>> device facility
>>>>>>>>>>
>>>>>>>>>>> Although you mentioned that the stopped state needs to be reflected in
>>>>>>>>>>> the Device Status field somehow, I'm not sure about that since the
>>>>>>>>>>> driver typically doesn't need to know whether the device is being
>>>>>>>>>>> migrated.
>>>>>>>>>> The guest won't see the real device status bit. VMM will shadow the device
>>>>>>>>>> status bit in this case.
>>>>>>>>>>
>>>>>>>>>> E.g with the current vhost-vDPA, vDPA behave like a vhost device, guest is
>>>>>>>>>> unaware of the migration.
>>>>>>>>>>
>>>>>>>>>> STOP status bit is set by Qemu to real virtio hardware. But guest will only
>>>>>>>>>> see the DRIVER_OK without STOP.
>>>>>>>>>>
>>>>>>>>>> It's not hard to implement the nested on top, see the discussion initiated
>>>>>>>>>> by Eugenio about how expose VIRTIO_F_STOP to guest for nested live
>>>>>>>>>> migration.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>        In fact, the VMM would need to hide this bit and it's safer to
>>>>>>>>>>> keep it out-of-band instead of risking exposing it by accident.
>>>>>>>>>> See above, VMM may choose to hide or expose the capability. It's useful for
>>>>>>>>>> migrating a nested guest.
>>>>>>>>>>
>>>>>>>>>> If we design an interface that can be used in the nested environment, it's
>>>>>>>>>> not an ideal interface.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> In addition, stateful devices need to load/save non-trivial amounts of
>>>>>>>>>>> data. They need DMA to do this efficiently, so an admin virtqueue is a
>>>>>>>>>>> good fit again.
>>>>>>>>>> I don't get the point here. You still need to address the exact the similar
>>>>>>>>>> issues for admin virtqueue: the unbound time in freezing the device, the
>>>>>>>>>> interaction with the virtio device status state machine.
>>>>>>>>> Device state state can be large so a register interface would be a
>>>>>>>>> bottleneck. DMA is needed. I think a virtqueue is a good fit for
>>>>>>>>> saving/loading device state.
>>>>>>>> So this patch doesn't mandate a register interface, isn't it?
>>>>>>> You're right, not this patch. I mentioned it because your other patch
>>>>>>> series ("[PATCH] virtio-pci: implement VIRTIO_F_RING_STATE") implements
>>>>>>> it a register interface.
>>>>>>>
>>>>>>>> And DMA
>>>>>>>> doesn't means a virtqueue, it could be a transport specific method.
>>>>>>> Yes, although virtqueues are a pretty good interface that works across
>>>>>>> transports (PCI/MMIO/etc) thanks to the standard vring memory layout.
>>>>>>>
>>>>>>>> I think we need to start from defining the state of one specific device and
>>>>>>>> see what is the best interface.
>>>>>>> virtio-blk might be the simplest. I think virtio-net has more device
>>>>>>> state and virtio-scsi is definitely more complext than virtio-blk.
>>>>>>>
>>>>>>> First we need agreement on whether "device state" encompasses the full
>>>>>>> state of the device or just state that is unknown to the VMM.
>>>>>> I think we've discussed this in the past. It can't work since:
>>>>>>
>>>>>> 1) The state and its format must be clearly defined in the spec
>>>>>> 2) We need to maintain migration compatibility and debug-ability
>>>>> Some devices need implementation-specific state. They should still be
>>>>> able to live migrate even if it means cross-implementation migration and
>>>>> debug-ability is not possible.
>>>> I think we need to re-visit this conclusion. Migration compatibility is
>>>> pretty important, especially consider the software stack has spent a huge
>>>> mount of effort in maintaining them.
>>>>
>>>> Say a virtio hardware would break this, this mean we will lose all the
>>>> advantages of being a standard device.
>>>>
>>>> If we can't do live migration among:
>>>>
>>>> 1) different backends, e.g migrate from virtio hardware to migrate software
>>>> 2) different vendors
>>>>
>>>> We failed to say as a standard device and the customer is in fact locked by
>>>> the vendor implicitly.
>>> My virtiofs device implementation is backed by an in-memory file system.
>>> The device state includes the contents of each file.
>>>
>>> Your virtiofs device implementation uses Linux file handles to keep
>>> track of open files. The device state includes Linux file handles (but
>>> not the contents of each file) so the destination host can access the
>>> same files on shared storage.
>>>
>>> Cornelia's virtiofs device implementation is backed by an object storage
>>> HTTP API. The device state includes API object IDs.
>>>
>>> The device state is implementation-dependent. There is no standard
>>> representation and it's not possible to migrate between device
>>> implementations. How are they supposed to migrate?
>>
>> So if I understand correclty, virtio-fs is not desigined to be migrate-able?
>>
>> (Having a check on the current virtio-fs support in qemu, it looks to me it
>> has a migration blocker).
> The code does not support live migration but it's on the roadmap. Max
> Reitz added Linux file handle support to virtiofsd. That was the first
> step towards being able to migrate the device's state.


A dumb question, how do qemu know it is connected to virtiofsd?


>
>>> This is why I think it's necessarily to allow implementation-specific
>>> device state representations.
>>
>> Or you probably mean you don't support cross backend migration. This sounds
>> like a drawback and it's actually not a standard device but a
>> vendor/implementation specific device.
>>
>> It would bring a lot of troubles, not only for the implementation but for
>> the management. Maybe we can start from adding the support of migration for
>> some specific backend and start from there.
> Yes, it's complicated. Some implementations could be compatible, but
> others can never be compatible because they have completely different
> state.
>
> The virtiofsd implementation is the main one for virtiofs and the device
> state representation can be published, even standardized. Others can
> implement it to achieve migration compatibility.
>
> But it must be possible for implementations that have completely
> different state to migrate too. virtiofsd isn't special.
>
>>>>>> 3) Not a proper uAPI desgin
>>>>> I never understood this argument. The Linux uAPI passes through lots of
>>>>> opaque data from devices to userspace. Allowing an
>>>>> implementation-specific device state representation is nothing new. VFIO
>>>>> already does it.
>>>> I think we've already had a lots of discussion for VFIO but without a
>>>> conclusion. Maybe we need the verdict from Linus or Greg (not sure if it's
>>>> too late). But that's not related to virito and this thread.
>>>>
>>>> What you propose here is kind of conflict with the efforts of virtio. I
>>>> think we all aggree that we should define the state in the spec. Assuming
>>>> this is correct:
>>>>
>>>> 1) why do we still offer opaque migration state to userspace?
>>> See above. Stateful devices may require an implementation-defined device
>>> state representation.
>>
>> So my point stand still, it's not a standard device if we do this.
> These "non-standard devices" still need to be able to migrate.


See other thread, it breaks the effort of having a spec.


>   How
> should we do that?


I think the main issue is that, to me it's not a virtio device but a 
device that is using virtio queue with implementation specific state. So 
it can't be migrated by the virtio subsystem but through a 
vendor/implementation specific migration driver.


>
>>>> 2) how can it be integrated into the current VMM (Qemu) virtio devices'
>>>> migration bytes stream?
>>> Opaque data like D-Bus VMState:
>>> https://qemu.readthedocs.io/en/latest/interop/dbus-vmstate.html
>>
>> Actually, I meant how to keep the opaque state which is compatible with all
>> the existing device that can do migration.
>>
>> E.g we want to live migration virtio-blk among any backends (from a hardware
>> device to a software backend).
> There was a series of email threads last year where migration
> compatibility was discussed:
>
> https://lists.gnu.org/archive/html/qemu-devel/2020-11/msg02620.html
>
> I proposed an algorithm for checking migration compatibility between
> devices. The source and destination device can have different
> implementations (e.g. hardware, software, etc).
>
> It involves picking an identifier like virtio-spec.org/pci/virtio-net
> for the device state representation and device parameters for aspects of
> the device that vary between instances (e.g. tso=on|off).
>
> It's more complex than today's live migration approach in libvirt and
> QEMU. Today libvirt configures the source and destination in a
> compatible manner (thanks to knowledge of the device implementation) and
> then QEMU transfers the device state.
>
> Part of the point of defining a migration compatibility algorithm is
> that it's possible to lift the assumptions out of libvirt so that
> arbitrary device implementations can be supported (hardware, software,
> etc) without putting knowledge about every device/VMM implementation
> into libvirt.
>
> (The other advantage is that this allows orchestration software to
> determine migration compatibility before starting a migration.)


This looks like another independent issues and I fully agree to have a 
better migration protocol. But using that means we break the migration 
compatibility with the existing device which is used for more than 10 
years. We still need to make the migration from/to the existing virtio 
device to work.


>
>>>>>>> That's
>>>>>>> basically the difference between the vhost/vDPA's selective passthrough
>>>>>>> approach and VFIO's full passthrough approach.
>>>>>> We can't do VFIO full pasthrough for migration anyway, some kind of mdev is
>>>>>> required but it's duplicated with the current vp_vdpa driver.
>>>>> I'm not sure that's true. Generic VFIO PCI migration can probably be
>>>>> achieved without mdev:
>>>>> 1. Define a migration PCI Capability that indicates support for
>>>>>       VFIO_REGION_TYPE_MIGRATION. This allows the PCI device to implement
>>>>>       the migration interface in hardware instead of an mdev driver.
>>>> So I think it still depend on the driver to implement migrate state which is
>>>> vendor specific.
>>> The current VFIO migration interface depends on a device-specific
>>> software mdev driver but here I'm showing that the physical device can
>>> implement the migration interface so that no device-specific driver code
>>> is needed.
>>
>> This is not what I read from the patch:
>>
>>   * device_state: (read/write)
>>   *      - The user application writes to this field to inform the vendor
>> driver
>>   *        about the device state to be transitioned to.
>>   *      - The vendor driver should take the necessary actions to change the
>>   *        device state. After successful transition to a given state, the
>>   *        vendor driver should return success on write(device_state, state)
>>   *        system call. If the device state transition fails, the vendor
>> driver
>>   *        should return an appropriate -errno for the fault condition.
>>
>> Vendor driver need to mediate between the uAPI and the actual device.
> Yes, that's the current state of VFIO migration. If a hardware interface
> (e.g. PCI Capability) is defined that maps to this API then no
> device-specific drivers would be necessary because core VFIO PCI code
> can implement the uAPI by talking to the hardware.


As we discussed, it would be very hard. The device state is 
implementation specific which may not fit for the Capability. (PCIE has 
already had VF migration state in the SR-IOV extended capability).


>
>>>>> 2. The VMM either uses the migration PCI Capability directly from
>>>>>       userspace or core VFIO PCI code advertises VFIO_REGION_TYPE_MIGRATION
>>>>>       to userspace so migration can proceed in the same way as with
>>>>>       VFIO/mdev drivers.
>>>>> 3. The PCI Capability is not passed through to the guest.
>>>> This brings troubles in the nested environment.
>>> It depends on the device splitting/management design. If L0 wishes to
>>> let L1 manage the VFs then it would need to expose a management device.
>>> Since the migration interface is generic (not device-specific) a generic
>>> management device solves this for all devices.
>>
>> Right, but it's a burden to expose the management device or it may just
>> won't work.
> A single generic management device is not a huge burden and it may turn
> out that keeping the management device out-of-band is actually a
> desirable feature if the device owner does not wish to expose the
> stop/save/load functionality for some reason.


VMM are free to hide those features from guest. Management can just do 
-device virtio-pci,state=false

Having management device works for L0 but not suitable for L(x>0). A per 
function device interface is a must for the nested virt to work in a 
simple and easy way.

Thanks


>
> Stefan


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit
  2021-07-20 10:31                                 ` Cornelia Huck
@ 2021-07-21  2:59                                   ` Jason Wang
  0 siblings, 0 replies; 115+ messages in thread
From: Jason Wang @ 2021-07-21  2:59 UTC (permalink / raw)
  To: Cornelia Huck, Stefan Hajnoczi
  Cc: Michael S. Tsirkin, Eugenio Perez Martin, Dr. David Alan Gilbert,
	virtio-comment, Virtio-Dev, Max Gurtovoy, Oren Duer,
	Shahaf Shuler, Parav Pandit, Bodong Wang, Alexander Mikheev,
	Halil Pasic


在 2021/7/20 下午6:31, Cornelia Huck 写道:
> On Fri, Jul 16 2021, Jason Wang <jasowang@redhat.com> wrote:
>
>> 在 2021/7/15 下午5:16, Stefan Hajnoczi 写道:
>>> Stopping
>>> devices sequentially increases migration downtime, so I think the
>>> interface should encourage concurrently stopping multiple devices.
>>>
>>> I think you and Cornelia discussed that an interrupt could be added to
>>> solve this problem. That would address my concerns about the STOP bit.
>>
>> The problems are:
>>
>> 1) if we generate an interrupt after STOP, it breaks the STOP semantic
>> where the device should not generate any interrupt
>> 2) if we generate an interrupt before STOP, we may end up with race
>> conditions
> I think not all interrupts are created equal here.
>
> For virtqueue notification interrupts, I agree. If the device is being
> stopped, no notification interrupts will be generated.
>
> For device interrupts in the general sense, banning these would make it
> impossible to implement STOP for CCW, as any channel program (be it
> RESET, READ_STATUS, or any new one) is required to generate a status
> pending/interrupt when it is finished.


So they are working at different levels. STOP is at the virtio level not 
the transport level.

For STOP, it means the device stop working at virtio level, it doesn't 
mean the device doesn't work at transport level. It means CCW can still 
have its interrupt if it's not generated from the virtiqueue or config.

So did RESET: I believe for both CCW and PCI, and virtio RESET doesn't 
meant a transport level reset.


> I also don't see how that would
> create a race condition for CCW.
>
> Why can't we simply have an interrupt indicating completion of the STOP
> request, and no further interrupts after that?


So I think we had some discussion on this. I think STOP should work as 
RESET.

And if we want an indication of STOP, why don't we need it for RESET?

Thanks


>


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit
  2021-07-20 13:09                                         ` Max Gurtovoy
@ 2021-07-21  3:06                                           ` Jason Wang
  2021-07-21 10:48                                           ` Stefan Hajnoczi
  1 sibling, 0 replies; 115+ messages in thread
From: Jason Wang @ 2021-07-21  3:06 UTC (permalink / raw)
  To: Max Gurtovoy, Stefan Hajnoczi
  Cc: Michael S. Tsirkin, Eugenio Perez Martin, Dr. David Alan Gilbert,
	virtio-comment, Virtio-Dev, Cornelia Huck, Oren Duer,
	Shahaf Shuler, Parav Pandit, Bodong Wang, Alexander Mikheev,
	Halil Pasic


在 2021/7/20 下午9:09, Max Gurtovoy 写道:
>
> On 7/20/2021 3:57 PM, Stefan Hajnoczi wrote:
>> On Tue, Jul 20, 2021 at 03:27:00PM +0300, Max Gurtovoy wrote:
>>> On 7/20/2021 6:02 AM, Jason Wang wrote:
>>>> 在 2021/7/19 下午8:43, Stefan Hajnoczi 写道:
>>>>> On Fri, Jul 16, 2021 at 10:03:17AM +0800, Jason Wang wrote:
>>>>>> 在 2021/7/15 下午6:01, Stefan Hajnoczi 写道:
>>>>>>> On Thu, Jul 15, 2021 at 09:35:13AM +0800, Jason Wang wrote:
>>>>>>>> 在 2021/7/14 下午11:07, Stefan Hajnoczi 写道:
>>>>>>>>> On Wed, Jul 14, 2021 at 06:29:28PM +0800, Jason Wang wrote:
>>>>>>>>>> 在 2021/7/14 下午5:53, Stefan Hajnoczi 写道:
>>>>>>>>>>> On Tue, Jul 13, 2021 at 08:16:35PM +0800, Jason Wang wrote:
>>>>>>>>>>>> 在 2021/7/13 下午6:00, Stefan Hajnoczi 写道:
>>>>>>>>>>>>> On Tue, Jul 13, 2021 at 11:27:03AM +0800, Jason Wang wrote:
>>>>>>>>>>>>>> 在 2021/7/12 下午5:57, Stefan Hajnoczi 写道:
>>>>>>>>>>>>>>> On Mon, Jul 12, 2021 at 12:00:39PM +0800, Jason Wang wrote:
>>>>>>>>>>>>>>>> 在 2021/7/11 上午4:36, Michael S. Tsirkin 写道:
>>>>>>>>>>>>>>>>> On Fri, Jul 09, 2021
>>>>>>>>>>>>>>>>> at 07:23:33PM +0200,
>>>>>>>>>>>>>>>>> Eugenio Perez Martin
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>> That's
>>>>>>>>> basically the difference between the vhost/vDPA's
>>>>>>>>> selective passthrough
>>>>>>>>> approach and VFIO's full passthrough approach.
>>>>>>>> We can't do VFIO full pasthrough for migration anyway,
>>>>>>>> some kind of mdev is
>>>>>>>> required but it's duplicated with the current vp_vdpa driver.
>>>>>>> I'm not sure that's true. Generic VFIO PCI migration can 
>>>>>>> probably be
>>>>>>> achieved without mdev:
>>>>>>> 1. Define a migration PCI Capability that indicates support for
>>>>>>>       VFIO_REGION_TYPE_MIGRATION. This allows the PCI device
>>>>>>> to implement
>>>>>>>       the migration interface in hardware instead of an mdev 
>>>>>>> driver.
>>>>>> So I think it still depend on the driver to implement migrate
>>>>>> state which is
>>>>>> vendor specific.
>>>>> The current VFIO migration interface depends on a device-specific
>>>>> software mdev driver but here I'm showing that the physical device 
>>>>> can
>>>>> implement the migration interface so that no device-specific 
>>>>> driver code
>>>>> is needed.
>>>>
>>>> This is not what I read from the patch:
>>>>
>>>>   * device_state: (read/write)
>>>>   *      - The user application writes to this field to inform the 
>>>> vendor
>>>> driver
>>>>   *        about the device state to be transitioned to.
>>>>   *      - The vendor driver should take the necessary actions to 
>>>> change
>>>> the
>>>>   *        device state. After successful transition to a given 
>>>> state, the
>>>>   *        vendor driver should return success on write(device_state,
>>>> state)
>>>>   *        system call. If the device state transition fails, the 
>>>> vendor
>>>> driver
>>>>   *        should return an appropriate -errno for the fault 
>>>> condition.
>>>>
>>>> Vendor driver need to mediate between the uAPI and the actual device.
>>> We're building an infrastructure for VFIO PCI devices in the last few
>>> months.
>>>
>>> It should be merged hopefully to kernel 5.15.
>> Do you have links to patch series or a brief description of the VFIO API
>> features that are on the roadmap?
>
> we devided it to few patchsets .
>
> The entire series can be found at:
>
> https://github.com/jgunthorpe/linux/commits/mlx5_vfio_pci
>
> We'll first add support for mlx5 devices suspend/resume (ConnectX-6 
> and above).
>
> The driver is ready in the series above.
>
> The next step is to add Live migration for virtio. Hopefully we will 
> be able to agree on the Virtio PCI in the near future.


So I still think for vitio pci the migration should be done via vp_vdpa 
driver instead of using vfio/mdev.

Using vDPA have too much advantages (we had a lot of discussion in the 
past) via present a virtio device instead of a transport specific one:

1) software/management stack is ready
2) transport independent, micro VM ready
3) migration compatibility could be maintained
4) management API ready
5) fail-over or multi-path in the future

etc.


>
>>
>>>>
>>>>>> Note that it's just an uAPI definition not something defined in 
>>>>>> the PCI
>>>>>> spec.
>>>>> Yes, that's why I mentioned Changpeng Liu's idea to turn the uAPI 
>>>>> into a
>>>>> standard PCI Capability to eliminate the need for device-specific
>>>>> drivers.
>>>>
>>>> Ok.
>>>>
>>>>
>>>>>> Out of curiosity, the patch is merged without any real users in
>>>>>> the Linux.
>>>>>> This is very bad since we lose the change to audit the whole design.
>>>>> I agree. It would have helped to have a complete vision for how live
>>>>> migration should work along with demos. I don't see any migration 
>>>>> code
>>>>> in samples/vfio-mdev/ :(.
>>>>
>>>> Right.
>>> Creating a standard is not related to Linux nor VFIO.
>>>
>>> With the proposal that I've sent, we can develop a migration driver and
>>> virtio device that will support it (NVIDIA virtio-blk SNAP device).
>>>
>>> And you can build live migration support in virtio_vdpa driver (if VDPA
>>> migration protocol will be implemented).
>> I guess "VDPA migration protocol" is not referring to Jason's proposal
>> but instead to a new vDPA interface that adds the migration interface
>> from you proposal to vDPA?
>
> I mean a parallel protocol to VFIO.


Actually, the first step is to define the state and its API in the spec 
as both of us tries to achieve and make it independent to any kind of 
implementation (VFIO or vDPA).

Thanks


>
>
>> Stefan
>


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit
  2021-07-20 12:27                                     ` Max Gurtovoy
  2021-07-20 12:57                                       ` Stefan Hajnoczi
@ 2021-07-21  3:09                                       ` Jason Wang
  2021-07-21 11:43                                         ` Max Gurtovoy
  1 sibling, 1 reply; 115+ messages in thread
From: Jason Wang @ 2021-07-21  3:09 UTC (permalink / raw)
  To: Max Gurtovoy, Stefan Hajnoczi
  Cc: Michael S. Tsirkin, Eugenio Perez Martin, Dr. David Alan Gilbert,
	virtio-comment, Virtio-Dev, Cornelia Huck, Oren Duer,
	Shahaf Shuler, Parav Pandit, Bodong Wang, Alexander Mikheev,
	Halil Pasic


在 2021/7/20 下午8:27, Max Gurtovoy 写道:
>
> On 7/20/2021 6:02 AM, Jason Wang wrote:
>>
>> 在 2021/7/19 下午8:43, Stefan Hajnoczi 写道:
>>> On Fri, Jul 16, 2021 at 10:03:17AM +0800, Jason Wang wrote:
>>>> 在 2021/7/15 下午6:01, Stefan Hajnoczi 写道:
>>>>> On Thu, Jul 15, 2021 at 09:35:13AM +0800, Jason Wang wrote:
>>>>>> 在 2021/7/14 下午11:07, Stefan Hajnoczi 写道:
>>>>>>> On Wed, Jul 14, 2021 at 06:29:28PM +0800, Jason Wang wrote:
>>>>>>>> 在 2021/7/14 下午5:53, Stefan Hajnoczi 写道:
>>>>>>>>> On Tue, Jul 13, 2021 at 08:16:35PM +0800, Jason Wang wrote:
>>>>>>>>>> 在 2021/7/13 下午6:00, Stefan Hajnoczi 写道:
>>>>>>>>>>> On Tue, Jul 13, 2021 at 11:27:03AM +0800, Jason Wang wrote:
>>>>>>>>>>>> 在 2021/7/12 下午5:57, Stefan Hajnoczi 写道:
>>>>>>>>>>>>> On Mon, Jul 12, 2021 at 12:00:39PM +0800, Jason Wang wrote:
>>>>>>>>>>>>>> 在 2021/7/11 上午4:36, Michael S. Tsirkin 写道:
>>>>>>>>>>>>>>> On Fri, Jul 09, 2021 at 07:23:33PM +0200, Eugenio Perez 
>>>>>>>>>>>>>>> Martin wrote:
>>>>>>>>>>>>>>>>>> If I understand correctly, this is all
>>>>>>>>>>>>>>>>>> driven from the driver inside the guest, so for this 
>>>>>>>>>>>>>>>>>> to work
>>>>>>>>>>>>>>>>>> the guest must be running and already have 
>>>>>>>>>>>>>>>>>> initialised the driver.
>>>>>>>>>>>>>>>>> Yes.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> As I see it, the feature can be driven entirely by the 
>>>>>>>>>>>>>>>> VMM as long as
>>>>>>>>>>>>>>>> it intercept the relevant configuration space (PCI, 
>>>>>>>>>>>>>>>> MMIO, etc) from
>>>>>>>>>>>>>>>> guest's reads and writes, and present it as coherent 
>>>>>>>>>>>>>>>> and transparent
>>>>>>>>>>>>>>>> for the guest. Some use cases I can imagine with a 
>>>>>>>>>>>>>>>> physical device (or
>>>>>>>>>>>>>>>> vp_vpda device) with VIRTIO_F_STOP:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 1) The VMM chooses not to pass the feature flag. The 
>>>>>>>>>>>>>>>> guest cannot stop
>>>>>>>>>>>>>>>> the device, so any write to this flag is an 
>>>>>>>>>>>>>>>> error/undefined.
>>>>>>>>>>>>>>>> 2) The VMM passes the flag to the guest. The guest can 
>>>>>>>>>>>>>>>> stop the device.
>>>>>>>>>>>>>>>> 2.1) The VMM stops the device to perform a live 
>>>>>>>>>>>>>>>> migration, and the
>>>>>>>>>>>>>>>> guest does not write to STOP in any moment of the LM. 
>>>>>>>>>>>>>>>> It resets the
>>>>>>>>>>>>>>>> destination device with the state, and then initializes 
>>>>>>>>>>>>>>>> the device.
>>>>>>>>>>>>>>>> 2.2) The guest stops the device and, when STOP(32) is 
>>>>>>>>>>>>>>>> set, the source
>>>>>>>>>>>>>>>> VMM migrates the device status. The destination VMM 
>>>>>>>>>>>>>>>> realizes the bit,
>>>>>>>>>>>>>>>> so it sets the bit in the destination too after device 
>>>>>>>>>>>>>>>> initialization.
>>>>>>>>>>>>>>>> 2.3) The device is not initialized by the guest so it 
>>>>>>>>>>>>>>>> doesn't matter
>>>>>>>>>>>>>>>> what bit has the HW, but the VM can be migrated.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Am I missing something?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks!
>>>>>>>>>>>>>>> It's doable like this. It's all a lot of hoops to jump 
>>>>>>>>>>>>>>> through though.
>>>>>>>>>>>>>>> It's also not easy for devices to implement.
>>>>>>>>>>>>>> It just requires a new status bit. Anything that makes 
>>>>>>>>>>>>>> you think it's hard
>>>>>>>>>>>>>> to implement?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> E.g for networking device, it should be sufficient to use 
>>>>>>>>>>>>>> this bit + the
>>>>>>>>>>>>>> virtqueue state.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Why don't we design the feature in a way that is useable 
>>>>>>>>>>>>>>> by VMMs
>>>>>>>>>>>>>>> and implementable by devices in a simple way?
>>>>>>>>>>>>>> It use the common technology like register shadowing 
>>>>>>>>>>>>>> without any further
>>>>>>>>>>>>>> stuffs.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Or do you have any other ideas?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> (I think we all know migration will be very hard if we 
>>>>>>>>>>>>>> simply pass through
>>>>>>>>>>>>>> those state registers).
>>>>>>>>>>>>> If an admin virtqueue is used instead of the STOP Device 
>>>>>>>>>>>>> Status field
>>>>>>>>>>>>> bit then there's no need to re-read the Device Status 
>>>>>>>>>>>>> field in a loop
>>>>>>>>>>>>> until the device has stopped.
>>>>>>>>>>>> Probably not. Let me to clarify several points:
>>>>>>>>>>>>
>>>>>>>>>>>> - This proposal has nothing to do with admin virtqueue. 
>>>>>>>>>>>> Actually, admin
>>>>>>>>>>>> virtqueue could be used for carrying any basic device 
>>>>>>>>>>>> facility like status
>>>>>>>>>>>> bit. E.g I'm going to post patches that use admin virtqueue 
>>>>>>>>>>>> as a "transport"
>>>>>>>>>>>> for device slicing at virtio level.
>>>>>>>>>>>> - Even if we had introduced admin virtqueue, we still need 
>>>>>>>>>>>> a per function
>>>>>>>>>>>> interface for this. This is a must for nested 
>>>>>>>>>>>> virtualization, we can't
>>>>>>>>>>>> always expect things like PF can be assigned to L1 guest.
>>>>>>>>>>>> - According to the proposal, there's no need for the device 
>>>>>>>>>>>> to complete all
>>>>>>>>>>>> the consumed buffers, device can choose to expose those 
>>>>>>>>>>>> inflight descriptors
>>>>>>>>>>>> in a device specific way and set the STOP bit. This means, 
>>>>>>>>>>>> if we have the
>>>>>>>>>>>> device specific in-flight descriptor reporting facility, 
>>>>>>>>>>>> the device can
>>>>>>>>>>>> almost set the STOP bit immediately.
>>>>>>>>>>>> - If we don't go with the basic device facility but using 
>>>>>>>>>>>> the admin
>>>>>>>>>>>> virtqueue specific method, we still need to clarify how it 
>>>>>>>>>>>> works with the
>>>>>>>>>>>> device status state machine, it will be some kind of 
>>>>>>>>>>>> sub-states which looks
>>>>>>>>>>>> much more complicated than the current proposal.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> When migrating a guest with many VIRTIO devices a busy 
>>>>>>>>>>>>> waiting approach
>>>>>>>>>>>>> extends downtime if implemented sequentially (stopping one 
>>>>>>>>>>>>> device at a
>>>>>>>>>>>>> time).
>>>>>>>>>>>> Well. You need some kinds of waiting for sure, the 
>>>>>>>>>>>> device/DMA needs sometime
>>>>>>>>>>>> to be stopped. The downtime is determined by a specific virtio
>>>>>>>>>>>> implementation which is hard to be restricted at the spec 
>>>>>>>>>>>> level. We can
>>>>>>>>>>>> clarify that the device must set the STOP bit in e.g 100ms.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>>        It can be implemented concurrently (setting the 
>>>>>>>>>>>>> STOP bit on all
>>>>>>>>>>>>> devices and then looping until all their Device Status 
>>>>>>>>>>>>> fields have the
>>>>>>>>>>>>> bit set), but this becomes more complex to implement.
>>>>>>>>>>>> I still don't get what kind of complexity did you worry here.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> I'm a little worried about adding a new bit that requires 
>>>>>>>>>>>>> busy
>>>>>>>>>>>>> waiting...
>>>>>>>>>>>> Busy wait is not something that is introduced in this patch:
>>>>>>>>>>>>
>>>>>>>>>>>> 4.1.4.3.2 Driver Requirements: Common configuration 
>>>>>>>>>>>> structure layout
>>>>>>>>>>>>
>>>>>>>>>>>> After writing 0 to device_status, the driver MUST wait for 
>>>>>>>>>>>> a read of
>>>>>>>>>>>> device_status to return 0 before reinitializing the device.
>>>>>>>>>>>>
>>>>>>>>>>>> Since it was required for at least one transport. We need 
>>>>>>>>>>>> do something
>>>>>>>>>>>> similar to when introducing basic facility.
>>>>>>>>>>> Adding the STOP but as a Device Status bit is a small and 
>>>>>>>>>>> clean VIRTIO
>>>>>>>>>>> spec change. I like that.
>>>>>>>>>>>
>>>>>>>>>>> On the other hand, devices need time to stop and that time 
>>>>>>>>>>> can be
>>>>>>>>>>> unbounded. For example, software virtio-blk/scsi 
>>>>>>>>>>> implementations since
>>>>>>>>>>> cannot immediately cancel in-flight I/O requests on Linux 
>>>>>>>>>>> hosts.
>>>>>>>>>>>
>>>>>>>>>>> The natural interface for long-running operations is 
>>>>>>>>>>> virtqueue requests.
>>>>>>>>>>> That's why I mentioned the alternative of using an admin 
>>>>>>>>>>> virtqueue
>>>>>>>>>>> instead of a Device Status bit.
>>>>>>>>>> So I'm not against the admin virtqueue. As said before, admin 
>>>>>>>>>> virtqueue
>>>>>>>>>> could be used for carrying the device status bit.
>>>>>>>>>>
>>>>>>>>>> Send a command to set STOP status bit to admin virtqueue. 
>>>>>>>>>> Device will make
>>>>>>>>>> the command buffer used after it has successfully stopped the 
>>>>>>>>>> device.
>>>>>>>>>>
>>>>>>>>>> AFAIK, they are not mutually exclusive, since they are trying 
>>>>>>>>>> to solve
>>>>>>>>>> different problems.
>>>>>>>>>>
>>>>>>>>>> Device status - basic device facility
>>>>>>>>>>
>>>>>>>>>> Admin virtqueue - transport/device specific way to implement 
>>>>>>>>>> (part of) the
>>>>>>>>>> device facility
>>>>>>>>>>
>>>>>>>>>>> Although you mentioned that the stopped state needs to be 
>>>>>>>>>>> reflected in
>>>>>>>>>>> the Device Status field somehow, I'm not sure about that 
>>>>>>>>>>> since the
>>>>>>>>>>> driver typically doesn't need to know whether the device is 
>>>>>>>>>>> being
>>>>>>>>>>> migrated.
>>>>>>>>>> The guest won't see the real device status bit. VMM will 
>>>>>>>>>> shadow the device
>>>>>>>>>> status bit in this case.
>>>>>>>>>>
>>>>>>>>>> E.g with the current vhost-vDPA, vDPA behave like a vhost 
>>>>>>>>>> device, guest is
>>>>>>>>>> unaware of the migration.
>>>>>>>>>>
>>>>>>>>>> STOP status bit is set by Qemu to real virtio hardware. But 
>>>>>>>>>> guest will only
>>>>>>>>>> see the DRIVER_OK without STOP.
>>>>>>>>>>
>>>>>>>>>> It's not hard to implement the nested on top, see the 
>>>>>>>>>> discussion initiated
>>>>>>>>>> by Eugenio about how expose VIRTIO_F_STOP to guest for nested 
>>>>>>>>>> live
>>>>>>>>>> migration.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>       In fact, the VMM would need to hide this bit and it's 
>>>>>>>>>>> safer to
>>>>>>>>>>> keep it out-of-band instead of risking exposing it by accident.
>>>>>>>>>> See above, VMM may choose to hide or expose the capability. 
>>>>>>>>>> It's useful for
>>>>>>>>>> migrating a nested guest.
>>>>>>>>>>
>>>>>>>>>> If we design an interface that can be used in the nested 
>>>>>>>>>> environment, it's
>>>>>>>>>> not an ideal interface.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> In addition, stateful devices need to load/save non-trivial 
>>>>>>>>>>> amounts of
>>>>>>>>>>> data. They need DMA to do this efficiently, so an admin 
>>>>>>>>>>> virtqueue is a
>>>>>>>>>>> good fit again.
>>>>>>>>>> I don't get the point here. You still need to address the 
>>>>>>>>>> exact the similar
>>>>>>>>>> issues for admin virtqueue: the unbound time in freezing the 
>>>>>>>>>> device, the
>>>>>>>>>> interaction with the virtio device status state machine.
>>>>>>>>> Device state state can be large so a register interface would 
>>>>>>>>> be a
>>>>>>>>> bottleneck. DMA is needed. I think a virtqueue is a good fit for
>>>>>>>>> saving/loading device state.
>>>>>>>> So this patch doesn't mandate a register interface, isn't it?
>>>>>>> You're right, not this patch. I mentioned it because your other 
>>>>>>> patch
>>>>>>> series ("[PATCH] virtio-pci: implement VIRTIO_F_RING_STATE") 
>>>>>>> implements
>>>>>>> it a register interface.
>>>>>>>
>>>>>>>> And DMA
>>>>>>>> doesn't means a virtqueue, it could be a transport specific 
>>>>>>>> method.
>>>>>>> Yes, although virtqueues are a pretty good interface that works 
>>>>>>> across
>>>>>>> transports (PCI/MMIO/etc) thanks to the standard vring memory 
>>>>>>> layout.
>>>>>>>
>>>>>>>> I think we need to start from defining the state of one 
>>>>>>>> specific device and
>>>>>>>> see what is the best interface.
>>>>>>> virtio-blk might be the simplest. I think virtio-net has more 
>>>>>>> device
>>>>>>> state and virtio-scsi is definitely more complext than virtio-blk.
>>>>>>>
>>>>>>> First we need agreement on whether "device state" encompasses 
>>>>>>> the full
>>>>>>> state of the device or just state that is unknown to the VMM.
>>>>>> I think we've discussed this in the past. It can't work since:
>>>>>>
>>>>>> 1) The state and its format must be clearly defined in the spec
>>>>>> 2) We need to maintain migration compatibility and debug-ability
>>>>> Some devices need implementation-specific state. They should still be
>>>>> able to live migrate even if it means cross-implementation 
>>>>> migration and
>>>>> debug-ability is not possible.
>>>>
>>>> I think we need to re-visit this conclusion. Migration 
>>>> compatibility is
>>>> pretty important, especially consider the software stack has spent 
>>>> a huge
>>>> mount of effort in maintaining them.
>>>>
>>>> Say a virtio hardware would break this, this mean we will lose all the
>>>> advantages of being a standard device.
>>>>
>>>> If we can't do live migration among:
>>>>
>>>> 1) different backends, e.g migrate from virtio hardware to migrate 
>>>> software
>>>> 2) different vendors
>>>>
>>>> We failed to say as a standard device and the customer is in fact 
>>>> locked by
>>>> the vendor implicitly.
>>> My virtiofs device implementation is backed by an in-memory file 
>>> system.
>>> The device state includes the contents of each file.
>>>
>>> Your virtiofs device implementation uses Linux file handles to keep
>>> track of open files. The device state includes Linux file handles (but
>>> not the contents of each file) so the destination host can access the
>>> same files on shared storage.
>>>
>>> Cornelia's virtiofs device implementation is backed by an object 
>>> storage
>>> HTTP API. The device state includes API object IDs.
>>>
>>> The device state is implementation-dependent. There is no standard
>>> representation and it's not possible to migrate between device
>>> implementations. How are they supposed to migrate?
>>
>>
>> So if I understand correclty, virtio-fs is not desigined to be 
>> migrate-able?
>>
>> (Having a check on the current virtio-fs support in qemu, it looks to 
>> me it has a migration blocker).
>>
>>
>>>
>>> This is why I think it's necessarily to allow implementation-specific
>>> device state representations.
>>
>>
>> Or you probably mean you don't support cross backend migration. This 
>> sounds like a drawback and it's actually not a standard device but a 
>> vendor/implementation specific device.
>>
>> It would bring a lot of troubles, not only for the implementation but 
>> for the management. Maybe we can start from adding the support of 
>> migration for some specific backend and start from there.
>>
>>
>>>
>>>>>> 3) Not a proper uAPI desgin
>>>>> I never understood this argument. The Linux uAPI passes through 
>>>>> lots of
>>>>> opaque data from devices to userspace. Allowing an
>>>>> implementation-specific device state representation is nothing 
>>>>> new. VFIO
>>>>> already does it.
>>>>
>>>> I think we've already had a lots of discussion for VFIO but without a
>>>> conclusion. Maybe we need the verdict from Linus or Greg (not sure 
>>>> if it's
>>>> too late). But that's not related to virito and this thread.
>>>>
>>>> What you propose here is kind of conflict with the efforts of 
>>>> virtio. I
>>>> think we all aggree that we should define the state in the spec. 
>>>> Assuming
>>>> this is correct:
>>>>
>>>> 1) why do we still offer opaque migration state to userspace?
>>> See above. Stateful devices may require an implementation-defined 
>>> device
>>> state representation.
>>
>>
>> So my point stand still, it's not a standard device if we do this.
>>
>>
>>>
>>>> 2) how can it be integrated into the current VMM (Qemu) virtio 
>>>> devices'
>>>> migration bytes stream?
>>> Opaque data like D-Bus VMState:
>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fqemu.readthedocs.io%2Fen%2Flatest%2Finterop%2Fdbus-vmstate.html&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C73950d2060194ce2a43e08d94b2ae003%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637623469808033640%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=6smKOLikySbPdeQa1sbNSRdTB13p3ma09BH%2BeknXAS4%3D&amp;reserved=0 
>>>
>>
>>
>> Actually, I meant how to keep the opaque state which is compatible 
>> with all the existing device that can do migration.
>>
>> E.g we want to live migration virtio-blk among any backends (from a 
>> hardware device to a software backend).
>
> I prefer we'll handle HW to SW migration in the future.


Yes, that's very important and on of the key advantages of virtio.


>
> We're still debating on other basic stuff.
>
>>
>>
>>>
>>>>>>> That's
>>>>>>> basically the difference between the vhost/vDPA's selective 
>>>>>>> passthrough
>>>>>>> approach and VFIO's full passthrough approach.
>>>>>> We can't do VFIO full pasthrough for migration anyway, some kind 
>>>>>> of mdev is
>>>>>> required but it's duplicated with the current vp_vdpa driver.
>>>>> I'm not sure that's true. Generic VFIO PCI migration can probably be
>>>>> achieved without mdev:
>>>>> 1. Define a migration PCI Capability that indicates support for
>>>>>      VFIO_REGION_TYPE_MIGRATION. This allows the PCI device to 
>>>>> implement
>>>>>      the migration interface in hardware instead of an mdev driver.
>>>>
>>>> So I think it still depend on the driver to implement migrate state 
>>>> which is
>>>> vendor specific.
>>> The current VFIO migration interface depends on a device-specific
>>> software mdev driver but here I'm showing that the physical device can
>>> implement the migration interface so that no device-specific driver 
>>> code
>>> is needed.
>>
>>
>> This is not what I read from the patch:
>>
>>  * device_state: (read/write)
>>  *      - The user application writes to this field to inform the 
>> vendor driver
>>  *        about the device state to be transitioned to.
>>  *      - The vendor driver should take the necessary actions to 
>> change the
>>  *        device state. After successful transition to a given state, 
>> the
>>  *        vendor driver should return success on write(device_state, 
>> state)
>>  *        system call. If the device state transition fails, the 
>> vendor driver
>>  *        should return an appropriate -errno for the fault condition.
>>
>> Vendor driver need to mediate between the uAPI and the actual device.
>
> We're building an infrastructure for VFIO PCI devices in the last few 
> months.
>
> It should be merged hopefully to kernel 5.15.


Ok.


>
>>
>>
>>>
>>>> Note that it's just an uAPI definition not something defined in the 
>>>> PCI
>>>> spec.
>>> Yes, that's why I mentioned Changpeng Liu's idea to turn the uAPI 
>>> into a
>>> standard PCI Capability to eliminate the need for device-specific
>>> drivers.
>>
>>
>> Ok.
>>
>>
>>>
>>>> Out of curiosity, the patch is merged without any real users in the 
>>>> Linux.
>>>> This is very bad since we lose the change to audit the whole design.
>>> I agree. It would have helped to have a complete vision for how live
>>> migration should work along with demos. I don't see any migration code
>>> in samples/vfio-mdev/ :(.
>>
>>
>> Right.
>
> Creating a standard is not related to Linux nor VFIO.


I fully agree here.


>
> With the proposal that I've sent, we can develop a migration driver 
> and virtio device that will support it (NVIDIA virtio-blk SNAP device).
>
> And you can build live migration support in virtio_vdpa driver (if 
> VDPA migration protocol will be implemented).


Right, vp_vdpa fit naturally for this. But I don't see the much value of 
a dedicated migration driver, do you?

Thanks


>
>
>>
>>
>>>>> 2. The VMM either uses the migration PCI Capability directly from
>>>>>      userspace or core VFIO PCI code advertises 
>>>>> VFIO_REGION_TYPE_MIGRATION
>>>>>      to userspace so migration can proceed in the same way as with
>>>>>      VFIO/mdev drivers.
>>>>> 3. The PCI Capability is not passed through to the guest.
>>>>
>>>> This brings troubles in the nested environment.
>>> It depends on the device splitting/management design. If L0 wishes to
>>> let L1 manage the VFs then it would need to expose a management device.
>>> Since the migration interface is generic (not device-specific) a 
>>> generic
>>> management device solves this for all devices.
>>
>>
>> Right, but it's a burden to expose the management device or it may 
>> just won't work.
>>
>> Thanks
>>
>>
>>>
>>> Stefan
>>
>


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit
  2021-07-21  2:29                                         ` Jason Wang
@ 2021-07-21 10:20                                           ` Stefan Hajnoczi
  2021-07-22  7:33                                             ` Jason Wang
  0 siblings, 1 reply; 115+ messages in thread
From: Stefan Hajnoczi @ 2021-07-21 10:20 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, Eugenio Perez Martin, Dr. David Alan Gilbert,
	virtio-comment, Virtio-Dev, Max Gurtovoy, Cornelia Huck,
	Oren Duer, Shahaf Shuler, Parav Pandit, Bodong Wang,
	Alexander Mikheev, Halil Pasic

[-- Attachment #1: Type: text/plain, Size: 29277 bytes --]

On Wed, Jul 21, 2021 at 10:29:17AM +0800, Jason Wang wrote:
> 
> 在 2021/7/20 下午4:50, Stefan Hajnoczi 写道:
> > On Tue, Jul 20, 2021 at 11:04:55AM +0800, Jason Wang wrote:
> > > 在 2021/7/19 下午8:45, Stefan Hajnoczi 写道:
> > > > On Fri, Jul 16, 2021 at 11:53:13AM +0800, Jason Wang wrote:
> > > > > 在 2021/7/16 上午10:03, Jason Wang 写道:
> > > > > > 在 2021/7/15 下午6:01, Stefan Hajnoczi 写道:
> > > > > > > On Thu, Jul 15, 2021 at 09:35:13AM +0800, Jason Wang wrote:
> > > > > > > > 在 2021/7/14 下午11:07, Stefan Hajnoczi 写道:
> > > > > > > > > On Wed, Jul 14, 2021 at 06:29:28PM +0800, Jason Wang wrote:
> > > > > > > > > > 在 2021/7/14 下午5:53, Stefan Hajnoczi 写道:
> > > > > > > > > > > On Tue, Jul 13, 2021 at 08:16:35PM +0800, Jason Wang wrote:
> > > > > > > > > > > > 在 2021/7/13 下午6:00, Stefan Hajnoczi 写道:
> > > > > > > > > > > > > On Tue, Jul 13, 2021 at 11:27:03AM +0800, Jason Wang wrote:
> > > > > > > > > > > > > > 在 2021/7/12 下午5:57, Stefan Hajnoczi 写道:
> > > > > > > > > > > > > > > On Mon, Jul 12, 2021 at 12:00:39PM +0800, Jason Wang wrote:
> > > > > > > > > > > > > > > > 在 2021/7/11 上午4:36, Michael S. Tsirkin 写道:
> > > > > > > > > > > > > > > > > On Fri, Jul 09, 2021 at
> > > > > > > > > > > > > > > > > 07:23:33PM +0200, Eugenio
> > > > > > > > > > > > > > > > > Perez Martin wrote:
> > > > > > > > > > > > > > > > > > > >           If I understand correctly, this is all
> > > > > > > > > > > > > > > > > > > > driven from the
> > > > > > > > > > > > > > > > > > > > driver inside
> > > > > > > > > > > > > > > > > > > > the guest, so
> > > > > > > > > > > > > > > > > > > > for this to work
> > > > > > > > > > > > > > > > > > > > the guest must
> > > > > > > > > > > > > > > > > > > > be running and
> > > > > > > > > > > > > > > > > > > > already have
> > > > > > > > > > > > > > > > > > > > initialised the
> > > > > > > > > > > > > > > > > > > > driver.
> > > > > > > > > > > > > > > > > > > Yes.
> > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > As I see it, the feature
> > > > > > > > > > > > > > > > > > can be driven entirely
> > > > > > > > > > > > > > > > > > by the VMM as long as
> > > > > > > > > > > > > > > > > > it intercept the
> > > > > > > > > > > > > > > > > > relevant configuration
> > > > > > > > > > > > > > > > > > space (PCI, MMIO, etc)
> > > > > > > > > > > > > > > > > > from
> > > > > > > > > > > > > > > > > > guest's reads and
> > > > > > > > > > > > > > > > > > writes, and present it
> > > > > > > > > > > > > > > > > > as coherent and
> > > > > > > > > > > > > > > > > > transparent
> > > > > > > > > > > > > > > > > > for the guest. Some use
> > > > > > > > > > > > > > > > > > cases I can imagine with
> > > > > > > > > > > > > > > > > > a physical device (or
> > > > > > > > > > > > > > > > > > vp_vpda device) with VIRTIO_F_STOP:
> > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > 1) The VMM chooses not
> > > > > > > > > > > > > > > > > > to pass the feature
> > > > > > > > > > > > > > > > > > flag. The guest cannot
> > > > > > > > > > > > > > > > > > stop
> > > > > > > > > > > > > > > > > > the device, so any write to this flag is an error/undefined.
> > > > > > > > > > > > > > > > > > 2) The VMM passes the
> > > > > > > > > > > > > > > > > > flag to the guest. The
> > > > > > > > > > > > > > > > > > guest can stop the
> > > > > > > > > > > > > > > > > > device.
> > > > > > > > > > > > > > > > > > 2.1) The VMM stops the
> > > > > > > > > > > > > > > > > > device to perform a live
> > > > > > > > > > > > > > > > > > migration, and the
> > > > > > > > > > > > > > > > > > guest does not write to
> > > > > > > > > > > > > > > > > > STOP in any moment of
> > > > > > > > > > > > > > > > > > the LM. It resets the
> > > > > > > > > > > > > > > > > > destination device with
> > > > > > > > > > > > > > > > > > the state, and then
> > > > > > > > > > > > > > > > > > initializes the device.
> > > > > > > > > > > > > > > > > > 2.2) The guest stops the
> > > > > > > > > > > > > > > > > > device and, when
> > > > > > > > > > > > > > > > > > STOP(32) is set, the
> > > > > > > > > > > > > > > > > > source
> > > > > > > > > > > > > > > > > > VMM migrates the device
> > > > > > > > > > > > > > > > > > status. The destination
> > > > > > > > > > > > > > > > > > VMM realizes the bit,
> > > > > > > > > > > > > > > > > > so it sets the bit in
> > > > > > > > > > > > > > > > > > the destination too
> > > > > > > > > > > > > > > > > > after device
> > > > > > > > > > > > > > > > > > initialization.
> > > > > > > > > > > > > > > > > > 2.3) The device is not
> > > > > > > > > > > > > > > > > > initialized by the guest
> > > > > > > > > > > > > > > > > > so it doesn't matter
> > > > > > > > > > > > > > > > > > what bit has the HW, but the VM can be migrated.
> > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > Am I missing something?
> > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > Thanks!
> > > > > > > > > > > > > > > > > It's doable like this. It's
> > > > > > > > > > > > > > > > > all a lot of hoops to jump
> > > > > > > > > > > > > > > > > through though.
> > > > > > > > > > > > > > > > > It's also not easy for devices to implement.
> > > > > > > > > > > > > > > > It just requires a new status
> > > > > > > > > > > > > > > > bit. Anything that makes you
> > > > > > > > > > > > > > > > think it's hard
> > > > > > > > > > > > > > > > to implement?
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > E.g for networking device, it
> > > > > > > > > > > > > > > > should be sufficient to use this
> > > > > > > > > > > > > > > > bit + the
> > > > > > > > > > > > > > > > virtqueue state.
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > Why don't we design the
> > > > > > > > > > > > > > > > > feature in a way that is
> > > > > > > > > > > > > > > > > useable by VMMs
> > > > > > > > > > > > > > > > > and implementable by devices in a simple way?
> > > > > > > > > > > > > > > > It use the common technology
> > > > > > > > > > > > > > > > like register shadowing without
> > > > > > > > > > > > > > > > any further
> > > > > > > > > > > > > > > > stuffs.
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > Or do you have any other ideas?
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > (I think we all know migration
> > > > > > > > > > > > > > > > will be very hard if we simply
> > > > > > > > > > > > > > > > pass through
> > > > > > > > > > > > > > > > those state registers).
> > > > > > > > > > > > > > > If an admin virtqueue is used
> > > > > > > > > > > > > > > instead of the STOP Device Status
> > > > > > > > > > > > > > > field
> > > > > > > > > > > > > > > bit then there's no need to re-read
> > > > > > > > > > > > > > > the Device Status field in a loop
> > > > > > > > > > > > > > > until the device has stopped.
> > > > > > > > > > > > > > Probably not. Let me to clarify several points:
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > - This proposal has nothing to do with
> > > > > > > > > > > > > > admin virtqueue. Actually, admin
> > > > > > > > > > > > > > virtqueue could be used for carrying any
> > > > > > > > > > > > > > basic device facility like status
> > > > > > > > > > > > > > bit. E.g I'm going to post patches that
> > > > > > > > > > > > > > use admin virtqueue as a "transport"
> > > > > > > > > > > > > > for device slicing at virtio level.
> > > > > > > > > > > > > > - Even if we had introduced admin
> > > > > > > > > > > > > > virtqueue, we still need a per function
> > > > > > > > > > > > > > interface for this. This is a must for
> > > > > > > > > > > > > > nested virtualization, we can't
> > > > > > > > > > > > > > always expect things like PF can be assigned to L1 guest.
> > > > > > > > > > > > > > - According to the proposal, there's no
> > > > > > > > > > > > > > need for the device to complete all
> > > > > > > > > > > > > > the consumed buffers, device can choose
> > > > > > > > > > > > > > to expose those inflight descriptors
> > > > > > > > > > > > > > in a device specific way and set the
> > > > > > > > > > > > > > STOP bit. This means, if we have the
> > > > > > > > > > > > > > device specific in-flight descriptor
> > > > > > > > > > > > > > reporting facility, the device can
> > > > > > > > > > > > > > almost set the STOP bit immediately.
> > > > > > > > > > > > > > - If we don't go with the basic device
> > > > > > > > > > > > > > facility but using the admin
> > > > > > > > > > > > > > virtqueue specific method, we still need
> > > > > > > > > > > > > > to clarify how it works with the
> > > > > > > > > > > > > > device status state machine, it will be
> > > > > > > > > > > > > > some kind of sub-states which looks
> > > > > > > > > > > > > > much more complicated than the current proposal.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > When migrating a guest with many
> > > > > > > > > > > > > > > VIRTIO devices a busy waiting
> > > > > > > > > > > > > > > approach
> > > > > > > > > > > > > > > extends downtime if implemented
> > > > > > > > > > > > > > > sequentially (stopping one device at
> > > > > > > > > > > > > > > a
> > > > > > > > > > > > > > > time).
> > > > > > > > > > > > > > Well. You need some kinds of waiting for
> > > > > > > > > > > > > > sure, the device/DMA needs sometime
> > > > > > > > > > > > > > to be stopped. The downtime is determined by a specific virtio
> > > > > > > > > > > > > > implementation which is hard to be
> > > > > > > > > > > > > > restricted at the spec level. We can
> > > > > > > > > > > > > > clarify that the device must set the STOP bit in e.g 100ms.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > >         It can be implemented
> > > > > > > > > > > > > > > concurrently (setting the STOP bit
> > > > > > > > > > > > > > > on all
> > > > > > > > > > > > > > > devices and then looping until all
> > > > > > > > > > > > > > > their Device Status fields have the
> > > > > > > > > > > > > > > bit set), but this becomes more complex to implement.
> > > > > > > > > > > > > > I still don't get what kind of complexity did you worry here.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > I'm a little worried about adding a new bit that requires busy
> > > > > > > > > > > > > > > waiting...
> > > > > > > > > > > > > > Busy wait is not something that is introduced in this patch:
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > 4.1.4.3.2 Driver Requirements: Common
> > > > > > > > > > > > > > configuration structure layout
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > After writing 0 to device_status, the
> > > > > > > > > > > > > > driver MUST wait for a read of
> > > > > > > > > > > > > > device_status to return 0 before reinitializing the device.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > Since it was required for at least one
> > > > > > > > > > > > > > transport. We need do something
> > > > > > > > > > > > > > similar to when introducing basic facility.
> > > > > > > > > > > > > Adding the STOP but as a Device Status bit
> > > > > > > > > > > > > is a small and clean VIRTIO
> > > > > > > > > > > > > spec change. I like that.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > On the other hand, devices need time to stop and that time can be
> > > > > > > > > > > > > unbounded. For example, software
> > > > > > > > > > > > > virtio-blk/scsi implementations since
> > > > > > > > > > > > > cannot immediately cancel in-flight I/O requests on Linux hosts.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > The natural interface for long-running
> > > > > > > > > > > > > operations is virtqueue requests.
> > > > > > > > > > > > > That's why I mentioned the alternative of using an admin virtqueue
> > > > > > > > > > > > > instead of a Device Status bit.
> > > > > > > > > > > > So I'm not against the admin virtqueue. As said
> > > > > > > > > > > > before, admin virtqueue
> > > > > > > > > > > > could be used for carrying the device status bit.
> > > > > > > > > > > > 
> > > > > > > > > > > > Send a command to set STOP status bit to admin
> > > > > > > > > > > > virtqueue. Device will make
> > > > > > > > > > > > the command buffer used after it has
> > > > > > > > > > > > successfully stopped the device.
> > > > > > > > > > > > 
> > > > > > > > > > > > AFAIK, they are not mutually exclusive, since
> > > > > > > > > > > > they are trying to solve
> > > > > > > > > > > > different problems.
> > > > > > > > > > > > 
> > > > > > > > > > > > Device status - basic device facility
> > > > > > > > > > > > 
> > > > > > > > > > > > Admin virtqueue - transport/device specific way
> > > > > > > > > > > > to implement (part of) the
> > > > > > > > > > > > device facility
> > > > > > > > > > > > 
> > > > > > > > > > > > > Although you mentioned that the stopped
> > > > > > > > > > > > > state needs to be reflected in
> > > > > > > > > > > > > the Device Status field somehow, I'm not sure about that since the
> > > > > > > > > > > > > driver typically doesn't need to know whether the device is being
> > > > > > > > > > > > > migrated.
> > > > > > > > > > > > The guest won't see the real device status bit.
> > > > > > > > > > > > VMM will shadow the device
> > > > > > > > > > > > status bit in this case.
> > > > > > > > > > > > 
> > > > > > > > > > > > E.g with the current vhost-vDPA, vDPA behave
> > > > > > > > > > > > like a vhost device, guest is
> > > > > > > > > > > > unaware of the migration.
> > > > > > > > > > > > 
> > > > > > > > > > > > STOP status bit is set by Qemu to real virtio
> > > > > > > > > > > > hardware. But guest will only
> > > > > > > > > > > > see the DRIVER_OK without STOP.
> > > > > > > > > > > > 
> > > > > > > > > > > > It's not hard to implement the nested on top,
> > > > > > > > > > > > see the discussion initiated
> > > > > > > > > > > > by Eugenio about how expose VIRTIO_F_STOP to guest for nested live
> > > > > > > > > > > > migration.
> > > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > > >        In fact, the VMM would need to hide
> > > > > > > > > > > > > this bit and it's safer to
> > > > > > > > > > > > > keep it out-of-band instead of risking exposing it by accident.
> > > > > > > > > > > > See above, VMM may choose to hide or expose the
> > > > > > > > > > > > capability. It's useful for
> > > > > > > > > > > > migrating a nested guest.
> > > > > > > > > > > > 
> > > > > > > > > > > > If we design an interface that can be used in
> > > > > > > > > > > > the nested environment, it's
> > > > > > > > > > > > not an ideal interface.
> > > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > > > In addition, stateful devices need to
> > > > > > > > > > > > > load/save non-trivial amounts of
> > > > > > > > > > > > > data. They need DMA to do this efficiently,
> > > > > > > > > > > > > so an admin virtqueue is a
> > > > > > > > > > > > > good fit again.
> > > > > > > > > > > > I don't get the point here. You still need to
> > > > > > > > > > > > address the exact the similar
> > > > > > > > > > > > issues for admin virtqueue: the unbound time in
> > > > > > > > > > > > freezing the device, the
> > > > > > > > > > > > interaction with the virtio device status state machine.
> > > > > > > > > > > Device state state can be large so a register interface would be a
> > > > > > > > > > > bottleneck. DMA is needed. I think a virtqueue is a good fit for
> > > > > > > > > > > saving/loading device state.
> > > > > > > > > > So this patch doesn't mandate a register interface, isn't it?
> > > > > > > > > You're right, not this patch. I mentioned it because your other patch
> > > > > > > > > series ("[PATCH] virtio-pci: implement VIRTIO_F_RING_STATE")
> > > > > > > > > implements
> > > > > > > > > it a register interface.
> > > > > > > > > 
> > > > > > > > > > And DMA
> > > > > > > > > > doesn't means a virtqueue, it could be a transport specific method.
> > > > > > > > > Yes, although virtqueues are a pretty good interface that works across
> > > > > > > > > transports (PCI/MMIO/etc) thanks to the standard vring memory layout.
> > > > > > > > > 
> > > > > > > > > > I think we need to start from defining the state of one
> > > > > > > > > > specific device and
> > > > > > > > > > see what is the best interface.
> > > > > > > > > virtio-blk might be the simplest. I think virtio-net has more device
> > > > > > > > > state and virtio-scsi is definitely more complext than virtio-blk.
> > > > > > > > > 
> > > > > > > > > First we need agreement on whether "device state" encompasses the full
> > > > > > > > > state of the device or just state that is unknown to the VMM.
> > > > > > > > I think we've discussed this in the past. It can't work since:
> > > > > > > > 
> > > > > > > > 1) The state and its format must be clearly defined in the spec
> > > > > > > > 2) We need to maintain migration compatibility and debug-ability
> > > > > > > Some devices need implementation-specific state. They should still be
> > > > > > > able to live migrate even if it means cross-implementation migration and
> > > > > > > debug-ability is not possible.
> > > > > > I think we need to re-visit this conclusion. Migration compatibility is
> > > > > > pretty important, especially consider the software stack has spent a
> > > > > > huge mount of effort in maintaining them.
> > > > > > 
> > > > > > Say a virtio hardware would break this, this mean we will lose all the
> > > > > > advantages of being a standard device.
> > > > > > 
> > > > > > If we can't do live migration among:
> > > > > > 
> > > > > > 1) different backends, e.g migrate from virtio hardware to migrate
> > > > > > software
> > > > > > 2) different vendors
> > > > > > 
> > > > > > We failed to say as a standard device and the customer is in fact locked
> > > > > > by the vendor implicitly.
> > > > > > 
> > > > > > 
> > > > > > > > 3) Not a proper uAPI desgin
> > > > > > > I never understood this argument. The Linux uAPI passes through lots of
> > > > > > > opaque data from devices to userspace. Allowing an
> > > > > > > implementation-specific device state representation is nothing new. VFIO
> > > > > > > already does it.
> > > > > > I think we've already had a lots of discussion for VFIO but without a
> > > > > > conclusion. Maybe we need the verdict from Linus or Greg (not sure if
> > > > > > it's too late). But that's not related to virito and this thread.
> > > > > > 
> > > > > > What you propose here is kind of conflict with the efforts of virtio. I
> > > > > > think we all aggree that we should define the state in the spec.
> > > > > > Assuming this is correct:
> > > > > > 
> > > > > > 1) why do we still offer opaque migration state to userspace?
> > > > > > 2) how can it be integrated into the current VMM (Qemu) virtio devices'
> > > > > > migration bytes stream?
> > > > > > 
> > > > > > We should standardize everything that is visible by the driver to be a
> > > > > > standard device. That's the power of virtio.
> > > > > > 
> > > > > > 
> > > > > > > > > That's
> > > > > > > > > basically the difference between the vhost/vDPA's selective
> > > > > > > > > passthrough
> > > > > > > > > approach and VFIO's full passthrough approach.
> > > > > > > > We can't do VFIO full pasthrough for migration anyway, some kind
> > > > > > > > of mdev is
> > > > > > > > required but it's duplicated with the current vp_vdpa driver.
> > > > > > > I'm not sure that's true. Generic VFIO PCI migration can probably be
> > > > > > > achieved without mdev:
> > > > > > > 1. Define a migration PCI Capability that indicates support for
> > > > > > >       VFIO_REGION_TYPE_MIGRATION. This allows the PCI device to implement
> > > > > > >       the migration interface in hardware instead of an mdev driver.
> > > > > > So I think it still depend on the driver to implement migrate state
> > > > > > which is vendor specific.
> > > > > > 
> > > > > > Note that it's just an uAPI definition not something defined in the PCI
> > > > > > spec.
> > > > > > 
> > > > > > Out of curiosity, the patch is merged without any real users in the
> > > > > > Linux. This is very bad since we lose the change to audit the whole
> > > > > > design.
> > > > > > 
> > > > > > 
> > > > > > > 2. The VMM either uses the migration PCI Capability directly from
> > > > > > >       userspace or core VFIO PCI code advertises
> > > > > > > VFIO_REGION_TYPE_MIGRATION
> > > > > > >       to userspace so migration can proceed in the same way as with
> > > > > > >       VFIO/mdev drivers.
> > > > > > > 3. The PCI Capability is not passed through to the guest.
> > > > > > This brings troubles in the nested environment.
> > > > > > 
> > > > > > Thanks
> > > > > > 
> > > > > > 
> > > > > > > Changpeng Liu originally mentioned the idea of defining a migration PCI
> > > > > > > Capability.
> > > > > > > 
> > > > > > > > >      For example, some of the
> > > > > > > > > virtio-net state is available to the VMM with vhost/vDPA because it
> > > > > > > > > intercepts the virtio-net control virtqueue.
> > > > > > > > > 
> > > > > > > > > Also, we need to decide to what degree the device state representation
> > > > > > > > > is standardized in the VIRTIO specification.
> > > > > > > > I think all the states must be defined in the spec otherwise the device
> > > > > > > > can't claim it supports migration at virtio level.
> > > > > > > > 
> > > > > > > > 
> > > > > > > > >      I think it makes sense to
> > > > > > > > > standardize it if it's possible to convey all necessary
> > > > > > > > > state and device
> > > > > > > > > implementors can easily implement this device state representation.
> > > > > > > > I doubt it's high device specific. E.g can we standardize device(GPU)
> > > > > > > > memory?
> > > > > > > For devices that have little internal state it's possible to define a
> > > > > > > standard device state representation.
> > > > > > > 
> > > > > > > For other devices, like virtio-crypto, virtio-fs, etc it becomes
> > > > > > > difficult because the device implementation contains state that will be
> > > > > > > needed but is very specific to the implementation. These devices *are*
> > > > > > > migratable but they don't have standard state. Even here there is a
> > > > > > > spectrum:
> > > > > > > - Host OS-specific state (e.g. Linux struct file_handles)
> > > > > > > - Library-specific state (e.g. crypto library state)
> > > > > > > - Implementation-specific state (e.g. sshfs inode state for virtio-fs)
> > > > > > > 
> > > > > > > This is why I think it's necessary to support both standard device state
> > > > > > > representations and implementation-specific device state
> > > > > > > representations.
> > > > > Having two ways will bring extra complexity. That why I suggest:
> > > > > 
> > > > > - to have general facility for the virtuqueue to be migrated
> > > > > - leave the device specific state to be device specific. so device can
> > > > > choose what is convenient way or interface.
> > > > I don't think we have a choice. For stateful devices it can be
> > > > impossible to define a standard device state representation.
> > > 
> > > Let me clarify, I agree we can't have a standard device state for all kinds
> > > of device.
> > > 
> > > That's way I tend to leave them to be device specific. (but not
> > > implementation specific)
> > Unfortunately device state is sometimes implementation-specific. Not
> > because the device is proprietary, but because the actual state is
> > meaningless to other implementations.
> > 
> > I mentioned virtiofs as an example where file system backends can be
> > implemented in completely different ways so the device state cannot be
> > migrated between implementations.
> 
> 
> So let me clarify my understanding, we had two kinds of states:
> 
> 1) implementation specific state that is not noticeable by the driver
> 2) device specific state that is noticeable by the driver
> 
> We don't have the interest in 1).
> 
> For 2) it's what needs to be defined in the spec. If we fail to generalize
> the device specific state, it can't be used by a standard virtio driver. Or
> maybe you can give a concrete example on how vitio-fs fail in doing this?

2) is what I mean when I say a "stateful" device. I agree, 1) is not
relevant to this discussion because we don't need to migrate internal
device state that the driver cannot interact with.

The virtiofs device has an OPEN request for opening a file. Live
migration must transfer the list of open files from the source to the
destination device so the driver can continue accessing files it
previously had open.

However, the result of the OPEN request is a number similar to a POSIX
fd, not the full device-internal state associated with an open file.
After migration the driver expects to continue using the number to
operate on the file. We must transfer the open file state to the
destination device.

Different device implementations may have completely different concepts
of open file state:

- An in-memory file system. The list of open files is a list of
  in-memory inodes. We'll also need to transfer the entire contents of
  the files/directories since it's in-memory and not shared with the
  destination device.

- A passthrough Linux file system. We need to transfer the Linux file
  handles (see open_by_handle_at(2)) so the destination device can open
  the inodes on the underlying shared host file system.

- A distributed object storage API. We need to transfer the list of
  open object IDs so the destination device can perform I/O to the same
  objects.

- Anyone can create a custom virtiofs device implementation and it will
  rely on different open file state.

I imagine virtio-gpu and virtio-crypto might have similar situations
where an object created through a virtqueue request has device-internal
state associated with it that must be migrated.

> > > But we can generalize the virtqueue state for sure.
> > I agree and also that some device types can standardize their device
> > state representations. But I think it's a technical requirement to
> > support implementation-specific state for device types where
> > cross-implementation migration is not possible.
> 
> 
> A question here, if the driver depends on the implementation specific state,
> how can we make sure that driver can work for other implementation. If we're
> sure that a single driver can work for all kinds of implementations, it
> means the we had device specific state not implementation state.

I think this is confusing stateless and stateful devices. You are
describing a stateless device here. I'll try to define a stateful
device:

A stateful device maintains state that the driver operates on indirectly
via standard requests. For example, the virtio-crypto device has
CREATE_SESSION requests and a session ID is returned to the driver so
further requests can be made on the session object. It may not be
possible to replay, reconnect, or restart the device without losing
state.

I hope that this description, together with the virtiofs specifics
above, make the problem clearer.

> > I'm not saying the implementation-specific state representation has to
> > be a binary blob. There could be an identifier registry to ensure live
> > migration compatibility checks can be performed. There could also be a
> > standard binary encoding for migration data.
> 
> 
> Yes, such requirements has been well studied in the past. There should be
> plenty of protocols to do this.
> 
> 
> >   But the contents will be
> > implementation-specific for some devices.
> 
> 
> If we allow this, it breaks the spec effort for having a standard devices.
> And it will block the real customers.

If we forbid this then devices for which migration is technically
possible will be unmigratable. Both users and device implementors will
find other solutions, like VFIO, so I don't think we can stop them even
if we tried.

I recognize that opaque device state poses a risk to migration
compatibility, because device implementors may arbitrarily use opaque
state when a standard is available.

However, the way to avoid this scenario is by:

1. Making the standard migration approach the easiest to implement
   because everything has been taken care of. It will save implementors
   the headache of defining and coding their own device state
   representations and versioning.

2. Educate users about migration compatibility so they can identify
   implementors are locking in their users.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit
  2021-07-21  2:52                                       ` Jason Wang
@ 2021-07-21 10:42                                         ` Stefan Hajnoczi
  2021-07-22  2:08                                           ` Jason Wang
  0 siblings, 1 reply; 115+ messages in thread
From: Stefan Hajnoczi @ 2021-07-21 10:42 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, Eugenio Perez Martin, Dr. David Alan Gilbert,
	virtio-comment, Virtio-Dev, Max Gurtovoy, Cornelia Huck,
	Oren Duer, Shahaf Shuler, Parav Pandit, Bodong Wang,
	Alexander Mikheev, Halil Pasic

[-- Attachment #1: Type: text/plain, Size: 25077 bytes --]

On Wed, Jul 21, 2021 at 10:52:15AM +0800, Jason Wang wrote:
> 
> 在 2021/7/20 下午6:19, Stefan Hajnoczi 写道:
> > On Tue, Jul 20, 2021 at 11:02:42AM +0800, Jason Wang wrote:
> > > 在 2021/7/19 下午8:43, Stefan Hajnoczi 写道:
> > > > On Fri, Jul 16, 2021 at 10:03:17AM +0800, Jason Wang wrote:
> > > > > 在 2021/7/15 下午6:01, Stefan Hajnoczi 写道:
> > > > > > On Thu, Jul 15, 2021 at 09:35:13AM +0800, Jason Wang wrote:
> > > > > > > 在 2021/7/14 下午11:07, Stefan Hajnoczi 写道:
> > > > > > > > On Wed, Jul 14, 2021 at 06:29:28PM +0800, Jason Wang wrote:
> > > > > > > > > 在 2021/7/14 下午5:53, Stefan Hajnoczi 写道:
> > > > > > > > > > On Tue, Jul 13, 2021 at 08:16:35PM +0800, Jason Wang wrote:
> > > > > > > > > > > 在 2021/7/13 下午6:00, Stefan Hajnoczi 写道:
> > > > > > > > > > > > On Tue, Jul 13, 2021 at 11:27:03AM +0800, Jason Wang wrote:
> > > > > > > > > > > > > 在 2021/7/12 下午5:57, Stefan Hajnoczi 写道:
> > > > > > > > > > > > > > On Mon, Jul 12, 2021 at 12:00:39PM +0800, Jason Wang wrote:
> > > > > > > > > > > > > > > 在 2021/7/11 上午4:36, Michael S. Tsirkin 写道:
> > > > > > > > > > > > > > > > On Fri, Jul 09, 2021 at 07:23:33PM +0200, Eugenio Perez Martin wrote:
> > > > > > > > > > > > > > > > > > >           If I understand correctly, this is all
> > > > > > > > > > > > > > > > > > > driven from the driver inside the guest, so for this to work
> > > > > > > > > > > > > > > > > > > the guest must be running and already have initialised the driver.
> > > > > > > > > > > > > > > > > > Yes.
> > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > As I see it, the feature can be driven entirely by the VMM as long as
> > > > > > > > > > > > > > > > > it intercept the relevant configuration space (PCI, MMIO, etc) from
> > > > > > > > > > > > > > > > > guest's reads and writes, and present it as coherent and transparent
> > > > > > > > > > > > > > > > > for the guest. Some use cases I can imagine with a physical device (or
> > > > > > > > > > > > > > > > > vp_vpda device) with VIRTIO_F_STOP:
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > 1) The VMM chooses not to pass the feature flag. The guest cannot stop
> > > > > > > > > > > > > > > > > the device, so any write to this flag is an error/undefined.
> > > > > > > > > > > > > > > > > 2) The VMM passes the flag to the guest. The guest can stop the device.
> > > > > > > > > > > > > > > > > 2.1) The VMM stops the device to perform a live migration, and the
> > > > > > > > > > > > > > > > > guest does not write to STOP in any moment of the LM. It resets the
> > > > > > > > > > > > > > > > > destination device with the state, and then initializes the device.
> > > > > > > > > > > > > > > > > 2.2) The guest stops the device and, when STOP(32) is set, the source
> > > > > > > > > > > > > > > > > VMM migrates the device status. The destination VMM realizes the bit,
> > > > > > > > > > > > > > > > > so it sets the bit in the destination too after device initialization.
> > > > > > > > > > > > > > > > > 2.3) The device is not initialized by the guest so it doesn't matter
> > > > > > > > > > > > > > > > > what bit has the HW, but the VM can be migrated.
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > Am I missing something?
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > Thanks!
> > > > > > > > > > > > > > > > It's doable like this. It's all a lot of hoops to jump through though.
> > > > > > > > > > > > > > > > It's also not easy for devices to implement.
> > > > > > > > > > > > > > > It just requires a new status bit. Anything that makes you think it's hard
> > > > > > > > > > > > > > > to implement?
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > E.g for networking device, it should be sufficient to use this bit + the
> > > > > > > > > > > > > > > virtqueue state.
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > Why don't we design the feature in a way that is useable by VMMs
> > > > > > > > > > > > > > > > and implementable by devices in a simple way?
> > > > > > > > > > > > > > > It use the common technology like register shadowing without any further
> > > > > > > > > > > > > > > stuffs.
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > Or do you have any other ideas?
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > (I think we all know migration will be very hard if we simply pass through
> > > > > > > > > > > > > > > those state registers).
> > > > > > > > > > > > > > If an admin virtqueue is used instead of the STOP Device Status field
> > > > > > > > > > > > > > bit then there's no need to re-read the Device Status field in a loop
> > > > > > > > > > > > > > until the device has stopped.
> > > > > > > > > > > > > Probably not. Let me to clarify several points:
> > > > > > > > > > > > > 
> > > > > > > > > > > > > - This proposal has nothing to do with admin virtqueue. Actually, admin
> > > > > > > > > > > > > virtqueue could be used for carrying any basic device facility like status
> > > > > > > > > > > > > bit. E.g I'm going to post patches that use admin virtqueue as a "transport"
> > > > > > > > > > > > > for device slicing at virtio level.
> > > > > > > > > > > > > - Even if we had introduced admin virtqueue, we still need a per function
> > > > > > > > > > > > > interface for this. This is a must for nested virtualization, we can't
> > > > > > > > > > > > > always expect things like PF can be assigned to L1 guest.
> > > > > > > > > > > > > - According to the proposal, there's no need for the device to complete all
> > > > > > > > > > > > > the consumed buffers, device can choose to expose those inflight descriptors
> > > > > > > > > > > > > in a device specific way and set the STOP bit. This means, if we have the
> > > > > > > > > > > > > device specific in-flight descriptor reporting facility, the device can
> > > > > > > > > > > > > almost set the STOP bit immediately.
> > > > > > > > > > > > > - If we don't go with the basic device facility but using the admin
> > > > > > > > > > > > > virtqueue specific method, we still need to clarify how it works with the
> > > > > > > > > > > > > device status state machine, it will be some kind of sub-states which looks
> > > > > > > > > > > > > much more complicated than the current proposal.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > 
> > > > > > > > > > > > > > When migrating a guest with many VIRTIO devices a busy waiting approach
> > > > > > > > > > > > > > extends downtime if implemented sequentially (stopping one device at a
> > > > > > > > > > > > > > time).
> > > > > > > > > > > > > Well. You need some kinds of waiting for sure, the device/DMA needs sometime
> > > > > > > > > > > > > to be stopped. The downtime is determined by a specific virtio
> > > > > > > > > > > > > implementation which is hard to be restricted at the spec level. We can
> > > > > > > > > > > > > clarify that the device must set the STOP bit in e.g 100ms.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > 
> > > > > > > > > > > > > >         It can be implemented concurrently (setting the STOP bit on all
> > > > > > > > > > > > > > devices and then looping until all their Device Status fields have the
> > > > > > > > > > > > > > bit set), but this becomes more complex to implement.
> > > > > > > > > > > > > I still don't get what kind of complexity did you worry here.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > 
> > > > > > > > > > > > > > I'm a little worried about adding a new bit that requires busy
> > > > > > > > > > > > > > waiting...
> > > > > > > > > > > > > Busy wait is not something that is introduced in this patch:
> > > > > > > > > > > > > 
> > > > > > > > > > > > > 4.1.4.3.2 Driver Requirements: Common configuration structure layout
> > > > > > > > > > > > > 
> > > > > > > > > > > > > After writing 0 to device_status, the driver MUST wait for a read of
> > > > > > > > > > > > > device_status to return 0 before reinitializing the device.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Since it was required for at least one transport. We need do something
> > > > > > > > > > > > > similar to when introducing basic facility.
> > > > > > > > > > > > Adding the STOP but as a Device Status bit is a small and clean VIRTIO
> > > > > > > > > > > > spec change. I like that.
> > > > > > > > > > > > 
> > > > > > > > > > > > On the other hand, devices need time to stop and that time can be
> > > > > > > > > > > > unbounded. For example, software virtio-blk/scsi implementations since
> > > > > > > > > > > > cannot immediately cancel in-flight I/O requests on Linux hosts.
> > > > > > > > > > > > 
> > > > > > > > > > > > The natural interface for long-running operations is virtqueue requests.
> > > > > > > > > > > > That's why I mentioned the alternative of using an admin virtqueue
> > > > > > > > > > > > instead of a Device Status bit.
> > > > > > > > > > > So I'm not against the admin virtqueue. As said before, admin virtqueue
> > > > > > > > > > > could be used for carrying the device status bit.
> > > > > > > > > > > 
> > > > > > > > > > > Send a command to set STOP status bit to admin virtqueue. Device will make
> > > > > > > > > > > the command buffer used after it has successfully stopped the device.
> > > > > > > > > > > 
> > > > > > > > > > > AFAIK, they are not mutually exclusive, since they are trying to solve
> > > > > > > > > > > different problems.
> > > > > > > > > > > 
> > > > > > > > > > > Device status - basic device facility
> > > > > > > > > > > 
> > > > > > > > > > > Admin virtqueue - transport/device specific way to implement (part of) the
> > > > > > > > > > > device facility
> > > > > > > > > > > 
> > > > > > > > > > > > Although you mentioned that the stopped state needs to be reflected in
> > > > > > > > > > > > the Device Status field somehow, I'm not sure about that since the
> > > > > > > > > > > > driver typically doesn't need to know whether the device is being
> > > > > > > > > > > > migrated.
> > > > > > > > > > > The guest won't see the real device status bit. VMM will shadow the device
> > > > > > > > > > > status bit in this case.
> > > > > > > > > > > 
> > > > > > > > > > > E.g with the current vhost-vDPA, vDPA behave like a vhost device, guest is
> > > > > > > > > > > unaware of the migration.
> > > > > > > > > > > 
> > > > > > > > > > > STOP status bit is set by Qemu to real virtio hardware. But guest will only
> > > > > > > > > > > see the DRIVER_OK without STOP.
> > > > > > > > > > > 
> > > > > > > > > > > It's not hard to implement the nested on top, see the discussion initiated
> > > > > > > > > > > by Eugenio about how expose VIRTIO_F_STOP to guest for nested live
> > > > > > > > > > > migration.
> > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > >        In fact, the VMM would need to hide this bit and it's safer to
> > > > > > > > > > > > keep it out-of-band instead of risking exposing it by accident.
> > > > > > > > > > > See above, VMM may choose to hide or expose the capability. It's useful for
> > > > > > > > > > > migrating a nested guest.
> > > > > > > > > > > 
> > > > > > > > > > > If we design an interface that can be used in the nested environment, it's
> > > > > > > > > > > not an ideal interface.
> > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > > In addition, stateful devices need to load/save non-trivial amounts of
> > > > > > > > > > > > data. They need DMA to do this efficiently, so an admin virtqueue is a
> > > > > > > > > > > > good fit again.
> > > > > > > > > > > I don't get the point here. You still need to address the exact the similar
> > > > > > > > > > > issues for admin virtqueue: the unbound time in freezing the device, the
> > > > > > > > > > > interaction with the virtio device status state machine.
> > > > > > > > > > Device state state can be large so a register interface would be a
> > > > > > > > > > bottleneck. DMA is needed. I think a virtqueue is a good fit for
> > > > > > > > > > saving/loading device state.
> > > > > > > > > So this patch doesn't mandate a register interface, isn't it?
> > > > > > > > You're right, not this patch. I mentioned it because your other patch
> > > > > > > > series ("[PATCH] virtio-pci: implement VIRTIO_F_RING_STATE") implements
> > > > > > > > it a register interface.
> > > > > > > > 
> > > > > > > > > And DMA
> > > > > > > > > doesn't means a virtqueue, it could be a transport specific method.
> > > > > > > > Yes, although virtqueues are a pretty good interface that works across
> > > > > > > > transports (PCI/MMIO/etc) thanks to the standard vring memory layout.
> > > > > > > > 
> > > > > > > > > I think we need to start from defining the state of one specific device and
> > > > > > > > > see what is the best interface.
> > > > > > > > virtio-blk might be the simplest. I think virtio-net has more device
> > > > > > > > state and virtio-scsi is definitely more complext than virtio-blk.
> > > > > > > > 
> > > > > > > > First we need agreement on whether "device state" encompasses the full
> > > > > > > > state of the device or just state that is unknown to the VMM.
> > > > > > > I think we've discussed this in the past. It can't work since:
> > > > > > > 
> > > > > > > 1) The state and its format must be clearly defined in the spec
> > > > > > > 2) We need to maintain migration compatibility and debug-ability
> > > > > > Some devices need implementation-specific state. They should still be
> > > > > > able to live migrate even if it means cross-implementation migration and
> > > > > > debug-ability is not possible.
> > > > > I think we need to re-visit this conclusion. Migration compatibility is
> > > > > pretty important, especially consider the software stack has spent a huge
> > > > > mount of effort in maintaining them.
> > > > > 
> > > > > Say a virtio hardware would break this, this mean we will lose all the
> > > > > advantages of being a standard device.
> > > > > 
> > > > > If we can't do live migration among:
> > > > > 
> > > > > 1) different backends, e.g migrate from virtio hardware to migrate software
> > > > > 2) different vendors
> > > > > 
> > > > > We failed to say as a standard device and the customer is in fact locked by
> > > > > the vendor implicitly.
> > > > My virtiofs device implementation is backed by an in-memory file system.
> > > > The device state includes the contents of each file.
> > > > 
> > > > Your virtiofs device implementation uses Linux file handles to keep
> > > > track of open files. The device state includes Linux file handles (but
> > > > not the contents of each file) so the destination host can access the
> > > > same files on shared storage.
> > > > 
> > > > Cornelia's virtiofs device implementation is backed by an object storage
> > > > HTTP API. The device state includes API object IDs.
> > > > 
> > > > The device state is implementation-dependent. There is no standard
> > > > representation and it's not possible to migrate between device
> > > > implementations. How are they supposed to migrate?
> > > 
> > > So if I understand correclty, virtio-fs is not desigined to be migrate-able?
> > > 
> > > (Having a check on the current virtio-fs support in qemu, it looks to me it
> > > has a migration blocker).
> > The code does not support live migration but it's on the roadmap. Max
> > Reitz added Linux file handle support to virtiofsd. That was the first
> > step towards being able to migrate the device's state.
> 
> 
> A dumb question, how do qemu know it is connected to virtiofsd?

virtiofsd is a vhost-user-fs device. QEMU doesn't know if it's connected
to virtiofsd or another implementation.

> > > > This is why I think it's necessarily to allow implementation-specific
> > > > device state representations.
> > > 
> > > Or you probably mean you don't support cross backend migration. This sounds
> > > like a drawback and it's actually not a standard device but a
> > > vendor/implementation specific device.
> > > 
> > > It would bring a lot of troubles, not only for the implementation but for
> > > the management. Maybe we can start from adding the support of migration for
> > > some specific backend and start from there.
> > Yes, it's complicated. Some implementations could be compatible, but
> > others can never be compatible because they have completely different
> > state.
> > 
> > The virtiofsd implementation is the main one for virtiofs and the device
> > state representation can be published, even standardized. Others can
> > implement it to achieve migration compatibility.
> > 
> > But it must be possible for implementations that have completely
> > different state to migrate too. virtiofsd isn't special.
> > 
> > > > > > > 3) Not a proper uAPI desgin
> > > > > > I never understood this argument. The Linux uAPI passes through lots of
> > > > > > opaque data from devices to userspace. Allowing an
> > > > > > implementation-specific device state representation is nothing new. VFIO
> > > > > > already does it.
> > > > > I think we've already had a lots of discussion for VFIO but without a
> > > > > conclusion. Maybe we need the verdict from Linus or Greg (not sure if it's
> > > > > too late). But that's not related to virito and this thread.
> > > > > 
> > > > > What you propose here is kind of conflict with the efforts of virtio. I
> > > > > think we all aggree that we should define the state in the spec. Assuming
> > > > > this is correct:
> > > > > 
> > > > > 1) why do we still offer opaque migration state to userspace?
> > > > See above. Stateful devices may require an implementation-defined device
> > > > state representation.
> > > 
> > > So my point stand still, it's not a standard device if we do this.
> > These "non-standard devices" still need to be able to migrate.
> 
> 
> See other thread, it breaks the effort of having a spec.
>
> >   How
> > should we do that?
> 
> 
> I think the main issue is that, to me it's not a virtio device but a device
> that is using virtio queue with implementation specific state. So it can't
> be migrated by the virtio subsystem but through a vendor/implementation
> specific migration driver.

Okay. Are you thinking about a separate set of vDPA APIs and vhost
ioctls so the VMM can save/load implementation-specific device state?
These separate APIs just need to be called as part of the standard
VIRTIO stop and vq save/load lifecycle.

> > 
> > > > > 2) how can it be integrated into the current VMM (Qemu) virtio devices'
> > > > > migration bytes stream?
> > > > Opaque data like D-Bus VMState:
> > > > https://qemu.readthedocs.io/en/latest/interop/dbus-vmstate.html
> > > 
> > > Actually, I meant how to keep the opaque state which is compatible with all
> > > the existing device that can do migration.
> > > 
> > > E.g we want to live migration virtio-blk among any backends (from a hardware
> > > device to a software backend).
> > There was a series of email threads last year where migration
> > compatibility was discussed:
> > 
> > https://lists.gnu.org/archive/html/qemu-devel/2020-11/msg02620.html
> > 
> > I proposed an algorithm for checking migration compatibility between
> > devices. The source and destination device can have different
> > implementations (e.g. hardware, software, etc).
> > 
> > It involves picking an identifier like virtio-spec.org/pci/virtio-net
> > for the device state representation and device parameters for aspects of
> > the device that vary between instances (e.g. tso=on|off).
> > 
> > It's more complex than today's live migration approach in libvirt and
> > QEMU. Today libvirt configures the source and destination in a
> > compatible manner (thanks to knowledge of the device implementation) and
> > then QEMU transfers the device state.
> > 
> > Part of the point of defining a migration compatibility algorithm is
> > that it's possible to lift the assumptions out of libvirt so that
> > arbitrary device implementations can be supported (hardware, software,
> > etc) without putting knowledge about every device/VMM implementation
> > into libvirt.
> > 
> > (The other advantage is that this allows orchestration software to
> > determine migration compatibility before starting a migration.)
> 
> 
> This looks like another independent issues and I fully agree to have a
> better migration protocol. But using that means we break the migration
> compatibility with the existing device which is used for more than 10 years.
> We still need to make the migration from/to the existing virtio device to
> work.

I agree that migrating to/from existing devices needs to work. It should
be possible to transition without breaking migration.

> > > > > > > > That's
> > > > > > > > basically the difference between the vhost/vDPA's selective passthrough
> > > > > > > > approach and VFIO's full passthrough approach.
> > > > > > > We can't do VFIO full pasthrough for migration anyway, some kind of mdev is
> > > > > > > required but it's duplicated with the current vp_vdpa driver.
> > > > > > I'm not sure that's true. Generic VFIO PCI migration can probably be
> > > > > > achieved without mdev:
> > > > > > 1. Define a migration PCI Capability that indicates support for
> > > > > >       VFIO_REGION_TYPE_MIGRATION. This allows the PCI device to implement
> > > > > >       the migration interface in hardware instead of an mdev driver.
> > > > > So I think it still depend on the driver to implement migrate state which is
> > > > > vendor specific.
> > > > The current VFIO migration interface depends on a device-specific
> > > > software mdev driver but here I'm showing that the physical device can
> > > > implement the migration interface so that no device-specific driver code
> > > > is needed.
> > > 
> > > This is not what I read from the patch:
> > > 
> > >   * device_state: (read/write)
> > >   *      - The user application writes to this field to inform the vendor
> > > driver
> > >   *        about the device state to be transitioned to.
> > >   *      - The vendor driver should take the necessary actions to change the
> > >   *        device state. After successful transition to a given state, the
> > >   *        vendor driver should return success on write(device_state, state)
> > >   *        system call. If the device state transition fails, the vendor
> > > driver
> > >   *        should return an appropriate -errno for the fault condition.
> > > 
> > > Vendor driver need to mediate between the uAPI and the actual device.
> > Yes, that's the current state of VFIO migration. If a hardware interface
> > (e.g. PCI Capability) is defined that maps to this API then no
> > device-specific drivers would be necessary because core VFIO PCI code
> > can implement the uAPI by talking to the hardware.
> 
> 
> As we discussed, it would be very hard. The device state is implementation
> specific which may not fit for the Capability. (PCIE has already had VF
> migration state in the SR-IOV extended capability).
> 
> 
> > 
> > > > > > 2. The VMM either uses the migration PCI Capability directly from
> > > > > >       userspace or core VFIO PCI code advertises VFIO_REGION_TYPE_MIGRATION
> > > > > >       to userspace so migration can proceed in the same way as with
> > > > > >       VFIO/mdev drivers.
> > > > > > 3. The PCI Capability is not passed through to the guest.
> > > > > This brings troubles in the nested environment.
> > > > It depends on the device splitting/management design. If L0 wishes to
> > > > let L1 manage the VFs then it would need to expose a management device.
> > > > Since the migration interface is generic (not device-specific) a generic
> > > > management device solves this for all devices.
> > > 
> > > Right, but it's a burden to expose the management device or it may just
> > > won't work.
> > A single generic management device is not a huge burden and it may turn
> > out that keeping the management device out-of-band is actually a
> > desirable feature if the device owner does not wish to expose the
> > stop/save/load functionality for some reason.
> 
> 
> VMM are free to hide those features from guest. Management can just do
> -device virtio-pci,state=false
> 
> Having management device works for L0 but not suitable for L(x>0). A per
> function device interface is a must for the nested virt to work in a simple
> and easy way.

You are right, a per function interface is simplest. I'm not experienced
enough with SR-IOV and nested virtualization to have a strong opinion in
this area.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit
  2021-07-20 13:09                                         ` Max Gurtovoy
  2021-07-21  3:06                                           ` Jason Wang
@ 2021-07-21 10:48                                           ` Stefan Hajnoczi
  2021-07-21 11:37                                             ` Max Gurtovoy
  1 sibling, 1 reply; 115+ messages in thread
From: Stefan Hajnoczi @ 2021-07-21 10:48 UTC (permalink / raw)
  To: Max Gurtovoy
  Cc: Jason Wang, Michael S. Tsirkin, Eugenio Perez Martin,
	Dr. David Alan Gilbert, virtio-comment, Virtio-Dev,
	Cornelia Huck, Oren Duer, Shahaf Shuler, Parav Pandit,
	Bodong Wang, Alexander Mikheev, Halil Pasic

[-- Attachment #1: Type: text/plain, Size: 4257 bytes --]

On Tue, Jul 20, 2021 at 04:09:27PM +0300, Max Gurtovoy wrote:
> 
> On 7/20/2021 3:57 PM, Stefan Hajnoczi wrote:
> > On Tue, Jul 20, 2021 at 03:27:00PM +0300, Max Gurtovoy wrote:
> > > On 7/20/2021 6:02 AM, Jason Wang wrote:
> > > > 在 2021/7/19 下午8:43, Stefan Hajnoczi 写道:
> > > > > On Fri, Jul 16, 2021 at 10:03:17AM +0800, Jason Wang wrote:
> > > > > > 在 2021/7/15 下午6:01, Stefan Hajnoczi 写道:
> > > > > > > On Thu, Jul 15, 2021 at 09:35:13AM +0800, Jason Wang wrote:
> > > > > > > > 在 2021/7/14 下午11:07, Stefan Hajnoczi 写道:
> > > > > > > > > On Wed, Jul 14, 2021 at 06:29:28PM +0800, Jason Wang wrote:
> > > > > > > > > > 在 2021/7/14 下午5:53, Stefan Hajnoczi 写道:
> > > > > > > > > > > On Tue, Jul 13, 2021 at 08:16:35PM +0800, Jason Wang wrote:
> > > > > > > > > > > > 在 2021/7/13 下午6:00, Stefan Hajnoczi 写道:
> > > > > > > > > > > > > On Tue, Jul 13, 2021 at 11:27:03AM +0800, Jason Wang wrote:
> > > > > > > > > > > > > > 在 2021/7/12 下午5:57, Stefan Hajnoczi 写道:
> > > > > > > > > > > > > > > On Mon, Jul 12, 2021 at 12:00:39PM +0800, Jason Wang wrote:
> > > > > > > > > > > > > > > > 在 2021/7/11 上午4:36, Michael S. Tsirkin 写道:
> > > > > > > > > > > > > > > > > On Fri, Jul 09, 2021
> > > > > > > > > > > > > > > > > at 07:23:33PM +0200,
> > > > > > > > > > > > > > > > > Eugenio Perez Martin
> > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > That's
> > > > > > > > > basically the difference between the vhost/vDPA's
> > > > > > > > > selective passthrough
> > > > > > > > > approach and VFIO's full passthrough approach.
> > > > > > > > We can't do VFIO full pasthrough for migration anyway,
> > > > > > > > some kind of mdev is
> > > > > > > > required but it's duplicated with the current vp_vdpa driver.
> > > > > > > I'm not sure that's true. Generic VFIO PCI migration can probably be
> > > > > > > achieved without mdev:
> > > > > > > 1. Define a migration PCI Capability that indicates support for
> > > > > > >       VFIO_REGION_TYPE_MIGRATION. This allows the PCI device
> > > > > > > to implement
> > > > > > >       the migration interface in hardware instead of an mdev driver.
> > > > > > So I think it still depend on the driver to implement migrate
> > > > > > state which is
> > > > > > vendor specific.
> > > > > The current VFIO migration interface depends on a device-specific
> > > > > software mdev driver but here I'm showing that the physical device can
> > > > > implement the migration interface so that no device-specific driver code
> > > > > is needed.
> > > > 
> > > > This is not what I read from the patch:
> > > > 
> > > >   * device_state: (read/write)
> > > >   *      - The user application writes to this field to inform the vendor
> > > > driver
> > > >   *        about the device state to be transitioned to.
> > > >   *      - The vendor driver should take the necessary actions to change
> > > > the
> > > >   *        device state. After successful transition to a given state, the
> > > >   *        vendor driver should return success on write(device_state,
> > > > state)
> > > >   *        system call. If the device state transition fails, the vendor
> > > > driver
> > > >   *        should return an appropriate -errno for the fault condition.
> > > > 
> > > > Vendor driver need to mediate between the uAPI and the actual device.
> > > We're building an infrastructure for VFIO PCI devices in the last few
> > > months.
> > > 
> > > It should be merged hopefully to kernel 5.15.
> > Do you have links to patch series or a brief description of the VFIO API
> > features that are on the roadmap?
> 
> we devided it to few patchsets .
> 
> The entire series can be found at:
> 
> https://github.com/jgunthorpe/linux/commits/mlx5_vfio_pci
> 
> We'll first add support for mlx5 devices suspend/resume (ConnectX-6 and
> above).
> 
> The driver is ready in the series above.

I looked briefly and it seems to implement the existing
VFIO_REGION_TYPE_MIGRATION API for mlx5 devices? I thought
"infrastructure for VFIO PCI devices" meant you were adding new
VFIO/mdev migration APIs.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit
  2021-07-21 10:48                                           ` Stefan Hajnoczi
@ 2021-07-21 11:37                                             ` Max Gurtovoy
  0 siblings, 0 replies; 115+ messages in thread
From: Max Gurtovoy @ 2021-07-21 11:37 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Jason Wang, Michael S. Tsirkin, Eugenio Perez Martin,
	Dr. David Alan Gilbert, virtio-comment, Virtio-Dev,
	Cornelia Huck, Oren Duer, Shahaf Shuler, Parav Pandit,
	Bodong Wang, Alexander Mikheev, Halil Pasic


On 7/21/2021 1:48 PM, Stefan Hajnoczi wrote:
> On Tue, Jul 20, 2021 at 04:09:27PM +0300, Max Gurtovoy wrote:
>> On 7/20/2021 3:57 PM, Stefan Hajnoczi wrote:
>>> On Tue, Jul 20, 2021 at 03:27:00PM +0300, Max Gurtovoy wrote:
>>>> On 7/20/2021 6:02 AM, Jason Wang wrote:
>>>>> 在 2021/7/19 下午8:43, Stefan Hajnoczi 写道:
>>>>>> On Fri, Jul 16, 2021 at 10:03:17AM +0800, Jason Wang wrote:
>>>>>>> 在 2021/7/15 下午6:01, Stefan Hajnoczi 写道:
>>>>>>>> On Thu, Jul 15, 2021 at 09:35:13AM +0800, Jason Wang wrote:
>>>>>>>>> 在 2021/7/14 下午11:07, Stefan Hajnoczi 写道:
>>>>>>>>>> On Wed, Jul 14, 2021 at 06:29:28PM +0800, Jason Wang wrote:
>>>>>>>>>>> 在 2021/7/14 下午5:53, Stefan Hajnoczi 写道:
>>>>>>>>>>>> On Tue, Jul 13, 2021 at 08:16:35PM +0800, Jason Wang wrote:
>>>>>>>>>>>>> 在 2021/7/13 下午6:00, Stefan Hajnoczi 写道:
>>>>>>>>>>>>>> On Tue, Jul 13, 2021 at 11:27:03AM +0800, Jason Wang wrote:
>>>>>>>>>>>>>>> 在 2021/7/12 下午5:57, Stefan Hajnoczi 写道:
>>>>>>>>>>>>>>>> On Mon, Jul 12, 2021 at 12:00:39PM +0800, Jason Wang wrote:
>>>>>>>>>>>>>>>>> 在 2021/7/11 上午4:36, Michael S. Tsirkin 写道:
>>>>>>>>>>>>>>>>>> On Fri, Jul 09, 2021
>>>>>>>>>>>>>>>>>> at 07:23:33PM +0200,
>>>>>>>>>>>>>>>>>> Eugenio Perez Martin
>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>> That's
>>>>>>>>>> basically the difference between the vhost/vDPA's
>>>>>>>>>> selective passthrough
>>>>>>>>>> approach and VFIO's full passthrough approach.
>>>>>>>>> We can't do VFIO full pasthrough for migration anyway,
>>>>>>>>> some kind of mdev is
>>>>>>>>> required but it's duplicated with the current vp_vdpa driver.
>>>>>>>> I'm not sure that's true. Generic VFIO PCI migration can probably be
>>>>>>>> achieved without mdev:
>>>>>>>> 1. Define a migration PCI Capability that indicates support for
>>>>>>>>        VFIO_REGION_TYPE_MIGRATION. This allows the PCI device
>>>>>>>> to implement
>>>>>>>>        the migration interface in hardware instead of an mdev driver.
>>>>>>> So I think it still depend on the driver to implement migrate
>>>>>>> state which is
>>>>>>> vendor specific.
>>>>>> The current VFIO migration interface depends on a device-specific
>>>>>> software mdev driver but here I'm showing that the physical device can
>>>>>> implement the migration interface so that no device-specific driver code
>>>>>> is needed.
>>>>> This is not what I read from the patch:
>>>>>
>>>>>    * device_state: (read/write)
>>>>>    *      - The user application writes to this field to inform the vendor
>>>>> driver
>>>>>    *        about the device state to be transitioned to.
>>>>>    *      - The vendor driver should take the necessary actions to change
>>>>> the
>>>>>    *        device state. After successful transition to a given state, the
>>>>>    *        vendor driver should return success on write(device_state,
>>>>> state)
>>>>>    *        system call. If the device state transition fails, the vendor
>>>>> driver
>>>>>    *        should return an appropriate -errno for the fault condition.
>>>>>
>>>>> Vendor driver need to mediate between the uAPI and the actual device.
>>>> We're building an infrastructure for VFIO PCI devices in the last few
>>>> months.
>>>>
>>>> It should be merged hopefully to kernel 5.15.
>>> Do you have links to patch series or a brief description of the VFIO API
>>> features that are on the roadmap?
>> we devided it to few patchsets .
>>
>> The entire series can be found at:
>>
>> https://github.com/jgunthorpe/linux/commits/mlx5_vfio_pci
>>
>> We'll first add support for mlx5 devices suspend/resume (ConnectX-6 and
>> above).
>>
>> The driver is ready in the series above.
> I looked briefly and it seems to implement the existing
> VFIO_REGION_TYPE_MIGRATION API for mlx5 devices? I thought
> "infrastructure for VFIO PCI devices" meant you were adding new
> VFIO/mdev migration APIs.

No, why do we need new API ?

We created an infrastructure for vendor to develop 
vendor_specific/protocol_specific vfio_pci drivers.

These drivers can add support for migration.

The next driver to be developed in our context is virtio_vfio_pci.

And for that we need a standard probably. I prefer we won't develop 
mlx_virtio_vfio_pci for NVIDIA virtio PCI devices but have a standard 
way to do migration.

> Stefan


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit
  2021-07-21  3:09                                       ` Jason Wang
@ 2021-07-21 11:43                                         ` Max Gurtovoy
  2021-07-22  2:01                                           ` Jason Wang
  0 siblings, 1 reply; 115+ messages in thread
From: Max Gurtovoy @ 2021-07-21 11:43 UTC (permalink / raw)
  To: Jason Wang, Stefan Hajnoczi
  Cc: Michael S. Tsirkin, Eugenio Perez Martin, Dr. David Alan Gilbert,
	virtio-comment, Virtio-Dev, Cornelia Huck, Oren Duer,
	Shahaf Shuler, Parav Pandit, Bodong Wang, Alexander Mikheev,
	Halil Pasic


On 7/21/2021 6:09 AM, Jason Wang wrote:
>
> 在 2021/7/20 下午8:27, Max Gurtovoy 写道:
>>
>> On 7/20/2021 6:02 AM, Jason Wang wrote:
>>>
>>> 在 2021/7/19 下午8:43, Stefan Hajnoczi 写道:
>>>> On Fri, Jul 16, 2021 at 10:03:17AM +0800, Jason Wang wrote:
>>>>> 在 2021/7/15 下午6:01, Stefan Hajnoczi 写道:
>>>>>> On Thu, Jul 15, 2021 at 09:35:13AM +0800, Jason Wang wrote:
>>>>>>> 在 2021/7/14 下午11:07, Stefan Hajnoczi 写道:
>>>>>>>> On Wed, Jul 14, 2021 at 06:29:28PM +0800, Jason Wang wrote:
>>>>>>>>> 在 2021/7/14 下午5:53, Stefan Hajnoczi 写道:
>>>>>>>>>> On Tue, Jul 13, 2021 at 08:16:35PM +0800, Jason Wang wrote:
>>>>>>>>>>> 在 2021/7/13 下午6:00, Stefan Hajnoczi 写道:
>>>>>>>>>>>> On Tue, Jul 13, 2021 at 11:27:03AM +0800, Jason Wang wrote:
>>>>>>>>>>>>> 在 2021/7/12 下午5:57, Stefan Hajnoczi 写道:
>>>>>>>>>>>>>> On Mon, Jul 12, 2021 at 12:00:39PM +0800, Jason Wang wrote:
>>>>>>>>>>>>>>> 在 2021/7/11 上午4:36, Michael S. Tsirkin 写道:
>>>>>>>>>>>>>>>> On Fri, Jul 09, 2021 at 07:23:33PM +0200, Eugenio Perez 
>>>>>>>>>>>>>>>> Martin wrote:
>>>>>>>>>>>>>>>>>>> If I understand correctly, this is all
>>>>>>>>>>>>>>>>>>> driven from the driver inside the guest, so for this 
>>>>>>>>>>>>>>>>>>> to work
>>>>>>>>>>>>>>>>>>> the guest must be running and already have 
>>>>>>>>>>>>>>>>>>> initialised the driver.
>>>>>>>>>>>>>>>>>> Yes.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> As I see it, the feature can be driven entirely by the 
>>>>>>>>>>>>>>>>> VMM as long as
>>>>>>>>>>>>>>>>> it intercept the relevant configuration space (PCI, 
>>>>>>>>>>>>>>>>> MMIO, etc) from
>>>>>>>>>>>>>>>>> guest's reads and writes, and present it as coherent 
>>>>>>>>>>>>>>>>> and transparent
>>>>>>>>>>>>>>>>> for the guest. Some use cases I can imagine with a 
>>>>>>>>>>>>>>>>> physical device (or
>>>>>>>>>>>>>>>>> vp_vpda device) with VIRTIO_F_STOP:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> 1) The VMM chooses not to pass the feature flag. The 
>>>>>>>>>>>>>>>>> guest cannot stop
>>>>>>>>>>>>>>>>> the device, so any write to this flag is an 
>>>>>>>>>>>>>>>>> error/undefined.
>>>>>>>>>>>>>>>>> 2) The VMM passes the flag to the guest. The guest can 
>>>>>>>>>>>>>>>>> stop the device.
>>>>>>>>>>>>>>>>> 2.1) The VMM stops the device to perform a live 
>>>>>>>>>>>>>>>>> migration, and the
>>>>>>>>>>>>>>>>> guest does not write to STOP in any moment of the LM. 
>>>>>>>>>>>>>>>>> It resets the
>>>>>>>>>>>>>>>>> destination device with the state, and then 
>>>>>>>>>>>>>>>>> initializes the device.
>>>>>>>>>>>>>>>>> 2.2) The guest stops the device and, when STOP(32) is 
>>>>>>>>>>>>>>>>> set, the source
>>>>>>>>>>>>>>>>> VMM migrates the device status. The destination VMM 
>>>>>>>>>>>>>>>>> realizes the bit,
>>>>>>>>>>>>>>>>> so it sets the bit in the destination too after device 
>>>>>>>>>>>>>>>>> initialization.
>>>>>>>>>>>>>>>>> 2.3) The device is not initialized by the guest so it 
>>>>>>>>>>>>>>>>> doesn't matter
>>>>>>>>>>>>>>>>> what bit has the HW, but the VM can be migrated.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Am I missing something?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks!
>>>>>>>>>>>>>>>> It's doable like this. It's all a lot of hoops to jump 
>>>>>>>>>>>>>>>> through though.
>>>>>>>>>>>>>>>> It's also not easy for devices to implement.
>>>>>>>>>>>>>>> It just requires a new status bit. Anything that makes 
>>>>>>>>>>>>>>> you think it's hard
>>>>>>>>>>>>>>> to implement?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> E.g for networking device, it should be sufficient to 
>>>>>>>>>>>>>>> use this bit + the
>>>>>>>>>>>>>>> virtqueue state.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Why don't we design the feature in a way that is 
>>>>>>>>>>>>>>>> useable by VMMs
>>>>>>>>>>>>>>>> and implementable by devices in a simple way?
>>>>>>>>>>>>>>> It use the common technology like register shadowing 
>>>>>>>>>>>>>>> without any further
>>>>>>>>>>>>>>> stuffs.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Or do you have any other ideas?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> (I think we all know migration will be very hard if we 
>>>>>>>>>>>>>>> simply pass through
>>>>>>>>>>>>>>> those state registers).
>>>>>>>>>>>>>> If an admin virtqueue is used instead of the STOP Device 
>>>>>>>>>>>>>> Status field
>>>>>>>>>>>>>> bit then there's no need to re-read the Device Status 
>>>>>>>>>>>>>> field in a loop
>>>>>>>>>>>>>> until the device has stopped.
>>>>>>>>>>>>> Probably not. Let me to clarify several points:
>>>>>>>>>>>>>
>>>>>>>>>>>>> - This proposal has nothing to do with admin virtqueue. 
>>>>>>>>>>>>> Actually, admin
>>>>>>>>>>>>> virtqueue could be used for carrying any basic device 
>>>>>>>>>>>>> facility like status
>>>>>>>>>>>>> bit. E.g I'm going to post patches that use admin 
>>>>>>>>>>>>> virtqueue as a "transport"
>>>>>>>>>>>>> for device slicing at virtio level.
>>>>>>>>>>>>> - Even if we had introduced admin virtqueue, we still need 
>>>>>>>>>>>>> a per function
>>>>>>>>>>>>> interface for this. This is a must for nested 
>>>>>>>>>>>>> virtualization, we can't
>>>>>>>>>>>>> always expect things like PF can be assigned to L1 guest.
>>>>>>>>>>>>> - According to the proposal, there's no need for the 
>>>>>>>>>>>>> device to complete all
>>>>>>>>>>>>> the consumed buffers, device can choose to expose those 
>>>>>>>>>>>>> inflight descriptors
>>>>>>>>>>>>> in a device specific way and set the STOP bit. This means, 
>>>>>>>>>>>>> if we have the
>>>>>>>>>>>>> device specific in-flight descriptor reporting facility, 
>>>>>>>>>>>>> the device can
>>>>>>>>>>>>> almost set the STOP bit immediately.
>>>>>>>>>>>>> - If we don't go with the basic device facility but using 
>>>>>>>>>>>>> the admin
>>>>>>>>>>>>> virtqueue specific method, we still need to clarify how it 
>>>>>>>>>>>>> works with the
>>>>>>>>>>>>> device status state machine, it will be some kind of 
>>>>>>>>>>>>> sub-states which looks
>>>>>>>>>>>>> much more complicated than the current proposal.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> When migrating a guest with many VIRTIO devices a busy 
>>>>>>>>>>>>>> waiting approach
>>>>>>>>>>>>>> extends downtime if implemented sequentially (stopping 
>>>>>>>>>>>>>> one device at a
>>>>>>>>>>>>>> time).
>>>>>>>>>>>>> Well. You need some kinds of waiting for sure, the 
>>>>>>>>>>>>> device/DMA needs sometime
>>>>>>>>>>>>> to be stopped. The downtime is determined by a specific 
>>>>>>>>>>>>> virtio
>>>>>>>>>>>>> implementation which is hard to be restricted at the spec 
>>>>>>>>>>>>> level. We can
>>>>>>>>>>>>> clarify that the device must set the STOP bit in e.g 100ms.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>        It can be implemented concurrently (setting the 
>>>>>>>>>>>>>> STOP bit on all
>>>>>>>>>>>>>> devices and then looping until all their Device Status 
>>>>>>>>>>>>>> fields have the
>>>>>>>>>>>>>> bit set), but this becomes more complex to implement.
>>>>>>>>>>>>> I still don't get what kind of complexity did you worry here.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> I'm a little worried about adding a new bit that requires 
>>>>>>>>>>>>>> busy
>>>>>>>>>>>>>> waiting...
>>>>>>>>>>>>> Busy wait is not something that is introduced in this patch:
>>>>>>>>>>>>>
>>>>>>>>>>>>> 4.1.4.3.2 Driver Requirements: Common configuration 
>>>>>>>>>>>>> structure layout
>>>>>>>>>>>>>
>>>>>>>>>>>>> After writing 0 to device_status, the driver MUST wait for 
>>>>>>>>>>>>> a read of
>>>>>>>>>>>>> device_status to return 0 before reinitializing the device.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Since it was required for at least one transport. We need 
>>>>>>>>>>>>> do something
>>>>>>>>>>>>> similar to when introducing basic facility.
>>>>>>>>>>>> Adding the STOP but as a Device Status bit is a small and 
>>>>>>>>>>>> clean VIRTIO
>>>>>>>>>>>> spec change. I like that.
>>>>>>>>>>>>
>>>>>>>>>>>> On the other hand, devices need time to stop and that time 
>>>>>>>>>>>> can be
>>>>>>>>>>>> unbounded. For example, software virtio-blk/scsi 
>>>>>>>>>>>> implementations since
>>>>>>>>>>>> cannot immediately cancel in-flight I/O requests on Linux 
>>>>>>>>>>>> hosts.
>>>>>>>>>>>>
>>>>>>>>>>>> The natural interface for long-running operations is 
>>>>>>>>>>>> virtqueue requests.
>>>>>>>>>>>> That's why I mentioned the alternative of using an admin 
>>>>>>>>>>>> virtqueue
>>>>>>>>>>>> instead of a Device Status bit.
>>>>>>>>>>> So I'm not against the admin virtqueue. As said before, 
>>>>>>>>>>> admin virtqueue
>>>>>>>>>>> could be used for carrying the device status bit.
>>>>>>>>>>>
>>>>>>>>>>> Send a command to set STOP status bit to admin virtqueue. 
>>>>>>>>>>> Device will make
>>>>>>>>>>> the command buffer used after it has successfully stopped 
>>>>>>>>>>> the device.
>>>>>>>>>>>
>>>>>>>>>>> AFAIK, they are not mutually exclusive, since they are 
>>>>>>>>>>> trying to solve
>>>>>>>>>>> different problems.
>>>>>>>>>>>
>>>>>>>>>>> Device status - basic device facility
>>>>>>>>>>>
>>>>>>>>>>> Admin virtqueue - transport/device specific way to implement 
>>>>>>>>>>> (part of) the
>>>>>>>>>>> device facility
>>>>>>>>>>>
>>>>>>>>>>>> Although you mentioned that the stopped state needs to be 
>>>>>>>>>>>> reflected in
>>>>>>>>>>>> the Device Status field somehow, I'm not sure about that 
>>>>>>>>>>>> since the
>>>>>>>>>>>> driver typically doesn't need to know whether the device is 
>>>>>>>>>>>> being
>>>>>>>>>>>> migrated.
>>>>>>>>>>> The guest won't see the real device status bit. VMM will 
>>>>>>>>>>> shadow the device
>>>>>>>>>>> status bit in this case.
>>>>>>>>>>>
>>>>>>>>>>> E.g with the current vhost-vDPA, vDPA behave like a vhost 
>>>>>>>>>>> device, guest is
>>>>>>>>>>> unaware of the migration.
>>>>>>>>>>>
>>>>>>>>>>> STOP status bit is set by Qemu to real virtio hardware. But 
>>>>>>>>>>> guest will only
>>>>>>>>>>> see the DRIVER_OK without STOP.
>>>>>>>>>>>
>>>>>>>>>>> It's not hard to implement the nested on top, see the 
>>>>>>>>>>> discussion initiated
>>>>>>>>>>> by Eugenio about how expose VIRTIO_F_STOP to guest for 
>>>>>>>>>>> nested live
>>>>>>>>>>> migration.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>       In fact, the VMM would need to hide this bit and it's 
>>>>>>>>>>>> safer to
>>>>>>>>>>>> keep it out-of-band instead of risking exposing it by 
>>>>>>>>>>>> accident.
>>>>>>>>>>> See above, VMM may choose to hide or expose the capability. 
>>>>>>>>>>> It's useful for
>>>>>>>>>>> migrating a nested guest.
>>>>>>>>>>>
>>>>>>>>>>> If we design an interface that can be used in the nested 
>>>>>>>>>>> environment, it's
>>>>>>>>>>> not an ideal interface.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> In addition, stateful devices need to load/save non-trivial 
>>>>>>>>>>>> amounts of
>>>>>>>>>>>> data. They need DMA to do this efficiently, so an admin 
>>>>>>>>>>>> virtqueue is a
>>>>>>>>>>>> good fit again.
>>>>>>>>>>> I don't get the point here. You still need to address the 
>>>>>>>>>>> exact the similar
>>>>>>>>>>> issues for admin virtqueue: the unbound time in freezing the 
>>>>>>>>>>> device, the
>>>>>>>>>>> interaction with the virtio device status state machine.
>>>>>>>>>> Device state state can be large so a register interface would 
>>>>>>>>>> be a
>>>>>>>>>> bottleneck. DMA is needed. I think a virtqueue is a good fit for
>>>>>>>>>> saving/loading device state.
>>>>>>>>> So this patch doesn't mandate a register interface, isn't it?
>>>>>>>> You're right, not this patch. I mentioned it because your other 
>>>>>>>> patch
>>>>>>>> series ("[PATCH] virtio-pci: implement VIRTIO_F_RING_STATE") 
>>>>>>>> implements
>>>>>>>> it a register interface.
>>>>>>>>
>>>>>>>>> And DMA
>>>>>>>>> doesn't means a virtqueue, it could be a transport specific 
>>>>>>>>> method.
>>>>>>>> Yes, although virtqueues are a pretty good interface that works 
>>>>>>>> across
>>>>>>>> transports (PCI/MMIO/etc) thanks to the standard vring memory 
>>>>>>>> layout.
>>>>>>>>
>>>>>>>>> I think we need to start from defining the state of one 
>>>>>>>>> specific device and
>>>>>>>>> see what is the best interface.
>>>>>>>> virtio-blk might be the simplest. I think virtio-net has more 
>>>>>>>> device
>>>>>>>> state and virtio-scsi is definitely more complext than virtio-blk.
>>>>>>>>
>>>>>>>> First we need agreement on whether "device state" encompasses 
>>>>>>>> the full
>>>>>>>> state of the device or just state that is unknown to the VMM.
>>>>>>> I think we've discussed this in the past. It can't work since:
>>>>>>>
>>>>>>> 1) The state and its format must be clearly defined in the spec
>>>>>>> 2) We need to maintain migration compatibility and debug-ability
>>>>>> Some devices need implementation-specific state. They should 
>>>>>> still be
>>>>>> able to live migrate even if it means cross-implementation 
>>>>>> migration and
>>>>>> debug-ability is not possible.
>>>>>
>>>>> I think we need to re-visit this conclusion. Migration 
>>>>> compatibility is
>>>>> pretty important, especially consider the software stack has spent 
>>>>> a huge
>>>>> mount of effort in maintaining them.
>>>>>
>>>>> Say a virtio hardware would break this, this mean we will lose all 
>>>>> the
>>>>> advantages of being a standard device.
>>>>>
>>>>> If we can't do live migration among:
>>>>>
>>>>> 1) different backends, e.g migrate from virtio hardware to migrate 
>>>>> software
>>>>> 2) different vendors
>>>>>
>>>>> We failed to say as a standard device and the customer is in fact 
>>>>> locked by
>>>>> the vendor implicitly.
>>>> My virtiofs device implementation is backed by an in-memory file 
>>>> system.
>>>> The device state includes the contents of each file.
>>>>
>>>> Your virtiofs device implementation uses Linux file handles to keep
>>>> track of open files. The device state includes Linux file handles (but
>>>> not the contents of each file) so the destination host can access the
>>>> same files on shared storage.
>>>>
>>>> Cornelia's virtiofs device implementation is backed by an object 
>>>> storage
>>>> HTTP API. The device state includes API object IDs.
>>>>
>>>> The device state is implementation-dependent. There is no standard
>>>> representation and it's not possible to migrate between device
>>>> implementations. How are they supposed to migrate?
>>>
>>>
>>> So if I understand correclty, virtio-fs is not desigined to be 
>>> migrate-able?
>>>
>>> (Having a check on the current virtio-fs support in qemu, it looks 
>>> to me it has a migration blocker).
>>>
>>>
>>>>
>>>> This is why I think it's necessarily to allow implementation-specific
>>>> device state representations.
>>>
>>>
>>> Or you probably mean you don't support cross backend migration. This 
>>> sounds like a drawback and it's actually not a standard device but a 
>>> vendor/implementation specific device.
>>>
>>> It would bring a lot of troubles, not only for the implementation 
>>> but for the management. Maybe we can start from adding the support 
>>> of migration for some specific backend and start from there.
>>>
>>>
>>>>
>>>>>>> 3) Not a proper uAPI desgin
>>>>>> I never understood this argument. The Linux uAPI passes through 
>>>>>> lots of
>>>>>> opaque data from devices to userspace. Allowing an
>>>>>> implementation-specific device state representation is nothing 
>>>>>> new. VFIO
>>>>>> already does it.
>>>>>
>>>>> I think we've already had a lots of discussion for VFIO but without a
>>>>> conclusion. Maybe we need the verdict from Linus or Greg (not sure 
>>>>> if it's
>>>>> too late). But that's not related to virito and this thread.
>>>>>
>>>>> What you propose here is kind of conflict with the efforts of 
>>>>> virtio. I
>>>>> think we all aggree that we should define the state in the spec. 
>>>>> Assuming
>>>>> this is correct:
>>>>>
>>>>> 1) why do we still offer opaque migration state to userspace?
>>>> See above. Stateful devices may require an implementation-defined 
>>>> device
>>>> state representation.
>>>
>>>
>>> So my point stand still, it's not a standard device if we do this.
>>>
>>>
>>>>
>>>>> 2) how can it be integrated into the current VMM (Qemu) virtio 
>>>>> devices'
>>>>> migration bytes stream?
>>>> Opaque data like D-Bus VMState:
>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fqemu.readthedocs.io%2Fen%2Flatest%2Finterop%2Fdbus-vmstate.html&amp;data=04%7C01%7Cmgurtovoy%40nvidia.com%7C55c091f2ff9a4d4225c008d94bf4f67f%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637624337775189910%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=Th5JGvvX9520LA%2FrH0wYxmFEUOiXSbbRBcYpBFFhFgg%3D&amp;reserved=0 
>>>>
>>>
>>>
>>> Actually, I meant how to keep the opaque state which is compatible 
>>> with all the existing device that can do migration.
>>>
>>> E.g we want to live migration virtio-blk among any backends (from a 
>>> hardware device to a software backend).
>>
>> I prefer we'll handle HW to SW migration in the future.
>
>
> Yes, that's very important and on of the key advantages of virtio.
>
>
>>
>> We're still debating on other basic stuff.
>>
>>>
>>>
>>>>
>>>>>>>> That's
>>>>>>>> basically the difference between the vhost/vDPA's selective 
>>>>>>>> passthrough
>>>>>>>> approach and VFIO's full passthrough approach.
>>>>>>> We can't do VFIO full pasthrough for migration anyway, some kind 
>>>>>>> of mdev is
>>>>>>> required but it's duplicated with the current vp_vdpa driver.
>>>>>> I'm not sure that's true. Generic VFIO PCI migration can probably be
>>>>>> achieved without mdev:
>>>>>> 1. Define a migration PCI Capability that indicates support for
>>>>>>      VFIO_REGION_TYPE_MIGRATION. This allows the PCI device to 
>>>>>> implement
>>>>>>      the migration interface in hardware instead of an mdev driver.
>>>>>
>>>>> So I think it still depend on the driver to implement migrate 
>>>>> state which is
>>>>> vendor specific.
>>>> The current VFIO migration interface depends on a device-specific
>>>> software mdev driver but here I'm showing that the physical device can
>>>> implement the migration interface so that no device-specific driver 
>>>> code
>>>> is needed.
>>>
>>>
>>> This is not what I read from the patch:
>>>
>>>  * device_state: (read/write)
>>>  *      - The user application writes to this field to inform the 
>>> vendor driver
>>>  *        about the device state to be transitioned to.
>>>  *      - The vendor driver should take the necessary actions to 
>>> change the
>>>  *        device state. After successful transition to a given 
>>> state, the
>>>  *        vendor driver should return success on write(device_state, 
>>> state)
>>>  *        system call. If the device state transition fails, the 
>>> vendor driver
>>>  *        should return an appropriate -errno for the fault condition.
>>>
>>> Vendor driver need to mediate between the uAPI and the actual device.
>>
>> We're building an infrastructure for VFIO PCI devices in the last few 
>> months.
>>
>> It should be merged hopefully to kernel 5.15.
>
>
> Ok.
>
>
>>
>>>
>>>
>>>>
>>>>> Note that it's just an uAPI definition not something defined in 
>>>>> the PCI
>>>>> spec.
>>>> Yes, that's why I mentioned Changpeng Liu's idea to turn the uAPI 
>>>> into a
>>>> standard PCI Capability to eliminate the need for device-specific
>>>> drivers.
>>>
>>>
>>> Ok.
>>>
>>>
>>>>
>>>>> Out of curiosity, the patch is merged without any real users in 
>>>>> the Linux.
>>>>> This is very bad since we lose the change to audit the whole design.
>>>> I agree. It would have helped to have a complete vision for how live
>>>> migration should work along with demos. I don't see any migration code
>>>> in samples/vfio-mdev/ :(.
>>>
>>>
>>> Right.
>>
>> Creating a standard is not related to Linux nor VFIO.
>
>
> I fully agree here.
>
>
>>
>> With the proposal that I've sent, we can develop a migration driver 
>> and virtio device that will support it (NVIDIA virtio-blk SNAP device).
>>
>> And you can build live migration support in virtio_vdpa driver (if 
>> VDPA migration protocol will be implemented).
>
>
> Right, vp_vdpa fit naturally for this. But I don't see the much value 
> of a dedicated migration driver, do you?

I don't know how VDPA device advertise migration capability to QEMU.


>
> Thanks
>
>
>>
>>
>>>
>>>
>>>>>> 2. The VMM either uses the migration PCI Capability directly from
>>>>>>      userspace or core VFIO PCI code advertises 
>>>>>> VFIO_REGION_TYPE_MIGRATION
>>>>>>      to userspace so migration can proceed in the same way as with
>>>>>>      VFIO/mdev drivers.
>>>>>> 3. The PCI Capability is not passed through to the guest.
>>>>>
>>>>> This brings troubles in the nested environment.
>>>> It depends on the device splitting/management design. If L0 wishes to
>>>> let L1 manage the VFs then it would need to expose a management 
>>>> device.
>>>> Since the migration interface is generic (not device-specific) a 
>>>> generic
>>>> management device solves this for all devices.
>>>
>>>
>>> Right, but it's a burden to expose the management device or it may 
>>> just won't work.
>>>
>>> Thanks
>>>
>>>
>>>>
>>>> Stefan
>>>
>>
>


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit
  2021-07-21 11:43                                         ` Max Gurtovoy
@ 2021-07-22  2:01                                           ` Jason Wang
  0 siblings, 0 replies; 115+ messages in thread
From: Jason Wang @ 2021-07-22  2:01 UTC (permalink / raw)
  To: Max Gurtovoy, Stefan Hajnoczi
  Cc: Michael S. Tsirkin, Eugenio Perez Martin, Dr. David Alan Gilbert,
	virtio-comment, Virtio-Dev, Cornelia Huck, Oren Duer,
	Shahaf Shuler, Parav Pandit, Bodong Wang, Alexander Mikheev,
	Halil Pasic


在 2021/7/21 下午7:43, Max Gurtovoy 写道:
>>
>>
>> Right, vp_vdpa fit naturally for this. But I don't see the much value 
>> of a dedicated migration driver, do you?
>
> I don't know how VDPA device advertise migration capability to QEMU. 


This is done by vhost-vDPA via advertising vhost feature 
VHOST_F_LOG_ALL, otherwise qemu will block the migration.

vhost-vDPA inherits the vhost uAPI, so at the uAPI level we had:

1) VHOST_SET_VRING_BASE/VHOST_GET_VRING_BASE for syncing index
2) VHOST_SET_LOG_BASE for setting the address of the dirty page bitmap
3) VHOST_VDPA_SET_STATUS for setting the device status

Note that, vhost-vDPA doesn't implement 2) since we're not sure the 
dirty page bitmap is the way we need to go for hardware, it would just 
take few lines for introducing them in vDPA bus level.

So from vDPA point of view, it's a vhost device. And the vDPA networking 
device is driven by the qemu vhost-vDPA module.

Without VHOST_SET_LOG_BASE, Eugenio is working on the shadow virtqueue 
for tracking dirty pages in software to live migrate a vDPA device.

Assuming the VHOST_SET_LOG_BASE or other dirty logging mechanism is 
implemented, we can live migrate a networking vDPA device with the 
hardware assisted dirty page tracking.

For the device state like filters, mac, qemu will intercept the control 
command so we don't need any API for querying those states from the 
devices. So the above three uAPIs are sufficient for live migrating a 
virtio-net vDPA device, and the migration compatibility is kept 
perfectly since it was seamlessly integrated with the current vhost 
protocol.

When the device/virtio specific mechanism for device state 
synchronization is invented, we can introduce uAPI for them.

Thanks


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit
  2021-07-21 10:42                                         ` Stefan Hajnoczi
@ 2021-07-22  2:08                                           ` Jason Wang
  2021-07-22 10:30                                             ` Stefan Hajnoczi
  0 siblings, 1 reply; 115+ messages in thread
From: Jason Wang @ 2021-07-22  2:08 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Michael S. Tsirkin, Eugenio Perez Martin, Dr. David Alan Gilbert,
	virtio-comment, Virtio-Dev, Max Gurtovoy, Cornelia Huck,
	Oren Duer, Shahaf Shuler, Parav Pandit, Bodong Wang,
	Alexander Mikheev, Halil Pasic


在 2021/7/21 下午6:42, Stefan Hajnoczi 写道:
> On Wed, Jul 21, 2021 at 10:52:15AM +0800, Jason Wang wrote:
>> 在 2021/7/20 下午6:19, Stefan Hajnoczi 写道:
>>> On Tue, Jul 20, 2021 at 11:02:42AM +0800, Jason Wang wrote:
>>>> 在 2021/7/19 下午8:43, Stefan Hajnoczi 写道:
>>>>> On Fri, Jul 16, 2021 at 10:03:17AM +0800, Jason Wang wrote:
>>>>>> 在 2021/7/15 下午6:01, Stefan Hajnoczi 写道:
>>>>>>> On Thu, Jul 15, 2021 at 09:35:13AM +0800, Jason Wang wrote:
>>>>>>>> 在 2021/7/14 下午11:07, Stefan Hajnoczi 写道:
>>>>>>>>> On Wed, Jul 14, 2021 at 06:29:28PM +0800, Jason Wang wrote:
>>>>>>>>>> 在 2021/7/14 下午5:53, Stefan Hajnoczi 写道:
>>>>>>>>>>> On Tue, Jul 13, 2021 at 08:16:35PM +0800, Jason Wang wrote:
>>>>>>>>>>>> 在 2021/7/13 下午6:00, Stefan Hajnoczi 写道:
>>>>>>>>>>>>> On Tue, Jul 13, 2021 at 11:27:03AM +0800, Jason Wang wrote:
>>>>>>>>>>>>>> 在 2021/7/12 下午5:57, Stefan Hajnoczi 写道:
>>>>>>>>>>>>>>> On Mon, Jul 12, 2021 at 12:00:39PM +0800, Jason Wang wrote:
>>>>>>>>>>>>>>>> 在 2021/7/11 上午4:36, Michael S. Tsirkin 写道:
>>>>>>>>>>>>>>>>> On Fri, Jul 09, 2021 at 07:23:33PM +0200, Eugenio Perez Martin wrote:
>>>>>>>>>>>>>>>>>>>>            If I understand correctly, this is all
>>>>>>>>>>>>>>>>>>>> driven from the driver inside the guest, so for this to work
>>>>>>>>>>>>>>>>>>>> the guest must be running and already have initialised the driver.
>>>>>>>>>>>>>>>>>>> Yes.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> As I see it, the feature can be driven entirely by the VMM as long as
>>>>>>>>>>>>>>>>>> it intercept the relevant configuration space (PCI, MMIO, etc) from
>>>>>>>>>>>>>>>>>> guest's reads and writes, and present it as coherent and transparent
>>>>>>>>>>>>>>>>>> for the guest. Some use cases I can imagine with a physical device (or
>>>>>>>>>>>>>>>>>> vp_vpda device) with VIRTIO_F_STOP:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> 1) The VMM chooses not to pass the feature flag. The guest cannot stop
>>>>>>>>>>>>>>>>>> the device, so any write to this flag is an error/undefined.
>>>>>>>>>>>>>>>>>> 2) The VMM passes the flag to the guest. The guest can stop the device.
>>>>>>>>>>>>>>>>>> 2.1) The VMM stops the device to perform a live migration, and the
>>>>>>>>>>>>>>>>>> guest does not write to STOP in any moment of the LM. It resets the
>>>>>>>>>>>>>>>>>> destination device with the state, and then initializes the device.
>>>>>>>>>>>>>>>>>> 2.2) The guest stops the device and, when STOP(32) is set, the source
>>>>>>>>>>>>>>>>>> VMM migrates the device status. The destination VMM realizes the bit,
>>>>>>>>>>>>>>>>>> so it sets the bit in the destination too after device initialization.
>>>>>>>>>>>>>>>>>> 2.3) The device is not initialized by the guest so it doesn't matter
>>>>>>>>>>>>>>>>>> what bit has the HW, but the VM can be migrated.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Am I missing something?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks!
>>>>>>>>>>>>>>>>> It's doable like this. It's all a lot of hoops to jump through though.
>>>>>>>>>>>>>>>>> It's also not easy for devices to implement.
>>>>>>>>>>>>>>>> It just requires a new status bit. Anything that makes you think it's hard
>>>>>>>>>>>>>>>> to implement?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> E.g for networking device, it should be sufficient to use this bit + the
>>>>>>>>>>>>>>>> virtqueue state.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Why don't we design the feature in a way that is useable by VMMs
>>>>>>>>>>>>>>>>> and implementable by devices in a simple way?
>>>>>>>>>>>>>>>> It use the common technology like register shadowing without any further
>>>>>>>>>>>>>>>> stuffs.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Or do you have any other ideas?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> (I think we all know migration will be very hard if we simply pass through
>>>>>>>>>>>>>>>> those state registers).
>>>>>>>>>>>>>>> If an admin virtqueue is used instead of the STOP Device Status field
>>>>>>>>>>>>>>> bit then there's no need to re-read the Device Status field in a loop
>>>>>>>>>>>>>>> until the device has stopped.
>>>>>>>>>>>>>> Probably not. Let me to clarify several points:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> - This proposal has nothing to do with admin virtqueue. Actually, admin
>>>>>>>>>>>>>> virtqueue could be used for carrying any basic device facility like status
>>>>>>>>>>>>>> bit. E.g I'm going to post patches that use admin virtqueue as a "transport"
>>>>>>>>>>>>>> for device slicing at virtio level.
>>>>>>>>>>>>>> - Even if we had introduced admin virtqueue, we still need a per function
>>>>>>>>>>>>>> interface for this. This is a must for nested virtualization, we can't
>>>>>>>>>>>>>> always expect things like PF can be assigned to L1 guest.
>>>>>>>>>>>>>> - According to the proposal, there's no need for the device to complete all
>>>>>>>>>>>>>> the consumed buffers, device can choose to expose those inflight descriptors
>>>>>>>>>>>>>> in a device specific way and set the STOP bit. This means, if we have the
>>>>>>>>>>>>>> device specific in-flight descriptor reporting facility, the device can
>>>>>>>>>>>>>> almost set the STOP bit immediately.
>>>>>>>>>>>>>> - If we don't go with the basic device facility but using the admin
>>>>>>>>>>>>>> virtqueue specific method, we still need to clarify how it works with the
>>>>>>>>>>>>>> device status state machine, it will be some kind of sub-states which looks
>>>>>>>>>>>>>> much more complicated than the current proposal.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> When migrating a guest with many VIRTIO devices a busy waiting approach
>>>>>>>>>>>>>>> extends downtime if implemented sequentially (stopping one device at a
>>>>>>>>>>>>>>> time).
>>>>>>>>>>>>>> Well. You need some kinds of waiting for sure, the device/DMA needs sometime
>>>>>>>>>>>>>> to be stopped. The downtime is determined by a specific virtio
>>>>>>>>>>>>>> implementation which is hard to be restricted at the spec level. We can
>>>>>>>>>>>>>> clarify that the device must set the STOP bit in e.g 100ms.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>          It can be implemented concurrently (setting the STOP bit on all
>>>>>>>>>>>>>>> devices and then looping until all their Device Status fields have the
>>>>>>>>>>>>>>> bit set), but this becomes more complex to implement.
>>>>>>>>>>>>>> I still don't get what kind of complexity did you worry here.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I'm a little worried about adding a new bit that requires busy
>>>>>>>>>>>>>>> waiting...
>>>>>>>>>>>>>> Busy wait is not something that is introduced in this patch:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 4.1.4.3.2 Driver Requirements: Common configuration structure layout
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> After writing 0 to device_status, the driver MUST wait for a read of
>>>>>>>>>>>>>> device_status to return 0 before reinitializing the device.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Since it was required for at least one transport. We need do something
>>>>>>>>>>>>>> similar to when introducing basic facility.
>>>>>>>>>>>>> Adding the STOP but as a Device Status bit is a small and clean VIRTIO
>>>>>>>>>>>>> spec change. I like that.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On the other hand, devices need time to stop and that time can be
>>>>>>>>>>>>> unbounded. For example, software virtio-blk/scsi implementations since
>>>>>>>>>>>>> cannot immediately cancel in-flight I/O requests on Linux hosts.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The natural interface for long-running operations is virtqueue requests.
>>>>>>>>>>>>> That's why I mentioned the alternative of using an admin virtqueue
>>>>>>>>>>>>> instead of a Device Status bit.
>>>>>>>>>>>> So I'm not against the admin virtqueue. As said before, admin virtqueue
>>>>>>>>>>>> could be used for carrying the device status bit.
>>>>>>>>>>>>
>>>>>>>>>>>> Send a command to set STOP status bit to admin virtqueue. Device will make
>>>>>>>>>>>> the command buffer used after it has successfully stopped the device.
>>>>>>>>>>>>
>>>>>>>>>>>> AFAIK, they are not mutually exclusive, since they are trying to solve
>>>>>>>>>>>> different problems.
>>>>>>>>>>>>
>>>>>>>>>>>> Device status - basic device facility
>>>>>>>>>>>>
>>>>>>>>>>>> Admin virtqueue - transport/device specific way to implement (part of) the
>>>>>>>>>>>> device facility
>>>>>>>>>>>>
>>>>>>>>>>>>> Although you mentioned that the stopped state needs to be reflected in
>>>>>>>>>>>>> the Device Status field somehow, I'm not sure about that since the
>>>>>>>>>>>>> driver typically doesn't need to know whether the device is being
>>>>>>>>>>>>> migrated.
>>>>>>>>>>>> The guest won't see the real device status bit. VMM will shadow the device
>>>>>>>>>>>> status bit in this case.
>>>>>>>>>>>>
>>>>>>>>>>>> E.g with the current vhost-vDPA, vDPA behave like a vhost device, guest is
>>>>>>>>>>>> unaware of the migration.
>>>>>>>>>>>>
>>>>>>>>>>>> STOP status bit is set by Qemu to real virtio hardware. But guest will only
>>>>>>>>>>>> see the DRIVER_OK without STOP.
>>>>>>>>>>>>
>>>>>>>>>>>> It's not hard to implement the nested on top, see the discussion initiated
>>>>>>>>>>>> by Eugenio about how expose VIRTIO_F_STOP to guest for nested live
>>>>>>>>>>>> migration.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>>         In fact, the VMM would need to hide this bit and it's safer to
>>>>>>>>>>>>> keep it out-of-band instead of risking exposing it by accident.
>>>>>>>>>>>> See above, VMM may choose to hide or expose the capability. It's useful for
>>>>>>>>>>>> migrating a nested guest.
>>>>>>>>>>>>
>>>>>>>>>>>> If we design an interface that can be used in the nested environment, it's
>>>>>>>>>>>> not an ideal interface.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> In addition, stateful devices need to load/save non-trivial amounts of
>>>>>>>>>>>>> data. They need DMA to do this efficiently, so an admin virtqueue is a
>>>>>>>>>>>>> good fit again.
>>>>>>>>>>>> I don't get the point here. You still need to address the exact the similar
>>>>>>>>>>>> issues for admin virtqueue: the unbound time in freezing the device, the
>>>>>>>>>>>> interaction with the virtio device status state machine.
>>>>>>>>>>> Device state state can be large so a register interface would be a
>>>>>>>>>>> bottleneck. DMA is needed. I think a virtqueue is a good fit for
>>>>>>>>>>> saving/loading device state.
>>>>>>>>>> So this patch doesn't mandate a register interface, isn't it?
>>>>>>>>> You're right, not this patch. I mentioned it because your other patch
>>>>>>>>> series ("[PATCH] virtio-pci: implement VIRTIO_F_RING_STATE") implements
>>>>>>>>> it a register interface.
>>>>>>>>>
>>>>>>>>>> And DMA
>>>>>>>>>> doesn't means a virtqueue, it could be a transport specific method.
>>>>>>>>> Yes, although virtqueues are a pretty good interface that works across
>>>>>>>>> transports (PCI/MMIO/etc) thanks to the standard vring memory layout.
>>>>>>>>>
>>>>>>>>>> I think we need to start from defining the state of one specific device and
>>>>>>>>>> see what is the best interface.
>>>>>>>>> virtio-blk might be the simplest. I think virtio-net has more device
>>>>>>>>> state and virtio-scsi is definitely more complext than virtio-blk.
>>>>>>>>>
>>>>>>>>> First we need agreement on whether "device state" encompasses the full
>>>>>>>>> state of the device or just state that is unknown to the VMM.
>>>>>>>> I think we've discussed this in the past. It can't work since:
>>>>>>>>
>>>>>>>> 1) The state and its format must be clearly defined in the spec
>>>>>>>> 2) We need to maintain migration compatibility and debug-ability
>>>>>>> Some devices need implementation-specific state. They should still be
>>>>>>> able to live migrate even if it means cross-implementation migration and
>>>>>>> debug-ability is not possible.
>>>>>> I think we need to re-visit this conclusion. Migration compatibility is
>>>>>> pretty important, especially consider the software stack has spent a huge
>>>>>> mount of effort in maintaining them.
>>>>>>
>>>>>> Say a virtio hardware would break this, this mean we will lose all the
>>>>>> advantages of being a standard device.
>>>>>>
>>>>>> If we can't do live migration among:
>>>>>>
>>>>>> 1) different backends, e.g migrate from virtio hardware to migrate software
>>>>>> 2) different vendors
>>>>>>
>>>>>> We failed to say as a standard device and the customer is in fact locked by
>>>>>> the vendor implicitly.
>>>>> My virtiofs device implementation is backed by an in-memory file system.
>>>>> The device state includes the contents of each file.
>>>>>
>>>>> Your virtiofs device implementation uses Linux file handles to keep
>>>>> track of open files. The device state includes Linux file handles (but
>>>>> not the contents of each file) so the destination host can access the
>>>>> same files on shared storage.
>>>>>
>>>>> Cornelia's virtiofs device implementation is backed by an object storage
>>>>> HTTP API. The device state includes API object IDs.
>>>>>
>>>>> The device state is implementation-dependent. There is no standard
>>>>> representation and it's not possible to migrate between device
>>>>> implementations. How are they supposed to migrate?
>>>> So if I understand correclty, virtio-fs is not desigined to be migrate-able?
>>>>
>>>> (Having a check on the current virtio-fs support in qemu, it looks to me it
>>>> has a migration blocker).
>>> The code does not support live migration but it's on the roadmap. Max
>>> Reitz added Linux file handle support to virtiofsd. That was the first
>>> step towards being able to migrate the device's state.
>>
>> A dumb question, how do qemu know it is connected to virtiofsd?
> virtiofsd is a vhost-user-fs device. QEMU doesn't know if it's connected
> to virtiofsd or another implementation.


That's my understanding. So this answers my questions basically: there 
could be a common migration implementation for each virtio-fs device 
which implies that we only need to migrate the common device specific 
state but not implementation specific state.


>
>>>>> This is why I think it's necessarily to allow implementation-specific
>>>>> device state representations.
>>>> Or you probably mean you don't support cross backend migration. This sounds
>>>> like a drawback and it's actually not a standard device but a
>>>> vendor/implementation specific device.
>>>>
>>>> It would bring a lot of troubles, not only for the implementation but for
>>>> the management. Maybe we can start from adding the support of migration for
>>>> some specific backend and start from there.
>>> Yes, it's complicated. Some implementations could be compatible, but
>>> others can never be compatible because they have completely different
>>> state.
>>>
>>> The virtiofsd implementation is the main one for virtiofs and the device
>>> state representation can be published, even standardized. Others can
>>> implement it to achieve migration compatibility.
>>>
>>> But it must be possible for implementations that have completely
>>> different state to migrate too. virtiofsd isn't special.
>>>
>>>>>>>> 3) Not a proper uAPI desgin
>>>>>>> I never understood this argument. The Linux uAPI passes through lots of
>>>>>>> opaque data from devices to userspace. Allowing an
>>>>>>> implementation-specific device state representation is nothing new. VFIO
>>>>>>> already does it.
>>>>>> I think we've already had a lots of discussion for VFIO but without a
>>>>>> conclusion. Maybe we need the verdict from Linus or Greg (not sure if it's
>>>>>> too late). But that's not related to virito and this thread.
>>>>>>
>>>>>> What you propose here is kind of conflict with the efforts of virtio. I
>>>>>> think we all aggree that we should define the state in the spec. Assuming
>>>>>> this is correct:
>>>>>>
>>>>>> 1) why do we still offer opaque migration state to userspace?
>>>>> See above. Stateful devices may require an implementation-defined device
>>>>> state representation.
>>>> So my point stand still, it's not a standard device if we do this.
>>> These "non-standard devices" still need to be able to migrate.
>>
>> See other thread, it breaks the effort of having a spec.
>>
>>>    How
>>> should we do that?
>>
>> I think the main issue is that, to me it's not a virtio device but a device
>> that is using virtio queue with implementation specific state. So it can't
>> be migrated by the virtio subsystem but through a vendor/implementation
>> specific migration driver.
> Okay. Are you thinking about a separate set of vDPA APIs and vhost
> ioctls so the VMM can save/load implementation-specific device state?


Probably not, I think the question is can we define the virtio-fs device 
state in the spec? If yes (and I think the answer is yes), we're fine. 
If not, it looks like we need to improve the spec or design.


> These separate APIs just need to be called as part of the standard
> VIRTIO stop and vq save/load lifecycle.


Yes, if they are virtio standard, we need to invent them.


>
>>>>>> 2) how can it be integrated into the current VMM (Qemu) virtio devices'
>>>>>> migration bytes stream?
>>>>> Opaque data like D-Bus VMState:
>>>>> https://qemu.readthedocs.io/en/latest/interop/dbus-vmstate.html
>>>> Actually, I meant how to keep the opaque state which is compatible with all
>>>> the existing device that can do migration.
>>>>
>>>> E.g we want to live migration virtio-blk among any backends (from a hardware
>>>> device to a software backend).
>>> There was a series of email threads last year where migration
>>> compatibility was discussed:
>>>
>>> https://lists.gnu.org/archive/html/qemu-devel/2020-11/msg02620.html
>>>
>>> I proposed an algorithm for checking migration compatibility between
>>> devices. The source and destination device can have different
>>> implementations (e.g. hardware, software, etc).
>>>
>>> It involves picking an identifier like virtio-spec.org/pci/virtio-net
>>> for the device state representation and device parameters for aspects of
>>> the device that vary between instances (e.g. tso=on|off).
>>>
>>> It's more complex than today's live migration approach in libvirt and
>>> QEMU. Today libvirt configures the source and destination in a
>>> compatible manner (thanks to knowledge of the device implementation) and
>>> then QEMU transfers the device state.
>>>
>>> Part of the point of defining a migration compatibility algorithm is
>>> that it's possible to lift the assumptions out of libvirt so that
>>> arbitrary device implementations can be supported (hardware, software,
>>> etc) without putting knowledge about every device/VMM implementation
>>> into libvirt.
>>>
>>> (The other advantage is that this allows orchestration software to
>>> determine migration compatibility before starting a migration.)
>>
>> This looks like another independent issues and I fully agree to have a
>> better migration protocol. But using that means we break the migration
>> compatibility with the existing device which is used for more than 10 years.
>> We still need to make the migration from/to the existing virtio device to
>> work.
> I agree that migrating to/from existing devices needs to work. It should
> be possible to transition without breaking migration.
>
>>>>>>>>> That's
>>>>>>>>> basically the difference between the vhost/vDPA's selective passthrough
>>>>>>>>> approach and VFIO's full passthrough approach.
>>>>>>>> We can't do VFIO full pasthrough for migration anyway, some kind of mdev is
>>>>>>>> required but it's duplicated with the current vp_vdpa driver.
>>>>>>> I'm not sure that's true. Generic VFIO PCI migration can probably be
>>>>>>> achieved without mdev:
>>>>>>> 1. Define a migration PCI Capability that indicates support for
>>>>>>>        VFIO_REGION_TYPE_MIGRATION. This allows the PCI device to implement
>>>>>>>        the migration interface in hardware instead of an mdev driver.
>>>>>> So I think it still depend on the driver to implement migrate state which is
>>>>>> vendor specific.
>>>>> The current VFIO migration interface depends on a device-specific
>>>>> software mdev driver but here I'm showing that the physical device can
>>>>> implement the migration interface so that no device-specific driver code
>>>>> is needed.
>>>> This is not what I read from the patch:
>>>>
>>>>    * device_state: (read/write)
>>>>    *      - The user application writes to this field to inform the vendor
>>>> driver
>>>>    *        about the device state to be transitioned to.
>>>>    *      - The vendor driver should take the necessary actions to change the
>>>>    *        device state. After successful transition to a given state, the
>>>>    *        vendor driver should return success on write(device_state, state)
>>>>    *        system call. If the device state transition fails, the vendor
>>>> driver
>>>>    *        should return an appropriate -errno for the fault condition.
>>>>
>>>> Vendor driver need to mediate between the uAPI and the actual device.
>>> Yes, that's the current state of VFIO migration. If a hardware interface
>>> (e.g. PCI Capability) is defined that maps to this API then no
>>> device-specific drivers would be necessary because core VFIO PCI code
>>> can implement the uAPI by talking to the hardware.
>>
>> As we discussed, it would be very hard. The device state is implementation
>> specific which may not fit for the Capability. (PCIE has already had VF
>> migration state in the SR-IOV extended capability).
>>
>>
>>>>>>> 2. The VMM either uses the migration PCI Capability directly from
>>>>>>>        userspace or core VFIO PCI code advertises VFIO_REGION_TYPE_MIGRATION
>>>>>>>        to userspace so migration can proceed in the same way as with
>>>>>>>        VFIO/mdev drivers.
>>>>>>> 3. The PCI Capability is not passed through to the guest.
>>>>>> This brings troubles in the nested environment.
>>>>> It depends on the device splitting/management design. If L0 wishes to
>>>>> let L1 manage the VFs then it would need to expose a management device.
>>>>> Since the migration interface is generic (not device-specific) a generic
>>>>> management device solves this for all devices.
>>>> Right, but it's a burden to expose the management device or it may just
>>>> won't work.
>>> A single generic management device is not a huge burden and it may turn
>>> out that keeping the management device out-of-band is actually a
>>> desirable feature if the device owner does not wish to expose the
>>> stop/save/load functionality for some reason.
>>
>> VMM are free to hide those features from guest. Management can just do
>> -device virtio-pci,state=false
>>
>> Having management device works for L0 but not suitable for L(x>0). A per
>> function device interface is a must for the nested virt to work in a simple
>> and easy way.
> You are right, a per function interface is simplest. I'm not experienced
> enough with SR-IOV and nested virtualization to have a strong opinion in
> this area.


Yes, and they can co-exist, the admin virtqueue works for L0, but we 
need hide those via per function API for nested.

That's why I start from proposing the basic facility instead of an 
actual transport (PCI or admin virtqueue) implementation.

Thanks



>
> Stefan


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit
  2021-07-21 10:20                                           ` Stefan Hajnoczi
@ 2021-07-22  7:33                                             ` Jason Wang
  2021-07-22 10:24                                               ` Stefan Hajnoczi
  0 siblings, 1 reply; 115+ messages in thread
From: Jason Wang @ 2021-07-22  7:33 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Michael S. Tsirkin, Eugenio Perez Martin, Dr. David Alan Gilbert,
	virtio-comment, Virtio-Dev, Max Gurtovoy, Cornelia Huck,
	Oren Duer, Shahaf Shuler, Parav Pandit, Bodong Wang,
	Alexander Mikheev, Halil Pasic


在 2021/7/21 下午6:20, Stefan Hajnoczi 写道:
> On Wed, Jul 21, 2021 at 10:29:17AM +0800, Jason Wang wrote:
>> 在 2021/7/20 下午4:50, Stefan Hajnoczi 写道:
>>> On Tue, Jul 20, 2021 at 11:04:55AM +0800, Jason Wang wrote:
>>>> 在 2021/7/19 下午8:45, Stefan Hajnoczi 写道:
>>>>> On Fri, Jul 16, 2021 at 11:53:13AM +0800, Jason Wang wrote:
>>>>>> 在 2021/7/16 上午10:03, Jason Wang 写道:
>>>>>>> 在 2021/7/15 下午6:01, Stefan Hajnoczi 写道:
>>>>>>>> On Thu, Jul 15, 2021 at 09:35:13AM +0800, Jason Wang wrote:
>>>>>>>>> 在 2021/7/14 下午11:07, Stefan Hajnoczi 写道:
>>>>>>>>>> On Wed, Jul 14, 2021 at 06:29:28PM +0800, Jason Wang wrote:
>>>>>>>>>>> 在 2021/7/14 下午5:53, Stefan Hajnoczi 写道:
>>>>>>>>>>>> On Tue, Jul 13, 2021 at 08:16:35PM +0800, Jason Wang wrote:
>>>>>>>>>>>>> 在 2021/7/13 下午6:00, Stefan Hajnoczi 写道:
>>>>>>>>>>>>>> On Tue, Jul 13, 2021 at 11:27:03AM +0800, Jason Wang wrote:
>>>>>>>>>>>>>>> 在 2021/7/12 下午5:57, Stefan Hajnoczi 写道:
>>>>>>>>>>>>>>>> On Mon, Jul 12, 2021 at 12:00:39PM +0800, Jason Wang wrote:
>>>>>>>>>>>>>>>>> 在 2021/7/11 上午4:36, Michael S. Tsirkin 写道:
>>>>>>>>>>>>>>>>>> On Fri, Jul 09, 2021 at
>>>>>>>>>>>>>>>>>> 07:23:33PM +0200, Eugenio
>>>>>>>>>>>>>>>>>> Perez Martin wrote:
>>>>>>>>>>>>>>>>>>>>>            If I understand correctly, this is all
>>>>>>>>>>>>>>>>>>>>> driven from the
>>>>>>>>>>>>>>>>>>>>> driver inside
>>>>>>>>>>>>>>>>>>>>> the guest, so
>>>>>>>>>>>>>>>>>>>>> for this to work
>>>>>>>>>>>>>>>>>>>>> the guest must
>>>>>>>>>>>>>>>>>>>>> be running and
>>>>>>>>>>>>>>>>>>>>> already have
>>>>>>>>>>>>>>>>>>>>> initialised the
>>>>>>>>>>>>>>>>>>>>> driver.
>>>>>>>>>>>>>>>>>>>> Yes.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> As I see it, the feature
>>>>>>>>>>>>>>>>>>> can be driven entirely
>>>>>>>>>>>>>>>>>>> by the VMM as long as
>>>>>>>>>>>>>>>>>>> it intercept the
>>>>>>>>>>>>>>>>>>> relevant configuration
>>>>>>>>>>>>>>>>>>> space (PCI, MMIO, etc)
>>>>>>>>>>>>>>>>>>> from
>>>>>>>>>>>>>>>>>>> guest's reads and
>>>>>>>>>>>>>>>>>>> writes, and present it
>>>>>>>>>>>>>>>>>>> as coherent and
>>>>>>>>>>>>>>>>>>> transparent
>>>>>>>>>>>>>>>>>>> for the guest. Some use
>>>>>>>>>>>>>>>>>>> cases I can imagine with
>>>>>>>>>>>>>>>>>>> a physical device (or
>>>>>>>>>>>>>>>>>>> vp_vpda device) with VIRTIO_F_STOP:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> 1) The VMM chooses not
>>>>>>>>>>>>>>>>>>> to pass the feature
>>>>>>>>>>>>>>>>>>> flag. The guest cannot
>>>>>>>>>>>>>>>>>>> stop
>>>>>>>>>>>>>>>>>>> the device, so any write to this flag is an error/undefined.
>>>>>>>>>>>>>>>>>>> 2) The VMM passes the
>>>>>>>>>>>>>>>>>>> flag to the guest. The
>>>>>>>>>>>>>>>>>>> guest can stop the
>>>>>>>>>>>>>>>>>>> device.
>>>>>>>>>>>>>>>>>>> 2.1) The VMM stops the
>>>>>>>>>>>>>>>>>>> device to perform a live
>>>>>>>>>>>>>>>>>>> migration, and the
>>>>>>>>>>>>>>>>>>> guest does not write to
>>>>>>>>>>>>>>>>>>> STOP in any moment of
>>>>>>>>>>>>>>>>>>> the LM. It resets the
>>>>>>>>>>>>>>>>>>> destination device with
>>>>>>>>>>>>>>>>>>> the state, and then
>>>>>>>>>>>>>>>>>>> initializes the device.
>>>>>>>>>>>>>>>>>>> 2.2) The guest stops the
>>>>>>>>>>>>>>>>>>> device and, when
>>>>>>>>>>>>>>>>>>> STOP(32) is set, the
>>>>>>>>>>>>>>>>>>> source
>>>>>>>>>>>>>>>>>>> VMM migrates the device
>>>>>>>>>>>>>>>>>>> status. The destination
>>>>>>>>>>>>>>>>>>> VMM realizes the bit,
>>>>>>>>>>>>>>>>>>> so it sets the bit in
>>>>>>>>>>>>>>>>>>> the destination too
>>>>>>>>>>>>>>>>>>> after device
>>>>>>>>>>>>>>>>>>> initialization.
>>>>>>>>>>>>>>>>>>> 2.3) The device is not
>>>>>>>>>>>>>>>>>>> initialized by the guest
>>>>>>>>>>>>>>>>>>> so it doesn't matter
>>>>>>>>>>>>>>>>>>> what bit has the HW, but the VM can be migrated.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Am I missing something?
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Thanks!
>>>>>>>>>>>>>>>>>> It's doable like this. It's
>>>>>>>>>>>>>>>>>> all a lot of hoops to jump
>>>>>>>>>>>>>>>>>> through though.
>>>>>>>>>>>>>>>>>> It's also not easy for devices to implement.
>>>>>>>>>>>>>>>>> It just requires a new status
>>>>>>>>>>>>>>>>> bit. Anything that makes you
>>>>>>>>>>>>>>>>> think it's hard
>>>>>>>>>>>>>>>>> to implement?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> E.g for networking device, it
>>>>>>>>>>>>>>>>> should be sufficient to use this
>>>>>>>>>>>>>>>>> bit + the
>>>>>>>>>>>>>>>>> virtqueue state.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Why don't we design the
>>>>>>>>>>>>>>>>>> feature in a way that is
>>>>>>>>>>>>>>>>>> useable by VMMs
>>>>>>>>>>>>>>>>>> and implementable by devices in a simple way?
>>>>>>>>>>>>>>>>> It use the common technology
>>>>>>>>>>>>>>>>> like register shadowing without
>>>>>>>>>>>>>>>>> any further
>>>>>>>>>>>>>>>>> stuffs.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Or do you have any other ideas?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> (I think we all know migration
>>>>>>>>>>>>>>>>> will be very hard if we simply
>>>>>>>>>>>>>>>>> pass through
>>>>>>>>>>>>>>>>> those state registers).
>>>>>>>>>>>>>>>> If an admin virtqueue is used
>>>>>>>>>>>>>>>> instead of the STOP Device Status
>>>>>>>>>>>>>>>> field
>>>>>>>>>>>>>>>> bit then there's no need to re-read
>>>>>>>>>>>>>>>> the Device Status field in a loop
>>>>>>>>>>>>>>>> until the device has stopped.
>>>>>>>>>>>>>>> Probably not. Let me to clarify several points:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> - This proposal has nothing to do with
>>>>>>>>>>>>>>> admin virtqueue. Actually, admin
>>>>>>>>>>>>>>> virtqueue could be used for carrying any
>>>>>>>>>>>>>>> basic device facility like status
>>>>>>>>>>>>>>> bit. E.g I'm going to post patches that
>>>>>>>>>>>>>>> use admin virtqueue as a "transport"
>>>>>>>>>>>>>>> for device slicing at virtio level.
>>>>>>>>>>>>>>> - Even if we had introduced admin
>>>>>>>>>>>>>>> virtqueue, we still need a per function
>>>>>>>>>>>>>>> interface for this. This is a must for
>>>>>>>>>>>>>>> nested virtualization, we can't
>>>>>>>>>>>>>>> always expect things like PF can be assigned to L1 guest.
>>>>>>>>>>>>>>> - According to the proposal, there's no
>>>>>>>>>>>>>>> need for the device to complete all
>>>>>>>>>>>>>>> the consumed buffers, device can choose
>>>>>>>>>>>>>>> to expose those inflight descriptors
>>>>>>>>>>>>>>> in a device specific way and set the
>>>>>>>>>>>>>>> STOP bit. This means, if we have the
>>>>>>>>>>>>>>> device specific in-flight descriptor
>>>>>>>>>>>>>>> reporting facility, the device can
>>>>>>>>>>>>>>> almost set the STOP bit immediately.
>>>>>>>>>>>>>>> - If we don't go with the basic device
>>>>>>>>>>>>>>> facility but using the admin
>>>>>>>>>>>>>>> virtqueue specific method, we still need
>>>>>>>>>>>>>>> to clarify how it works with the
>>>>>>>>>>>>>>> device status state machine, it will be
>>>>>>>>>>>>>>> some kind of sub-states which looks
>>>>>>>>>>>>>>> much more complicated than the current proposal.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> When migrating a guest with many
>>>>>>>>>>>>>>>> VIRTIO devices a busy waiting
>>>>>>>>>>>>>>>> approach
>>>>>>>>>>>>>>>> extends downtime if implemented
>>>>>>>>>>>>>>>> sequentially (stopping one device at
>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>> time).
>>>>>>>>>>>>>>> Well. You need some kinds of waiting for
>>>>>>>>>>>>>>> sure, the device/DMA needs sometime
>>>>>>>>>>>>>>> to be stopped. The downtime is determined by a specific virtio
>>>>>>>>>>>>>>> implementation which is hard to be
>>>>>>>>>>>>>>> restricted at the spec level. We can
>>>>>>>>>>>>>>> clarify that the device must set the STOP bit in e.g 100ms.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>          It can be implemented
>>>>>>>>>>>>>>>> concurrently (setting the STOP bit
>>>>>>>>>>>>>>>> on all
>>>>>>>>>>>>>>>> devices and then looping until all
>>>>>>>>>>>>>>>> their Device Status fields have the
>>>>>>>>>>>>>>>> bit set), but this becomes more complex to implement.
>>>>>>>>>>>>>>> I still don't get what kind of complexity did you worry here.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I'm a little worried about adding a new bit that requires busy
>>>>>>>>>>>>>>>> waiting...
>>>>>>>>>>>>>>> Busy wait is not something that is introduced in this patch:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 4.1.4.3.2 Driver Requirements: Common
>>>>>>>>>>>>>>> configuration structure layout
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> After writing 0 to device_status, the
>>>>>>>>>>>>>>> driver MUST wait for a read of
>>>>>>>>>>>>>>> device_status to return 0 before reinitializing the device.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Since it was required for at least one
>>>>>>>>>>>>>>> transport. We need do something
>>>>>>>>>>>>>>> similar to when introducing basic facility.
>>>>>>>>>>>>>> Adding the STOP but as a Device Status bit
>>>>>>>>>>>>>> is a small and clean VIRTIO
>>>>>>>>>>>>>> spec change. I like that.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On the other hand, devices need time to stop and that time can be
>>>>>>>>>>>>>> unbounded. For example, software
>>>>>>>>>>>>>> virtio-blk/scsi implementations since
>>>>>>>>>>>>>> cannot immediately cancel in-flight I/O requests on Linux hosts.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The natural interface for long-running
>>>>>>>>>>>>>> operations is virtqueue requests.
>>>>>>>>>>>>>> That's why I mentioned the alternative of using an admin virtqueue
>>>>>>>>>>>>>> instead of a Device Status bit.
>>>>>>>>>>>>> So I'm not against the admin virtqueue. As said
>>>>>>>>>>>>> before, admin virtqueue
>>>>>>>>>>>>> could be used for carrying the device status bit.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Send a command to set STOP status bit to admin
>>>>>>>>>>>>> virtqueue. Device will make
>>>>>>>>>>>>> the command buffer used after it has
>>>>>>>>>>>>> successfully stopped the device.
>>>>>>>>>>>>>
>>>>>>>>>>>>> AFAIK, they are not mutually exclusive, since
>>>>>>>>>>>>> they are trying to solve
>>>>>>>>>>>>> different problems.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Device status - basic device facility
>>>>>>>>>>>>>
>>>>>>>>>>>>> Admin virtqueue - transport/device specific way
>>>>>>>>>>>>> to implement (part of) the
>>>>>>>>>>>>> device facility
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Although you mentioned that the stopped
>>>>>>>>>>>>>> state needs to be reflected in
>>>>>>>>>>>>>> the Device Status field somehow, I'm not sure about that since the
>>>>>>>>>>>>>> driver typically doesn't need to know whether the device is being
>>>>>>>>>>>>>> migrated.
>>>>>>>>>>>>> The guest won't see the real device status bit.
>>>>>>>>>>>>> VMM will shadow the device
>>>>>>>>>>>>> status bit in this case.
>>>>>>>>>>>>>
>>>>>>>>>>>>> E.g with the current vhost-vDPA, vDPA behave
>>>>>>>>>>>>> like a vhost device, guest is
>>>>>>>>>>>>> unaware of the migration.
>>>>>>>>>>>>>
>>>>>>>>>>>>> STOP status bit is set by Qemu to real virtio
>>>>>>>>>>>>> hardware. But guest will only
>>>>>>>>>>>>> see the DRIVER_OK without STOP.
>>>>>>>>>>>>>
>>>>>>>>>>>>> It's not hard to implement the nested on top,
>>>>>>>>>>>>> see the discussion initiated
>>>>>>>>>>>>> by Eugenio about how expose VIRTIO_F_STOP to guest for nested live
>>>>>>>>>>>>> migration.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>         In fact, the VMM would need to hide
>>>>>>>>>>>>>> this bit and it's safer to
>>>>>>>>>>>>>> keep it out-of-band instead of risking exposing it by accident.
>>>>>>>>>>>>> See above, VMM may choose to hide or expose the
>>>>>>>>>>>>> capability. It's useful for
>>>>>>>>>>>>> migrating a nested guest.
>>>>>>>>>>>>>
>>>>>>>>>>>>> If we design an interface that can be used in
>>>>>>>>>>>>> the nested environment, it's
>>>>>>>>>>>>> not an ideal interface.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> In addition, stateful devices need to
>>>>>>>>>>>>>> load/save non-trivial amounts of
>>>>>>>>>>>>>> data. They need DMA to do this efficiently,
>>>>>>>>>>>>>> so an admin virtqueue is a
>>>>>>>>>>>>>> good fit again.
>>>>>>>>>>>>> I don't get the point here. You still need to
>>>>>>>>>>>>> address the exact the similar
>>>>>>>>>>>>> issues for admin virtqueue: the unbound time in
>>>>>>>>>>>>> freezing the device, the
>>>>>>>>>>>>> interaction with the virtio device status state machine.
>>>>>>>>>>>> Device state state can be large so a register interface would be a
>>>>>>>>>>>> bottleneck. DMA is needed. I think a virtqueue is a good fit for
>>>>>>>>>>>> saving/loading device state.
>>>>>>>>>>> So this patch doesn't mandate a register interface, isn't it?
>>>>>>>>>> You're right, not this patch. I mentioned it because your other patch
>>>>>>>>>> series ("[PATCH] virtio-pci: implement VIRTIO_F_RING_STATE")
>>>>>>>>>> implements
>>>>>>>>>> it a register interface.
>>>>>>>>>>
>>>>>>>>>>> And DMA
>>>>>>>>>>> doesn't means a virtqueue, it could be a transport specific method.
>>>>>>>>>> Yes, although virtqueues are a pretty good interface that works across
>>>>>>>>>> transports (PCI/MMIO/etc) thanks to the standard vring memory layout.
>>>>>>>>>>
>>>>>>>>>>> I think we need to start from defining the state of one
>>>>>>>>>>> specific device and
>>>>>>>>>>> see what is the best interface.
>>>>>>>>>> virtio-blk might be the simplest. I think virtio-net has more device
>>>>>>>>>> state and virtio-scsi is definitely more complext than virtio-blk.
>>>>>>>>>>
>>>>>>>>>> First we need agreement on whether "device state" encompasses the full
>>>>>>>>>> state of the device or just state that is unknown to the VMM.
>>>>>>>>> I think we've discussed this in the past. It can't work since:
>>>>>>>>>
>>>>>>>>> 1) The state and its format must be clearly defined in the spec
>>>>>>>>> 2) We need to maintain migration compatibility and debug-ability
>>>>>>>> Some devices need implementation-specific state. They should still be
>>>>>>>> able to live migrate even if it means cross-implementation migration and
>>>>>>>> debug-ability is not possible.
>>>>>>> I think we need to re-visit this conclusion. Migration compatibility is
>>>>>>> pretty important, especially consider the software stack has spent a
>>>>>>> huge mount of effort in maintaining them.
>>>>>>>
>>>>>>> Say a virtio hardware would break this, this mean we will lose all the
>>>>>>> advantages of being a standard device.
>>>>>>>
>>>>>>> If we can't do live migration among:
>>>>>>>
>>>>>>> 1) different backends, e.g migrate from virtio hardware to migrate
>>>>>>> software
>>>>>>> 2) different vendors
>>>>>>>
>>>>>>> We failed to say as a standard device and the customer is in fact locked
>>>>>>> by the vendor implicitly.
>>>>>>>
>>>>>>>
>>>>>>>>> 3) Not a proper uAPI desgin
>>>>>>>> I never understood this argument. The Linux uAPI passes through lots of
>>>>>>>> opaque data from devices to userspace. Allowing an
>>>>>>>> implementation-specific device state representation is nothing new. VFIO
>>>>>>>> already does it.
>>>>>>> I think we've already had a lots of discussion for VFIO but without a
>>>>>>> conclusion. Maybe we need the verdict from Linus or Greg (not sure if
>>>>>>> it's too late). But that's not related to virito and this thread.
>>>>>>>
>>>>>>> What you propose here is kind of conflict with the efforts of virtio. I
>>>>>>> think we all aggree that we should define the state in the spec.
>>>>>>> Assuming this is correct:
>>>>>>>
>>>>>>> 1) why do we still offer opaque migration state to userspace?
>>>>>>> 2) how can it be integrated into the current VMM (Qemu) virtio devices'
>>>>>>> migration bytes stream?
>>>>>>>
>>>>>>> We should standardize everything that is visible by the driver to be a
>>>>>>> standard device. That's the power of virtio.
>>>>>>>
>>>>>>>
>>>>>>>>>> That's
>>>>>>>>>> basically the difference between the vhost/vDPA's selective
>>>>>>>>>> passthrough
>>>>>>>>>> approach and VFIO's full passthrough approach.
>>>>>>>>> We can't do VFIO full pasthrough for migration anyway, some kind
>>>>>>>>> of mdev is
>>>>>>>>> required but it's duplicated with the current vp_vdpa driver.
>>>>>>>> I'm not sure that's true. Generic VFIO PCI migration can probably be
>>>>>>>> achieved without mdev:
>>>>>>>> 1. Define a migration PCI Capability that indicates support for
>>>>>>>>        VFIO_REGION_TYPE_MIGRATION. This allows the PCI device to implement
>>>>>>>>        the migration interface in hardware instead of an mdev driver.
>>>>>>> So I think it still depend on the driver to implement migrate state
>>>>>>> which is vendor specific.
>>>>>>>
>>>>>>> Note that it's just an uAPI definition not something defined in the PCI
>>>>>>> spec.
>>>>>>>
>>>>>>> Out of curiosity, the patch is merged without any real users in the
>>>>>>> Linux. This is very bad since we lose the change to audit the whole
>>>>>>> design.
>>>>>>>
>>>>>>>
>>>>>>>> 2. The VMM either uses the migration PCI Capability directly from
>>>>>>>>        userspace or core VFIO PCI code advertises
>>>>>>>> VFIO_REGION_TYPE_MIGRATION
>>>>>>>>        to userspace so migration can proceed in the same way as with
>>>>>>>>        VFIO/mdev drivers.
>>>>>>>> 3. The PCI Capability is not passed through to the guest.
>>>>>>> This brings troubles in the nested environment.
>>>>>>>
>>>>>>> Thanks
>>>>>>>
>>>>>>>
>>>>>>>> Changpeng Liu originally mentioned the idea of defining a migration PCI
>>>>>>>> Capability.
>>>>>>>>
>>>>>>>>>>       For example, some of the
>>>>>>>>>> virtio-net state is available to the VMM with vhost/vDPA because it
>>>>>>>>>> intercepts the virtio-net control virtqueue.
>>>>>>>>>>
>>>>>>>>>> Also, we need to decide to what degree the device state representation
>>>>>>>>>> is standardized in the VIRTIO specification.
>>>>>>>>> I think all the states must be defined in the spec otherwise the device
>>>>>>>>> can't claim it supports migration at virtio level.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>       I think it makes sense to
>>>>>>>>>> standardize it if it's possible to convey all necessary
>>>>>>>>>> state and device
>>>>>>>>>> implementors can easily implement this device state representation.
>>>>>>>>> I doubt it's high device specific. E.g can we standardize device(GPU)
>>>>>>>>> memory?
>>>>>>>> For devices that have little internal state it's possible to define a
>>>>>>>> standard device state representation.
>>>>>>>>
>>>>>>>> For other devices, like virtio-crypto, virtio-fs, etc it becomes
>>>>>>>> difficult because the device implementation contains state that will be
>>>>>>>> needed but is very specific to the implementation. These devices *are*
>>>>>>>> migratable but they don't have standard state. Even here there is a
>>>>>>>> spectrum:
>>>>>>>> - Host OS-specific state (e.g. Linux struct file_handles)
>>>>>>>> - Library-specific state (e.g. crypto library state)
>>>>>>>> - Implementation-specific state (e.g. sshfs inode state for virtio-fs)
>>>>>>>>
>>>>>>>> This is why I think it's necessary to support both standard device state
>>>>>>>> representations and implementation-specific device state
>>>>>>>> representations.
>>>>>> Having two ways will bring extra complexity. That why I suggest:
>>>>>>
>>>>>> - to have general facility for the virtuqueue to be migrated
>>>>>> - leave the device specific state to be device specific. so device can
>>>>>> choose what is convenient way or interface.
>>>>> I don't think we have a choice. For stateful devices it can be
>>>>> impossible to define a standard device state representation.
>>>> Let me clarify, I agree we can't have a standard device state for all kinds
>>>> of device.
>>>>
>>>> That's way I tend to leave them to be device specific. (but not
>>>> implementation specific)
>>> Unfortunately device state is sometimes implementation-specific. Not
>>> because the device is proprietary, but because the actual state is
>>> meaningless to other implementations.
>>>
>>> I mentioned virtiofs as an example where file system backends can be
>>> implemented in completely different ways so the device state cannot be
>>> migrated between implementations.
>>
>> So let me clarify my understanding, we had two kinds of states:
>>
>> 1) implementation specific state that is not noticeable by the driver
>> 2) device specific state that is noticeable by the driver
>>
>> We don't have the interest in 1).
>>
>> For 2) it's what needs to be defined in the spec. If we fail to generalize
>> the device specific state, it can't be used by a standard virtio driver. Or
>> maybe you can give a concrete example on how vitio-fs fail in doing this?
> 2) is what I mean when I say a "stateful" device. I agree, 1) is not
> relevant to this discussion because we don't need to migrate internal
> device state that the driver cannot interact with.
>
> The virtiofs device has an OPEN request for opening a file. Live
> migration must transfer the list of open files from the source to the
> destination device so the driver can continue accessing files it
> previously had open.
>
> However, the result of the OPEN request is a number similar to a POSIX
> fd, not the full device-internal state associated with an open file.
> After migration the driver expects to continue using the number to
> operate on the file. We must transfer the open file state to the
> destination device.
>
> Different device implementations may have completely different concepts
> of open file state:
>
> - An in-memory file system. The list of open files is a list of
>    in-memory inodes. We'll also need to transfer the entire contents of
>    the files/directories since it's in-memory and not shared with the
>    destination device.
>
> - A passthrough Linux file system. We need to transfer the Linux file
>    handles (see open_by_handle_at(2)) so the destination device can open
>    the inodes on the underlying shared host file system.
>
> - A distributed object storage API. We need to transfer the list of
>    open object IDs so the destination device can perform I/O to the same
>    objects.
>
> - Anyone can create a custom virtiofs device implementation and it will
>    rely on different open file state.


So it looks to me you want to propose migration drivers for different 
implementation. It looks to me we it's better to go with something more 
easier:

1) Having a common driver visible state defined in the spec, and use 
them for migration
2) It's the charge of the device or backend to "map" the driver visible 
state to its implementation specific state.

If 1) is in-sufficient, we should extend it until the it satisfies 2)

For virtio-fs, it looks like the issue is that the implementation needs 
to associate objects in different namespaces (guest vs host).

For the above cases:

1) For the in-memory file system, if I understand correctly, it can be 
accessed directly by guest via a transport specific way (e.g BAR). In 
this case, it's driver noticeable state so all the memory must be 
migrated to the destination.
2) For the passthrough, it's the charge of the implementation to do the 
map. I think it doesn't differ a lot with the current migration among 
shared storage of block devices. (Note that open_by_handle_at() requires 
CAP_DAC_READ_SEARCH which I'm not sure it can used)
3) For distributed storage, the implementation should implement the 
association between driver object and distributed storage object (I 
think each such API should has something like UUID for a global 
namespace) and provide a way for reverse lookup.
4) For other implementation, it should be the same as 3)

It's the management layer to decide whether we can do cross 
implementation migration, but qemu should migrate the common driver 
visible state instead of implementation specific state.

We can have a dedicated feature flag for this and block the migration 
for the devices without this feature.

I think we don't want to end up with several migration drivers for 
virtio-fs.


> I imagine virtio-gpu and virtio-crypto might have similar situations
> where an object created through a virtqueue request has device-internal
> state associated with it that must be migrated.


So the point stands still. The device internal state should be restored 
from the device specific state defined in the spec. Spec would guarantee 
that the minimal part of the device specific state, implementation 
should use those to restore implementation specific state.


>
>>>> But we can generalize the virtqueue state for sure.
>>> I agree and also that some device types can standardize their device
>>> state representations. But I think it's a technical requirement to
>>> support implementation-specific state for device types where
>>> cross-implementation migration is not possible.
>>
>> A question here, if the driver depends on the implementation specific state,
>> how can we make sure that driver can work for other implementation. If we're
>> sure that a single driver can work for all kinds of implementations, it
>> means the we had device specific state not implementation state.
> I think this is confusing stateless and stateful devices. You are
> describing a stateless device here. I'll try to define a stateful
> device:
>
> A stateful device maintains state that the driver operates on indirectly
> via standard requests. For example, the virtio-crypto device has
> CREATE_SESSION requests and a session ID is returned to the driver so
> further requests can be made on the session object. It may not be
> possible to replay, reconnect, or restart the device without losing
> state.


If it's impossible to do all of these. These state should be noticeable 
by the driver and they are not implementation specific but device 
specific which could be defined in the spec.


>
> I hope that this description, together with the virtiofs specifics
> above, make the problem clearer.


Yes.


>
>>> I'm not saying the implementation-specific state representation has to
>>> be a binary blob. There could be an identifier registry to ensure live
>>> migration compatibility checks can be performed. There could also be a
>>> standard binary encoding for migration data.
>>
>> Yes, such requirements has been well studied in the past. There should be
>> plenty of protocols to do this.
>>
>>
>>>    But the contents will be
>>> implementation-specific for some devices.
>>
>> If we allow this, it breaks the spec effort for having a standard devices.
>> And it will block the real customers.
> If we forbid this then devices for which migration is technically
> possible will be unmigratable. Both users and device implementors will
> find other solutions, like VFIO, so I don't think we can stop them even
> if we tried.


The difference is standard device vs vendor specific device. We can't 
avoid the implementation specific state for vendor specific hardware.

But for standard device, we don't want to end up with migration drivers. 
(Do we want vendor to ship the vendor specific migration drivers for 
NVM(e) devices?)


>
> I recognize that opaque device state poses a risk to migration
> compatibility, because device implementors may arbitrarily use opaque
> state when a standard is available.
>
> However, the way to avoid this scenario is by:
>
> 1. Making the standard migration approach the easiest to implement
>     because everything has been taken care of. It will save implementors
>     the headache of defining and coding their own device state
>     representations and versioning.
>
> 2. Educate users about migration compatibility so they can identify
>     implementors are locking in their users.


For vendor specific device, this may work. But for standard devices like 
virtio, we should go further.

The device states should be defined in the spec clearly. We should 
re-visit the design if those states contains anything that is 
implementation specific.

Thanks


>
> Stefan


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit
  2021-07-22  7:33                                             ` Jason Wang
@ 2021-07-22 10:24                                               ` Stefan Hajnoczi
  2021-07-22 13:08                                                 ` Jason Wang
  0 siblings, 1 reply; 115+ messages in thread
From: Stefan Hajnoczi @ 2021-07-22 10:24 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, Eugenio Perez Martin, Dr. David Alan Gilbert,
	virtio-comment, Virtio-Dev, Max Gurtovoy, Cornelia Huck,
	Oren Duer, Shahaf Shuler, Parav Pandit, Bodong Wang,
	Alexander Mikheev, Halil Pasic

[-- Attachment #1: Type: text/plain, Size: 35384 bytes --]

On Thu, Jul 22, 2021 at 03:33:10PM +0800, Jason Wang wrote:
> 
> 在 2021/7/21 下午6:20, Stefan Hajnoczi 写道:
> > On Wed, Jul 21, 2021 at 10:29:17AM +0800, Jason Wang wrote:
> > > 在 2021/7/20 下午4:50, Stefan Hajnoczi 写道:
> > > > On Tue, Jul 20, 2021 at 11:04:55AM +0800, Jason Wang wrote:
> > > > > 在 2021/7/19 下午8:45, Stefan Hajnoczi 写道:
> > > > > > On Fri, Jul 16, 2021 at 11:53:13AM +0800, Jason Wang wrote:
> > > > > > > 在 2021/7/16 上午10:03, Jason Wang 写道:
> > > > > > > > 在 2021/7/15 下午6:01, Stefan Hajnoczi 写道:
> > > > > > > > > On Thu, Jul 15, 2021 at 09:35:13AM +0800, Jason Wang wrote:
> > > > > > > > > > 在 2021/7/14 下午11:07, Stefan Hajnoczi 写道:
> > > > > > > > > > > On Wed, Jul 14, 2021 at 06:29:28PM +0800, Jason Wang wrote:
> > > > > > > > > > > > 在 2021/7/14 下午5:53, Stefan Hajnoczi 写道:
> > > > > > > > > > > > > On Tue, Jul 13, 2021 at 08:16:35PM +0800, Jason Wang wrote:
> > > > > > > > > > > > > > 在 2021/7/13 下午6:00, Stefan Hajnoczi 写道:
> > > > > > > > > > > > > > > On Tue, Jul 13, 2021 at 11:27:03AM +0800, Jason Wang wrote:
> > > > > > > > > > > > > > > > 在 2021/7/12 下午5:57, Stefan Hajnoczi 写道:
> > > > > > > > > > > > > > > > > On Mon, Jul 12, 2021 at 12:00:39PM +0800, Jason Wang wrote:
> > > > > > > > > > > > > > > > > > 在 2021/7/11 上午4:36, Michael S. Tsirkin 写道:
> > > > > > > > > > > > > > > > > > > On Fri, Jul 09, 2021 at
> > > > > > > > > > > > > > > > > > > 07:23:33PM +0200, Eugenio
> > > > > > > > > > > > > > > > > > > Perez Martin wrote:
> > > > > > > > > > > > > > > > > > > > > >            If I understand correctly, this is all
> > > > > > > > > > > > > > > > > > > > > > driven from the
> > > > > > > > > > > > > > > > > > > > > > driver inside
> > > > > > > > > > > > > > > > > > > > > > the guest, so
> > > > > > > > > > > > > > > > > > > > > > for this to work
> > > > > > > > > > > > > > > > > > > > > > the guest must
> > > > > > > > > > > > > > > > > > > > > > be running and
> > > > > > > > > > > > > > > > > > > > > > already have
> > > > > > > > > > > > > > > > > > > > > > initialised the
> > > > > > > > > > > > > > > > > > > > > > driver.
> > > > > > > > > > > > > > > > > > > > > Yes.
> > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > As I see it, the feature
> > > > > > > > > > > > > > > > > > > > can be driven entirely
> > > > > > > > > > > > > > > > > > > > by the VMM as long as
> > > > > > > > > > > > > > > > > > > > it intercept the
> > > > > > > > > > > > > > > > > > > > relevant configuration
> > > > > > > > > > > > > > > > > > > > space (PCI, MMIO, etc)
> > > > > > > > > > > > > > > > > > > > from
> > > > > > > > > > > > > > > > > > > > guest's reads and
> > > > > > > > > > > > > > > > > > > > writes, and present it
> > > > > > > > > > > > > > > > > > > > as coherent and
> > > > > > > > > > > > > > > > > > > > transparent
> > > > > > > > > > > > > > > > > > > > for the guest. Some use
> > > > > > > > > > > > > > > > > > > > cases I can imagine with
> > > > > > > > > > > > > > > > > > > > a physical device (or
> > > > > > > > > > > > > > > > > > > > vp_vpda device) with VIRTIO_F_STOP:
> > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > 1) The VMM chooses not
> > > > > > > > > > > > > > > > > > > > to pass the feature
> > > > > > > > > > > > > > > > > > > > flag. The guest cannot
> > > > > > > > > > > > > > > > > > > > stop
> > > > > > > > > > > > > > > > > > > > the device, so any write to this flag is an error/undefined.
> > > > > > > > > > > > > > > > > > > > 2) The VMM passes the
> > > > > > > > > > > > > > > > > > > > flag to the guest. The
> > > > > > > > > > > > > > > > > > > > guest can stop the
> > > > > > > > > > > > > > > > > > > > device.
> > > > > > > > > > > > > > > > > > > > 2.1) The VMM stops the
> > > > > > > > > > > > > > > > > > > > device to perform a live
> > > > > > > > > > > > > > > > > > > > migration, and the
> > > > > > > > > > > > > > > > > > > > guest does not write to
> > > > > > > > > > > > > > > > > > > > STOP in any moment of
> > > > > > > > > > > > > > > > > > > > the LM. It resets the
> > > > > > > > > > > > > > > > > > > > destination device with
> > > > > > > > > > > > > > > > > > > > the state, and then
> > > > > > > > > > > > > > > > > > > > initializes the device.
> > > > > > > > > > > > > > > > > > > > 2.2) The guest stops the
> > > > > > > > > > > > > > > > > > > > device and, when
> > > > > > > > > > > > > > > > > > > > STOP(32) is set, the
> > > > > > > > > > > > > > > > > > > > source
> > > > > > > > > > > > > > > > > > > > VMM migrates the device
> > > > > > > > > > > > > > > > > > > > status. The destination
> > > > > > > > > > > > > > > > > > > > VMM realizes the bit,
> > > > > > > > > > > > > > > > > > > > so it sets the bit in
> > > > > > > > > > > > > > > > > > > > the destination too
> > > > > > > > > > > > > > > > > > > > after device
> > > > > > > > > > > > > > > > > > > > initialization.
> > > > > > > > > > > > > > > > > > > > 2.3) The device is not
> > > > > > > > > > > > > > > > > > > > initialized by the guest
> > > > > > > > > > > > > > > > > > > > so it doesn't matter
> > > > > > > > > > > > > > > > > > > > what bit has the HW, but the VM can be migrated.
> > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > Am I missing something?
> > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > Thanks!
> > > > > > > > > > > > > > > > > > > It's doable like this. It's
> > > > > > > > > > > > > > > > > > > all a lot of hoops to jump
> > > > > > > > > > > > > > > > > > > through though.
> > > > > > > > > > > > > > > > > > > It's also not easy for devices to implement.
> > > > > > > > > > > > > > > > > > It just requires a new status
> > > > > > > > > > > > > > > > > > bit. Anything that makes you
> > > > > > > > > > > > > > > > > > think it's hard
> > > > > > > > > > > > > > > > > > to implement?
> > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > E.g for networking device, it
> > > > > > > > > > > > > > > > > > should be sufficient to use this
> > > > > > > > > > > > > > > > > > bit + the
> > > > > > > > > > > > > > > > > > virtqueue state.
> > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > Why don't we design the
> > > > > > > > > > > > > > > > > > > feature in a way that is
> > > > > > > > > > > > > > > > > > > useable by VMMs
> > > > > > > > > > > > > > > > > > > and implementable by devices in a simple way?
> > > > > > > > > > > > > > > > > > It use the common technology
> > > > > > > > > > > > > > > > > > like register shadowing without
> > > > > > > > > > > > > > > > > > any further
> > > > > > > > > > > > > > > > > > stuffs.
> > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > Or do you have any other ideas?
> > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > (I think we all know migration
> > > > > > > > > > > > > > > > > > will be very hard if we simply
> > > > > > > > > > > > > > > > > > pass through
> > > > > > > > > > > > > > > > > > those state registers).
> > > > > > > > > > > > > > > > > If an admin virtqueue is used
> > > > > > > > > > > > > > > > > instead of the STOP Device Status
> > > > > > > > > > > > > > > > > field
> > > > > > > > > > > > > > > > > bit then there's no need to re-read
> > > > > > > > > > > > > > > > > the Device Status field in a loop
> > > > > > > > > > > > > > > > > until the device has stopped.
> > > > > > > > > > > > > > > > Probably not. Let me to clarify several points:
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > - This proposal has nothing to do with
> > > > > > > > > > > > > > > > admin virtqueue. Actually, admin
> > > > > > > > > > > > > > > > virtqueue could be used for carrying any
> > > > > > > > > > > > > > > > basic device facility like status
> > > > > > > > > > > > > > > > bit. E.g I'm going to post patches that
> > > > > > > > > > > > > > > > use admin virtqueue as a "transport"
> > > > > > > > > > > > > > > > for device slicing at virtio level.
> > > > > > > > > > > > > > > > - Even if we had introduced admin
> > > > > > > > > > > > > > > > virtqueue, we still need a per function
> > > > > > > > > > > > > > > > interface for this. This is a must for
> > > > > > > > > > > > > > > > nested virtualization, we can't
> > > > > > > > > > > > > > > > always expect things like PF can be assigned to L1 guest.
> > > > > > > > > > > > > > > > - According to the proposal, there's no
> > > > > > > > > > > > > > > > need for the device to complete all
> > > > > > > > > > > > > > > > the consumed buffers, device can choose
> > > > > > > > > > > > > > > > to expose those inflight descriptors
> > > > > > > > > > > > > > > > in a device specific way and set the
> > > > > > > > > > > > > > > > STOP bit. This means, if we have the
> > > > > > > > > > > > > > > > device specific in-flight descriptor
> > > > > > > > > > > > > > > > reporting facility, the device can
> > > > > > > > > > > > > > > > almost set the STOP bit immediately.
> > > > > > > > > > > > > > > > - If we don't go with the basic device
> > > > > > > > > > > > > > > > facility but using the admin
> > > > > > > > > > > > > > > > virtqueue specific method, we still need
> > > > > > > > > > > > > > > > to clarify how it works with the
> > > > > > > > > > > > > > > > device status state machine, it will be
> > > > > > > > > > > > > > > > some kind of sub-states which looks
> > > > > > > > > > > > > > > > much more complicated than the current proposal.
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > When migrating a guest with many
> > > > > > > > > > > > > > > > > VIRTIO devices a busy waiting
> > > > > > > > > > > > > > > > > approach
> > > > > > > > > > > > > > > > > extends downtime if implemented
> > > > > > > > > > > > > > > > > sequentially (stopping one device at
> > > > > > > > > > > > > > > > > a
> > > > > > > > > > > > > > > > > time).
> > > > > > > > > > > > > > > > Well. You need some kinds of waiting for
> > > > > > > > > > > > > > > > sure, the device/DMA needs sometime
> > > > > > > > > > > > > > > > to be stopped. The downtime is determined by a specific virtio
> > > > > > > > > > > > > > > > implementation which is hard to be
> > > > > > > > > > > > > > > > restricted at the spec level. We can
> > > > > > > > > > > > > > > > clarify that the device must set the STOP bit in e.g 100ms.
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > >          It can be implemented
> > > > > > > > > > > > > > > > > concurrently (setting the STOP bit
> > > > > > > > > > > > > > > > > on all
> > > > > > > > > > > > > > > > > devices and then looping until all
> > > > > > > > > > > > > > > > > their Device Status fields have the
> > > > > > > > > > > > > > > > > bit set), but this becomes more complex to implement.
> > > > > > > > > > > > > > > > I still don't get what kind of complexity did you worry here.
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > I'm a little worried about adding a new bit that requires busy
> > > > > > > > > > > > > > > > > waiting...
> > > > > > > > > > > > > > > > Busy wait is not something that is introduced in this patch:
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > 4.1.4.3.2 Driver Requirements: Common
> > > > > > > > > > > > > > > > configuration structure layout
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > After writing 0 to device_status, the
> > > > > > > > > > > > > > > > driver MUST wait for a read of
> > > > > > > > > > > > > > > > device_status to return 0 before reinitializing the device.
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > Since it was required for at least one
> > > > > > > > > > > > > > > > transport. We need do something
> > > > > > > > > > > > > > > > similar to when introducing basic facility.
> > > > > > > > > > > > > > > Adding the STOP but as a Device Status bit
> > > > > > > > > > > > > > > is a small and clean VIRTIO
> > > > > > > > > > > > > > > spec change. I like that.
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > On the other hand, devices need time to stop and that time can be
> > > > > > > > > > > > > > > unbounded. For example, software
> > > > > > > > > > > > > > > virtio-blk/scsi implementations since
> > > > > > > > > > > > > > > cannot immediately cancel in-flight I/O requests on Linux hosts.
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > The natural interface for long-running
> > > > > > > > > > > > > > > operations is virtqueue requests.
> > > > > > > > > > > > > > > That's why I mentioned the alternative of using an admin virtqueue
> > > > > > > > > > > > > > > instead of a Device Status bit.
> > > > > > > > > > > > > > So I'm not against the admin virtqueue. As said
> > > > > > > > > > > > > > before, admin virtqueue
> > > > > > > > > > > > > > could be used for carrying the device status bit.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > Send a command to set STOP status bit to admin
> > > > > > > > > > > > > > virtqueue. Device will make
> > > > > > > > > > > > > > the command buffer used after it has
> > > > > > > > > > > > > > successfully stopped the device.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > AFAIK, they are not mutually exclusive, since
> > > > > > > > > > > > > > they are trying to solve
> > > > > > > > > > > > > > different problems.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > Device status - basic device facility
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > Admin virtqueue - transport/device specific way
> > > > > > > > > > > > > > to implement (part of) the
> > > > > > > > > > > > > > device facility
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > Although you mentioned that the stopped
> > > > > > > > > > > > > > > state needs to be reflected in
> > > > > > > > > > > > > > > the Device Status field somehow, I'm not sure about that since the
> > > > > > > > > > > > > > > driver typically doesn't need to know whether the device is being
> > > > > > > > > > > > > > > migrated.
> > > > > > > > > > > > > > The guest won't see the real device status bit.
> > > > > > > > > > > > > > VMM will shadow the device
> > > > > > > > > > > > > > status bit in this case.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > E.g with the current vhost-vDPA, vDPA behave
> > > > > > > > > > > > > > like a vhost device, guest is
> > > > > > > > > > > > > > unaware of the migration.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > STOP status bit is set by Qemu to real virtio
> > > > > > > > > > > > > > hardware. But guest will only
> > > > > > > > > > > > > > see the DRIVER_OK without STOP.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > It's not hard to implement the nested on top,
> > > > > > > > > > > > > > see the discussion initiated
> > > > > > > > > > > > > > by Eugenio about how expose VIRTIO_F_STOP to guest for nested live
> > > > > > > > > > > > > > migration.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > >         In fact, the VMM would need to hide
> > > > > > > > > > > > > > > this bit and it's safer to
> > > > > > > > > > > > > > > keep it out-of-band instead of risking exposing it by accident.
> > > > > > > > > > > > > > See above, VMM may choose to hide or expose the
> > > > > > > > > > > > > > capability. It's useful for
> > > > > > > > > > > > > > migrating a nested guest.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > If we design an interface that can be used in
> > > > > > > > > > > > > > the nested environment, it's
> > > > > > > > > > > > > > not an ideal interface.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > In addition, stateful devices need to
> > > > > > > > > > > > > > > load/save non-trivial amounts of
> > > > > > > > > > > > > > > data. They need DMA to do this efficiently,
> > > > > > > > > > > > > > > so an admin virtqueue is a
> > > > > > > > > > > > > > > good fit again.
> > > > > > > > > > > > > > I don't get the point here. You still need to
> > > > > > > > > > > > > > address the exact the similar
> > > > > > > > > > > > > > issues for admin virtqueue: the unbound time in
> > > > > > > > > > > > > > freezing the device, the
> > > > > > > > > > > > > > interaction with the virtio device status state machine.
> > > > > > > > > > > > > Device state state can be large so a register interface would be a
> > > > > > > > > > > > > bottleneck. DMA is needed. I think a virtqueue is a good fit for
> > > > > > > > > > > > > saving/loading device state.
> > > > > > > > > > > > So this patch doesn't mandate a register interface, isn't it?
> > > > > > > > > > > You're right, not this patch. I mentioned it because your other patch
> > > > > > > > > > > series ("[PATCH] virtio-pci: implement VIRTIO_F_RING_STATE")
> > > > > > > > > > > implements
> > > > > > > > > > > it a register interface.
> > > > > > > > > > > 
> > > > > > > > > > > > And DMA
> > > > > > > > > > > > doesn't means a virtqueue, it could be a transport specific method.
> > > > > > > > > > > Yes, although virtqueues are a pretty good interface that works across
> > > > > > > > > > > transports (PCI/MMIO/etc) thanks to the standard vring memory layout.
> > > > > > > > > > > 
> > > > > > > > > > > > I think we need to start from defining the state of one
> > > > > > > > > > > > specific device and
> > > > > > > > > > > > see what is the best interface.
> > > > > > > > > > > virtio-blk might be the simplest. I think virtio-net has more device
> > > > > > > > > > > state and virtio-scsi is definitely more complext than virtio-blk.
> > > > > > > > > > > 
> > > > > > > > > > > First we need agreement on whether "device state" encompasses the full
> > > > > > > > > > > state of the device or just state that is unknown to the VMM.
> > > > > > > > > > I think we've discussed this in the past. It can't work since:
> > > > > > > > > > 
> > > > > > > > > > 1) The state and its format must be clearly defined in the spec
> > > > > > > > > > 2) We need to maintain migration compatibility and debug-ability
> > > > > > > > > Some devices need implementation-specific state. They should still be
> > > > > > > > > able to live migrate even if it means cross-implementation migration and
> > > > > > > > > debug-ability is not possible.
> > > > > > > > I think we need to re-visit this conclusion. Migration compatibility is
> > > > > > > > pretty important, especially consider the software stack has spent a
> > > > > > > > huge mount of effort in maintaining them.
> > > > > > > > 
> > > > > > > > Say a virtio hardware would break this, this mean we will lose all the
> > > > > > > > advantages of being a standard device.
> > > > > > > > 
> > > > > > > > If we can't do live migration among:
> > > > > > > > 
> > > > > > > > 1) different backends, e.g migrate from virtio hardware to migrate
> > > > > > > > software
> > > > > > > > 2) different vendors
> > > > > > > > 
> > > > > > > > We failed to say as a standard device and the customer is in fact locked
> > > > > > > > by the vendor implicitly.
> > > > > > > > 
> > > > > > > > 
> > > > > > > > > > 3) Not a proper uAPI desgin
> > > > > > > > > I never understood this argument. The Linux uAPI passes through lots of
> > > > > > > > > opaque data from devices to userspace. Allowing an
> > > > > > > > > implementation-specific device state representation is nothing new. VFIO
> > > > > > > > > already does it.
> > > > > > > > I think we've already had a lots of discussion for VFIO but without a
> > > > > > > > conclusion. Maybe we need the verdict from Linus or Greg (not sure if
> > > > > > > > it's too late). But that's not related to virito and this thread.
> > > > > > > > 
> > > > > > > > What you propose here is kind of conflict with the efforts of virtio. I
> > > > > > > > think we all aggree that we should define the state in the spec.
> > > > > > > > Assuming this is correct:
> > > > > > > > 
> > > > > > > > 1) why do we still offer opaque migration state to userspace?
> > > > > > > > 2) how can it be integrated into the current VMM (Qemu) virtio devices'
> > > > > > > > migration bytes stream?
> > > > > > > > 
> > > > > > > > We should standardize everything that is visible by the driver to be a
> > > > > > > > standard device. That's the power of virtio.
> > > > > > > > 
> > > > > > > > 
> > > > > > > > > > > That's
> > > > > > > > > > > basically the difference between the vhost/vDPA's selective
> > > > > > > > > > > passthrough
> > > > > > > > > > > approach and VFIO's full passthrough approach.
> > > > > > > > > > We can't do VFIO full pasthrough for migration anyway, some kind
> > > > > > > > > > of mdev is
> > > > > > > > > > required but it's duplicated with the current vp_vdpa driver.
> > > > > > > > > I'm not sure that's true. Generic VFIO PCI migration can probably be
> > > > > > > > > achieved without mdev:
> > > > > > > > > 1. Define a migration PCI Capability that indicates support for
> > > > > > > > >        VFIO_REGION_TYPE_MIGRATION. This allows the PCI device to implement
> > > > > > > > >        the migration interface in hardware instead of an mdev driver.
> > > > > > > > So I think it still depend on the driver to implement migrate state
> > > > > > > > which is vendor specific.
> > > > > > > > 
> > > > > > > > Note that it's just an uAPI definition not something defined in the PCI
> > > > > > > > spec.
> > > > > > > > 
> > > > > > > > Out of curiosity, the patch is merged without any real users in the
> > > > > > > > Linux. This is very bad since we lose the change to audit the whole
> > > > > > > > design.
> > > > > > > > 
> > > > > > > > 
> > > > > > > > > 2. The VMM either uses the migration PCI Capability directly from
> > > > > > > > >        userspace or core VFIO PCI code advertises
> > > > > > > > > VFIO_REGION_TYPE_MIGRATION
> > > > > > > > >        to userspace so migration can proceed in the same way as with
> > > > > > > > >        VFIO/mdev drivers.
> > > > > > > > > 3. The PCI Capability is not passed through to the guest.
> > > > > > > > This brings troubles in the nested environment.
> > > > > > > > 
> > > > > > > > Thanks
> > > > > > > > 
> > > > > > > > 
> > > > > > > > > Changpeng Liu originally mentioned the idea of defining a migration PCI
> > > > > > > > > Capability.
> > > > > > > > > 
> > > > > > > > > > >       For example, some of the
> > > > > > > > > > > virtio-net state is available to the VMM with vhost/vDPA because it
> > > > > > > > > > > intercepts the virtio-net control virtqueue.
> > > > > > > > > > > 
> > > > > > > > > > > Also, we need to decide to what degree the device state representation
> > > > > > > > > > > is standardized in the VIRTIO specification.
> > > > > > > > > > I think all the states must be defined in the spec otherwise the device
> > > > > > > > > > can't claim it supports migration at virtio level.
> > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > >       I think it makes sense to
> > > > > > > > > > > standardize it if it's possible to convey all necessary
> > > > > > > > > > > state and device
> > > > > > > > > > > implementors can easily implement this device state representation.
> > > > > > > > > > I doubt it's high device specific. E.g can we standardize device(GPU)
> > > > > > > > > > memory?
> > > > > > > > > For devices that have little internal state it's possible to define a
> > > > > > > > > standard device state representation.
> > > > > > > > > 
> > > > > > > > > For other devices, like virtio-crypto, virtio-fs, etc it becomes
> > > > > > > > > difficult because the device implementation contains state that will be
> > > > > > > > > needed but is very specific to the implementation. These devices *are*
> > > > > > > > > migratable but they don't have standard state. Even here there is a
> > > > > > > > > spectrum:
> > > > > > > > > - Host OS-specific state (e.g. Linux struct file_handles)
> > > > > > > > > - Library-specific state (e.g. crypto library state)
> > > > > > > > > - Implementation-specific state (e.g. sshfs inode state for virtio-fs)
> > > > > > > > > 
> > > > > > > > > This is why I think it's necessary to support both standard device state
> > > > > > > > > representations and implementation-specific device state
> > > > > > > > > representations.
> > > > > > > Having two ways will bring extra complexity. That why I suggest:
> > > > > > > 
> > > > > > > - to have general facility for the virtuqueue to be migrated
> > > > > > > - leave the device specific state to be device specific. so device can
> > > > > > > choose what is convenient way or interface.
> > > > > > I don't think we have a choice. For stateful devices it can be
> > > > > > impossible to define a standard device state representation.
> > > > > Let me clarify, I agree we can't have a standard device state for all kinds
> > > > > of device.
> > > > > 
> > > > > That's way I tend to leave them to be device specific. (but not
> > > > > implementation specific)
> > > > Unfortunately device state is sometimes implementation-specific. Not
> > > > because the device is proprietary, but because the actual state is
> > > > meaningless to other implementations.
> > > > 
> > > > I mentioned virtiofs as an example where file system backends can be
> > > > implemented in completely different ways so the device state cannot be
> > > > migrated between implementations.
> > > 
> > > So let me clarify my understanding, we had two kinds of states:
> > > 
> > > 1) implementation specific state that is not noticeable by the driver
> > > 2) device specific state that is noticeable by the driver
> > > 
> > > We don't have the interest in 1).
> > > 
> > > For 2) it's what needs to be defined in the spec. If we fail to generalize
> > > the device specific state, it can't be used by a standard virtio driver. Or
> > > maybe you can give a concrete example on how vitio-fs fail in doing this?
> > 2) is what I mean when I say a "stateful" device. I agree, 1) is not
> > relevant to this discussion because we don't need to migrate internal
> > device state that the driver cannot interact with.
> > 
> > The virtiofs device has an OPEN request for opening a file. Live
> > migration must transfer the list of open files from the source to the
> > destination device so the driver can continue accessing files it
> > previously had open.
> > 
> > However, the result of the OPEN request is a number similar to a POSIX
> > fd, not the full device-internal state associated with an open file.
> > After migration the driver expects to continue using the number to
> > operate on the file. We must transfer the open file state to the
> > destination device.
> > 
> > Different device implementations may have completely different concepts
> > of open file state:
> > 
> > - An in-memory file system. The list of open files is a list of
> >    in-memory inodes. We'll also need to transfer the entire contents of
> >    the files/directories since it's in-memory and not shared with the
> >    destination device.
> > 
> > - A passthrough Linux file system. We need to transfer the Linux file
> >    handles (see open_by_handle_at(2)) so the destination device can open
> >    the inodes on the underlying shared host file system.
> > 
> > - A distributed object storage API. We need to transfer the list of
> >    open object IDs so the destination device can perform I/O to the same
> >    objects.
> > 
> > - Anyone can create a custom virtiofs device implementation and it will
> >    rely on different open file state.
> 
> 
> So it looks to me you want to propose migration drivers for different
> implementation. It looks to me we it's better to go with something more
> easier:
> 
> 1) Having a common driver visible state defined in the spec, and use them
> for migration
> 2) It's the charge of the device or backend to "map" the driver visible
> state to its implementation specific state.
> 
> If 1) is in-sufficient, we should extend it until the it satisfies 2)
> 
> For virtio-fs, it looks like the issue is that the implementation needs to
> associate objects in different namespaces (guest vs host).
> 
> For the above cases:
> 
> 1) For the in-memory file system, if I understand correctly, it can be
> accessed directly by guest via a transport specific way (e.g BAR). In this
> case, it's driver noticeable state so all the memory must be migrated to the
> destination.

The virtiofs DAX Window feature is optional. Also, the DAX Window BAR
only exposes file contents, not file system metadata. The guest driver
or the BAR contents are not enough to migrate the state of the device.

> 2) For the passthrough, it's the charge of the implementation to do the map.
> I think it doesn't differ a lot with the current migration among shared
> storage of block devices. (Note that open_by_handle_at() requires
> CAP_DAC_READ_SEARCH which I'm not sure it can used)

It's different from block devices because the FUSE protocol
is stateful. Commands like LOOKUP, OPEN, OPENDIR create objects that are
referred to by a temporary number (similar to POSIX fds). Due to POSIX
file system semantics there's no way to recreate the mapping correctly
without Linux file handles (because files can be deleted, renamed, moved
while they are still open or their inode is known). Shared block devices
do not have this issue.

> 3) For distributed storage, the implementation should implement the
> association between driver object and distributed storage object (I think
> each such API should has something like UUID for a global namespace) and
> provide a way for reverse lookup.
> 4) For other implementation, it should be the same as 3)
> 
> It's the management layer to decide whether we can do cross implementation
> migration, but qemu should migrate the common driver visible state instead
> of implementation specific state.
> 
> We can have a dedicated feature flag for this and block the migration for
> the devices without this feature.
> 
> I think we don't want to end up with several migration drivers for
> virtio-fs.
>
> > I imagine virtio-gpu and virtio-crypto might have similar situations
> > where an object created through a virtqueue request has device-internal
> > state associated with it that must be migrated.
> 
> 
> So the point stands still. The device internal state should be restored from
> the device specific state defined in the spec. Spec would guarantee that the
> minimal part of the device specific state, implementation should use those
> to restore implementation specific state.
> 
> 
> > 
> > > > > But we can generalize the virtqueue state for sure.
> > > > I agree and also that some device types can standardize their device
> > > > state representations. But I think it's a technical requirement to
> > > > support implementation-specific state for device types where
> > > > cross-implementation migration is not possible.
> > > 
> > > A question here, if the driver depends on the implementation specific state,
> > > how can we make sure that driver can work for other implementation. If we're
> > > sure that a single driver can work for all kinds of implementations, it
> > > means the we had device specific state not implementation state.
> > I think this is confusing stateless and stateful devices. You are
> > describing a stateless device here. I'll try to define a stateful
> > device:
> > 
> > A stateful device maintains state that the driver operates on indirectly
> > via standard requests. For example, the virtio-crypto device has
> > CREATE_SESSION requests and a session ID is returned to the driver so
> > further requests can be made on the session object. It may not be
> > possible to replay, reconnect, or restart the device without losing
> > state.
> 
> 
> If it's impossible to do all of these. These state should be noticeable by
> the driver and they are not implementation specific but device specific
> which could be defined in the spec.
> 
> 
> > 
> > I hope that this description, together with the virtiofs specifics
> > above, make the problem clearer.
> 
> 
> Yes.
> 
> 
> > 
> > > > I'm not saying the implementation-specific state representation has to
> > > > be a binary blob. There could be an identifier registry to ensure live
> > > > migration compatibility checks can be performed. There could also be a
> > > > standard binary encoding for migration data.
> > > 
> > > Yes, such requirements has been well studied in the past. There should be
> > > plenty of protocols to do this.
> > > 
> > > 
> > > >    But the contents will be
> > > > implementation-specific for some devices.
> > > 
> > > If we allow this, it breaks the spec effort for having a standard devices.
> > > And it will block the real customers.
> > If we forbid this then devices for which migration is technically
> > possible will be unmigratable. Both users and device implementors will
> > find other solutions, like VFIO, so I don't think we can stop them even
> > if we tried.
> 
> 
> The difference is standard device vs vendor specific device. We can't avoid
> the implementation specific state for vendor specific hardware.
> 
> But for standard device, we don't want to end up with migration drivers. (Do
> we want vendor to ship the vendor specific migration drivers for NVM(e)
> devices?)
> 
> 
> > 
> > I recognize that opaque device state poses a risk to migration
> > compatibility, because device implementors may arbitrarily use opaque
> > state when a standard is available.
> > 
> > However, the way to avoid this scenario is by:
> > 
> > 1. Making the standard migration approach the easiest to implement
> >     because everything has been taken care of. It will save implementors
> >     the headache of defining and coding their own device state
> >     representations and versioning.
> > 
> > 2. Educate users about migration compatibility so they can identify
> >     implementors are locking in their users.
> 
> 
> For vendor specific device, this may work. But for standard devices like
> virtio, we should go further.
> 
> The device states should be defined in the spec clearly. We should re-visit
> the design if those states contains anything that is implementation
> specific.

Can you describe how migrating virtiofs devices should work? I think
that might be quicker than if I reply to each of your points because our
views are still quite far apart.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit
  2021-07-22  2:08                                           ` Jason Wang
@ 2021-07-22 10:30                                             ` Stefan Hajnoczi
  0 siblings, 0 replies; 115+ messages in thread
From: Stefan Hajnoczi @ 2021-07-22 10:30 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, Eugenio Perez Martin, Dr. David Alan Gilbert,
	virtio-comment, Virtio-Dev, Max Gurtovoy, Cornelia Huck,
	Oren Duer, Shahaf Shuler, Parav Pandit, Bodong Wang,
	Alexander Mikheev, Halil Pasic

[-- Attachment #1: Type: text/plain, Size: 20820 bytes --]

On Thu, Jul 22, 2021 at 10:08:51AM +0800, Jason Wang wrote:
> 
> 在 2021/7/21 下午6:42, Stefan Hajnoczi 写道:
> > On Wed, Jul 21, 2021 at 10:52:15AM +0800, Jason Wang wrote:
> > > 在 2021/7/20 下午6:19, Stefan Hajnoczi 写道:
> > > > On Tue, Jul 20, 2021 at 11:02:42AM +0800, Jason Wang wrote:
> > > > > 在 2021/7/19 下午8:43, Stefan Hajnoczi 写道:
> > > > > > On Fri, Jul 16, 2021 at 10:03:17AM +0800, Jason Wang wrote:
> > > > > > > 在 2021/7/15 下午6:01, Stefan Hajnoczi 写道:
> > > > > > > > On Thu, Jul 15, 2021 at 09:35:13AM +0800, Jason Wang wrote:
> > > > > > > > > 在 2021/7/14 下午11:07, Stefan Hajnoczi 写道:
> > > > > > > > > > On Wed, Jul 14, 2021 at 06:29:28PM +0800, Jason Wang wrote:
> > > > > > > > > > > 在 2021/7/14 下午5:53, Stefan Hajnoczi 写道:
> > > > > > > > > > > > On Tue, Jul 13, 2021 at 08:16:35PM +0800, Jason Wang wrote:
> > > > > > > > > > > > > 在 2021/7/13 下午6:00, Stefan Hajnoczi 写道:
> > > > > > > > > > > > > > On Tue, Jul 13, 2021 at 11:27:03AM +0800, Jason Wang wrote:
> > > > > > > > > > > > > > > 在 2021/7/12 下午5:57, Stefan Hajnoczi 写道:
> > > > > > > > > > > > > > > > On Mon, Jul 12, 2021 at 12:00:39PM +0800, Jason Wang wrote:
> > > > > > > > > > > > > > > > > 在 2021/7/11 上午4:36, Michael S. Tsirkin 写道:
> > > > > > > > > > > > > > > > > > On Fri, Jul 09, 2021 at 07:23:33PM +0200, Eugenio Perez Martin wrote:
> > > > > > > > > > > > > > > > > > > > >            If I understand correctly, this is all
> > > > > > > > > > > > > > > > > > > > > driven from the driver inside the guest, so for this to work
> > > > > > > > > > > > > > > > > > > > > the guest must be running and already have initialised the driver.
> > > > > > > > > > > > > > > > > > > > Yes.
> > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > As I see it, the feature can be driven entirely by the VMM as long as
> > > > > > > > > > > > > > > > > > > it intercept the relevant configuration space (PCI, MMIO, etc) from
> > > > > > > > > > > > > > > > > > > guest's reads and writes, and present it as coherent and transparent
> > > > > > > > > > > > > > > > > > > for the guest. Some use cases I can imagine with a physical device (or
> > > > > > > > > > > > > > > > > > > vp_vpda device) with VIRTIO_F_STOP:
> > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > 1) The VMM chooses not to pass the feature flag. The guest cannot stop
> > > > > > > > > > > > > > > > > > > the device, so any write to this flag is an error/undefined.
> > > > > > > > > > > > > > > > > > > 2) The VMM passes the flag to the guest. The guest can stop the device.
> > > > > > > > > > > > > > > > > > > 2.1) The VMM stops the device to perform a live migration, and the
> > > > > > > > > > > > > > > > > > > guest does not write to STOP in any moment of the LM. It resets the
> > > > > > > > > > > > > > > > > > > destination device with the state, and then initializes the device.
> > > > > > > > > > > > > > > > > > > 2.2) The guest stops the device and, when STOP(32) is set, the source
> > > > > > > > > > > > > > > > > > > VMM migrates the device status. The destination VMM realizes the bit,
> > > > > > > > > > > > > > > > > > > so it sets the bit in the destination too after device initialization.
> > > > > > > > > > > > > > > > > > > 2.3) The device is not initialized by the guest so it doesn't matter
> > > > > > > > > > > > > > > > > > > what bit has the HW, but the VM can be migrated.
> > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > Am I missing something?
> > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > Thanks!
> > > > > > > > > > > > > > > > > > It's doable like this. It's all a lot of hoops to jump through though.
> > > > > > > > > > > > > > > > > > It's also not easy for devices to implement.
> > > > > > > > > > > > > > > > > It just requires a new status bit. Anything that makes you think it's hard
> > > > > > > > > > > > > > > > > to implement?
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > E.g for networking device, it should be sufficient to use this bit + the
> > > > > > > > > > > > > > > > > virtqueue state.
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > Why don't we design the feature in a way that is useable by VMMs
> > > > > > > > > > > > > > > > > > and implementable by devices in a simple way?
> > > > > > > > > > > > > > > > > It use the common technology like register shadowing without any further
> > > > > > > > > > > > > > > > > stuffs.
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > Or do you have any other ideas?
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > (I think we all know migration will be very hard if we simply pass through
> > > > > > > > > > > > > > > > > those state registers).
> > > > > > > > > > > > > > > > If an admin virtqueue is used instead of the STOP Device Status field
> > > > > > > > > > > > > > > > bit then there's no need to re-read the Device Status field in a loop
> > > > > > > > > > > > > > > > until the device has stopped.
> > > > > > > > > > > > > > > Probably not. Let me to clarify several points:
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > - This proposal has nothing to do with admin virtqueue. Actually, admin
> > > > > > > > > > > > > > > virtqueue could be used for carrying any basic device facility like status
> > > > > > > > > > > > > > > bit. E.g I'm going to post patches that use admin virtqueue as a "transport"
> > > > > > > > > > > > > > > for device slicing at virtio level.
> > > > > > > > > > > > > > > - Even if we had introduced admin virtqueue, we still need a per function
> > > > > > > > > > > > > > > interface for this. This is a must for nested virtualization, we can't
> > > > > > > > > > > > > > > always expect things like PF can be assigned to L1 guest.
> > > > > > > > > > > > > > > - According to the proposal, there's no need for the device to complete all
> > > > > > > > > > > > > > > the consumed buffers, device can choose to expose those inflight descriptors
> > > > > > > > > > > > > > > in a device specific way and set the STOP bit. This means, if we have the
> > > > > > > > > > > > > > > device specific in-flight descriptor reporting facility, the device can
> > > > > > > > > > > > > > > almost set the STOP bit immediately.
> > > > > > > > > > > > > > > - If we don't go with the basic device facility but using the admin
> > > > > > > > > > > > > > > virtqueue specific method, we still need to clarify how it works with the
> > > > > > > > > > > > > > > device status state machine, it will be some kind of sub-states which looks
> > > > > > > > > > > > > > > much more complicated than the current proposal.
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > When migrating a guest with many VIRTIO devices a busy waiting approach
> > > > > > > > > > > > > > > > extends downtime if implemented sequentially (stopping one device at a
> > > > > > > > > > > > > > > > time).
> > > > > > > > > > > > > > > Well. You need some kinds of waiting for sure, the device/DMA needs sometime
> > > > > > > > > > > > > > > to be stopped. The downtime is determined by a specific virtio
> > > > > > > > > > > > > > > implementation which is hard to be restricted at the spec level. We can
> > > > > > > > > > > > > > > clarify that the device must set the STOP bit in e.g 100ms.
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > >          It can be implemented concurrently (setting the STOP bit on all
> > > > > > > > > > > > > > > > devices and then looping until all their Device Status fields have the
> > > > > > > > > > > > > > > > bit set), but this becomes more complex to implement.
> > > > > > > > > > > > > > > I still don't get what kind of complexity did you worry here.
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > I'm a little worried about adding a new bit that requires busy
> > > > > > > > > > > > > > > > waiting...
> > > > > > > > > > > > > > > Busy wait is not something that is introduced in this patch:
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > 4.1.4.3.2 Driver Requirements: Common configuration structure layout
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > After writing 0 to device_status, the driver MUST wait for a read of
> > > > > > > > > > > > > > > device_status to return 0 before reinitializing the device.
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > Since it was required for at least one transport. We need do something
> > > > > > > > > > > > > > > similar to when introducing basic facility.
> > > > > > > > > > > > > > Adding the STOP but as a Device Status bit is a small and clean VIRTIO
> > > > > > > > > > > > > > spec change. I like that.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > On the other hand, devices need time to stop and that time can be
> > > > > > > > > > > > > > unbounded. For example, software virtio-blk/scsi implementations since
> > > > > > > > > > > > > > cannot immediately cancel in-flight I/O requests on Linux hosts.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > The natural interface for long-running operations is virtqueue requests.
> > > > > > > > > > > > > > That's why I mentioned the alternative of using an admin virtqueue
> > > > > > > > > > > > > > instead of a Device Status bit.
> > > > > > > > > > > > > So I'm not against the admin virtqueue. As said before, admin virtqueue
> > > > > > > > > > > > > could be used for carrying the device status bit.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Send a command to set STOP status bit to admin virtqueue. Device will make
> > > > > > > > > > > > > the command buffer used after it has successfully stopped the device.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > AFAIK, they are not mutually exclusive, since they are trying to solve
> > > > > > > > > > > > > different problems.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Device status - basic device facility
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Admin virtqueue - transport/device specific way to implement (part of) the
> > > > > > > > > > > > > device facility
> > > > > > > > > > > > > 
> > > > > > > > > > > > > > Although you mentioned that the stopped state needs to be reflected in
> > > > > > > > > > > > > > the Device Status field somehow, I'm not sure about that since the
> > > > > > > > > > > > > > driver typically doesn't need to know whether the device is being
> > > > > > > > > > > > > > migrated.
> > > > > > > > > > > > > The guest won't see the real device status bit. VMM will shadow the device
> > > > > > > > > > > > > status bit in this case.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > E.g with the current vhost-vDPA, vDPA behave like a vhost device, guest is
> > > > > > > > > > > > > unaware of the migration.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > STOP status bit is set by Qemu to real virtio hardware. But guest will only
> > > > > > > > > > > > > see the DRIVER_OK without STOP.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > It's not hard to implement the nested on top, see the discussion initiated
> > > > > > > > > > > > > by Eugenio about how expose VIRTIO_F_STOP to guest for nested live
> > > > > > > > > > > > > migration.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > 
> > > > > > > > > > > > > >         In fact, the VMM would need to hide this bit and it's safer to
> > > > > > > > > > > > > > keep it out-of-band instead of risking exposing it by accident.
> > > > > > > > > > > > > See above, VMM may choose to hide or expose the capability. It's useful for
> > > > > > > > > > > > > migrating a nested guest.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > If we design an interface that can be used in the nested environment, it's
> > > > > > > > > > > > > not an ideal interface.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > 
> > > > > > > > > > > > > > In addition, stateful devices need to load/save non-trivial amounts of
> > > > > > > > > > > > > > data. They need DMA to do this efficiently, so an admin virtqueue is a
> > > > > > > > > > > > > > good fit again.
> > > > > > > > > > > > > I don't get the point here. You still need to address the exact the similar
> > > > > > > > > > > > > issues for admin virtqueue: the unbound time in freezing the device, the
> > > > > > > > > > > > > interaction with the virtio device status state machine.
> > > > > > > > > > > > Device state state can be large so a register interface would be a
> > > > > > > > > > > > bottleneck. DMA is needed. I think a virtqueue is a good fit for
> > > > > > > > > > > > saving/loading device state.
> > > > > > > > > > > So this patch doesn't mandate a register interface, isn't it?
> > > > > > > > > > You're right, not this patch. I mentioned it because your other patch
> > > > > > > > > > series ("[PATCH] virtio-pci: implement VIRTIO_F_RING_STATE") implements
> > > > > > > > > > it a register interface.
> > > > > > > > > > 
> > > > > > > > > > > And DMA
> > > > > > > > > > > doesn't means a virtqueue, it could be a transport specific method.
> > > > > > > > > > Yes, although virtqueues are a pretty good interface that works across
> > > > > > > > > > transports (PCI/MMIO/etc) thanks to the standard vring memory layout.
> > > > > > > > > > 
> > > > > > > > > > > I think we need to start from defining the state of one specific device and
> > > > > > > > > > > see what is the best interface.
> > > > > > > > > > virtio-blk might be the simplest. I think virtio-net has more device
> > > > > > > > > > state and virtio-scsi is definitely more complext than virtio-blk.
> > > > > > > > > > 
> > > > > > > > > > First we need agreement on whether "device state" encompasses the full
> > > > > > > > > > state of the device or just state that is unknown to the VMM.
> > > > > > > > > I think we've discussed this in the past. It can't work since:
> > > > > > > > > 
> > > > > > > > > 1) The state and its format must be clearly defined in the spec
> > > > > > > > > 2) We need to maintain migration compatibility and debug-ability
> > > > > > > > Some devices need implementation-specific state. They should still be
> > > > > > > > able to live migrate even if it means cross-implementation migration and
> > > > > > > > debug-ability is not possible.
> > > > > > > I think we need to re-visit this conclusion. Migration compatibility is
> > > > > > > pretty important, especially consider the software stack has spent a huge
> > > > > > > mount of effort in maintaining them.
> > > > > > > 
> > > > > > > Say a virtio hardware would break this, this mean we will lose all the
> > > > > > > advantages of being a standard device.
> > > > > > > 
> > > > > > > If we can't do live migration among:
> > > > > > > 
> > > > > > > 1) different backends, e.g migrate from virtio hardware to migrate software
> > > > > > > 2) different vendors
> > > > > > > 
> > > > > > > We failed to say as a standard device and the customer is in fact locked by
> > > > > > > the vendor implicitly.
> > > > > > My virtiofs device implementation is backed by an in-memory file system.
> > > > > > The device state includes the contents of each file.
> > > > > > 
> > > > > > Your virtiofs device implementation uses Linux file handles to keep
> > > > > > track of open files. The device state includes Linux file handles (but
> > > > > > not the contents of each file) so the destination host can access the
> > > > > > same files on shared storage.
> > > > > > 
> > > > > > Cornelia's virtiofs device implementation is backed by an object storage
> > > > > > HTTP API. The device state includes API object IDs.
> > > > > > 
> > > > > > The device state is implementation-dependent. There is no standard
> > > > > > representation and it's not possible to migrate between device
> > > > > > implementations. How are they supposed to migrate?
> > > > > So if I understand correclty, virtio-fs is not desigined to be migrate-able?
> > > > > 
> > > > > (Having a check on the current virtio-fs support in qemu, it looks to me it
> > > > > has a migration blocker).
> > > > The code does not support live migration but it's on the roadmap. Max
> > > > Reitz added Linux file handle support to virtiofsd. That was the first
> > > > step towards being able to migrate the device's state.
> > > 
> > > A dumb question, how do qemu know it is connected to virtiofsd?
> > virtiofsd is a vhost-user-fs device. QEMU doesn't know if it's connected
> > to virtiofsd or another implementation.
> 
> 
> That's my understanding. So this answers my questions basically: there could
> be a common migration implementation for each virtio-fs device which implies
> that we only need to migrate the common device specific state but not
> implementation specific state.
> 
> 
> > 
> > > > > > This is why I think it's necessarily to allow implementation-specific
> > > > > > device state representations.
> > > > > Or you probably mean you don't support cross backend migration. This sounds
> > > > > like a drawback and it's actually not a standard device but a
> > > > > vendor/implementation specific device.
> > > > > 
> > > > > It would bring a lot of troubles, not only for the implementation but for
> > > > > the management. Maybe we can start from adding the support of migration for
> > > > > some specific backend and start from there.
> > > > Yes, it's complicated. Some implementations could be compatible, but
> > > > others can never be compatible because they have completely different
> > > > state.
> > > > 
> > > > The virtiofsd implementation is the main one for virtiofs and the device
> > > > state representation can be published, even standardized. Others can
> > > > implement it to achieve migration compatibility.
> > > > 
> > > > But it must be possible for implementations that have completely
> > > > different state to migrate too. virtiofsd isn't special.
> > > > 
> > > > > > > > > 3) Not a proper uAPI desgin
> > > > > > > > I never understood this argument. The Linux uAPI passes through lots of
> > > > > > > > opaque data from devices to userspace. Allowing an
> > > > > > > > implementation-specific device state representation is nothing new. VFIO
> > > > > > > > already does it.
> > > > > > > I think we've already had a lots of discussion for VFIO but without a
> > > > > > > conclusion. Maybe we need the verdict from Linus or Greg (not sure if it's
> > > > > > > too late). But that's not related to virito and this thread.
> > > > > > > 
> > > > > > > What you propose here is kind of conflict with the efforts of virtio. I
> > > > > > > think we all aggree that we should define the state in the spec. Assuming
> > > > > > > this is correct:
> > > > > > > 
> > > > > > > 1) why do we still offer opaque migration state to userspace?
> > > > > > See above. Stateful devices may require an implementation-defined device
> > > > > > state representation.
> > > > > So my point stand still, it's not a standard device if we do this.
> > > > These "non-standard devices" still need to be able to migrate.
> > > 
> > > See other thread, it breaks the effort of having a spec.
> > > 
> > > >    How
> > > > should we do that?
> > > 
> > > I think the main issue is that, to me it's not a virtio device but a device
> > > that is using virtio queue with implementation specific state. So it can't
> > > be migrated by the virtio subsystem but through a vendor/implementation
> > > specific migration driver.
> > Okay. Are you thinking about a separate set of vDPA APIs and vhost
> > ioctls so the VMM can save/load implementation-specific device state?
> 
> 
> Probably not, I think the question is can we define the virtio-fs device
> state in the spec? If yes (and I think the answer is yes), we're fine. If
> not, it looks like we need to improve the spec or design.

There isn't one device state that contains all the information needed to
migrate virtiofs devices - unless implementation-specific device state
is allowed.

The implementation-specific device state could be added to the spec, but
it doesn't help: each implementation will use its subset of the device
state from the spec and cross-implementation migration still won't be
possible.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit
  2021-07-22 10:24                                               ` Stefan Hajnoczi
@ 2021-07-22 13:08                                                 ` Jason Wang
  2021-07-26 15:07                                                   ` Stefan Hajnoczi
  0 siblings, 1 reply; 115+ messages in thread
From: Jason Wang @ 2021-07-22 13:08 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Michael S. Tsirkin, Eugenio Perez Martin, Dr. David Alan Gilbert,
	virtio-comment, Virtio-Dev, Max Gurtovoy, Cornelia Huck,
	Oren Duer, Shahaf Shuler, Parav Pandit, Bodong Wang,
	Alexander Mikheev, Halil Pasic


在 2021/7/22 下午6:24, Stefan Hajnoczi 写道:
> On Thu, Jul 22, 2021 at 03:33:10PM +0800, Jason Wang wrote:
>> 在 2021/7/21 下午6:20, Stefan Hajnoczi 写道:
>>> On Wed, Jul 21, 2021 at 10:29:17AM +0800, Jason Wang wrote:
>>>> 在 2021/7/20 下午4:50, Stefan Hajnoczi 写道:
>>>>> On Tue, Jul 20, 2021 at 11:04:55AM +0800, Jason Wang wrote:
>>>>>> 在 2021/7/19 下午8:45, Stefan Hajnoczi 写道:
>>>>>>> On Fri, Jul 16, 2021 at 11:53:13AM +0800, Jason Wang wrote:
>>>>>>>> 在 2021/7/16 上午10:03, Jason Wang 写道:
>>>>>>>>> 在 2021/7/15 下午6:01, Stefan Hajnoczi 写道:
>>>>>>>>>> On Thu, Jul 15, 2021 at 09:35:13AM +0800, Jason Wang wrote:
>>>>>>>>>>> 在 2021/7/14 下午11:07, Stefan Hajnoczi 写道:
>>>>>>>>>>>> On Wed, Jul 14, 2021 at 06:29:28PM +0800, Jason Wang wrote:
>>>>>>>>>>>>> 在 2021/7/14 下午5:53, Stefan Hajnoczi 写道:
>>>>>>>>>>>>>> On Tue, Jul 13, 2021 at 08:16:35PM +0800, Jason Wang wrote:
>>>>>>>>>>>>>>> 在 2021/7/13 下午6:00, Stefan Hajnoczi 写道:
>>>>>>>>>>>>>>>> On Tue, Jul 13, 2021 at 11:27:03AM +0800, Jason Wang wrote:
>>>>>>>>>>>>>>>>> 在 2021/7/12 下午5:57, Stefan Hajnoczi 写道:
>>>>>>>>>>>>>>>>>> On Mon, Jul 12, 2021 at 12:00:39PM +0800, Jason Wang wrote:
>>>>>>>>>>>>>>>>>>> 在 2021/7/11 上午4:36, Michael S. Tsirkin 写道:
>>>>>>>>>>>>>>>>>>>> On Fri, Jul 09, 2021 at
>>>>>>>>>>>>>>>>>>>> 07:23:33PM +0200, Eugenio
>>>>>>>>>>>>>>>>>>>> Perez Martin wrote:
>>>>>>>>>>>>>>>>>>>>>>>             If I understand correctly, this is all
>>>>>>>>>>>>>>>>>>>>>>> driven from the
>>>>>>>>>>>>>>>>>>>>>>> driver inside
>>>>>>>>>>>>>>>>>>>>>>> the guest, so
>>>>>>>>>>>>>>>>>>>>>>> for this to work
>>>>>>>>>>>>>>>>>>>>>>> the guest must
>>>>>>>>>>>>>>>>>>>>>>> be running and
>>>>>>>>>>>>>>>>>>>>>>> already have
>>>>>>>>>>>>>>>>>>>>>>> initialised the
>>>>>>>>>>>>>>>>>>>>>>> driver.
>>>>>>>>>>>>>>>>>>>>>> Yes.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> As I see it, the feature
>>>>>>>>>>>>>>>>>>>>> can be driven entirely
>>>>>>>>>>>>>>>>>>>>> by the VMM as long as
>>>>>>>>>>>>>>>>>>>>> it intercept the
>>>>>>>>>>>>>>>>>>>>> relevant configuration
>>>>>>>>>>>>>>>>>>>>> space (PCI, MMIO, etc)
>>>>>>>>>>>>>>>>>>>>> from
>>>>>>>>>>>>>>>>>>>>> guest's reads and
>>>>>>>>>>>>>>>>>>>>> writes, and present it
>>>>>>>>>>>>>>>>>>>>> as coherent and
>>>>>>>>>>>>>>>>>>>>> transparent
>>>>>>>>>>>>>>>>>>>>> for the guest. Some use
>>>>>>>>>>>>>>>>>>>>> cases I can imagine with
>>>>>>>>>>>>>>>>>>>>> a physical device (or
>>>>>>>>>>>>>>>>>>>>> vp_vpda device) with VIRTIO_F_STOP:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> 1) The VMM chooses not
>>>>>>>>>>>>>>>>>>>>> to pass the feature
>>>>>>>>>>>>>>>>>>>>> flag. The guest cannot
>>>>>>>>>>>>>>>>>>>>> stop
>>>>>>>>>>>>>>>>>>>>> the device, so any write to this flag is an error/undefined.
>>>>>>>>>>>>>>>>>>>>> 2) The VMM passes the
>>>>>>>>>>>>>>>>>>>>> flag to the guest. The
>>>>>>>>>>>>>>>>>>>>> guest can stop the
>>>>>>>>>>>>>>>>>>>>> device.
>>>>>>>>>>>>>>>>>>>>> 2.1) The VMM stops the
>>>>>>>>>>>>>>>>>>>>> device to perform a live
>>>>>>>>>>>>>>>>>>>>> migration, and the
>>>>>>>>>>>>>>>>>>>>> guest does not write to
>>>>>>>>>>>>>>>>>>>>> STOP in any moment of
>>>>>>>>>>>>>>>>>>>>> the LM. It resets the
>>>>>>>>>>>>>>>>>>>>> destination device with
>>>>>>>>>>>>>>>>>>>>> the state, and then
>>>>>>>>>>>>>>>>>>>>> initializes the device.
>>>>>>>>>>>>>>>>>>>>> 2.2) The guest stops the
>>>>>>>>>>>>>>>>>>>>> device and, when
>>>>>>>>>>>>>>>>>>>>> STOP(32) is set, the
>>>>>>>>>>>>>>>>>>>>> source
>>>>>>>>>>>>>>>>>>>>> VMM migrates the device
>>>>>>>>>>>>>>>>>>>>> status. The destination
>>>>>>>>>>>>>>>>>>>>> VMM realizes the bit,
>>>>>>>>>>>>>>>>>>>>> so it sets the bit in
>>>>>>>>>>>>>>>>>>>>> the destination too
>>>>>>>>>>>>>>>>>>>>> after device
>>>>>>>>>>>>>>>>>>>>> initialization.
>>>>>>>>>>>>>>>>>>>>> 2.3) The device is not
>>>>>>>>>>>>>>>>>>>>> initialized by the guest
>>>>>>>>>>>>>>>>>>>>> so it doesn't matter
>>>>>>>>>>>>>>>>>>>>> what bit has the HW, but the VM can be migrated.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Am I missing something?
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Thanks!
>>>>>>>>>>>>>>>>>>>> It's doable like this. It's
>>>>>>>>>>>>>>>>>>>> all a lot of hoops to jump
>>>>>>>>>>>>>>>>>>>> through though.
>>>>>>>>>>>>>>>>>>>> It's also not easy for devices to implement.
>>>>>>>>>>>>>>>>>>> It just requires a new status
>>>>>>>>>>>>>>>>>>> bit. Anything that makes you
>>>>>>>>>>>>>>>>>>> think it's hard
>>>>>>>>>>>>>>>>>>> to implement?
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> E.g for networking device, it
>>>>>>>>>>>>>>>>>>> should be sufficient to use this
>>>>>>>>>>>>>>>>>>> bit + the
>>>>>>>>>>>>>>>>>>> virtqueue state.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Why don't we design the
>>>>>>>>>>>>>>>>>>>> feature in a way that is
>>>>>>>>>>>>>>>>>>>> useable by VMMs
>>>>>>>>>>>>>>>>>>>> and implementable by devices in a simple way?
>>>>>>>>>>>>>>>>>>> It use the common technology
>>>>>>>>>>>>>>>>>>> like register shadowing without
>>>>>>>>>>>>>>>>>>> any further
>>>>>>>>>>>>>>>>>>> stuffs.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Or do you have any other ideas?
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> (I think we all know migration
>>>>>>>>>>>>>>>>>>> will be very hard if we simply
>>>>>>>>>>>>>>>>>>> pass through
>>>>>>>>>>>>>>>>>>> those state registers).
>>>>>>>>>>>>>>>>>> If an admin virtqueue is used
>>>>>>>>>>>>>>>>>> instead of the STOP Device Status
>>>>>>>>>>>>>>>>>> field
>>>>>>>>>>>>>>>>>> bit then there's no need to re-read
>>>>>>>>>>>>>>>>>> the Device Status field in a loop
>>>>>>>>>>>>>>>>>> until the device has stopped.
>>>>>>>>>>>>>>>>> Probably not. Let me to clarify several points:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> - This proposal has nothing to do with
>>>>>>>>>>>>>>>>> admin virtqueue. Actually, admin
>>>>>>>>>>>>>>>>> virtqueue could be used for carrying any
>>>>>>>>>>>>>>>>> basic device facility like status
>>>>>>>>>>>>>>>>> bit. E.g I'm going to post patches that
>>>>>>>>>>>>>>>>> use admin virtqueue as a "transport"
>>>>>>>>>>>>>>>>> for device slicing at virtio level.
>>>>>>>>>>>>>>>>> - Even if we had introduced admin
>>>>>>>>>>>>>>>>> virtqueue, we still need a per function
>>>>>>>>>>>>>>>>> interface for this. This is a must for
>>>>>>>>>>>>>>>>> nested virtualization, we can't
>>>>>>>>>>>>>>>>> always expect things like PF can be assigned to L1 guest.
>>>>>>>>>>>>>>>>> - According to the proposal, there's no
>>>>>>>>>>>>>>>>> need for the device to complete all
>>>>>>>>>>>>>>>>> the consumed buffers, device can choose
>>>>>>>>>>>>>>>>> to expose those inflight descriptors
>>>>>>>>>>>>>>>>> in a device specific way and set the
>>>>>>>>>>>>>>>>> STOP bit. This means, if we have the
>>>>>>>>>>>>>>>>> device specific in-flight descriptor
>>>>>>>>>>>>>>>>> reporting facility, the device can
>>>>>>>>>>>>>>>>> almost set the STOP bit immediately.
>>>>>>>>>>>>>>>>> - If we don't go with the basic device
>>>>>>>>>>>>>>>>> facility but using the admin
>>>>>>>>>>>>>>>>> virtqueue specific method, we still need
>>>>>>>>>>>>>>>>> to clarify how it works with the
>>>>>>>>>>>>>>>>> device status state machine, it will be
>>>>>>>>>>>>>>>>> some kind of sub-states which looks
>>>>>>>>>>>>>>>>> much more complicated than the current proposal.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> When migrating a guest with many
>>>>>>>>>>>>>>>>>> VIRTIO devices a busy waiting
>>>>>>>>>>>>>>>>>> approach
>>>>>>>>>>>>>>>>>> extends downtime if implemented
>>>>>>>>>>>>>>>>>> sequentially (stopping one device at
>>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>> time).
>>>>>>>>>>>>>>>>> Well. You need some kinds of waiting for
>>>>>>>>>>>>>>>>> sure, the device/DMA needs sometime
>>>>>>>>>>>>>>>>> to be stopped. The downtime is determined by a specific virtio
>>>>>>>>>>>>>>>>> implementation which is hard to be
>>>>>>>>>>>>>>>>> restricted at the spec level. We can
>>>>>>>>>>>>>>>>> clarify that the device must set the STOP bit in e.g 100ms.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>           It can be implemented
>>>>>>>>>>>>>>>>>> concurrently (setting the STOP bit
>>>>>>>>>>>>>>>>>> on all
>>>>>>>>>>>>>>>>>> devices and then looping until all
>>>>>>>>>>>>>>>>>> their Device Status fields have the
>>>>>>>>>>>>>>>>>> bit set), but this becomes more complex to implement.
>>>>>>>>>>>>>>>>> I still don't get what kind of complexity did you worry here.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I'm a little worried about adding a new bit that requires busy
>>>>>>>>>>>>>>>>>> waiting...
>>>>>>>>>>>>>>>>> Busy wait is not something that is introduced in this patch:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> 4.1.4.3.2 Driver Requirements: Common
>>>>>>>>>>>>>>>>> configuration structure layout
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> After writing 0 to device_status, the
>>>>>>>>>>>>>>>>> driver MUST wait for a read of
>>>>>>>>>>>>>>>>> device_status to return 0 before reinitializing the device.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Since it was required for at least one
>>>>>>>>>>>>>>>>> transport. We need do something
>>>>>>>>>>>>>>>>> similar to when introducing basic facility.
>>>>>>>>>>>>>>>> Adding the STOP but as a Device Status bit
>>>>>>>>>>>>>>>> is a small and clean VIRTIO
>>>>>>>>>>>>>>>> spec change. I like that.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On the other hand, devices need time to stop and that time can be
>>>>>>>>>>>>>>>> unbounded. For example, software
>>>>>>>>>>>>>>>> virtio-blk/scsi implementations since
>>>>>>>>>>>>>>>> cannot immediately cancel in-flight I/O requests on Linux hosts.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The natural interface for long-running
>>>>>>>>>>>>>>>> operations is virtqueue requests.
>>>>>>>>>>>>>>>> That's why I mentioned the alternative of using an admin virtqueue
>>>>>>>>>>>>>>>> instead of a Device Status bit.
>>>>>>>>>>>>>>> So I'm not against the admin virtqueue. As said
>>>>>>>>>>>>>>> before, admin virtqueue
>>>>>>>>>>>>>>> could be used for carrying the device status bit.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Send a command to set STOP status bit to admin
>>>>>>>>>>>>>>> virtqueue. Device will make
>>>>>>>>>>>>>>> the command buffer used after it has
>>>>>>>>>>>>>>> successfully stopped the device.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> AFAIK, they are not mutually exclusive, since
>>>>>>>>>>>>>>> they are trying to solve
>>>>>>>>>>>>>>> different problems.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Device status - basic device facility
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Admin virtqueue - transport/device specific way
>>>>>>>>>>>>>>> to implement (part of) the
>>>>>>>>>>>>>>> device facility
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Although you mentioned that the stopped
>>>>>>>>>>>>>>>> state needs to be reflected in
>>>>>>>>>>>>>>>> the Device Status field somehow, I'm not sure about that since the
>>>>>>>>>>>>>>>> driver typically doesn't need to know whether the device is being
>>>>>>>>>>>>>>>> migrated.
>>>>>>>>>>>>>>> The guest won't see the real device status bit.
>>>>>>>>>>>>>>> VMM will shadow the device
>>>>>>>>>>>>>>> status bit in this case.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> E.g with the current vhost-vDPA, vDPA behave
>>>>>>>>>>>>>>> like a vhost device, guest is
>>>>>>>>>>>>>>> unaware of the migration.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> STOP status bit is set by Qemu to real virtio
>>>>>>>>>>>>>>> hardware. But guest will only
>>>>>>>>>>>>>>> see the DRIVER_OK without STOP.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> It's not hard to implement the nested on top,
>>>>>>>>>>>>>>> see the discussion initiated
>>>>>>>>>>>>>>> by Eugenio about how expose VIRTIO_F_STOP to guest for nested live
>>>>>>>>>>>>>>> migration.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>          In fact, the VMM would need to hide
>>>>>>>>>>>>>>>> this bit and it's safer to
>>>>>>>>>>>>>>>> keep it out-of-band instead of risking exposing it by accident.
>>>>>>>>>>>>>>> See above, VMM may choose to hide or expose the
>>>>>>>>>>>>>>> capability. It's useful for
>>>>>>>>>>>>>>> migrating a nested guest.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> If we design an interface that can be used in
>>>>>>>>>>>>>>> the nested environment, it's
>>>>>>>>>>>>>>> not an ideal interface.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> In addition, stateful devices need to
>>>>>>>>>>>>>>>> load/save non-trivial amounts of
>>>>>>>>>>>>>>>> data. They need DMA to do this efficiently,
>>>>>>>>>>>>>>>> so an admin virtqueue is a
>>>>>>>>>>>>>>>> good fit again.
>>>>>>>>>>>>>>> I don't get the point here. You still need to
>>>>>>>>>>>>>>> address the exact the similar
>>>>>>>>>>>>>>> issues for admin virtqueue: the unbound time in
>>>>>>>>>>>>>>> freezing the device, the
>>>>>>>>>>>>>>> interaction with the virtio device status state machine.
>>>>>>>>>>>>>> Device state state can be large so a register interface would be a
>>>>>>>>>>>>>> bottleneck. DMA is needed. I think a virtqueue is a good fit for
>>>>>>>>>>>>>> saving/loading device state.
>>>>>>>>>>>>> So this patch doesn't mandate a register interface, isn't it?
>>>>>>>>>>>> You're right, not this patch. I mentioned it because your other patch
>>>>>>>>>>>> series ("[PATCH] virtio-pci: implement VIRTIO_F_RING_STATE")
>>>>>>>>>>>> implements
>>>>>>>>>>>> it a register interface.
>>>>>>>>>>>>
>>>>>>>>>>>>> And DMA
>>>>>>>>>>>>> doesn't means a virtqueue, it could be a transport specific method.
>>>>>>>>>>>> Yes, although virtqueues are a pretty good interface that works across
>>>>>>>>>>>> transports (PCI/MMIO/etc) thanks to the standard vring memory layout.
>>>>>>>>>>>>
>>>>>>>>>>>>> I think we need to start from defining the state of one
>>>>>>>>>>>>> specific device and
>>>>>>>>>>>>> see what is the best interface.
>>>>>>>>>>>> virtio-blk might be the simplest. I think virtio-net has more device
>>>>>>>>>>>> state and virtio-scsi is definitely more complext than virtio-blk.
>>>>>>>>>>>>
>>>>>>>>>>>> First we need agreement on whether "device state" encompasses the full
>>>>>>>>>>>> state of the device or just state that is unknown to the VMM.
>>>>>>>>>>> I think we've discussed this in the past. It can't work since:
>>>>>>>>>>>
>>>>>>>>>>> 1) The state and its format must be clearly defined in the spec
>>>>>>>>>>> 2) We need to maintain migration compatibility and debug-ability
>>>>>>>>>> Some devices need implementation-specific state. They should still be
>>>>>>>>>> able to live migrate even if it means cross-implementation migration and
>>>>>>>>>> debug-ability is not possible.
>>>>>>>>> I think we need to re-visit this conclusion. Migration compatibility is
>>>>>>>>> pretty important, especially consider the software stack has spent a
>>>>>>>>> huge mount of effort in maintaining them.
>>>>>>>>>
>>>>>>>>> Say a virtio hardware would break this, this mean we will lose all the
>>>>>>>>> advantages of being a standard device.
>>>>>>>>>
>>>>>>>>> If we can't do live migration among:
>>>>>>>>>
>>>>>>>>> 1) different backends, e.g migrate from virtio hardware to migrate
>>>>>>>>> software
>>>>>>>>> 2) different vendors
>>>>>>>>>
>>>>>>>>> We failed to say as a standard device and the customer is in fact locked
>>>>>>>>> by the vendor implicitly.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>> 3) Not a proper uAPI desgin
>>>>>>>>>> I never understood this argument. The Linux uAPI passes through lots of
>>>>>>>>>> opaque data from devices to userspace. Allowing an
>>>>>>>>>> implementation-specific device state representation is nothing new. VFIO
>>>>>>>>>> already does it.
>>>>>>>>> I think we've already had a lots of discussion for VFIO but without a
>>>>>>>>> conclusion. Maybe we need the verdict from Linus or Greg (not sure if
>>>>>>>>> it's too late). But that's not related to virito and this thread.
>>>>>>>>>
>>>>>>>>> What you propose here is kind of conflict with the efforts of virtio. I
>>>>>>>>> think we all aggree that we should define the state in the spec.
>>>>>>>>> Assuming this is correct:
>>>>>>>>>
>>>>>>>>> 1) why do we still offer opaque migration state to userspace?
>>>>>>>>> 2) how can it be integrated into the current VMM (Qemu) virtio devices'
>>>>>>>>> migration bytes stream?
>>>>>>>>>
>>>>>>>>> We should standardize everything that is visible by the driver to be a
>>>>>>>>> standard device. That's the power of virtio.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>>> That's
>>>>>>>>>>>> basically the difference between the vhost/vDPA's selective
>>>>>>>>>>>> passthrough
>>>>>>>>>>>> approach and VFIO's full passthrough approach.
>>>>>>>>>>> We can't do VFIO full pasthrough for migration anyway, some kind
>>>>>>>>>>> of mdev is
>>>>>>>>>>> required but it's duplicated with the current vp_vdpa driver.
>>>>>>>>>> I'm not sure that's true. Generic VFIO PCI migration can probably be
>>>>>>>>>> achieved without mdev:
>>>>>>>>>> 1. Define a migration PCI Capability that indicates support for
>>>>>>>>>>         VFIO_REGION_TYPE_MIGRATION. This allows the PCI device to implement
>>>>>>>>>>         the migration interface in hardware instead of an mdev driver.
>>>>>>>>> So I think it still depend on the driver to implement migrate state
>>>>>>>>> which is vendor specific.
>>>>>>>>>
>>>>>>>>> Note that it's just an uAPI definition not something defined in the PCI
>>>>>>>>> spec.
>>>>>>>>>
>>>>>>>>> Out of curiosity, the patch is merged without any real users in the
>>>>>>>>> Linux. This is very bad since we lose the change to audit the whole
>>>>>>>>> design.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> 2. The VMM either uses the migration PCI Capability directly from
>>>>>>>>>>         userspace or core VFIO PCI code advertises
>>>>>>>>>> VFIO_REGION_TYPE_MIGRATION
>>>>>>>>>>         to userspace so migration can proceed in the same way as with
>>>>>>>>>>         VFIO/mdev drivers.
>>>>>>>>>> 3. The PCI Capability is not passed through to the guest.
>>>>>>>>> This brings troubles in the nested environment.
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> Changpeng Liu originally mentioned the idea of defining a migration PCI
>>>>>>>>>> Capability.
>>>>>>>>>>
>>>>>>>>>>>>        For example, some of the
>>>>>>>>>>>> virtio-net state is available to the VMM with vhost/vDPA because it
>>>>>>>>>>>> intercepts the virtio-net control virtqueue.
>>>>>>>>>>>>
>>>>>>>>>>>> Also, we need to decide to what degree the device state representation
>>>>>>>>>>>> is standardized in the VIRTIO specification.
>>>>>>>>>>> I think all the states must be defined in the spec otherwise the device
>>>>>>>>>>> can't claim it supports migration at virtio level.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>        I think it makes sense to
>>>>>>>>>>>> standardize it if it's possible to convey all necessary
>>>>>>>>>>>> state and device
>>>>>>>>>>>> implementors can easily implement this device state representation.
>>>>>>>>>>> I doubt it's high device specific. E.g can we standardize device(GPU)
>>>>>>>>>>> memory?
>>>>>>>>>> For devices that have little internal state it's possible to define a
>>>>>>>>>> standard device state representation.
>>>>>>>>>>
>>>>>>>>>> For other devices, like virtio-crypto, virtio-fs, etc it becomes
>>>>>>>>>> difficult because the device implementation contains state that will be
>>>>>>>>>> needed but is very specific to the implementation. These devices *are*
>>>>>>>>>> migratable but they don't have standard state. Even here there is a
>>>>>>>>>> spectrum:
>>>>>>>>>> - Host OS-specific state (e.g. Linux struct file_handles)
>>>>>>>>>> - Library-specific state (e.g. crypto library state)
>>>>>>>>>> - Implementation-specific state (e.g. sshfs inode state for virtio-fs)
>>>>>>>>>>
>>>>>>>>>> This is why I think it's necessary to support both standard device state
>>>>>>>>>> representations and implementation-specific device state
>>>>>>>>>> representations.
>>>>>>>> Having two ways will bring extra complexity. That why I suggest:
>>>>>>>>
>>>>>>>> - to have general facility for the virtuqueue to be migrated
>>>>>>>> - leave the device specific state to be device specific. so device can
>>>>>>>> choose what is convenient way or interface.
>>>>>>> I don't think we have a choice. For stateful devices it can be
>>>>>>> impossible to define a standard device state representation.
>>>>>> Let me clarify, I agree we can't have a standard device state for all kinds
>>>>>> of device.
>>>>>>
>>>>>> That's way I tend to leave them to be device specific. (but not
>>>>>> implementation specific)
>>>>> Unfortunately device state is sometimes implementation-specific. Not
>>>>> because the device is proprietary, but because the actual state is
>>>>> meaningless to other implementations.
>>>>>
>>>>> I mentioned virtiofs as an example where file system backends can be
>>>>> implemented in completely different ways so the device state cannot be
>>>>> migrated between implementations.
>>>> So let me clarify my understanding, we had two kinds of states:
>>>>
>>>> 1) implementation specific state that is not noticeable by the driver
>>>> 2) device specific state that is noticeable by the driver
>>>>
>>>> We don't have the interest in 1).
>>>>
>>>> For 2) it's what needs to be defined in the spec. If we fail to generalize
>>>> the device specific state, it can't be used by a standard virtio driver. Or
>>>> maybe you can give a concrete example on how vitio-fs fail in doing this?
>>> 2) is what I mean when I say a "stateful" device. I agree, 1) is not
>>> relevant to this discussion because we don't need to migrate internal
>>> device state that the driver cannot interact with.
>>>
>>> The virtiofs device has an OPEN request for opening a file. Live
>>> migration must transfer the list of open files from the source to the
>>> destination device so the driver can continue accessing files it
>>> previously had open.
>>>
>>> However, the result of the OPEN request is a number similar to a POSIX
>>> fd, not the full device-internal state associated with an open file.
>>> After migration the driver expects to continue using the number to
>>> operate on the file. We must transfer the open file state to the
>>> destination device.
>>>
>>> Different device implementations may have completely different concepts
>>> of open file state:
>>>
>>> - An in-memory file system. The list of open files is a list of
>>>     in-memory inodes. We'll also need to transfer the entire contents of
>>>     the files/directories since it's in-memory and not shared with the
>>>     destination device.
>>>
>>> - A passthrough Linux file system. We need to transfer the Linux file
>>>     handles (see open_by_handle_at(2)) so the destination device can open
>>>     the inodes on the underlying shared host file system.
>>>
>>> - A distributed object storage API. We need to transfer the list of
>>>     open object IDs so the destination device can perform I/O to the same
>>>     objects.
>>>
>>> - Anyone can create a custom virtiofs device implementation and it will
>>>     rely on different open file state.
>>
>> So it looks to me you want to propose migration drivers for different
>> implementation. It looks to me we it's better to go with something more
>> easier:
>>
>> 1) Having a common driver visible state defined in the spec, and use them
>> for migration
>> 2) It's the charge of the device or backend to "map" the driver visible
>> state to its implementation specific state.
>>
>> If 1) is in-sufficient, we should extend it until the it satisfies 2)
>>
>> For virtio-fs, it looks like the issue is that the implementation needs to
>> associate objects in different namespaces (guest vs host).
>>
>> For the above cases:
>>
>> 1) For the in-memory file system, if I understand correctly, it can be
>> accessed directly by guest via a transport specific way (e.g BAR). In this
>> case, it's driver noticeable state so all the memory must be migrated to the
>> destination.
> The virtiofs DAX Window feature is optional. Also, the DAX Window BAR
> only exposes file contents, not file system metadata. The guest driver
> or the BAR contents are not enough to migrate the state of the device.


Yes, so the point is if the state is noticeable, it need to be migrated.

And another thing that may cause confusing is that, we probably need to 
clarify what DAX is in the spec or avoid using Linux specific 
terminology as virtio can live without it.


>
>> 2) For the passthrough, it's the charge of the implementation to do the map.
>> I think it doesn't differ a lot with the current migration among shared
>> storage of block devices. (Note that open_by_handle_at() requires
>> CAP_DAC_READ_SEARCH which I'm not sure it can used)
> It's different from block devices because the FUSE protocol
> is stateful. Commands like LOOKUP, OPEN, OPENDIR create objects that are
> referred to by a temporary number (similar to POSIX fds). Due to POSIX
> file system semantics there's no way to recreate the mapping correctly
> without Linux file handles (because files can be deleted, renamed, moved
> while they are still open or their inode is known). Shared block devices
> do not have this issue.


See below, I would like to have a look at how Linux file handle is 
expected to work.


>
>> 3) For distributed storage, the implementation should implement the
>> association between driver object and distributed storage object (I think
>> each such API should has something like UUID for a global namespace) and
>> provide a way for reverse lookup.
>> 4) For other implementation, it should be the same as 3)
>>
>> It's the management layer to decide whether we can do cross implementation
>> migration, but qemu should migrate the common driver visible state instead
>> of implementation specific state.
>>
>> We can have a dedicated feature flag for this and block the migration for
>> the devices without this feature.
>>
>> I think we don't want to end up with several migration drivers for
>> virtio-fs.
>>
>>> I imagine virtio-gpu and virtio-crypto might have similar situations
>>> where an object created through a virtqueue request has device-internal
>>> state associated with it that must be migrated.
>>
>> So the point stands still. The device internal state should be restored from
>> the device specific state defined in the spec. Spec would guarantee that the
>> minimal part of the device specific state, implementation should use those
>> to restore implementation specific state.
>>
>>
>>>>>> But we can generalize the virtqueue state for sure.
>>>>> I agree and also that some device types can standardize their device
>>>>> state representations. But I think it's a technical requirement to
>>>>> support implementation-specific state for device types where
>>>>> cross-implementation migration is not possible.
>>>> A question here, if the driver depends on the implementation specific state,
>>>> how can we make sure that driver can work for other implementation. If we're
>>>> sure that a single driver can work for all kinds of implementations, it
>>>> means the we had device specific state not implementation state.
>>> I think this is confusing stateless and stateful devices. You are
>>> describing a stateless device here. I'll try to define a stateful
>>> device:
>>>
>>> A stateful device maintains state that the driver operates on indirectly
>>> via standard requests. For example, the virtio-crypto device has
>>> CREATE_SESSION requests and a session ID is returned to the driver so
>>> further requests can be made on the session object. It may not be
>>> possible to replay, reconnect, or restart the device without losing
>>> state.
>>
>> If it's impossible to do all of these. These state should be noticeable by
>> the driver and they are not implementation specific but device specific
>> which could be defined in the spec.
>>
>>
>>> I hope that this description, together with the virtiofs specifics
>>> above, make the problem clearer.
>>
>> Yes.
>>
>>
>>>>> I'm not saying the implementation-specific state representation has to
>>>>> be a binary blob. There could be an identifier registry to ensure live
>>>>> migration compatibility checks can be performed. There could also be a
>>>>> standard binary encoding for migration data.
>>>> Yes, such requirements has been well studied in the past. There should be
>>>> plenty of protocols to do this.
>>>>
>>>>
>>>>>     But the contents will be
>>>>> implementation-specific for some devices.
>>>> If we allow this, it breaks the spec effort for having a standard devices.
>>>> And it will block the real customers.
>>> If we forbid this then devices for which migration is technically
>>> possible will be unmigratable. Both users and device implementors will
>>> find other solutions, like VFIO, so I don't think we can stop them even
>>> if we tried.
>>
>> The difference is standard device vs vendor specific device. We can't avoid
>> the implementation specific state for vendor specific hardware.
>>
>> But for standard device, we don't want to end up with migration drivers. (Do
>> we want vendor to ship the vendor specific migration drivers for NVM(e)
>> devices?)
>>
>>
>>> I recognize that opaque device state poses a risk to migration
>>> compatibility, because device implementors may arbitrarily use opaque
>>> state when a standard is available.
>>>
>>> However, the way to avoid this scenario is by:
>>>
>>> 1. Making the standard migration approach the easiest to implement
>>>      because everything has been taken care of. It will save implementors
>>>      the headache of defining and coding their own device state
>>>      representations and versioning.
>>>
>>> 2. Educate users about migration compatibility so they can identify
>>>      implementors are locking in their users.
>>
>> For vendor specific device, this may work. But for standard devices like
>> virtio, we should go further.
>>
>> The device states should be defined in the spec clearly. We should re-visit
>> the design if those states contains anything that is implementation
>> specific.
> Can you describe how migrating virtiofs devices should work?


I need to learn more virtio-fs before answering this question.

Actually, it would be faster if I can see a prototype of the migration 
support for virtio-fs and start from there (as I've suggested this in 
another thread).


>   I think
> that might be quicker than if I reply to each of your points because our
> views are still quite far apart.


Yes, it would be quicker if we can start from a prototype.

Thanks


>
> Stefan


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit
  2021-07-22 13:08                                                 ` Jason Wang
@ 2021-07-26 15:07                                                   ` Stefan Hajnoczi
  2021-07-27  7:43                                                     ` Max Reitz
  2021-08-03  6:33                                                     ` Jason Wang
  0 siblings, 2 replies; 115+ messages in thread
From: Stefan Hajnoczi @ 2021-07-26 15:07 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, Eugenio Perez Martin, Dr. David Alan Gilbert,
	virtio-comment, Virtio-Dev, Max Gurtovoy, Cornelia Huck,
	Oren Duer, Shahaf Shuler, Parav Pandit, Bodong Wang,
	Alexander Mikheev, Halil Pasic, mreitz

[-- Attachment #1: Type: text/plain, Size: 4122 bytes --]

On Thu, Jul 22, 2021 at 09:08:58PM +0800, Jason Wang wrote:
> 
> 在 2021/7/22 下午6:24, Stefan Hajnoczi 写道:
> > On Thu, Jul 22, 2021 at 03:33:10PM +0800, Jason Wang wrote:
> > > 在 2021/7/21 下午6:20, Stefan Hajnoczi 写道:
> > > > On Wed, Jul 21, 2021 at 10:29:17AM +0800, Jason Wang wrote:
> > > > > 在 2021/7/20 下午4:50, Stefan Hajnoczi 写道:
> > > > I recognize that opaque device state poses a risk to migration
> > > > compatibility, because device implementors may arbitrarily use opaque
> > > > state when a standard is available.
> > > > 
> > > > However, the way to avoid this scenario is by:
> > > > 
> > > > 1. Making the standard migration approach the easiest to implement
> > > >      because everything has been taken care of. It will save implementors
> > > >      the headache of defining and coding their own device state
> > > >      representations and versioning.
> > > > 
> > > > 2. Educate users about migration compatibility so they can identify
> > > >      implementors are locking in their users.
> > > 
> > > For vendor specific device, this may work. But for standard devices like
> > > virtio, we should go further.
> > > 
> > > The device states should be defined in the spec clearly. We should re-visit
> > > the design if those states contains anything that is implementation
> > > specific.
> > Can you describe how migrating virtiofs devices should work?
> 
> 
> I need to learn more virtio-fs before answering this question.
> 
> Actually, it would be faster if I can see a prototype of the migration
> support for virtio-fs and start from there (as I've suggested this in
> another thread).
> 
> 
> >   I think
> > that might be quicker than if I reply to each of your points because our
> > views are still quite far apart.
> 
> 
> Yes, it would be quicker if we can start from a prototype.

I have CCed Max Reitz to check whether a prototype of virtiofs migration
might be available soon?

But I can describe the key state that needs to be migrated:

- FUSE nodeid -> host inode mappings. The driver uses nodeid numbers in
  the FUSE protocol and the device maps them to actual inodes on the
  passthrough file system.
- FUSE fh -> open fd mappings. The driver uses fh numbers in the FUSE
  protocol and the device maps them to actual file descriptors on the
  host.
- FUSE fh -> open dir fd mappings. The driver uses fh numbers in the
  FUSE protocol and the device maps them to actual O_DIRECTORY file
  descriptors on the host.

The driver expects to be able to continue using nodeid and fh numbers
across migration. Let's look at just the open fds for a moment:

The OPEN command opens the file for a given nodeid and returns its fh.
Due to POSIX file system semantics there is no reliable way to reopen
the same file from just the filename. The problem is that a file can be
renamed or deleted (but still accessible until the last fd is closed).

Linux file handles (open_by_handle_at(2) and name_to_handle_at(2)) make
it possible to reopen the exact same file using a struct file_handle
instead of a filename. So the virtiofs device could transfer the Linux
file handles to the destination where the fd -> open fd mappings can be
restored.

The problem is that Linux file handles are an implementation-specific
solution to this problem. On non-Linux hosts there may be other
solutions that userspace file systems use to solve this problem. Or a
virtiofs device may not implement a passthrough host file system and
have a completely different concept of what an inode is.

This means only a subset of virtiofs implementations can use Linux file
handles as part of their device state. There is no way for the driver or
device to recreate or restore the necessary information without
implementation-specific device state like Linux file handles, though.

I guess this is just a summary of what we've already discussed and not
new information. I think an implementation today would use DBus VMState
to transfer implementation-specific device state (an opaque blob).

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit
  2021-07-26 15:07                                                   ` Stefan Hajnoczi
@ 2021-07-27  7:43                                                     ` Max Reitz
  2021-08-03  6:33                                                     ` Jason Wang
  1 sibling, 0 replies; 115+ messages in thread
From: Max Reitz @ 2021-07-27  7:43 UTC (permalink / raw)
  To: Stefan Hajnoczi, Jason Wang
  Cc: Michael S. Tsirkin, Eugenio Perez Martin, Dr. David Alan Gilbert,
	virtio-comment, Virtio-Dev, Max Gurtovoy, Cornelia Huck,
	Oren Duer, Shahaf Shuler, Parav Pandit, Bodong Wang,
	Alexander Mikheev, Halil Pasic

On 26.07.21 17:07, Stefan Hajnoczi wrote:
> On Thu, Jul 22, 2021 at 09:08:58PM +0800, Jason Wang wrote:
>> 在 2021/7/22 下午6:24, Stefan Hajnoczi 写道:
>>> On Thu, Jul 22, 2021 at 03:33:10PM +0800, Jason Wang wrote:
>>>> 在 2021/7/21 下午6:20, Stefan Hajnoczi 写道:
>>>>> On Wed, Jul 21, 2021 at 10:29:17AM +0800, Jason Wang wrote:
>>>>>> 在 2021/7/20 下午4:50, Stefan Hajnoczi 写道:
>>>>> I recognize that opaque device state poses a risk to migration
>>>>> compatibility, because device implementors may arbitrarily use opaque
>>>>> state when a standard is available.
>>>>>
>>>>> However, the way to avoid this scenario is by:
>>>>>
>>>>> 1. Making the standard migration approach the easiest to implement
>>>>>       because everything has been taken care of. It will save implementors
>>>>>       the headache of defining and coding their own device state
>>>>>       representations and versioning.
>>>>>
>>>>> 2. Educate users about migration compatibility so they can identify
>>>>>       implementors are locking in their users.
>>>> For vendor specific device, this may work. But for standard devices like
>>>> virtio, we should go further.
>>>>
>>>> The device states should be defined in the spec clearly. We should re-visit
>>>> the design if those states contains anything that is implementation
>>>> specific.
>>> Can you describe how migrating virtiofs devices should work?
>>
>> I need to learn more virtio-fs before answering this question.
>>
>> Actually, it would be faster if I can see a prototype of the migration
>> support for virtio-fs and start from there (as I've suggested this in
>> another thread).
>>
>>
>>>    I think
>>> that might be quicker than if I reply to each of your points because our
>>> views are still quite far apart.
>>
>> Yes, it would be quicker if we can start from a prototype.
> I have CCed Max Reitz to check whether a prototype of virtiofs migration
> might be available soon?

No, I don’t think so.

I hope I can start with looking into it soon (where "soon" already means 
one or two months), but I can make absolutely no predictions on when 
something usable even as just a prototype might come out of it.  (And I 
find that to be so many conditionals that I don’t think there’s going to 
be one soon.)

> But I can describe the key state that needs to be migrated:
>
> - FUSE nodeid -> host inode mappings. The driver uses nodeid numbers in
>    the FUSE protocol and the device maps them to actual inodes on the
>    passthrough file system.
> - FUSE fh -> open fd mappings. The driver uses fh numbers in the FUSE
>    protocol and the device maps them to actual file descriptors on the
>    host.
> - FUSE fh -> open dir fd mappings. The driver uses fh numbers in the
>    FUSE protocol and the device maps them to actual O_DIRECTORY file
>    descriptors on the host.
>
> The driver expects to be able to continue using nodeid and fh numbers
> across migration. Let's look at just the open fds for a moment:
>
> The OPEN command opens the file for a given nodeid and returns its fh.
> Due to POSIX file system semantics there is no reliable way to reopen
> the same file from just the filename. The problem is that a file can be
> renamed or deleted (but still accessible until the last fd is closed).
>
> Linux file handles (open_by_handle_at(2) and name_to_handle_at(2)) make
> it possible to reopen the exact same file using a struct file_handle
> instead of a filename. So the virtiofs device could transfer the Linux
> file handles to the destination where the fd -> open fd mappings can be
> restored.
>
> The problem is that Linux file handles are an implementation-specific
> solution to this problem. On non-Linux hosts there may be other
> solutions that userspace file systems use to solve this problem. Or a
> virtiofs device may not implement a passthrough host file system and
> have a completely different concept of what an inode is.
>
> This means only a subset of virtiofs implementations can use Linux file
> handles as part of their device state. There is no way for the driver or
> device to recreate or restore the necessary information without
> implementation-specific device state like Linux file handles, though.

(For the record, I agree with this summary.)

Max

> I guess this is just a summary of what we've already discussed and not
> new information. I think an implementation today would use DBus VMState
> to transfer implementation-specific device state (an opaque blob).
>
> Stefan


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit
  2021-07-26 15:07                                                   ` Stefan Hajnoczi
  2021-07-27  7:43                                                     ` Max Reitz
@ 2021-08-03  6:33                                                     ` Jason Wang
  2021-08-03 10:37                                                       ` Stefan Hajnoczi
  1 sibling, 1 reply; 115+ messages in thread
From: Jason Wang @ 2021-08-03  6:33 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Michael S. Tsirkin, Eugenio Perez Martin, Dr. David Alan Gilbert,
	virtio-comment, Virtio-Dev, Max Gurtovoy, Cornelia Huck,
	Oren Duer, Shahaf Shuler, Parav Pandit, Bodong Wang,
	Alexander Mikheev, Halil Pasic, mreitz


在 2021/7/26 下午11:07, Stefan Hajnoczi 写道:
> On Thu, Jul 22, 2021 at 09:08:58PM +0800, Jason Wang wrote:
>> 在 2021/7/22 下午6:24, Stefan Hajnoczi 写道:
>>> On Thu, Jul 22, 2021 at 03:33:10PM +0800, Jason Wang wrote:
>>>> 在 2021/7/21 下午6:20, Stefan Hajnoczi 写道:
>>>>> On Wed, Jul 21, 2021 at 10:29:17AM +0800, Jason Wang wrote:
>>>>>> 在 2021/7/20 下午4:50, Stefan Hajnoczi 写道:
>>>>> I recognize that opaque device state poses a risk to migration
>>>>> compatibility, because device implementors may arbitrarily use opaque
>>>>> state when a standard is available.
>>>>>
>>>>> However, the way to avoid this scenario is by:
>>>>>
>>>>> 1. Making the standard migration approach the easiest to implement
>>>>>       because everything has been taken care of. It will save implementors
>>>>>       the headache of defining and coding their own device state
>>>>>       representations and versioning.
>>>>>
>>>>> 2. Educate users about migration compatibility so they can identify
>>>>>       implementors are locking in their users.
>>>> For vendor specific device, this may work. But for standard devices like
>>>> virtio, we should go further.
>>>>
>>>> The device states should be defined in the spec clearly. We should re-visit
>>>> the design if those states contains anything that is implementation
>>>> specific.
>>> Can you describe how migrating virtiofs devices should work?
>>
>> I need to learn more virtio-fs before answering this question.
>>
>> Actually, it would be faster if I can see a prototype of the migration
>> support for virtio-fs and start from there (as I've suggested this in
>> another thread).
>>
>>
>>>    I think
>>> that might be quicker than if I reply to each of your points because our
>>> views are still quite far apart.
>>
>> Yes, it would be quicker if we can start from a prototype.
> I have CCed Max Reitz to check whether a prototype of virtiofs migration
> might be available soon?
>
> But I can describe the key state that needs to be migrated:
>
> - FUSE nodeid -> host inode mappings. The driver uses nodeid numbers in
>    the FUSE protocol and the device maps them to actual inodes on the
>    passthrough file system.
> - FUSE fh -> open fd mappings. The driver uses fh numbers in the FUSE
>    protocol and the device maps them to actual file descriptors on the
>    host.
> - FUSE fh -> open dir fd mappings. The driver uses fh numbers in the
>    FUSE protocol and the device maps them to actual O_DIRECTORY file
>    descriptors on the host.
>
> The driver expects to be able to continue using nodeid and fh numbers
> across migration. Let's look at just the open fds for a moment:
>
> The OPEN command opens the file for a given nodeid and returns its fh.
> Due to POSIX file system semantics there is no reliable way to reopen
> the same file from just the filename. The problem is that a file can be
> renamed or deleted (but still accessible until the last fd is closed).
>
> Linux file handles (open_by_handle_at(2) and name_to_handle_at(2)) make
> it possible to reopen the exact same file using a struct file_handle
> instead of a filename. So the virtiofs device could transfer the Linux
> file handles to the destination where the fd -> open fd mappings can be
> restored.
>
> The problem is that Linux file handles are an implementation-specific
> solution to this problem.


Yes according to the manpage, it not a part of the uABI, so it's not 
guaranteed to work on the destination if I understand it correctly.



> On non-Linux hosts there may be other
> solutions that userspace file systems use to solve this problem. Or a
> virtiofs device may not implement a passthrough host file system and
> have a completely different concept of what an inode is.


The situation is somehow similar to device pass-through which makes it 
very hard to have a general way to migrate.


>
> This means only a subset of virtiofs implementations can use Linux file
> handles as part of their device state. There is no way for the driver or
> device to recreate or restore the necessary information without
> implementation-specific device state like Linux file handles, though.


So my understanding is that even the linux file handle is not a general 
solution:

- not a part of uABI (not guaranteed to work on the destination)
- depends on the kernel version and a specific Kconfig (CONFIG_FHANDLE)


>
> I guess this is just a summary of what we've already discussed and not
> new information. I think an implementation today would use DBus VMState
> to transfer implementation-specific device state (an opaque blob).


Instead of trying to migrate those opaque stuffs which is kind of 
tricky, I wonder if we can avoid them by recording the mapping in the 
shared filesystem itself.

Thanks


>
> Stefan


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit
  2021-08-03  6:33                                                     ` Jason Wang
@ 2021-08-03 10:37                                                       ` Stefan Hajnoczi
  2021-08-03 11:42                                                         ` Jason Wang
  0 siblings, 1 reply; 115+ messages in thread
From: Stefan Hajnoczi @ 2021-08-03 10:37 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, Eugenio Perez Martin, Dr. David Alan Gilbert,
	virtio-comment, Virtio-Dev, Max Gurtovoy, Cornelia Huck,
	Oren Duer, Shahaf Shuler, Parav Pandit, Bodong Wang,
	Alexander Mikheev, Halil Pasic, mreitz

[-- Attachment #1: Type: text/plain, Size: 1598 bytes --]

On Tue, Aug 03, 2021 at 02:33:20PM +0800, Jason Wang wrote:
> 在 2021/7/26 下午11:07, Stefan Hajnoczi 写道:
> > I guess this is just a summary of what we've already discussed and not
> > new information. I think an implementation today would use DBus VMState
> > to transfer implementation-specific device state (an opaque blob).
> 
> 
> Instead of trying to migrate those opaque stuffs which is kind of tricky, I
> wonder if we can avoid them by recording the mapping in the shared
> filesystem itself.

The problem is that virtiofsd has no way of reopening the exact same
files without Linux file handles. So they need to be transferred to the
destination (or stored on a shared file system as you suggested),
regardless of whether they are part of the VIRTIO spec's device state or
not.

Implementation-specific state can be considered outside the scope of the
VIRTIO spec. In other words, we could exclude it from the VIRTIO-level
device state that save/load operate on. This does not solve the problem,
it just shifts the responsibility to the virtualization stack to migrate
this state.

The Linux file handles or other virtiofsd implementation-specific state
would be migrated separately (e.g. using DBus VMstate) so that by the
time the destination device does a VIRTIO load operation, it has the
necessary implementation-specific state ready.

I prefer to support in-band migration of implementation-specific state
because it's less complex to have a single device state instead of
splitting it.

Is this the direction you were thinking in?

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit
  2021-08-03 10:37                                                       ` Stefan Hajnoczi
@ 2021-08-03 11:42                                                         ` Jason Wang
  2021-08-03 12:22                                                           ` Dr. David Alan Gilbert
  2021-08-04  8:36                                                           ` Stefan Hajnoczi
  0 siblings, 2 replies; 115+ messages in thread
From: Jason Wang @ 2021-08-03 11:42 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Michael S. Tsirkin, Eugenio Perez Martin, Dr. David Alan Gilbert,
	virtio-comment, Virtio-Dev, Max Gurtovoy, Cornelia Huck,
	Oren Duer, Shahaf Shuler, Parav Pandit, Bodong Wang,
	Alexander Mikheev, Halil Pasic, mreitz


在 2021/8/3 下午6:37, Stefan Hajnoczi 写道:
> On Tue, Aug 03, 2021 at 02:33:20PM +0800, Jason Wang wrote:
>> 在 2021/7/26 下午11:07, Stefan Hajnoczi 写道:
>>> I guess this is just a summary of what we've already discussed and not
>>> new information. I think an implementation today would use DBus VMState
>>> to transfer implementation-specific device state (an opaque blob).
>>
>> Instead of trying to migrate those opaque stuffs which is kind of tricky, I
>> wonder if we can avoid them by recording the mapping in the shared
>> filesystem itself.
> The problem is that virtiofsd has no way of reopening the exact same
> files without Linux file handles.


I believe if we want to support live migration of the passthrough 
filesystem. The filesystem itself must be shared? (like NFS)

Assuming this is true. Can we store those mapping (e.g fuse inode -> 
host inode) in a known path/file in the passthrough filesystem itself 
and hide that file from the guest?

The destination can simply open this unkown file and do the lookup the 
mapping and reopen the file if necessary.

Then we don't need the Linux file handle.


>   So they need to be transferred to the
> destination (or stored on a shared file system as you suggested),
> regardless of whether they are part of the VIRTIO spec's device state or
> not.
>
> Implementation-specific state can be considered outside the scope of the
> VIRTIO spec. In other words, we could exclude it from the VIRTIO-level
> device state that save/load operate on. This does not solve the problem,
> it just shifts the responsibility to the virtualization stack to migrate
> this state.
>
> The Linux file handles or other virtiofsd implementation-specific state
> would be migrated separately (e.g. using DBus VMstate) so that by the
> time the destination device does a VIRTIO load operation, it has the
> necessary implementation-specific state ready.


That may work but I want to get rid of the implementation specific 
stuffs like linux handles completely.


>
> I prefer to support in-band migration of implementation-specific state
> because it's less complex to have a single device state instead of
> splitting it.


I wonder how to deal with migration compatibility in this case.


>
> Is this the direction you were thinking in?


Somehow.

Thanks


>
> Stefan


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit
  2021-08-03 11:42                                                         ` Jason Wang
@ 2021-08-03 12:22                                                           ` Dr. David Alan Gilbert
  2021-08-04  1:42                                                             ` Jason Wang
  2021-08-04  8:38                                                             ` Stefan Hajnoczi
  2021-08-04  8:36                                                           ` Stefan Hajnoczi
  1 sibling, 2 replies; 115+ messages in thread
From: Dr. David Alan Gilbert @ 2021-08-03 12:22 UTC (permalink / raw)
  To: Jason Wang
  Cc: Stefan Hajnoczi, Michael S. Tsirkin, Eugenio Perez Martin,
	virtio-comment, Virtio-Dev, Max Gurtovoy, Cornelia Huck,
	Oren Duer, Shahaf Shuler, Parav Pandit, Bodong Wang,
	Alexander Mikheev, Halil Pasic, mreitz

* Jason Wang (jasowang@redhat.com) wrote:
> 
> 在 2021/8/3 下午6:37, Stefan Hajnoczi 写道:
> > On Tue, Aug 03, 2021 at 02:33:20PM +0800, Jason Wang wrote:
> > > 在 2021/7/26 下午11:07, Stefan Hajnoczi 写道:
> > > > I guess this is just a summary of what we've already discussed and not
> > > > new information. I think an implementation today would use DBus VMState
> > > > to transfer implementation-specific device state (an opaque blob).
> > > 
> > > Instead of trying to migrate those opaque stuffs which is kind of tricky, I
> > > wonder if we can avoid them by recording the mapping in the shared
> > > filesystem itself.
> > The problem is that virtiofsd has no way of reopening the exact same
> > files without Linux file handles.
> 
> 
> I believe if we want to support live migration of the passthrough
> filesystem. The filesystem itself must be shared? (like NFS)
> 
> Assuming this is true. Can we store those mapping (e.g fuse inode -> host
> inode) in a known path/file in the passthrough filesystem itself and hide
> that file from the guest?

That's pretty dangerous; it assumes that the filesystem is only used
together with virtiofs; as a *shared* filesystem it's possible that it's
being used directly by normal NFS clients as well.
It's also very racy; trying to make sure those mappings reflect the
*current* meaning of inodes even while they're changing under your feet
is non-trivial.

> The destination can simply open this unkown file and do the lookup the
> mapping and reopen the file if necessary.
> 
> Then we don't need the Linux file handle.
> 
> 
> >   So they need to be transferred to the
> > destination (or stored on a shared file system as you suggested),
> > regardless of whether they are part of the VIRTIO spec's device state or
> > not.
> > 
> > Implementation-specific state can be considered outside the scope of the
> > VIRTIO spec. In other words, we could exclude it from the VIRTIO-level
> > device state that save/load operate on. This does not solve the problem,
> > it just shifts the responsibility to the virtualization stack to migrate
> > this state.
> > 
> > The Linux file handles or other virtiofsd implementation-specific state
> > would be migrated separately (e.g. using DBus VMstate) so that by the
> > time the destination device does a VIRTIO load operation, it has the
> > necessary implementation-specific state ready.
> 
> 
> That may work but I want to get rid of the implementation specific stuffs
> like linux handles completely.

I'm not sure how much implementation specific you can get rid of; but
you should be able to comparmentalise it, and you should be able to make
it so that common things can be shared; i.e. if I have two
implementations of virtiofs, both running on Linux, then it might be
good if we can live migrate between them, and standardise the format.

So, I'd expect the core virtiofs data to be standardised globally, then
I'd expect how Linux implementations work to be standardised.

Dave

> 
> > 
> > I prefer to support in-band migration of implementation-specific state
> > because it's less complex to have a single device state instead of
> > splitting it.
> 
> 
> I wonder how to deal with migration compatibility in this case.
> 
> 
> > 
> > Is this the direction you were thinking in?
> 
> 
> Somehow.
> 
> Thanks
> 
> 
> > 
> > Stefan
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit
  2021-08-03 12:22                                                           ` Dr. David Alan Gilbert
@ 2021-08-04  1:42                                                             ` Jason Wang
  2021-08-04  9:07                                                               ` Dr. David Alan Gilbert
  2021-08-04  9:20                                                               ` Stefan Hajnoczi
  2021-08-04  8:38                                                             ` Stefan Hajnoczi
  1 sibling, 2 replies; 115+ messages in thread
From: Jason Wang @ 2021-08-04  1:42 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Stefan Hajnoczi, Michael S. Tsirkin, Eugenio Perez Martin,
	virtio-comment, Virtio-Dev, Max Gurtovoy, Cornelia Huck,
	Oren Duer, Shahaf Shuler, Parav Pandit, Bodong Wang,
	Alexander Mikheev, Halil Pasic, mreitz


在 2021/8/3 下午8:22, Dr. David Alan Gilbert 写道:
> * Jason Wang (jasowang@redhat.com) wrote:
>> 在 2021/8/3 下午6:37, Stefan Hajnoczi 写道:
>>> On Tue, Aug 03, 2021 at 02:33:20PM +0800, Jason Wang wrote:
>>>> 在 2021/7/26 下午11:07, Stefan Hajnoczi 写道:
>>>>> I guess this is just a summary of what we've already discussed and not
>>>>> new information. I think an implementation today would use DBus VMState
>>>>> to transfer implementation-specific device state (an opaque blob).
>>>> Instead of trying to migrate those opaque stuffs which is kind of tricky, I
>>>> wonder if we can avoid them by recording the mapping in the shared
>>>> filesystem itself.
>>> The problem is that virtiofsd has no way of reopening the exact same
>>> files without Linux file handles.
>>
>> I believe if we want to support live migration of the passthrough
>> filesystem. The filesystem itself must be shared? (like NFS)
>>
>> Assuming this is true. Can we store those mapping (e.g fuse inode -> host
>> inode) in a known path/file in the passthrough filesystem itself and hide
>> that file from the guest?
> That's pretty dangerous; it assumes that the filesystem is only used
> together with virtiofs; as a *shared* filesystem it's possible that it's
> being used directly by normal NFS clients as well.
> It's also very racy; trying to make sure those mappings reflect the
> *current* meaning of inodes even while they're changing under your feet
> is non-trivial.


Right, it's just a thought to avoid migrating implementation specific 
stuffs.


>
>> The destination can simply open this unkown file and do the lookup the
>> mapping and reopen the file if necessary.
>>
>> Then we don't need the Linux file handle.
>>
>>
>>>    So they need to be transferred to the
>>> destination (or stored on a shared file system as you suggested),
>>> regardless of whether they are part of the VIRTIO spec's device state or
>>> not.
>>>
>>> Implementation-specific state can be considered outside the scope of the
>>> VIRTIO spec. In other words, we could exclude it from the VIRTIO-level
>>> device state that save/load operate on. This does not solve the problem,
>>> it just shifts the responsibility to the virtualization stack to migrate
>>> this state.
>>>
>>> The Linux file handles or other virtiofsd implementation-specific state
>>> would be migrated separately (e.g. using DBus VMstate) so that by the
>>> time the destination device does a VIRTIO load operation, it has the
>>> necessary implementation-specific state ready.
>>
>> That may work but I want to get rid of the implementation specific stuffs
>> like linux handles completely.
> I'm not sure how much implementation specific you can get rid of; but
> you should be able to comparmentalise it, and you should be able to make
> it so that common things can be shared;


Yes, that's is the way we need to go.


>   i.e. if I have two
> implementations of virtiofs, both running on Linux, then it might be
> good if we can live migrate between them, and standardise the format.


As replied in the previous version, I'm not sure how hard it is consider 
the file_handle mentioned by Stefan is not a part of uABI and it depends 
on specific kernel config to work.


>
> So, I'd expect the core virtiofs data to be standardised globally,


Yes, maybe start at the FUSE level.


>   then
> I'd expect how Linux implementations work to be standardised.


Does it mean we need:

1) port virtiofsd to multiple platforms
2) only support live migration among virtiofds

?

Thanks


>
> Dave
>
>>> I prefer to support in-band migration of implementation-specific state
>>> because it's less complex to have a single device state instead of
>>> splitting it.
>>
>> I wonder how to deal with migration compatibility in this case.
>>
>>
>>> Is this the direction you were thinking in?
>>
>> Somehow.
>>
>> Thanks
>>
>>
>>> Stefan


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit
  2021-08-03 11:42                                                         ` Jason Wang
  2021-08-03 12:22                                                           ` Dr. David Alan Gilbert
@ 2021-08-04  8:36                                                           ` Stefan Hajnoczi
  2021-08-05  6:35                                                             ` Jason Wang
  1 sibling, 1 reply; 115+ messages in thread
From: Stefan Hajnoczi @ 2021-08-04  8:36 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, Eugenio Perez Martin, Dr. David Alan Gilbert,
	virtio-comment, Virtio-Dev, Max Gurtovoy, Cornelia Huck,
	Oren Duer, Shahaf Shuler, Parav Pandit, Bodong Wang,
	Alexander Mikheev, Halil Pasic, mreitz

[-- Attachment #1: Type: text/plain, Size: 1273 bytes --]

On Tue, Aug 03, 2021 at 07:42:31PM +0800, Jason Wang wrote:
> 
> 在 2021/8/3 下午6:37, Stefan Hajnoczi 写道:
> > On Tue, Aug 03, 2021 at 02:33:20PM +0800, Jason Wang wrote:
> > > 在 2021/7/26 下午11:07, Stefan Hajnoczi 写道:
> > > > I guess this is just a summary of what we've already discussed and not
> > > > new information. I think an implementation today would use DBus VMState
> > > > to transfer implementation-specific device state (an opaque blob).
> > > 
> > > Instead of trying to migrate those opaque stuffs which is kind of tricky, I
> > > wonder if we can avoid them by recording the mapping in the shared
> > > filesystem itself.
> > The problem is that virtiofsd has no way of reopening the exact same
> > files without Linux file handles.
> 
> 
> I believe if we want to support live migration of the passthrough
> filesystem. The filesystem itself must be shared? (like NFS)

The virtiofs device is not tied to any particular file system backend.
The file system could be shared (available from both the source and
destination) or local. It might be a passthrough file system or
something else (an in-memory file system similar to Linux tmpfs, a
non-POSIX network storage like a REST object storage API, etc).

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit
  2021-08-03 12:22                                                           ` Dr. David Alan Gilbert
  2021-08-04  1:42                                                             ` Jason Wang
@ 2021-08-04  8:38                                                             ` Stefan Hajnoczi
  1 sibling, 0 replies; 115+ messages in thread
From: Stefan Hajnoczi @ 2021-08-04  8:38 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Jason Wang, Michael S. Tsirkin, Eugenio Perez Martin,
	virtio-comment, Virtio-Dev, Max Gurtovoy, Cornelia Huck,
	Oren Duer, Shahaf Shuler, Parav Pandit, Bodong Wang,
	Alexander Mikheev, Halil Pasic, mreitz

[-- Attachment #1: Type: text/plain, Size: 1783 bytes --]

On Tue, Aug 03, 2021 at 01:22:09PM +0100, Dr. David Alan Gilbert wrote:
> * Jason Wang (jasowang@redhat.com) wrote:
> > 
> > 在 2021/8/3 下午6:37, Stefan Hajnoczi 写道:
> > > On Tue, Aug 03, 2021 at 02:33:20PM +0800, Jason Wang wrote:
> > > > 在 2021/7/26 下午11:07, Stefan Hajnoczi 写道:
> > > > > I guess this is just a summary of what we've already discussed and not
> > > > > new information. I think an implementation today would use DBus VMState
> > > > > to transfer implementation-specific device state (an opaque blob).
> > > > 
> > > > Instead of trying to migrate those opaque stuffs which is kind of tricky, I
> > > > wonder if we can avoid them by recording the mapping in the shared
> > > > filesystem itself.
> > > The problem is that virtiofsd has no way of reopening the exact same
> > > files without Linux file handles.
> > 
> > 
> > I believe if we want to support live migration of the passthrough
> > filesystem. The filesystem itself must be shared? (like NFS)
> > 
> > Assuming this is true. Can we store those mapping (e.g fuse inode -> host
> > inode) in a known path/file in the passthrough filesystem itself and hide
> > that file from the guest?
> 
> That's pretty dangerous; it assumes that the filesystem is only used
> together with virtiofs; as a *shared* filesystem it's possible that it's
> being used directly by normal NFS clients as well.
> It's also very racy; trying to make sure those mappings reflect the
> *current* meaning of inodes even while they're changing under your feet
> is non-trivial.

Right, it's impossible to guarantee that you will reopen or even be able
to find the same inode for the reasons I've mentioned (deleting and
renaming files) plus more (inode number reuse).

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit
  2021-08-04  1:42                                                             ` Jason Wang
@ 2021-08-04  9:07                                                               ` Dr. David Alan Gilbert
  2021-08-05  6:38                                                                 ` Jason Wang
  2021-08-04  9:20                                                               ` Stefan Hajnoczi
  1 sibling, 1 reply; 115+ messages in thread
From: Dr. David Alan Gilbert @ 2021-08-04  9:07 UTC (permalink / raw)
  To: Jason Wang
  Cc: Stefan Hajnoczi, Michael S. Tsirkin, Eugenio Perez Martin,
	virtio-comment, Virtio-Dev, Max Gurtovoy, Cornelia Huck,
	Oren Duer, Shahaf Shuler, Parav Pandit, Bodong Wang,
	Alexander Mikheev, Halil Pasic, mreitz

* Jason Wang (jasowang@redhat.com) wrote:
> 
> 在 2021/8/3 下午8:22, Dr. David Alan Gilbert 写道:
> > * Jason Wang (jasowang@redhat.com) wrote:
> > > 在 2021/8/3 下午6:37, Stefan Hajnoczi 写道:
> > > > On Tue, Aug 03, 2021 at 02:33:20PM +0800, Jason Wang wrote:
> > > > > 在 2021/7/26 下午11:07, Stefan Hajnoczi 写道:
> > > > > > I guess this is just a summary of what we've already discussed and not
> > > > > > new information. I think an implementation today would use DBus VMState
> > > > > > to transfer implementation-specific device state (an opaque blob).
> > > > > Instead of trying to migrate those opaque stuffs which is kind of tricky, I
> > > > > wonder if we can avoid them by recording the mapping in the shared
> > > > > filesystem itself.
> > > > The problem is that virtiofsd has no way of reopening the exact same
> > > > files without Linux file handles.
> > > 
> > > I believe if we want to support live migration of the passthrough
> > > filesystem. The filesystem itself must be shared? (like NFS)
> > > 
> > > Assuming this is true. Can we store those mapping (e.g fuse inode -> host
> > > inode) in a known path/file in the passthrough filesystem itself and hide
> > > that file from the guest?
> > That's pretty dangerous; it assumes that the filesystem is only used
> > together with virtiofs; as a *shared* filesystem it's possible that it's
> > being used directly by normal NFS clients as well.
> > It's also very racy; trying to make sure those mappings reflect the
> > *current* meaning of inodes even while they're changing under your feet
> > is non-trivial.
> 
> 
> Right, it's just a thought to avoid migrating implementation specific
> stuffs.
> 
> 
> > 
> > > The destination can simply open this unkown file and do the lookup the
> > > mapping and reopen the file if necessary.
> > > 
> > > Then we don't need the Linux file handle.
> > > 
> > > 
> > > >    So they need to be transferred to the
> > > > destination (or stored on a shared file system as you suggested),
> > > > regardless of whether they are part of the VIRTIO spec's device state or
> > > > not.
> > > > 
> > > > Implementation-specific state can be considered outside the scope of the
> > > > VIRTIO spec. In other words, we could exclude it from the VIRTIO-level
> > > > device state that save/load operate on. This does not solve the problem,
> > > > it just shifts the responsibility to the virtualization stack to migrate
> > > > this state.
> > > > 
> > > > The Linux file handles or other virtiofsd implementation-specific state
> > > > would be migrated separately (e.g. using DBus VMstate) so that by the
> > > > time the destination device does a VIRTIO load operation, it has the
> > > > necessary implementation-specific state ready.
> > > 
> > > That may work but I want to get rid of the implementation specific stuffs
> > > like linux handles completely.
> > I'm not sure how much implementation specific you can get rid of; but
> > you should be able to comparmentalise it, and you should be able to make
> > it so that common things can be shared;
> 
> 
> Yes, that's is the way we need to go.
> 
> 
> >   i.e. if I have two
> > implementations of virtiofs, both running on Linux, then it might be
> > good if we can live migrate between them, and standardise the format.
> 
> 
> As replied in the previous version, I'm not sure how hard it is consider the
> file_handle mentioned by Stefan is not a part of uABI and it depends on
> specific kernel config to work.
> 
> 
> > 
> > So, I'd expect the core virtiofs data to be standardised globally,
> 
> 
> Yes, maybe start at the FUSE level.
> 
> 
> >   then
> > I'd expect how Linux implementations work to be standardised.
> 
> 
> Does it mean we need:
> 
> 1) port virtiofsd to multiple platforms
> 2) only support live migration among virtiofds
> 
> ?

Not necessarily; I mean that we have layers:
  a) Virtio
  b) Virtio-fs
  c]
    c1) virtio-fs backed by a Linux filesystem
    c2) virtio-fs backed by some object store
    c3) virito-fs backed by something else

(a) is standardised
The migration data for (b) can be standardised
We can also standardise c1, c2 (not that we've made a c2)
and we could expect migration between different implementations all
that are backed by a Linux filesystem (if that file handle stuff is
portable); but we wouldn't expect a migration between c1 and c2 to work.
(and c2 might get split if there are different types of object store).

So, just because there are different types of backends, doesn't mean we
have to give up standardisation; we just have to acknowledge there's
a range of backends and standardisethe bits we can.

Dave

> 
> 
> > 
> > Dave
> > 
> > > > I prefer to support in-band migration of implementation-specific state
> > > > because it's less complex to have a single device state instead of
> > > > splitting it.
> > > 
> > > I wonder how to deal with migration compatibility in this case.
> > > 
> > > 
> > > > Is this the direction you were thinking in?
> > > 
> > > Somehow.
> > > 
> > > Thanks
> > > 
> > > 
> > > > Stefan
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit
  2021-08-04  1:42                                                             ` Jason Wang
  2021-08-04  9:07                                                               ` Dr. David Alan Gilbert
@ 2021-08-04  9:20                                                               ` Stefan Hajnoczi
  2021-08-05  6:45                                                                 ` Jason Wang
  1 sibling, 1 reply; 115+ messages in thread
From: Stefan Hajnoczi @ 2021-08-04  9:20 UTC (permalink / raw)
  To: Jason Wang
  Cc: Dr. David Alan Gilbert, Michael S. Tsirkin, Eugenio Perez Martin,
	virtio-comment, Virtio-Dev, Max Gurtovoy, Cornelia Huck,
	Oren Duer, Shahaf Shuler, Parav Pandit, Bodong Wang,
	Alexander Mikheev, Halil Pasic, mreitz

[-- Attachment #1: Type: text/plain, Size: 1863 bytes --]

On Wed, Aug 04, 2021 at 09:42:34AM +0800, Jason Wang wrote:
> 在 2021/8/3 下午8:22, Dr. David Alan Gilbert 写道:
> > * Jason Wang (jasowang@redhat.com) wrote:
> > > 在 2021/8/3 下午6:37, Stefan Hajnoczi 写道:
> > > > On Tue, Aug 03, 2021 at 02:33:20PM +0800, Jason Wang wrote:
> > > > > 在 2021/7/26 下午11:07, Stefan Hajnoczi 写道:
> > I'd expect how Linux implementations work to be standardised.
> 
> 
> Does it mean we need:
> 
> 1) port virtiofsd to multiple platforms

Correct migration requires a non-POSIX mechanism to reopen files (saving
inode numbers as you've suggested isn't enough). If that's unavailable
then it won't be possible to migrate safely.

> 2) only support live migration among virtiofds

We can standardize the device state representation for Linux passthrough
file systems and implement it in QEMU's virtiofsd and virtiofsd-rs.

However, it's technically possible for other virtiofsd implementations
to migrate too and they shouldn't be second-class citizens. QEMU's
virtiofsd isn't special and Linux passthrough file systems aren't
special.

Some device state representations will apply to one specific virtiofs
implementation, so the value of standardizing it beyond choosing a
unique identifier to prevent collisions is questionable. Does the VIRTIO
TC want to spend time reviewing implementation-specific device state
representations?

What I suggest is to allow in-band implementation-specific device state
with a unique identifier that prevents migration between incompatible
implementations. Standardize device state representations that are
actually worth standardizing (like the Linux passthrough file system
where there are multiple implementations): implementors benefit from
using the standard because it saves them time and ensures migration
compatibility.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit
  2021-08-04  8:36                                                           ` Stefan Hajnoczi
@ 2021-08-05  6:35                                                             ` Jason Wang
  0 siblings, 0 replies; 115+ messages in thread
From: Jason Wang @ 2021-08-05  6:35 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Michael S. Tsirkin, Eugenio Perez Martin, Dr. David Alan Gilbert,
	virtio-comment, Virtio-Dev, Max Gurtovoy, Cornelia Huck,
	Oren Duer, Shahaf Shuler, Parav Pandit, Bodong Wang,
	Alexander Mikheev, Halil Pasic, mreitz


在 2021/8/4 下午4:36, Stefan Hajnoczi 写道:
> On Tue, Aug 03, 2021 at 07:42:31PM +0800, Jason Wang wrote:
>> 在 2021/8/3 下午6:37, Stefan Hajnoczi 写道:
>>> On Tue, Aug 03, 2021 at 02:33:20PM +0800, Jason Wang wrote:
>>>> 在 2021/7/26 下午11:07, Stefan Hajnoczi 写道:
>>>>> I guess this is just a summary of what we've already discussed and not
>>>>> new information. I think an implementation today would use DBus VMState
>>>>> to transfer implementation-specific device state (an opaque blob).
>>>> Instead of trying to migrate those opaque stuffs which is kind of tricky, I
>>>> wonder if we can avoid them by recording the mapping in the shared
>>>> filesystem itself.
>>> The problem is that virtiofsd has no way of reopening the exact same
>>> files without Linux file handles.
>>
>> I believe if we want to support live migration of the passthrough
>> filesystem. The filesystem itself must be shared? (like NFS)
> The virtiofs device is not tied to any particular file system backend.
> The file system could be shared (available from both the source and
> destination) or local. It might be a passthrough file system or
> something else (an in-memory file system similar to Linux tmpfs, a
> non-POSIX network storage like a REST object storage API, etc).


Yes, I meant for the case of passthrough file system, it must be shared 
in order to be migrated.

Or can we live migrate the whole passthrough file system to the 
destination as block device?

Thanks


>
> Stefan


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit
  2021-08-04  9:07                                                               ` Dr. David Alan Gilbert
@ 2021-08-05  6:38                                                                 ` Jason Wang
  2021-08-05  8:19                                                                   ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 115+ messages in thread
From: Jason Wang @ 2021-08-05  6:38 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Stefan Hajnoczi, Michael S. Tsirkin, Eugenio Perez Martin,
	virtio-comment, Virtio-Dev, Max Gurtovoy, Cornelia Huck,
	Oren Duer, Shahaf Shuler, Parav Pandit, Bodong Wang,
	Alexander Mikheev, Halil Pasic, mreitz


在 2021/8/4 下午5:07, Dr. David Alan Gilbert 写道:
> * Jason Wang (jasowang@redhat.com) wrote:
>> 在 2021/8/3 下午8:22, Dr. David Alan Gilbert 写道:
>>> * Jason Wang (jasowang@redhat.com) wrote:
>>>> 在 2021/8/3 下午6:37, Stefan Hajnoczi 写道:
>>>>> On Tue, Aug 03, 2021 at 02:33:20PM +0800, Jason Wang wrote:
>>>>>> 在 2021/7/26 下午11:07, Stefan Hajnoczi 写道:
>>>>>>> I guess this is just a summary of what we've already discussed and not
>>>>>>> new information. I think an implementation today would use DBus VMState
>>>>>>> to transfer implementation-specific device state (an opaque blob).
>>>>>> Instead of trying to migrate those opaque stuffs which is kind of tricky, I
>>>>>> wonder if we can avoid them by recording the mapping in the shared
>>>>>> filesystem itself.
>>>>> The problem is that virtiofsd has no way of reopening the exact same
>>>>> files without Linux file handles.
>>>> I believe if we want to support live migration of the passthrough
>>>> filesystem. The filesystem itself must be shared? (like NFS)
>>>>
>>>> Assuming this is true. Can we store those mapping (e.g fuse inode -> host
>>>> inode) in a known path/file in the passthrough filesystem itself and hide
>>>> that file from the guest?
>>> That's pretty dangerous; it assumes that the filesystem is only used
>>> together with virtiofs; as a *shared* filesystem it's possible that it's
>>> being used directly by normal NFS clients as well.
>>> It's also very racy; trying to make sure those mappings reflect the
>>> *current* meaning of inodes even while they're changing under your feet
>>> is non-trivial.
>>
>> Right, it's just a thought to avoid migrating implementation specific
>> stuffs.
>>
>>
>>>> The destination can simply open this unkown file and do the lookup the
>>>> mapping and reopen the file if necessary.
>>>>
>>>> Then we don't need the Linux file handle.
>>>>
>>>>
>>>>>     So they need to be transferred to the
>>>>> destination (or stored on a shared file system as you suggested),
>>>>> regardless of whether they are part of the VIRTIO spec's device state or
>>>>> not.
>>>>>
>>>>> Implementation-specific state can be considered outside the scope of the
>>>>> VIRTIO spec. In other words, we could exclude it from the VIRTIO-level
>>>>> device state that save/load operate on. This does not solve the problem,
>>>>> it just shifts the responsibility to the virtualization stack to migrate
>>>>> this state.
>>>>>
>>>>> The Linux file handles or other virtiofsd implementation-specific state
>>>>> would be migrated separately (e.g. using DBus VMstate) so that by the
>>>>> time the destination device does a VIRTIO load operation, it has the
>>>>> necessary implementation-specific state ready.
>>>> That may work but I want to get rid of the implementation specific stuffs
>>>> like linux handles completely.
>>> I'm not sure how much implementation specific you can get rid of; but
>>> you should be able to comparmentalise it, and you should be able to make
>>> it so that common things can be shared;
>>
>> Yes, that's is the way we need to go.
>>
>>
>>>    i.e. if I have two
>>> implementations of virtiofs, both running on Linux, then it might be
>>> good if we can live migrate between them, and standardise the format.
>>
>> As replied in the previous version, I'm not sure how hard it is consider the
>> file_handle mentioned by Stefan is not a part of uABI and it depends on
>> specific kernel config to work.
>>
>>
>>> So, I'd expect the core virtiofs data to be standardised globally,
>>
>> Yes, maybe start at the FUSE level.
>>
>>
>>>    then
>>> I'd expect how Linux implementations work to be standardised.
>>
>> Does it mean we need:
>>
>> 1) port virtiofsd to multiple platforms
>> 2) only support live migration among virtiofds
>>
>> ?
> Not necessarily; I mean that we have layers:
>    a) Virtio
>    b) Virtio-fs
>    c]
>      c1) virtio-fs backed by a Linux filesystem
>      c2) virtio-fs backed by some object store
>      c3) virito-fs backed by something else
>
> (a) is standardised
> The migration data for (b) can be standardised


That would be good.


> We can also standardise c1, c2 (not that we've made a c2)
> and we could expect migration between different implementations all
> that are backed by a Linux filesystem (if that file handle stuff is
> portable); but we wouldn't expect a migration between c1 and c2 to work.
> (and c2 might get split if there are different types of object store).


If I understand this correctly, this requires the management layer to 
know the details of the backend before trying to live migrate the guest. 
Or do we need different feature bits for the above three types of the 
virtio-fs device?


>
> So, just because there are different types of backends, doesn't mean we
> have to give up standardisation; we just have to acknowledge there's
> a range of backends and standardisethe bits we can.


Right.

Thanks


>
> Dave
>
>>
>>> Dave
>>>
>>>>> I prefer to support in-band migration of implementation-specific state
>>>>> because it's less complex to have a single device state instead of
>>>>> splitting it.
>>>> I wonder how to deal with migration compatibility in this case.
>>>>
>>>>
>>>>> Is this the direction you were thinking in?
>>>> Somehow.
>>>>
>>>> Thanks
>>>>
>>>>
>>>>> Stefan


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit
  2021-08-04  9:20                                                               ` Stefan Hajnoczi
@ 2021-08-05  6:45                                                                 ` Jason Wang
  0 siblings, 0 replies; 115+ messages in thread
From: Jason Wang @ 2021-08-05  6:45 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Dr. David Alan Gilbert, Michael S. Tsirkin, Eugenio Perez Martin,
	virtio-comment, Virtio-Dev, Max Gurtovoy, Cornelia Huck,
	Oren Duer, Shahaf Shuler, Parav Pandit, Bodong Wang,
	Alexander Mikheev, Halil Pasic, mreitz


在 2021/8/4 下午5:20, Stefan Hajnoczi 写道:
> On Wed, Aug 04, 2021 at 09:42:34AM +0800, Jason Wang wrote:
>> 在 2021/8/3 下午8:22, Dr. David Alan Gilbert 写道:
>>> * Jason Wang (jasowang@redhat.com) wrote:
>>>> 在 2021/8/3 下午6:37, Stefan Hajnoczi 写道:
>>>>> On Tue, Aug 03, 2021 at 02:33:20PM +0800, Jason Wang wrote:
>>>>>> 在 2021/7/26 下午11:07, Stefan Hajnoczi 写道:
>>> I'd expect how Linux implementations work to be standardised.
>>
>> Does it mean we need:
>>
>> 1) port virtiofsd to multiple platforms
> Correct migration requires a non-POSIX mechanism to reopen files (saving
> inode numbers as you've suggested isn't enough). If that's unavailable
> then it won't be possible to migrate safely.


Ok.


>
>> 2) only support live migration among virtiofds
> We can standardize the device state representation for Linux passthrough
> file systems and implement it in QEMU's virtiofsd and virtiofsd-rs.
>
> However, it's technically possible for other virtiofsd implementations
> to migrate too and they shouldn't be second-class citizens. QEMU's
> virtiofsd isn't special and Linux passthrough file systems aren't
> special.


So the migration compatibility is still a problem for those backends.


>
> Some device state representations will apply to one specific virtiofs
> implementation, so the value of standardizing it beyond choosing a
> unique identifier to prevent collisions is questionable.


As replied in another thread, could we categorize the different types of 
backend with different feature bits. Then we can start think of how to 
standardize the state of each?


>   Does the VIRTIO
> TC want to spend time reviewing implementation-specific device state
> representations?


If it's implementation specific not virtio specific, I guess not. But if 
we use feature bits for identify the backend types, do we have the 
chance to make it virtio specific instead of implementation specific?


>
> What I suggest is to allow in-band implementation-specific device state
> with a unique identifier that prevents migration between incompatible
> implementations.


Does this mean we can only know it's impossible to migrate after a 
migration failure?


>   Standardize device state representations that are
> actually worth standardizing (like the Linux passthrough file system
> where there are multiple implementations): implementors benefit from
> using the standard because it saves them time and ensures migration
> compatibility.


Yes.

Thanks


>
> Stefan


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit
  2021-08-05  6:38                                                                 ` Jason Wang
@ 2021-08-05  8:19                                                                   ` Dr. David Alan Gilbert
  2021-08-06  6:15                                                                     ` Jason Wang
  0 siblings, 1 reply; 115+ messages in thread
From: Dr. David Alan Gilbert @ 2021-08-05  8:19 UTC (permalink / raw)
  To: Jason Wang
  Cc: Stefan Hajnoczi, Michael S. Tsirkin, Eugenio Perez Martin,
	virtio-comment, Virtio-Dev, Max Gurtovoy, Cornelia Huck,
	Oren Duer, Shahaf Shuler, Parav Pandit, Bodong Wang,
	Alexander Mikheev, Halil Pasic, mreitz

* Jason Wang (jasowang@redhat.com) wrote:
> 
> 在 2021/8/4 下午5:07, Dr. David Alan Gilbert 写道:
> > * Jason Wang (jasowang@redhat.com) wrote:
> > > 在 2021/8/3 下午8:22, Dr. David Alan Gilbert 写道:
> > > > * Jason Wang (jasowang@redhat.com) wrote:
> > > > > 在 2021/8/3 下午6:37, Stefan Hajnoczi 写道:
> > > > > > On Tue, Aug 03, 2021 at 02:33:20PM +0800, Jason Wang wrote:
> > > > > > > 在 2021/7/26 下午11:07, Stefan Hajnoczi 写道:
> > > > > > > > I guess this is just a summary of what we've already discussed and not
> > > > > > > > new information. I think an implementation today would use DBus VMState
> > > > > > > > to transfer implementation-specific device state (an opaque blob).
> > > > > > > Instead of trying to migrate those opaque stuffs which is kind of tricky, I
> > > > > > > wonder if we can avoid them by recording the mapping in the shared
> > > > > > > filesystem itself.
> > > > > > The problem is that virtiofsd has no way of reopening the exact same
> > > > > > files without Linux file handles.
> > > > > I believe if we want to support live migration of the passthrough
> > > > > filesystem. The filesystem itself must be shared? (like NFS)
> > > > > 
> > > > > Assuming this is true. Can we store those mapping (e.g fuse inode -> host
> > > > > inode) in a known path/file in the passthrough filesystem itself and hide
> > > > > that file from the guest?
> > > > That's pretty dangerous; it assumes that the filesystem is only used
> > > > together with virtiofs; as a *shared* filesystem it's possible that it's
> > > > being used directly by normal NFS clients as well.
> > > > It's also very racy; trying to make sure those mappings reflect the
> > > > *current* meaning of inodes even while they're changing under your feet
> > > > is non-trivial.
> > > 
> > > Right, it's just a thought to avoid migrating implementation specific
> > > stuffs.
> > > 
> > > 
> > > > > The destination can simply open this unkown file and do the lookup the
> > > > > mapping and reopen the file if necessary.
> > > > > 
> > > > > Then we don't need the Linux file handle.
> > > > > 
> > > > > 
> > > > > >     So they need to be transferred to the
> > > > > > destination (or stored on a shared file system as you suggested),
> > > > > > regardless of whether they are part of the VIRTIO spec's device state or
> > > > > > not.
> > > > > > 
> > > > > > Implementation-specific state can be considered outside the scope of the
> > > > > > VIRTIO spec. In other words, we could exclude it from the VIRTIO-level
> > > > > > device state that save/load operate on. This does not solve the problem,
> > > > > > it just shifts the responsibility to the virtualization stack to migrate
> > > > > > this state.
> > > > > > 
> > > > > > The Linux file handles or other virtiofsd implementation-specific state
> > > > > > would be migrated separately (e.g. using DBus VMstate) so that by the
> > > > > > time the destination device does a VIRTIO load operation, it has the
> > > > > > necessary implementation-specific state ready.
> > > > > That may work but I want to get rid of the implementation specific stuffs
> > > > > like linux handles completely.
> > > > I'm not sure how much implementation specific you can get rid of; but
> > > > you should be able to comparmentalise it, and you should be able to make
> > > > it so that common things can be shared;
> > > 
> > > Yes, that's is the way we need to go.
> > > 
> > > 
> > > >    i.e. if I have two
> > > > implementations of virtiofs, both running on Linux, then it might be
> > > > good if we can live migrate between them, and standardise the format.
> > > 
> > > As replied in the previous version, I'm not sure how hard it is consider the
> > > file_handle mentioned by Stefan is not a part of uABI and it depends on
> > > specific kernel config to work.
> > > 
> > > 
> > > > So, I'd expect the core virtiofs data to be standardised globally,
> > > 
> > > Yes, maybe start at the FUSE level.
> > > 
> > > 
> > > >    then
> > > > I'd expect how Linux implementations work to be standardised.
> > > 
> > > Does it mean we need:
> > > 
> > > 1) port virtiofsd to multiple platforms
> > > 2) only support live migration among virtiofds
> > > 
> > > ?
> > Not necessarily; I mean that we have layers:
> >    a) Virtio
> >    b) Virtio-fs
> >    c]
> >      c1) virtio-fs backed by a Linux filesystem
> >      c2) virtio-fs backed by some object store
> >      c3) virito-fs backed by something else
> > 
> > (a) is standardised
> > The migration data for (b) can be standardised
> 
> 
> That would be good.
> 
> 
> > We can also standardise c1, c2 (not that we've made a c2)
> > and we could expect migration between different implementations all
> > that are backed by a Linux filesystem (if that file handle stuff is
> > portable); but we wouldn't expect a migration between c1 and c2 to work.
> > (and c2 might get split if there are different types of object store).
> 
> 
> If I understand this correctly, this requires the management layer to know
> the details of the backend before trying to live migrate the guest. Or do we
> need different feature bits for the above three types of the virtio-fs
> device?

Yep, something would need to know the details of the backend; but that's
true of most existing backends anyway; e.g. in virtio-net the management
layer understands the underlying network and what it has to setup on the
destination to ensure the network on both sides looks the same; it's got
different implications but it does still need to know it.

Dave

> 
> > 
> > So, just because there are different types of backends, doesn't mean we
> > have to give up standardisation; we just have to acknowledge there's
> > a range of backends and standardisethe bits we can.
> 
> 
> Right.
> 
> Thanks
> 
> 
> > 
> > Dave
> > 
> > > 
> > > > Dave
> > > > 
> > > > > > I prefer to support in-band migration of implementation-specific state
> > > > > > because it's less complex to have a single device state instead of
> > > > > > splitting it.
> > > > > I wonder how to deal with migration compatibility in this case.
> > > > > 
> > > > > 
> > > > > > Is this the direction you were thinking in?
> > > > > Somehow.
> > > > > 
> > > > > Thanks
> > > > > 
> > > > > 
> > > > > > Stefan
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit
  2021-08-05  8:19                                                                   ` Dr. David Alan Gilbert
@ 2021-08-06  6:15                                                                     ` Jason Wang
  2021-08-08  9:31                                                                       ` Max Gurtovoy
  0 siblings, 1 reply; 115+ messages in thread
From: Jason Wang @ 2021-08-06  6:15 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Stefan Hajnoczi, Michael S. Tsirkin, Eugenio Perez Martin,
	virtio-comment, Virtio-Dev, Max Gurtovoy, Cornelia Huck,
	Oren Duer, Shahaf Shuler, Parav Pandit, Bodong Wang,
	Alexander Mikheev, Halil Pasic, mreitz


在 2021/8/5 下午4:19, Dr. David Alan Gilbert 写道:
> * Jason Wang (jasowang@redhat.com) wrote:
>> 在 2021/8/4 下午5:07, Dr. David Alan Gilbert 写道:
>>> * Jason Wang (jasowang@redhat.com) wrote:
>>>> 在 2021/8/3 下午8:22, Dr. David Alan Gilbert 写道:
>>>>> * Jason Wang (jasowang@redhat.com) wrote:
>>>>>> 在 2021/8/3 下午6:37, Stefan Hajnoczi 写道:
>>>>>>> On Tue, Aug 03, 2021 at 02:33:20PM +0800, Jason Wang wrote:
>>>>>>>> 在 2021/7/26 下午11:07, Stefan Hajnoczi 写道:
>>>>>>>>> I guess this is just a summary of what we've already discussed and not
>>>>>>>>> new information. I think an implementation today would use DBus VMState
>>>>>>>>> to transfer implementation-specific device state (an opaque blob).
>>>>>>>> Instead of trying to migrate those opaque stuffs which is kind of tricky, I
>>>>>>>> wonder if we can avoid them by recording the mapping in the shared
>>>>>>>> filesystem itself.
>>>>>>> The problem is that virtiofsd has no way of reopening the exact same
>>>>>>> files without Linux file handles.
>>>>>> I believe if we want to support live migration of the passthrough
>>>>>> filesystem. The filesystem itself must be shared? (like NFS)
>>>>>>
>>>>>> Assuming this is true. Can we store those mapping (e.g fuse inode -> host
>>>>>> inode) in a known path/file in the passthrough filesystem itself and hide
>>>>>> that file from the guest?
>>>>> That's pretty dangerous; it assumes that the filesystem is only used
>>>>> together with virtiofs; as a *shared* filesystem it's possible that it's
>>>>> being used directly by normal NFS clients as well.
>>>>> It's also very racy; trying to make sure those mappings reflect the
>>>>> *current* meaning of inodes even while they're changing under your feet
>>>>> is non-trivial.
>>>> Right, it's just a thought to avoid migrating implementation specific
>>>> stuffs.
>>>>
>>>>
>>>>>> The destination can simply open this unkown file and do the lookup the
>>>>>> mapping and reopen the file if necessary.
>>>>>>
>>>>>> Then we don't need the Linux file handle.
>>>>>>
>>>>>>
>>>>>>>      So they need to be transferred to the
>>>>>>> destination (or stored on a shared file system as you suggested),
>>>>>>> regardless of whether they are part of the VIRTIO spec's device state or
>>>>>>> not.
>>>>>>>
>>>>>>> Implementation-specific state can be considered outside the scope of the
>>>>>>> VIRTIO spec. In other words, we could exclude it from the VIRTIO-level
>>>>>>> device state that save/load operate on. This does not solve the problem,
>>>>>>> it just shifts the responsibility to the virtualization stack to migrate
>>>>>>> this state.
>>>>>>>
>>>>>>> The Linux file handles or other virtiofsd implementation-specific state
>>>>>>> would be migrated separately (e.g. using DBus VMstate) so that by the
>>>>>>> time the destination device does a VIRTIO load operation, it has the
>>>>>>> necessary implementation-specific state ready.
>>>>>> That may work but I want to get rid of the implementation specific stuffs
>>>>>> like linux handles completely.
>>>>> I'm not sure how much implementation specific you can get rid of; but
>>>>> you should be able to comparmentalise it, and you should be able to make
>>>>> it so that common things can be shared;
>>>> Yes, that's is the way we need to go.
>>>>
>>>>
>>>>>     i.e. if I have two
>>>>> implementations of virtiofs, both running on Linux, then it might be
>>>>> good if we can live migrate between them, and standardise the format.
>>>> As replied in the previous version, I'm not sure how hard it is consider the
>>>> file_handle mentioned by Stefan is not a part of uABI and it depends on
>>>> specific kernel config to work.
>>>>
>>>>
>>>>> So, I'd expect the core virtiofs data to be standardised globally,
>>>> Yes, maybe start at the FUSE level.
>>>>
>>>>
>>>>>     then
>>>>> I'd expect how Linux implementations work to be standardised.
>>>> Does it mean we need:
>>>>
>>>> 1) port virtiofsd to multiple platforms
>>>> 2) only support live migration among virtiofds
>>>>
>>>> ?
>>> Not necessarily; I mean that we have layers:
>>>     a) Virtio
>>>     b) Virtio-fs
>>>     c]
>>>       c1) virtio-fs backed by a Linux filesystem
>>>       c2) virtio-fs backed by some object store
>>>       c3) virito-fs backed by something else
>>>
>>> (a) is standardised
>>> The migration data for (b) can be standardised
>>
>> That would be good.
>>
>>
>>> We can also standardise c1, c2 (not that we've made a c2)
>>> and we could expect migration between different implementations all
>>> that are backed by a Linux filesystem (if that file handle stuff is
>>> portable); but we wouldn't expect a migration between c1 and c2 to work.
>>> (and c2 might get split if there are different types of object store).
>>
>> If I understand this correctly, this requires the management layer to know
>> the details of the backend before trying to live migrate the guest. Or do we
>> need different feature bits for the above three types of the virtio-fs
>> device?
> Yep, something would need to know the details of the backend; but that's
> true of most existing backends anyway; e.g. in virtio-net the management
> layer understands the underlying network and what it has to setup on the
> destination to ensure the network on both sides looks the same; it's got
> different implications but it does still need to know it.


I think it's different. E.g for the case of virtio-net though setup on 
the destination doesn't depends on the device state.

E.g technically, we can do cross backend live migration. E.g from qemu 
virtio-net to a vhost-user backend.

Thanks


>
> Dave
>
>>> So, just because there are different types of backends, doesn't mean we
>>> have to give up standardisation; we just have to acknowledge there's
>>> a range of backends and standardisethe bits we can.
>>
>> Right.
>>
>> Thanks
>>
>>
>>> Dave
>>>
>>>>> Dave
>>>>>
>>>>>>> I prefer to support in-band migration of implementation-specific state
>>>>>>> because it's less complex to have a single device state instead of
>>>>>>> splitting it.
>>>>>> I wonder how to deal with migration compatibility in this case.
>>>>>>
>>>>>>
>>>>>>> Is this the direction you were thinking in?
>>>>>> Somehow.
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>>
>>>>>>> Stefan


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit
  2021-08-06  6:15                                                                     ` Jason Wang
@ 2021-08-08  9:31                                                                       ` Max Gurtovoy
  0 siblings, 0 replies; 115+ messages in thread
From: Max Gurtovoy @ 2021-08-08  9:31 UTC (permalink / raw)
  To: Jason Wang, Dr. David Alan Gilbert
  Cc: Stefan Hajnoczi, Michael S. Tsirkin, Eugenio Perez Martin,
	virtio-comment, Virtio-Dev, Cornelia Huck, Oren Duer,
	Shahaf Shuler, Parav Pandit, Bodong Wang, Alexander Mikheev,
	Halil Pasic, mreitz


On 8/6/2021 9:15 AM, Jason Wang wrote:
>
> 在 2021/8/5 下午4:19, Dr. David Alan Gilbert 写道:
>> * Jason Wang (jasowang@redhat.com) wrote:
>>> 在 2021/8/4 下午5:07, Dr. David Alan Gilbert 写道:
>>>> * Jason Wang (jasowang@redhat.com) wrote:
>>>>> 在 2021/8/3 下午8:22, Dr. David Alan Gilbert 写道:
>>>>>> * Jason Wang (jasowang@redhat.com) wrote:
>>>>>>> 在 2021/8/3 下午6:37, Stefan Hajnoczi 写道:
>>>>>>>> On Tue, Aug 03, 2021 at 02:33:20PM +0800, Jason Wang wrote:
>>>>>>>>> 在 2021/7/26 下午11:07, Stefan Hajnoczi 写道:
>>>>>>>>>> I guess this is just a summary of what we've already 
>>>>>>>>>> discussed and not
>>>>>>>>>> new information. I think an implementation today would use 
>>>>>>>>>> DBus VMState
>>>>>>>>>> to transfer implementation-specific device state (an opaque 
>>>>>>>>>> blob).
>>>>>>>>> Instead of trying to migrate those opaque stuffs which is kind 
>>>>>>>>> of tricky, I
>>>>>>>>> wonder if we can avoid them by recording the mapping in the 
>>>>>>>>> shared
>>>>>>>>> filesystem itself.
>>>>>>>> The problem is that virtiofsd has no way of reopening the exact 
>>>>>>>> same
>>>>>>>> files without Linux file handles.
>>>>>>> I believe if we want to support live migration of the passthrough
>>>>>>> filesystem. The filesystem itself must be shared? (like NFS)
>>>>>>>
>>>>>>> Assuming this is true. Can we store those mapping (e.g fuse 
>>>>>>> inode -> host
>>>>>>> inode) in a known path/file in the passthrough filesystem itself 
>>>>>>> and hide
>>>>>>> that file from the guest?
>>>>>> That's pretty dangerous; it assumes that the filesystem is only used
>>>>>> together with virtiofs; as a *shared* filesystem it's possible 
>>>>>> that it's
>>>>>> being used directly by normal NFS clients as well.
>>>>>> It's also very racy; trying to make sure those mappings reflect the
>>>>>> *current* meaning of inodes even while they're changing under 
>>>>>> your feet
>>>>>> is non-trivial.
>>>>> Right, it's just a thought to avoid migrating implementation specific
>>>>> stuffs.
>>>>>
>>>>>
>>>>>>> The destination can simply open this unkown file and do the 
>>>>>>> lookup the
>>>>>>> mapping and reopen the file if necessary.
>>>>>>>
>>>>>>> Then we don't need the Linux file handle.
>>>>>>>
>>>>>>>
>>>>>>>>      So they need to be transferred to the
>>>>>>>> destination (or stored on a shared file system as you suggested),
>>>>>>>> regardless of whether they are part of the VIRTIO spec's device 
>>>>>>>> state or
>>>>>>>> not.
>>>>>>>>
>>>>>>>> Implementation-specific state can be considered outside the 
>>>>>>>> scope of the
>>>>>>>> VIRTIO spec. In other words, we could exclude it from the 
>>>>>>>> VIRTIO-level
>>>>>>>> device state that save/load operate on. This does not solve the 
>>>>>>>> problem,
>>>>>>>> it just shifts the responsibility to the virtualization stack 
>>>>>>>> to migrate
>>>>>>>> this state.
>>>>>>>>
>>>>>>>> The Linux file handles or other virtiofsd 
>>>>>>>> implementation-specific state
>>>>>>>> would be migrated separately (e.g. using DBus VMstate) so that 
>>>>>>>> by the
>>>>>>>> time the destination device does a VIRTIO load operation, it 
>>>>>>>> has the
>>>>>>>> necessary implementation-specific state ready.
>>>>>>> That may work but I want to get rid of the implementation 
>>>>>>> specific stuffs
>>>>>>> like linux handles completely.
>>>>>> I'm not sure how much implementation specific you can get rid of; 
>>>>>> but
>>>>>> you should be able to comparmentalise it, and you should be able 
>>>>>> to make
>>>>>> it so that common things can be shared;
>>>>> Yes, that's is the way we need to go.
>>>>>
>>>>>
>>>>>>     i.e. if I have two
>>>>>> implementations of virtiofs, both running on Linux, then it might be
>>>>>> good if we can live migrate between them, and standardise the 
>>>>>> format.
>>>>> As replied in the previous version, I'm not sure how hard it is 
>>>>> consider the
>>>>> file_handle mentioned by Stefan is not a part of uABI and it 
>>>>> depends on
>>>>> specific kernel config to work.
>>>>>
>>>>>
>>>>>> So, I'd expect the core virtiofs data to be standardised globally,
>>>>> Yes, maybe start at the FUSE level.
>>>>>
>>>>>
>>>>>>     then
>>>>>> I'd expect how Linux implementations work to be standardised.
>>>>> Does it mean we need:
>>>>>
>>>>> 1) port virtiofsd to multiple platforms
>>>>> 2) only support live migration among virtiofds
>>>>>
>>>>> ?
>>>> Not necessarily; I mean that we have layers:
>>>>     a) Virtio
>>>>     b) Virtio-fs
>>>>     c]
>>>>       c1) virtio-fs backed by a Linux filesystem
>>>>       c2) virtio-fs backed by some object store
>>>>       c3) virito-fs backed by something else
>>>>
>>>> (a) is standardised
>>>> The migration data for (b) can be standardised
>>>
>>> That would be good.
>>>
>>>
>>>> We can also standardise c1, c2 (not that we've made a c2)
>>>> and we could expect migration between different implementations all
>>>> that are backed by a Linux filesystem (if that file handle stuff is
>>>> portable); but we wouldn't expect a migration between c1 and c2 to 
>>>> work.
>>>> (and c2 might get split if there are different types of object store).
>>>
>>> If I understand this correctly, this requires the management layer 
>>> to know
>>> the details of the backend before trying to live migrate the guest. 
>>> Or do we
>>> need different feature bits for the above three types of the virtio-fs
>>> device?
>> Yep, something would need to know the details of the backend; but that's
>> true of most existing backends anyway; e.g. in virtio-net the management
>> layer understands the underlying network and what it has to setup on the
>> destination to ensure the network on both sides looks the same; it's got
>> different implications but it does still need to know it.
>
>
> I think it's different. E.g for the case of virtio-net though setup on 
> the destination doesn't depends on the device state.
>
> E.g technically, we can do cross backend live migration. E.g from qemu 
> virtio-net to a vhost-user backend.

virtio-blk also need to have management to control backends.

Let's say the storage is backed by remote nvmf target so you must to 
create the same connection in the destination as well.

This should be done by some sys-admin.


>
> Thanks
>
>
>>
>> Dave
>>
>>>> So, just because there are different types of backends, doesn't 
>>>> mean we
>>>> have to give up standardisation; we just have to acknowledge there's
>>>> a range of backends and standardisethe bits we can.
>>>
>>> Right.
>>>
>>> Thanks
>>>
>>>
>>>> Dave
>>>>
>>>>>> Dave
>>>>>>
>>>>>>>> I prefer to support in-band migration of 
>>>>>>>> implementation-specific state
>>>>>>>> because it's less complex to have a single device state instead of
>>>>>>>> splitting it.
>>>>>>> I wonder how to deal with migration compatibility in this case.
>>>>>>>
>>>>>>>
>>>>>>>> Is this the direction you were thinking in?
>>>>>>> Somehow.
>>>>>>>
>>>>>>> Thanks
>>>>>>>
>>>>>>>
>>>>>>>> Stefan
>

This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 115+ messages in thread

end of thread, other threads:[~2021-08-08  9:31 UTC | newest]

Thread overview: 115+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-06  4:33 [PATCH V2 0/2] Vitqueue State Synchronization Jason Wang
2021-07-06  4:33 ` [PATCH V2 1/2] virtio: introduce virtqueue state as basic facility Jason Wang
2021-07-06  9:32   ` Michael S. Tsirkin
2021-07-06 17:09     ` Eugenio Perez Martin
2021-07-06 19:08       ` Michael S. Tsirkin
2021-07-06 23:49         ` Max Gurtovoy
2021-07-07  2:50           ` Jason Wang
2021-07-07 12:03             ` Max Gurtovoy
2021-07-07 12:11               ` [virtio-comment] " Jason Wang
2021-07-07  2:42         ` Jason Wang
2021-07-07  4:36           ` Jason Wang
2021-07-07  2:41       ` Jason Wang
2021-07-06 12:27   ` [virtio-comment] " Cornelia Huck
2021-07-07  3:29     ` [virtio-dev] " Jason Wang
2021-07-06  4:33 ` [PATCH V2 2/2] virtio: introduce STOP status bit Jason Wang
2021-07-06  9:24   ` [virtio-comment] " Dr. David Alan Gilbert
2021-07-07  3:20     ` Jason Wang
2021-07-09 17:23       ` Eugenio Perez Martin
2021-07-10 20:36         ` Michael S. Tsirkin
2021-07-12  4:00           ` Jason Wang
2021-07-12  9:57             ` Stefan Hajnoczi
2021-07-13  3:27               ` Jason Wang
2021-07-13  8:19                 ` Cornelia Huck
2021-07-13  9:13                   ` Jason Wang
2021-07-13 11:31                     ` Cornelia Huck
2021-07-13 12:23                       ` Jason Wang
2021-07-13 12:28                         ` Cornelia Huck
2021-07-14  2:47                           ` Jason Wang
2021-07-14  6:20                             ` Cornelia Huck
2021-07-14  8:53                               ` Jason Wang
2021-07-14  9:24                                 ` [virtio-dev] " Cornelia Huck
2021-07-15  2:01                                   ` Jason Wang
2021-07-13 10:00                 ` Stefan Hajnoczi
2021-07-13 12:16                   ` Jason Wang
2021-07-14  9:53                     ` Stefan Hajnoczi
2021-07-14 10:29                       ` Jason Wang
2021-07-14 15:07                         ` Stefan Hajnoczi
2021-07-14 16:22                           ` Max Gurtovoy
2021-07-15  1:38                             ` Jason Wang
2021-07-15  9:26                               ` Stefan Hajnoczi
2021-07-16  1:48                                 ` Jason Wang
2021-07-19 12:08                                   ` Stefan Hajnoczi
2021-07-20  2:46                                     ` Jason Wang
2021-07-15 21:18                               ` Michael S. Tsirkin
2021-07-16  2:19                                 ` Jason Wang
2021-07-15  1:35                           ` Jason Wang
2021-07-15  9:16                             ` [virtio-dev] " Stefan Hajnoczi
2021-07-16  1:44                               ` Jason Wang
2021-07-19 12:18                                 ` [virtio-dev] " Stefan Hajnoczi
2021-07-20  2:50                                   ` Jason Wang
2021-07-20 10:31                                 ` Cornelia Huck
2021-07-21  2:59                                   ` Jason Wang
2021-07-15 10:01                             ` Stefan Hajnoczi
2021-07-16  2:03                               ` Jason Wang
2021-07-16  3:53                                 ` Jason Wang
2021-07-19 12:45                                   ` Stefan Hajnoczi
2021-07-20  3:04                                     ` Jason Wang
2021-07-20  8:50                                       ` Stefan Hajnoczi
2021-07-20 10:48                                         ` Cornelia Huck
2021-07-20 12:47                                           ` Stefan Hajnoczi
2021-07-21  2:29                                         ` Jason Wang
2021-07-21 10:20                                           ` Stefan Hajnoczi
2021-07-22  7:33                                             ` Jason Wang
2021-07-22 10:24                                               ` Stefan Hajnoczi
2021-07-22 13:08                                                 ` Jason Wang
2021-07-26 15:07                                                   ` Stefan Hajnoczi
2021-07-27  7:43                                                     ` Max Reitz
2021-08-03  6:33                                                     ` Jason Wang
2021-08-03 10:37                                                       ` Stefan Hajnoczi
2021-08-03 11:42                                                         ` Jason Wang
2021-08-03 12:22                                                           ` Dr. David Alan Gilbert
2021-08-04  1:42                                                             ` Jason Wang
2021-08-04  9:07                                                               ` Dr. David Alan Gilbert
2021-08-05  6:38                                                                 ` Jason Wang
2021-08-05  8:19                                                                   ` Dr. David Alan Gilbert
2021-08-06  6:15                                                                     ` Jason Wang
2021-08-08  9:31                                                                       ` Max Gurtovoy
2021-08-04  9:20                                                               ` Stefan Hajnoczi
2021-08-05  6:45                                                                 ` Jason Wang
2021-08-04  8:38                                                             ` Stefan Hajnoczi
2021-08-04  8:36                                                           ` Stefan Hajnoczi
2021-08-05  6:35                                                             ` Jason Wang
2021-07-19 12:43                                 ` Stefan Hajnoczi
2021-07-20  3:02                                   ` Jason Wang
2021-07-20 10:19                                     ` Stefan Hajnoczi
2021-07-21  2:52                                       ` Jason Wang
2021-07-21 10:42                                         ` Stefan Hajnoczi
2021-07-22  2:08                                           ` Jason Wang
2021-07-22 10:30                                             ` Stefan Hajnoczi
2021-07-20 12:27                                     ` Max Gurtovoy
2021-07-20 12:57                                       ` Stefan Hajnoczi
2021-07-20 13:09                                         ` Max Gurtovoy
2021-07-21  3:06                                           ` Jason Wang
2021-07-21 10:48                                           ` Stefan Hajnoczi
2021-07-21 11:37                                             ` Max Gurtovoy
2021-07-21  3:09                                       ` Jason Wang
2021-07-21 11:43                                         ` Max Gurtovoy
2021-07-22  2:01                                           ` Jason Wang
2021-07-12  3:53         ` Jason Wang
2021-07-06 12:50   ` [virtio-comment] " Cornelia Huck
2021-07-06 13:18     ` Jason Wang
2021-07-06 14:27       ` [virtio-dev] " Cornelia Huck
2021-07-07  0:05         ` Max Gurtovoy
2021-07-07  3:14           ` Jason Wang
2021-07-07  2:56         ` Jason Wang
2021-07-07 16:45           ` [virtio-comment] " Cornelia Huck
2021-07-08  4:06             ` Jason Wang
2021-07-09 17:35   ` Eugenio Perez Martin
2021-07-12  4:06     ` Jason Wang
2021-07-10 20:40   ` Michael S. Tsirkin
2021-07-12  4:04     ` Jason Wang
2021-07-12 10:12 ` [PATCH V2 0/2] Vitqueue State Synchronization Stefan Hajnoczi
2021-07-13  3:08   ` Jason Wang
2021-07-13 10:30     ` Stefan Hajnoczi
2021-07-13 11:56       ` Jason Wang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.