All of lore.kernel.org
 help / color / mirror / Atom feed
* [virtio-comment] [PATCH V2 0/6] introduce basic facilities for virito live migration
@ 2023-11-03 10:34 Zhu Lingshan
  2023-11-03 10:34 ` [virtio-comment] [PATCH V2 1/6] virtio: introduce virtqueue state Zhu Lingshan
                   ` (6 more replies)
  0 siblings, 7 replies; 186+ messages in thread
From: Zhu Lingshan @ 2023-11-03 10:34 UTC (permalink / raw)
  To: jasowang, mst, eperezma, cohuck, stefanha
  Cc: virtio-comment, parav, Zhu Lingshan

This series introduces basic facilities to support
virtio live migration, includes:

1)a new SUSPEND bit in the device status
Which is used to suspend the device, so that the device states
and virtqueue states are stabilized.

2)virtqueue state and its accessor, to get and set last_avail_idx
and last_used_idx of virtqueues.

3)dirty page tracking

Changes from V1:
1)move vq state defination from content.tex to splited/packed.tex(Michael)
2)"the device should ignore resetting vqs when suspended" ==> "the driver should not reset vqs when suspended"(Michael)
3)add dirty page tracking facility

Zhu Lingshan (6):
  virtio: introduce virtqueue state
  virtio: introduce SUSPEND bit in device status
  virtio: dont reset vqs when SUSPEND
  virtio-pci: implement VIRTIO_F_QUEUE_STATE
  virtio: introduce dirty page tracking facility
  virtio-pci: implement dirty page tracking

 content.tex       | 66 ++++++++++++++++++++++++++++++++--
 packed-ring.tex   | 58 ++++++++++++++++++++++++++++++
 split-ring.tex    | 39 ++++++++++++++++++++
 transport-pci.tex | 90 +++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 251 insertions(+), 2 deletions(-)

-- 
2.35.3


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* [virtio-comment] [PATCH V2 1/6] virtio: introduce virtqueue state
  2023-11-03 10:34 [virtio-comment] [PATCH V2 0/6] introduce basic facilities for virito live migration Zhu Lingshan
@ 2023-11-03 10:34 ` Zhu Lingshan
  2023-11-03 11:35   ` [virtio-comment] " Parav Pandit
  2023-11-03 11:52   ` Michael S. Tsirkin
  2023-11-03 10:34 ` [virtio-comment] [PATCH V2 2/6] virtio: introduce SUSPEND bit in device status Zhu Lingshan
                   ` (5 subsequent siblings)
  6 siblings, 2 replies; 186+ messages in thread
From: Zhu Lingshan @ 2023-11-03 10:34 UTC (permalink / raw)
  To: jasowang, mst, eperezma, cohuck, stefanha
  Cc: virtio-comment, parav, Zhu Lingshan

This patch adds new virtqueue facility to save and restore virtqueue
state. The virtqueue state is split into two parts:

- The available state: The state that is used for read the next
  available buffer.
- The used state: The state that is used for make buffer used.

This will simply the transport specific method implementation. E.g two
le16 could be used instead of a single le32). For split virtqueue, we
only need the available state since the used state is implemented in
the virtqueue itself (the used index). For packed virtqueue, we need
both the available state and the used state.

The typical use cases are live migration and debugging.

Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 content.tex     |  7 ++++--
 packed-ring.tex | 58 +++++++++++++++++++++++++++++++++++++++++++++++++
 split-ring.tex  | 39 +++++++++++++++++++++++++++++++++
 3 files changed, 102 insertions(+), 2 deletions(-)

diff --git a/content.tex b/content.tex
index 0a62dce..76813b5 100644
--- a/content.tex
+++ b/content.tex
@@ -99,10 +99,10 @@ \section{Feature Bits}\label{sec:Basic Facilities of a Virtio Device / Feature B
 \begin{description}
 \item[0 to 23, and 50 to 127] Feature bits for the specific device type
 
-\item[24 to 41] Feature bits reserved for extensions to the queue and
+\item[24 to 42] Feature bits reserved for extensions to the queue and
   feature negotiation mechanisms
 
-\item[42 to 49, and 128 and above] Feature bits reserved for future extensions.
+\item[43 to 49, and 128 and above] Feature bits reserved for future extensions.
 \end{description}
 
 \begin{note}
@@ -872,6 +872,9 @@ \chapter{Reserved Feature Bits}\label{sec:Reserved Feature Bits}
 	\ref{devicenormative:Basic Facilities of a Virtio Device / Feature Bits} for
 	handling features reserved for future use.
 
+  \item[VIRTIO_F_QUEUE_STATE(42)] This feature indicates that the device allows the driver
+  to access its internal virtqueue state.
+
 \end{description}
 
 \drivernormative{\section}{Reserved Feature Bits}{Reserved Feature Bits}
diff --git a/packed-ring.tex b/packed-ring.tex
index 9eeb382..ad6aba0 100644
--- a/packed-ring.tex
+++ b/packed-ring.tex
@@ -729,3 +729,61 @@ \subsection{Receiving Used Buffers From The Device}\label{sec:Basic Facilities o
         process_buffer(d);
 }
 \end{lstlisting}
+
+\subsection{Virtqueue State}\label{sec:Basic Facilities of a Virtio Device / Packed Virtqueues / Virtqueue State}
+
+When VIRTIO_F_QUEUE_STATE has been negotiated, the driver can set and
+get the device internal virtqueue state through the following
+fields. The implementation of the interfaces is transport specific.
+
+\subsubsection{\field{Available State} Field}
+
+The available state field is two bytes of virtqueue state that is used by
+the device to read the next available buffer. It is presented in the followwing format:
+
+\begin{lstlisting}
+le16 {
+  last_avail_idx : 15;
+  last_avail_wrap_counter : 1;
+};
+\end{lstlisting}
+
+The \field{last_avail_idx} field is the free-running location
+where the device read the next descriptor from the virtqueue descriptor ring.
+
+The \field{last_avail_wrap_counter} field is the last driver ring wrap
+counter that was observed by the device.
+
+See also \ref{sec:Packed Virtqueues / Driver and Device Ring Wrap Counters}.
+
+\subsubsection{\field{Used State} Field}
+
+The used state field is two bytes of virtqueue state that is used by
+the device when marking a buffer used. It is presented in the followwing format:
+
+\begin{lstlisting}
+le16 {
+  used_idx : 15;
+  used_wrap_counter : 1;
+};
+\end{lstlisting}
+
+The \field{used_idx} field is the free-running location where the device write the next
+used descriptor to the descriptor ring.
+
+The \field{used_wrap_counter} field is the wrap counter that is used
+by the device.
+
+See also \ref{sec:Packed Virtqueues / Driver and Device Ring Wrap Counters}.
+
+\devicenormative{\subsubsection}{Virtqueue State}{Basic Facilities of a Virtio Device / Packed Virtqueues/ Virtqueue State}
+
+The device SHOULD only accept setting Virtqueue State of any packed virtqueues when DRIVER_OK is not set in \field{device status}, or SUSPEND is set in \field{device status}.
+Otherwise the device MUST ignore any writes to Virtqueue State of any packed virtqueues.
+
+When SUSPEND is set, the device MUST record the Virtqueue State of every enabled packed virtqueue
+in \field{Available State} field and \field{Used State} field respectively,
+and correspondingly restore the Virtqueue State of every enabled packed virtqueue
+from \field{Available State} field and \field{Used State} field when DRIVER_OK is set.
+
+The device SHOULD reset \field{Available State} field and \field{Used State} field upon a device reset.
diff --git a/split-ring.tex b/split-ring.tex
index de94038..a78b44d 100644
--- a/split-ring.tex
+++ b/split-ring.tex
@@ -734,3 +734,42 @@ \subsection{Receiving Used Buffers From The Device}\label{sec:Basic Facilities o
 }
 \end{lstlisting}
 \end{note}
+
+\subsection{Virtqueue State}\label{sec:Basic Facilities of a Virtio Device / Splited Virtqueues / Virtqueue State}
+
+When VIRTIO_F_QUEUE_STATE has been negotiated, the driver can set and
+get the device internal virtqueue state through the following
+fields. The implementation of the interfaces is transport specific.
+
+\subsubsection{\field{Available State} Field}
+
+The available state field is two bytes of virtqueue state that is used by
+the device to read the next available buffer. It is presented in the followwing format:
+
+\begin{lstlisting}
+le16 last_avail_idx;
+\end{lstlisting}
+
+The \field{last_avail_idx} field is the free-running available ring
+index where the device will read the next available head of a
+descriptor chain.
+
+See also \ref{sec:Basic Facilities of a Virtio Device / Virtqueues / The Virtqueue Available Ring}.
+
+\drivernormative{\subsubsection}{Virtqueue State}{Basic Facilities of a Virtio Device / Splited Virtqueues/ Virtqueue State}
+
+The driver SHOULD NOT access \field{Used State} of any splited virtqueues, it SHOULD use the
+used index in the used ring.
+
+\devicenormative{\subsubsection}{Virtqueue State}{Basic Facilities of a Virtio Device / Splited Virtqueues/ Virtqueue State}
+
+The device SHOULD only accept setting Virtqueue State of any splited virtqueues
+when DRIVER_OK is not set in \field{device status} or SUSPEND is set in \field{device status}.
+Otherwise the device MUST ignore any writes to Virtqueue State of any splited virtqueues.
+
+When SUSPEND is set, the device MUST record the Available State of every enabled splited virtqueue
+in \field{Available State} field,
+and correspondingly restore the Available State of every enabled splited virtqueue
+from \field{Available State} field when DRIVER_OK is set.
+
+The device SHOULD reset \field{Available State} field upon a device reset.
-- 
2.35.3


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [virtio-comment] [PATCH V2 2/6] virtio: introduce SUSPEND bit in device status
  2023-11-03 10:34 [virtio-comment] [PATCH V2 0/6] introduce basic facilities for virito live migration Zhu Lingshan
  2023-11-03 10:34 ` [virtio-comment] [PATCH V2 1/6] virtio: introduce virtqueue state Zhu Lingshan
@ 2023-11-03 10:34 ` Zhu Lingshan
  2023-11-03 11:35   ` [virtio-comment] " Parav Pandit
  2023-11-06  9:43   ` [virtio-comment] " Michael S. Tsirkin
  2023-11-03 10:34 ` [virtio-comment] [PATCH V2 3/6] virtio: dont reset vqs when SUSPEND Zhu Lingshan
                   ` (4 subsequent siblings)
  6 siblings, 2 replies; 186+ messages in thread
From: Zhu Lingshan @ 2023-11-03 10:34 UTC (permalink / raw)
  To: jasowang, mst, eperezma, cohuck, stefanha
  Cc: virtio-comment, parav, Zhu Lingshan

This patch introduces a new status bit in the device status: SUSPEND.

This SUSPEND bit can be used by the driver to suspend a device,
in order to stabilize the device states and virtqueue states.

Its main use case is live migration.

Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 content.tex | 36 ++++++++++++++++++++++++++++++++++--
 1 file changed, 34 insertions(+), 2 deletions(-)

diff --git a/content.tex b/content.tex
index 76813b5..bcc9d4b 100644
--- a/content.tex
+++ b/content.tex
@@ -49,6 +49,10 @@ \section{\field{Device Status} Field}\label{sec:Basic Facilities of a Virtio Dev
 
 \item[DEVICE_NEEDS_RESET (64)] Indicates that the device has experienced
   an error from which it can't recover.
+
+\item[SUSPEND (16)] When VIRTIO_F_SUSPEND is negotiated, indicates that the
+  device has been suspended by the driver.
+
 \end{description}
 
 The \field{device status} field starts out as 0, and is reinitialized to 0 by
@@ -73,6 +77,10 @@ \section{\field{Device Status} Field}\label{sec:Basic Facilities of a Virtio Dev
 recover by issuing a reset.
 \end{note}
 
+The driver SHOULD NOT set SUSPEND if FEATURES_OK is not set.
+
+When setting SUSPEND, the driver MUST re-read \field{device status} to ensure the SUSPEND bit is set.
+
 \devicenormative{\subsection}{Device Status Field}{Basic Facilities of a Virtio Device / Device Status Field}
 
 The device MUST NOT consume buffers or send any used buffer
@@ -82,6 +90,26 @@ \section{\field{Device Status} Field}\label{sec:Basic Facilities of a Virtio Dev
 that a reset is needed.  If DRIVER_OK is set, after it sets DEVICE_NEEDS_RESET, the device
 MUST send a device configuration change notification to the driver.
 
+The device MUST ignore SUSPEND if FEATURES_OK is not set.
+
+The device MUST ignore SUSPEND if VIRTIO_F_SUSPEND is not negotiated.
+
+The device SHOULD allow settings to \field{device status} even when SUSPEND is set.
+
+If VIRTIO_F_SUSPEND is negotiated and SUSPEND is set, the device SHOULD clear SUSPEND
+and resumes operation upon DRIVER_OK.
+
+If VIRTIO_F_SUSPEND is negotiated, when the driver sets SUSPEND,
+the device SHOULD perform the following actions before presenting SUSPEND bit in the \field{device status}:
+
+\begin{itemize}
+\item Stop consuming buffers of any virtqueues and mark all finished descritors as used.
+\item Wait until all descriptors that being processed to finish and mark them as used.
+\item Flush all used buffer and send used buffer notifications to the driver.
+\item Record Virtqueue State of each enabled virtqueue, see section \ref{sec:Virtqueues / Virtqueue State}
+\item Pause its operation except \field{device status} and preserve configurations in its Device Configuration Space, see \ref{sec:Basic Facilities of a Virtio Device / Device Configuration Space}
+\end{itemize}
+
 \section{Feature Bits}\label{sec:Basic Facilities of a Virtio Device / Feature Bits}
 
 Each virtio device offers all the features it understands.  During
@@ -99,10 +127,10 @@ \section{Feature Bits}\label{sec:Basic Facilities of a Virtio Device / Feature B
 \begin{description}
 \item[0 to 23, and 50 to 127] Feature bits for the specific device type
 
-\item[24 to 42] Feature bits reserved for extensions to the queue and
+\item[24 to 43] Feature bits reserved for extensions to the queue and
   feature negotiation mechanisms
 
-\item[43 to 49, and 128 and above] Feature bits reserved for future extensions.
+\item[44 to 49, and 128 and above] Feature bits reserved for future extensions.
 \end{description}
 
 \begin{note}
@@ -875,6 +903,10 @@ \chapter{Reserved Feature Bits}\label{sec:Reserved Feature Bits}
   \item[VIRTIO_F_QUEUE_STATE(42)] This feature indicates that the device allows the driver
   to access its internal virtqueue state.
 
+  \item[VIRTIO_F_SUSPEND(43)] This feature indicates that the driver can
+   SUSPEND the device.
+   See \ref{sec:Basic Facilities of a Virtio Device / Device Status Field}.
+
 \end{description}
 
 \drivernormative{\section}{Reserved Feature Bits}{Reserved Feature Bits}
-- 
2.35.3


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [virtio-comment] [PATCH V2 3/6] virtio: dont reset vqs when SUSPEND
  2023-11-03 10:34 [virtio-comment] [PATCH V2 0/6] introduce basic facilities for virito live migration Zhu Lingshan
  2023-11-03 10:34 ` [virtio-comment] [PATCH V2 1/6] virtio: introduce virtqueue state Zhu Lingshan
  2023-11-03 10:34 ` [virtio-comment] [PATCH V2 2/6] virtio: introduce SUSPEND bit in device status Zhu Lingshan
@ 2023-11-03 10:34 ` Zhu Lingshan
  2023-11-06  9:49   ` [virtio-comment] " Michael S. Tsirkin
  2023-11-03 10:34 ` [virtio-comment] [PATCH V2 4/6] virtio-pci: implement VIRTIO_F_QUEUE_STATE Zhu Lingshan
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 186+ messages in thread
From: Zhu Lingshan @ 2023-11-03 10:34 UTC (permalink / raw)
  To: jasowang, mst, eperezma, cohuck, stefanha
  Cc: virtio-comment, parav, Zhu Lingshan

When SUSPEND is set, device states and virtqueue states
should be stablized, therefore the driver should not
reset vqs when SUSPEND is set in device status.

Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
---
 content.tex | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/content.tex b/content.tex
index bcc9d4b..060b5c2 100644
--- a/content.tex
+++ b/content.tex
@@ -444,6 +444,9 @@ \subsubsection{Virtqueue Reset}\label{sec:Basic Facilities of a Virtio Device /
 The device MUST reset any state of a virtqueue to the default state,
 including the available state and the used state.
 
+If VIRTIO_F_SUSPEND is negotiated and SUSPEND is set in \field{device status},
+the driver SHOULD NOT reset any virtqueues.
+
 \drivernormative{\paragraph}{Virtqueue Reset}{Basic Facilities of a Virtio Device / Virtqueues / Virtqueue Reset / Virtqueue Reset}
 
 After the driver tells the device to reset a queue, the driver MUST verify that
-- 
2.35.3


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [virtio-comment] [PATCH V2 4/6] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  2023-11-03 10:34 [virtio-comment] [PATCH V2 0/6] introduce basic facilities for virito live migration Zhu Lingshan
                   ` (2 preceding siblings ...)
  2023-11-03 10:34 ` [virtio-comment] [PATCH V2 3/6] virtio: dont reset vqs when SUSPEND Zhu Lingshan
@ 2023-11-03 10:34 ` Zhu Lingshan
  2023-11-03 11:35   ` [virtio-comment] " Parav Pandit
  2023-11-08 17:56   ` Michael S. Tsirkin
  2023-11-03 10:34 ` [virtio-comment] [PATCH V2 5/6] virtio: introduce dirty page tracking facility Zhu Lingshan
                   ` (2 subsequent siblings)
  6 siblings, 2 replies; 186+ messages in thread
From: Zhu Lingshan @ 2023-11-03 10:34 UTC (permalink / raw)
  To: jasowang, mst, eperezma, cohuck, stefanha
  Cc: virtio-comment, parav, Zhu Lingshan

This patch adds two new le16 fields to common configuration structure
to support VIRTIO_F_QUEUE_STATE in PCI transport layer.

Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
---
 transport-pci.tex | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/transport-pci.tex b/transport-pci.tex
index a5c6719..3161519 100644
--- a/transport-pci.tex
+++ b/transport-pci.tex
@@ -325,6 +325,10 @@ \subsubsection{Common configuration structure layout}\label{sec:Virtio Transport
         /* About the administration virtqueue. */
         le16 admin_queue_index;         /* read-only for driver */
         le16 admin_queue_num;         /* read-only for driver */
+
+	/* Virtqueue state */
+        le16 queue_avail_state;         /* read-write */
+        le16 queue_used_state;          /* read-write */
 };
 \end{lstlisting}
 
@@ -428,6 +432,17 @@ \subsubsection{Common configuration structure layout}\label{sec:Virtio Transport
 	The value 0 indicates no supported administration virtqueues.
 	This field is valid only if VIRTIO_F_ADMIN_VQ has been
 	negotiated.
+
+\item[\field{queue_avail_state}]
+        This field is valid only if VIRTIO_F_QUEUE_STATE has been
+        negotiated. The driver sets and gets the available state of
+        the virtqueue here (see \ref{sec:Virtqueues / Virtqueue State}).
+
+\item[\field{queue_used_state}]
+        This field is valid only if VIRTIO_F_QUEUE_STATE has been
+        negotiated. The driver sets and gets the used state of the
+        virtqueue here (see \ref{sec:Virtqueues / Virtqueue State}).
+
 \end{description}
 
 \devicenormative{\paragraph}{Common configuration structure layout}{Virtio Transport Options / Virtio Over PCI Bus / PCI Device Layout / Common configuration structure layout}
@@ -488,6 +503,9 @@ \subsubsection{Common configuration structure layout}\label{sec:Virtio Transport
 present either a value of 0 or a power of 2 in
 \field{queue_size}.
 
+If VIRTIO_F_QUEUE_STATE has not been negotiated, the device MUST ignore
+any accesses to \field{queue_avail_state} and \field{queue_used_state}.
+
 If VIRTIO_F_ADMIN_VQ has been negotiated, the value
 \field{admin_queue_index} MUST be equal to, or bigger than
 \field{num_queues}; also, \field{admin_queue_num} MUST be
-- 
2.35.3


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [virtio-comment] [PATCH V2 5/6] virtio: introduce dirty page tracking facility
  2023-11-03 10:34 [virtio-comment] [PATCH V2 0/6] introduce basic facilities for virito live migration Zhu Lingshan
                   ` (3 preceding siblings ...)
  2023-11-03 10:34 ` [virtio-comment] [PATCH V2 4/6] virtio-pci: implement VIRTIO_F_QUEUE_STATE Zhu Lingshan
@ 2023-11-03 10:34 ` Zhu Lingshan
  2023-11-03 11:35   ` [virtio-comment] " Parav Pandit
  2023-11-03 10:34 ` [virtio-comment] [PATCH V2 6/6] virtio-pci: implement dirty page tracking Zhu Lingshan
  2023-11-07  8:01 ` [virtio-comment] Re: [PATCH V2 0/6] introduce basic facilities for virito live migration Michael S. Tsirkin
  6 siblings, 1 reply; 186+ messages in thread
From: Zhu Lingshan @ 2023-11-03 10:34 UTC (permalink / raw)
  To: jasowang, mst, eperezma, cohuck, stefanha
  Cc: virtio-comment, parav, Zhu Lingshan

This commit introduce a new virtio facility to track
device dirty pages, a typical use case is live migration.

The implementation of this facility is transport specific.

Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 content.tex | 28 ++++++++++++++++++++++++++--
 1 file changed, 26 insertions(+), 2 deletions(-)

diff --git a/content.tex b/content.tex
index 060b5c2..eb9274f 100644
--- a/content.tex
+++ b/content.tex
@@ -127,10 +127,10 @@ \section{Feature Bits}\label{sec:Basic Facilities of a Virtio Device / Feature B
 \begin{description}
 \item[0 to 23, and 50 to 127] Feature bits for the specific device type
 
-\item[24 to 43] Feature bits reserved for extensions to the queue and
+\item[24 to 44] Feature bits reserved for extensions to the queue and
   feature negotiation mechanisms
 
-\item[44 to 49, and 128 and above] Feature bits reserved for future extensions.
+\item[45 to 49, and 128 and above] Feature bits reserved for future extensions.
 \end{description}
 
 \begin{note}
@@ -535,6 +535,27 @@ \section{Exporting Objects}\label{sec:Basic Facilities of a Virtio Device / Expo
 
 \input{admin.tex}
 
+\section{Memory Dirty Pages Tracker}\label{sec:Basic Facilities of a Virtio Device / Memory Dirty Pages Tracker}
+
+A "dirty page" refers to a page in memory that has been modified by the device
+but have not yet been acknowledged or processed by the CPU.
+
+This Memory Dirty Pages Tracker is a device facility that can record and report
+dirty pages caused by the device in device address space.
+
+The device offers a feature bit VIRTIO_F_MEM_TRACK if capable of tracking dirty pages.
+The implementation of this dirty page tracking facility is transport specific.
+
+A typical use case of this facility is to track dirty pages during live migration process.
+
+\drivernormative{\subsection}{Memory Dirty Pages Tracker}{Facilities of a Virtio Device / Memory Dirty Pages Tracker}
+
+The driver MUST fetch and clear dirty page information atomically.
+
+\devicenormative{\subsection}{Memory Dirty Pages Tracker}{Facilities of a Virtio Device / Memory Dirty Pages Tracker}
+
+The device MUST report the dirty page information atomically.
+
 \chapter{General Initialization And Device Operation}\label{sec:General Initialization And Device Operation}
 
 We start with an overview of device initialization, then expand on the
@@ -910,6 +931,9 @@ \chapter{Reserved Feature Bits}\label{sec:Reserved Feature Bits}
    SUSPEND the device.
    See \ref{sec:Basic Facilities of a Virtio Device / Device Status Field}.
 
+  \item[VIRTIO_F_MEM_TRACK(44)] This feature indicates that the device can track
+  memory dirty pages caused by itself.
+
 \end{description}
 
 \drivernormative{\section}{Reserved Feature Bits}{Reserved Feature Bits}
-- 
2.35.3


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [virtio-comment] [PATCH V2 6/6] virtio-pci: implement dirty page tracking
  2023-11-03 10:34 [virtio-comment] [PATCH V2 0/6] introduce basic facilities for virito live migration Zhu Lingshan
                   ` (4 preceding siblings ...)
  2023-11-03 10:34 ` [virtio-comment] [PATCH V2 5/6] virtio: introduce dirty page tracking facility Zhu Lingshan
@ 2023-11-03 10:34 ` Zhu Lingshan
  2023-11-03 10:46   ` [virtio-comment] " Michael S. Tsirkin
  2023-11-03 10:50   ` Michael S. Tsirkin
  2023-11-07  8:01 ` [virtio-comment] Re: [PATCH V2 0/6] introduce basic facilities for virito live migration Michael S. Tsirkin
  6 siblings, 2 replies; 186+ messages in thread
From: Zhu Lingshan @ 2023-11-03 10:34 UTC (permalink / raw)
  To: jasowang, mst, eperezma, cohuck, stefanha
  Cc: virtio-comment, parav, Zhu Lingshan

This commit implements dirty page tracking facility in
PCI transport layer.

Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 transport-pci.tex | 72 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 72 insertions(+)

diff --git a/transport-pci.tex b/transport-pci.tex
index 3161519..16209f4 100644
--- a/transport-pci.tex
+++ b/transport-pci.tex
@@ -188,6 +188,8 @@ \subsection{Virtio Structure PCI Capabilities}\label{sec:Virtio Transport Option
 #define VIRTIO_PCI_CAP_SHARED_MEMORY_CFG 8
 /* Vendor-specific data */
 #define VIRTIO_PCI_CAP_VENDOR_CFG        9
+/* Memory Dirty Pages Tracker*/
+#define VIRTIO_PCI_CAP_MEMORY_TRACK_CFG  10
 \end{lstlisting}
 
         Any other value is reserved for future use.
@@ -1230,3 +1232,73 @@ \subsubsection{Driver Handling Interrupts}\label{sec:Virtio Transport Options /
         re-examine the configuration space to see what changed.
     \end{itemize}
 \end{itemize}
+
+\subsection{Memory Dirty Pages Tracker Capability}\label{sec:Virtio
+Transport Options / Virtio Over PCI Bus / PCI Device Layout /
+Memory Dirty Pages Tracker Capability}
+
+The Memory Dirty Pages Tracker facility is found at \field{bar} and \field{offset} in VIRTIO_PCI_CAP_MEMORY_TRACK_CFG capability.
+Its layout is shown below:
+
+\begin{lstlisting}
+struct virtio_pci_dity_page_track {
+        u8 enable;               /* Read-Write */
+        u8 gra_power;            /* Read-Write */
+        u8 reserved[2];
+        le32 {
+            pasid: 20;           /* Read-Write */
+            reserved: 12;
+        };
+        le64 bitmap_addr;        /* Read-Write */
+        le64 bitmap_length;      /* Read-Write */
+};
+\end{lstlisting}
+
+\begin{description}
+\item[\field{enable}]
+	The driver writes 1 to enable dirty pages tracking and sets 0 to disable.
+\item[\field{gra_power}]
+	The driver uses this to set the dirty pages tracking granularity.
+	Each bit in the bitmap covers page_size = 2\^{}(12 + gra_power) bytes,
+	so when gra_power == 0, 4K bytes page is default.
+\item[\field{pasid}]
+	Optionally, the driver uses this to assign a pasid to this capability.
+\item[\field{bitmap_addr}]
+	The driver use this to set the address of the bitmap which records the dirty pages
+	caused by the device.
+	Each bit in the bitmap represents one memory page, bit 0 in the bitmap
+	reprsents page 0 at address 0, bit 1 represents page 1, and so on in a linear manner.
+	When \field{enable} is set to 1 and the device writes to a memory page,
+	the device MUST set the corresponding bit to 1 which indicating the page is dirty.
+\item[\field{bitmap_length}]
+	The driver use this to set the length in bytes of the bitmap.
+\end{description}
+
+\devicenormative{\subsubsection}{Memory Dirty Pages Tracker Capability}{Virtio Transport Options / Virtio Over PCI Bus / Memory Dirty Pages Tracker Capability}
+
+The device MUST NOT set any bits beyond bitmap_length when reporting dirty pages.
+
+To prevent a read-modify-write procedure, if a memory page is dirty,
+optionally the device is permitted to set the entire byte, which encompasses the relevant bit, to 1.
+
+The device MAY increase \field{gra_power} to reduce \field{bitmap_length}.
+
+The device must ignore any writes to \field{pasid} if PASID Extended Capability is absent or
+the PASID functionality is disabled in PASID Extended Capability
+
+The bitmap which starts at \field{bitmap_addr} SHOULD not be considered
+as dirty when the device write to it.
+
+On a reset, the device MUST reset \field{pasid} and \field{enable}, and stop
+tracking dirty pages.
+
+\drivernormative{\subsubsection}{Memory Dirty Pages Tracker Capability}{Virtio Transport Options / Virtio Over PCI Bus / Memory Dirty Pages Tracker Capability}
+
+The driver is responsible to allocate the bitmap for tracking device dirty pages.
+
+Upon retrieving a cluster of bits from the bitmap, the driver MUST clear each of them by setting 0.
+
+The driver MUST configure \field{pasid} if PASID is enabled in PASID Extended Capability.
+
+The driver SHOULD NOT access \field{pasid} if PASID Extended Capability is absent or
+the PASID functionality is disabled in PASID Extended Capability.
-- 
2.35.3


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply related	[flat|nested] 186+ messages in thread

* [virtio-comment] Re: [PATCH V2 6/6] virtio-pci: implement dirty page tracking
  2023-11-03 10:34 ` [virtio-comment] [PATCH V2 6/6] virtio-pci: implement dirty page tracking Zhu Lingshan
@ 2023-11-03 10:46   ` Michael S. Tsirkin
  2023-11-03 14:21     ` Zhu, Lingshan
  2023-11-03 10:50   ` Michael S. Tsirkin
  1 sibling, 1 reply; 186+ messages in thread
From: Michael S. Tsirkin @ 2023-11-03 10:46 UTC (permalink / raw)
  To: Zhu Lingshan; +Cc: jasowang, eperezma, cohuck, stefanha, virtio-comment, parav

On Fri, Nov 03, 2023 at 06:34:37PM +0800, Zhu Lingshan wrote:
> +\begin{lstlisting}
> +struct virtio_pci_dity_page_track {
> +        u8 enable;               /* Read-Write */
> +        u8 gra_power;            /* Read-Write */
> +        u8 reserved[2];
> +        le32 {
> +            pasid: 20;           /* Read-Write */
> +            reserved: 12;
> +        };
> +        le64 bitmap_addr;        /* Read-Write */
> +        le64 bitmap_length;      /* Read-Write */
> +};
> +\end{lstlisting}

Okay, so it's a simple mailbox in config space.  Which by itself is
probably a very reasonable idea - more or less what I suggested.
However, using such a generic facility just for the dirty bitmap seems
too limited.  Please make it accept arbitrary commands. Reusing admin
command structure with a special "device itself" group sounds like one
way to do it.

-- 
MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* [virtio-comment] Re: [PATCH V2 6/6] virtio-pci: implement dirty page tracking
  2023-11-03 10:34 ` [virtio-comment] [PATCH V2 6/6] virtio-pci: implement dirty page tracking Zhu Lingshan
  2023-11-03 10:46   ` [virtio-comment] " Michael S. Tsirkin
@ 2023-11-03 10:50   ` Michael S. Tsirkin
  2023-11-03 11:35     ` [virtio-comment] " Parav Pandit
  2023-11-03 14:32     ` [virtio-comment] " Zhu, Lingshan
  1 sibling, 2 replies; 186+ messages in thread
From: Michael S. Tsirkin @ 2023-11-03 10:50 UTC (permalink / raw)
  To: Zhu Lingshan; +Cc: jasowang, eperezma, cohuck, stefanha, virtio-comment, parav

On Fri, Nov 03, 2023 at 06:34:37PM +0800, Zhu Lingshan wrote:
> +\item[\field{bitmap_addr}]
> +	The driver use this to set the address of the bitmap which records the dirty pages
> +	caused by the device.
> +	Each bit in the bitmap represents one memory page, bit 0 in the bitmap
> +	reprsents page 0 at address 0, bit 1 represents page 1, and so on in a linear manner.
> +	When \field{enable} is set to 1 and the device writes to a memory page,
> +	the device MUST set the corresponding bit to 1 which indicating the page is dirty.
> +\item[\field{bitmap_length}]
> +	The driver use this to set the length in bytes of the bitmap.
> +\end{description}
> +
> +\devicenormative{\subsubsection}{Memory Dirty Pages Tracker Capability}{Virtio Transport Options / Virtio Over PCI Bus / Memory Dirty Pages Tracker Capability}
> +
> +The device MUST NOT set any bits beyond bitmap_length when reporting dirty pages.
> +
> +To prevent a read-modify-write procedure, if a memory page is dirty,
> +optionally the device is permitted to set the entire byte, which encompasses the relevant bit, to 1.
> +
> +The device MAY increase \field{gra_power} to reduce \field{bitmap_length}.
> +
> +The device must ignore any writes to \field{pasid} if PASID Extended Capability is absent or
> +the PASID functionality is disabled in PASID Extended Capability


I have to say this is going to work very badly when the number of dirty
pages is small: you will end up scanning and re-scanning all of bitmap.
And the resolution is apparently 8 pages? You have just multiplied
the migration bandwidth by a factor of 8.

-- 
MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* [virtio-comment] RE: [PATCH V2 5/6] virtio: introduce dirty page tracking facility
  2023-11-03 10:34 ` [virtio-comment] [PATCH V2 5/6] virtio: introduce dirty page tracking facility Zhu Lingshan
@ 2023-11-03 11:35   ` Parav Pandit
  2023-11-03 14:11     ` [virtio-comment] " Zhu, Lingshan
  0 siblings, 1 reply; 186+ messages in thread
From: Parav Pandit @ 2023-11-03 11:35 UTC (permalink / raw)
  To: Zhu Lingshan, jasowang, mst, eperezma, cohuck, stefanha; +Cc: virtio-comment

Hi Jason,

> From: Zhu Lingshan <lingshan.zhu@intel.com>
> Sent: Friday, November 3, 2023 4:05 PM
> This commit introduce a new virtio facility to track device dirty pages, a typical
> use case is live migration.
> 
> The implementation of this facility is transport specific.
> 
> Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>


> Signed-off-by: Jason Wang <jasowang@redhat.com>
In my series of dirty page tracking (aka write recording), you kept insisting until Thu, that it is optional, and platform will do it.
Why do you propose this facility now?
Can you please explain as commit log says typical use case is "live migration"? :)

^ permalink raw reply	[flat|nested] 186+ messages in thread

* [virtio-comment] RE: [PATCH V2 6/6] virtio-pci: implement dirty page tracking
  2023-11-03 10:50   ` Michael S. Tsirkin
@ 2023-11-03 11:35     ` Parav Pandit
  2023-11-03 15:02       ` [virtio-comment] " Zhu, Lingshan
  2023-11-05 16:20       ` Michael S. Tsirkin
  2023-11-03 14:32     ` [virtio-comment] " Zhu, Lingshan
  1 sibling, 2 replies; 186+ messages in thread
From: Parav Pandit @ 2023-11-03 11:35 UTC (permalink / raw)
  To: Michael S. Tsirkin, Zhu Lingshan
  Cc: jasowang, eperezma, cohuck, stefanha, virtio-comment


> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Friday, November 3, 2023 4:20 PM
> 
> On Fri, Nov 03, 2023 at 06:34:37PM +0800, Zhu Lingshan wrote:
> > +\item[\field{bitmap_addr}]
> > +	The driver use this to set the address of the bitmap which records the
> dirty pages
> > +	caused by the device.
> > +	Each bit in the bitmap represents one memory page, bit 0 in the bitmap
> > +	reprsents page 0 at address 0, bit 1 represents page 1, and so on in a
> linear manner.
> > +	When \field{enable} is set to 1 and the device writes to a memory page,
> > +	the device MUST set the corresponding bit to 1 which indicating the
> page is dirty.
> > +\item[\field{bitmap_length}]
> > +	The driver use this to set the length in bytes of the bitmap.
> > +\end{description}
> > +
> > +\devicenormative{\subsubsection}{Memory Dirty Pages Tracker
> > +Capability}{Virtio Transport Options / Virtio Over PCI Bus / Memory
> > +Dirty Pages Tracker Capability}
> > +
> > +The device MUST NOT set any bits beyond bitmap_length when reporting
> dirty pages.
> > +
> > +To prevent a read-modify-write procedure, if a memory page is dirty,
It is not to prevent; it is just not possible to do racy RMW. 😊
Hence to work around you propose to mark all pages dirty. Too bad.
This just does not work.

Secondly the bitmap array is function is for full guest memory size, while there is lot of sparce region now and also in future.
This is the second problem.

This is exactly why I asked you to review the page write recording series of admin commands and comment.
And you never commented with sheer ignorance.

So clearly the start stop method for specific range and without bandwidth explosion, admin commands of [1] stands better.

If you do [1] on the member device also using its AQ in future, it will work for non-passthrough case.
If you build non-passthrough live migration using [1], also it will work.
So I don’t see any point of this series anymore.

[1] https://lists.oasis-open.org/archives/virtio-comment/202310/msg00475.html

> > +optionally the device is permitted to set the entire byte, which encompasses
> the relevant bit, to 1.
> > +
> > +The device MAY increase \field{gra_power} to reduce \field{bitmap_length}.
> > +
> > +The device must ignore any writes to \field{pasid} if PASID Extended
> > +Capability is absent or the PASID functionality is disabled in PASID
> > +Extended Capability
> 
> 
> I have to say this is going to work very badly when the number of dirty pages is
> small: you will end up scanning and re-scanning all of bitmap.
> And the resolution is apparently 8 pages? You have just multiplied the migration
> bandwidth by a factor of 8.

Yeah.
And device does not even know previously reported pages are read by driver or not. All guess work game for driver and device.

^ permalink raw reply	[flat|nested] 186+ messages in thread

* [virtio-comment] RE: [PATCH V2 4/6] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  2023-11-03 10:34 ` [virtio-comment] [PATCH V2 4/6] virtio-pci: implement VIRTIO_F_QUEUE_STATE Zhu Lingshan
@ 2023-11-03 11:35   ` Parav Pandit
  2023-11-03 14:57     ` [virtio-comment] " Zhu, Lingshan
  2023-11-08 17:56   ` Michael S. Tsirkin
  1 sibling, 1 reply; 186+ messages in thread
From: Parav Pandit @ 2023-11-03 11:35 UTC (permalink / raw)
  To: Zhu Lingshan, jasowang, mst, eperezma, cohuck, stefanha; +Cc: virtio-comment


> From: Zhu Lingshan <lingshan.zhu@intel.com>
> Sent: Friday, November 3, 2023 4:05 PM
> 
> This patch adds two new le16 fields to common configuration structure to
> support VIRTIO_F_QUEUE_STATE in PCI transport layer.
> 
> Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
> ---
>  transport-pci.tex | 18 ++++++++++++++++++
>  1 file changed, 18 insertions(+)
> 
> diff --git a/transport-pci.tex b/transport-pci.tex index a5c6719..3161519 100644
> --- a/transport-pci.tex
> +++ b/transport-pci.tex
> @@ -325,6 +325,10 @@ \subsubsection{Common configuration structure
> layout}\label{sec:Virtio Transport
>          /* About the administration virtqueue. */
>          le16 admin_queue_index;         /* read-only for driver */
>          le16 admin_queue_num;         /* read-only for driver */
> +
> +	/* Virtqueue state */
> +        le16 queue_avail_state;         /* read-write */
> +        le16 queue_used_state;          /* read-write */
This tiny interface for 128 virtio net queues through register read writes, does not work effectively.
There are inflight out of order descriptors for block also.
Hence toy registers like this do not work.

Series [1] is comprehensive that covers it even if you consider non-passtrhough device migration model.
Where you can suspend individual queues using new admin command and get them in the device context state.

[1] https://lists.oasis-open.org/archives/virtio-comment/202310/msg00472.html

>  };
>  \end{lstlisting}
> 
> @@ -428,6 +432,17 @@ \subsubsection{Common configuration structure
> layout}\label{sec:Virtio Transport
>  	The value 0 indicates no supported administration virtqueues.
>  	This field is valid only if VIRTIO_F_ADMIN_VQ has been
>  	negotiated.
> +
> +\item[\field{queue_avail_state}]
> +        This field is valid only if VIRTIO_F_QUEUE_STATE has been
> +        negotiated. The driver sets and gets the available state of
> +        the virtqueue here (see \ref{sec:Virtqueues / Virtqueue State}).
> +
> +\item[\field{queue_used_state}]
> +        This field is valid only if VIRTIO_F_QUEUE_STATE has been
> +        negotiated. The driver sets and gets the used state of the
> +        virtqueue here (see \ref{sec:Virtqueues / Virtqueue State}).
> +
>  \end{description}
> 
>  \devicenormative{\paragraph}{Common configuration structure layout}{Virtio
> Transport Options / Virtio Over PCI Bus / PCI Device Layout / Common
> configuration structure layout} @@ -488,6 +503,9 @@ \subsubsection{Common
> configuration structure layout}\label{sec:Virtio Transport  present either a value
> of 0 or a power of 2 in  \field{queue_size}.
> 
> +If VIRTIO_F_QUEUE_STATE has not been negotiated, the device MUST ignore
> +any accesses to \field{queue_avail_state} and \field{queue_used_state}.
> +
>  If VIRTIO_F_ADMIN_VQ has been negotiated, the value
> \field{admin_queue_index} MUST be equal to, or bigger than
> \field{num_queues}; also, \field{admin_queue_num} MUST be
> --
> 2.35.3


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* [virtio-comment] RE: [PATCH V2 2/6] virtio: introduce SUSPEND bit in device status
  2023-11-03 10:34 ` [virtio-comment] [PATCH V2 2/6] virtio: introduce SUSPEND bit in device status Zhu Lingshan
@ 2023-11-03 11:35   ` Parav Pandit
  2023-11-03 14:55     ` [virtio-comment] " Zhu, Lingshan
  2023-11-06  9:43   ` [virtio-comment] " Michael S. Tsirkin
  1 sibling, 1 reply; 186+ messages in thread
From: Parav Pandit @ 2023-11-03 11:35 UTC (permalink / raw)
  To: Zhu Lingshan, jasowang, mst, eperezma, cohuck, stefanha; +Cc: virtio-comment


> From: Zhu Lingshan <lingshan.zhu@intel.com>
> Sent: Friday, November 3, 2023 4:05 PM
> 
> This patch introduces a new status bit in the device status: SUSPEND.
> 
> This SUSPEND bit can be used by the driver to suspend a device, in order to
> stabilize the device states and virtqueue states.
> 
> Its main use case is live migration.
> 
> Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
> Signed-off-by: Jason Wang <jasowang@redhat.com>

You constantly complained that whatever was proposed using admin commands method in [1] must work for passthrough and non-passthrough.

And halfway in the discussion you propose a method after learning all the limitations of in-band, you propose a solution only works for non-passthrough mode.

You asked someone to have comprehensive proposal and when it comes to you following it, you just don’t.
And have most shallow commit message to not even mention it.

Please be consistent in design approach.
And if you don’t want to be, stop asking others.

This is not the way TC collaboration works.
I probably shouldn’t even expect this from you.

[1] https://lists.oasis-open.org/archives/virtio-comment/202310/msg00472.html

> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> ---
>  content.tex | 36 ++++++++++++++++++++++++++++++++++--
>  1 file changed, 34 insertions(+), 2 deletions(-)
> 
> diff --git a/content.tex b/content.tex
> index 76813b5..bcc9d4b 100644
> --- a/content.tex
> +++ b/content.tex
> @@ -49,6 +49,10 @@ \section{\field{Device Status} Field}\label{sec:Basic
> Facilities of a Virtio Dev
> 
>  \item[DEVICE_NEEDS_RESET (64)] Indicates that the device has experienced
>    an error from which it can't recover.
> +
> +\item[SUSPEND (16)] When VIRTIO_F_SUSPEND is negotiated, indicates that
> +the
> +  device has been suspended by the driver.
> +
>  \end{description}
> 
>  The \field{device status} field starts out as 0, and is reinitialized to 0 by @@ -
> 73,6 +77,10 @@ \section{\field{Device Status} Field}\label{sec:Basic Facilities
> of a Virtio Dev  recover by issuing a reset.
>  \end{note}
> 
> +The driver SHOULD NOT set SUSPEND if FEATURES_OK is not set.
> +
> +When setting SUSPEND, the driver MUST re-read \field{device status} to ensure
> the SUSPEND bit is set.
> +
>  \devicenormative{\subsection}{Device Status Field}{Basic Facilities of a Virtio
> Device / Device Status Field}
> 
>  The device MUST NOT consume buffers or send any used buffer @@ -82,6
> +90,26 @@ \section{\field{Device Status} Field}\label{sec:Basic Facilities of a
> Virtio Dev  that a reset is needed.  If DRIVER_OK is set, after it sets
> DEVICE_NEEDS_RESET, the device  MUST send a device configuration change
> notification to the driver.
> 
> +The device MUST ignore SUSPEND if FEATURES_OK is not set.
> +
> +The device MUST ignore SUSPEND if VIRTIO_F_SUSPEND is not negotiated.
> +
> +The device SHOULD allow settings to \field{device status} even when SUSPEND
> is set.
> +
> +If VIRTIO_F_SUSPEND is negotiated and SUSPEND is set, the device SHOULD
> +clear SUSPEND and resumes operation upon DRIVER_OK.
> +
> +If VIRTIO_F_SUSPEND is negotiated, when the driver sets SUSPEND, the
> +device SHOULD perform the following actions before presenting SUSPEND bit
> in the \field{device status}:
> +
> +\begin{itemize}
> +\item Stop consuming buffers of any virtqueues and mark all finished
> descritors as used.
> +\item Wait until all descriptors that being processed to finish and mark them
> as used.
> +\item Flush all used buffer and send used buffer notifications to the driver.
> +\item Record Virtqueue State of each enabled virtqueue, see section
> +\ref{sec:Virtqueues / Virtqueue State} \item Pause its operation except
> +\field{device status} and preserve configurations in its Device
> +Configuration Space, see \ref{sec:Basic Facilities of a Virtio Device /
> +Device Configuration Space} \end{itemize}
> +
>  \section{Feature Bits}\label{sec:Basic Facilities of a Virtio Device / Feature Bits}
> 
>  Each virtio device offers all the features it understands.  During @@ -99,10
> +127,10 @@ \section{Feature Bits}\label{sec:Basic Facilities of a Virtio Device /
> Feature B  \begin{description}
>  \item[0 to 23, and 50 to 127] Feature bits for the specific device type
> 
> -\item[24 to 42] Feature bits reserved for extensions to the queue and
> +\item[24 to 43] Feature bits reserved for extensions to the queue and
>    feature negotiation mechanisms
> 
> -\item[43 to 49, and 128 and above] Feature bits reserved for future extensions.
> +\item[44 to 49, and 128 and above] Feature bits reserved for future
> extensions.
>  \end{description}
> 
>  \begin{note}
> @@ -875,6 +903,10 @@ \chapter{Reserved Feature Bits}\label{sec:Reserved
> Feature Bits}
>    \item[VIRTIO_F_QUEUE_STATE(42)] This feature indicates that the device
> allows the driver
>    to access its internal virtqueue state.
> 
> +  \item[VIRTIO_F_SUSPEND(43)] This feature indicates that the driver can
> +   SUSPEND the device.
> +   See \ref{sec:Basic Facilities of a Virtio Device / Device Status Field}.
> +
>  \end{description}
> 
>  \drivernormative{\section}{Reserved Feature Bits}{Reserved Feature Bits}
> --
> 2.35.3


^ permalink raw reply	[flat|nested] 186+ messages in thread

* [virtio-comment] RE: [PATCH V2 1/6] virtio: introduce virtqueue state
  2023-11-03 10:34 ` [virtio-comment] [PATCH V2 1/6] virtio: introduce virtqueue state Zhu Lingshan
@ 2023-11-03 11:35   ` Parav Pandit
  2023-11-03 14:39     ` [virtio-comment] " Zhu, Lingshan
  2023-11-03 11:52   ` Michael S. Tsirkin
  1 sibling, 1 reply; 186+ messages in thread
From: Parav Pandit @ 2023-11-03 11:35 UTC (permalink / raw)
  To: Zhu Lingshan, jasowang, mst, eperezma, cohuck, stefanha; +Cc: virtio-comment

> From: Zhu Lingshan <lingshan.zhu@intel.com>
> Sent: Friday, November 3, 2023 4:05 PM
> 
> This patch adds new virtqueue facility to save and restore virtqueue state. The
> virtqueue state is split into two parts:
> 
> - The available state: The state that is used for read the next
>   available buffer.
> - The used state: The state that is used for make buffer used.
> 
> This will simply the transport specific method implementation. E.g two
> le16 could be used instead of a single le32). For split virtqueue, we only need
> the available state since the used state is implemented in the virtqueue itself
> (the used index). 

Sorry, this does not work.
Refer to my latest series at [2] that covers used ring elements too.
Commit change log covered the reasoning.

[2] https://lists.oasis-open.org/archives/virtio-comment/202310/msg00472.html

^ permalink raw reply	[flat|nested] 186+ messages in thread

* [virtio-comment] Re: [PATCH V2 1/6] virtio: introduce virtqueue state
  2023-11-03 10:34 ` [virtio-comment] [PATCH V2 1/6] virtio: introduce virtqueue state Zhu Lingshan
  2023-11-03 11:35   ` [virtio-comment] " Parav Pandit
@ 2023-11-03 11:52   ` Michael S. Tsirkin
  2023-11-03 14:49     ` Zhu, Lingshan
  1 sibling, 1 reply; 186+ messages in thread
From: Michael S. Tsirkin @ 2023-11-03 11:52 UTC (permalink / raw)
  To: Zhu Lingshan; +Cc: jasowang, eperezma, cohuck, stefanha, virtio-comment, parav

On Fri, Nov 03, 2023 at 06:34:32PM +0800, Zhu Lingshan wrote:
> This patch adds new virtqueue facility to save and restore virtqueue
> state. The virtqueue state is split into two parts:
> 
> - The available state: The state that is used for read the next
>   available buffer.
> - The used state: The state that is used for make buffer used.
> 
> This will simply the transport specific method implementation. E.g two
> le16 could be used instead of a single le32). For split virtqueue, we
> only need the available state since the used state is implemented in
> the virtqueue itself (the used index). For packed virtqueue, we need
> both the available state and the used state.
> 
> The typical use cases are live migration and debugging.
> 
> Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
> Signed-off-by: Jason Wang <jasowang@redhat.com>
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> ---
>  content.tex     |  7 ++++--
>  packed-ring.tex | 58 +++++++++++++++++++++++++++++++++++++++++++++++++
>  split-ring.tex  | 39 +++++++++++++++++++++++++++++++++
>  3 files changed, 102 insertions(+), 2 deletions(-)
> 
> diff --git a/content.tex b/content.tex
> index 0a62dce..76813b5 100644
> --- a/content.tex
> +++ b/content.tex
> @@ -99,10 +99,10 @@ \section{Feature Bits}\label{sec:Basic Facilities of a Virtio Device / Feature B
>  \begin{description}
>  \item[0 to 23, and 50 to 127] Feature bits for the specific device type
>  
> -\item[24 to 41] Feature bits reserved for extensions to the queue and
> +\item[24 to 42] Feature bits reserved for extensions to the queue and
>    feature negotiation mechanisms
>  
> -\item[42 to 49, and 128 and above] Feature bits reserved for future extensions.
> +\item[43 to 49, and 128 and above] Feature bits reserved for future extensions.
>  \end{description}
>  
>  \begin{note}
> @@ -872,6 +872,9 @@ \chapter{Reserved Feature Bits}\label{sec:Reserved Feature Bits}
>  	\ref{devicenormative:Basic Facilities of a Virtio Device / Feature Bits} for
>  	handling features reserved for future use.
>  
> +  \item[VIRTIO_F_QUEUE_STATE(42)] This feature indicates that the device allows the driver
> +  to access its internal virtqueue state.
> +
>  \end{description}
>  
>  \drivernormative{\section}{Reserved Feature Bits}{Reserved Feature Bits}
> diff --git a/packed-ring.tex b/packed-ring.tex
> index 9eeb382..ad6aba0 100644
> --- a/packed-ring.tex
> +++ b/packed-ring.tex
> @@ -729,3 +729,61 @@ \subsection{Receiving Used Buffers From The Device}\label{sec:Basic Facilities o
>          process_buffer(d);
>  }
>  \end{lstlisting}
> +
> +\subsection{Virtqueue State}\label{sec:Basic Facilities of a Virtio Device / Packed Virtqueues / Virtqueue State}
> +
> +When VIRTIO_F_QUEUE_STATE has been negotiated, the driver can set and
> +get the device internal virtqueue state through the following
> +fields. The implementation of the interfaces is transport specific.
> +
> +\subsubsection{\field{Available State} Field}
> +
> +The available state field is two bytes of virtqueue state that is used by
> +the device to read the next available buffer. It is presented in the followwing format:
> +
> +\begin{lstlisting}
> +le16 {
> +  last_avail_idx : 15;
> +  last_avail_wrap_counter : 1;
> +};
> +\end{lstlisting}
> +
> +The \field{last_avail_idx} field is the free-running location
> +where the device read the next descriptor from the virtqueue descriptor ring.
> +
> +The \field{last_avail_wrap_counter} field is the last driver ring wrap
> +counter that was observed by the device.
> +
> +See also \ref{sec:Packed Virtqueues / Driver and Device Ring Wrap Counters}.
> +
> +\subsubsection{\field{Used State} Field}
> +
> +The used state field is two bytes of virtqueue state that is used by
> +the device when marking a buffer used. It is presented in the followwing format:
> +
> +\begin{lstlisting}
> +le16 {
> +  used_idx : 15;
> +  used_wrap_counter : 1;
> +};
> +\end{lstlisting}
> +
> +The \field{used_idx} field is the free-running location where the device write the next
> +used descriptor to the descriptor ring.
> +
> +The \field{used_wrap_counter} field is the wrap counter that is used
> +by the device.
> +
> +See also \ref{sec:Packed Virtqueues / Driver and Device Ring Wrap Counters}.
> +
> +\devicenormative{\subsubsection}{Virtqueue State}{Basic Facilities of a Virtio Device / Packed Virtqueues/ Virtqueue State}
> +
> +The device SHOULD only accept setting Virtqueue State of any packed virtqueues when DRIVER_OK is not set in \field{device status}, or SUSPEND is set in \field{device status}.

Except SUSPEND is undefined and is explained in patch 2 :(
Please do not split patches like this, you are just
splitting them at random boundaries.


> +Otherwise the device MUST ignore any writes to Virtqueue State of any packed virtqueues.
> +
> +When SUSPEND is set, the device MUST record the Virtqueue State of every enabled packed virtqueue
> +in \field{Available State} field and \field{Used State} field respectively,
> +and correspondingly restore the Virtqueue State of every enabled packed virtqueue
> +from \field{Available State} field and \field{Used State} field when DRIVER_OK is set.
> +
> +The device SHOULD reset \field{Available State} field and \field{Used State} field upon a device reset.
> diff --git a/split-ring.tex b/split-ring.tex
> index de94038..a78b44d 100644
> --- a/split-ring.tex
> +++ b/split-ring.tex
> @@ -734,3 +734,42 @@ \subsection{Receiving Used Buffers From The Device}\label{sec:Basic Facilities o
>  }
>  \end{lstlisting}
>  \end{note}
> +
> +\subsection{Virtqueue State}\label{sec:Basic Facilities of a Virtio Device / Splited Virtqueues / Virtqueue State}
> +
> +When VIRTIO_F_QUEUE_STATE has been negotiated, the driver can set and
> +get the device internal virtqueue state through the following
> +fields. The implementation of the interfaces is transport specific.
> +
> +\subsubsection{\field{Available State} Field}
> +
> +The available state field is two bytes of virtqueue state that is used by
> +the device to read the next available buffer. It is presented in the followwing format:
> +
> +\begin{lstlisting}
> +le16 last_avail_idx;
> +\end{lstlisting}
> +
> +The \field{last_avail_idx} field is the free-running available ring
> +index where the device will read the next available head of a
> +descriptor chain.
> +
> +See also \ref{sec:Basic Facilities of a Virtio Device / Virtqueues / The Virtqueue Available Ring}.
> +
> +\drivernormative{\subsubsection}{Virtqueue State}{Basic Facilities of a Virtio Device / Splited Virtqueues/ Virtqueue State}
> +
> +The driver SHOULD NOT access \field{Used State} of any splited virtqueues, it SHOULD use the
> +used index in the used ring.
> +
> +\devicenormative{\subsubsection}{Virtqueue State}{Basic Facilities of a Virtio Device / Splited Virtqueues/ Virtqueue State}
> +
> +The device SHOULD only accept setting Virtqueue State of any splited virtqueues
> +when DRIVER_OK is not set in \field{device status} or SUSPEND is set in \field{device status}.
> +Otherwise the device MUST ignore any writes to Virtqueue State of any splited virtqueues.

all these requests to ignore writes are to what end? just prohibit
driver from doing this.

> +
> +When SUSPEND is set, the device MUST record the Available State of every enabled splited virtqueue
> +in \field{Available State} field,
> +and correspondingly restore the Available State of every enabled splited virtqueue
> +from \field{Available State} field when DRIVER_OK is set.
> +
> +The device SHOULD reset \field{Available State} field upon a device reset.

At this point I have no idea
- how can a state of a virtqueue at a random time be represented
  by a 16 bit integer
- if it's not at a random time then why do you even need an integer -
  synchronize queue to memory and then all state is in memory



> -- 
> 2.35.3


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* [virtio-comment] Re: [PATCH V2 5/6] virtio: introduce dirty page tracking facility
  2023-11-03 11:35   ` [virtio-comment] " Parav Pandit
@ 2023-11-03 14:11     ` Zhu, Lingshan
  0 siblings, 0 replies; 186+ messages in thread
From: Zhu, Lingshan @ 2023-11-03 14:11 UTC (permalink / raw)
  To: Parav Pandit, jasowang, mst, eperezma, cohuck, stefanha; +Cc: virtio-comment



On 11/3/2023 7:35 PM, Parav Pandit wrote:
> Hi Jason,
>
>> From: Zhu Lingshan <lingshan.zhu@intel.com>
>> Sent: Friday, November 3, 2023 4:05 PM
>> This commit introduce a new virtio facility to track device dirty pages, a typical
>> use case is live migration.
>>
>> The implementation of this facility is transport specific.
>>
>> Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
>
>> Signed-off-by: Jason Wang <jasowang@redhat.com>
> In my series of dirty page tracking (aka write recording), you kept insisting until Thu, that it is optional, and platform will do it.
> Why do you propose this facility now?
> Can you please explain as commit log says typical use case is "live migration"? :)
You misunderstood us.

If you remember, it is you challenge our config space proposal (v1) is 
incomplete without dirty page tracking facility.

This is to answer your challenge, as an optional backup.

We still believe dirty page tracking should be better done by platform 
facilities, like vt-d.



This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* [virtio-comment] Re: [PATCH V2 6/6] virtio-pci: implement dirty page tracking
  2023-11-03 10:46   ` [virtio-comment] " Michael S. Tsirkin
@ 2023-11-03 14:21     ` Zhu, Lingshan
  2023-11-06  9:16       ` Zhu, Lingshan
  0 siblings, 1 reply; 186+ messages in thread
From: Zhu, Lingshan @ 2023-11-03 14:21 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: jasowang, eperezma, cohuck, stefanha, virtio-comment, parav



On 11/3/2023 6:46 PM, Michael S. Tsirkin wrote:
> On Fri, Nov 03, 2023 at 06:34:37PM +0800, Zhu Lingshan wrote:
>> +\begin{lstlisting}
>> +struct virtio_pci_dity_page_track {
>> +        u8 enable;               /* Read-Write */
>> +        u8 gra_power;            /* Read-Write */
>> +        u8 reserved[2];
>> +        le32 {
>> +            pasid: 20;           /* Read-Write */
>> +            reserved: 12;
>> +        };
>> +        le64 bitmap_addr;        /* Read-Write */
>> +        le64 bitmap_length;      /* Read-Write */
>> +};
>> +\end{lstlisting}
> Okay, so it's a simple mailbox in config space.  Which by itself is
> probably a very reasonable idea - more or less what I suggested.
> However, using such a generic facility just for the dirty bitmap seems
> too limited.  Please make it accept arbitrary commands. Reusing admin
> command structure with a special "device itself" group sounds like one
> way to do it.
processing admin cmds in a cap may be too complex and overkill.
we need to handle variable length of cmds, handle async returned 
results, and so on.

This struct seems easy and simple. And shall we use platform facilities 
like vt-d
to track dirty pages?
>


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* [virtio-comment] Re: [PATCH V2 6/6] virtio-pci: implement dirty page tracking
  2023-11-03 10:50   ` Michael S. Tsirkin
  2023-11-03 11:35     ` [virtio-comment] " Parav Pandit
@ 2023-11-03 14:32     ` Zhu, Lingshan
  2023-11-05 16:16       ` Michael S. Tsirkin
  1 sibling, 1 reply; 186+ messages in thread
From: Zhu, Lingshan @ 2023-11-03 14:32 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: jasowang, eperezma, cohuck, stefanha, virtio-comment, parav

[-- Attachment #1: Type: text/plain, Size: 2496 bytes --]



On 11/3/2023 6:50 PM, Michael S. Tsirkin wrote:
> On Fri, Nov 03, 2023 at 06:34:37PM +0800, Zhu Lingshan wrote:
>> +\item[\field{bitmap_addr}]
>> +	The driver use this to set the address of the bitmap which records the dirty pages
>> +	caused by the device.
>> +	Each bit in the bitmap represents one memory page, bit 0 in the bitmap
>> +	reprsents page 0 at address 0, bit 1 represents page 1, and so on in a linear manner.
>> +	When \field{enable} is set to 1 and the device writes to a memory page,
>> +	the device MUST set the corresponding bit to 1 which indicating the page is dirty.
>> +\item[\field{bitmap_length}]
>> +	The driver use this to set the length in bytes of the bitmap.
>> +\end{description}
>> +
>> +\devicenormative{\subsubsection}{Memory Dirty Pages Tracker Capability}{Virtio Transport Options / Virtio Over PCI Bus / Memory Dirty Pages Tracker Capability}
>> +
>> +The device MUST NOT set any bits beyond bitmap_length when reporting dirty pages.
>> +
>> +To prevent a read-modify-write procedure, if a memory page is dirty,
>> +optionally the device is permitted to set the entire byte, which encompasses the relevant bit, to 1.
>> +
>> +The device MAY increase \field{gra_power} to reduce \field{bitmap_length}.
>> +
>> +The device must ignore any writes to \field{pasid} if PASID Extended Capability is absent or
>> +the PASID functionality is disabled in PASID Extended Capability
>
> I have to say this is going to work very badly when the number of dirty
> pages is small: you will end up scanning and re-scanning all of bitmap.
The driver needs to scan anyway, Intel production work with similar 
bitmap based dirty page tracking solution for years.

Otherwise the device should report PFN which is not very practical.
> And the resolution is apparently 8 pages? You have just multiplied
> the migration bandwidth by a factor of 8.
No, as described in the comments, the tacking granularity is controlled 
by \field{gra_power}, one bit represents a page with page_size = 2^(12 + 
gra_power). This can also be used to reduce the size of the bitmap.

"To prevent a read-modify-write procedure, if a memory page is dirty,
optionally the device is permitted to set the entire byte, which encompasses the relevant bit, to 1."

This is optional and DMA is very likely to write a neighbor page, and the device transmit a whole byte anyway
when a bit is dirty.

How about we use platform dirty page tracking facility then implement this in virtio, as Jason suggested?

>

[-- Attachment #2: Type: text/html, Size: 3655 bytes --]

^ permalink raw reply	[flat|nested] 186+ messages in thread

* [virtio-comment] Re: [PATCH V2 1/6] virtio: introduce virtqueue state
  2023-11-03 11:35   ` [virtio-comment] " Parav Pandit
@ 2023-11-03 14:39     ` Zhu, Lingshan
  0 siblings, 0 replies; 186+ messages in thread
From: Zhu, Lingshan @ 2023-11-03 14:39 UTC (permalink / raw)
  To: Parav Pandit, jasowang, mst, eperezma, cohuck, stefanha; +Cc: virtio-comment

[-- Attachment #1: Type: text/plain, Size: 1049 bytes --]



On 11/3/2023 7:35 PM, Parav Pandit wrote:
>> From: Zhu Lingshan<lingshan.zhu@intel.com>
>> Sent: Friday, November 3, 2023 4:05 PM
>>
>> This patch adds new virtqueue facility to save and restore virtqueue state. The
>> virtqueue state is split into two parts:
>>
>> - The available state: The state that is used for read the next
>>    available buffer.
>> - The used state: The state that is used for make buffer used.
>>
>> This will simply the transport specific method implementation. E.g two
>> le16 could be used instead of a single le32). For split virtqueue, we only need
>> the available state since the used state is implemented in the virtqueue itself
>> (the used index).
> Sorry, this does not work.
> Refer to my latest series at [2] that covers used ring elements too.
> Commit change log covered the reasoning.
>
> [2]https://lists.oasis-open.org/archives/virtio-comment/202310/msg00472.html
this patch is to migrate device internal virtqueue state, not in-guest 
or in descriptor states, right?

Can you name any missing fields?


[-- Attachment #2: Type: text/html, Size: 1930 bytes --]

^ permalink raw reply	[flat|nested] 186+ messages in thread

* [virtio-comment] Re: [PATCH V2 1/6] virtio: introduce virtqueue state
  2023-11-03 11:52   ` Michael S. Tsirkin
@ 2023-11-03 14:49     ` Zhu, Lingshan
  2023-11-06  9:35       ` Michael S. Tsirkin
  0 siblings, 1 reply; 186+ messages in thread
From: Zhu, Lingshan @ 2023-11-03 14:49 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: jasowang, eperezma, cohuck, stefanha, virtio-comment, parav

[-- Attachment #1: Type: text/plain, Size: 8483 bytes --]



On 11/3/2023 7:52 PM, Michael S. Tsirkin wrote:
> On Fri, Nov 03, 2023 at 06:34:32PM +0800, Zhu Lingshan wrote:
>> This patch adds new virtqueue facility to save and restore virtqueue
>> state. The virtqueue state is split into two parts:
>>
>> - The available state: The state that is used for read the next
>>    available buffer.
>> - The used state: The state that is used for make buffer used.
>>
>> This will simply the transport specific method implementation. E.g two
>> le16 could be used instead of a single le32). For split virtqueue, we
>> only need the available state since the used state is implemented in
>> the virtqueue itself (the used index). For packed virtqueue, we need
>> both the available state and the used state.
>>
>> The typical use cases are live migration and debugging.
>>
>> Signed-off-by: Zhu Lingshan<lingshan.zhu@intel.com>
>> Signed-off-by: Jason Wang<jasowang@redhat.com>
>> Signed-off-by: Eugenio Pérez<eperezma@redhat.com>
>> ---
>>   content.tex     |  7 ++++--
>>   packed-ring.tex | 58 +++++++++++++++++++++++++++++++++++++++++++++++++
>>   split-ring.tex  | 39 +++++++++++++++++++++++++++++++++
>>   3 files changed, 102 insertions(+), 2 deletions(-)
>>
>> diff --git a/content.tex b/content.tex
>> index 0a62dce..76813b5 100644
>> --- a/content.tex
>> +++ b/content.tex
>> @@ -99,10 +99,10 @@ \section{Feature Bits}\label{sec:Basic Facilities of a Virtio Device / Feature B
>>   \begin{description}
>>   \item[0 to 23, and 50 to 127] Feature bits for the specific device type
>>   
>> -\item[24 to 41] Feature bits reserved for extensions to the queue and
>> +\item[24 to 42] Feature bits reserved for extensions to the queue and
>>     feature negotiation mechanisms
>>   
>> -\item[42 to 49, and 128 and above] Feature bits reserved for future extensions.
>> +\item[43 to 49, and 128 and above] Feature bits reserved for future extensions.
>>   \end{description}
>>   
>>   \begin{note}
>> @@ -872,6 +872,9 @@ \chapter{Reserved Feature Bits}\label{sec:Reserved Feature Bits}
>>   	\ref{devicenormative:Basic Facilities of a Virtio Device / Feature Bits} for
>>   	handling features reserved for future use.
>>   
>> +  \item[VIRTIO_F_QUEUE_STATE(42)] This feature indicates that the device allows the driver
>> +  to access its internal virtqueue state.
>> +
>>   \end{description}
>>   
>>   \drivernormative{\section}{Reserved Feature Bits}{Reserved Feature Bits}
>> diff --git a/packed-ring.tex b/packed-ring.tex
>> index 9eeb382..ad6aba0 100644
>> --- a/packed-ring.tex
>> +++ b/packed-ring.tex
>> @@ -729,3 +729,61 @@ \subsection{Receiving Used Buffers From The Device}\label{sec:Basic Facilities o
>>           process_buffer(d);
>>   }
>>   \end{lstlisting}
>> +
>> +\subsection{Virtqueue State}\label{sec:Basic Facilities of a Virtio Device / Packed Virtqueues / Virtqueue State}
>> +
>> +When VIRTIO_F_QUEUE_STATE has been negotiated, the driver can set and
>> +get the device internal virtqueue state through the following
>> +fields. The implementation of the interfaces is transport specific.
>> +
>> +\subsubsection{\field{Available State} Field}
>> +
>> +The available state field is two bytes of virtqueue state that is used by
>> +the device to read the next available buffer. It is presented in the followwing format:
>> +
>> +\begin{lstlisting}
>> +le16 {
>> +  last_avail_idx : 15;
>> +  last_avail_wrap_counter : 1;
>> +};
>> +\end{lstlisting}
>> +
>> +The \field{last_avail_idx} field is the free-running location
>> +where the device read the next descriptor from the virtqueue descriptor ring.
>> +
>> +The \field{last_avail_wrap_counter} field is the last driver ring wrap
>> +counter that was observed by the device.
>> +
>> +See also \ref{sec:Packed Virtqueues / Driver and Device Ring Wrap Counters}.
>> +
>> +\subsubsection{\field{Used State} Field}
>> +
>> +The used state field is two bytes of virtqueue state that is used by
>> +the device when marking a buffer used. It is presented in the followwing format:
>> +
>> +\begin{lstlisting}
>> +le16 {
>> +  used_idx : 15;
>> +  used_wrap_counter : 1;
>> +};
>> +\end{lstlisting}
>> +
>> +The \field{used_idx} field is the free-running location where the device write the next
>> +used descriptor to the descriptor ring.
>> +
>> +The \field{used_wrap_counter} field is the wrap counter that is used
>> +by the device.
>> +
>> +See also \ref{sec:Packed Virtqueues / Driver and Device Ring Wrap Counters}.
>> +
>> +\devicenormative{\subsubsection}{Virtqueue State}{Basic Facilities of a Virtio Device / Packed Virtqueues/ Virtqueue State}
>> +
>> +The device SHOULD only accept setting Virtqueue State of any packed virtqueues when DRIVER_OK is not set in \field{device status}, or SUSPEND is set in \field{device status}.
> Except SUSPEND is undefined and is explained in patch 2 :(
> Please do not split patches like this, you are just
> splitting them at random boundaries.
will do.
>
>
>> +Otherwise the device MUST ignore any writes to Virtqueue State of any packed virtqueues.
>> +
>> +When SUSPEND is set, the device MUST record the Virtqueue State of every enabled packed virtqueue
>> +in \field{Available State} field and \field{Used State} field respectively,
>> +and correspondingly restore the Virtqueue State of every enabled packed virtqueue
>> +from \field{Available State} field and \field{Used State} field when DRIVER_OK is set.
>> +
>> +The device SHOULD reset \field{Available State} field and \field{Used State} field upon a device reset.
>> diff --git a/split-ring.tex b/split-ring.tex
>> index de94038..a78b44d 100644
>> --- a/split-ring.tex
>> +++ b/split-ring.tex
>> @@ -734,3 +734,42 @@ \subsection{Receiving Used Buffers From The Device}\label{sec:Basic Facilities o
>>   }
>>   \end{lstlisting}
>>   \end{note}
>> +
>> +\subsection{Virtqueue State}\label{sec:Basic Facilities of a Virtio Device / Splited Virtqueues / Virtqueue State}
>> +
>> +When VIRTIO_F_QUEUE_STATE has been negotiated, the driver can set and
>> +get the device internal virtqueue state through the following
>> +fields. The implementation of the interfaces is transport specific.
>> +
>> +\subsubsection{\field{Available State} Field}
>> +
>> +The available state field is two bytes of virtqueue state that is used by
>> +the device to read the next available buffer. It is presented in the followwing format:
>> +
>> +\begin{lstlisting}
>> +le16 last_avail_idx;
>> +\end{lstlisting}
>> +
>> +The \field{last_avail_idx} field is the free-running available ring
>> +index where the device will read the next available head of a
>> +descriptor chain.
>> +
>> +See also \ref{sec:Basic Facilities of a Virtio Device / Virtqueues / The Virtqueue Available Ring}.
>> +
>> +\drivernormative{\subsubsection}{Virtqueue State}{Basic Facilities of a Virtio Device / Splited Virtqueues/ Virtqueue State}
>> +
>> +The driver SHOULD NOT access \field{Used State} of any splited virtqueues, it SHOULD use the
>> +used index in the used ring.
>> +
>> +\devicenormative{\subsubsection}{Virtqueue State}{Basic Facilities of a Virtio Device / Splited Virtqueues/ Virtqueue State}
>> +
>> +The device SHOULD only accept setting Virtqueue State of any splited virtqueues
>> +when DRIVER_OK is not set in \field{device status} or SUSPEND is set in \field{device status}.
>> +Otherwise the device MUST ignore any writes to Virtqueue State of any splited virtqueues.
> all these requests to ignore writes are to what end? just prohibit
> driver from doing this.
OK
>
>> +
>> +When SUSPEND is set, the device MUST record the Available State of every enabled splited virtqueue
>> +in \field{Available State} field,
>> +and correspondingly restore the Available State of every enabled splited virtqueue
>> +from \field{Available State} field when DRIVER_OK is set.
>> +
>> +The device SHOULD reset \field{Available State} field upon a device reset.
> At this point I have no idea
> - how can a state of a virtqueue at a random time be represented
>    by a 16 bit integer
not sure what is a random time, this is to request the device to reset
its avail state, for example, it is "le16 queue_avail_state" in 
virtio-pci common cfg. Resetting this so the device will not recover 
from a wrong value of the last run.
> - if it's not at a random time then why do you even need an integer -
>    synchronize queue to memory and then all state is in memory
Not sure what is a sync queue, but for example, "le16 queue_avail_state" 
for PCI transport exists in a cap.
>
>
>
>> -- 
>> 2.35.3

[-- Attachment #2: Type: text/html, Size: 10182 bytes --]

^ permalink raw reply	[flat|nested] 186+ messages in thread

* [virtio-comment] Re: [PATCH V2 2/6] virtio: introduce SUSPEND bit in device status
  2023-11-03 11:35   ` [virtio-comment] " Parav Pandit
@ 2023-11-03 14:55     ` Zhu, Lingshan
  2023-11-03 15:54       ` [virtio-comment] " Parav Pandit
  0 siblings, 1 reply; 186+ messages in thread
From: Zhu, Lingshan @ 2023-11-03 14:55 UTC (permalink / raw)
  To: Parav Pandit, jasowang, mst, eperezma, cohuck, stefanha; +Cc: virtio-comment



On 11/3/2023 7:35 PM, Parav Pandit wrote:
>> From: Zhu Lingshan <lingshan.zhu@intel.com>
>> Sent: Friday, November 3, 2023 4:05 PM
>>
>> This patch introduces a new status bit in the device status: SUSPEND.
>>
>> This SUSPEND bit can be used by the driver to suspend a device, in order to
>> stabilize the device states and virtqueue states.
>>
>> Its main use case is live migration.
>>
>> Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
>> Signed-off-by: Jason Wang <jasowang@redhat.com>
> You constantly complained that whatever was proposed using admin commands method in [1] must work for passthrough and non-passthrough.
>
> And halfway in the discussion you propose a method after learning all the limitations of in-band, you propose a solution only works for non-passthrough mode.
>
> You asked someone to have comprehensive proposal and when it comes to you following it, you just don’t.
not sure what you are talking about.
> And have most shallow commit message to not even mention it.
>
> Please be consistent in design approach.
> And if you don’t want to be, stop asking others.
this SUSPEND/RESUME doesn't change since the RFC series, how can it not 
be inconsistent???
>
> This is not the way TC collaboration works.
> I probably shouldn’t even expect this from you.
>
> [1] https://lists.oasis-open.org/archives/virtio-comment/202310/msg00472.html
Please don't be so emotional and please be professional.

Why this solution can not work for pass-through? Do you know the device 
ownership will be transferred to the hypervisor when guest suspended in 
live migration?
>
>> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
>> ---
>>   content.tex | 36 ++++++++++++++++++++++++++++++++++--
>>   1 file changed, 34 insertions(+), 2 deletions(-)
>>
>> diff --git a/content.tex b/content.tex
>> index 76813b5..bcc9d4b 100644
>> --- a/content.tex
>> +++ b/content.tex
>> @@ -49,6 +49,10 @@ \section{\field{Device Status} Field}\label{sec:Basic
>> Facilities of a Virtio Dev
>>
>>   \item[DEVICE_NEEDS_RESET (64)] Indicates that the device has experienced
>>     an error from which it can't recover.
>> +
>> +\item[SUSPEND (16)] When VIRTIO_F_SUSPEND is negotiated, indicates that
>> +the
>> +  device has been suspended by the driver.
>> +
>>   \end{description}
>>
>>   The \field{device status} field starts out as 0, and is reinitialized to 0 by @@ -
>> 73,6 +77,10 @@ \section{\field{Device Status} Field}\label{sec:Basic Facilities
>> of a Virtio Dev  recover by issuing a reset.
>>   \end{note}
>>
>> +The driver SHOULD NOT set SUSPEND if FEATURES_OK is not set.
>> +
>> +When setting SUSPEND, the driver MUST re-read \field{device status} to ensure
>> the SUSPEND bit is set.
>> +
>>   \devicenormative{\subsection}{Device Status Field}{Basic Facilities of a Virtio
>> Device / Device Status Field}
>>
>>   The device MUST NOT consume buffers or send any used buffer @@ -82,6
>> +90,26 @@ \section{\field{Device Status} Field}\label{sec:Basic Facilities of a
>> Virtio Dev  that a reset is needed.  If DRIVER_OK is set, after it sets
>> DEVICE_NEEDS_RESET, the device  MUST send a device configuration change
>> notification to the driver.
>>
>> +The device MUST ignore SUSPEND if FEATURES_OK is not set.
>> +
>> +The device MUST ignore SUSPEND if VIRTIO_F_SUSPEND is not negotiated.
>> +
>> +The device SHOULD allow settings to \field{device status} even when SUSPEND
>> is set.
>> +
>> +If VIRTIO_F_SUSPEND is negotiated and SUSPEND is set, the device SHOULD
>> +clear SUSPEND and resumes operation upon DRIVER_OK.
>> +
>> +If VIRTIO_F_SUSPEND is negotiated, when the driver sets SUSPEND, the
>> +device SHOULD perform the following actions before presenting SUSPEND bit
>> in the \field{device status}:
>> +
>> +\begin{itemize}
>> +\item Stop consuming buffers of any virtqueues and mark all finished
>> descritors as used.
>> +\item Wait until all descriptors that being processed to finish and mark them
>> as used.
>> +\item Flush all used buffer and send used buffer notifications to the driver.
>> +\item Record Virtqueue State of each enabled virtqueue, see section
>> +\ref{sec:Virtqueues / Virtqueue State} \item Pause its operation except
>> +\field{device status} and preserve configurations in its Device
>> +Configuration Space, see \ref{sec:Basic Facilities of a Virtio Device /
>> +Device Configuration Space} \end{itemize}
>> +
>>   \section{Feature Bits}\label{sec:Basic Facilities of a Virtio Device / Feature Bits}
>>
>>   Each virtio device offers all the features it understands.  During @@ -99,10
>> +127,10 @@ \section{Feature Bits}\label{sec:Basic Facilities of a Virtio Device /
>> Feature B  \begin{description}
>>   \item[0 to 23, and 50 to 127] Feature bits for the specific device type
>>
>> -\item[24 to 42] Feature bits reserved for extensions to the queue and
>> +\item[24 to 43] Feature bits reserved for extensions to the queue and
>>     feature negotiation mechanisms
>>
>> -\item[43 to 49, and 128 and above] Feature bits reserved for future extensions.
>> +\item[44 to 49, and 128 and above] Feature bits reserved for future
>> extensions.
>>   \end{description}
>>
>>   \begin{note}
>> @@ -875,6 +903,10 @@ \chapter{Reserved Feature Bits}\label{sec:Reserved
>> Feature Bits}
>>     \item[VIRTIO_F_QUEUE_STATE(42)] This feature indicates that the device
>> allows the driver
>>     to access its internal virtqueue state.
>>
>> +  \item[VIRTIO_F_SUSPEND(43)] This feature indicates that the driver can
>> +   SUSPEND the device.
>> +   See \ref{sec:Basic Facilities of a Virtio Device / Device Status Field}.
>> +
>>   \end{description}
>>
>>   \drivernormative{\section}{Reserved Feature Bits}{Reserved Feature Bits}
>> --
>> 2.35.3


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* [virtio-comment] Re: [PATCH V2 4/6] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  2023-11-03 11:35   ` [virtio-comment] " Parav Pandit
@ 2023-11-03 14:57     ` Zhu, Lingshan
  2023-11-03 15:50       ` Parav Pandit
  0 siblings, 1 reply; 186+ messages in thread
From: Zhu, Lingshan @ 2023-11-03 14:57 UTC (permalink / raw)
  To: Parav Pandit, jasowang, mst, eperezma, cohuck, stefanha; +Cc: virtio-comment



On 11/3/2023 7:35 PM, Parav Pandit wrote:
>> From: Zhu Lingshan <lingshan.zhu@intel.com>
>> Sent: Friday, November 3, 2023 4:05 PM
>>
>> This patch adds two new le16 fields to common configuration structure to
>> support VIRTIO_F_QUEUE_STATE in PCI transport layer.
>>
>> Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
>> ---
>>   transport-pci.tex | 18 ++++++++++++++++++
>>   1 file changed, 18 insertions(+)
>>
>> diff --git a/transport-pci.tex b/transport-pci.tex index a5c6719..3161519 100644
>> --- a/transport-pci.tex
>> +++ b/transport-pci.tex
>> @@ -325,6 +325,10 @@ \subsubsection{Common configuration structure
>> layout}\label{sec:Virtio Transport
>>           /* About the administration virtqueue. */
>>           le16 admin_queue_index;         /* read-only for driver */
>>           le16 admin_queue_num;         /* read-only for driver */
>> +
>> +	/* Virtqueue state */
>> +        le16 queue_avail_state;         /* read-write */
>> +        le16 queue_used_state;          /* read-write */
> This tiny interface for 128 virtio net queues through register read writes, does not work effectively.
> There are inflight out of order descriptors for block also.
> Hence toy registers like this do not work.
Do you know there is a queue_select? Why this does not work? Do you know 
how other queue related fields work?
Like how to set a queue size and enable it?
>
> Series [1] is comprehensive that covers it even if you consider non-passtrhough device migration model.
> Where you can suspend individual queues using new admin command and get them in the device context state.
>
> [1] https://lists.oasis-open.org/archives/virtio-comment/202310/msg00472.html
I suggest you read QEMU migration code. If you don't want to, not my fault.
>
>>   };
>>   \end{lstlisting}
>>
>> @@ -428,6 +432,17 @@ \subsubsection{Common configuration structure
>> layout}\label{sec:Virtio Transport
>>   	The value 0 indicates no supported administration virtqueues.
>>   	This field is valid only if VIRTIO_F_ADMIN_VQ has been
>>   	negotiated.
>> +
>> +\item[\field{queue_avail_state}]
>> +        This field is valid only if VIRTIO_F_QUEUE_STATE has been
>> +        negotiated. The driver sets and gets the available state of
>> +        the virtqueue here (see \ref{sec:Virtqueues / Virtqueue State}).
>> +
>> +\item[\field{queue_used_state}]
>> +        This field is valid only if VIRTIO_F_QUEUE_STATE has been
>> +        negotiated. The driver sets and gets the used state of the
>> +        virtqueue here (see \ref{sec:Virtqueues / Virtqueue State}).
>> +
>>   \end{description}
>>
>>   \devicenormative{\paragraph}{Common configuration structure layout}{Virtio
>> Transport Options / Virtio Over PCI Bus / PCI Device Layout / Common
>> configuration structure layout} @@ -488,6 +503,9 @@ \subsubsection{Common
>> configuration structure layout}\label{sec:Virtio Transport  present either a value
>> of 0 or a power of 2 in  \field{queue_size}.
>>
>> +If VIRTIO_F_QUEUE_STATE has not been negotiated, the device MUST ignore
>> +any accesses to \field{queue_avail_state} and \field{queue_used_state}.
>> +
>>   If VIRTIO_F_ADMIN_VQ has been negotiated, the value
>> \field{admin_queue_index} MUST be equal to, or bigger than
>> \field{num_queues}; also, \field{admin_queue_num} MUST be
>> --
>> 2.35.3


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* [virtio-comment] Re: [PATCH V2 6/6] virtio-pci: implement dirty page tracking
  2023-11-03 11:35     ` [virtio-comment] " Parav Pandit
@ 2023-11-03 15:02       ` Zhu, Lingshan
  2023-11-03 15:47         ` [virtio-comment] " Parav Pandit
  2023-11-05 16:20       ` Michael S. Tsirkin
  1 sibling, 1 reply; 186+ messages in thread
From: Zhu, Lingshan @ 2023-11-03 15:02 UTC (permalink / raw)
  To: Parav Pandit, Michael S. Tsirkin
  Cc: jasowang, eperezma, cohuck, stefanha, virtio-comment



On 11/3/2023 7:35 PM, Parav Pandit wrote:
>> From: Michael S. Tsirkin <mst@redhat.com>
>> Sent: Friday, November 3, 2023 4:20 PM
>>
>> On Fri, Nov 03, 2023 at 06:34:37PM +0800, Zhu Lingshan wrote:
>>> +\item[\field{bitmap_addr}]
>>> +	The driver use this to set the address of the bitmap which records the
>> dirty pages
>>> +	caused by the device.
>>> +	Each bit in the bitmap represents one memory page, bit 0 in the bitmap
>>> +	reprsents page 0 at address 0, bit 1 represents page 1, and so on in a
>> linear manner.
>>> +	When \field{enable} is set to 1 and the device writes to a memory page,
>>> +	the device MUST set the corresponding bit to 1 which indicating the
>> page is dirty.
>>> +\item[\field{bitmap_length}]
>>> +	The driver use this to set the length in bytes of the bitmap.
>>> +\end{description}
>>> +
>>> +\devicenormative{\subsubsection}{Memory Dirty Pages Tracker
>>> +Capability}{Virtio Transport Options / Virtio Over PCI Bus / Memory
>>> +Dirty Pages Tracker Capability}
>>> +
>>> +The device MUST NOT set any bits beyond bitmap_length when reporting
>> dirty pages.
>>> +
>>> +To prevent a read-modify-write procedure, if a memory page is dirty,
> It is not to prevent; it is just not possible to do racy RMW. 😊
if you understand what is a atomic routine, you will not call it racy.
> Hence to work around you propose to mark all pages dirty. Too bad.
> This just does not work.
why? and this is optional.
>
> Secondly the bitmap array is function is for full guest memory size, while there is lot of sparce region now and also in future.
> This is the second problem.
did you see gra_power and its comments?
>
> This is exactly why I asked you to review the page write recording series of admin commands and comment.
> And you never commented with sheer ignorance.
>
> So clearly the start stop method for specific range and without bandwidth explosion, admin commands of [1] stands better.
>
> If you do [1] on the member device also using its AQ in future, it will work for non-passthrough case.
> If you build non-passthrough live migration using [1], also it will work.
> So I don’t see any point of this series anymore.
As Jason pointed out, there are many problems in your proposal,
you should answer there. I don't need to repeat his words and duplicate 
the discussions.
>
> [1] https://lists.oasis-open.org/archives/virtio-comment/202310/msg00475.html
you still need to explain why this does not work for pass-through. And I 
remember this is a point-less topic as MST ever wants to mute
another "pass-through" thread.
>
>>> +optionally the device is permitted to set the entire byte, which encompasses
>> the relevant bit, to 1.
>>> +
>>> +The device MAY increase \field{gra_power} to reduce \field{bitmap_length}.
>>> +
>>> +The device must ignore any writes to \field{pasid} if PASID Extended
>>> +Capability is absent or the PASID functionality is disabled in PASID
>>> +Extended Capability
>>
>> I have to say this is going to work very badly when the number of dirty pages is
>> small: you will end up scanning and re-scanning all of bitmap.
>> And the resolution is apparently 8 pages? You have just multiplied the migration
>> bandwidth by a factor of 8.
> Yeah.
> And device does not even know previously reported pages are read by driver or not. All guess work game for driver and device.
see my reply to him


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* [virtio-comment] RE: [PATCH V2 6/6] virtio-pci: implement dirty page tracking
  2023-11-03 15:02       ` [virtio-comment] " Zhu, Lingshan
@ 2023-11-03 15:47         ` Parav Pandit
  2023-11-05 16:12           ` [virtio-comment] " Michael S. Tsirkin
  2023-11-06  3:52           ` Zhu, Lingshan
  0 siblings, 2 replies; 186+ messages in thread
From: Parav Pandit @ 2023-11-03 15:47 UTC (permalink / raw)
  To: Zhu, Lingshan, Michael S. Tsirkin
  Cc: jasowang, eperezma, cohuck, stefanha, virtio-comment


> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> Sent: Friday, November 3, 2023 8:33 PM
> 
> On 11/3/2023 7:35 PM, Parav Pandit wrote:
> >> From: Michael S. Tsirkin <mst@redhat.com>
> >> Sent: Friday, November 3, 2023 4:20 PM
> >>
> >> On Fri, Nov 03, 2023 at 06:34:37PM +0800, Zhu Lingshan wrote:
> >>> +\item[\field{bitmap_addr}]
> >>> +	The driver use this to set the address of the bitmap which records
> >>> +the
> >> dirty pages
> >>> +	caused by the device.
> >>> +	Each bit in the bitmap represents one memory page, bit 0 in the bitmap
> >>> +	reprsents page 0 at address 0, bit 1 represents page 1, and so on
> >>> +in a
> >> linear manner.
> >>> +	When \field{enable} is set to 1 and the device writes to a memory page,
> >>> +	the device MUST set the corresponding bit to 1 which indicating
> >>> +the
> >> page is dirty.
> >>> +\item[\field{bitmap_length}]
> >>> +	The driver use this to set the length in bytes of the bitmap.
> >>> +\end{description}
> >>> +
> >>> +\devicenormative{\subsubsection}{Memory Dirty Pages Tracker
> >>> +Capability}{Virtio Transport Options / Virtio Over PCI Bus / Memory
> >>> +Dirty Pages Tracker Capability}
> >>> +
> >>> +The device MUST NOT set any bits beyond bitmap_length when
> >>> +reporting
> >> dirty pages.
> >>> +
> >>> +To prevent a read-modify-write procedure, if a memory page is
> >>> +dirty,
> > It is not to prevent; it is just not possible to do racy RMW. 😊
> if you understand what is a atomic routine, you will not call it racy.
> > Hence to work around you propose to mark all pages dirty. Too bad.
> > This just does not work.
> why? and this is optional.
Because device cannot set individual bits in atomic way for same byte read by the cpu.
1. device read the byte that had bit 0 and 4 set.
2. cpu atomically clear these bits.
3. device wrote bits 0, 4, and new bits 2 and 3.
4. cpu now transferred page 0 and 4 again.

Optional thing also needs to work. :)

> >
> > Secondly the bitmap array is function is for full guest memory size, while
> there is lot of sparce region now and also in future.
> > This is the second problem.
> did you see gra_power and its comments?
gra_power says the page size.
Not the sparce multiple ranges of the guest memory.
Device endup tracking uninterested area as well.

> >
> > This is exactly why I asked you to review the page write recording series of
> admin commands and comment.
> > And you never commented with sheer ignorance.
> >
> > So clearly the start stop method for specific range and without bandwidth
> explosion, admin commands of [1] stands better.
> >
> > If you do [1] on the member device also using its AQ in future, it will work for
> non-passthrough case.
> > If you build non-passthrough live migration using [1], also it will work.
> > So I don’t see any point of this series anymore.
> As Jason pointed out, there are many problems in your proposal, you should
> answer there. I don't need to repeat his words and duplicate the discussions.
Many are already addressed in v3.

> >
> > [1]
> > https://lists.oasis-open.org/archives/virtio-comment/202310/msg00475.h
> > tml
> you still need to explain why this does not work for pass-through. 
It does not work for following reasons.
1. Because all the fields that put on the member device are not in direct control of the hypervisor.
The device is directly controlled by the guest including the device status and when it resets the device all the things stored in the device are lost.

2. the PCI FLR is clearing all the registers you exposed here.

3. Endless expansion of config registers of dirty tracking is not scalable, as they are not init time registers not following the Appendix B guidelines.

4. bitmap based dirty tracking is not atomic between cpu and device.
Hence, it is racy.

5. All the device context needed for passthrough based hypervisor for a device type specific is missing.
All of those can be used for non-passthrough as well.
[1] https://lists.oasis-open.org/archives/virtio-comment/202311/msg00085.html

> And I
> remember this is a point-less topic as MST ever wants to mute another "pass-
> through" thread.
No. he did not say that.
He meant to not endlessly debate which one is better.
He clearly said, try to see if you can make multiple hypervisor model work.
And your series shows a clear ignorance of his guidance.


> >
> >>> +optionally the device is permitted to set the entire byte, which
> >>> +encompasses
> >> the relevant bit, to 1.
> >>> +
> >>> +The device MAY increase \field{gra_power} to reduce
> \field{bitmap_length}.
> >>> +
> >>> +The device must ignore any writes to \field{pasid} if PASID
> >>> +Extended Capability is absent or the PASID functionality is
> >>> +disabled in PASID Extended Capability
> >>
> >> I have to say this is going to work very badly when the number of
> >> dirty pages is
> >> small: you will end up scanning and re-scanning all of bitmap.
> >> And the resolution is apparently 8 pages? You have just multiplied
> >> the migration bandwidth by a factor of 8.
> > Yeah.
> > And device does not even know previously reported pages are read by driver
> or not. All guess work game for driver and device.
> see my reply to him
Please see above reply.

^ permalink raw reply	[flat|nested] 186+ messages in thread

* RE: [virtio-comment] Re: [PATCH V2 4/6] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  2023-11-03 14:57     ` [virtio-comment] " Zhu, Lingshan
@ 2023-11-03 15:50       ` Parav Pandit
  2023-11-06  3:31         ` Zhu, Lingshan
  0 siblings, 1 reply; 186+ messages in thread
From: Parav Pandit @ 2023-11-03 15:50 UTC (permalink / raw)
  To: Zhu, Lingshan, jasowang, mst, eperezma, cohuck, stefanha; +Cc: virtio-comment


> From: virtio-comment@lists.oasis-open.org <virtio-comment@lists.oasis-
> open.org> On Behalf Of Zhu, Lingshan
> Sent: Friday, November 3, 2023 8:27 PM
> 
> On 11/3/2023 7:35 PM, Parav Pandit wrote:
> >> From: Zhu Lingshan <lingshan.zhu@intel.com>
> >> Sent: Friday, November 3, 2023 4:05 PM
> >>
> >> This patch adds two new le16 fields to common configuration structure
> >> to support VIRTIO_F_QUEUE_STATE in PCI transport layer.
> >>
> >> Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
> >> ---
> >>   transport-pci.tex | 18 ++++++++++++++++++
> >>   1 file changed, 18 insertions(+)
> >>
> >> diff --git a/transport-pci.tex b/transport-pci.tex index
> >> a5c6719..3161519 100644
> >> --- a/transport-pci.tex
> >> +++ b/transport-pci.tex
> >> @@ -325,6 +325,10 @@ \subsubsection{Common configuration structure
> >> layout}\label{sec:Virtio Transport
> >>           /* About the administration virtqueue. */
> >>           le16 admin_queue_index;         /* read-only for driver */
> >>           le16 admin_queue_num;         /* read-only for driver */
> >> +
> >> +	/* Virtqueue state */
> >> +        le16 queue_avail_state;         /* read-write */
> >> +        le16 queue_used_state;          /* read-write */
> > This tiny interface for 128 virtio net queues through register read writes, does
> not work effectively.
> > There are inflight out of order descriptors for block also.
> > Hence toy registers like this do not work.
> Do you know there is a queue_select? Why this does not work? Do you know
> how other queue related fields work?
:)
Yes. If you notice queue_reset related critical spec bug fix was done when it was introduced so that live migration can _actually_ work.

When queue_select is done for 128 queues serially, it take a lot of time to read those slow register interface for this + inflight descriptors + more.

> Like how to set a queue size and enable it?
Those are meant to be used before DRIVER_OK stage as they are init time registers.
Not to keep abusing them..

> >
> > Series [1] is comprehensive that covers it even if you consider non-
> passtrhough device migration model.
> > Where you can suspend individual queues using new admin command and get
> them in the device context state.
> >
> > [1]
> > https://lists.oasis-open.org/archives/virtio-comment/202310/msg00472.h
> > tml
> I suggest you read QEMU migration code. If you don't want to, not my fault.
> >
> >>   };
> >>   \end{lstlisting}
> >>
> >> @@ -428,6 +432,17 @@ \subsubsection{Common configuration structure
> >> layout}\label{sec:Virtio Transport
> >>   	The value 0 indicates no supported administration virtqueues.
> >>   	This field is valid only if VIRTIO_F_ADMIN_VQ has been
> >>   	negotiated.
> >> +
> >> +\item[\field{queue_avail_state}]
> >> +        This field is valid only if VIRTIO_F_QUEUE_STATE has been
> >> +        negotiated. The driver sets and gets the available state of
> >> +        the virtqueue here (see \ref{sec:Virtqueues / Virtqueue State}).
> >> +
> >> +\item[\field{queue_used_state}]
> >> +        This field is valid only if VIRTIO_F_QUEUE_STATE has been
> >> +        negotiated. The driver sets and gets the used state of the
> >> +        virtqueue here (see \ref{sec:Virtqueues / Virtqueue State}).
> >> +
> >>   \end{description}
> >>
> >>   \devicenormative{\paragraph}{Common configuration structure
> >> layout}{Virtio Transport Options / Virtio Over PCI Bus / PCI Device
> >> Layout / Common configuration structure layout} @@ -488,6 +503,9 @@
> >> \subsubsection{Common configuration structure
> >> layout}\label{sec:Virtio Transport  present either a value of 0 or a power of 2
> in  \field{queue_size}.
> >>
> >> +If VIRTIO_F_QUEUE_STATE has not been negotiated, the device MUST
> >> +ignore any accesses to \field{queue_avail_state} and
> \field{queue_used_state}.
> >> +
> >>   If VIRTIO_F_ADMIN_VQ has been negotiated, the value
> >> \field{admin_queue_index} MUST be equal to, or bigger than
> >> \field{num_queues}; also, \field{admin_queue_num} MUST be
> >> --
> >> 2.35.3
> 
> 
> This publicly archived list offers a means to provide input to the OASIS Virtual
> I/O Device (VIRTIO) TC.
> 
> In order to verify user consent to the Feedback License terms and to minimize
> spam in the list archive, subscription is required before posting.
> 
> Subscribe: virtio-comment-subscribe@lists.oasis-open.org
> Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
> List help: virtio-comment-help@lists.oasis-open.org
> List archive: https://lists.oasis-open.org/archives/virtio-comment/
> Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
> List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
> Committee: https://www.oasis-open.org/committees/virtio/
> Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* [virtio-comment] RE: [PATCH V2 2/6] virtio: introduce SUSPEND bit in device status
  2023-11-03 14:55     ` [virtio-comment] " Zhu, Lingshan
@ 2023-11-03 15:54       ` Parav Pandit
  2023-11-06  3:29         ` [virtio-comment] " Zhu, Lingshan
  0 siblings, 1 reply; 186+ messages in thread
From: Parav Pandit @ 2023-11-03 15:54 UTC (permalink / raw)
  To: Zhu, Lingshan, jasowang, mst, eperezma, cohuck, stefanha; +Cc: virtio-comment


> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> Sent: Friday, November 3, 2023 8:25 PM
> 
> On 11/3/2023 7:35 PM, Parav Pandit wrote:
> >> From: Zhu Lingshan <lingshan.zhu@intel.com>
> >> Sent: Friday, November 3, 2023 4:05 PM
> >>
> >> This patch introduces a new status bit in the device status: SUSPEND.
> >>
> >> This SUSPEND bit can be used by the driver to suspend a device, in
> >> order to stabilize the device states and virtqueue states.
> >>
> >> Its main use case is live migration.
> >>
> >> Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
> >> Signed-off-by: Jason Wang <jasowang@redhat.com>
> > You constantly complained that whatever was proposed using admin
> commands method in [1] must work for passthrough and non-passthrough.
> >
> > And halfway in the discussion you propose a method after learning all the
> limitations of in-band, you propose a solution only works for non-passthrough
> mode.
> >
> > You asked someone to have comprehensive proposal and when it comes to
> you following it, you just don’t.
> not sure what you are talking about.
> > And have most shallow commit message to not even mention it.
> >
> > Please be consistent in design approach.
> > And if you don’t want to be, stop asking others.
> this SUSPEND/RESUME doesn't change since the RFC series, how can it not be
> inconsistent???
> >
> > This is not the way TC collaboration works.
> > I probably shouldn’t even expect this from you.

Your proposal does not cover both the use cases of passthrough and non-passthrough.
Yet you kept demanding them for others.
This is just wrong.

I am aware that both models as technical pros and cons.

> >
> > [1]
> > https://lists.oasis-open.org/archives/virtio-comment/202310/msg00472.h
> > tml
> Please don't be so emotional and please be professional.
> 
> Why this solution can not work for pass-through? Do you know the device
> ownership will be transferred to the hypervisor when guest suspended in live
> migration?
I explained 5 reasons why it does not work in previous reply.

As the word indicates "live migration", the hypervisor needs to access the device when it is "live" (not just after).
Hence, passthrough mode must be able to capture the state of the device and dirty pages database when its live.
(and after the source is suspended).

> >
> >> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> >> ---
> >>   content.tex | 36 ++++++++++++++++++++++++++++++++++--
> >>   1 file changed, 34 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/content.tex b/content.tex index 76813b5..bcc9d4b 100644
> >> --- a/content.tex
> >> +++ b/content.tex
> >> @@ -49,6 +49,10 @@ \section{\field{Device Status}
> >> Field}\label{sec:Basic Facilities of a Virtio Dev
> >>
> >>   \item[DEVICE_NEEDS_RESET (64)] Indicates that the device has
> experienced
> >>     an error from which it can't recover.
> >> +
> >> +\item[SUSPEND (16)] When VIRTIO_F_SUSPEND is negotiated, indicates
> >> +that the
> >> +  device has been suspended by the driver.
> >> +
> >>   \end{description}
> >>
> >>   The \field{device status} field starts out as 0, and is
> >> reinitialized to 0 by @@ -
> >> 73,6 +77,10 @@ \section{\field{Device Status} Field}\label{sec:Basic
> >> Facilities of a Virtio Dev  recover by issuing a reset.
> >>   \end{note}
> >>
> >> +The driver SHOULD NOT set SUSPEND if FEATURES_OK is not set.
> >> +
> >> +When setting SUSPEND, the driver MUST re-read \field{device status}
> >> +to ensure
> >> the SUSPEND bit is set.
> >> +
> >>   \devicenormative{\subsection}{Device Status Field}{Basic Facilities
> >> of a Virtio Device / Device Status Field}
> >>
> >>   The device MUST NOT consume buffers or send any used buffer @@
> >> -82,6
> >> +90,26 @@ \section{\field{Device Status} Field}\label{sec:Basic
> >> +Facilities of a
> >> Virtio Dev  that a reset is needed.  If DRIVER_OK is set, after it
> >> sets DEVICE_NEEDS_RESET, the device  MUST send a device configuration
> >> change notification to the driver.
> >>
> >> +The device MUST ignore SUSPEND if FEATURES_OK is not set.
> >> +
> >> +The device MUST ignore SUSPEND if VIRTIO_F_SUSPEND is not negotiated.
> >> +
> >> +The device SHOULD allow settings to \field{device status} even when
> >> +SUSPEND
> >> is set.
> >> +
> >> +If VIRTIO_F_SUSPEND is negotiated and SUSPEND is set, the device
> >> +SHOULD clear SUSPEND and resumes operation upon DRIVER_OK.
> >> +
> >> +If VIRTIO_F_SUSPEND is negotiated, when the driver sets SUSPEND, the
> >> +device SHOULD perform the following actions before presenting
> >> +SUSPEND bit
> >> in the \field{device status}:
> >> +
> >> +\begin{itemize}
> >> +\item Stop consuming buffers of any virtqueues and mark all finished
> >> descritors as used.
> >> +\item Wait until all descriptors that being processed to finish and
> >> +mark them
> >> as used.
> >> +\item Flush all used buffer and send used buffer notifications to the driver.
> >> +\item Record Virtqueue State of each enabled virtqueue, see section
> >> +\ref{sec:Virtqueues / Virtqueue State} \item Pause its operation
> >> +except \field{device status} and preserve configurations in its
> >> +Device Configuration Space, see \ref{sec:Basic Facilities of a
> >> +Virtio Device / Device Configuration Space} \end{itemize}
> >> +
> >>   \section{Feature Bits}\label{sec:Basic Facilities of a Virtio
> >> Device / Feature Bits}
> >>
> >>   Each virtio device offers all the features it understands.  During
> >> @@ -99,10
> >> +127,10 @@ \section{Feature Bits}\label{sec:Basic Facilities of a
> >> +Virtio Device /
> >> Feature B  \begin{description}
> >>   \item[0 to 23, and 50 to 127] Feature bits for the specific device
> >> type
> >>
> >> -\item[24 to 42] Feature bits reserved for extensions to the queue
> >> and
> >> +\item[24 to 43] Feature bits reserved for extensions to the queue
> >> +and
> >>     feature negotiation mechanisms
> >>
> >> -\item[43 to 49, and 128 and above] Feature bits reserved for future
> extensions.
> >> +\item[44 to 49, and 128 and above] Feature bits reserved for future
> >> extensions.
> >>   \end{description}
> >>
> >>   \begin{note}
> >> @@ -875,6 +903,10 @@ \chapter{Reserved Feature
> >> Bits}\label{sec:Reserved Feature Bits}
> >>     \item[VIRTIO_F_QUEUE_STATE(42)] This feature indicates that the
> >> device allows the driver
> >>     to access its internal virtqueue state.
> >>
> >> +  \item[VIRTIO_F_SUSPEND(43)] This feature indicates that the driver can
> >> +   SUSPEND the device.
> >> +   See \ref{sec:Basic Facilities of a Virtio Device / Device Status Field}.
> >> +
> >>   \end{description}
> >>
> >>   \drivernormative{\section}{Reserved Feature Bits}{Reserved Feature
> >> Bits}
> >> --
> >> 2.35.3


^ permalink raw reply	[flat|nested] 186+ messages in thread

* [virtio-comment] Re: [PATCH V2 6/6] virtio-pci: implement dirty page tracking
  2023-11-03 15:47         ` [virtio-comment] " Parav Pandit
@ 2023-11-05 16:12           ` Michael S. Tsirkin
  2023-11-06  3:58             ` Zhu, Lingshan
  2023-11-06  4:03             ` [virtio-comment] " Parav Pandit
  2023-11-06  3:52           ` Zhu, Lingshan
  1 sibling, 2 replies; 186+ messages in thread
From: Michael S. Tsirkin @ 2023-11-05 16:12 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Zhu, Lingshan, jasowang, eperezma, cohuck, stefanha, virtio-comment

On Fri, Nov 03, 2023 at 03:47:34PM +0000, Parav Pandit wrote:
> > > [1]
> > > https://lists.oasis-open.org/archives/virtio-comment/202310/msg00475.h
> > > tml
> > you still need to explain why this does not work for pass-through. 
> It does not work for following reasons.
> 1. Because all the fields that put on the member device are not in direct control of the hypervisor.
> The device is directly controlled by the guest including the device status and when it resets the device all the things stored in the device are lost.

I think the idea is that when this gateway is in the device then
device reset has to trap. At a high level, ok. But then what?
Is a full scan of all memory required until device reset is complete?
Drivers currently tend to busy poll the reset register,
if this takes very long we might start seeing soft lockup
messages. What is the idea then? Maybe for this we need a separate
weaker reset that does not touch this capability?

>
> 2. the PCI FLR is clearing all the registers you exposed here.

Same problem, though FLR at least is expected to take a long time.


> 3. Endless expansion of config registers of dirty tracking is not scalable, as they are not init time registers not following the Appendix B guidelines.

> 4. bitmap based dirty tracking is not atomic between cpu and device.
> Hence, it is racy.

Well pcie atomics exist. Not sure whether it's reasonable to rely on
them. Any data on who common implementations are?

> 5. All the device context needed for passthrough based hypervisor for a device type specific is missing.
> All of those can be used for non-passthrough as well.
> [1] https://lists.oasis-open.org/archives/virtio-comment/202311/msg00085.html
> 
> > And I
> > remember this is a point-less topic as MST ever wants to mute another "pass-
> > through" thread.
> No. he did not say that.
> He meant to not endlessly debate which one is better.
> He clearly said, try to see if you can make multiple hypervisor model work.
> And your series shows a clear ignorance of his guidance.

I think you mean "ignoring" :)


> 
> > >
> > >>> +optionally the device is permitted to set the entire byte, which
> > >>> +encompasses
> > >> the relevant bit, to 1.
> > >>> +
> > >>> +The device MAY increase \field{gra_power} to reduce
> > \field{bitmap_length}.
> > >>> +
> > >>> +The device must ignore any writes to \field{pasid} if PASID
> > >>> +Extended Capability is absent or the PASID functionality is
> > >>> +disabled in PASID Extended Capability
> > >>
> > >> I have to say this is going to work very badly when the number of
> > >> dirty pages is
> > >> small: you will end up scanning and re-scanning all of bitmap.
> > >> And the resolution is apparently 8 pages? You have just multiplied
> > >> the migration bandwidth by a factor of 8.
> > > Yeah.
> > > And device does not even know previously reported pages are read by driver
> > or not. All guess work game for driver and device.
> > see my reply to him
> Please see above reply.


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* [virtio-comment] Re: [PATCH V2 6/6] virtio-pci: implement dirty page tracking
  2023-11-03 14:32     ` [virtio-comment] " Zhu, Lingshan
@ 2023-11-05 16:16       ` Michael S. Tsirkin
  2023-11-06  4:06         ` Zhu, Lingshan
  0 siblings, 1 reply; 186+ messages in thread
From: Michael S. Tsirkin @ 2023-11-05 16:16 UTC (permalink / raw)
  To: Zhu, Lingshan; +Cc: jasowang, eperezma, cohuck, stefanha, virtio-comment, parav

On Fri, Nov 03, 2023 at 10:32:59PM +0800, Zhu, Lingshan wrote:
> 
> 
> On 11/3/2023 6:50 PM, Michael S. Tsirkin wrote:
> 
>     On Fri, Nov 03, 2023 at 06:34:37PM +0800, Zhu Lingshan wrote:
> 
>         +\item[\field{bitmap_addr}]
>         +       The driver use this to set the address of the bitmap which records the dirty pages
>         +       caused by the device.
>         +       Each bit in the bitmap represents one memory page, bit 0 in the bitmap
>         +       reprsents page 0 at address 0, bit 1 represents page 1, and so on in a linear manner.
>         +       When \field{enable} is set to 1 and the device writes to a memory page,
>         +       the device MUST set the corresponding bit to 1 which indicating the page is dirty.
>         +\item[\field{bitmap_length}]
>         +       The driver use this to set the length in bytes of the bitmap.
>         +\end{description}
>         +
>         +\devicenormative{\subsubsection}{Memory Dirty Pages Tracker Capability}{Virtio Transport Options / Virtio Over PCI Bus / Memory Dirty Pages Tracker Capability}
>         +
>         +The device MUST NOT set any bits beyond bitmap_length when reporting dirty pages.
>         +
>         +To prevent a read-modify-write procedure, if a memory page is dirty,
>         +optionally the device is permitted to set the entire byte, which encompasses the relevant bit, to 1.
>         +
>         +The device MAY increase \field{gra_power} to reduce \field{bitmap_length}.
>         +
>         +The device must ignore any writes to \field{pasid} if PASID Extended Capability is absent or
>         +the PASID functionality is disabled in PASID Extended Capability
> 
> 
>     I have to say this is going to work very badly when the number of dirty
>     pages is small: you will end up scanning and re-scanning all of bitmap.
> 
> The driver needs to scan anyway,

Not with e.g. Parav's proposal - device reports individual pages
changed. This is analogous to PML.

> Intel production work with similar bitmap
> based dirty page tracking solution for years.

and then VMs became bigger and PML was introduced.

> Otherwise the device should report PFN which is not very practical.

Why not?

>     And the resolution is apparently 8 pages? You have just multiplied
>     the migration bandwidth by a factor of 8.
> 
> No, as described in the comments, the tacking granularity is controlled by \
> field{gra_power}, one bit represents a page with page_size = 2^(12 +
> gra_power). This can also be used to reduce the size of the bitmap.

.. at the cost of increasing migration bandwidth.

> "To prevent a read-modify-write procedure, if a memory page is dirty,
> optionally the device is permitted to set the entire byte, which encompasses the relevant bit, to 1."
> 
> This is optional and DMA is very likely to write a neighbor page, and the device transmit a whole byte anyway
> when a bit is dirty.
> 
> How about we use platform dirty page tracking facility then implement this in virtio, as Jason suggested?
> 

Without something like PML it likely won't scale either.

-- 
MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* [virtio-comment] Re: [PATCH V2 6/6] virtio-pci: implement dirty page tracking
  2023-11-03 11:35     ` [virtio-comment] " Parav Pandit
  2023-11-03 15:02       ` [virtio-comment] " Zhu, Lingshan
@ 2023-11-05 16:20       ` Michael S. Tsirkin
  2023-11-06  3:51         ` [virtio-comment] " Parav Pandit
  1 sibling, 1 reply; 186+ messages in thread
From: Michael S. Tsirkin @ 2023-11-05 16:20 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Zhu Lingshan, jasowang, eperezma, cohuck, stefanha, virtio-comment

On Fri, Nov 03, 2023 at 11:35:03AM +0000, Parav Pandit wrote:
> And you never commented with sheer ignorance.

I think you mean ignoring here. ignorance is something else entirely.


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* [virtio-comment] Re: [PATCH V2 2/6] virtio: introduce SUSPEND bit in device status
  2023-11-03 15:54       ` [virtio-comment] " Parav Pandit
@ 2023-11-06  3:29         ` Zhu, Lingshan
  2023-11-06  4:07           ` [virtio-comment] " Parav Pandit
  0 siblings, 1 reply; 186+ messages in thread
From: Zhu, Lingshan @ 2023-11-06  3:29 UTC (permalink / raw)
  To: Parav Pandit, jasowang, mst, eperezma, cohuck, stefanha; +Cc: virtio-comment



On 11/3/2023 11:54 PM, Parav Pandit wrote:
>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>> Sent: Friday, November 3, 2023 8:25 PM
>>
>> On 11/3/2023 7:35 PM, Parav Pandit wrote:
>>>> From: Zhu Lingshan <lingshan.zhu@intel.com>
>>>> Sent: Friday, November 3, 2023 4:05 PM
>>>>
>>>> This patch introduces a new status bit in the device status: SUSPEND.
>>>>
>>>> This SUSPEND bit can be used by the driver to suspend a device, in
>>>> order to stabilize the device states and virtqueue states.
>>>>
>>>> Its main use case is live migration.
>>>>
>>>> Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
>>>> Signed-off-by: Jason Wang <jasowang@redhat.com>
>>> You constantly complained that whatever was proposed using admin
>> commands method in [1] must work for passthrough and non-passthrough.
>>> And halfway in the discussion you propose a method after learning all the
>> limitations of in-band, you propose a solution only works for non-passthrough
>> mode.
>>> You asked someone to have comprehensive proposal and when it comes to
>> you following it, you just don’t.
>> not sure what you are talking about.
>>> And have most shallow commit message to not even mention it.
>>>
>>> Please be consistent in design approach.
>>> And if you don’t want to be, stop asking others.
>> this SUSPEND/RESUME doesn't change since the RFC series, how can it not be
>> inconsistent???
>>> This is not the way TC collaboration works.
>>> I probably shouldn’t even expect this from you.
> Your proposal does not cover both the use cases of passthrough and non-passthrough.
> Yet you kept demanding them for others.
> This is just wrong.
>
> I am aware that both models as technical pros and cons.
Why this doesn't work? the device status byte has been working for many 
years,
and do you know when guest freeze, the hypervisor owns the device????
>
>>> [1]
>>> https://lists.oasis-open.org/archives/virtio-comment/202310/msg00472.h
>>> tml
>> Please don't be so emotional and please be professional.
>>
>> Why this solution can not work for pass-through? Do you know the device
>> ownership will be transferred to the hypervisor when guest suspended in live
>> migration?
> I explained 5 reasons why it does not work in previous reply.
>
> As the word indicates "live migration", the hypervisor needs to access the device when it is "live" (not just after).
> Hence, passthrough mode must be able to capture the state of the device and dirty pages database when its live.
> (and after the source is suspended).
No, the hypervisor should only collect dirty pages when the device alive.
As you can see, the dirty page tracking facility has a PASID for 
isolation. But still, the
question is, we should better use platform dirty page tracking

Then suspend the device after guest freeze, to stabilize the device 
status, then read the status.

How can you say this does not work???
>
>>>> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
>>>> ---
>>>>    content.tex | 36 ++++++++++++++++++++++++++++++++++--
>>>>    1 file changed, 34 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/content.tex b/content.tex index 76813b5..bcc9d4b 100644
>>>> --- a/content.tex
>>>> +++ b/content.tex
>>>> @@ -49,6 +49,10 @@ \section{\field{Device Status}
>>>> Field}\label{sec:Basic Facilities of a Virtio Dev
>>>>
>>>>    \item[DEVICE_NEEDS_RESET (64)] Indicates that the device has
>> experienced
>>>>      an error from which it can't recover.
>>>> +
>>>> +\item[SUSPEND (16)] When VIRTIO_F_SUSPEND is negotiated, indicates
>>>> +that the
>>>> +  device has been suspended by the driver.
>>>> +
>>>>    \end{description}
>>>>
>>>>    The \field{device status} field starts out as 0, and is
>>>> reinitialized to 0 by @@ -
>>>> 73,6 +77,10 @@ \section{\field{Device Status} Field}\label{sec:Basic
>>>> Facilities of a Virtio Dev  recover by issuing a reset.
>>>>    \end{note}
>>>>
>>>> +The driver SHOULD NOT set SUSPEND if FEATURES_OK is not set.
>>>> +
>>>> +When setting SUSPEND, the driver MUST re-read \field{device status}
>>>> +to ensure
>>>> the SUSPEND bit is set.
>>>> +
>>>>    \devicenormative{\subsection}{Device Status Field}{Basic Facilities
>>>> of a Virtio Device / Device Status Field}
>>>>
>>>>    The device MUST NOT consume buffers or send any used buffer @@
>>>> -82,6
>>>> +90,26 @@ \section{\field{Device Status} Field}\label{sec:Basic
>>>> +Facilities of a
>>>> Virtio Dev  that a reset is needed.  If DRIVER_OK is set, after it
>>>> sets DEVICE_NEEDS_RESET, the device  MUST send a device configuration
>>>> change notification to the driver.
>>>>
>>>> +The device MUST ignore SUSPEND if FEATURES_OK is not set.
>>>> +
>>>> +The device MUST ignore SUSPEND if VIRTIO_F_SUSPEND is not negotiated.
>>>> +
>>>> +The device SHOULD allow settings to \field{device status} even when
>>>> +SUSPEND
>>>> is set.
>>>> +
>>>> +If VIRTIO_F_SUSPEND is negotiated and SUSPEND is set, the device
>>>> +SHOULD clear SUSPEND and resumes operation upon DRIVER_OK.
>>>> +
>>>> +If VIRTIO_F_SUSPEND is negotiated, when the driver sets SUSPEND, the
>>>> +device SHOULD perform the following actions before presenting
>>>> +SUSPEND bit
>>>> in the \field{device status}:
>>>> +
>>>> +\begin{itemize}
>>>> +\item Stop consuming buffers of any virtqueues and mark all finished
>>>> descritors as used.
>>>> +\item Wait until all descriptors that being processed to finish and
>>>> +mark them
>>>> as used.
>>>> +\item Flush all used buffer and send used buffer notifications to the driver.
>>>> +\item Record Virtqueue State of each enabled virtqueue, see section
>>>> +\ref{sec:Virtqueues / Virtqueue State} \item Pause its operation
>>>> +except \field{device status} and preserve configurations in its
>>>> +Device Configuration Space, see \ref{sec:Basic Facilities of a
>>>> +Virtio Device / Device Configuration Space} \end{itemize}
>>>> +
>>>>    \section{Feature Bits}\label{sec:Basic Facilities of a Virtio
>>>> Device / Feature Bits}
>>>>
>>>>    Each virtio device offers all the features it understands.  During
>>>> @@ -99,10
>>>> +127,10 @@ \section{Feature Bits}\label{sec:Basic Facilities of a
>>>> +Virtio Device /
>>>> Feature B  \begin{description}
>>>>    \item[0 to 23, and 50 to 127] Feature bits for the specific device
>>>> type
>>>>
>>>> -\item[24 to 42] Feature bits reserved for extensions to the queue
>>>> and
>>>> +\item[24 to 43] Feature bits reserved for extensions to the queue
>>>> +and
>>>>      feature negotiation mechanisms
>>>>
>>>> -\item[43 to 49, and 128 and above] Feature bits reserved for future
>> extensions.
>>>> +\item[44 to 49, and 128 and above] Feature bits reserved for future
>>>> extensions.
>>>>    \end{description}
>>>>
>>>>    \begin{note}
>>>> @@ -875,6 +903,10 @@ \chapter{Reserved Feature
>>>> Bits}\label{sec:Reserved Feature Bits}
>>>>      \item[VIRTIO_F_QUEUE_STATE(42)] This feature indicates that the
>>>> device allows the driver
>>>>      to access its internal virtqueue state.
>>>>
>>>> +  \item[VIRTIO_F_SUSPEND(43)] This feature indicates that the driver can
>>>> +   SUSPEND the device.
>>>> +   See \ref{sec:Basic Facilities of a Virtio Device / Device Status Field}.
>>>> +
>>>>    \end{description}
>>>>
>>>>    \drivernormative{\section}{Reserved Feature Bits}{Reserved Feature
>>>> Bits}
>>>> --
>>>> 2.35.3


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [virtio-comment] Re: [PATCH V2 4/6] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  2023-11-03 15:50       ` Parav Pandit
@ 2023-11-06  3:31         ` Zhu, Lingshan
  2023-11-06  4:12           ` Parav Pandit
  0 siblings, 1 reply; 186+ messages in thread
From: Zhu, Lingshan @ 2023-11-06  3:31 UTC (permalink / raw)
  To: Parav Pandit, jasowang, mst, eperezma, cohuck, stefanha; +Cc: virtio-comment



On 11/3/2023 11:50 PM, Parav Pandit wrote:
>> From: virtio-comment@lists.oasis-open.org <virtio-comment@lists.oasis-
>> open.org> On Behalf Of Zhu, Lingshan
>> Sent: Friday, November 3, 2023 8:27 PM
>>
>> On 11/3/2023 7:35 PM, Parav Pandit wrote:
>>>> From: Zhu Lingshan <lingshan.zhu@intel.com>
>>>> Sent: Friday, November 3, 2023 4:05 PM
>>>>
>>>> This patch adds two new le16 fields to common configuration structure
>>>> to support VIRTIO_F_QUEUE_STATE in PCI transport layer.
>>>>
>>>> Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
>>>> ---
>>>>    transport-pci.tex | 18 ++++++++++++++++++
>>>>    1 file changed, 18 insertions(+)
>>>>
>>>> diff --git a/transport-pci.tex b/transport-pci.tex index
>>>> a5c6719..3161519 100644
>>>> --- a/transport-pci.tex
>>>> +++ b/transport-pci.tex
>>>> @@ -325,6 +325,10 @@ \subsubsection{Common configuration structure
>>>> layout}\label{sec:Virtio Transport
>>>>            /* About the administration virtqueue. */
>>>>            le16 admin_queue_index;         /* read-only for driver */
>>>>            le16 admin_queue_num;         /* read-only for driver */
>>>> +
>>>> +	/* Virtqueue state */
>>>> +        le16 queue_avail_state;         /* read-write */
>>>> +        le16 queue_used_state;          /* read-write */
>>> This tiny interface for 128 virtio net queues through register read writes, does
>> not work effectively.
>>> There are inflight out of order descriptors for block also.
>>> Hence toy registers like this do not work.
>> Do you know there is a queue_select? Why this does not work? Do you know
>> how other queue related fields work?
> :)
> Yes. If you notice queue_reset related critical spec bug fix was done when it was introduced so that live migration can _actually_ work.
>
> When queue_select is done for 128 queues serially, it take a lot of time to read those slow register interface for this + inflight descriptors + more.
interesting, virtio work in this pattern for many years, right?
>
>> Like how to set a queue size and enable it?
> Those are meant to be used before DRIVER_OK stage as they are init time registers.
> Not to keep abusing them..
don't you need to set queue_size at the destination side?
>
>>> Series [1] is comprehensive that covers it even if you consider non-
>> passtrhough device migration model.
>>> Where you can suspend individual queues using new admin command and get
>> them in the device context state.
>>> [1]
>>> https://lists.oasis-open.org/archives/virtio-comment/202310/msg00472.h
>>> tml
>> I suggest you read QEMU migration code. If you don't want to, not my fault.
>>>>    };
>>>>    \end{lstlisting}
>>>>
>>>> @@ -428,6 +432,17 @@ \subsubsection{Common configuration structure
>>>> layout}\label{sec:Virtio Transport
>>>>    	The value 0 indicates no supported administration virtqueues.
>>>>    	This field is valid only if VIRTIO_F_ADMIN_VQ has been
>>>>    	negotiated.
>>>> +
>>>> +\item[\field{queue_avail_state}]
>>>> +        This field is valid only if VIRTIO_F_QUEUE_STATE has been
>>>> +        negotiated. The driver sets and gets the available state of
>>>> +        the virtqueue here (see \ref{sec:Virtqueues / Virtqueue State}).
>>>> +
>>>> +\item[\field{queue_used_state}]
>>>> +        This field is valid only if VIRTIO_F_QUEUE_STATE has been
>>>> +        negotiated. The driver sets and gets the used state of the
>>>> +        virtqueue here (see \ref{sec:Virtqueues / Virtqueue State}).
>>>> +
>>>>    \end{description}
>>>>
>>>>    \devicenormative{\paragraph}{Common configuration structure
>>>> layout}{Virtio Transport Options / Virtio Over PCI Bus / PCI Device
>>>> Layout / Common configuration structure layout} @@ -488,6 +503,9 @@
>>>> \subsubsection{Common configuration structure
>>>> layout}\label{sec:Virtio Transport  present either a value of 0 or a power of 2
>> in  \field{queue_size}.
>>>> +If VIRTIO_F_QUEUE_STATE has not been negotiated, the device MUST
>>>> +ignore any accesses to \field{queue_avail_state} and
>> \field{queue_used_state}.
>>>> +
>>>>    If VIRTIO_F_ADMIN_VQ has been negotiated, the value
>>>> \field{admin_queue_index} MUST be equal to, or bigger than
>>>> \field{num_queues}; also, \field{admin_queue_num} MUST be
>>>> --
>>>> 2.35.3
>>
>> This publicly archived list offers a means to provide input to the OASIS Virtual
>> I/O Device (VIRTIO) TC.
>>
>> In order to verify user consent to the Feedback License terms and to minimize
>> spam in the list archive, subscription is required before posting.
>>
>> Subscribe: virtio-comment-subscribe@lists.oasis-open.org
>> Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
>> List help: virtio-comment-help@lists.oasis-open.org
>> List archive: https://lists.oasis-open.org/archives/virtio-comment/
>> Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
>> List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
>> Committee: https://www.oasis-open.org/committees/virtio/
>> Join OASIS: https://www.oasis-open.org/join/


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* [virtio-comment] RE: [PATCH V2 6/6] virtio-pci: implement dirty page tracking
  2023-11-05 16:20       ` Michael S. Tsirkin
@ 2023-11-06  3:51         ` Parav Pandit
  0 siblings, 0 replies; 186+ messages in thread
From: Parav Pandit @ 2023-11-06  3:51 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Zhu Lingshan, jasowang, eperezma, cohuck, stefanha, virtio-comment



> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Sunday, November 5, 2023 9:50 PM
> 
> On Fri, Nov 03, 2023 at 11:35:03AM +0000, Parav Pandit wrote:
> > And you never commented with sheer ignorance.
> 
> I think you mean ignoring here. ignorance is something else entirely.
I am sorry. You are right. I meant ignoring.

This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* [virtio-comment] Re: [PATCH V2 6/6] virtio-pci: implement dirty page tracking
  2023-11-03 15:47         ` [virtio-comment] " Parav Pandit
  2023-11-05 16:12           ` [virtio-comment] " Michael S. Tsirkin
@ 2023-11-06  3:52           ` Zhu, Lingshan
  2023-11-06  4:34             ` [virtio-comment] " Parav Pandit
  1 sibling, 1 reply; 186+ messages in thread
From: Zhu, Lingshan @ 2023-11-06  3:52 UTC (permalink / raw)
  To: Parav Pandit, Michael S. Tsirkin
  Cc: jasowang, eperezma, cohuck, stefanha, virtio-comment



On 11/3/2023 11:47 PM, Parav Pandit wrote:
>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>> Sent: Friday, November 3, 2023 8:33 PM
>>
>> On 11/3/2023 7:35 PM, Parav Pandit wrote:
>>>> From: Michael S. Tsirkin <mst@redhat.com>
>>>> Sent: Friday, November 3, 2023 4:20 PM
>>>>
>>>> On Fri, Nov 03, 2023 at 06:34:37PM +0800, Zhu Lingshan wrote:
>>>>> +\item[\field{bitmap_addr}]
>>>>> +	The driver use this to set the address of the bitmap which records
>>>>> +the
>>>> dirty pages
>>>>> +	caused by the device.
>>>>> +	Each bit in the bitmap represents one memory page, bit 0 in the bitmap
>>>>> +	reprsents page 0 at address 0, bit 1 represents page 1, and so on
>>>>> +in a
>>>> linear manner.
>>>>> +	When \field{enable} is set to 1 and the device writes to a memory page,
>>>>> +	the device MUST set the corresponding bit to 1 which indicating
>>>>> +the
>>>> page is dirty.
>>>>> +\item[\field{bitmap_length}]
>>>>> +	The driver use this to set the length in bytes of the bitmap.
>>>>> +\end{description}
>>>>> +
>>>>> +\devicenormative{\subsubsection}{Memory Dirty Pages Tracker
>>>>> +Capability}{Virtio Transport Options / Virtio Over PCI Bus / Memory
>>>>> +Dirty Pages Tracker Capability}
>>>>> +
>>>>> +The device MUST NOT set any bits beyond bitmap_length when
>>>>> +reporting
>>>> dirty pages.
>>>>> +
>>>>> +To prevent a read-modify-write procedure, if a memory page is
>>>>> +dirty,
>>> It is not to prevent; it is just not possible to do racy RMW. 😊
>> if you understand what is a atomic routine, you will not call it racy.
>>> Hence to work around you propose to mark all pages dirty. Too bad.
>>> This just does not work.
>> why? and this is optional.
> Because device cannot set individual bits in atomic way for same byte read by the cpu.
> 1. device read the byte that had bit 0 and 4 set.
> 2. cpu atomically clear these bits.
> 3. device wrote bits 0, 4, and new bits 2 and 3.
> 4. cpu now transferred page 0 and 4 again.
>
> Optional thing also needs to work. :)
Do you know both CPU and device actually don't read bit, they read bytes????
Do you know RC connected to memory controller????
Do you know there are locked transaction and atomic operations in PCI???
Do you know there are atomic read/write/clear even read and clear and so 
on in CPU ISA????
>
>>> Secondly the bitmap array is function is for full guest memory size, while
>> there is lot of sparce region now and also in future.
>>> This is the second problem.
>> did you see gra_power and its comments?
> gra_power says the page size.
> Not the sparce multiple ranges of the guest memory.
> Device endup tracking uninterested area as well.
increase gra_power can reduce bitmap size, right?
Totally up to the hypervisor, right?
>
>>> This is exactly why I asked you to review the page write recording series of
>> admin commands and comment.
>>> And you never commented with sheer ignorance.
>>>
>>> So clearly the start stop method for specific range and without bandwidth
>> explosion, admin commands of [1] stands better.
>>> If you do [1] on the member device also using its AQ in future, it will work for
>> non-passthrough case.
>>> If you build non-passthrough live migration using [1], also it will work.
>>> So I don’t see any point of this series anymore.
>> As Jason pointed out, there are many problems in your proposal, you should
>> answer there. I don't need to repeat his words and duplicate the discussions.
> Many are already addressed in v3.
interesting, does your V3 support nested?
>
>>> [1]
>>> https://lists.oasis-open.org/archives/virtio-comment/202310/msg00475.h
>>> tml
>> you still need to explain why this does not work for pass-through.
> It does not work for following reasons.
> 1. Because all the fields that put on the member device are not in direct control of the hypervisor.
> The device is directly controlled by the guest including the device status and when it resets the device all the things stored in the device are lost.
have you seen PASID? and if the device reset, it has to forget 
everything as expected, right?
>
> 2. the PCI FLR is clearing all the registers you exposed here.
see above
>
> 3. Endless expansion of config registers of dirty tracking is not scalable, as they are not init time registers not following the Appendix B guidelines.
endless expansion?? It is a complete set of dirty page tracking, right????
have you see this cap only controls? The device DMA writes the bitmap, 
not by registers.

Again, if you want to fix Appendix B, OK.
>
> 4. bitmap based dirty tracking is not atomic between cpu and device.
> Hence, it is racy.
see above, the first reply.
>
> 5. All the device context needed for passthrough based hypervisor for a device type specific is missing.
> All of those can be used for non-passthrough as well.
> [1] https://lists.oasis-open.org/archives/virtio-comment/202311/msg00085.html
If you want to discuss this again, I don't want to wast time but only 
asking you whether you want to define
virtio-fs device context
>
>> And I
>> remember this is a point-less topic as MST ever wants to mute another "pass-
>> through" thread.
> No. he did not say that.
> He meant to not endlessly debate which one is better.
> He clearly said, try to see if you can make multiple hypervisor model work.
> And your series shows a clear ignorance of his guidance.

Let me quote MST's reply here:
"I feel this discussion will keep meandering because the terminology is
vague. There's no single thing that is called "passthrough" -
vendors just build what is expedient with current hardware and
software. Nvidia has a bunch of people working on vfio so they
call that passthrough, Red Hat has people working on VDPA and
they call that passthrough, etc.


Before I mute this discussion for good, does anyone here have any
feeling progress is made? What kind of progress? "

So please don't discuss on pass-through anymore.
It seems only you need to develop the knowledge
>
>
>>>>> +optionally the device is permitted to set the entire byte, which
>>>>> +encompasses
>>>> the relevant bit, to 1.
>>>>> +
>>>>> +The device MAY increase \field{gra_power} to reduce
>> \field{bitmap_length}.
>>>>> +
>>>>> +The device must ignore any writes to \field{pasid} if PASID
>>>>> +Extended Capability is absent or the PASID functionality is
>>>>> +disabled in PASID Extended Capability
>>>> I have to say this is going to work very badly when the number of
>>>> dirty pages is
>>>> small: you will end up scanning and re-scanning all of bitmap.
>>>> And the resolution is apparently 8 pages? You have just multiplied
>>>> the migration bandwidth by a factor of 8.
>>> Yeah.
>>> And device does not even know previously reported pages are read by driver
>> or not. All guess work game for driver and device.
>> see my reply to him
> Please see above reply.
see above


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* [virtio-comment] Re: [PATCH V2 6/6] virtio-pci: implement dirty page tracking
  2023-11-05 16:12           ` [virtio-comment] " Michael S. Tsirkin
@ 2023-11-06  3:58             ` Zhu, Lingshan
  2023-11-06 10:33               ` Michael S. Tsirkin
  2023-11-06  4:03             ` [virtio-comment] " Parav Pandit
  1 sibling, 1 reply; 186+ messages in thread
From: Zhu, Lingshan @ 2023-11-06  3:58 UTC (permalink / raw)
  To: Michael S. Tsirkin, Parav Pandit
  Cc: jasowang, eperezma, cohuck, stefanha, virtio-comment



On 11/6/2023 12:12 AM, Michael S. Tsirkin wrote:
> On Fri, Nov 03, 2023 at 03:47:34PM +0000, Parav Pandit wrote:
>>>> [1]
>>>> https://lists.oasis-open.org/archives/virtio-comment/202310/msg00475.h
>>>> tml
>>> you still need to explain why this does not work for pass-through.
>> It does not work for following reasons.
>> 1. Because all the fields that put on the member device are not in direct control of the hypervisor.
>> The device is directly controlled by the guest including the device status and when it resets the device all the things stored in the device are lost.
> I think the idea is that when this gateway is in the device then
> device reset has to trap. At a high level, ok. But then what?
No, when device reset, the device is expected to forget everything and
re-intialize.
> Is a full scan of all memory required until device reset is complete?
Who scan the memory? The device tracks its own dirty pages.
> Drivers currently tend to busy poll the reset register,
> if this takes very long we might start seeing soft lockup
> messages. What is the idea then? Maybe for this we need a separate
> weaker reset that does not touch this capability?
When reset, how can we expect the LM progress continue running.

For example, when the device DMA writes something, then reset before 
sending an interrupt,
the DMA-ed pages should be lost as expected, right?
>
>> 2. the PCI FLR is clearing all the registers you exposed here.
> Same problem, though FLR at least is expected to take a long time.
If FLR, then whole device reset, and this is PCI, not virtio.
As Jason pointed out, do you want to audit every PCI functionality?
>
>
>> 3. Endless expansion of config registers of dirty tracking is not scalable, as they are not init time registers not following the Appendix B guidelines.
>> 4. bitmap based dirty tracking is not atomic between cpu and device.
>> Hence, it is racy.
> Well pcie atomics exist. Not sure whether it's reasonable to rely on
> them. Any data on who common implementations are?
>
>> 5. All the device context needed for passthrough based hypervisor for a device type specific is missing.
>> All of those can be used for non-passthrough as well.
>> [1] https://lists.oasis-open.org/archives/virtio-comment/202311/msg00085.html
>>
>>> And I
>>> remember this is a point-less topic as MST ever wants to mute another "pass-
>>> through" thread.
>> No. he did not say that.
>> He meant to not endlessly debate which one is better.
>> He clearly said, try to see if you can make multiple hypervisor model work.
>> And your series shows a clear ignorance of his guidance.
> I think you mean "ignoring" :)
>
>
>>>>>> +optionally the device is permitted to set the entire byte, which
>>>>>> +encompasses
>>>>> the relevant bit, to 1.
>>>>>> +
>>>>>> +The device MAY increase \field{gra_power} to reduce
>>> \field{bitmap_length}.
>>>>>> +
>>>>>> +The device must ignore any writes to \field{pasid} if PASID
>>>>>> +Extended Capability is absent or the PASID functionality is
>>>>>> +disabled in PASID Extended Capability
>>>>> I have to say this is going to work very badly when the number of
>>>>> dirty pages is
>>>>> small: you will end up scanning and re-scanning all of bitmap.
>>>>> And the resolution is apparently 8 pages? You have just multiplied
>>>>> the migration bandwidth by a factor of 8.
>>>> Yeah.
>>>> And device does not even know previously reported pages are read by driver
>>> or not. All guess work game for driver and device.
>>> see my reply to him
>> Please see above reply.


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* [virtio-comment] RE: [PATCH V2 6/6] virtio-pci: implement dirty page tracking
  2023-11-05 16:12           ` [virtio-comment] " Michael S. Tsirkin
  2023-11-06  3:58             ` Zhu, Lingshan
@ 2023-11-06  4:03             ` Parav Pandit
  2023-11-07 11:13               ` [virtio-comment] " Michael S. Tsirkin
  1 sibling, 1 reply; 186+ messages in thread
From: Parav Pandit @ 2023-11-06  4:03 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Zhu, Lingshan, jasowang, eperezma, cohuck, stefanha, virtio-comment



> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Sunday, November 5, 2023 9:42 PM
> 
> On Fri, Nov 03, 2023 at 03:47:34PM +0000, Parav Pandit wrote:
> > > > [1]
> > > > https://lists.oasis-open.org/archives/virtio-comment/202310/msg004
> > > > 75.h
> > > > tml
> > > you still need to explain why this does not work for pass-through.
> > It does not work for following reasons.
> > 1. Because all the fields that put on the member device are not in direct
> control of the hypervisor.
> > The device is directly controlled by the guest including the device status and
> when it resets the device all the things stored in the device are lost.
> 
> I think the idea is that when this gateway is in the device then device reset has
> to trap. At a high level, ok. But then what?
> Is a full scan of all memory required until device reset is complete?
> Drivers currently tend to busy poll the reset register, if this takes very long we
> might start seeing soft lockup messages. What is the idea then? Maybe for this
> we need a separate weaker reset that does not touch this capability?
>
You meant the gateway is not in the device, right?

I likely didn't understand. I don't see a relation to timing.

When the device reset is not trapped by the hypervisor, most things does not work, it requires trapping other things to like cvq, device registers and more.
It may be fine for those use case, but it does not fullfill the requirement of passthrough mode of hw.
  
> >
> > 2. the PCI FLR is clearing all the registers you exposed here.
> 
> Same problem, though FLR at least is expected to take a long time.
> 
> 
> > 3. Endless expansion of config registers of dirty tracking is not scalable, as they
> are not init time registers not following the Appendix B guidelines.
> 
> > 4. bitmap based dirty tracking is not atomic between cpu and device.
> > Hence, it is racy.
> 
> Well pcie atomics exist. Not sure whether it's reasonable to rely on them. Any
> data on who common implementations are?
>
Pci atomics are (a) fetch add, (b) swap, (c) CAS.
Not atomic_OR().
 
> > 5. All the device context needed for passthrough based hypervisor for a
> device type specific is missing.
> > All of those can be used for non-passthrough as well.
> > [1]
> > https://lists.oasis-open.org/archives/virtio-comment/202311/msg00085.h
> > tml
> >
> > > And I
> > > remember this is a point-less topic as MST ever wants to mute
> > > another "pass- through" thread.
> > No. he did not say that.
> > He meant to not endlessly debate which one is better.
> > He clearly said, try to see if you can make multiple hypervisor model work.
> > And your series shows a clear ignorance of his guidance.
> 
> I think you mean "ignoring" :)
> 
I am sorry, yes, I meant ignoring.

This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [virtio-comment] Re: [PATCH V2 6/6] virtio-pci: implement dirty page tracking
  2023-11-05 16:16       ` Michael S. Tsirkin
@ 2023-11-06  4:06         ` Zhu, Lingshan
  2023-11-06 10:22           ` Michael S. Tsirkin
  0 siblings, 1 reply; 186+ messages in thread
From: Zhu, Lingshan @ 2023-11-06  4:06 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: jasowang, eperezma, cohuck, stefanha, virtio-comment, parav



On 11/6/2023 12:16 AM, Michael S. Tsirkin wrote:
> On Fri, Nov 03, 2023 at 10:32:59PM +0800, Zhu, Lingshan wrote:
>>
>> On 11/3/2023 6:50 PM, Michael S. Tsirkin wrote:
>>
>>      On Fri, Nov 03, 2023 at 06:34:37PM +0800, Zhu Lingshan wrote:
>>
>>          +\item[\field{bitmap_addr}]
>>          +       The driver use this to set the address of the bitmap which records the dirty pages
>>          +       caused by the device.
>>          +       Each bit in the bitmap represents one memory page, bit 0 in the bitmap
>>          +       reprsents page 0 at address 0, bit 1 represents page 1, and so on in a linear manner.
>>          +       When \field{enable} is set to 1 and the device writes to a memory page,
>>          +       the device MUST set the corresponding bit to 1 which indicating the page is dirty.
>>          +\item[\field{bitmap_length}]
>>          +       The driver use this to set the length in bytes of the bitmap.
>>          +\end{description}
>>          +
>>          +\devicenormative{\subsubsection}{Memory Dirty Pages Tracker Capability}{Virtio Transport Options / Virtio Over PCI Bus / Memory Dirty Pages Tracker Capability}
>>          +
>>          +The device MUST NOT set any bits beyond bitmap_length when reporting dirty pages.
>>          +
>>          +To prevent a read-modify-write procedure, if a memory page is dirty,
>>          +optionally the device is permitted to set the entire byte, which encompasses the relevant bit, to 1.
>>          +
>>          +The device MAY increase \field{gra_power} to reduce \field{bitmap_length}.
>>          +
>>          +The device must ignore any writes to \field{pasid} if PASID Extended Capability is absent or
>>          +the PASID functionality is disabled in PASID Extended Capability
>>
>>
>>      I have to say this is going to work very badly when the number of dirty
>>      pages is small: you will end up scanning and re-scanning all of bitmap.
>>
>> The driver needs to scan anyway,
> Not with e.g. Parav's proposal - device reports individual pages
> changed. This is analogous to PML.
In this proposal, the device DMA writes the bitmap to recording dirty pages,
this can easily be merged into QEMU migration flow.
>
>> Intel production work with similar bitmap
>> based dirty page tracking solution for years.
> and then VMs became bigger and PML was introduced.
So you agree we should track dirty pages through the platform facilities?
I am glad to hear that!
>
>> Otherwise the device should report PFN which is not very practical.
> Why not?
Really? the device report PFN?
What can happen if the device keep writing a small piece of memory???
>
>>      And the resolution is apparently 8 pages? You have just multiplied
>>      the migration bandwidth by a factor of 8.
>>
>> No, as described in the comments, the tacking granularity is controlled by \
>> field{gra_power}, one bit represents a page with page_size = 2^(12 +
>> gra_power). This can also be used to reduce the size of the bitmap.
> .. at the cost of increasing migration bandwidth.
The device is very likely to write a neighbor page, and this happens
everywhere for example CPU read 64 bytes aligned data.

This is a tradeoff
>
>> "To prevent a read-modify-write procedure, if a memory page is dirty,
>> optionally the device is permitted to set the entire byte, which encompasses the relevant bit, to 1."
>>
>> This is optional and DMA is very likely to write a neighbor page, and the device transmit a whole byte anyway
>> when a bit is dirty.
>>
>> How about we use platform dirty page tracking facility then implement this in virtio, as Jason suggested?
>>
> Without something like PML it likely won't scale either.
So that would be platform issue which we don't need to take care of?
Intel VT-d can do this for sure.
>


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* [virtio-comment] RE: [PATCH V2 2/6] virtio: introduce SUSPEND bit in device status
  2023-11-06  3:29         ` [virtio-comment] " Zhu, Lingshan
@ 2023-11-06  4:07           ` Parav Pandit
  2023-11-06  9:21             ` Zhu, Lingshan
  0 siblings, 1 reply; 186+ messages in thread
From: Parav Pandit @ 2023-11-06  4:07 UTC (permalink / raw)
  To: Zhu, Lingshan, jasowang, mst, eperezma, cohuck, stefanha; +Cc: virtio-comment


> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> Sent: Monday, November 6, 2023 9:00 AM
> 
> On 11/3/2023 11:54 PM, Parav Pandit wrote:
> >> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> >> Sent: Friday, November 3, 2023 8:25 PM
> >>
> >> On 11/3/2023 7:35 PM, Parav Pandit wrote:
> >>>> From: Zhu Lingshan <lingshan.zhu@intel.com>
> >>>> Sent: Friday, November 3, 2023 4:05 PM
> >>>>
> >>>> This patch introduces a new status bit in the device status: SUSPEND.
> >>>>
> >>>> This SUSPEND bit can be used by the driver to suspend a device, in
> >>>> order to stabilize the device states and virtqueue states.
> >>>>
> >>>> Its main use case is live migration.
> >>>>
> >>>> Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
> >>>> Signed-off-by: Jason Wang <jasowang@redhat.com>
> >>> You constantly complained that whatever was proposed using admin
> >> commands method in [1] must work for passthrough and non-passthrough.
> >>> And halfway in the discussion you propose a method after learning
> >>> all the
> >> limitations of in-band, you propose a solution only works for
> >> non-passthrough mode.
> >>> You asked someone to have comprehensive proposal and when it comes
> >>> to
> >> you following it, you just don’t.
> >> not sure what you are talking about.
> >>> And have most shallow commit message to not even mention it.
> >>>
> >>> Please be consistent in design approach.
> >>> And if you don’t want to be, stop asking others.
> >> this SUSPEND/RESUME doesn't change since the RFC series, how can it
> >> not be inconsistent???
> >>> This is not the way TC collaboration works.
> >>> I probably shouldn’t even expect this from you.
> > Your proposal does not cover both the use cases of passthrough and non-
> passthrough.
> > Yet you kept demanding them for others.
> > This is just wrong.
> >
> > I am aware that both models as technical pros and cons.
> Why this doesn't work? the device status byte has been working for many
> years, and do you know when guest freeze, the hypervisor owns the device????

When the guest is not frozen and during the pre-copy phase, hypervisor needs to access the device (context, dirty pages).
How does it work if the guest owns the device?

> >
> >>> [1]
> >>> https://lists.oasis-open.org/archives/virtio-comment/202310/msg00472
> >>> .h
> >>> tml
> >> Please don't be so emotional and please be professional.
> >>
> >> Why this solution can not work for pass-through? Do you know the
> >> device ownership will be transferred to the hypervisor when guest
> >> suspended in live migration?
> > I explained 5 reasons why it does not work in previous reply.
> >
> > As the word indicates "live migration", the hypervisor needs to access the
> device when it is "live" (not just after).
> > Hence, passthrough mode must be able to capture the state of the device and
> dirty pages database when its live.
> > (and after the source is suspended).
> No, the hypervisor should only collect dirty pages when the device alive.

It is needed during both the times.
When the device and guest is live during pre-copy phase.
And after the device is frozen, to get the final round of pages.

> As you can see, the dirty page tracking facility has a PASID for isolation. But still,
> the question is, we should better use platform dirty page tracking
>
Nothing to do with PASID, as PASID is owned by the guest.
 
> Then suspend the device after guest freeze, to stabilize the device status, then
> read the status.
> 
> How can you say this does not work???
I explained above.

^ permalink raw reply	[flat|nested] 186+ messages in thread

* RE: [virtio-comment] Re: [PATCH V2 4/6] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  2023-11-06  3:31         ` Zhu, Lingshan
@ 2023-11-06  4:12           ` Parav Pandit
  2023-11-06  9:27             ` Zhu, Lingshan
  0 siblings, 1 reply; 186+ messages in thread
From: Parav Pandit @ 2023-11-06  4:12 UTC (permalink / raw)
  To: Zhu, Lingshan, jasowang, mst, eperezma, cohuck, stefanha; +Cc: virtio-comment


> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> Sent: Monday, November 6, 2023 9:01 AM
> 
> On 11/3/2023 11:50 PM, Parav Pandit wrote:
> >> From: virtio-comment@lists.oasis-open.org
> >> <virtio-comment@lists.oasis- open.org> On Behalf Of Zhu, Lingshan
> >> Sent: Friday, November 3, 2023 8:27 PM
> >>
> >> On 11/3/2023 7:35 PM, Parav Pandit wrote:
> >>>> From: Zhu Lingshan <lingshan.zhu@intel.com>
> >>>> Sent: Friday, November 3, 2023 4:05 PM
> >>>>
> >>>> This patch adds two new le16 fields to common configuration
> >>>> structure to support VIRTIO_F_QUEUE_STATE in PCI transport layer.
> >>>>
> >>>> Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
> >>>> ---
> >>>>    transport-pci.tex | 18 ++++++++++++++++++
> >>>>    1 file changed, 18 insertions(+)
> >>>>
> >>>> diff --git a/transport-pci.tex b/transport-pci.tex index
> >>>> a5c6719..3161519 100644
> >>>> --- a/transport-pci.tex
> >>>> +++ b/transport-pci.tex
> >>>> @@ -325,6 +325,10 @@ \subsubsection{Common configuration
> structure
> >>>> layout}\label{sec:Virtio Transport
> >>>>            /* About the administration virtqueue. */
> >>>>            le16 admin_queue_index;         /* read-only for driver */
> >>>>            le16 admin_queue_num;         /* read-only for driver */
> >>>> +
> >>>> +	/* Virtqueue state */
> >>>> +        le16 queue_avail_state;         /* read-write */
> >>>> +        le16 queue_used_state;          /* read-write */
> >>> This tiny interface for 128 virtio net queues through register read
> >>> writes, does
> >> not work effectively.
> >>> There are inflight out of order descriptors for block also.
> >>> Hence toy registers like this do not work.
> >> Do you know there is a queue_select? Why this does not work? Do you
> >> know how other queue related fields work?
> > :)
> > Yes. If you notice queue_reset related critical spec bug fix was done when it
> was introduced so that live migration can _actually_ work.
> >
> > When queue_select is done for 128 queues serially, it take a lot of time to
> read those slow register interface for this + inflight descriptors + more.
> interesting, virtio work in this pattern for many years, right?
All these years 400Gbps and 800Gbps virtio was not present, number of queues were not in hw.
Device didn’t support LM.
Many limitations existed all these years and TC is improving and expanding them.
So all these years do not matter.

> >
> >> Like how to set a queue size and enable it?
> > Those are meant to be used before DRIVER_OK stage as they are init time
> registers.
> > Not to keep abusing them..
> don't you need to set queue_size at the destination side?
No.
But the src/dst does not matter.
Queue_size to be set before DRIVER_OK like rest of the registers, as all queues must be created before the driver_ok phase.
Queue_reset was last moment exception.


^ permalink raw reply	[flat|nested] 186+ messages in thread

* [virtio-comment] RE: [PATCH V2 6/6] virtio-pci: implement dirty page tracking
  2023-11-06  3:52           ` Zhu, Lingshan
@ 2023-11-06  4:34             ` Parav Pandit
  2023-11-06  9:34               ` [virtio-comment] " Zhu, Lingshan
  2023-11-06 10:29               ` Michael S. Tsirkin
  0 siblings, 2 replies; 186+ messages in thread
From: Parav Pandit @ 2023-11-06  4:34 UTC (permalink / raw)
  To: Zhu, Lingshan, Michael S. Tsirkin
  Cc: jasowang, eperezma, cohuck, stefanha, virtio-comment



> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> Sent: Monday, November 6, 2023 9:22 AM
> 
> 
> On 11/3/2023 11:47 PM, Parav Pandit wrote:
> >> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> >> Sent: Friday, November 3, 2023 8:33 PM
> >>
> >> On 11/3/2023 7:35 PM, Parav Pandit wrote:
> >>>> From: Michael S. Tsirkin <mst@redhat.com>
> >>>> Sent: Friday, November 3, 2023 4:20 PM
> >>>>
> >>>> On Fri, Nov 03, 2023 at 06:34:37PM +0800, Zhu Lingshan wrote:
> >>>>> +\item[\field{bitmap_addr}]
> >>>>> +	The driver use this to set the address of the bitmap which
> >>>>> +records the
> >>>> dirty pages
> >>>>> +	caused by the device.
> >>>>> +	Each bit in the bitmap represents one memory page, bit 0 in the bitmap
> >>>>> +	reprsents page 0 at address 0, bit 1 represents page 1, and so
> >>>>> +on in a
> >>>> linear manner.
> >>>>> +	When \field{enable} is set to 1 and the device writes to a memory page,
> >>>>> +	the device MUST set the corresponding bit to 1 which indicating
> >>>>> +the
> >>>> page is dirty.
> >>>>> +\item[\field{bitmap_length}]
> >>>>> +	The driver use this to set the length in bytes of the bitmap.
> >>>>> +\end{description}
> >>>>> +
> >>>>> +\devicenormative{\subsubsection}{Memory Dirty Pages Tracker
> >>>>> +Capability}{Virtio Transport Options / Virtio Over PCI Bus /
> >>>>> +Memory Dirty Pages Tracker Capability}
> >>>>> +
> >>>>> +The device MUST NOT set any bits beyond bitmap_length when
> >>>>> +reporting
> >>>> dirty pages.
> >>>>> +
> >>>>> +To prevent a read-modify-write procedure, if a memory page is
> >>>>> +dirty,
> >>> It is not to prevent; it is just not possible to do racy RMW. 😊
> >> if you understand what is a atomic routine, you will not call it racy.
> >>> Hence to work around you propose to mark all pages dirty. Too bad.
> >>> This just does not work.
> >> why? and this is optional.
> > Because device cannot set individual bits in atomic way for same byte read by
> the cpu.
> > 1. device read the byte that had bit 0 and 4 set.
> > 2. cpu atomically clear these bits.
> > 3. device wrote bits 0, 4, and new bits 2 and 3.
> > 4. cpu now transferred page 0 and 4 again.
> >
> > Optional thing also needs to work. :)
> Do you know both CPU and device actually don't read bit, they read bytes????
Yes. this is why atomic_OR is not possible on pcie.

> Do you know RC connected to memory controller????
Yes.
> Do you know there are locked transaction and atomic operations in PCI???
Can you explain how PCI does RMW locked transaction?
Is it one TLP or multiple?

> Do you know there are atomic read/write/clear even read and clear and so on in
> CPU ISA????
Read is always atomic from cpu.
I didn’t know about read_and_clear atomic ISA. This combined with pci future support for atomic_or.
If you already know a Linux kernel api for atomic_read_and_clear, please share.

> >
> >>> Secondly the bitmap array is function is for full guest memory size,
> >>> while
> >> there is lot of sparce region now and also in future.
> >>> This is the second problem.
> >> did you see gra_power and its comments?
> > gra_power says the page size.
> > Not the sparce multiple ranges of the guest memory.
> > Device endup tracking uninterested area as well.
> increase gra_power can reduce bitmap size, right?
> Totally up to the hypervisor, right?
Yes, and that can increase the amount of memory.
The way I understood is, if gra_power is 2MB, than whole 2MB page to be considered dirty, even if 8KB was dirty.
Did I understand it right?

> >
> >>> This is exactly why I asked you to review the page write recording
> >>> series of
> >> admin commands and comment.
> >>> And you never commented with sheer ignorance.
> >>>
> >>> So clearly the start stop method for specific range and without
> >>> bandwidth
> >> explosion, admin commands of [1] stands better.
> >>> If you do [1] on the member device also using its AQ in future, it
> >>> will work for
> >> non-passthrough case.
> >>> If you build non-passthrough live migration using [1], also it will work.
> >>> So I don’t see any point of this series anymore.
> >> As Jason pointed out, there are many problems in your proposal, you
> >> should answer there. I don't need to repeat his words and duplicate the
> discussions.
> > Many are already addressed in v3.
> interesting, does your V3 support nested?
Not directly.
Is it similar to cpu PML which does not supported nested.
One can always implement nested using some emulation.
The second option for high performance would be allow SR-IOV cap on the VF and support true nesting using existing proposal of v3.

> >
> >>> [1]
> >>> https://lists.oasis-open.org/archives/virtio-comment/202310/msg00475
> >>> .h
> >>> tml
> >> you still need to explain why this does not work for pass-through.
> > It does not work for following reasons.
> > 1. Because all the fields that put on the member device are not in direct
> control of the hypervisor.
> > The device is directly controlled by the guest including the device status and
> when it resets the device all the things stored in the device are lost.
> have you seen PASID? and if the device reset, it has to forget everything as
> expected, right?
PASID does not help with the reset. Because as you told reset, resets everything.
PASID does not bifurcate the device common control which is not linked to any PASID.

> >
> > 2. the PCI FLR is clearing all the registers you exposed here.
> see above
> >
> > 3. Endless expansion of config registers of dirty tracking is not scalable, as they
> are not init time registers not following the Appendix B guidelines.
> endless expansion?? It is a complete set of dirty page tracking, right????
> have you see this cap only controls? The device DMA writes the bitmap, not by
> registers.
Device dirty page tracking is start/stop command to be done by the hypervisor.
So when guest is resetting the device, it stopped the DMA initiated by the hypervisor.
This fundamentally breaks things.

> 
> Again, if you want to fix Appendix B, OK.
> >
> > 4. bitmap based dirty tracking is not atomic between cpu and device.
> > Hence, it is racy.
> see above, the first reply.
> >
> > 5. All the device context needed for passthrough based hypervisor for a
> device type specific is missing.
> > All of those can be used for non-passthrough as well.
> > [1]
> > https://lists.oasis-open.org/archives/virtio-comment/202311/msg00085.h
> > tml
> If you want to discuss this again, I don't want to wast time but only asking you
> whether you want to define virtio-fs device context
It will be defined in future.
And if virtio-fs was not written with migration in mind, may be one will invent virtio-fs2.

> >
> >> And I
> >> remember this is a point-less topic as MST ever wants to mute another
> >> "pass- through" thread.
> > No. he did not say that.
> > He meant to not endlessly debate which one is better.
> > He clearly said, try to see if you can make multiple hypervisor model work.
> > And your series shows a clear ignorance of his guidance.
> 
> Let me quote MST's reply here:
> "I feel this discussion will keep meandering because the terminology is vague.
> There's no single thing that is called "passthrough" - vendors just build what is
> expedient with current hardware and software. Nvidia has a bunch of people
> working on vfio so they call that passthrough, Red Hat has people working on
> VDPA and they call that passthrough, etc.
> 
> 
> Before I mute this discussion for good, does anyone here have any feeling
> progress is made? What kind of progress? "
> 
> So please don't discuss on pass-through anymore.

I don’t want to discuss the pros and cons of passthrough vs, vdpa, as usual.
V3 covers broader use case of passthrough, hence once can always implement trap+emulation instead of passthrough.
V3 already indicates that other variants of the passthrough can be done as well or can be extended.
So please explore if that fits your vdpa need.

> It seems only you need to develop the knowledge

> >
> >
> >>>>> +optionally the device is permitted to set the entire byte, which
> >>>>> +encompasses
> >>>> the relevant bit, to 1.
> >>>>> +
> >>>>> +The device MAY increase \field{gra_power} to reduce
> >> \field{bitmap_length}.
> >>>>> +
> >>>>> +The device must ignore any writes to \field{pasid} if PASID
> >>>>> +Extended Capability is absent or the PASID functionality is
> >>>>> +disabled in PASID Extended Capability
> >>>> I have to say this is going to work very badly when the number of
> >>>> dirty pages is
> >>>> small: you will end up scanning and re-scanning all of bitmap.
> >>>> And the resolution is apparently 8 pages? You have just multiplied
> >>>> the migration bandwidth by a factor of 8.
> >>> Yeah.
> >>> And device does not even know previously reported pages are read by
> >>> driver
> >> or not. All guess work game for driver and device.
> >> see my reply to him
> > Please see above reply.
> see above


^ permalink raw reply	[flat|nested] 186+ messages in thread

* [virtio-comment] Re: [PATCH V2 6/6] virtio-pci: implement dirty page tracking
  2023-11-03 14:21     ` Zhu, Lingshan
@ 2023-11-06  9:16       ` Zhu, Lingshan
  2023-11-06 10:15         ` Michael S. Tsirkin
  0 siblings, 1 reply; 186+ messages in thread
From: Zhu, Lingshan @ 2023-11-06  9:16 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: jasowang, eperezma, cohuck, stefanha, virtio-comment, parav



On 11/3/2023 10:21 PM, Zhu, Lingshan wrote:
>
>
> On 11/3/2023 6:46 PM, Michael S. Tsirkin wrote:
>> On Fri, Nov 03, 2023 at 06:34:37PM +0800, Zhu Lingshan wrote:
>>> +\begin{lstlisting}
>>> +struct virtio_pci_dity_page_track {
>>> +        u8 enable;               /* Read-Write */
>>> +        u8 gra_power;            /* Read-Write */
>>> +        u8 reserved[2];
>>> +        le32 {
>>> +            pasid: 20;           /* Read-Write */
>>> +            reserved: 12;
>>> +        };
>>> +        le64 bitmap_addr;        /* Read-Write */
>>> +        le64 bitmap_length;      /* Read-Write */
>>> +};
>>> +\end{lstlisting}
>> Okay, so it's a simple mailbox in config space.  Which by itself is
>> probably a very reasonable idea - more or less what I suggested.
>> However, using such a generic facility just for the dirty bitmap seems
>> too limited.  Please make it accept arbitrary commands. Reusing admin
>> command structure with a special "device itself" group sounds like one
>> way to do it.
> processing admin cmds in a cap may be too complex and overkill.
> we need to handle variable length of cmds, handle async returned 
> results, and so on.
>
> This struct seems easy and simple. And shall we use platform 
> facilities like vt-d
> to track dirty pages?
To demonstrate these issues, suppose we have a struct in a bar to 
process admin cmds:

struct virtio_admin_cmd {
         u64 in_data_length;
         u8 cmd_in_data[];
         u64 out_data_length;
         u8 cmd_out_data[];
         u8 ret;
};

The problems are:
1) command_in_data and command_out data have variable length, so how 
many HW resource should be reserved in the bar?
2) To process the cmds in the bar, the device MAY need to read many 
registers in cmd_in_data[] and write many registers in cmd_out_data[],
which can be ineffective, this is not DMA.
3) a bar can only process one cmd at a time, and the driver can only 
issue another cmd after received an ret.
This process has to be synchronous IO, one cmd blocks another.
4) VF implementing a bar processing admin cmds conflicts with PF's admin vq.

So I think a bar or a cap processing admin cmds is way to complex and 
overkill.

Thanks



This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [virtio-comment] RE: [PATCH V2 2/6] virtio: introduce SUSPEND bit in device status
  2023-11-06  4:07           ` [virtio-comment] " Parav Pandit
@ 2023-11-06  9:21             ` Zhu, Lingshan
  2023-11-06 10:52               ` Parav Pandit
  0 siblings, 1 reply; 186+ messages in thread
From: Zhu, Lingshan @ 2023-11-06  9:21 UTC (permalink / raw)
  To: Parav Pandit, jasowang, mst, eperezma, cohuck, stefanha; +Cc: virtio-comment



On 11/6/2023 12:07 PM, Parav Pandit wrote:
>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>> Sent: Monday, November 6, 2023 9:00 AM
>>
>> On 11/3/2023 11:54 PM, Parav Pandit wrote:
>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>>>> Sent: Friday, November 3, 2023 8:25 PM
>>>>
>>>> On 11/3/2023 7:35 PM, Parav Pandit wrote:
>>>>>> From: Zhu Lingshan <lingshan.zhu@intel.com>
>>>>>> Sent: Friday, November 3, 2023 4:05 PM
>>>>>>
>>>>>> This patch introduces a new status bit in the device status: SUSPEND.
>>>>>>
>>>>>> This SUSPEND bit can be used by the driver to suspend a device, in
>>>>>> order to stabilize the device states and virtqueue states.
>>>>>>
>>>>>> Its main use case is live migration.
>>>>>>
>>>>>> Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
>>>>>> Signed-off-by: Jason Wang <jasowang@redhat.com>
>>>>> You constantly complained that whatever was proposed using admin
>>>> commands method in [1] must work for passthrough and non-passthrough.
>>>>> And halfway in the discussion you propose a method after learning
>>>>> all the
>>>> limitations of in-band, you propose a solution only works for
>>>> non-passthrough mode.
>>>>> You asked someone to have comprehensive proposal and when it comes
>>>>> to
>>>> you following it, you just don’t.
>>>> not sure what you are talking about.
>>>>> And have most shallow commit message to not even mention it.
>>>>>
>>>>> Please be consistent in design approach.
>>>>> And if you don’t want to be, stop asking others.
>>>> this SUSPEND/RESUME doesn't change since the RFC series, how can it
>>>> not be inconsistent???
>>>>> This is not the way TC collaboration works.
>>>>> I probably shouldn’t even expect this from you.
>>> Your proposal does not cover both the use cases of passthrough and non-
>> passthrough.
>>> Yet you kept demanding them for others.
>>> This is just wrong.
>>>
>>> I am aware that both models as technical pros and cons.
>> Why this doesn't work? the device status byte has been working for many
>> years, and do you know when guest freeze, the hypervisor owns the device????
> When the guest is not frozen and during the pre-copy phase, hypervisor needs to access the device (context, dirty pages).
> How does it work if the guest owns the device?
Have you seen PASID there?
>
>>>>> [1]
>>>>> https://lists.oasis-open.org/archives/virtio-comment/202310/msg00472
>>>>> .h
>>>>> tml
>>>> Please don't be so emotional and please be professional.
>>>>
>>>> Why this solution can not work for pass-through? Do you know the
>>>> device ownership will be transferred to the hypervisor when guest
>>>> suspended in live migration?
>>> I explained 5 reasons why it does not work in previous reply.
>>>
>>> As the word indicates "live migration", the hypervisor needs to access the
>> device when it is "live" (not just after).
>>> Hence, passthrough mode must be able to capture the state of the device and
>> dirty pages database when its live.
>>> (and after the source is suspended).
>> No, the hypervisor should only collect dirty pages when the device alive.
> It is needed during both the times.
> When the device and guest is live during pre-copy phase.
> And after the device is frozen, to get the final round of pages.
With PASID, dirty page tracking facility can be isolated from the guest,
means the hypervisor owns this facility. So the hypervisor
can collect the dirty pages.

When the device suspended, it should report the last round of dirty pages
through dirty page tracking facility as expected.

This can work, right?
>
>> As you can see, the dirty page tracking facility has a PASID for isolation. But still,
>> the question is, we should better use platform dirty page tracking
>>
> Nothing to do with PASID, as PASID is owned by the guest.
It looks you don't know how PASID work.

Host can setup PASID to isolate some facilities, right?
>   
>> Then suspend the device after guest freeze, to stabilize the device status, then
>> read the status.
>>
>> How can you say this does not work???
> I explained above.
see above


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [virtio-comment] Re: [PATCH V2 4/6] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  2023-11-06  4:12           ` Parav Pandit
@ 2023-11-06  9:27             ` Zhu, Lingshan
  2023-11-06 10:52               ` Parav Pandit
  0 siblings, 1 reply; 186+ messages in thread
From: Zhu, Lingshan @ 2023-11-06  9:27 UTC (permalink / raw)
  To: Parav Pandit, jasowang, mst, eperezma, cohuck, stefanha; +Cc: virtio-comment



On 11/6/2023 12:12 PM, Parav Pandit wrote:
>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>> Sent: Monday, November 6, 2023 9:01 AM
>>
>> On 11/3/2023 11:50 PM, Parav Pandit wrote:
>>>> From: virtio-comment@lists.oasis-open.org
>>>> <virtio-comment@lists.oasis- open.org> On Behalf Of Zhu, Lingshan
>>>> Sent: Friday, November 3, 2023 8:27 PM
>>>>
>>>> On 11/3/2023 7:35 PM, Parav Pandit wrote:
>>>>>> From: Zhu Lingshan <lingshan.zhu@intel.com>
>>>>>> Sent: Friday, November 3, 2023 4:05 PM
>>>>>>
>>>>>> This patch adds two new le16 fields to common configuration
>>>>>> structure to support VIRTIO_F_QUEUE_STATE in PCI transport layer.
>>>>>>
>>>>>> Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
>>>>>> ---
>>>>>>     transport-pci.tex | 18 ++++++++++++++++++
>>>>>>     1 file changed, 18 insertions(+)
>>>>>>
>>>>>> diff --git a/transport-pci.tex b/transport-pci.tex index
>>>>>> a5c6719..3161519 100644
>>>>>> --- a/transport-pci.tex
>>>>>> +++ b/transport-pci.tex
>>>>>> @@ -325,6 +325,10 @@ \subsubsection{Common configuration
>> structure
>>>>>> layout}\label{sec:Virtio Transport
>>>>>>             /* About the administration virtqueue. */
>>>>>>             le16 admin_queue_index;         /* read-only for driver */
>>>>>>             le16 admin_queue_num;         /* read-only for driver */
>>>>>> +
>>>>>> +	/* Virtqueue state */
>>>>>> +        le16 queue_avail_state;         /* read-write */
>>>>>> +        le16 queue_used_state;          /* read-write */
>>>>> This tiny interface for 128 virtio net queues through register read
>>>>> writes, does
>>>> not work effectively.
>>>>> There are inflight out of order descriptors for block also.
>>>>> Hence toy registers like this do not work.
>>>> Do you know there is a queue_select? Why this does not work? Do you
>>>> know how other queue related fields work?
>>> :)
>>> Yes. If you notice queue_reset related critical spec bug fix was done when it
>> was introduced so that live migration can _actually_ work.
>>> When queue_select is done for 128 queues serially, it take a lot of time to
>> read those slow register interface for this + inflight descriptors + more.
>> interesting, virtio work in this pattern for many years, right?
> All these years 400Gbps and 800Gbps virtio was not present, number of queues were not in hw.
The registers are control path in config space, how 400G or 800G affect??
See the virtio common cfg, you will find the max number of vqs is there, 
num_queues.
> Device didn’t support LM.
> Many limitations existed all these years and TC is improving and expanding them.
> So all these years do not matter.
Not sure what are you talking about, haven't we initialize the device 
and vqs
in config space for years?????? What's wrong with this mechanism?
Are you questioning virito-pci fundamentals???
>
>>>> Like how to set a queue size and enable it?
>>> Those are meant to be used before DRIVER_OK stage as they are init time
>> registers.
>>> Not to keep abusing them..
>> don't you need to set queue_size at the destination side?
> No.
> But the src/dst does not matter.
> Queue_size to be set before DRIVER_OK like rest of the registers, as all queues must be created before the driver_ok phase.
> Queue_reset was last moment exception.
create a queue? Nvidia specific?

For standard virtio, you need to read the number of enabled vqs at the 
source side, then enable them at the dst, so queue_size matters,
not to create.
>


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* [virtio-comment] Re: [PATCH V2 6/6] virtio-pci: implement dirty page tracking
  2023-11-06  4:34             ` [virtio-comment] " Parav Pandit
@ 2023-11-06  9:34               ` Zhu, Lingshan
  2023-11-06 10:52                 ` [virtio-comment] " Parav Pandit
  2023-11-06 11:13                 ` [virtio-comment] " Parav Pandit
  2023-11-06 10:29               ` Michael S. Tsirkin
  1 sibling, 2 replies; 186+ messages in thread
From: Zhu, Lingshan @ 2023-11-06  9:34 UTC (permalink / raw)
  To: Parav Pandit, Michael S. Tsirkin
  Cc: jasowang, eperezma, cohuck, stefanha, virtio-comment



On 11/6/2023 12:34 PM, Parav Pandit wrote:
>
>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>> Sent: Monday, November 6, 2023 9:22 AM
>>
>>
>> On 11/3/2023 11:47 PM, Parav Pandit wrote:
>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>>>> Sent: Friday, November 3, 2023 8:33 PM
>>>>
>>>> On 11/3/2023 7:35 PM, Parav Pandit wrote:
>>>>>> From: Michael S. Tsirkin <mst@redhat.com>
>>>>>> Sent: Friday, November 3, 2023 4:20 PM
>>>>>>
>>>>>> On Fri, Nov 03, 2023 at 06:34:37PM +0800, Zhu Lingshan wrote:
>>>>>>> +\item[\field{bitmap_addr}]
>>>>>>> +	The driver use this to set the address of the bitmap which
>>>>>>> +records the
>>>>>> dirty pages
>>>>>>> +	caused by the device.
>>>>>>> +	Each bit in the bitmap represents one memory page, bit 0 in the bitmap
>>>>>>> +	reprsents page 0 at address 0, bit 1 represents page 1, and so
>>>>>>> +on in a
>>>>>> linear manner.
>>>>>>> +	When \field{enable} is set to 1 and the device writes to a memory page,
>>>>>>> +	the device MUST set the corresponding bit to 1 which indicating
>>>>>>> +the
>>>>>> page is dirty.
>>>>>>> +\item[\field{bitmap_length}]
>>>>>>> +	The driver use this to set the length in bytes of the bitmap.
>>>>>>> +\end{description}
>>>>>>> +
>>>>>>> +\devicenormative{\subsubsection}{Memory Dirty Pages Tracker
>>>>>>> +Capability}{Virtio Transport Options / Virtio Over PCI Bus /
>>>>>>> +Memory Dirty Pages Tracker Capability}
>>>>>>> +
>>>>>>> +The device MUST NOT set any bits beyond bitmap_length when
>>>>>>> +reporting
>>>>>> dirty pages.
>>>>>>> +
>>>>>>> +To prevent a read-modify-write procedure, if a memory page is
>>>>>>> +dirty,
>>>>> It is not to prevent; it is just not possible to do racy RMW. 😊
>>>> if you understand what is a atomic routine, you will not call it racy.
>>>>> Hence to work around you propose to mark all pages dirty. Too bad.
>>>>> This just does not work.
>>>> why? and this is optional.
>>> Because device cannot set individual bits in atomic way for same byte read by
>> the cpu.
>>> 1. device read the byte that had bit 0 and 4 set.
>>> 2. cpu atomically clear these bits.
>>> 3. device wrote bits 0, 4, and new bits 2 and 3.
>>> 4. cpu now transferred page 0 and 4 again.
>>>
>>> Optional thing also needs to work. :)
>> Do you know both CPU and device actually don't read bit, they read bytes????
> Yes. this is why atomic_OR is not possible on pcie.
>
>> Do you know RC connected to memory controller????
> Yes.
>> Do you know there are locked transaction and atomic operations in PCI???
> Can you explain how PCI does RMW locked transaction?
> Is it one TLP or multiple?
>
>> Do you know there are atomic read/write/clear even read and clear and so on in
>> CPU ISA????
> Read is always atomic from cpu.
> I didn’t know about read_and_clear atomic ISA. This combined with pci future support for atomic_or.
> If you already know a Linux kernel api for atomic_read_and_clear, please share.
To answer all questions above, you should read PCI spec and CPU SDM, we 
don't copy and paste the content
here, nobody develop their knowledge this way.
>
>>>>> Secondly the bitmap array is function is for full guest memory size,
>>>>> while
>>>> there is lot of sparce region now and also in future.
>>>>> This is the second problem.
>>>> did you see gra_power and its comments?
>>> gra_power says the page size.
>>> Not the sparce multiple ranges of the guest memory.
>>> Device endup tracking uninterested area as well.
>> increase gra_power can reduce bitmap size, right?
>> Totally up to the hypervisor, right?
> Yes, and that can increase the amount of memory.
> The way I understood is, if gra_power is 2MB, than whole 2MB page to be considered dirty, even if 8KB was dirty.
> Did I understand it right?
Do you know DMA are very likely to write a neighbor page? Do you know 
why huge page is introduced?
Hint: not only for reduce TLB miss.
>
>>>>> This is exactly why I asked you to review the page write recording
>>>>> series of
>>>> admin commands and comment.
>>>>> And you never commented with sheer ignorance.
>>>>>
>>>>> So clearly the start stop method for specific range and without
>>>>> bandwidth
>>>> explosion, admin commands of [1] stands better.
>>>>> If you do [1] on the member device also using its AQ in future, it
>>>>> will work for
>>>> non-passthrough case.
>>>>> If you build non-passthrough live migration using [1], also it will work.
>>>>> So I don’t see any point of this series anymore.
>>>> As Jason pointed out, there are many problems in your proposal, you
>>>> should answer there. I don't need to repeat his words and duplicate the
>> discussions.
>>> Many are already addressed in v3.
>> interesting, does your V3 support nested?
> Not directly.
> Is it similar to cpu PML which does not supported nested.
> One can always implement nested using some emulation.
> The second option for high performance would be allow SR-IOV cap on the VF and support true nesting using existing proposal of v3.
If your proposal does not support nested, then it is incomplete.
>
>>>>> [1]
>>>>> https://lists.oasis-open.org/archives/virtio-comment/202310/msg00475
>>>>> .h
>>>>> tml
>>>> you still need to explain why this does not work for pass-through.
>>> It does not work for following reasons.
>>> 1. Because all the fields that put on the member device are not in direct
>> control of the hypervisor.
>>> The device is directly controlled by the guest including the device status and
>> when it resets the device all the things stored in the device are lost.
>> have you seen PASID? and if the device reset, it has to forget everything as
>> expected, right?
> PASID does not help with the reset. Because as you told reset, resets everything.
> PASID does not bifurcate the device common control which is not linked to any PASID.
PASID means some facilities can be isolated. When reset, the device 
forget everything.
>
>>> 2. the PCI FLR is clearing all the registers you exposed here.
>> see above
>>> 3. Endless expansion of config registers of dirty tracking is not scalable, as they
>> are not init time registers not following the Appendix B guidelines.
>> endless expansion?? It is a complete set of dirty page tracking, right????
>> have you see this cap only controls? The device DMA writes the bitmap, not by
>> registers.
> Device dirty page tracking is start/stop command to be done by the hypervisor.
> So when guest is resetting the device, it stopped the DMA initiated by the hypervisor.
> This fundamentally breaks things.
Why? When device resets, do you want to keep tracking dirty pages????
>
>> Again, if you want to fix Appendix B, OK.
>>> 4. bitmap based dirty tracking is not atomic between cpu and device.
>>> Hence, it is racy.
>> see above, the first reply.
>>> 5. All the device context needed for passthrough based hypervisor for a
>> device type specific is missing.
>>> All of those can be used for non-passthrough as well.
>>> [1]
>>> https://lists.oasis-open.org/archives/virtio-comment/202311/msg00085.h
>>> tml
>> If you want to discuss this again, I don't want to wast time but only asking you
>> whether you want to define virtio-fs device context
> It will be defined in future.
> And if virtio-fs was not written with migration in mind, may be one will invent virtio-fs2.
don't say future, talk is cheap, show me the code.
>
>>>> And I
>>>> remember this is a point-less topic as MST ever wants to mute another
>>>> "pass- through" thread.
>>> No. he did not say that.
>>> He meant to not endlessly debate which one is better.
>>> He clearly said, try to see if you can make multiple hypervisor model work.
>>> And your series shows a clear ignorance of his guidance.
>> Let me quote MST's reply here:
>> "I feel this discussion will keep meandering because the terminology is vague.
>> There's no single thing that is called "passthrough" - vendors just build what is
>> expedient with current hardware and software. Nvidia has a bunch of people
>> working on vfio so they call that passthrough, Red Hat has people working on
>> VDPA and they call that passthrough, etc.
>>
>>
>> Before I mute this discussion for good, does anyone here have any feeling
>> progress is made? What kind of progress? "
>>
>> So please don't discuss on pass-through anymore.
> I don’t want to discuss the pros and cons of passthrough vs, vdpa, as usual.
> V3 covers broader use case of passthrough, hence once can always implement trap+emulation instead of passthrough.
> V3 already indicates that other variants of the passthrough can be done as well or can be extended.
> So please explore if that fits your vdpa need.
So, please no pass-through discussion anymore.
>
>> It seems only you need to develop the knowledge
>>>
>>>>>>> +optionally the device is permitted to set the entire byte, which
>>>>>>> +encompasses
>>>>>> the relevant bit, to 1.
>>>>>>> +
>>>>>>> +The device MAY increase \field{gra_power} to reduce
>>>> \field{bitmap_length}.
>>>>>>> +
>>>>>>> +The device must ignore any writes to \field{pasid} if PASID
>>>>>>> +Extended Capability is absent or the PASID functionality is
>>>>>>> +disabled in PASID Extended Capability
>>>>>> I have to say this is going to work very badly when the number of
>>>>>> dirty pages is
>>>>>> small: you will end up scanning and re-scanning all of bitmap.
>>>>>> And the resolution is apparently 8 pages? You have just multiplied
>>>>>> the migration bandwidth by a factor of 8.
>>>>> Yeah.
>>>>> And device does not even know previously reported pages are read by
>>>>> driver
>>>> or not. All guess work game for driver and device.
>>>> see my reply to him
>>> Please see above reply.
>> see above


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* [virtio-comment] Re: [PATCH V2 1/6] virtio: introduce virtqueue state
  2023-11-03 14:49     ` Zhu, Lingshan
@ 2023-11-06  9:35       ` Michael S. Tsirkin
  2023-11-06  9:42         ` Zhu, Lingshan
  0 siblings, 1 reply; 186+ messages in thread
From: Michael S. Tsirkin @ 2023-11-06  9:35 UTC (permalink / raw)
  To: Zhu, Lingshan; +Cc: jasowang, eperezma, cohuck, stefanha, virtio-comment, parav

On Fri, Nov 03, 2023 at 10:49:42PM +0800, Zhu, Lingshan wrote:
>         +When SUSPEND is set, the device MUST record the Available State of every enabled splited virtqueue
>         +in \field{Available State} field,
>         +and correspondingly restore the Available State of every enabled splited virtqueue
>         +from \field{Available State} field when DRIVER_OK is set.
>         +
>         +The device SHOULD reset \field{Available State} field upon a device reset.
> 
>     At this point I have no idea
>     - how can a state of a virtqueue at a random time be represented
>       by a 16 bit integer
> 
> not sure what is a random time, this is to request the device to reset
> its avail state, for example, it is "le16 queue_avail_state" in virtio-pci
> common cfg. Resetting this so the device will not recover from a wrong value of
> the last run.

You simply never bother to say what is "Available State" and what
does it mean to restore it.  Not to mention words like "splited"
which just adds to the confusion.

>     - if it's not at a random time then why do you even need an integer -
>       synchronize queue to memory and then all state is in memory
> 
> Not sure what is a sync queue, but for example, "le16 queue_avail_state" for
> PCI transport exists in a cap.

I just point out that normally a lot of ring state is in memory.
So you need to be much more specific about how you are augmenting that.
For example, if buffers are used exactly in order for a split ring
then used index seems to be exactly the same as last available index
you describe - it's a free running counter. OTOH if they are not
used in order then I don't see how is a single index sufficient to
describe which ones have been used and which not.


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* [virtio-comment] Re: [PATCH V2 1/6] virtio: introduce virtqueue state
  2023-11-06  9:35       ` Michael S. Tsirkin
@ 2023-11-06  9:42         ` Zhu, Lingshan
  2023-11-06  9:45           ` Michael S. Tsirkin
  0 siblings, 1 reply; 186+ messages in thread
From: Zhu, Lingshan @ 2023-11-06  9:42 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: jasowang, eperezma, cohuck, stefanha, virtio-comment, parav



On 11/6/2023 5:35 PM, Michael S. Tsirkin wrote:
> On Fri, Nov 03, 2023 at 10:49:42PM +0800, Zhu, Lingshan wrote:
>>          +When SUSPEND is set, the device MUST record the Available State of every enabled splited virtqueue
>>          +in \field{Available State} field,
>>          +and correspondingly restore the Available State of every enabled splited virtqueue
>>          +from \field{Available State} field when DRIVER_OK is set.
>>          +
>>          +The device SHOULD reset \field{Available State} field upon a device reset.
>>
>>      At this point I have no idea
>>      - how can a state of a virtqueue at a random time be represented
>>        by a 16 bit integer
>>
>> not sure what is a random time, this is to request the device to reset
>> its avail state, for example, it is "le16 queue_avail_state" in virtio-pci
>> common cfg. Resetting this so the device will not recover from a wrong value of
>> the last run.
> You simply never bother to say what is "Available State" and what
> does it mean to restore it.  Not to mention words like "splited"
> which just adds to the confusion.
It says:
+The available state field is two bytes of virtqueue state that is used by
+the device to read the next available buffer. It is presented in the 
following format:

Do you want me to add more descriptions?

>
>>      - if it's not at a random time then why do you even need an integer -
>>        synchronize queue to memory and then all state is in memory
>>
>> Not sure what is a sync queue, but for example, "le16 queue_avail_state" for
>> PCI transport exists in a cap.
> I just point out that normally a lot of ring state is in memory.
> So you need to be much more specific about how you are augmenting that.
> For example, if buffers are used exactly in order for a split ring
> then used index seems to be exactly the same as last available index
> you describe - it's a free running counter. OTOH if they are not
> used in order then I don't see how is a single index sufficient to
> describe which ones have been used and which not.
I am not sure I get it.

Used idx(not like packed vq, no over-writing descriptors) and other 
states are in guest memory, so migrated with guest migration.
>


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* [virtio-comment] Re: [PATCH V2 2/6] virtio: introduce SUSPEND bit in device status
  2023-11-03 10:34 ` [virtio-comment] [PATCH V2 2/6] virtio: introduce SUSPEND bit in device status Zhu Lingshan
  2023-11-03 11:35   ` [virtio-comment] " Parav Pandit
@ 2023-11-06  9:43   ` Michael S. Tsirkin
  2023-11-07  9:09     ` Zhu, Lingshan
  1 sibling, 1 reply; 186+ messages in thread
From: Michael S. Tsirkin @ 2023-11-06  9:43 UTC (permalink / raw)
  To: Zhu Lingshan; +Cc: jasowang, eperezma, cohuck, stefanha, virtio-comment, parav

On Fri, Nov 03, 2023 at 06:34:33PM +0800, Zhu Lingshan wrote:
> This patch introduces a new status bit in the device status: SUSPEND.
> 
> This SUSPEND bit can be used by the driver to suspend a device,
> in order to stabilize the device states and virtqueue states.
> 
> Its main use case is live migration.
> 
> Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
> Signed-off-by: Jason Wang <jasowang@redhat.com>
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> ---
>  content.tex | 36 ++++++++++++++++++++++++++++++++++--
>  1 file changed, 34 insertions(+), 2 deletions(-)
> 
> diff --git a/content.tex b/content.tex
> index 76813b5..bcc9d4b 100644
> --- a/content.tex
> +++ b/content.tex
> @@ -49,6 +49,10 @@ \section{\field{Device Status} Field}\label{sec:Basic Facilities of a Virtio Dev
>  
>  \item[DEVICE_NEEDS_RESET (64)] Indicates that the device has experienced
>    an error from which it can't recover.
> +
> +\item[SUSPEND (16)] When VIRTIO_F_SUSPEND is negotiated, indicates that the
> +  device has been suspended by the driver.
> +

what does this mean?

>  \end{description}
>  
>  The \field{device status} field starts out as 0, and is reinitialized to 0 by
> @@ -73,6 +77,10 @@ \section{\field{Device Status} Field}\label{sec:Basic Facilities of a Virtio Dev
>  recover by issuing a reset.
>  \end{note}
>  
> +The driver SHOULD NOT set SUSPEND if FEATURES_OK is not set.
> +
> +When setting SUSPEND, the driver MUST re-read \field{device status} to ensure the SUSPEND bit is set.
> +

and if it's not?

>  \devicenormative{\subsection}{Device Status Field}{Basic Facilities of a Virtio Device / Device Status Field}
>  
>  The device MUST NOT consume buffers or send any used buffer
> @@ -82,6 +90,26 @@ \section{\field{Device Status} Field}\label{sec:Basic Facilities of a Virtio Dev
>  that a reset is needed.  If DRIVER_OK is set, after it sets DEVICE_NEEDS_RESET, the device
>  MUST send a device configuration change notification to the driver.
>  
> +The device MUST ignore SUSPEND if FEATURES_OK is not set.
> +
> +The device MUST ignore SUSPEND if VIRTIO_F_SUSPEND is not negotiated.
> +
> +The device SHOULD allow settings to \field{device status} even when SUSPEND is set.

which settings?

> +
> +If VIRTIO_F_SUSPEND is negotiated and SUSPEND is set, the device SHOULD clear SUSPEND
> +and resumes operation upon DRIVER_OK.
> +
> +If VIRTIO_F_SUSPEND is negotiated, when the driver sets SUSPEND,
> +the device SHOULD perform the following actions before presenting SUSPEND bit in the \field{device status}:
> +
> +\begin{itemize}
> +\item Stop consuming buffers of any virtqueues and mark all finished descritors as used.

descritors? and what does finished mean?

> +\item Wait until all descriptors that being processed to finish and mark them as used.

descriptors are not marked used. buffers are.

that being -> that are being maybe?

> +\item Flush all used buffer and send used buffer notifications to the driver.

used buffers?
what does Flush mean?

> +\item Record Virtqueue State of each enabled virtqueue, see section \ref{sec:Virtqueues / Virtqueue State}

execpt that one unfortunately does not bother to say what does this mean
:(

> +\item Pause its operation except \field{device status} and preserve configurations in its Device Configuration Space, see \ref{sec:Basic Facilities of a Virtio Device / Device Configuration Space}

How do you Pause? For example, consider a link state register. You set
SUSPEND, then link goes down. What is device supposed to do?
Record this somewhere internal but do not show it to driver?
And how exactly will this hidden internal state be migrated
since it is not visible?


> +\end{itemize}
> +
>  \section{Feature Bits}\label{sec:Basic Facilities of a Virtio Device / Feature Bits}
>  
>  Each virtio device offers all the features it understands.  During
> @@ -99,10 +127,10 @@ \section{Feature Bits}\label{sec:Basic Facilities of a Virtio Device / Feature B
>  \begin{description}
>  \item[0 to 23, and 50 to 127] Feature bits for the specific device type
>  
> -\item[24 to 42] Feature bits reserved for extensions to the queue and
> +\item[24 to 43] Feature bits reserved for extensions to the queue and
>    feature negotiation mechanisms
>  
> -\item[43 to 49, and 128 and above] Feature bits reserved for future extensions.
> +\item[44 to 49, and 128 and above] Feature bits reserved for future extensions.
>  \end{description}
>  
>  \begin{note}
> @@ -875,6 +903,10 @@ \chapter{Reserved Feature Bits}\label{sec:Reserved Feature Bits}
>    \item[VIRTIO_F_QUEUE_STATE(42)] This feature indicates that the device allows the driver
>    to access its internal virtqueue state.
>  
> +  \item[VIRTIO_F_SUSPEND(43)] This feature indicates that the driver can
> +   SUSPEND the device.

why is SUSPEND upper-case here?

> +   See \ref{sec:Basic Facilities of a Virtio Device / Device Status Field}.
> +
>  \end{description}
>  
>  \drivernormative{\section}{Reserved Feature Bits}{Reserved Feature Bits}
> -- 
> 2.35.3


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* [virtio-comment] Re: [PATCH V2 1/6] virtio: introduce virtqueue state
  2023-11-06  9:42         ` Zhu, Lingshan
@ 2023-11-06  9:45           ` Michael S. Tsirkin
  2023-11-07  8:11             ` Zhu, Lingshan
  0 siblings, 1 reply; 186+ messages in thread
From: Michael S. Tsirkin @ 2023-11-06  9:45 UTC (permalink / raw)
  To: Zhu, Lingshan; +Cc: jasowang, eperezma, cohuck, stefanha, virtio-comment, parav

On Mon, Nov 06, 2023 at 05:42:10PM +0800, Zhu, Lingshan wrote:
> 
> 
> On 11/6/2023 5:35 PM, Michael S. Tsirkin wrote:
> > On Fri, Nov 03, 2023 at 10:49:42PM +0800, Zhu, Lingshan wrote:
> > >          +When SUSPEND is set, the device MUST record the Available State of every enabled splited virtqueue
> > >          +in \field{Available State} field,
> > >          +and correspondingly restore the Available State of every enabled splited virtqueue
> > >          +from \field{Available State} field when DRIVER_OK is set.
> > >          +
> > >          +The device SHOULD reset \field{Available State} field upon a device reset.
> > > 
> > >      At this point I have no idea
> > >      - how can a state of a virtqueue at a random time be represented
> > >        by a 16 bit integer
> > > 
> > > not sure what is a random time, this is to request the device to reset
> > > its avail state, for example, it is "le16 queue_avail_state" in virtio-pci
> > > common cfg. Resetting this so the device will not recover from a wrong value of
> > > the last run.
> > You simply never bother to say what is "Available State" and what
> > does it mean to restore it.  Not to mention words like "splited"
> > which just adds to the confusion.
> It says:
> +The available state field is two bytes of virtqueue state that is used by
> +the device to read the next available buffer. It is presented in the
> following format:
> 
> Do you want me to add more descriptions?

maybe start with an example

> > 
> > >      - if it's not at a random time then why do you even need an integer -
> > >        synchronize queue to memory and then all state is in memory
> > > 
> > > Not sure what is a sync queue, but for example, "le16 queue_avail_state" for
> > > PCI transport exists in a cap.
> > I just point out that normally a lot of ring state is in memory.
> > So you need to be much more specific about how you are augmenting that.
> > For example, if buffers are used exactly in order for a split ring
> > then used index seems to be exactly the same as last available index
> > you describe - it's a free running counter. OTOH if they are not
> > used in order then I don't see how is a single index sufficient to
> > describe which ones have been used and which not.
> I am not sure I get it.
> 
> Used idx(not like packed vq, no over-writing descriptors) and other states
> are in guest memory, so migrated with guest migration.

yes and so? why is that not enough and what is this available state then?

-- 
MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* [virtio-comment] Re: [PATCH V2 3/6] virtio: dont reset vqs when SUSPEND
  2023-11-03 10:34 ` [virtio-comment] [PATCH V2 3/6] virtio: dont reset vqs when SUSPEND Zhu Lingshan
@ 2023-11-06  9:49   ` Michael S. Tsirkin
  2023-11-07  9:27     ` Zhu, Lingshan
  0 siblings, 1 reply; 186+ messages in thread
From: Michael S. Tsirkin @ 2023-11-06  9:49 UTC (permalink / raw)
  To: Zhu Lingshan; +Cc: jasowang, eperezma, cohuck, stefanha, virtio-comment, parav

On Fri, Nov 03, 2023 at 06:34:34PM +0800, Zhu Lingshan wrote:
> When SUSPEND is set, device states and virtqueue states
> should be stablized, therefore the driver should not
> reset vqs when SUSPEND is set in device status.
> 
> Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
> ---
>  content.tex | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/content.tex b/content.tex
> index bcc9d4b..060b5c2 100644
> --- a/content.tex
> +++ b/content.tex
> @@ -444,6 +444,9 @@ \subsubsection{Virtqueue Reset}\label{sec:Basic Facilities of a Virtio Device /
>  The device MUST reset any state of a virtqueue to the default state,
>  including the available state and the used state.
>  
> +If VIRTIO_F_SUSPEND is negotiated and SUSPEND is set in \field{device status},
> +the driver SHOULD NOT reset any virtqueues.
> +
>  \drivernormative{\paragraph}{Virtqueue Reset}{Basic Facilities of a Virtio Device / Virtqueues / Virtqueue Reset / Virtqueue Reset}
>  
>  After the driver tells the device to reset a queue, the driver MUST verify that

Seems somewhat arbitrary and breaks the claim that the
feature is orthogonal and can have uses besides migration.



> -- 
> 2.35.3


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* [virtio-comment] Re: [PATCH V2 6/6] virtio-pci: implement dirty page tracking
  2023-11-06  9:16       ` Zhu, Lingshan
@ 2023-11-06 10:15         ` Michael S. Tsirkin
  2023-11-07  9:43           ` Zhu, Lingshan
  0 siblings, 1 reply; 186+ messages in thread
From: Michael S. Tsirkin @ 2023-11-06 10:15 UTC (permalink / raw)
  To: Zhu, Lingshan; +Cc: jasowang, eperezma, cohuck, stefanha, virtio-comment, parav

On Mon, Nov 06, 2023 at 05:16:43PM +0800, Zhu, Lingshan wrote:
> 
> 
> On 11/3/2023 10:21 PM, Zhu, Lingshan wrote:
> > 
> > 
> > On 11/3/2023 6:46 PM, Michael S. Tsirkin wrote:
> > > On Fri, Nov 03, 2023 at 06:34:37PM +0800, Zhu Lingshan wrote:
> > > > +\begin{lstlisting}
> > > > +struct virtio_pci_dity_page_track {
> > > > +        u8 enable;               /* Read-Write */
> > > > +        u8 gra_power;            /* Read-Write */
> > > > +        u8 reserved[2];
> > > > +        le32 {
> > > > +            pasid: 20;           /* Read-Write */
> > > > +            reserved: 12;
> > > > +        };
> > > > +        le64 bitmap_addr;        /* Read-Write */
> > > > +        le64 bitmap_length;      /* Read-Write */
> > > > +};
> > > > +\end{lstlisting}
> > > Okay, so it's a simple mailbox in config space.  Which by itself is
> > > probably a very reasonable idea - more or less what I suggested.
> > > However, using such a generic facility just for the dirty bitmap seems
> > > too limited.  Please make it accept arbitrary commands. Reusing admin
> > > command structure with a special "device itself" group sounds like one
> > > way to do it.
> > processing admin cmds in a cap may be too complex and overkill.
> > we need to handle variable length of cmds, handle async returned
> > results, and so on.
> > 
> > This struct seems easy and simple. And shall we use platform facilities
> > like vt-d
> > to track dirty pages?
> To demonstrate these issues, suppose we have a struct in a bar to process
> admin cmds:
> 
> struct virtio_admin_cmd {
>         u64 in_data_length;
>         u8 cmd_in_data[];
>         u64 out_data_length;
>         u8 cmd_out_data[];
>         u8 ret;
> };

An alternative is do same as you did here, e.g.:
struct virtio_admin_cmd {
         u64 admin_cmd_pa; /* an out descriptor followed by an in descriptor */
         u32 pasid : 20;
	 u8 reserved : 11;
	 u8 hardware : 1;
};

or we can stick two lengths and addresses straight in the capability.


> The problems are:
> 1) command_in_data and command_out data have variable length, so how many HW
> resource should be reserved in the bar?

actually admin commands are truncated by device so just
set to length that device understands.

> 2) To process the cmds in the bar, the device MAY need to read many
> registers in cmd_in_data[] and write many registers in cmd_out_data[],
> which can be ineffective, this is not DMA.

True. Again if you don't want to depend on pasid that's the
only option.


> 3) a bar can only process one cmd at a time, and the driver can only issue
> another cmd after received an ret.
> This process has to be synchronous IO, one cmd blocks another.

Exactly same as what you did though.

> 4) VF implementing a bar processing admin cmds conflicts with PF's admin vq.

So just don't create conflicts. It's same as multiple admin vqs
really which we already support.


> 
> So I think a bar or a cap processing admin cmds is way to complex and
> overkill.
> 
> Thanks

Sounds like a straw man argument.

-- 
MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [virtio-comment] Re: [PATCH V2 6/6] virtio-pci: implement dirty page tracking
  2023-11-06  4:06         ` Zhu, Lingshan
@ 2023-11-06 10:22           ` Michael S. Tsirkin
  2023-11-07 10:44             ` Zhu, Lingshan
  0 siblings, 1 reply; 186+ messages in thread
From: Michael S. Tsirkin @ 2023-11-06 10:22 UTC (permalink / raw)
  To: Zhu, Lingshan; +Cc: jasowang, eperezma, cohuck, stefanha, virtio-comment, parav

On Mon, Nov 06, 2023 at 12:06:39PM +0800, Zhu, Lingshan wrote:
> > > Intel production work with similar bitmap
> > > based dirty page tracking solution for years.
> > and then VMs became bigger and PML was introduced.
> So you agree we should track dirty pages through the platform facilities?
> I am glad to hear that!

I just said that I thought there's no PML in platform facilities and that
might be a problem. Am I wrong?

> > 
> > > Otherwise the device should report PFN which is not very practical.
> > Why not?
> Really? the device report PFN?
> What can happen if the device keep writing a small piece of memory???

then you just report the PFN once. Should work like PML really -
IOW devices maintains a bit per page internally and reports
PFN when bit is set.

> > 
> > >      And the resolution is apparently 8 pages? You have just multiplied
> > >      the migration bandwidth by a factor of 8.
> > > 
> > > No, as described in the comments, the tacking granularity is controlled by \
> > > field{gra_power}, one bit represents a page with page_size = 2^(12 +
> > > gra_power). This can also be used to reduce the size of the bitmap.
> > .. at the cost of increasing migration bandwidth.
> The device is very likely to write a neighbor page,

how likely? and e.g. with slab randomization too? please collect some
data and show it.

> and this happens
> everywhere for example CPU read 64 bytes aligned data.

CPUs don't need to send their cache across a bandwidth constrained
shared network.

> 
> This is a tradeoff

tradeoff between which two options?

> > 
> > > "To prevent a read-modify-write procedure, if a memory page is dirty,
> > > optionally the device is permitted to set the entire byte, which encompasses the relevant bit, to 1."
> > > 
> > > This is optional and DMA is very likely to write a neighbor page, and the device transmit a whole byte anyway
> > > when a bit is dirty.
> > > 
> > > How about we use platform dirty page tracking facility then implement this in virtio, as Jason suggested?
> > > 
> > Without something like PML it likely won't scale either.
> So that would be platform issue which we don't need to take care of?
> Intel VT-d can do this for sure.

Intel VT-d supports PML from the IOMMU? I didn't realize. Could you help
me find it in the doc please? Which hardware supports this in the field?
What about other vendors?

-- 
MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* [virtio-comment] Re: [PATCH V2 6/6] virtio-pci: implement dirty page tracking
  2023-11-06  4:34             ` [virtio-comment] " Parav Pandit
  2023-11-06  9:34               ` [virtio-comment] " Zhu, Lingshan
@ 2023-11-06 10:29               ` Michael S. Tsirkin
  2023-11-06 11:21                 ` [virtio-comment] " Parav Pandit
  1 sibling, 1 reply; 186+ messages in thread
From: Michael S. Tsirkin @ 2023-11-06 10:29 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Zhu, Lingshan, jasowang, eperezma, cohuck, stefanha, virtio-comment

On Mon, Nov 06, 2023 at 04:34:44AM +0000, Parav Pandit wrote:
> > Do you know there are locked transaction and atomic operations in PCI???
> Can you explain how PCI does RMW locked transaction?
> Is it one TLP or multiple?

Parav what are you asking about here? pcie supports CAS and Swap which
likely can work for this use-case - these are non posted writes. It's in
the pcie spec. Zhu Lingshan if your proposal relies on this then you should
include
- explanation on how it's supposed to be implemented using AtomicOp
- some data on how common support is or is likely to be in the field

-- 
MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* [virtio-comment] Re: [PATCH V2 6/6] virtio-pci: implement dirty page tracking
  2023-11-06  3:58             ` Zhu, Lingshan
@ 2023-11-06 10:33               ` Michael S. Tsirkin
  2023-11-07  9:48                 ` Zhu, Lingshan
  0 siblings, 1 reply; 186+ messages in thread
From: Michael S. Tsirkin @ 2023-11-06 10:33 UTC (permalink / raw)
  To: Zhu, Lingshan
  Cc: Parav Pandit, jasowang, eperezma, cohuck, stefanha, virtio-comment

On Mon, Nov 06, 2023 at 11:58:03AM +0800, Zhu, Lingshan wrote:
> 
> 
> On 11/6/2023 12:12 AM, Michael S. Tsirkin wrote:
> > On Fri, Nov 03, 2023 at 03:47:34PM +0000, Parav Pandit wrote:
> > > > > [1]
> > > > > https://lists.oasis-open.org/archives/virtio-comment/202310/msg00475.h
> > > > > tml
> > > > you still need to explain why this does not work for pass-through.
> > > It does not work for following reasons.
> > > 1. Because all the fields that put on the member device are not in direct control of the hypervisor.
> > > The device is directly controlled by the guest including the device status and when it resets the device all the things stored in the device are lost.
> > I think the idea is that when this gateway is in the device then
> > device reset has to trap. At a high level, ok. But then what?
> No, when device reset, the device is expected to forget everything and
> re-intialize.

That's a problem then - memory that was already written will not be
detected as such.

> > Is a full scan of all memory required until device reset is complete?
> Who scan the memory? The device tracks its own dirty pages.

yes but reset erases this information.


> > Drivers currently tend to busy poll the reset register,
> > if this takes very long we might start seeing soft lockup
> > messages. What is the idea then? Maybe for this we need a separate
> > weaker reset that does not touch this capability?
> When reset, how can we expect the LM progress continue running.
> 
> For example, when the device DMA writes something, then reset before sending
> an interrupt,
> the DMA-ed pages should be lost as expected, right?

interrupt is going to guest, has nothing to do with it.

device writes data into memory
device sends interrupt
driver sees data
driver sends reset

meanwhile hypervisor did not see any dirty pages

now what? hypervisor must apparently retrieve all
dirty page data before it can reset the device.

-- 
MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* RE: [virtio-comment] RE: [PATCH V2 2/6] virtio: introduce SUSPEND bit in device status
  2023-11-06  9:21             ` Zhu, Lingshan
@ 2023-11-06 10:52               ` Parav Pandit
  2023-11-07  8:21                 ` Zhu, Lingshan
  0 siblings, 1 reply; 186+ messages in thread
From: Parav Pandit @ 2023-11-06 10:52 UTC (permalink / raw)
  To: Zhu, Lingshan, jasowang, mst, eperezma, cohuck, stefanha; +Cc: virtio-comment


> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> Sent: Monday, November 6, 2023 2:51 PM
> 
> On 11/6/2023 12:07 PM, Parav Pandit wrote:
> >> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> >> Sent: Monday, November 6, 2023 9:00 AM
> >>
> >> On 11/3/2023 11:54 PM, Parav Pandit wrote:
> >>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> >>>> Sent: Friday, November 3, 2023 8:25 PM
> >>>>
> >>>> On 11/3/2023 7:35 PM, Parav Pandit wrote:
> >>>>>> From: Zhu Lingshan <lingshan.zhu@intel.com>
> >>>>>> Sent: Friday, November 3, 2023 4:05 PM
> >>>>>>
> >>>>>> This patch introduces a new status bit in the device status: SUSPEND.
> >>>>>>
> >>>>>> This SUSPEND bit can be used by the driver to suspend a device,
> >>>>>> in order to stabilize the device states and virtqueue states.
> >>>>>>
> >>>>>> Its main use case is live migration.
> >>>>>>
> >>>>>> Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
> >>>>>> Signed-off-by: Jason Wang <jasowang@redhat.com>
> >>>>> You constantly complained that whatever was proposed using admin
> >>>> commands method in [1] must work for passthrough and non-passthrough.
> >>>>> And halfway in the discussion you propose a method after learning
> >>>>> all the
> >>>> limitations of in-band, you propose a solution only works for
> >>>> non-passthrough mode.
> >>>>> You asked someone to have comprehensive proposal and when it comes
> >>>>> to
> >>>> you following it, you just don’t.
> >>>> not sure what you are talking about.
> >>>>> And have most shallow commit message to not even mention it.
> >>>>>
> >>>>> Please be consistent in design approach.
> >>>>> And if you don’t want to be, stop asking others.
> >>>> this SUSPEND/RESUME doesn't change since the RFC series, how can it
> >>>> not be inconsistent???
> >>>>> This is not the way TC collaboration works.
> >>>>> I probably shouldn’t even expect this from you.
> >>> Your proposal does not cover both the use cases of passthrough and
> >>> non-
> >> passthrough.
> >>> Yet you kept demanding them for others.
> >>> This is just wrong.
> >>>
> >>> I am aware that both models as technical pros and cons.
> >> Why this doesn't work? the device status byte has been working for
> >> many years, and do you know when guest freeze, the hypervisor owns the
> device????
> > When the guest is not frozen and during the pre-copy phase, hypervisor needs
> to access the device (context, dirty pages).
> > How does it work if the guest owns the device?
> Have you seen PASID there?
PASID does not help because as explained virtio common config space and device specific config space is owned by the guest driver.

Secondly PASID space is also owned by the guest driver.

> >
> >>>>> [1]
> >>>>> https://lists.oasis-open.org/archives/virtio-comment/202310/msg004
> >>>>> 72
> >>>>> .h
> >>>>> tml
> >>>> Please don't be so emotional and please be professional.
> >>>>
> >>>> Why this solution can not work for pass-through? Do you know the
> >>>> device ownership will be transferred to the hypervisor when guest
> >>>> suspended in live migration?
> >>> I explained 5 reasons why it does not work in previous reply.
> >>>
> >>> As the word indicates "live migration", the hypervisor needs to
> >>> access the
> >> device when it is "live" (not just after).
> >>> Hence, passthrough mode must be able to capture the state of the
> >>> device and
> >> dirty pages database when its live.
> >>> (and after the source is suspended).
> >> No, the hypervisor should only collect dirty pages when the device alive.
> > It is needed during both the times.
> > When the device and guest is live during pre-copy phase.
> > And after the device is frozen, to get the final round of pages.
> With PASID, dirty page tracking facility can be isolated from the guest, means
> the hypervisor owns this facility. So the hypervisor can collect the dirty pages.
> 
> When the device suspended, it should report the last round of dirty pages
> through dirty page tracking facility as expected.
> 
> This can work, right?
Unfortunately no, as non atomic bitmap cannot reside in the host memory,
And whatever is in the device gets reset on device reset and/or FLR. So the dirty map detail is lost.
Similarly the device context is also lost on these two events triggered by guest.

> >
> >> As you can see, the dirty page tracking facility has a PASID for
> >> isolation. But still, the question is, we should better use platform
> >> dirty page tracking
> >>
> > Nothing to do with PASID, as PASID is owned by the guest.
> It looks you don't know how PASID work.

> 
> Host can setup PASID to isolate some facilities, right?
There are few limitations with PASID.
a. All platforms do not have PASID and 
b. I explained above PASID do not work always as PASID only bifurcates DMA not the device _functionality_.
c. PASID to be available to guest as_is what is present on the device

> >
> >> Then suspend the device after guest freeze, to stabilize the device
> >> status, then read the status.
> >>
> >> How can you say this does not work???
> > I explained above.
> see above


^ permalink raw reply	[flat|nested] 186+ messages in thread

* RE: [virtio-comment] Re: [PATCH V2 4/6] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  2023-11-06  9:27             ` Zhu, Lingshan
@ 2023-11-06 10:52               ` Parav Pandit
  2023-11-07  9:31                 ` Zhu, Lingshan
  0 siblings, 1 reply; 186+ messages in thread
From: Parav Pandit @ 2023-11-06 10:52 UTC (permalink / raw)
  To: Zhu, Lingshan, jasowang, mst, eperezma, cohuck, stefanha; +Cc: virtio-comment


> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> Sent: Monday, November 6, 2023 2:57 PM
> 
> On 11/6/2023 12:12 PM, Parav Pandit wrote:
> >> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> >> Sent: Monday, November 6, 2023 9:01 AM
> >>
> >> On 11/3/2023 11:50 PM, Parav Pandit wrote:
> >>>> From: virtio-comment@lists.oasis-open.org
> >>>> <virtio-comment@lists.oasis- open.org> On Behalf Of Zhu, Lingshan
> >>>> Sent: Friday, November 3, 2023 8:27 PM
> >>>>
> >>>> On 11/3/2023 7:35 PM, Parav Pandit wrote:
> >>>>>> From: Zhu Lingshan <lingshan.zhu@intel.com>
> >>>>>> Sent: Friday, November 3, 2023 4:05 PM
> >>>>>>
> >>>>>> This patch adds two new le16 fields to common configuration
> >>>>>> structure to support VIRTIO_F_QUEUE_STATE in PCI transport layer.
> >>>>>>
> >>>>>> Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
> >>>>>> ---
> >>>>>>     transport-pci.tex | 18 ++++++++++++++++++
> >>>>>>     1 file changed, 18 insertions(+)
> >>>>>>
> >>>>>> diff --git a/transport-pci.tex b/transport-pci.tex index
> >>>>>> a5c6719..3161519 100644
> >>>>>> --- a/transport-pci.tex
> >>>>>> +++ b/transport-pci.tex
> >>>>>> @@ -325,6 +325,10 @@ \subsubsection{Common configuration
> >> structure
> >>>>>> layout}\label{sec:Virtio Transport
> >>>>>>             /* About the administration virtqueue. */
> >>>>>>             le16 admin_queue_index;         /* read-only for driver */
> >>>>>>             le16 admin_queue_num;         /* read-only for driver */
> >>>>>> +
> >>>>>> +	/* Virtqueue state */
> >>>>>> +        le16 queue_avail_state;         /* read-write */
> >>>>>> +        le16 queue_used_state;          /* read-write */
> >>>>> This tiny interface for 128 virtio net queues through register
> >>>>> read writes, does
> >>>> not work effectively.
> >>>>> There are inflight out of order descriptors for block also.
> >>>>> Hence toy registers like this do not work.
> >>>> Do you know there is a queue_select? Why this does not work? Do you
> >>>> know how other queue related fields work?
> >>> :)
> >>> Yes. If you notice queue_reset related critical spec bug fix was
> >>> done when it
> >> was introduced so that live migration can _actually_ work.
> >>> When queue_select is done for 128 queues serially, it take a lot of
> >>> time to
> >> read those slow register interface for this + inflight descriptors + more.
> >> interesting, virtio work in this pattern for many years, right?
> > All these years 400Gbps and 800Gbps virtio was not present, number of
> queues were not in hw.
> The registers are control path in config space, how 400G or 800G affect??
Because those are the one in practice requires large number of VQs.

You are asking per VQ register commands to modify things dynamically via this one vq at a time, serializing all the operations.
It does not scale well with high q count.
> See the virtio common cfg, you will find the max number of vqs is there,
> num_queues.

:)
Sure. those values at high q count affects.

> > Device didn’t support LM.
> > Many limitations existed all these years and TC is improving and expanding
> them.
> > So all these years do not matter.
> Not sure what are you talking about, haven't we initialize the device and vqs in
> config space for years?????? What's wrong with this mechanism?
> Are you questioning virito-pci fundamentals???
Don’t point to in-efficient past to establish similar in-efficient future.

> >
> >>>> Like how to set a queue size and enable it?
> >>> Those are meant to be used before DRIVER_OK stage as they are init
> >>> time
> >> registers.
> >>> Not to keep abusing them..
> >> don't you need to set queue_size at the destination side?
> > No.
> > But the src/dst does not matter.
> > Queue_size to be set before DRIVER_OK like rest of the registers, as all
> queues must be created before the driver_ok phase.
> > Queue_reset was last moment exception.
> create a queue? Nvidia specific?
> 
Huh. No.
Do git log and realize what happened with queue_reset.

> For standard virtio, you need to read the number of enabled vqs at the source
> side, then enable them at the dst, so queue_size matters, not to create.
All that happens in the pre-copy phase.

^ permalink raw reply	[flat|nested] 186+ messages in thread

* [virtio-comment] RE: [PATCH V2 6/6] virtio-pci: implement dirty page tracking
  2023-11-06  9:34               ` [virtio-comment] " Zhu, Lingshan
@ 2023-11-06 10:52                 ` Parav Pandit
  2023-11-06 11:05                   ` [virtio-comment] " Michael S. Tsirkin
  2023-11-07  9:52                   ` Zhu, Lingshan
  2023-11-06 11:13                 ` [virtio-comment] " Parav Pandit
  1 sibling, 2 replies; 186+ messages in thread
From: Parav Pandit @ 2023-11-06 10:52 UTC (permalink / raw)
  To: Zhu, Lingshan, Michael S. Tsirkin
  Cc: jasowang, eperezma, cohuck, stefanha, virtio-comment


> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> Sent: Monday, November 6, 2023 3:04 PM
> 
> On 11/6/2023 12:34 PM, Parav Pandit wrote:
> >
> >> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> >> Sent: Monday, November 6, 2023 9:22 AM
> >>
> >>
> >> On 11/3/2023 11:47 PM, Parav Pandit wrote:
> >>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> >>>> Sent: Friday, November 3, 2023 8:33 PM
> >>>>
> >>>> On 11/3/2023 7:35 PM, Parav Pandit wrote:
> >>>>>> From: Michael S. Tsirkin <mst@redhat.com>
> >>>>>> Sent: Friday, November 3, 2023 4:20 PM
> >>>>>>
> >>>>>> On Fri, Nov 03, 2023 at 06:34:37PM +0800, Zhu Lingshan wrote:
> >>>>>>> +\item[\field{bitmap_addr}]
> >>>>>>> +	The driver use this to set the address of the bitmap which
> >>>>>>> +records the
> >>>>>> dirty pages
> >>>>>>> +	caused by the device.
> >>>>>>> +	Each bit in the bitmap represents one memory page, bit 0 in
> the bitmap
> >>>>>>> +	reprsents page 0 at address 0, bit 1 represents page 1, and so
> >>>>>>> +on in a
> >>>>>> linear manner.
> >>>>>>> +	When \field{enable} is set to 1 and the device writes to a
> memory page,
> >>>>>>> +	the device MUST set the corresponding bit to 1 which
> >>>>>>> +indicating the
> >>>>>> page is dirty.
> >>>>>>> +\item[\field{bitmap_length}]
> >>>>>>> +	The driver use this to set the length in bytes of the bitmap.
> >>>>>>> +\end{description}
> >>>>>>> +
> >>>>>>> +\devicenormative{\subsubsection}{Memory Dirty Pages Tracker
> >>>>>>> +Capability}{Virtio Transport Options / Virtio Over PCI Bus /
> >>>>>>> +Memory Dirty Pages Tracker Capability}
> >>>>>>> +
> >>>>>>> +The device MUST NOT set any bits beyond bitmap_length when
> >>>>>>> +reporting
> >>>>>> dirty pages.
> >>>>>>> +
> >>>>>>> +To prevent a read-modify-write procedure, if a memory page is
> >>>>>>> +dirty,
> >>>>> It is not to prevent; it is just not possible to do racy RMW. 😊
> >>>> if you understand what is a atomic routine, you will not call it racy.
> >>>>> Hence to work around you propose to mark all pages dirty. Too bad.
> >>>>> This just does not work.
> >>>> why? and this is optional.
> >>> Because device cannot set individual bits in atomic way for same
> >>> byte read by
> >> the cpu.
> >>> 1. device read the byte that had bit 0 and 4 set.
> >>> 2. cpu atomically clear these bits.
> >>> 3. device wrote bits 0, 4, and new bits 2 and 3.
> >>> 4. cpu now transferred page 0 and 4 again.
> >>>
> >>> Optional thing also needs to work. :)
> >> Do you know both CPU and device actually don't read bit, they read
> bytes????
> > Yes. this is why atomic_OR is not possible on pcie.
> >
> >> Do you know RC connected to memory controller????
> > Yes.
> >> Do you know there are locked transaction and atomic operations in PCI???
> > Can you explain how PCI does RMW locked transaction?
> > Is it one TLP or multiple?
> >
> >> Do you know there are atomic read/write/clear even read and clear and
> >> so on in CPU ISA????
> > Read is always atomic from cpu.
> > I didn’t know about read_and_clear atomic ISA. This combined with pci future
> support for atomic_or.
> > If you already know a Linux kernel api for atomic_read_and_clear, please
> share.
> To answer all questions above, you should read PCI spec and CPU SDM, we don't
> copy and paste the content here, nobody develop their knowledge this way.

I read the pci spec and I see only 3 operations which does not have atomic or.
I will try to find for the CPU instruction on read_and_clear that you suggested.
Thanks for the suggestion.

> >
> >>>>> Secondly the bitmap array is function is for full guest memory
> >>>>> size, while
> >>>> there is lot of sparce region now and also in future.
> >>>>> This is the second problem.
> >>>> did you see gra_power and its comments?
> >>> gra_power says the page size.
> >>> Not the sparce multiple ranges of the guest memory.
> >>> Device endup tracking uninterested area as well.
> >> increase gra_power can reduce bitmap size, right?
> >> Totally up to the hypervisor, right?
> > Yes, and that can increase the amount of memory.
> > The way I understood is, if gra_power is 2MB, than whole 2MB page to be
> considered dirty, even if 8KB was dirty.
> > Did I understand it right?
> Do you know DMA are very likely to write a neighbor page? Do you know why
> huge page is introduced?
> Hint: not only for reduce TLB miss.
> >
> >>>>> This is exactly why I asked you to review the page write recording
> >>>>> series of
> >>>> admin commands and comment.
> >>>>> And you never commented with sheer ignorance.
> >>>>>
> >>>>> So clearly the start stop method for specific range and without
> >>>>> bandwidth
> >>>> explosion, admin commands of [1] stands better.
> >>>>> If you do [1] on the member device also using its AQ in future, it
> >>>>> will work for
> >>>> non-passthrough case.
> >>>>> If you build non-passthrough live migration using [1], also it will work.
> >>>>> So I don’t see any point of this series anymore.
> >>>> As Jason pointed out, there are many problems in your proposal, you
> >>>> should answer there. I don't need to repeat his words and duplicate
> >>>> the
> >> discussions.
> >>> Many are already addressed in v3.
> >> interesting, does your V3 support nested?
> > Not directly.
> > Is it similar to cpu PML which does not supported nested.
> > One can always implement nested using some emulation.
> > The second option for high performance would be allow SR-IOV cap on the VF
> and support true nesting using existing proposal of v3.
> If your proposal does not support nested, then it is incomplete.
> >
> >>>>> [1]
> >>>>> https://lists.oasis-open.org/archives/virtio-comment/202310/msg004
> >>>>> 75
> >>>>> .h
> >>>>> tml
> >>>> you still need to explain why this does not work for pass-through.
> >>> It does not work for following reasons.
> >>> 1. Because all the fields that put on the member device are not in
> >>> direct
> >> control of the hypervisor.
> >>> The device is directly controlled by the guest including the device
> >>> status and
> >> when it resets the device all the things stored in the device are lost.
> >> have you seen PASID? and if the device reset, it has to forget
> >> everything as expected, right?
> > PASID does not help with the reset. Because as you told reset, resets
> everything.
> > PASID does not bifurcate the device common control which is not linked to any
> PASID.
> PASID means some facilities can be isolated. When reset, the device forget
> everything.
> >
> >>> 2. the PCI FLR is clearing all the registers you exposed here.
> >> see above
> >>> 3. Endless expansion of config registers of dirty tracking is not
> >>> scalable, as they
> >> are not init time registers not following the Appendix B guidelines.
> >> endless expansion?? It is a complete set of dirty page tracking, right????
> >> have you see this cap only controls? The device DMA writes the
> >> bitmap, not by registers.
> > Device dirty page tracking is start/stop command to be done by the
> hypervisor.
> > So when guest is resetting the device, it stopped the DMA initiated by the
> hypervisor.
> > This fundamentally breaks things.
> Why? When device resets, do you want to keep tracking dirty pages????
Yes, when the device resets, before that event occurred, all the pages which were dirtied, must be migrated.
And after reset also new page tracking to continue.

> >

> >> Again, if you want to fix Appendix B, OK.
> >>> 4. bitmap based dirty tracking is not atomic between cpu and device.
> >>> Hence, it is racy.
> >> see above, the first reply.
> >>> 5. All the device context needed for passthrough based hypervisor
> >>> for a
> >> device type specific is missing.
> >>> All of those can be used for non-passthrough as well.
> >>> [1]
> >>> https://lists.oasis-open.org/archives/virtio-comment/202311/msg00085
> >>> .h
> >>> tml
> >> If you want to discuss this again, I don't want to wast time but only
> >> asking you whether you want to define virtio-fs device context
> > It will be defined in future.
> > And if virtio-fs was not written with migration in mind, may be one will invent
> virtio-fs2.
> don't say future, talk is cheap, show me the code.
> >
> >>>> And I
> >>>> remember this is a point-less topic as MST ever wants to mute
> >>>> another
> >>>> "pass- through" thread.
> >>> No. he did not say that.
> >>> He meant to not endlessly debate which one is better.
> >>> He clearly said, try to see if you can make multiple hypervisor model work.
> >>> And your series shows a clear ignorance of his guidance.
> >> Let me quote MST's reply here:
> >> "I feel this discussion will keep meandering because the terminology is
> vague.
> >> There's no single thing that is called "passthrough" - vendors just
> >> build what is expedient with current hardware and software. Nvidia
> >> has a bunch of people working on vfio so they call that passthrough,
> >> Red Hat has people working on VDPA and they call that passthrough, etc.
> >>
> >>
> >> Before I mute this discussion for good, does anyone here have any
> >> feeling progress is made? What kind of progress? "
> >>
> >> So please don't discuss on pass-through anymore.
> > I don’t want to discuss the pros and cons of passthrough vs, vdpa, as usual.
> > V3 covers broader use case of passthrough, hence once can always implement
> trap+emulation instead of passthrough.
> > V3 already indicates that other variants of the passthrough can be done as
> well or can be extended.
> > So please explore if that fits your vdpa need.
> So, please no pass-through discussion anymore.
> >
> >> It seems only you need to develop the knowledge
> >>>
> >>>>>>> +optionally the device is permitted to set the entire byte,
> >>>>>>> +which encompasses
> >>>>>> the relevant bit, to 1.
> >>>>>>> +
> >>>>>>> +The device MAY increase \field{gra_power} to reduce
> >>>> \field{bitmap_length}.
> >>>>>>> +
> >>>>>>> +The device must ignore any writes to \field{pasid} if PASID
> >>>>>>> +Extended Capability is absent or the PASID functionality is
> >>>>>>> +disabled in PASID Extended Capability
> >>>>>> I have to say this is going to work very badly when the number of
> >>>>>> dirty pages is
> >>>>>> small: you will end up scanning and re-scanning all of bitmap.
> >>>>>> And the resolution is apparently 8 pages? You have just
> >>>>>> multiplied the migration bandwidth by a factor of 8.
> >>>>> Yeah.
> >>>>> And device does not even know previously reported pages are read
> >>>>> by driver
> >>>> or not. All guess work game for driver and device.
> >>>> see my reply to him
> >>> Please see above reply.
> >> see above


^ permalink raw reply	[flat|nested] 186+ messages in thread

* [virtio-comment] Re: [PATCH V2 6/6] virtio-pci: implement dirty page tracking
  2023-11-06 10:52                 ` [virtio-comment] " Parav Pandit
@ 2023-11-06 11:05                   ` Michael S. Tsirkin
  2023-11-06 11:07                     ` [virtio-comment] " Parav Pandit
  2023-11-07  9:52                   ` Zhu, Lingshan
  1 sibling, 1 reply; 186+ messages in thread
From: Michael S. Tsirkin @ 2023-11-06 11:05 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Zhu, Lingshan, jasowang, eperezma, cohuck, stefanha, virtio-comment

On Mon, Nov 06, 2023 at 10:52:46AM +0000, Parav Pandit wrote:
> I read the pci spec and I see only 3 operations which does not have atomic or.

CAS+retry can be used to implement any atomics including atomic or.

-- 
MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* [virtio-comment] RE: [PATCH V2 6/6] virtio-pci: implement dirty page tracking
  2023-11-06 11:05                   ` [virtio-comment] " Michael S. Tsirkin
@ 2023-11-06 11:07                     ` Parav Pandit
  2023-11-06 11:21                       ` [virtio-comment] " Michael S. Tsirkin
  0 siblings, 1 reply; 186+ messages in thread
From: Parav Pandit @ 2023-11-06 11:07 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Zhu, Lingshan, jasowang, eperezma, cohuck, stefanha, virtio-comment


> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Monday, November 6, 2023 4:36 PM
> On Mon, Nov 06, 2023 at 10:52:46AM +0000, Parav Pandit wrote:
> > I read the pci spec and I see only 3 operations which does not have atomic or.
> 
> CAS+retry can be used to implement any atomics including atomic or.

Yes, we considered it. the PCI backpressure is lot that negates the value of such retries for such small transactions.

This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* [virtio-comment] RE: [PATCH V2 6/6] virtio-pci: implement dirty page tracking
  2023-11-06  9:34               ` [virtio-comment] " Zhu, Lingshan
  2023-11-06 10:52                 ` [virtio-comment] " Parav Pandit
@ 2023-11-06 11:13                 ` Parav Pandit
  2023-11-07 10:01                   ` [virtio-comment] " Zhu, Lingshan
  1 sibling, 1 reply; 186+ messages in thread
From: Parav Pandit @ 2023-11-06 11:13 UTC (permalink / raw)
  To: Zhu, Lingshan, Michael S. Tsirkin
  Cc: jasowang, eperezma, cohuck, stefanha, virtio-comment


> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> Sent: Monday, November 6, 2023 3:04 PM

> So, please no pass-through discussion anymore.

If you comment like this, nothing can progress.

What you are implying with above language is: 
"hey a virtio can do live migration ONLY by creating vdpa device on top of ALREADY virtio device, and you get another virtio device by running through 3 layers of stack you get virtio device on other side!".

Then for sure, I disagree to it for 100% for such a single-minded design.

At least I am trying to propose if a solution can work for generic passthrough where least amount of hypervisor mediation is done.

And an extension where hypervisor has choice to more medication layers as it finds suitable.
And if there are technical issues, may be two different interfaces or more admin commands needed for two modes.
The idea is to attempt to converge and discuss those details, not the opposite.

Your above comment shows a clear sign of non-collaboration to make both mode works.
At one point I may probably stop responding to your comments that repeatedly says:

"Go read QEMU code, Do you know what is PASID?, Do you know num_queues, Go read PCI spec"...

Taking deep breath now to do some productive work in TC...

^ permalink raw reply	[flat|nested] 186+ messages in thread

* [virtio-comment] RE: [PATCH V2 6/6] virtio-pci: implement dirty page tracking
  2023-11-06 10:29               ` Michael S. Tsirkin
@ 2023-11-06 11:21                 ` Parav Pandit
  2023-11-06 11:27                   ` [virtio-comment] " Michael S. Tsirkin
  2023-11-07 10:02                   ` [virtio-comment] " Zhu, Lingshan
  0 siblings, 2 replies; 186+ messages in thread
From: Parav Pandit @ 2023-11-06 11:21 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Zhu, Lingshan, jasowang, eperezma, cohuck, stefanha, virtio-comment


> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Monday, November 6, 2023 4:00 PM
> 
> On Mon, Nov 06, 2023 at 04:34:44AM +0000, Parav Pandit wrote:
> > > Do you know there are locked transaction and atomic operations in PCI???
> > Can you explain how PCI does RMW locked transaction?
> > Is it one TLP or multiple?
> 
> Parav what are you asking about here? 

> pcie supports CAS and Swap which likely
> can work for this use-case - these are non posted writes. It's in the pcie spec.
PCI spec do not have atomic OR operation.
Lingshan in above comment suggested some unknown locked transaction and atomic operation.
So I was asking him which is that atomic operation and how PCI does it?
I don't know if any that can do PCI atomic OR without a workaround.

This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* [virtio-comment] Re: [PATCH V2 6/6] virtio-pci: implement dirty page tracking
  2023-11-06 11:07                     ` [virtio-comment] " Parav Pandit
@ 2023-11-06 11:21                       ` Michael S. Tsirkin
  0 siblings, 0 replies; 186+ messages in thread
From: Michael S. Tsirkin @ 2023-11-06 11:21 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Zhu, Lingshan, jasowang, eperezma, cohuck, stefanha, virtio-comment

On Mon, Nov 06, 2023 at 11:07:24AM +0000, Parav Pandit wrote:
> 
> > From: Michael S. Tsirkin <mst@redhat.com>
> > Sent: Monday, November 6, 2023 4:36 PM
> > On Mon, Nov 06, 2023 at 10:52:46AM +0000, Parav Pandit wrote:
> > > I read the pci spec and I see only 3 operations which does not have atomic or.
> > 
> > CAS+retry can be used to implement any atomics including atomic or.
> 
> Yes, we considered it. the PCI backpressure is lot that negates the value of such retries for such small transactions.

if there's some locality device can buffer these up maybe,
though of course the chances of a failure increase then.


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* [virtio-comment] Re: [PATCH V2 6/6] virtio-pci: implement dirty page tracking
  2023-11-06 11:21                 ` [virtio-comment] " Parav Pandit
@ 2023-11-06 11:27                   ` Michael S. Tsirkin
  2023-11-06 11:31                     ` [virtio-comment] " Parav Pandit
  2023-11-07 10:02                   ` [virtio-comment] " Zhu, Lingshan
  1 sibling, 1 reply; 186+ messages in thread
From: Michael S. Tsirkin @ 2023-11-06 11:27 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Zhu, Lingshan, jasowang, eperezma, cohuck, stefanha, virtio-comment

On Mon, Nov 06, 2023 at 11:21:12AM +0000, Parav Pandit wrote:
> 
> > From: Michael S. Tsirkin <mst@redhat.com>
> > Sent: Monday, November 6, 2023 4:00 PM
> > 
> > On Mon, Nov 06, 2023 at 04:34:44AM +0000, Parav Pandit wrote:
> > > > Do you know there are locked transaction and atomic operations in PCI???
> > > Can you explain how PCI does RMW locked transaction?
> > > Is it one TLP or multiple?
> > 
> > Parav what are you asking about here? 
> 
> > pcie supports CAS and Swap which likely
> > can work for this use-case - these are non posted writes. It's in the pcie spec.
> PCI spec do not have atomic OR operation.
> Lingshan in above comment suggested some unknown locked transaction and atomic operation.
> So I was asking him which is that atomic operation and how PCI does it?
> I don't know if any that can do PCI atomic OR without a workaround.

Well this is not what you wrote - you asked about RMW and that's exactly what CAS does.
Maybe you should take a bit more care writing then. Generally this
discussion would benefit a lot if people stop shooting from the hip.

-- 
MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* [virtio-comment] RE: [PATCH V2 6/6] virtio-pci: implement dirty page tracking
  2023-11-06 11:27                   ` [virtio-comment] " Michael S. Tsirkin
@ 2023-11-06 11:31                     ` Parav Pandit
  0 siblings, 0 replies; 186+ messages in thread
From: Parav Pandit @ 2023-11-06 11:31 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Zhu, Lingshan, jasowang, eperezma, cohuck, stefanha, virtio-comment


> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Monday, November 6, 2023 4:58 PM
> 
> On Mon, Nov 06, 2023 at 11:21:12AM +0000, Parav Pandit wrote:
> >
> > > From: Michael S. Tsirkin <mst@redhat.com>
> > > Sent: Monday, November 6, 2023 4:00 PM
> > >
> > > On Mon, Nov 06, 2023 at 04:34:44AM +0000, Parav Pandit wrote:
> > > > > Do you know there are locked transaction and atomic operations in
> PCI???
> > > > Can you explain how PCI does RMW locked transaction?
> > > > Is it one TLP or multiple?
> > >
> > > Parav what are you asking about here?
> >
> > > pcie supports CAS and Swap which likely can work for this use-case -
> > > these are non posted writes. It's in the pcie spec.
> > PCI spec do not have atomic OR operation.
> > Lingshan in above comment suggested some unknown locked transaction and
> atomic operation.
> > So I was asking him which is that atomic operation and how PCI does it?
> > I don't know if any that can do PCI atomic OR without a workaround.
> 
> Well this is not what you wrote - you asked about RMW and that's exactly what
> CAS does.
> Maybe you should take a bit more care writing then. Generally this discussion
> would benefit a lot if people stop shooting from the hip.

Yes. I will take more care. Thanks.

This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* [virtio-comment] Re: [PATCH V2 0/6] introduce basic facilities for virito live migration
  2023-11-03 10:34 [virtio-comment] [PATCH V2 0/6] introduce basic facilities for virito live migration Zhu Lingshan
                   ` (5 preceding siblings ...)
  2023-11-03 10:34 ` [virtio-comment] [PATCH V2 6/6] virtio-pci: implement dirty page tracking Zhu Lingshan
@ 2023-11-07  8:01 ` Michael S. Tsirkin
  2023-11-08 10:19   ` Zhu, Lingshan
  6 siblings, 1 reply; 186+ messages in thread
From: Michael S. Tsirkin @ 2023-11-07  8:01 UTC (permalink / raw)
  To: Zhu Lingshan; +Cc: jasowang, eperezma, cohuck, stefanha, virtio-comment, parav

On Fri, Nov 03, 2023 at 06:34:31PM +0800, Zhu Lingshan wrote:
> This series introduces basic facilities to support
> virtio live migration, includes:
> 
> 1)a new SUSPEND bit in the device status
> Which is used to suspend the device, so that the device states
> and virtqueue states are stabilized.
> 
> 2)virtqueue state and its accessor, to get and set last_avail_idx
> and last_used_idx of virtqueues.
> 
> 3)dirty page tracking


So looking at this from 100ft:
- SUSPEND bit looks like something that might have value as a generic
  component. For example, maybe for NUMA balancing we could suspend,
  quickly copy ring to a different node and resume.  However current
  restrictions make it very limited, e.g.  apparently you can't change
  config space while suspended.
  As another example, changing config while suspended might be
  needed e.g. for net announcements.
  Also, do we want to suspend individual
  queues then? what exactly happens with config changes while suspended
  that would happen otherwise is also unclear. Also as is, proposal is
  very light on detail. Other patches in the series make it look like
  there are more assumptions made about e.g. how vq enters the
  suspended state.

- virtqueue state proposal looks very vague. A couple of 16 bit indices
  are insufficient to fully describe internal vq state at an arbitrary
  time. Some assumptions seem to be made that make this possible and
  yes, these would need to be stated and/or lifted.
  Preferably lifted since another use-case proposed was debugging -
  you do not, while debugging, want to depend on device following
  a complex set of assumptions.
  
- dirty page tracking as described does not seem practical for
  many systems.  increasing page size x8 is just being nasty
  towards other network users. CAS + retry could be a solution,
  but this needs to be documented thoroughly then and it appears this is not what author expects to implement
  anyway - instead, there's an assumption that platform itself
  will support dirty tracking. By itself, this is not
  an impossible assumption - will possibly result in a cheaper,
  slower device. why not have an option like this?
  I would probably just drop it from this proposal completely.
  Also, tracking memory on the device means we'll lose state
  around reset. Solving that could be tricky. Finally,
  dependence on PASID can not be removed apparently.
  So maybe, people who want to track memory changes on the
  device itself should just bite the bullet and use
  admin vq in the PF.




-- 
MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* [virtio-comment] Re: [PATCH V2 1/6] virtio: introduce virtqueue state
  2023-11-06  9:45           ` Michael S. Tsirkin
@ 2023-11-07  8:11             ` Zhu, Lingshan
  2023-11-07  8:22               ` Michael S. Tsirkin
  0 siblings, 1 reply; 186+ messages in thread
From: Zhu, Lingshan @ 2023-11-07  8:11 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: jasowang, eperezma, cohuck, stefanha, virtio-comment, parav



On 11/6/2023 5:45 PM, Michael S. Tsirkin wrote:
> On Mon, Nov 06, 2023 at 05:42:10PM +0800, Zhu, Lingshan wrote:
>>
>> On 11/6/2023 5:35 PM, Michael S. Tsirkin wrote:
>>> On Fri, Nov 03, 2023 at 10:49:42PM +0800, Zhu, Lingshan wrote:
>>>>           +When SUSPEND is set, the device MUST record the Available State of every enabled splited virtqueue
>>>>           +in \field{Available State} field,
>>>>           +and correspondingly restore the Available State of every enabled splited virtqueue
>>>>           +from \field{Available State} field when DRIVER_OK is set.
>>>>           +
>>>>           +The device SHOULD reset \field{Available State} field upon a device reset.
>>>>
>>>>       At this point I have no idea
>>>>       - how can a state of a virtqueue at a random time be represented
>>>>         by a 16 bit integer
>>>>
>>>> not sure what is a random time, this is to request the device to reset
>>>> its avail state, for example, it is "le16 queue_avail_state" in virtio-pci
>>>> common cfg. Resetting this so the device will not recover from a wrong value of
>>>> the last run.
>>> You simply never bother to say what is "Available State" and what
>>> does it mean to restore it.  Not to mention words like "splited"
>>> which just adds to the confusion.
>> It says:
>> +The available state field is two bytes of virtqueue state that is used by
>> +the device to read the next available buffer. It is presented in the
>> following format:
>>
>> Do you want me to add more descriptions?
> maybe start with an example
I think they are already in the spec, I can add:
see also "2.7.6 The Virtqueue Available Ring" and "2.7.13.1 Placing 
Buffers Into The Descriptor Table"
>
>>>>       - if it's not at a random time then why do you even need an integer -
>>>>         synchronize queue to memory and then all state is in memory
>>>>
>>>> Not sure what is a sync queue, but for example, "le16 queue_avail_state" for
>>>> PCI transport exists in a cap.
>>> I just point out that normally a lot of ring state is in memory.
>>> So you need to be much more specific about how you are augmenting that.
>>> For example, if buffers are used exactly in order for a split ring
>>> then used index seems to be exactly the same as last available index
>>> you describe - it's a free running counter. OTOH if they are not
>>> used in order then I don't see how is a single index sufficient to
>>> describe which ones have been used and which not.
>> I am not sure I get it.
>>
>> Used idx(not like packed vq, no over-writing descriptors) and other states
>> are in guest memory, so migrated with guest migration.
> yes and so? why is that not enough and what is this available state then?
The spec has illustrated how available index work and has given an 
example(see above cited sections)
And this patch even has given a more clear description for it.

Other states are in guest memory and migrated with guest memory.
>


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [virtio-comment] RE: [PATCH V2 2/6] virtio: introduce SUSPEND bit in device status
  2023-11-06 10:52               ` Parav Pandit
@ 2023-11-07  8:21                 ` Zhu, Lingshan
  2023-11-07  8:33                   ` Michael S. Tsirkin
  0 siblings, 1 reply; 186+ messages in thread
From: Zhu, Lingshan @ 2023-11-07  8:21 UTC (permalink / raw)
  To: Parav Pandit, jasowang, mst, eperezma, cohuck, stefanha; +Cc: virtio-comment



On 11/6/2023 6:52 PM, Parav Pandit wrote:
>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>> Sent: Monday, November 6, 2023 2:51 PM
>>
>> On 11/6/2023 12:07 PM, Parav Pandit wrote:
>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>>>> Sent: Monday, November 6, 2023 9:00 AM
>>>>
>>>> On 11/3/2023 11:54 PM, Parav Pandit wrote:
>>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>>>>>> Sent: Friday, November 3, 2023 8:25 PM
>>>>>>
>>>>>> On 11/3/2023 7:35 PM, Parav Pandit wrote:
>>>>>>>> From: Zhu Lingshan <lingshan.zhu@intel.com>
>>>>>>>> Sent: Friday, November 3, 2023 4:05 PM
>>>>>>>>
>>>>>>>> This patch introduces a new status bit in the device status: SUSPEND.
>>>>>>>>
>>>>>>>> This SUSPEND bit can be used by the driver to suspend a device,
>>>>>>>> in order to stabilize the device states and virtqueue states.
>>>>>>>>
>>>>>>>> Its main use case is live migration.
>>>>>>>>
>>>>>>>> Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
>>>>>>>> Signed-off-by: Jason Wang <jasowang@redhat.com>
>>>>>>> You constantly complained that whatever was proposed using admin
>>>>>> commands method in [1] must work for passthrough and non-passthrough.
>>>>>>> And halfway in the discussion you propose a method after learning
>>>>>>> all the
>>>>>> limitations of in-band, you propose a solution only works for
>>>>>> non-passthrough mode.
>>>>>>> You asked someone to have comprehensive proposal and when it comes
>>>>>>> to
>>>>>> you following it, you just don’t.
>>>>>> not sure what you are talking about.
>>>>>>> And have most shallow commit message to not even mention it.
>>>>>>>
>>>>>>> Please be consistent in design approach.
>>>>>>> And if you don’t want to be, stop asking others.
>>>>>> this SUSPEND/RESUME doesn't change since the RFC series, how can it
>>>>>> not be inconsistent???
>>>>>>> This is not the way TC collaboration works.
>>>>>>> I probably shouldn’t even expect this from you.
>>>>> Your proposal does not cover both the use cases of passthrough and
>>>>> non-
>>>> passthrough.
>>>>> Yet you kept demanding them for others.
>>>>> This is just wrong.
>>>>>
>>>>> I am aware that both models as technical pros and cons.
>>>> Why this doesn't work? the device status byte has been working for
>>>> many years, and do you know when guest freeze, the hypervisor owns the
>> device????
>>> When the guest is not frozen and during the pre-copy phase, hypervisor needs
>> to access the device (context, dirty pages).
>>> How does it work if the guest owns the device?
>> Have you seen PASID there?
> PASID does not help because as explained virtio common config space and device specific config space is owned by the guest driver.
>
> Secondly PASID space is also owned by the guest driver.
hypervisor sets a PASID to isolate the cap.
>
>>>>>>> [1]
>>>>>>> https://lists.oasis-open.org/archives/virtio-comment/202310/msg004
>>>>>>> 72
>>>>>>> .h
>>>>>>> tml
>>>>>> Please don't be so emotional and please be professional.
>>>>>>
>>>>>> Why this solution can not work for pass-through? Do you know the
>>>>>> device ownership will be transferred to the hypervisor when guest
>>>>>> suspended in live migration?
>>>>> I explained 5 reasons why it does not work in previous reply.
>>>>>
>>>>> As the word indicates "live migration", the hypervisor needs to
>>>>> access the
>>>> device when it is "live" (not just after).
>>>>> Hence, passthrough mode must be able to capture the state of the
>>>>> device and
>>>> dirty pages database when its live.
>>>>> (and after the source is suspended).
>>>> No, the hypervisor should only collect dirty pages when the device alive.
>>> It is needed during both the times.
>>> When the device and guest is live during pre-copy phase.
>>> And after the device is frozen, to get the final round of pages.
>> With PASID, dirty page tracking facility can be isolated from the guest, means
>> the hypervisor owns this facility. So the hypervisor can collect the dirty pages.
>>
>> When the device suspended, it should report the last round of dirty pages
>> through dirty page tracking facility as expected.
>>
>> This can work, right?
> Unfortunately no, as non atomic bitmap cannot reside in the host memory,
as explained before, PCI and CPU supports atomic read/write. Please 
refer to PCI spec and CPU ISA.
> And whatever is in the device gets reset on device reset and/or FLR. So the dirty map detail is lost.
> Similarly the device context is also lost on these two events triggered by guest.
we explained before, when reset, the device should clear everything.
>
>>>> As you can see, the dirty page tracking facility has a PASID for
>>>> isolation. But still, the question is, we should better use platform
>>>> dirty page tracking
>>>>
>>> Nothing to do with PASID, as PASID is owned by the guest.
>> It looks you don't know how PASID work.
>> Host can setup PASID to isolate some facilities, right?
> There are few limitations with PASID.
> a. All platforms do not have PASID and
As we have explained for many times, this is a basic facility,
and the implementation is transport-specific.

We given an example of PCI implementation, and PCI support PASID, right?
> b. I explained above PASID do not work always as PASID only bifurcates DMA not the device _functionality_.
With a PASID, a cap can be considered to be placed in another logical 
address space, which is not accessible to the guest.
> c. PASID to be available to guest as_is what is present on the device
host hypervisor sets the PASID, transparent to the guest.
>
>>>> Then suspend the device after guest freeze, to stabilize the device
>>>> status, then read the status.
>>>>
>>>> How can you say this does not work???
>>> I explained above.
>> see above


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* [virtio-comment] Re: [PATCH V2 1/6] virtio: introduce virtqueue state
  2023-11-07  8:11             ` Zhu, Lingshan
@ 2023-11-07  8:22               ` Michael S. Tsirkin
  2023-11-08  4:08                 ` Zhu, Lingshan
  0 siblings, 1 reply; 186+ messages in thread
From: Michael S. Tsirkin @ 2023-11-07  8:22 UTC (permalink / raw)
  To: Zhu, Lingshan; +Cc: jasowang, eperezma, cohuck, stefanha, virtio-comment, parav

On Tue, Nov 07, 2023 at 04:11:25PM +0800, Zhu, Lingshan wrote:
> 
> 
> On 11/6/2023 5:45 PM, Michael S. Tsirkin wrote:
> > On Mon, Nov 06, 2023 at 05:42:10PM +0800, Zhu, Lingshan wrote:
> > > 
> > > On 11/6/2023 5:35 PM, Michael S. Tsirkin wrote:
> > > > On Fri, Nov 03, 2023 at 10:49:42PM +0800, Zhu, Lingshan wrote:
> > > > >           +When SUSPEND is set, the device MUST record the Available State of every enabled splited virtqueue
> > > > >           +in \field{Available State} field,
> > > > >           +and correspondingly restore the Available State of every enabled splited virtqueue
> > > > >           +from \field{Available State} field when DRIVER_OK is set.
> > > > >           +
> > > > >           +The device SHOULD reset \field{Available State} field upon a device reset.
> > > > > 
> > > > >       At this point I have no idea
> > > > >       - how can a state of a virtqueue at a random time be represented
> > > > >         by a 16 bit integer
> > > > > 
> > > > > not sure what is a random time, this is to request the device to reset
> > > > > its avail state, for example, it is "le16 queue_avail_state" in virtio-pci
> > > > > common cfg. Resetting this so the device will not recover from a wrong value of
> > > > > the last run.
> > > > You simply never bother to say what is "Available State" and what
> > > > does it mean to restore it.  Not to mention words like "splited"
> > > > which just adds to the confusion.
> > > It says:
> > > +The available state field is two bytes of virtqueue state that is used by
> > > +the device to read the next available buffer. It is presented in the
> > > following format:
> > > 
> > > Do you want me to add more descriptions?
> > maybe start with an example
> I think they are already in the spec, I can add:
> see also "2.7.6 The Virtqueue Available Ring" and "2.7.13.1 Placing Buffers
> Into The Descriptor Table"
> > 
> > > > >       - if it's not at a random time then why do you even need an integer -
> > > > >         synchronize queue to memory and then all state is in memory
> > > > > 
> > > > > Not sure what is a sync queue, but for example, "le16 queue_avail_state" for
> > > > > PCI transport exists in a cap.
> > > > I just point out that normally a lot of ring state is in memory.
> > > > So you need to be much more specific about how you are augmenting that.
> > > > For example, if buffers are used exactly in order for a split ring
> > > > then used index seems to be exactly the same as last available index
> > > > you describe - it's a free running counter. OTOH if they are not
> > > > used in order then I don't see how is a single index sufficient to
> > > > describe which ones have been used and which not.
> > > I am not sure I get it.
> > > 
> > > Used idx(not like packed vq, no over-writing descriptors) and other states
> > > are in guest memory, so migrated with guest migration.
> > yes and so? why is that not enough and what is this available state then?
> The spec has illustrated how available index work and has given an
> example(see above cited sections)
> And this patch even has given a more clear description for it.
> 
> Other states are in guest memory and migrated with guest memory.

Yea I wrote large parts of it and I know how the available index works.

And sorry no idea what you are talking about.

At any time, there can be up to 2^16 buffers that have been made
available, and a random subset of these have been used. There is no
chance in the world a single 16 bit index describes even that part of
state, never mind device type specific processing that might be going
on.

As a wild guest this proposal is making a bunch of unstated assumptions
about device being in a very specific state where this *is* possible.
For people to be able to implement devices and drivers these
need to be spelled out.


-- 
MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [virtio-comment] RE: [PATCH V2 2/6] virtio: introduce SUSPEND bit in device status
  2023-11-07  8:21                 ` Zhu, Lingshan
@ 2023-11-07  8:33                   ` Michael S. Tsirkin
  2023-11-07  9:24                     ` Zhu, Lingshan
  0 siblings, 1 reply; 186+ messages in thread
From: Michael S. Tsirkin @ 2023-11-07  8:33 UTC (permalink / raw)
  To: Zhu, Lingshan
  Cc: Parav Pandit, jasowang, eperezma, cohuck, stefanha, virtio-comment

On Tue, Nov 07, 2023 at 04:21:13PM +0800, Zhu, Lingshan wrote:
> > > This can work, right?
> > Unfortunately no, as non atomic bitmap cannot reside in the host memory,
> as explained before, PCI and CPU supports atomic read/write. Please refer to
> PCI spec and CPU ISA.

I don't see how atomic read or write does anything useful here but maybe.
You need to explain how you are using atomics in your proposal then.


> > And whatever is in the device gets reset on device reset and/or FLR. So the dirty map detail is lost.
> > Similarly the device context is also lost on these two events triggered by guest.
> we explained before, when reset, the device should clear everything.

then migration will corrupt memory. Not great.



> > 
> > > > > As you can see, the dirty page tracking facility has a PASID for
> > > > > isolation. But still, the question is, we should better use platform
> > > > > dirty page tracking
> > > > > 
> > > > Nothing to do with PASID, as PASID is owned by the guest.
> > > It looks you don't know how PASID work.
> > > Host can setup PASID to isolate some facilities, right?
> > There are few limitations with PASID.
> > a. All platforms do not have PASID and
> As we have explained for many times, this is a basic facility,
> and the implementation is transport-specific.
> 
> We given an example of PCI implementation, and PCI support PASID, right?

Yes it's a limitation but maybe one we can live with
for this feature.  It does mean that we might need solutions
for systems without this support. virtio use is not limited
to servers or high end systems.


> > b. I explained above PASID do not work always as PASID only bifurcates DMA not the device _functionality_.
> With a PASID, a cap can be considered to be placed in another logical
> address space, which is not accessible to the guest.
> > c. PASID to be available to guest as_is what is present on the device
> host hypervisor sets the PASID, transparent to the guest.

Lingshan whenever people ask you a ton of questions in response to
your spec proposal then respose should not be to simply
answer on the mailing list and then repost without a lot of changes
since spec readers will likely have questions exactly like these
and we can not make them go and read this flame war.
And frankly, most of this TC stopped following this thread a while ago,
it seems to be going nowhere.
The response should be to add the explanation in the spec.
Look at Parav's live migration proposals with "theory of operation" chapters
for an example of how this can be done.

> > 
> > > > > Then suspend the device after guest freeze, to stabilize the device
> > > > > status, then read the status.
> > > > > 
> > > > > How can you say this does not work???
> > > > I explained above.
> > > see above


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [virtio-comment] Re: [PATCH V2 2/6] virtio: introduce SUSPEND bit in device status
  2023-11-06  9:43   ` [virtio-comment] " Michael S. Tsirkin
@ 2023-11-07  9:09     ` Zhu, Lingshan
  2023-11-08 17:55       ` Michael S. Tsirkin
  0 siblings, 1 reply; 186+ messages in thread
From: Zhu, Lingshan @ 2023-11-07  9:09 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: jasowang, eperezma, cohuck, stefanha, virtio-comment, parav

[-- Attachment #1: Type: text/plain, Size: 7499 bytes --]



On 11/6/2023 5:43 PM, Michael S. Tsirkin wrote:
> On Fri, Nov 03, 2023 at 06:34:33PM +0800, Zhu Lingshan wrote:
>> This patch introduces a new status bit in the device status: SUSPEND.
>>
>> This SUSPEND bit can be used by the driver to suspend a device,
>> in order to stabilize the device states and virtqueue states.
>>
>> Its main use case is live migration.
>>
>> Signed-off-by: Zhu Lingshan<lingshan.zhu@intel.com>
>> Signed-off-by: Jason Wang<jasowang@redhat.com>
>> Signed-off-by: Eugenio Pérez<eperezma@redhat.com>
>> ---
>>   content.tex | 36 ++++++++++++++++++++++++++++++++++--
>>   1 file changed, 34 insertions(+), 2 deletions(-)
>>
>> diff --git a/content.tex b/content.tex
>> index 76813b5..bcc9d4b 100644
>> --- a/content.tex
>> +++ b/content.tex
>> @@ -49,6 +49,10 @@ \section{\field{Device Status} Field}\label{sec:Basic Facilities of a Virtio Dev
>>   
>>   \item[DEVICE_NEEDS_RESET (64)] Indicates that the device has experienced
>>     an error from which it can't recover.
>> +
>> +\item[SUSPEND (16)] When VIRTIO_F_SUSPEND is negotiated, indicates that the
>> +  device has been suspended by the driver.
>> +
> what does this mean?
When the driver sets SUSPEND and the device presents SUSPEND, means
the device has been suspended by the driver.

Do you suggest to remove "When VIRTIO_F_SUSPEND is negotiated"
>
>>   \end{description}
>>   
>>   The \field{device status} field starts out as 0, and is reinitialized to 0 by
>> @@ -73,6 +77,10 @@ \section{\field{Device Status} Field}\label{sec:Basic Facilities of a Virtio Dev
>>   recover by issuing a reset.
>>   \end{note}
>>   
>> +The driver SHOULD NOT set SUSPEND if FEATURES_OK is not set.
>> +
>> +When setting SUSPEND, the driver MUST re-read \field{device status} to ensure the SUSPEND bit is set.
>> +
> and if it's not?
Then the device may run into errors or just need longer time to suspend.

This is how we handle features_OK: "Re-read device status to ensure the 
FEATURES_OK bit is still set"
>
>>   \devicenormative{\subsection}{Device Status Field}{Basic Facilities of a Virtio Device / Device Status Field}
>>   
>>   The device MUST NOT consume buffers or send any used buffer
>> @@ -82,6 +90,26 @@ \section{\field{Device Status} Field}\label{sec:Basic Facilities of a Virtio Dev
>>   that a reset is needed.  If DRIVER_OK is set, after it sets DEVICE_NEEDS_RESET, the device
>>   MUST send a device configuration change notification to the driver.
>>   
>> +The device MUST ignore SUSPEND if FEATURES_OK is not set.
>> +
>> +The device MUST ignore SUSPEND if VIRTIO_F_SUSPEND is not negotiated.
>> +
>> +The device SHOULD allow settings to \field{device status} even when SUSPEND is set.
> which settings?
any legit writing to the device status, like DRIVER_OK
>
>> +
>> +If VIRTIO_F_SUSPEND is negotiated and SUSPEND is set, the device SHOULD clear SUSPEND
>> +and resumes operation upon DRIVER_OK.
>> +
>> +If VIRTIO_F_SUSPEND is negotiated, when the driver sets SUSPEND,
>> +the device SHOULD perform the following actions before presenting SUSPEND bit in the \field{device status}:
>> +
>> +\begin{itemize}
>> +\item Stop consuming buffers of any virtqueues and mark all finished descritors as used.
> descritors? and what does finished mean?
Sorry my typo.

Finished means done processing it.

Like the spec words: When the device has finished a buffer, it writes 
the descriptor index into the used ring, and sends a used buffer 
notification.
>
>> +\item Wait until all descriptors that being processed to finish and mark them as used.
> descriptors are not marked used. buffers are.
>
> that being -> that are being maybe?
Will fix
>
>> +\item Flush all used buffer and send used buffer notifications to the driver.
> used buffers?
Here it means the buffer marked as used.
shall I use finished buffer or any other suggestions?
> what does Flush mean?
Flush means send all of them out. Like 5.19.7.1 Device Requirements: 
Device Operation: Virtqueue flush

>
>> +\item Record Virtqueue State of each enabled virtqueue, see section \ref{sec:Virtqueues / Virtqueue State}
> execpt that one unfortunately does not bother to say what does this mean
> :(
The virtqueue state has been defined in this series, in 
packed/split-ring.tex.
And an PCI implementation of the interfaces is included.

Do you suggest any supplementary materials?
>
>> +\item Pause its operation except \field{device status} and preserve configurations in its Device Configuration Space, see \ref{sec:Basic Facilities of a Virtio Device / Device Configuration Space}
> How do you Pause? For example, consider a link state register. You set
The device pauses itself.
> SUSPEND, then link goes down. What is device supposed to do?
Once the device suspended, the device should not respond to the link_down
until alive again. This is to preserve the device states, just record
whatever it is when SUSPEND-ed. And process the signal when resume or
alive at the destination side. At the destination it also needs a
live announce which require an active link.
> Record this somewhere internal but do not show it to driver?
> And how exactly will this hidden internal state be migrated
> since it is not visible?
May I know what kind of internal states?
This series migrates stateless devices, hard to define virtio-fs device 
context.
>
>
>> +\end{itemize}
>> +
>>   \section{Feature Bits}\label{sec:Basic Facilities of a Virtio Device / Feature Bits}
>>   
>>   Each virtio device offers all the features it understands.  During
>> @@ -99,10 +127,10 @@ \section{Feature Bits}\label{sec:Basic Facilities of a Virtio Device / Feature B
>>   \begin{description}
>>   \item[0 to 23, and 50 to 127] Feature bits for the specific device type
>>   
>> -\item[24 to 42] Feature bits reserved for extensions to the queue and
>> +\item[24 to 43] Feature bits reserved for extensions to the queue and
>>     feature negotiation mechanisms
>>   
>> -\item[43 to 49, and 128 and above] Feature bits reserved for future extensions.
>> +\item[44 to 49, and 128 and above] Feature bits reserved for future extensions.
>>   \end{description}
>>   
>>   \begin{note}
>> @@ -875,6 +903,10 @@ \chapter{Reserved Feature Bits}\label{sec:Reserved Feature Bits}
>>     \item[VIRTIO_F_QUEUE_STATE(42)] This feature indicates that the device allows the driver
>>     to access its internal virtqueue state.
>>   
>> +  \item[VIRTIO_F_SUSPEND(43)] This feature indicates that the driver can
>> +   SUSPEND the device.
> why is SUSPEND upper-case here?
will be lower in V3.

Thanks
>
>> +   See \ref{sec:Basic Facilities of a Virtio Device / Device Status Field}.
>> +
>>   \end{description}
>>   
>>   \drivernormative{\section}{Reserved Feature Bits}{Reserved Feature Bits}
>> -- 
>> 2.35.3
>
> This publicly archived list offers a means to provide input to the
> OASIS Virtual I/O Device (VIRTIO) TC.
>
> In order to verify user consent to the Feedback License terms and
> to minimize spam in the list archive, subscription is required
> before posting.
>
> Subscribe:virtio-comment-subscribe@lists.oasis-open.org
> Unsubscribe:virtio-comment-unsubscribe@lists.oasis-open.org
> List help:virtio-comment-help@lists.oasis-open.org
> List archive:https://lists.oasis-open.org/archives/virtio-comment/
> Feedback License:https://www.oasis-open.org/who/ipr/feedback_license.pdf
> List Guidelines:https://www.oasis-open.org/policies-guidelines/mailing-lists
> Committee:https://www.oasis-open.org/committees/virtio/
> Join OASIS:https://www.oasis-open.org/join/
>

[-- Attachment #2: Type: text/html, Size: 12858 bytes --]

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [virtio-comment] RE: [PATCH V2 2/6] virtio: introduce SUSPEND bit in device status
  2023-11-07  8:33                   ` Michael S. Tsirkin
@ 2023-11-07  9:24                     ` Zhu, Lingshan
  2023-11-08  7:42                       ` Michael S. Tsirkin
  0 siblings, 1 reply; 186+ messages in thread
From: Zhu, Lingshan @ 2023-11-07  9:24 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Parav Pandit, jasowang, eperezma, cohuck, stefanha, virtio-comment



On 11/7/2023 4:33 PM, Michael S. Tsirkin wrote:
> On Tue, Nov 07, 2023 at 04:21:13PM +0800, Zhu, Lingshan wrote:
>>>> This can work, right?
>>> Unfortunately no, as non atomic bitmap cannot reside in the host memory,
>> as explained before, PCI and CPU supports atomic read/write. Please refer to
>> PCI spec and CPU ISA.
> I don't see how atomic read or write does anything useful here but maybe.
Because the device writs the bitmap and the driver "read and clear"
the bitmap, so the ops need to be atomic, or they can run into race.
> You need to explain how you are using atomics in your proposal then.
Not sure we should talk about much of how atomic works, as explained above
the operations should be atomic to avoid race conditions or losing 
information.
Like:

1) Device Read
2) Device Write
3) Device Clear

Here we lost the bitmap information.
>
>
>>> And whatever is in the device gets reset on device reset and/or FLR. So the dirty map detail is lost.
>>> Similarly the device context is also lost on these two events triggered by guest.
>> we explained before, when reset, the device should clear everything.
> then migration will corrupt memory. Not great.
I think when reset, the device should clear everything, therefore the driver
should clear the legacy data as well, don't know how corrupt
>
>
>
>>>>>> As you can see, the dirty page tracking facility has a PASID for
>>>>>> isolation. But still, the question is, we should better use platform
>>>>>> dirty page tracking
>>>>>>
>>>>> Nothing to do with PASID, as PASID is owned by the guest.
>>>> It looks you don't know how PASID work.
>>>> Host can setup PASID to isolate some facilities, right?
>>> There are few limitations with PASID.
>>> a. All platforms do not have PASID and
>> As we have explained for many times, this is a basic facility,
>> and the implementation is transport-specific.
>>
>> We given an example of PCI implementation, and PCI support PASID, right?
> Yes it's a limitation but maybe one we can live with
> for this feature.  It does mean that we might need solutions
> for systems without this support. virtio use is not limited
> to servers or high end systems.
PASID has been introduced years ago and I know some vendors implemented
onboard IOMMU can also do isolating.

And this is a basic facility, the implementation is transport specific.
>
>
>>> b. I explained above PASID do not work always as PASID only bifurcates DMA not the device _functionality_.
>> With a PASID, a cap can be considered to be placed in another logical
>> address space, which is not accessible to the guest.
>>> c. PASID to be available to guest as_is what is present on the device
>> host hypervisor sets the PASID, transparent to the guest.
> Lingshan whenever people ask you a ton of questions in response to
> your spec proposal then respose should not be to simply
> answer on the mailing list and then repost without a lot of changes
> since spec readers will likely have questions exactly like these
> and we can not make them go and read this flame war.
Well, I should say, I have repeated the same answers for too many times.
> And frankly, most of this TC stopped following this thread a while ago,
> it seems to be going nowhere.
I still believe we should release the best quality
of spec as we can.
> The response should be to add the explanation in the spec.
> Look at Parav's live migration proposals with "theory of operation" chapters
> for an example of how this can be done.
I am not sure we should talk how PCI work in virtio spec.
But I can add "pasid for isolation"

These facilities are not only used for live migration,
can also work for debugging. Like suspend then read vq state.

I can add more explanation in the cover letter
>
>>>>>> Then suspend the device after guest freeze, to stabilize the device
>>>>>> status, then read the status.
>>>>>>
>>>>>> How can you say this does not work???
>>>>> I explained above.
>>>> see above
>
> This publicly archived list offers a means to provide input to the
> OASIS Virtual I/O Device (VIRTIO) TC.
>
> In order to verify user consent to the Feedback License terms and
> to minimize spam in the list archive, subscription is required
> before posting.
>
> Subscribe: virtio-comment-subscribe@lists.oasis-open.org
> Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
> List help: virtio-comment-help@lists.oasis-open.org
> List archive: https://lists.oasis-open.org/archives/virtio-comment/
> Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
> List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
> Committee: https://www.oasis-open.org/committees/virtio/
> Join OASIS: https://www.oasis-open.org/join/
>


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* [virtio-comment] Re: [PATCH V2 3/6] virtio: dont reset vqs when SUSPEND
  2023-11-06  9:49   ` [virtio-comment] " Michael S. Tsirkin
@ 2023-11-07  9:27     ` Zhu, Lingshan
  2023-11-08 17:46       ` Michael S. Tsirkin
  0 siblings, 1 reply; 186+ messages in thread
From: Zhu, Lingshan @ 2023-11-07  9:27 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: jasowang, eperezma, cohuck, stefanha, virtio-comment, parav



On 11/6/2023 5:49 PM, Michael S. Tsirkin wrote:
> On Fri, Nov 03, 2023 at 06:34:34PM +0800, Zhu Lingshan wrote:
>> When SUSPEND is set, device states and virtqueue states
>> should be stablized, therefore the driver should not
>> reset vqs when SUSPEND is set in device status.
>>
>> Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
>> ---
>>   content.tex | 3 +++
>>   1 file changed, 3 insertions(+)
>>
>> diff --git a/content.tex b/content.tex
>> index bcc9d4b..060b5c2 100644
>> --- a/content.tex
>> +++ b/content.tex
>> @@ -444,6 +444,9 @@ \subsubsection{Virtqueue Reset}\label{sec:Basic Facilities of a Virtio Device /
>>   The device MUST reset any state of a virtqueue to the default state,
>>   including the available state and the used state.
>>   
>> +If VIRTIO_F_SUSPEND is negotiated and SUSPEND is set in \field{device status},
>> +the driver SHOULD NOT reset any virtqueues.
>> +
>>   \drivernormative{\paragraph}{Virtqueue Reset}{Basic Facilities of a Virtio Device / Virtqueues / Virtqueue Reset / Virtqueue Reset}
>>   
>>   After the driver tells the device to reset a queue, the driver MUST verify that
> Seems somewhat arbitrary and breaks the claim that the
> feature is orthogonal and can have uses besides migration.
when suspended, the device is frozen.
The driver is aware of this process and so should not reset the vqs I think.
>
>
>
>> -- 
>> 2.35.3


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [virtio-comment] Re: [PATCH V2 4/6] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  2023-11-06 10:52               ` Parav Pandit
@ 2023-11-07  9:31                 ` Zhu, Lingshan
  2023-11-08 17:44                   ` Michael S. Tsirkin
  2023-11-09  6:28                   ` Parav Pandit
  0 siblings, 2 replies; 186+ messages in thread
From: Zhu, Lingshan @ 2023-11-07  9:31 UTC (permalink / raw)
  To: Parav Pandit, jasowang, mst, eperezma, cohuck, stefanha; +Cc: virtio-comment



On 11/6/2023 6:52 PM, Parav Pandit wrote:
>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>> Sent: Monday, November 6, 2023 2:57 PM
>>
>> On 11/6/2023 12:12 PM, Parav Pandit wrote:
>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>>>> Sent: Monday, November 6, 2023 9:01 AM
>>>>
>>>> On 11/3/2023 11:50 PM, Parav Pandit wrote:
>>>>>> From: virtio-comment@lists.oasis-open.org
>>>>>> <virtio-comment@lists.oasis- open.org> On Behalf Of Zhu, Lingshan
>>>>>> Sent: Friday, November 3, 2023 8:27 PM
>>>>>>
>>>>>> On 11/3/2023 7:35 PM, Parav Pandit wrote:
>>>>>>>> From: Zhu Lingshan <lingshan.zhu@intel.com>
>>>>>>>> Sent: Friday, November 3, 2023 4:05 PM
>>>>>>>>
>>>>>>>> This patch adds two new le16 fields to common configuration
>>>>>>>> structure to support VIRTIO_F_QUEUE_STATE in PCI transport layer.
>>>>>>>>
>>>>>>>> Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
>>>>>>>> ---
>>>>>>>>      transport-pci.tex | 18 ++++++++++++++++++
>>>>>>>>      1 file changed, 18 insertions(+)
>>>>>>>>
>>>>>>>> diff --git a/transport-pci.tex b/transport-pci.tex index
>>>>>>>> a5c6719..3161519 100644
>>>>>>>> --- a/transport-pci.tex
>>>>>>>> +++ b/transport-pci.tex
>>>>>>>> @@ -325,6 +325,10 @@ \subsubsection{Common configuration
>>>> structure
>>>>>>>> layout}\label{sec:Virtio Transport
>>>>>>>>              /* About the administration virtqueue. */
>>>>>>>>              le16 admin_queue_index;         /* read-only for driver */
>>>>>>>>              le16 admin_queue_num;         /* read-only for driver */
>>>>>>>> +
>>>>>>>> +	/* Virtqueue state */
>>>>>>>> +        le16 queue_avail_state;         /* read-write */
>>>>>>>> +        le16 queue_used_state;          /* read-write */
>>>>>>> This tiny interface for 128 virtio net queues through register
>>>>>>> read writes, does
>>>>>> not work effectively.
>>>>>>> There are inflight out of order descriptors for block also.
>>>>>>> Hence toy registers like this do not work.
>>>>>> Do you know there is a queue_select? Why this does not work? Do you
>>>>>> know how other queue related fields work?
>>>>> :)
>>>>> Yes. If you notice queue_reset related critical spec bug fix was
>>>>> done when it
>>>> was introduced so that live migration can _actually_ work.
>>>>> When queue_select is done for 128 queues serially, it take a lot of
>>>>> time to
>>>> read those slow register interface for this + inflight descriptors + more.
>>>> interesting, virtio work in this pattern for many years, right?
>>> All these years 400Gbps and 800Gbps virtio was not present, number of
>> queues were not in hw.
>> The registers are control path in config space, how 400G or 800G affect??
> Because those are the one in practice requires large number of VQs.
>
> You are asking per VQ register commands to modify things dynamically via this one vq at a time, serializing all the operations.
> It does not scale well with high q count.
This is not dynamically, it only happens when SUSPEND and RESUME.
This is the same mechanism how virtio initialize a virtqueue, working 
for many years.
>> See the virtio common cfg, you will find the max number of vqs is there,
>> num_queues.
> :)
> Sure. those values at high q count affects.
the driver need to initialize them anyway.
>
>>> Device didn’t support LM.
>>> Many limitations existed all these years and TC is improving and expanding
>> them.
>>> So all these years do not matter.
>> Not sure what are you talking about, haven't we initialize the device and vqs in
>> config space for years?????? What's wrong with this mechanism?
>> Are you questioning virito-pci fundamentals???
> Don’t point to in-efficient past to establish similar in-efficient future.
interesting, you know this is a one-time thing, right?
and you are aware of this has been there for years.
>
>>>>>> Like how to set a queue size and enable it?
>>>>> Those are meant to be used before DRIVER_OK stage as they are init
>>>>> time
>>>> registers.
>>>>> Not to keep abusing them..
>>>> don't you need to set queue_size at the destination side?
>>> No.
>>> But the src/dst does not matter.
>>> Queue_size to be set before DRIVER_OK like rest of the registers, as all
>> queues must be created before the driver_ok phase.
>>> Queue_reset was last moment exception.
>> create a queue? Nvidia specific?
>>
> Huh. No.
> Do git log and realize what happened with queue_reset.
You didn't answer the question, does the spec even has defined "create a 
vq"?
>
>> For standard virtio, you need to read the number of enabled vqs at the source
>> side, then enable them at the dst, so queue_size matters, not to create.
> All that happens in the pre-copy phase.
Yes and how your answer related to this discussion?


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [virtio-comment] Re: [PATCH V2 6/6] virtio-pci: implement dirty page tracking
  2023-11-06 10:15         ` Michael S. Tsirkin
@ 2023-11-07  9:43           ` Zhu, Lingshan
  2023-11-07 10:43             ` Michael S. Tsirkin
  0 siblings, 1 reply; 186+ messages in thread
From: Zhu, Lingshan @ 2023-11-07  9:43 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: jasowang, eperezma, cohuck, stefanha, virtio-comment, parav



On 11/6/2023 6:15 PM, Michael S. Tsirkin wrote:
> On Mon, Nov 06, 2023 at 05:16:43PM +0800, Zhu, Lingshan wrote:
>>
>> On 11/3/2023 10:21 PM, Zhu, Lingshan wrote:
>>>
>>> On 11/3/2023 6:46 PM, Michael S. Tsirkin wrote:
>>>> On Fri, Nov 03, 2023 at 06:34:37PM +0800, Zhu Lingshan wrote:
>>>>> +\begin{lstlisting}
>>>>> +struct virtio_pci_dity_page_track {
>>>>> +        u8 enable;               /* Read-Write */
>>>>> +        u8 gra_power;            /* Read-Write */
>>>>> +        u8 reserved[2];
>>>>> +        le32 {
>>>>> +            pasid: 20;           /* Read-Write */
>>>>> +            reserved: 12;
>>>>> +        };
>>>>> +        le64 bitmap_addr;        /* Read-Write */
>>>>> +        le64 bitmap_length;      /* Read-Write */
>>>>> +};
>>>>> +\end{lstlisting}
>>>> Okay, so it's a simple mailbox in config space.  Which by itself is
>>>> probably a very reasonable idea - more or less what I suggested.
>>>> However, using such a generic facility just for the dirty bitmap seems
>>>> too limited.  Please make it accept arbitrary commands. Reusing admin
>>>> command structure with a special "device itself" group sounds like one
>>>> way to do it.
>>> processing admin cmds in a cap may be too complex and overkill.
>>> we need to handle variable length of cmds, handle async returned
>>> results, and so on.
>>>
>>> This struct seems easy and simple. And shall we use platform facilities
>>> like vt-d
>>> to track dirty pages?
>> To demonstrate these issues, suppose we have a struct in a bar to process
>> admin cmds:
>>
>> struct virtio_admin_cmd {
>>          u64 in_data_length;
>>          u8 cmd_in_data[];
>>          u64 out_data_length;
>>          u8 cmd_out_data[];
>>          u8 ret;
>> };
> An alternative is do same as you did here, e.g.:
> struct virtio_admin_cmd {
>           u64 admin_cmd_pa; /* an out descriptor followed by an in descriptor */
PA depends on IOMMU and ATS
>           u32 pasid : 20;
> 	 u8 reserved : 11;
> 	 u8 hardware : 1;
> };
>
> or we can stick two lengths and addresses straight in the capability.
Still can only process only one cmd at a time and others are blocked.
>
>
>> The problems are:
>> 1) command_in_data and command_out data have variable length, so how many HW
>> resource should be reserved in the bar?
> actually admin commands are truncated by device so just
> set to length that device understands.
How the framework know the length? How can the vendor know how many
HW resource they should place here? I am not sure it is a good
idea to guess.
>
>> 2) To process the cmds in the bar, the device MAY need to read many
>> registers in cmd_in_data[] and write many registers in cmd_out_data[],
>> which can be ineffective, this is not DMA.
> True. Again if you don't want to depend on pasid that's the
> only option.
Yes, this exactly shows how complex and overkill to use
a cap to handle admin cmds. And maybe admin cmds is not a must.
>
>
>> 3) a bar can only process one cmd at a time, and the driver can only issue
>> another cmd after received an ret.
>> This process has to be synchronous IO, one cmd blocks another.
> Exactly same as what you did though.
We have implemented a cap in PCI for dirty page tracking, there are no 
admin cmds.

And for dirty page tracking, we still believe it is better to use 
platform facilities,
as Jason explained.
>
>> 4) VF implementing a bar processing admin cmds conflicts with PF's admin vq.
> So just don't create conflicts. It's same as multiple admin vqs
> really which we already support.
How to tell the SW don't create conflict?
>
>
>> So I think a bar or a cap processing admin cmds is way to complex and
>> overkill.
>>
>> Thanks
> Sounds like a straw man argument.
>


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [virtio-comment] Re: [PATCH V2 6/6] virtio-pci: implement dirty page tracking
  2023-11-06 10:33               ` Michael S. Tsirkin
@ 2023-11-07  9:48                 ` Zhu, Lingshan
  0 siblings, 0 replies; 186+ messages in thread
From: Zhu, Lingshan @ 2023-11-07  9:48 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Parav Pandit, jasowang, eperezma, cohuck, stefanha, virtio-comment



On 11/6/2023 6:33 PM, Michael S. Tsirkin wrote:
> On Mon, Nov 06, 2023 at 11:58:03AM +0800, Zhu, Lingshan wrote:
>>
>> On 11/6/2023 12:12 AM, Michael S. Tsirkin wrote:
>>> On Fri, Nov 03, 2023 at 03:47:34PM +0000, Parav Pandit wrote:
>>>>>> [1]
>>>>>> https://lists.oasis-open.org/archives/virtio-comment/202310/msg00475.h
>>>>>> tml
>>>>> you still need to explain why this does not work for pass-through.
>>>> It does not work for following reasons.
>>>> 1. Because all the fields that put on the member device are not in direct control of the hypervisor.
>>>> The device is directly controlled by the guest including the device status and when it resets the device all the things stored in the device are lost.
>>> I think the idea is that when this gateway is in the device then
>>> device reset has to trap. At a high level, ok. But then what?
>> No, when device reset, the device is expected to forget everything and
>> re-intialize.
> That's a problem then - memory that was already written will not be
> detected as such.
Two cases:
1) memory written and has sent an interrupt. Then CPU has acknowledged 
the data, it should process the data
2) memory written but no interrupt, this should be lost as expected.
>
>>> Is a full scan of all memory required until device reset is complete?
>> Who scan the memory? The device tracks its own dirty pages.
> yes but reset erases this information.
should be, or the hypervisor will copy invalid information, cause a 
conflict.

I remember Jason as given an example on this.
>
>
>>> Drivers currently tend to busy poll the reset register,
>>> if this takes very long we might start seeing soft lockup
>>> messages. What is the idea then? Maybe for this we need a separate
>>> weaker reset that does not touch this capability?
>> When reset, how can we expect the LM progress continue running.
>>
>> For example, when the device DMA writes something, then reset before sending
>> an interrupt,
>> the DMA-ed pages should be lost as expected, right?
> interrupt is going to guest, has nothing to do with it.
>
> device writes data into memory
> device sends interrupt
> driver sees data
At this point, the guest CPU owns the dirty pages, owns the data
> driver sends reset
At this point, the device should forget everything
>
> meanwhile hypervisor did not see any dirty pages
>
> now what? hypervisor must apparently retrieve all
> dirty page data before it can reset the device.
It is not the hypervisor, once an interrupt has sent,
the guest take over the pages.
>


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* [virtio-comment] Re: [PATCH V2 6/6] virtio-pci: implement dirty page tracking
  2023-11-06 10:52                 ` [virtio-comment] " Parav Pandit
  2023-11-06 11:05                   ` [virtio-comment] " Michael S. Tsirkin
@ 2023-11-07  9:52                   ` Zhu, Lingshan
  2023-11-07 11:33                     ` Michael S. Tsirkin
  1 sibling, 1 reply; 186+ messages in thread
From: Zhu, Lingshan @ 2023-11-07  9:52 UTC (permalink / raw)
  To: Parav Pandit, Michael S. Tsirkin
  Cc: jasowang, eperezma, cohuck, stefanha, virtio-comment



On 11/6/2023 6:52 PM, Parav Pandit wrote:
>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>> Sent: Monday, November 6, 2023 3:04 PM
>>
>> On 11/6/2023 12:34 PM, Parav Pandit wrote:
>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>>>> Sent: Monday, November 6, 2023 9:22 AM
>>>>
>>>>
>>>> On 11/3/2023 11:47 PM, Parav Pandit wrote:
>>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>>>>>> Sent: Friday, November 3, 2023 8:33 PM
>>>>>>
>>>>>> On 11/3/2023 7:35 PM, Parav Pandit wrote:
>>>>>>>> From: Michael S. Tsirkin <mst@redhat.com>
>>>>>>>> Sent: Friday, November 3, 2023 4:20 PM
>>>>>>>>
>>>>>>>> On Fri, Nov 03, 2023 at 06:34:37PM +0800, Zhu Lingshan wrote:
>>>>>>>>> +\item[\field{bitmap_addr}]
>>>>>>>>> +	The driver use this to set the address of the bitmap which
>>>>>>>>> +records the
>>>>>>>> dirty pages
>>>>>>>>> +	caused by the device.
>>>>>>>>> +	Each bit in the bitmap represents one memory page, bit 0 in
>> the bitmap
>>>>>>>>> +	reprsents page 0 at address 0, bit 1 represents page 1, and so
>>>>>>>>> +on in a
>>>>>>>> linear manner.
>>>>>>>>> +	When \field{enable} is set to 1 and the device writes to a
>> memory page,
>>>>>>>>> +	the device MUST set the corresponding bit to 1 which
>>>>>>>>> +indicating the
>>>>>>>> page is dirty.
>>>>>>>>> +\item[\field{bitmap_length}]
>>>>>>>>> +	The driver use this to set the length in bytes of the bitmap.
>>>>>>>>> +\end{description}
>>>>>>>>> +
>>>>>>>>> +\devicenormative{\subsubsection}{Memory Dirty Pages Tracker
>>>>>>>>> +Capability}{Virtio Transport Options / Virtio Over PCI Bus /
>>>>>>>>> +Memory Dirty Pages Tracker Capability}
>>>>>>>>> +
>>>>>>>>> +The device MUST NOT set any bits beyond bitmap_length when
>>>>>>>>> +reporting
>>>>>>>> dirty pages.
>>>>>>>>> +
>>>>>>>>> +To prevent a read-modify-write procedure, if a memory page is
>>>>>>>>> +dirty,
>>>>>>> It is not to prevent; it is just not possible to do racy RMW. 😊
>>>>>> if you understand what is a atomic routine, you will not call it racy.
>>>>>>> Hence to work around you propose to mark all pages dirty. Too bad.
>>>>>>> This just does not work.
>>>>>> why? and this is optional.
>>>>> Because device cannot set individual bits in atomic way for same
>>>>> byte read by
>>>> the cpu.
>>>>> 1. device read the byte that had bit 0 and 4 set.
>>>>> 2. cpu atomically clear these bits.
>>>>> 3. device wrote bits 0, 4, and new bits 2 and 3.
>>>>> 4. cpu now transferred page 0 and 4 again.
>>>>>
>>>>> Optional thing also needs to work. :)
>>>> Do you know both CPU and device actually don't read bit, they read
>> bytes????
>>> Yes. this is why atomic_OR is not possible on pcie.
>>>
>>>> Do you know RC connected to memory controller????
>>> Yes.
>>>> Do you know there are locked transaction and atomic operations in PCI???
>>> Can you explain how PCI does RMW locked transaction?
>>> Is it one TLP or multiple?
>>>
>>>> Do you know there are atomic read/write/clear even read and clear and
>>>> so on in CPU ISA????
>>> Read is always atomic from cpu.
>>> I didn’t know about read_and_clear atomic ISA. This combined with pci future
>> support for atomic_or.
>>> If you already know a Linux kernel api for atomic_read_and_clear, please
>> share.
>> To answer all questions above, you should read PCI spec and CPU SDM, we don't
>> copy and paste the content here, nobody develop their knowledge this way.
> I read the pci spec and I see only 3 operations which does not have atomic or.
> I will try to find for the CPU instruction on read_and_clear that you suggested.
> Thanks for the suggestion.
>
>>>>>>> Secondly the bitmap array is function is for full guest memory
>>>>>>> size, while
>>>>>> there is lot of sparce region now and also in future.
>>>>>>> This is the second problem.
>>>>>> did you see gra_power and its comments?
>>>>> gra_power says the page size.
>>>>> Not the sparce multiple ranges of the guest memory.
>>>>> Device endup tracking uninterested area as well.
>>>> increase gra_power can reduce bitmap size, right?
>>>> Totally up to the hypervisor, right?
>>> Yes, and that can increase the amount of memory.
>>> The way I understood is, if gra_power is 2MB, than whole 2MB page to be
>> considered dirty, even if 8KB was dirty.
>>> Did I understand it right?
>> Do you know DMA are very likely to write a neighbor page? Do you know why
>> huge page is introduced?
>> Hint: not only for reduce TLB miss.
>>>>>>> This is exactly why I asked you to review the page write recording
>>>>>>> series of
>>>>>> admin commands and comment.
>>>>>>> And you never commented with sheer ignorance.
>>>>>>>
>>>>>>> So clearly the start stop method for specific range and without
>>>>>>> bandwidth
>>>>>> explosion, admin commands of [1] stands better.
>>>>>>> If you do [1] on the member device also using its AQ in future, it
>>>>>>> will work for
>>>>>> non-passthrough case.
>>>>>>> If you build non-passthrough live migration using [1], also it will work.
>>>>>>> So I don’t see any point of this series anymore.
>>>>>> As Jason pointed out, there are many problems in your proposal, you
>>>>>> should answer there. I don't need to repeat his words and duplicate
>>>>>> the
>>>> discussions.
>>>>> Many are already addressed in v3.
>>>> interesting, does your V3 support nested?
>>> Not directly.
>>> Is it similar to cpu PML which does not supported nested.
>>> One can always implement nested using some emulation.
>>> The second option for high performance would be allow SR-IOV cap on the VF
>> and support true nesting using existing proposal of v3.
>> If your proposal does not support nested, then it is incomplete.
>>>>>>> [1]
>>>>>>> https://lists.oasis-open.org/archives/virtio-comment/202310/msg004
>>>>>>> 75
>>>>>>> .h
>>>>>>> tml
>>>>>> you still need to explain why this does not work for pass-through.
>>>>> It does not work for following reasons.
>>>>> 1. Because all the fields that put on the member device are not in
>>>>> direct
>>>> control of the hypervisor.
>>>>> The device is directly controlled by the guest including the device
>>>>> status and
>>>> when it resets the device all the things stored in the device are lost.
>>>> have you seen PASID? and if the device reset, it has to forget
>>>> everything as expected, right?
>>> PASID does not help with the reset. Because as you told reset, resets
>> everything.
>>> PASID does not bifurcate the device common control which is not linked to any
>> PASID.
>> PASID means some facilities can be isolated. When reset, the device forget
>> everything.
>>>>> 2. the PCI FLR is clearing all the registers you exposed here.
>>>> see above
>>>>> 3. Endless expansion of config registers of dirty tracking is not
>>>>> scalable, as they
>>>> are not init time registers not following the Appendix B guidelines.
>>>> endless expansion?? It is a complete set of dirty page tracking, right????
>>>> have you see this cap only controls? The device DMA writes the
>>>> bitmap, not by registers.
>>> Device dirty page tracking is start/stop command to be done by the
>> hypervisor.
>>> So when guest is resetting the device, it stopped the DMA initiated by the
>> hypervisor.
>>> This fundamentally breaks things.
>> Why? When device resets, do you want to keep tracking dirty pages????
> Yes, when the device resets, before that event occurred, all the pages which were dirtied, must be migrated.
> And after reset also new page tracking to continue.
That depends on whether there is an interrupt for the dirty pages.
If there is an interrupt, then the guest owns the pages
>
>>>> Again, if you want to fix Appendix B, OK.
>>>>> 4. bitmap based dirty tracking is not atomic between cpu and device.
>>>>> Hence, it is racy.
>>>> see above, the first reply.
>>>>> 5. All the device context needed for passthrough based hypervisor
>>>>> for a
>>>> device type specific is missing.
>>>>> All of those can be used for non-passthrough as well.
>>>>> [1]
>>>>> https://lists.oasis-open.org/archives/virtio-comment/202311/msg00085
>>>>> .h
>>>>> tml
>>>> If you want to discuss this again, I don't want to wast time but only
>>>> asking you whether you want to define virtio-fs device context
>>> It will be defined in future.
>>> And if virtio-fs was not written with migration in mind, may be one will invent
>> virtio-fs2.
>> don't say future, talk is cheap, show me the code.
>>>>>> And I
>>>>>> remember this is a point-less topic as MST ever wants to mute
>>>>>> another
>>>>>> "pass- through" thread.
>>>>> No. he did not say that.
>>>>> He meant to not endlessly debate which one is better.
>>>>> He clearly said, try to see if you can make multiple hypervisor model work.
>>>>> And your series shows a clear ignorance of his guidance.
>>>> Let me quote MST's reply here:
>>>> "I feel this discussion will keep meandering because the terminology is
>> vague.
>>>> There's no single thing that is called "passthrough" - vendors just
>>>> build what is expedient with current hardware and software. Nvidia
>>>> has a bunch of people working on vfio so they call that passthrough,
>>>> Red Hat has people working on VDPA and they call that passthrough, etc.
>>>>
>>>>
>>>> Before I mute this discussion for good, does anyone here have any
>>>> feeling progress is made? What kind of progress? "
>>>>
>>>> So please don't discuss on pass-through anymore.
>>> I don’t want to discuss the pros and cons of passthrough vs, vdpa, as usual.
>>> V3 covers broader use case of passthrough, hence once can always implement
>> trap+emulation instead of passthrough.
>>> V3 already indicates that other variants of the passthrough can be done as
>> well or can be extended.
>>> So please explore if that fits your vdpa need.
>> So, please no pass-through discussion anymore.
>>>> It seems only you need to develop the knowledge
>>>>>>>>> +optionally the device is permitted to set the entire byte,
>>>>>>>>> +which encompasses
>>>>>>>> the relevant bit, to 1.
>>>>>>>>> +
>>>>>>>>> +The device MAY increase \field{gra_power} to reduce
>>>>>> \field{bitmap_length}.
>>>>>>>>> +
>>>>>>>>> +The device must ignore any writes to \field{pasid} if PASID
>>>>>>>>> +Extended Capability is absent or the PASID functionality is
>>>>>>>>> +disabled in PASID Extended Capability
>>>>>>>> I have to say this is going to work very badly when the number of
>>>>>>>> dirty pages is
>>>>>>>> small: you will end up scanning and re-scanning all of bitmap.
>>>>>>>> And the resolution is apparently 8 pages? You have just
>>>>>>>> multiplied the migration bandwidth by a factor of 8.
>>>>>>> Yeah.
>>>>>>> And device does not even know previously reported pages are read
>>>>>>> by driver
>>>>>> or not. All guess work game for driver and device.
>>>>>> see my reply to him
>>>>> Please see above reply.
>>>> see above


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* [virtio-comment] Re: [PATCH V2 6/6] virtio-pci: implement dirty page tracking
  2023-11-06 11:13                 ` [virtio-comment] " Parav Pandit
@ 2023-11-07 10:01                   ` Zhu, Lingshan
  2023-11-07 10:25                     ` Michael S. Tsirkin
  2023-11-07 12:00                     ` Michael S. Tsirkin
  0 siblings, 2 replies; 186+ messages in thread
From: Zhu, Lingshan @ 2023-11-07 10:01 UTC (permalink / raw)
  To: Parav Pandit, Michael S. Tsirkin
  Cc: jasowang, eperezma, cohuck, stefanha, virtio-comment



On 11/6/2023 7:13 PM, Parav Pandit wrote:
>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>> Sent: Monday, November 6, 2023 3:04 PM
>> So, please no pass-through discussion anymore.
> If you comment like this, nothing can progress.
>
> What you are implying with above language is:
> "hey a virtio can do live migration ONLY by creating vdpa device on top of ALREADY virtio device, and you get another virtio device by running through 3 layers of stack you get virtio device on other side!".
I never say that right? I keep explaining how pass-through and "trap and 
emulate work", I even explained how PASID work.
>
> Then for sure, I disagree to it for 100% for such a single-minded design.
>
> At least I am trying to propose if a solution can work for generic passthrough where least amount of hypervisor mediation is done.
>
> And an extension where hypervisor has choice to more medication layers as it finds suitable.
> And if there are technical issues, may be two different interfaces or more admin commands needed for two modes.
> The idea is to attempt to converge and discuss those details, not the opposite.
>
> Your above comment shows a clear sign of non-collaboration to make both mode works.
Well, I see you are emotional, please take a deep breath and calm down, 
to be professional,
give yourself a break, and really not necessary to be mad at me.

As you know I am just a Junior Engineer in Intel, not like you a Senior 
Principle Engineer who has spent many years and
have developed knowledge in this area. So I am quite technical focusing, 
they are all technical discussions till now.

We always welcome collaboration, remember Jason has proposed a solution 
to build admin vq based on these basic facilities,
and I am fully agree on his proposal.
> At one point I may probably stop responding to your comments that repeatedly says:
>
> "Go read QEMU code, Do you know what is PASID?, Do you know num_queues, Go read PCI spec"...
With all respect, you should do these because they are text book knowledge.
>
> Taking deep breath now to do some productive work in TC...


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* [virtio-comment] Re: [PATCH V2 6/6] virtio-pci: implement dirty page tracking
  2023-11-06 11:21                 ` [virtio-comment] " Parav Pandit
  2023-11-06 11:27                   ` [virtio-comment] " Michael S. Tsirkin
@ 2023-11-07 10:02                   ` Zhu, Lingshan
  2023-11-07 11:36                     ` Michael S. Tsirkin
  1 sibling, 1 reply; 186+ messages in thread
From: Zhu, Lingshan @ 2023-11-07 10:02 UTC (permalink / raw)
  To: Parav Pandit, Michael S. Tsirkin
  Cc: jasowang, eperezma, cohuck, stefanha, virtio-comment



On 11/6/2023 7:21 PM, Parav Pandit wrote:
>> From: Michael S. Tsirkin <mst@redhat.com>
>> Sent: Monday, November 6, 2023 4:00 PM
>>
>> On Mon, Nov 06, 2023 at 04:34:44AM +0000, Parav Pandit wrote:
>>>> Do you know there are locked transaction and atomic operations in PCI???
>>> Can you explain how PCI does RMW locked transaction?
>>> Is it one TLP or multiple?
>> Parav what are you asking about here?
>> pcie supports CAS and Swap which likely
>> can work for this use-case - these are non posted writes. It's in the pcie spec.
> PCI spec do not have atomic OR operation.
> Lingshan in above comment suggested some unknown locked transaction and atomic operation.
> So I was asking him which is that atomic operation and how PCI does it?
> I don't know if any that can do PCI atomic OR without a workaround.
6.15 Atomic Operations (AtomicOps)


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* [virtio-comment] Re: [PATCH V2 6/6] virtio-pci: implement dirty page tracking
  2023-11-07 10:01                   ` [virtio-comment] " Zhu, Lingshan
@ 2023-11-07 10:25                     ` Michael S. Tsirkin
  2023-11-07 11:12                       ` [virtio-comment] " Parav Pandit
  2023-11-08  9:36                       ` Zhu, Lingshan
  2023-11-07 12:00                     ` Michael S. Tsirkin
  1 sibling, 2 replies; 186+ messages in thread
From: Michael S. Tsirkin @ 2023-11-07 10:25 UTC (permalink / raw)
  To: Zhu, Lingshan
  Cc: Parav Pandit, jasowang, eperezma, cohuck, stefanha, virtio-comment

On Tue, Nov 07, 2023 at 06:01:27PM +0800, Zhu, Lingshan wrote:
> 
> 
> On 11/6/2023 7:13 PM, Parav Pandit wrote:
> > > From: Zhu, Lingshan <lingshan.zhu@intel.com>
> > > Sent: Monday, November 6, 2023 3:04 PM
> > > So, please no pass-through discussion anymore.
> > If you comment like this, nothing can progress.
> > 
> > What you are implying with above language is:
> > "hey a virtio can do live migration ONLY by creating vdpa device on top of ALREADY virtio device, and you get another virtio device by running through 3 layers of stack you get virtio device on other side!".
> I never say that right? I keep explaining how pass-through and "trap and
> emulate work", I even explained how PASID work.

Parav, Lingshan can we please stop the "what is pass through" arguments?

I think thatthe term is vague, as
(almost?) no hypervisor passes all accesses without exemption through.
And the fact you are speaking past enough other on this subject
for how long now? seems to demonstrate I'm right.

Describing migration in the spec as opposed to leaving it up to
hypervisors seems valuable at least to me since historically hypervisors
did such a bad job of it. So I personally feel it's nice if it's there,
and the SUSPEND bit only works after DRIVER_OK. So that's an example
argument that makes sense to me.  But number of layers involved in control
path seems completely irrelevant to most people. *That* is an nvidia
thing, something very specific about vfio and vdpa and whatnot.
Nothing to do with the spec, wrong list for this.


> > 
> > Then for sure, I disagree to it for 100% for such a single-minded design.
> > 
> > At least I am trying to propose if a solution can work for generic passthrough where least amount of hypervisor mediation is done.
> > 
> > And an extension where hypervisor has choice to more medication layers as it finds suitable.
> > And if there are technical issues, may be two different interfaces or more admin commands needed for two modes.
> > The idea is to attempt to converge and discuss those details, not the opposite.
> > 
> > Your above comment shows a clear sign of non-collaboration to make both mode works.
> Well, I see you are emotional, please take a deep breath and calm down, to
> be professional,
> give yourself a break, and really not necessary to be mad at me.
> 
> As you know I am just a Junior Engineer in Intel, not like you a Senior
> Principle Engineer who has spent many years and
> have developed knowledge in this area. So I am quite technical focusing,
> they are all technical discussions till now.

It looks more like a passive-agressive flamewar from the side.
So maybe try to see other's point of view. I asked what's the advantage of
admin vq thing for migration and you said "it's an nvidia thing".
And when people try to point them out to you, you go well tough.
Maybe but we are wasting time here.

> We always welcome collaboration, remember Jason has proposed a solution to
> build admin vq based on these basic facilities,
> and I am fully agree on his proposal.

I didn't see anything specific frankly, I can easily see how Parav could
get mad if he posts a reasonably fleshed out patchset (which admittedly,
needs work with wording etc) and instead of review gets back
"rework this on top of these basic facilities which we don't yet know
how they will work but maybe will". We'll be stuck in this loop for
how long?


> > At one point I may probably stop responding to your comments that repeatedly says:
> > 
> > "Go read QEMU code, Do you know what is PASID?, Do you know num_queues, Go read PCI spec"...
> With all respect, you should do these because they are text book knowledge.
> > 
> > Taking deep breath now to do some productive work in TC...


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [virtio-comment] Re: [PATCH V2 6/6] virtio-pci: implement dirty page tracking
  2023-11-07  9:43           ` Zhu, Lingshan
@ 2023-11-07 10:43             ` Michael S. Tsirkin
  0 siblings, 0 replies; 186+ messages in thread
From: Michael S. Tsirkin @ 2023-11-07 10:43 UTC (permalink / raw)
  To: Zhu, Lingshan; +Cc: jasowang, eperezma, cohuck, stefanha, virtio-comment, parav

On Tue, Nov 07, 2023 at 05:43:44PM +0800, Zhu, Lingshan wrote:
> 
> 
> On 11/6/2023 6:15 PM, Michael S. Tsirkin wrote:
> > On Mon, Nov 06, 2023 at 05:16:43PM +0800, Zhu, Lingshan wrote:
> > > 
> > > On 11/3/2023 10:21 PM, Zhu, Lingshan wrote:
> > > > 
> > > > On 11/3/2023 6:46 PM, Michael S. Tsirkin wrote:
> > > > > On Fri, Nov 03, 2023 at 06:34:37PM +0800, Zhu Lingshan wrote:
> > > > > > +\begin{lstlisting}
> > > > > > +struct virtio_pci_dity_page_track {
> > > > > > +        u8 enable;               /* Read-Write */
> > > > > > +        u8 gra_power;            /* Read-Write */
> > > > > > +        u8 reserved[2];
> > > > > > +        le32 {
> > > > > > +            pasid: 20;           /* Read-Write */
> > > > > > +            reserved: 12;
> > > > > > +        };
> > > > > > +        le64 bitmap_addr;        /* Read-Write */
> > > > > > +        le64 bitmap_length;      /* Read-Write */
> > > > > > +};
> > > > > > +\end{lstlisting}
> > > > > Okay, so it's a simple mailbox in config space.  Which by itself is
> > > > > probably a very reasonable idea - more or less what I suggested.
> > > > > However, using such a generic facility just for the dirty bitmap seems
> > > > > too limited.  Please make it accept arbitrary commands. Reusing admin
> > > > > command structure with a special "device itself" group sounds like one
> > > > > way to do it.
> > > > processing admin cmds in a cap may be too complex and overkill.
> > > > we need to handle variable length of cmds, handle async returned
> > > > results, and so on.
> > > > 
> > > > This struct seems easy and simple. And shall we use platform facilities
> > > > like vt-d
> > > > to track dirty pages?
> > > To demonstrate these issues, suppose we have a struct in a bar to process
> > > admin cmds:
> > > 
> > > struct virtio_admin_cmd {
> > >          u64 in_data_length;
> > >          u8 cmd_in_data[];
> > >          u64 out_data_length;
> > >          u8 cmd_out_data[];
> > >          u8 ret;
> > > };
> > An alternative is do same as you did here, e.g.:
> > struct virtio_admin_cmd {
> >           u64 admin_cmd_pa; /* an out descriptor followed by an in descriptor */
> PA depends on IOMMU and ATS

Not how virtio spec uses it. grep spec for physical address.


> >           u32 pasid : 20;
> > 	 u8 reserved : 11;
> > 	 u8 hardware : 1;
> > };
> > 
> > or we can stick two lengths and addresses straight in the capability.
> Still can only process only one cmd at a time and others are blocked.

Point being? Your patch is the same - thought you wanted simplicity?
We could stick a full queue there for sure - does this seem like
a good idea to you?

> > 
> > 
> > > The problems are:
> > > 1) command_in_data and command_out data have variable length, so how many HW
> > > resource should be reserved in the bar?
> > actually admin commands are truncated by device so just
> > set to length that device understands.
> How the framework know the length? How can the vendor know how many
> HW resource they should place here? I am not sure it is a good
> idea to guess.

I really don't want to repeat the whole admin command infrastructure
explanation I thought it's pretty clear. device knows the length
and truncates. Let's not get started on frameworks please.


> > 
> > > 2) To process the cmds in the bar, the device MAY need to read many
> > > registers in cmd_in_data[] and write many registers in cmd_out_data[],
> > > which can be ineffective, this is not DMA.
> > True. Again if you don't want to depend on pasid that's the
> > only option.
> Yes, this exactly shows how complex and overkill to use
> a cap to handle admin cmds. And maybe admin cmds is not a must.

I don't know. But this patch in question is really just impractical.
We can't implement silly one-off gateways for each new random thing.
You work on this thing so it looks like it's the most important feature
to you maybe, but virtio works very well for most people than you
very much.


> > 
> > 
> > > 3) a bar can only process one cmd at a time, and the driver can only issue
> > > another cmd after received an ret.
> > > This process has to be synchronous IO, one cmd blocks another.
> > Exactly same as what you did though.
> We have implemented a cap in PCI for dirty page tracking, there are no admin
> cmds.

"implemented" as in "posted v1 of a very rough draft". with tradeoffs
like increasing migration payload by a factor of x8. Or try to use
AtomicOps, if you try you will I believe see commands can actually
fail and so you need to report errors.

So maybe it's great for CXL. Should we be able to use platform
capabilities if there?  Of course. This is right in the charter.  Must
we limit ourselves to specific platforms with specific capabilities? if
there are tc members interested in supported limited platforms without
said capabilities, I don't see why we must.


> And for dirty page tracking, we still believe it is better to use platform
> facilities,
> as Jason explained.

So pls don't waste everyone's time trying to review stuff you don't think can work.


> > 
> > > 4) VF implementing a bar processing admin cmds conflicts with PF's admin vq.
> > So just don't create conflicts. It's same as multiple admin vqs
> > really which we already support.
> How to tell the SW don't create conflict?

It's like a single mutex in software or something no?
We already say 
	It is the responsibility of the driver to ensure
	strict request ordering for commands, because they will be
	consumed with no order constraints. 
seems enough for me.


> > 
> > 
> > > So I think a bar or a cap processing admin cmds is way to complex and
> > > overkill.
> > > 
> > > Thanks
> > Sounds like a straw man argument.
> > 


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [virtio-comment] Re: [PATCH V2 6/6] virtio-pci: implement dirty page tracking
  2023-11-06 10:22           ` Michael S. Tsirkin
@ 2023-11-07 10:44             ` Zhu, Lingshan
  2023-11-07 11:29               ` Michael S. Tsirkin
  0 siblings, 1 reply; 186+ messages in thread
From: Zhu, Lingshan @ 2023-11-07 10:44 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: jasowang, eperezma, cohuck, stefanha, virtio-comment, parav



On 11/6/2023 6:22 PM, Michael S. Tsirkin wrote:
> On Mon, Nov 06, 2023 at 12:06:39PM +0800, Zhu, Lingshan wrote:
>>>> Intel production work with similar bitmap
>>>> based dirty page tracking solution for years.
>>> and then VMs became bigger and PML was introduced.
>> So you agree we should track dirty pages through the platform facilities?
>> I am glad to hear that!
> I just said that I thought there's no PML in platform facilities and that
> might be a problem. Am I wrong?
That's thru some platform may don't have PML.
The thing is most of platform have PML and this is a tradeoff:
1)don't implement dirty page tracking in virtio, but use platform. Then 
some platform
can not track dirty pages by HW
2)implement dirty page tracking in virtio, but most platform don't use it.

As you see, I have post this virito dirty page tracking as a backup here,
so this should be your call anyway.
>
>>>> Otherwise the device should report PFN which is not very practical.
>>> Why not?
>> Really? the device report PFN?
>> What can happen if the device keep writing a small piece of memory???
> then you just report the PFN once. Should work like PML really -
> IOW devices maintains a bit per page internally and reports
> PFN when bit is set.
Yes, it is a/d bit. by bitmap, when the device keep writing a small 
region of memory,
only needs to mark the bits as dirty once.

when reporting PFN, the device need to repeating report the same bunch 
of PFNs(0x1234abcd..),
which is not very efficient.

And we need to merge the device dirty page into QEMU dirty page bit map 
anyway.
>
>>>>       And the resolution is apparently 8 pages? You have just multiplied
>>>>       the migration bandwidth by a factor of 8.
>>>>
>>>> No, as described in the comments, the tacking granularity is controlled by \
>>>> field{gra_power}, one bit represents a page with page_size = 2^(12 +
>>>> gra_power). This can also be used to reduce the size of the bitmap.
>>> .. at the cost of increasing migration bandwidth.
>> The device is very likely to write a neighbor page,
> how likely? and e.g. with slab randomization too? please collect some
> data and show it.
DMA writes continuous memory and take an example of DMA writing the ring 
buffer,
it likely to write a neighbor page. This is called memory locality.
>
>> and this happens
>> everywhere for example CPU read 64 bytes aligned data.
> CPUs don't need to send their cache across a bandwidth constrained
> shared network.
This is not about cacheline size, just saying it is 64bytes aligned.
>
>> This is a tradeoff
> tradeoff between which two options?
1) small tracking granularity and big bitmap
2) big tracking granularity and smaller bitmap(with memory locality)
>
>>>> "To prevent a read-modify-write procedure, if a memory page is dirty,
>>>> optionally the device is permitted to set the entire byte, which encompasses the relevant bit, to 1."
>>>>
>>>> This is optional and DMA is very likely to write a neighbor page, and the device transmit a whole byte anyway
>>>> when a bit is dirty.
>>>>
>>>> How about we use platform dirty page tracking facility then implement this in virtio, as Jason suggested?
>>>>
>>> Without something like PML it likely won't scale either.
>> So that would be platform issue which we don't need to take care of?
>> Intel VT-d can do this for sure.
> Intel VT-d supports PML from the IOMMU? I didn't realize. Could you help
> me find it in the doc please? Which hardware supports this in the field?
> What about other vendors?
Please refer to intel vt-d spec: 
https://cdrdv2-public.intel.com/774206/vt-directed-io-spec%20.pdf

As far as I know, AMD and ARM support this too.

Anyway, as said above, this should be your call whether implement dirty 
page tracking in virtio.
>


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* [virtio-comment] RE: [PATCH V2 6/6] virtio-pci: implement dirty page tracking
  2023-11-07 10:25                     ` Michael S. Tsirkin
@ 2023-11-07 11:12                       ` Parav Pandit
  2023-11-07 11:24                         ` Parav Pandit
  2023-11-07 11:31                         ` [virtio-comment] " Michael S. Tsirkin
  2023-11-08  9:36                       ` Zhu, Lingshan
  1 sibling, 2 replies; 186+ messages in thread
From: Parav Pandit @ 2023-11-07 11:12 UTC (permalink / raw)
  To: Michael S. Tsirkin, Zhu, Lingshan
  Cc: jasowang, eperezma, cohuck, stefanha, virtio-comment

> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Tuesday, November 7, 2023 3:55 PM
> 
> On Tue, Nov 07, 2023 at 06:01:27PM +0800, Zhu, Lingshan wrote:
> >
> >
> > On 11/6/2023 7:13 PM, Parav Pandit wrote:
> > > > From: Zhu, Lingshan <lingshan.zhu@intel.com>
> > > > Sent: Monday, November 6, 2023 3:04 PM So, please no pass-through
> > > > discussion anymore.
> > > If you comment like this, nothing can progress.
> > >
> > > What you are implying with above language is:
> > > "hey a virtio can do live migration ONLY by creating vdpa device on top of
> ALREADY virtio device, and you get another virtio device by running through 3
> layers of stack you get virtio device on other side!".
> > I never say that right? I keep explaining how pass-through and "trap
> > and emulate work", I even explained how PASID work.
> 
> Parav, Lingshan can we please stop the "what is pass through" arguments?
> 
I am not discussing it at all anymore.
The use case is well defined and seeing how we can/cannot have one proposal.

> I think thatthe term is vague, as
> (almost?) no hypervisor passes all accesses without exemption through.
> And the fact you are speaking past enough other on this subject for how long
> now? seems to demonstrate I'm right.
> 
In v3 I acknowledged both the use cases in commit log unlike the other side.

> Describing migration in the spec as opposed to leaving it up to hypervisors
> seems valuable at least to me since historically hypervisors did such a bad job
> of it. 
It is done in v3.

> So I personally feel it's nice if it's there, and the SUSPEND bit only works
> after DRIVER_OK. So that's an example argument that makes sense to me.  
As talked suspend is useful and should be controlled by the guest anyway for power management.
Hypervisor is not supposed to use it during LM.

> But
> number of layers involved in control path seems completely irrelevant to most
> people. *That* is an nvidia thing, something very specific about vfio and vdpa
> and whatnot.
> Nothing to do with the spec, wrong list for this.
> 
Certainly, it is not an Nvidia thing.
Multiple device vendors would like to do this and equally users too.
So, I disagree.
I don't want to debate it.

Didn't you ask at start of this email to stop debating on what is passthrough?

> 
> > >
> > > Then for sure, I disagree to it for 100% for such a single-minded design.
> > >
> > > At least I am trying to propose if a solution can work for generic
> passthrough where least amount of hypervisor mediation is done.
> > >
> > > And an extension where hypervisor has choice to more medication layers as
> it finds suitable.
> > > And if there are technical issues, may be two different interfaces or more
> admin commands needed for two modes.
> > > The idea is to attempt to converge and discuss those details, not the
> opposite.
> > >
> > > Your above comment shows a clear sign of non-collaboration to make both
> mode works.
> > Well, I see you are emotional, please take a deep breath and calm
> > down, to be professional, give yourself a break, and really not
> > necessary to be mad at me.
> >
> > As you know I am just a Junior Engineer in Intel, not like you a
> > Senior Principle Engineer who has spent many years and have developed
> > knowledge in this area. So I am quite technical focusing, they are all
> > technical discussions till now.
> 
> It looks more like a passive-agressive flamewar from the side.
> So maybe try to see other's point of view. I asked what's the advantage of
> admin vq thing for migration and you said "it's an nvidia thing".
Huh, really you have to say this?
There are two TC sign-off on the patches.

Since I don't have the link to previously listed advantages I have to repeat here.

1. admin vq is must (it is not about an advantage) to support device passthrough to the guest.

Passthrough definition: following things are not trapped by hypervisor in one use case.
(a) virtio common and device config space
(b) cvqs for 6 and more device types
(c) hypervisor not involved in mixing PCI specific FLRs with virtio specific logic.

> And when people try to point them out to you, you go well tough.
> Maybe but we are wasting time here.
Only thing Lingshan pointed out if some QoS on AQ.
He never responded on dirty page tracking, why he cannot use it.
He never responded why device context cannot be used.

> 
> > We always welcome collaboration, remember Jason has proposed a
> > solution to build admin vq based on these basic facilities, and I am
> > fully agree on his proposal.
> 
> I didn't see anything specific frankly, I can easily see how Parav could get mad if
> he posts a reasonably fleshed out patchset (which admittedly, needs work with
> wording etc) and instead of review gets back "rework this on top of these basic
> facilities which we don't yet know how they will work but maybe will". We'll be
> stuck in this loop for how long?
> 
The series from Lingshan clearly does not address the requirements listed in v3.
And he is not open to converge it either.

My humble input is:
1. Accept the two use cases listed of vfio and vdpa being practical to support existing stacks
2. Try to converge two cases; if there is common virtio spec framework it can use
3. If they can, great lets use it.
4. If not, both use cases need different infrastructure, so build two.

Do you have any better suggestions to support both use cases?

This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* [virtio-comment] Re: [PATCH V2 6/6] virtio-pci: implement dirty page tracking
  2023-11-06  4:03             ` [virtio-comment] " Parav Pandit
@ 2023-11-07 11:13               ` Michael S. Tsirkin
  2023-11-08  9:29                 ` Zhu, Lingshan
  0 siblings, 1 reply; 186+ messages in thread
From: Michael S. Tsirkin @ 2023-11-07 11:13 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Zhu, Lingshan, jasowang, eperezma, cohuck, stefanha, virtio-comment

On Mon, Nov 06, 2023 at 04:03:42AM +0000, Parav Pandit wrote:
> 
> 
> > From: Michael S. Tsirkin <mst@redhat.com>
> > Sent: Sunday, November 5, 2023 9:42 PM
> > 
> > On Fri, Nov 03, 2023 at 03:47:34PM +0000, Parav Pandit wrote:
> > > > > [1]
> > > > > https://lists.oasis-open.org/archives/virtio-comment/202310/msg004
> > > > > 75.h
> > > > > tml
> > > > you still need to explain why this does not work for pass-through.
> > > It does not work for following reasons.
> > > 1. Because all the fields that put on the member device are not in direct
> > control of the hypervisor.
> > > The device is directly controlled by the guest including the device status and
> > when it resets the device all the things stored in the device are lost.
> > 
> > I think the idea is that when this gateway is in the device then device reset has
> > to trap. At a high level, ok. But then what?
> > Is a full scan of all memory required until device reset is complete?
> > Drivers currently tend to busy poll the reset register, if this takes very long we
> > might start seeing soft lockup messages. What is the idea then? Maybe for this
> > we need a separate weaker reset that does not touch this capability?
> >
> You meant the gateway is not in the device, right?
> 
> I likely didn't understand. I don't see a relation to timing.
> 
> When the device reset is not trapped by the hypervisor, most things does not work, it requires trapping other things to like cvq, device registers and more.
> It may be fine for those use case, but it does not fullfill the requirement of passthrough mode of hw.

I wish we'd just stop using the term, it just confuses everyone.

I feel the point worth making is that currently, all this job is done
by hypervisors. And they manage fine! vdpa really truly does not need
the SUSPEND bit because it knows about devices and it
can just use whatever it wants in any vendor specific way it wants.

where all this migration work comes handy is if we say that
we want our device to all just do what the
spec says. No vendor specific tricks. And I find it exciting that
there are people who want to work on this instead of
each vendor wasting man hours on their own almost the same but
slightly different driver.

I personally think this patch is not great for the trap use-case either.
Why? For example if device is somewhat slow then it will take it
hundreds of milliseconds to synchronize the whole guest memory, and
blocking reset means blocking e.g. guest boot.  I was wrong about soft
lockup btw - linux does msleep which I think means no soft lockups. But boot is
blocked and modules are not loaded.

-- 
MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* [virtio-comment] RE: [PATCH V2 6/6] virtio-pci: implement dirty page tracking
  2023-11-07 11:12                       ` [virtio-comment] " Parav Pandit
@ 2023-11-07 11:24                         ` Parav Pandit
  2023-11-08  7:11                           ` [virtio-comment] " Jason Wang
  2023-11-07 11:31                         ` [virtio-comment] " Michael S. Tsirkin
  1 sibling, 1 reply; 186+ messages in thread
From: Parav Pandit @ 2023-11-07 11:24 UTC (permalink / raw)
  To: Parav Pandit, Michael S. Tsirkin, Zhu, Lingshan
  Cc: jasowang, eperezma, cohuck, stefanha, virtio-comment

> From: virtio-comment@lists.oasis-open.org <virtio-comment@lists.oasis-
> open.org> On Behalf Of Parav Pandit
> Sent: Tuesday, November 7, 2023 4:43 PM

> My humble input is:
> 1. Accept the two use cases listed of vfio and vdpa being practical to support
> existing stacks 2. Try to converge two cases; if there is common virtio spec
> framework it can use 3. If they can, great lets use it.
> 4. If not, both use cases need different infrastructure, so build two.
> 
> Do you have any better suggestions to support both use cases?

And both use cases can be serviced by proposed admin commands with different hypervisor layers to use them.
If something is missing, we can extend these admin commands further.

This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [virtio-comment] Re: [PATCH V2 6/6] virtio-pci: implement dirty page tracking
  2023-11-07 10:44             ` Zhu, Lingshan
@ 2023-11-07 11:29               ` Michael S. Tsirkin
  0 siblings, 0 replies; 186+ messages in thread
From: Michael S. Tsirkin @ 2023-11-07 11:29 UTC (permalink / raw)
  To: Zhu, Lingshan; +Cc: jasowang, eperezma, cohuck, stefanha, virtio-comment, parav

On Tue, Nov 07, 2023 at 06:44:46PM +0800, Zhu, Lingshan wrote:
> > Intel VT-d supports PML from the IOMMU? I didn't realize. Could you help
> > me find it in the doc please? Which hardware supports this in the field?
> > What about other vendors?
> Please refer to intel vt-d spec:
> https://cdrdv2-public.intel.com/774206/vt-directed-io-spec%20.pdf

Yes I know where the VTD doc is. But what chapter do you refer to
when you say that VTD supports page modification log?

-- 
MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* [virtio-comment] Re: [PATCH V2 6/6] virtio-pci: implement dirty page tracking
  2023-11-07 11:12                       ` [virtio-comment] " Parav Pandit
  2023-11-07 11:24                         ` Parav Pandit
@ 2023-11-07 11:31                         ` Michael S. Tsirkin
  1 sibling, 0 replies; 186+ messages in thread
From: Michael S. Tsirkin @ 2023-11-07 11:31 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Zhu, Lingshan, jasowang, eperezma, cohuck, stefanha, virtio-comment

On Tue, Nov 07, 2023 at 11:12:47AM +0000, Parav Pandit wrote:
> Didn't you ask at start of this email to stop debating on what is passthrough?

And I'm suggesting some other terms that hopefully will describe the
use-case in a generic terms as opposed to whatever is in Linus' tree
as of Nov 2023.

-- 
MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* [virtio-comment] Re: [PATCH V2 6/6] virtio-pci: implement dirty page tracking
  2023-11-07  9:52                   ` Zhu, Lingshan
@ 2023-11-07 11:33                     ` Michael S. Tsirkin
  2023-11-08  9:30                       ` Zhu, Lingshan
  0 siblings, 1 reply; 186+ messages in thread
From: Michael S. Tsirkin @ 2023-11-07 11:33 UTC (permalink / raw)
  To: Zhu, Lingshan
  Cc: Parav Pandit, jasowang, eperezma, cohuck, stefanha, virtio-comment

On Tue, Nov 07, 2023 at 05:52:41PM +0800, Zhu, Lingshan wrote:
> > > > > > 2. the PCI FLR is clearing all the registers you exposed here.
> > > > > see above
> > > > > > 3. Endless expansion of config registers of dirty tracking is not
> > > > > > scalable, as they
> > > > > are not init time registers not following the Appendix B guidelines.
> > > > > endless expansion?? It is a complete set of dirty page tracking, right????
> > > > > have you see this cap only controls? The device DMA writes the
> > > > > bitmap, not by registers.
> > > > Device dirty page tracking is start/stop command to be done by the
> > > hypervisor.
> > > > So when guest is resetting the device, it stopped the DMA initiated by the
> > > hypervisor.
> > > > This fundamentally breaks things.
> > > Why? When device resets, do you want to keep tracking dirty pages????
> > Yes, when the device resets, before that event occurred, all the pages which were dirtied, must be migrated.
> > And after reset also new page tracking to continue.
> That depends on whether there is an interrupt for the dirty pages.
> If there is an interrupt, then the guest owns the pages

Not in the virtio model, guest owns the memory once buffer has been used.

-- 
MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* [virtio-comment] Re: [PATCH V2 6/6] virtio-pci: implement dirty page tracking
  2023-11-07 10:02                   ` [virtio-comment] " Zhu, Lingshan
@ 2023-11-07 11:36                     ` Michael S. Tsirkin
  0 siblings, 0 replies; 186+ messages in thread
From: Michael S. Tsirkin @ 2023-11-07 11:36 UTC (permalink / raw)
  To: Zhu, Lingshan
  Cc: Parav Pandit, jasowang, eperezma, cohuck, stefanha, virtio-comment

On Tue, Nov 07, 2023 at 06:02:21PM +0800, Zhu, Lingshan wrote:
> 
> 
> On 11/6/2023 7:21 PM, Parav Pandit wrote:
> > > From: Michael S. Tsirkin <mst@redhat.com>
> > > Sent: Monday, November 6, 2023 4:00 PM
> > > 
> > > On Mon, Nov 06, 2023 at 04:34:44AM +0000, Parav Pandit wrote:
> > > > > Do you know there are locked transaction and atomic operations in PCI???
> > > > Can you explain how PCI does RMW locked transaction?
> > > > Is it one TLP or multiple?
> > > Parav what are you asking about here?
> > > pcie supports CAS and Swap which likely
> > > can work for this use-case - these are non posted writes. It's in the pcie spec.
> > PCI spec do not have atomic OR operation.
> > Lingshan in above comment suggested some unknown locked transaction and atomic operation.
> > So I was asking him which is that atomic operation and how PCI does it?
> > I don't know if any that can do PCI atomic OR without a workaround.
> 6.15 Atomic Operations (AtomicOps)

There's no atomic or there. I guess you could use CAS. What are you
going to do if CAS fails? Error out or retry? Retry can fail
indefinitely and then what does the device do? Error out might be ok -
have driver slow down. You need error reporting though which you have
omitted because overkill.

-- 
MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* [virtio-comment] Re: [PATCH V2 6/6] virtio-pci: implement dirty page tracking
  2023-11-07 10:01                   ` [virtio-comment] " Zhu, Lingshan
  2023-11-07 10:25                     ` Michael S. Tsirkin
@ 2023-11-07 12:00                     ` Michael S. Tsirkin
  1 sibling, 0 replies; 186+ messages in thread
From: Michael S. Tsirkin @ 2023-11-07 12:00 UTC (permalink / raw)
  To: Zhu, Lingshan
  Cc: Parav Pandit, jasowang, eperezma, cohuck, stefanha, virtio-comment

On Tue, Nov 07, 2023 at 06:01:27PM +0800, Zhu, Lingshan wrote:
> 
> 
> On 11/6/2023 7:13 PM, Parav Pandit wrote:
> > > From: Zhu, Lingshan <lingshan.zhu@intel.com>
> > > Sent: Monday, November 6, 2023 3:04 PM
> > > So, please no pass-through discussion anymore.
> > If you comment like this, nothing can progress.
> > 
> > What you are implying with above language is:
> > "hey a virtio can do live migration ONLY by creating vdpa device on top of ALREADY virtio device, and you get another virtio device by running through 3 layers of stack you get virtio device on other side!".
> I never say that right? I keep explaining how pass-through and "trap and
> emulate work", I even explained how PASID work.
> > 
> > Then for sure, I disagree to it for 100% for such a single-minded design.
> > 
> > At least I am trying to propose if a solution can work for generic passthrough where least amount of hypervisor mediation is done.
> > 
> > And an extension where hypervisor has choice to more medication layers as it finds suitable.
> > And if there are technical issues, may be two different interfaces or more admin commands needed for two modes.
> > The idea is to attempt to converge and discuss those details, not the opposite.
> > 
> > Your above comment shows a clear sign of non-collaboration to make both mode works.
> Well, I see you are emotional, please take a deep breath and calm down, to
> be professional,
> give yourself a break, and really not necessary to be mad at me.

You aren't just going to stop people being mad by telling them not to.
Try to figure out why did that happen.

-- 
MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* [virtio-comment] Re: [PATCH V2 1/6] virtio: introduce virtqueue state
  2023-11-07  8:22               ` Michael S. Tsirkin
@ 2023-11-08  4:08                 ` Zhu, Lingshan
  0 siblings, 0 replies; 186+ messages in thread
From: Zhu, Lingshan @ 2023-11-08  4:08 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: jasowang, eperezma, cohuck, stefanha, virtio-comment, parav



On 11/7/2023 4:22 PM, Michael S. Tsirkin wrote:
> On Tue, Nov 07, 2023 at 04:11:25PM +0800, Zhu, Lingshan wrote:
>>
>> On 11/6/2023 5:45 PM, Michael S. Tsirkin wrote:
>>> On Mon, Nov 06, 2023 at 05:42:10PM +0800, Zhu, Lingshan wrote:
>>>> On 11/6/2023 5:35 PM, Michael S. Tsirkin wrote:
>>>>> On Fri, Nov 03, 2023 at 10:49:42PM +0800, Zhu, Lingshan wrote:
>>>>>>            +When SUSPEND is set, the device MUST record the Available State of every enabled splited virtqueue
>>>>>>            +in \field{Available State} field,
>>>>>>            +and correspondingly restore the Available State of every enabled splited virtqueue
>>>>>>            +from \field{Available State} field when DRIVER_OK is set.
>>>>>>            +
>>>>>>            +The device SHOULD reset \field{Available State} field upon a device reset.
>>>>>>
>>>>>>        At this point I have no idea
>>>>>>        - how can a state of a virtqueue at a random time be represented
>>>>>>          by a 16 bit integer
>>>>>>
>>>>>> not sure what is a random time, this is to request the device to reset
>>>>>> its avail state, for example, it is "le16 queue_avail_state" in virtio-pci
>>>>>> common cfg. Resetting this so the device will not recover from a wrong value of
>>>>>> the last run.
>>>>> You simply never bother to say what is "Available State" and what
>>>>> does it mean to restore it.  Not to mention words like "splited"
>>>>> which just adds to the confusion.
>>>> It says:
>>>> +The available state field is two bytes of virtqueue state that is used by
>>>> +the device to read the next available buffer. It is presented in the
>>>> following format:
>>>>
>>>> Do you want me to add more descriptions?
>>> maybe start with an example
>> I think they are already in the spec, I can add:
>> see also "2.7.6 The Virtqueue Available Ring" and "2.7.13.1 Placing Buffers
>> Into The Descriptor Table"
>>>>>>        - if it's not at a random time then why do you even need an integer -
>>>>>>          synchronize queue to memory and then all state is in memory
>>>>>>
>>>>>> Not sure what is a sync queue, but for example, "le16 queue_avail_state" for
>>>>>> PCI transport exists in a cap.
>>>>> I just point out that normally a lot of ring state is in memory.
>>>>> So you need to be much more specific about how you are augmenting that.
>>>>> For example, if buffers are used exactly in order for a split ring
>>>>> then used index seems to be exactly the same as last available index
>>>>> you describe - it's a free running counter. OTOH if they are not
>>>>> used in order then I don't see how is a single index sufficient to
>>>>> describe which ones have been used and which not.
>>>> I am not sure I get it.
>>>>
>>>> Used idx(not like packed vq, no over-writing descriptors) and other states
>>>> are in guest memory, so migrated with guest migration.
>>> yes and so? why is that not enough and what is this available state then?
>> The spec has illustrated how available index work and has given an
>> example(see above cited sections)
>> And this patch even has given a more clear description for it.
>>
>> Other states are in guest memory and migrated with guest memory.
> Yea I wrote large parts of it and I know how the available index works.
>
> And sorry no idea what you are talking about.
>
> At any time, there can be up to 2^16 buffers that have been made
> available, and a random subset of these have been used. There is no
> chance in the world a single 16 bit index describes even that part of
> state, never mind device type specific processing that might be going
> on.
>
> As a wild guest this proposal is making a bunch of unstated assumptions
> about device being in a very specific state where this *is* possible.
> For people to be able to implement devices and drivers these
> need to be spelled out.
Thanks for your advice, I may need more hints to improve this patch.

If it is about when _F_IN_ORDER not negotiated, I found this section 
from the spec:

Some devices always use descriptors in the same order in which they have 
been made available. These
devices can offer the VIRTIO_F_IN_ORDER feature. If negotiated, this 
knowledge allows devices to notify
the use of a batch of buffers to the driver by only writing out a single 
used ring entry with the id corresponding
to the head entry of the descriptor chain describing the last buffer in 
the batch.
The device then skips forward in the ring according to the size of the 
batch. Accordingly, it increments the
used idx by the size of the batch.

This section implies that if _F_IN_ORDER is not negotiated, the device 
may not able to process the descriptors in
order, thus may not write only one used_idx for a batch of buffers. This 
is about how to make buffer used and
used_idx is in the guest memory. If the device selective done processing 
some descriptors, then maybe just
mark them done one by one than batching.

Here I see there are two kind of vq states, on device or in guest 
memory. So this series migrate the on device
state explicitly and others are migrating with guest memory.

Can you be more specific on what parameters of a vq that I should 
address in this patch?

Thanks
>
>


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* [virtio-comment] Re: [PATCH V2 6/6] virtio-pci: implement dirty page tracking
  2023-11-07 11:24                         ` Parav Pandit
@ 2023-11-08  7:11                           ` Jason Wang
  2023-11-08  7:16                             ` [virtio-comment] " Parav Pandit
  0 siblings, 1 reply; 186+ messages in thread
From: Jason Wang @ 2023-11-08  7:11 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Michael S. Tsirkin, Zhu, Lingshan, eperezma, cohuck, stefanha,
	virtio-comment

On Tue, Nov 7, 2023 at 7:24 PM Parav Pandit <parav@nvidia.com> wrote:
>
> > From: virtio-comment@lists.oasis-open.org <virtio-comment@lists.oasis-
> > open.org> On Behalf Of Parav Pandit
> > Sent: Tuesday, November 7, 2023 4:43 PM
>
> > My humble input is:
> > 1. Accept the two use cases listed of vfio and vdpa being practical to support
> > existing stacks 2. Try to converge two cases; if there is common virtio spec
> > framework it can use 3. If they can, great lets use it.

That's my point as well.

> > 4. If not, both use cases need different infrastructure, so build two.
> >
> > Do you have any better suggestions to support both use cases?
>
> And both use cases can be serviced by proposed admin commands with different hypervisor layers to use them.

I think it still has some open questions that need to be answered.

> If something is missing, we can extend these admin commands further.
>

Thanks


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* [virtio-comment] RE: [PATCH V2 6/6] virtio-pci: implement dirty page tracking
  2023-11-08  7:11                           ` [virtio-comment] " Jason Wang
@ 2023-11-08  7:16                             ` Parav Pandit
  0 siblings, 0 replies; 186+ messages in thread
From: Parav Pandit @ 2023-11-08  7:16 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, Zhu, Lingshan, eperezma, cohuck, stefanha,
	virtio-comment

> From: Jason Wang <jasowang@redhat.com>
> Sent: Wednesday, November 8, 2023 12:42 PM
> 
> On Tue, Nov 7, 2023 at 7:24 PM Parav Pandit <parav@nvidia.com> wrote:
> >
> > > From: virtio-comment@lists.oasis-open.org
> > > <virtio-comment@lists.oasis- open.org> On Behalf Of Parav Pandit
> > > Sent: Tuesday, November 7, 2023 4:43 PM
> >
> > > My humble input is:
> > > 1. Accept the two use cases listed of vfio and vdpa being practical
> > > to support existing stacks 2. Try to converge two cases; if there is
> > > common virtio spec framework it can use 3. If they can, great lets use it.
> 
> That's my point as well.
> 
> > > 4. If not, both use cases need different infrastructure, so build two.
> > >
> > > Do you have any better suggestions to support both use cases?
> >
> > And both use cases can be serviced by proposed admin commands with
> different hypervisor layers to use them.
> 
> I think it still has some open questions that need to be answered.

Yes. I will get to it tomorrow on 11/9 or today.

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [virtio-comment] RE: [PATCH V2 2/6] virtio: introduce SUSPEND bit in device status
  2023-11-07  9:24                     ` Zhu, Lingshan
@ 2023-11-08  7:42                       ` Michael S. Tsirkin
  0 siblings, 0 replies; 186+ messages in thread
From: Michael S. Tsirkin @ 2023-11-08  7:42 UTC (permalink / raw)
  To: Zhu, Lingshan
  Cc: Parav Pandit, jasowang, eperezma, cohuck, stefanha, virtio-comment

On Tue, Nov 07, 2023 at 05:24:44PM +0800, Zhu, Lingshan wrote:
> 
> 
> On 11/7/2023 4:33 PM, Michael S. Tsirkin wrote:
> > On Tue, Nov 07, 2023 at 04:21:13PM +0800, Zhu, Lingshan wrote:
> > > > > This can work, right?
> > > > Unfortunately no, as non atomic bitmap cannot reside in the host memory,
> > > as explained before, PCI and CPU supports atomic read/write. Please refer to
> > > PCI spec and CPU ISA.
> > I don't see how atomic read or write does anything useful here but maybe.
> Because the device writs the bitmap and the driver "read and clear"
> the bitmap, so the ops need to be atomic, or they can run into race.
> > You need to explain how you are using atomics in your proposal then.
> Not sure we should talk about much of how atomic works, as explained above
> the operations should be atomic to avoid race conditions or losing
> information.
> Like:
> 
> 1) Device Read
> 2) Device Write
> 3) Device Clear
> 
> Here we lost the bitmap information.

That's an unusual use of the term "race condition". But yes, you need
to spell out how do driver and device interact.

> > 
> > 
> > > > And whatever is in the device gets reset on device reset and/or FLR. So the dirty map detail is lost.
> > > > Similarly the device context is also lost on these two events triggered by guest.
> > > we explained before, when reset, the device should clear everything.
> > then migration will corrupt memory. Not great.
> I think when reset, the device should clear everything, therefore the driver
> should clear the legacy data as well, don't know how corrupt

If you write data in memory CPU will observe it. If you then
migrate the CPU but not the memory then CPU and memory state are
inconsistent. I am surprised I need to say that, maybe I misunderstand
the question.

> > 
> > 
> > 
> > > > > > > As you can see, the dirty page tracking facility has a PASID for
> > > > > > > isolation. But still, the question is, we should better use platform
> > > > > > > dirty page tracking
> > > > > > > 
> > > > > > Nothing to do with PASID, as PASID is owned by the guest.
> > > > > It looks you don't know how PASID work.
> > > > > Host can setup PASID to isolate some facilities, right?
> > > > There are few limitations with PASID.
> > > > a. All platforms do not have PASID and
> > > As we have explained for many times, this is a basic facility,
> > > and the implementation is transport-specific.
> > > 
> > > We given an example of PCI implementation, and PCI support PASID, right?
> > Yes it's a limitation but maybe one we can live with
> > for this feature.  It does mean that we might need solutions
> > for systems without this support. virtio use is not limited
> > to servers or high end systems.
> PASID has been introduced years ago and I know some vendors implemented
> onboard IOMMU can also do isolating.

Introduced yes but when was it actually implemented? Do you know?

> 
> And this is a basic facility, the implementation is transport specific.

That's why if no one wants to support systems without PASID
this is, maybe, ok. But we know there are people who want this.

> > 
> > 
> > > > b. I explained above PASID do not work always as PASID only bifurcates DMA not the device _functionality_.
> > > With a PASID, a cap can be considered to be placed in another logical
> > > address space, which is not accessible to the guest.
> > > > c. PASID to be available to guest as_is what is present on the device
> > > host hypervisor sets the PASID, transparent to the guest.
> > Lingshan whenever people ask you a ton of questions in response to
> > your spec proposal then respose should not be to simply
> > answer on the mailing list and then repost without a lot of changes
> > since spec readers will likely have questions exactly like these
> > and we can not make them go and read this flame war.
> Well, I should say, I have repeated the same answers for too many times.

Don't. Amend the spec proposal instead so readers don't have these
questions.

> > And frankly, most of this TC stopped following this thread a while ago,
> > it seems to be going nowhere.
> I still believe we should release the best quality
> of spec as we can.
> > The response should be to add the explanation in the spec.
> > Look at Parav's live migration proposals with "theory of operation" chapters
> > for an example of how this can be done.
> I am not sure we should talk how PCI work in virtio spec.
> But I can add "pasid for isolation"
> 
> These facilities are not only used for live migration,
> can also work for debugging. Like suspend then read vq state.

Maybe. Then you need to document what is this state.


> I can add more explanation in the cover letter


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [virtio-comment] Re: [PATCH V2 6/6] virtio-pci: implement dirty page tracking
  2023-11-07 11:13               ` [virtio-comment] " Michael S. Tsirkin
@ 2023-11-08  9:29                 ` Zhu, Lingshan
  2023-11-08 17:18                   ` Michael S. Tsirkin
  0 siblings, 1 reply; 186+ messages in thread
From: Zhu, Lingshan @ 2023-11-08  9:29 UTC (permalink / raw)
  To: Michael S. Tsirkin, Parav Pandit
  Cc: jasowang, eperezma, cohuck, stefanha, virtio-comment



On 11/7/2023 7:13 PM, Michael S. Tsirkin wrote:
> On Mon, Nov 06, 2023 at 04:03:42AM +0000, Parav Pandit wrote:
>>
>>> From: Michael S. Tsirkin <mst@redhat.com>
>>> Sent: Sunday, November 5, 2023 9:42 PM
>>>
>>> On Fri, Nov 03, 2023 at 03:47:34PM +0000, Parav Pandit wrote:
>>>>>> [1]
>>>>>> https://lists.oasis-open.org/archives/virtio-comment/202310/msg004
>>>>>> 75.h
>>>>>> tml
>>>>> you still need to explain why this does not work for pass-through.
>>>> It does not work for following reasons.
>>>> 1. Because all the fields that put on the member device are not in direct
>>> control of the hypervisor.
>>>> The device is directly controlled by the guest including the device status and
>>> when it resets the device all the things stored in the device are lost.
>>>
>>> I think the idea is that when this gateway is in the device then device reset has
>>> to trap. At a high level, ok. But then what?
>>> Is a full scan of all memory required until device reset is complete?
>>> Drivers currently tend to busy poll the reset register, if this takes very long we
>>> might start seeing soft lockup messages. What is the idea then? Maybe for this
>>> we need a separate weaker reset that does not touch this capability?
>>>
>> You meant the gateway is not in the device, right?
>>
>> I likely didn't understand. I don't see a relation to timing.
>>
>> When the device reset is not trapped by the hypervisor, most things does not work, it requires trapping other things to like cvq, device registers and more.
>> It may be fine for those use case, but it does not fullfill the requirement of passthrough mode of hw.
> I wish we'd just stop using the term, it just confuses everyone.
>
> I feel the point worth making is that currently, all this job is done
> by hypervisors. And they manage fine! vdpa really truly does not need
> the SUSPEND bit because it knows about devices and it
> can just use whatever it wants in any vendor specific way it wants.
So true, this is exact what Intel implements in some productions.
>
> where all this migration work comes handy is if we say that
> we want our device to all just do what the
> spec says. No vendor specific tricks. And I find it exciting that
> there are people who want to work on this instead of
> each vendor wasting man hours on their own almost the same but
> slightly different driver.
I agree
>
> I personally think this patch is not great for the trap use-case either.
> Why? For example if device is somewhat slow then it will take it
> hundreds of milliseconds to synchronize the whole guest memory, and
> blocking reset means blocking e.g. guest boot.  I was wrong about soft
> lockup btw - linux does msleep which I think means no soft lockups. But boot is
> blocked and modules are not loaded.
I am not sure SUSPEND can block RESET, I think reset can take immediate 
actions, because
once reset, whether suspended does not matter.
>


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* [virtio-comment] Re: [PATCH V2 6/6] virtio-pci: implement dirty page tracking
  2023-11-07 11:33                     ` Michael S. Tsirkin
@ 2023-11-08  9:30                       ` Zhu, Lingshan
  2023-11-08 17:19                         ` Michael S. Tsirkin
  0 siblings, 1 reply; 186+ messages in thread
From: Zhu, Lingshan @ 2023-11-08  9:30 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Parav Pandit, jasowang, eperezma, cohuck, stefanha, virtio-comment



On 11/7/2023 7:33 PM, Michael S. Tsirkin wrote:
> On Tue, Nov 07, 2023 at 05:52:41PM +0800, Zhu, Lingshan wrote:
>>>>>>> 2. the PCI FLR is clearing all the registers you exposed here.
>>>>>> see above
>>>>>>> 3. Endless expansion of config registers of dirty tracking is not
>>>>>>> scalable, as they
>>>>>> are not init time registers not following the Appendix B guidelines.
>>>>>> endless expansion?? It is a complete set of dirty page tracking, right????
>>>>>> have you see this cap only controls? The device DMA writes the
>>>>>> bitmap, not by registers.
>>>>> Device dirty page tracking is start/stop command to be done by the
>>>> hypervisor.
>>>>> So when guest is resetting the device, it stopped the DMA initiated by the
>>>> hypervisor.
>>>>> This fundamentally breaks things.
>>>> Why? When device resets, do you want to keep tracking dirty pages????
>>> Yes, when the device resets, before that event occurred, all the pages which were dirtied, must be migrated.
>>> And after reset also new page tracking to continue.
>> That depends on whether there is an interrupt for the dirty pages.
>> If there is an interrupt, then the guest owns the pages
> Not in the virtio model, guest owns the memory once buffer has been used.
Yes and even better, interrupt happens after buffers marked as used.
>


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [virtio-comment] Re: [PATCH V2 6/6] virtio-pci: implement dirty page tracking
  2023-11-07 10:25                     ` Michael S. Tsirkin
  2023-11-07 11:12                       ` [virtio-comment] " Parav Pandit
@ 2023-11-08  9:36                       ` Zhu, Lingshan
  1 sibling, 0 replies; 186+ messages in thread
From: Zhu, Lingshan @ 2023-11-08  9:36 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Parav Pandit, jasowang, eperezma, cohuck, stefanha, virtio-comment



On 11/7/2023 6:25 PM, Michael S. Tsirkin wrote:
> On Tue, Nov 07, 2023 at 06:01:27PM +0800, Zhu, Lingshan wrote:
>>
>> On 11/6/2023 7:13 PM, Parav Pandit wrote:
>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>>>> Sent: Monday, November 6, 2023 3:04 PM
>>>> So, please no pass-through discussion anymore.
>>> If you comment like this, nothing can progress.
>>>
>>> What you are implying with above language is:
>>> "hey a virtio can do live migration ONLY by creating vdpa device on top of ALREADY virtio device, and you get another virtio device by running through 3 layers of stack you get virtio device on other side!".
>> I never say that right? I keep explaining how pass-through and "trap and
>> emulate work", I even explained how PASID work.
> Parav, Lingshan can we please stop the "what is pass through" arguments?
>
> I think thatthe term is vague, as
> (almost?) no hypervisor passes all accesses without exemption through.
> And the fact you are speaking past enough other on this subject
> for how long now? seems to demonstrate I'm right.
>
> Describing migration in the spec as opposed to leaving it up to
> hypervisors seems valuable at least to me since historically hypervisors
> did such a bad job of it. So I personally feel it's nice if it's there,
> and the SUSPEND bit only works after DRIVER_OK. So that's an example
> argument that makes sense to me.  But number of layers involved in control
> path seems completely irrelevant to most people. *That* is an nvidia
> thing, something very specific about vfio and vdpa and whatnot.
> Nothing to do with the spec, wrong list for this.
I agree and thanks Micheal for your help.
It's OK for me to let SUSPEND only work after DRIVER_OK.
>
>
>>> Then for sure, I disagree to it for 100% for such a single-minded design.
>>>
>>> At least I am trying to propose if a solution can work for generic passthrough where least amount of hypervisor mediation is done.
>>>
>>> And an extension where hypervisor has choice to more medication layers as it finds suitable.
>>> And if there are technical issues, may be two different interfaces or more admin commands needed for two modes.
>>> The idea is to attempt to converge and discuss those details, not the opposite.
>>>
>>> Your above comment shows a clear sign of non-collaboration to make both mode works.
>> Well, I see you are emotional, please take a deep breath and calm down, to
>> be professional,
>> give yourself a break, and really not necessary to be mad at me.
>>
>> As you know I am just a Junior Engineer in Intel, not like you a Senior
>> Principle Engineer who has spent many years and
>> have developed knowledge in this area. So I am quite technical focusing,
>> they are all technical discussions till now.
> It looks more like a passive-agressive flamewar from the side.
> So maybe try to see other's point of view. I asked what's the advantage of
> admin vq thing for migration and you said "it's an nvidia thing".
> And when people try to point them out to you, you go well tough.
> Maybe but we are wasting time here.
yeah, sorry for that, lets focus on technical opens, get things done,
release better quality of work as we try our best.
>
>> We always welcome collaboration, remember Jason has proposed a solution to
>> build admin vq based on these basic facilities,
>> and I am fully agree on his proposal.
> I didn't see anything specific frankly, I can easily see how Parav could
> get mad if he posts a reasonably fleshed out patchset (which admittedly,
> needs work with wording etc) and instead of review gets back
> "rework this on top of these basic facilities which we don't yet know
> how they will work but maybe will". We'll be stuck in this loop for
> how long?
OK, I did not copy-paste Jason's proposal there.
Then if needed, lets try work out another approach.
>
>
>>> At one point I may probably stop responding to your comments that repeatedly says:
>>>
>>> "Go read QEMU code, Do you know what is PASID?, Do you know num_queues, Go read PCI spec"...
>> With all respect, you should do these because they are text book knowledge.
>>> Taking deep breath now to do some productive work in TC...
>
> This publicly archived list offers a means to provide input to the
> OASIS Virtual I/O Device (VIRTIO) TC.
>
> In order to verify user consent to the Feedback License terms and
> to minimize spam in the list archive, subscription is required
> before posting.
>
> Subscribe: virtio-comment-subscribe@lists.oasis-open.org
> Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
> List help: virtio-comment-help@lists.oasis-open.org
> List archive: https://lists.oasis-open.org/archives/virtio-comment/
> Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
> List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
> Committee: https://www.oasis-open.org/committees/virtio/
> Join OASIS: https://www.oasis-open.org/join/
>


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [virtio-comment] Re: [PATCH V2 0/6] introduce basic facilities for virito live migration
  2023-11-07  8:01 ` [virtio-comment] Re: [PATCH V2 0/6] introduce basic facilities for virito live migration Michael S. Tsirkin
@ 2023-11-08 10:19   ` Zhu, Lingshan
  0 siblings, 0 replies; 186+ messages in thread
From: Zhu, Lingshan @ 2023-11-08 10:19 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: jasowang, eperezma, cohuck, stefanha, virtio-comment, parav



On 11/7/2023 4:01 PM, Michael S. Tsirkin wrote:
> On Fri, Nov 03, 2023 at 06:34:31PM +0800, Zhu Lingshan wrote:
>> This series introduces basic facilities to support
>> virtio live migration, includes:
>>
>> 1)a new SUSPEND bit in the device status
>> Which is used to suspend the device, so that the device states
>> and virtqueue states are stabilized.
>>
>> 2)virtqueue state and its accessor, to get and set last_avail_idx
>> and last_used_idx of virtqueues.
>>
>> 3)dirty page tracking
>
> So looking at this from 100ft:
> - SUSPEND bit looks like something that might have value as a generic
>    component. For example, maybe for NUMA balancing we could suspend,
>    quickly copy ring to a different node and resume.  However current
>    restrictions make it very limited, e.g.  apparently you can't change
>    config space while suspended.
Maybe don't need to change the source side config space.
SUSPEND the device to stabilize the device config,
so that the hypervisor can fetch reliable device states,
then if any changes are required, just make modifications
at the destination side before setting SUSPEND there.
>    As another example, changing config while suspended might be
>    needed e.g. for net announcements.
I think link announce should happen after the device back alive, not 
before.
>    Also, do we want to suspend individual
>    queues then? what exactly happens with config changes while suspended
>    that would happen otherwise is also unclear. Also as is, proposal is
>    very light on detail. Other patches in the series make it look like
>    there are more assumptions made about e.g. how vq enters the
>    suspended state.
Not sure we need an interface to suspend a individual vq, it suspends 
all vqs.

I am not sure I totally get you, if you find anything I should add, and 
any suggestions,
please let me know. I should provide more details in the cover letter 
for sure,
I will add the live migration process in V3 cover letter.
>
> - virtqueue state proposal looks very vague. A couple of 16 bit indices
>    are insufficient to fully describe internal vq state at an arbitrary
>    time. Some assumptions seem to be made that make this possible and
>    yes, these would need to be stated and/or lifted.
>    Preferably lifted since another use-case proposed was debugging -
>    you do not, while debugging, want to depend on device following
>    a complex set of assumptions.
I see there are two kinds of vq states:
1) on device, the device internal states.
I see they are avail idx, used idx and in-flight descriptors.
2) states in the guest memory. This part migrates with guest memory.

I may miss something, please let me what I should add in the vq states,
and I can improve.
>    
> - dirty page tracking as described does not seem practical for
>    many systems.  increasing page size x8 is just being nasty
>    towards other network users. CAS + retry could be a solution,
>    but this needs to be documented thoroughly then and it appears this is not what author expects to implement
>    anyway - instead, there's an assumption that platform itself
>    will support dirty tracking. By itself, this is not
>    an impossible assumption - will possibly result in a cheaper,
>    slower device. why not have an option like this?
>    I would probably just drop it from this proposal completely.
>    Also, tracking memory on the device means we'll lose state
>    around reset. Solving that could be tricky. Finally,
>    dependence on PASID can not be removed apparently.
>    So maybe, people who want to track memory changes on the
>    device itself should just bite the bullet and use
>    admin vq in the PF.
>
>
>
>


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [virtio-comment] Re: [PATCH V2 6/6] virtio-pci: implement dirty page tracking
  2023-11-08  9:29                 ` Zhu, Lingshan
@ 2023-11-08 17:18                   ` Michael S. Tsirkin
  2023-11-09 10:29                     ` Zhu, Lingshan
  0 siblings, 1 reply; 186+ messages in thread
From: Michael S. Tsirkin @ 2023-11-08 17:18 UTC (permalink / raw)
  To: Zhu, Lingshan
  Cc: Parav Pandit, jasowang, eperezma, cohuck, stefanha, virtio-comment

On Wed, Nov 08, 2023 at 05:29:00PM +0800, Zhu, Lingshan wrote:
> 
> 
> On 11/7/2023 7:13 PM, Michael S. Tsirkin wrote:
> > On Mon, Nov 06, 2023 at 04:03:42AM +0000, Parav Pandit wrote:
> > > 
> > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > Sent: Sunday, November 5, 2023 9:42 PM
> > > > 
> > > > On Fri, Nov 03, 2023 at 03:47:34PM +0000, Parav Pandit wrote:
> > > > > > > [1]
> > > > > > > https://lists.oasis-open.org/archives/virtio-comment/202310/msg004
> > > > > > > 75.h
> > > > > > > tml
> > > > > > you still need to explain why this does not work for pass-through.
> > > > > It does not work for following reasons.
> > > > > 1. Because all the fields that put on the member device are not in direct
> > > > control of the hypervisor.
> > > > > The device is directly controlled by the guest including the device status and
> > > > when it resets the device all the things stored in the device are lost.
> > > > 
> > > > I think the idea is that when this gateway is in the device then device reset has
> > > > to trap. At a high level, ok. But then what?
> > > > Is a full scan of all memory required until device reset is complete?
> > > > Drivers currently tend to busy poll the reset register, if this takes very long we
> > > > might start seeing soft lockup messages. What is the idea then? Maybe for this
> > > > we need a separate weaker reset that does not touch this capability?
> > > > 
> > > You meant the gateway is not in the device, right?
> > > 
> > > I likely didn't understand. I don't see a relation to timing.
> > > 
> > > When the device reset is not trapped by the hypervisor, most things does not work, it requires trapping other things to like cvq, device registers and more.
> > > It may be fine for those use case, but it does not fullfill the requirement of passthrough mode of hw.
> > I wish we'd just stop using the term, it just confuses everyone.
> > 
> > I feel the point worth making is that currently, all this job is done
> > by hypervisors. And they manage fine! vdpa really truly does not need
> > the SUSPEND bit because it knows about devices and it
> > can just use whatever it wants in any vendor specific way it wants.
> So true, this is exact what Intel implements in some productions.
> > 
> > where all this migration work comes handy is if we say that
> > we want our device to all just do what the
> > spec says. No vendor specific tricks. And I find it exciting that
> > there are people who want to work on this instead of
> > each vendor wasting man hours on their own almost the same but
> > slightly different driver.
> I agree
> > 
> > I personally think this patch is not great for the trap use-case either.
> > Why? For example if device is somewhat slow then it will take it
> > hundreds of milliseconds to synchronize the whole guest memory, and
> > blocking reset means blocking e.g. guest boot.  I was wrong about soft
> > lockup btw - linux does msleep which I think means no soft lockups. But boot is
> > blocked and modules are not loaded.
> I am not sure SUSPEND can block RESET, I think reset can take immediate
> actions, because
> once reset, whether suspended does not matter.

No, because if you don't suspend device will keep changing memory.
You need to
1. suspend
2. get all dirty memory synced
3. reset


Reset earlier will corrupt guest memory.


> > 


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* [virtio-comment] Re: [PATCH V2 6/6] virtio-pci: implement dirty page tracking
  2023-11-08  9:30                       ` Zhu, Lingshan
@ 2023-11-08 17:19                         ` Michael S. Tsirkin
  2023-11-09 10:34                           ` Zhu, Lingshan
  0 siblings, 1 reply; 186+ messages in thread
From: Michael S. Tsirkin @ 2023-11-08 17:19 UTC (permalink / raw)
  To: Zhu, Lingshan
  Cc: Parav Pandit, jasowang, eperezma, cohuck, stefanha, virtio-comment

On Wed, Nov 08, 2023 at 05:30:02PM +0800, Zhu, Lingshan wrote:
> 
> 
> On 11/7/2023 7:33 PM, Michael S. Tsirkin wrote:
> > On Tue, Nov 07, 2023 at 05:52:41PM +0800, Zhu, Lingshan wrote:
> > > > > > > > 2. the PCI FLR is clearing all the registers you exposed here.
> > > > > > > see above
> > > > > > > > 3. Endless expansion of config registers of dirty tracking is not
> > > > > > > > scalable, as they
> > > > > > > are not init time registers not following the Appendix B guidelines.
> > > > > > > endless expansion?? It is a complete set of dirty page tracking, right????
> > > > > > > have you see this cap only controls? The device DMA writes the
> > > > > > > bitmap, not by registers.
> > > > > > Device dirty page tracking is start/stop command to be done by the
> > > > > hypervisor.
> > > > > > So when guest is resetting the device, it stopped the DMA initiated by the
> > > > > hypervisor.
> > > > > > This fundamentally breaks things.
> > > > > Why? When device resets, do you want to keep tracking dirty pages????
> > > > Yes, when the device resets, before that event occurred, all the pages which were dirtied, must be migrated.
> > > > And after reset also new page tracking to continue.
> > > That depends on whether there is an interrupt for the dirty pages.
> > > If there is an interrupt, then the guest owns the pages
> > Not in the virtio model, guest owns the memory once buffer has been used.
> Yes and even better, interrupt happens after buffers marked as used.

But guest owns memory earlier and you can not change it after this
point.


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [virtio-comment] Re: [PATCH V2 4/6] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  2023-11-07  9:31                 ` Zhu, Lingshan
@ 2023-11-08 17:44                   ` Michael S. Tsirkin
  2023-11-09 10:00                     ` Zhu, Lingshan
  2023-11-09  6:28                   ` Parav Pandit
  1 sibling, 1 reply; 186+ messages in thread
From: Michael S. Tsirkin @ 2023-11-08 17:44 UTC (permalink / raw)
  To: Zhu, Lingshan
  Cc: Parav Pandit, jasowang, eperezma, cohuck, stefanha, virtio-comment

On Tue, Nov 07, 2023 at 05:31:38PM +0800, Zhu, Lingshan wrote:
> 
> 
> On 11/6/2023 6:52 PM, Parav Pandit wrote:
> > > From: Zhu, Lingshan <lingshan.zhu@intel.com>
> > > Sent: Monday, November 6, 2023 2:57 PM
> > > 
> > > On 11/6/2023 12:12 PM, Parav Pandit wrote:
> > > > > From: Zhu, Lingshan <lingshan.zhu@intel.com>
> > > > > Sent: Monday, November 6, 2023 9:01 AM
> > > > > 
> > > > > On 11/3/2023 11:50 PM, Parav Pandit wrote:
> > > > > > > From: virtio-comment@lists.oasis-open.org
> > > > > > > <virtio-comment@lists.oasis- open.org> On Behalf Of Zhu, Lingshan
> > > > > > > Sent: Friday, November 3, 2023 8:27 PM
> > > > > > > 
> > > > > > > On 11/3/2023 7:35 PM, Parav Pandit wrote:
> > > > > > > > > From: Zhu Lingshan <lingshan.zhu@intel.com>
> > > > > > > > > Sent: Friday, November 3, 2023 4:05 PM
> > > > > > > > > 
> > > > > > > > > This patch adds two new le16 fields to common configuration
> > > > > > > > > structure to support VIRTIO_F_QUEUE_STATE in PCI transport layer.
> > > > > > > > > 
> > > > > > > > > Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
> > > > > > > > > ---
> > > > > > > > >      transport-pci.tex | 18 ++++++++++++++++++
> > > > > > > > >      1 file changed, 18 insertions(+)
> > > > > > > > > 
> > > > > > > > > diff --git a/transport-pci.tex b/transport-pci.tex index
> > > > > > > > > a5c6719..3161519 100644
> > > > > > > > > --- a/transport-pci.tex
> > > > > > > > > +++ b/transport-pci.tex
> > > > > > > > > @@ -325,6 +325,10 @@ \subsubsection{Common configuration
> > > > > structure
> > > > > > > > > layout}\label{sec:Virtio Transport
> > > > > > > > >              /* About the administration virtqueue. */
> > > > > > > > >              le16 admin_queue_index;         /* read-only for driver */
> > > > > > > > >              le16 admin_queue_num;         /* read-only for driver */
> > > > > > > > > +
> > > > > > > > > +	/* Virtqueue state */
> > > > > > > > > +        le16 queue_avail_state;         /* read-write */
> > > > > > > > > +        le16 queue_used_state;          /* read-write */
> > > > > > > > This tiny interface for 128 virtio net queues through register
> > > > > > > > read writes, does
> > > > > > > not work effectively.
> > > > > > > > There are inflight out of order descriptors for block also.
> > > > > > > > Hence toy registers like this do not work.
> > > > > > > Do you know there is a queue_select? Why this does not work? Do you
> > > > > > > know how other queue related fields work?
> > > > > > :)
> > > > > > Yes. If you notice queue_reset related critical spec bug fix was
> > > > > > done when it
> > > > > was introduced so that live migration can _actually_ work.
> > > > > > When queue_select is done for 128 queues serially, it take a lot of
> > > > > > time to
> > > > > read those slow register interface for this + inflight descriptors + more.
> > > > > interesting, virtio work in this pattern for many years, right?
> > > > All these years 400Gbps and 800Gbps virtio was not present, number of
> > > queues were not in hw.
> > > The registers are control path in config space, how 400G or 800G affect??
> > Because those are the one in practice requires large number of VQs.
> > 
> > You are asking per VQ register commands to modify things dynamically via this one vq at a time, serializing all the operations.
> > It does not scale well with high q count.
> This is not dynamically, it only happens when SUSPEND and RESUME.
> This is the same mechanism how virtio initialize a virtqueue, working for
> many years.

I wish we just had a transport vq already. That's the way to solve this
not fighting individual bits.

> > > See the virtio common cfg, you will find the max number of vqs is there,
> > > num_queues.
> > :)
> > Sure. those values at high q count affects.
> the driver need to initialize them anyway.
> > 
> > > > Device didn’t support LM.
> > > > Many limitations existed all these years and TC is improving and expanding
> > > them.
> > > > So all these years do not matter.
> > > Not sure what are you talking about, haven't we initialize the device and vqs in
> > > config space for years?????? What's wrong with this mechanism?
> > > Are you questioning virito-pci fundamentals???
> > Don’t point to in-efficient past to establish similar in-efficient future.
> interesting, you know this is a one-time thing, right?
> and you are aware of this has been there for years.
> > 
> > > > > > > Like how to set a queue size and enable it?
> > > > > > Those are meant to be used before DRIVER_OK stage as they are init
> > > > > > time
> > > > > registers.
> > > > > > Not to keep abusing them..
> > > > > don't you need to set queue_size at the destination side?
> > > > No.
> > > > But the src/dst does not matter.
> > > > Queue_size to be set before DRIVER_OK like rest of the registers, as all
> > > queues must be created before the driver_ok phase.
> > > > Queue_reset was last moment exception.
> > > create a queue? Nvidia specific?
> > > 
> > Huh. No.
> > Do git log and realize what happened with queue_reset.
> You didn't answer the question, does the spec even has defined "create a
> vq"?
> > 
> > > For standard virtio, you need to read the number of enabled vqs at the source
> > > side, then enable them at the dst, so queue_size matters, not to create.
> > All that happens in the pre-copy phase.
> Yes and how your answer related to this discussion?


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* [virtio-comment] Re: [PATCH V2 3/6] virtio: dont reset vqs when SUSPEND
  2023-11-07  9:27     ` Zhu, Lingshan
@ 2023-11-08 17:46       ` Michael S. Tsirkin
  2023-11-09  9:58         ` Zhu, Lingshan
  0 siblings, 1 reply; 186+ messages in thread
From: Michael S. Tsirkin @ 2023-11-08 17:46 UTC (permalink / raw)
  To: Zhu, Lingshan; +Cc: jasowang, eperezma, cohuck, stefanha, virtio-comment, parav

On Tue, Nov 07, 2023 at 05:27:23PM +0800, Zhu, Lingshan wrote:
> 
> 
> On 11/6/2023 5:49 PM, Michael S. Tsirkin wrote:
> > On Fri, Nov 03, 2023 at 06:34:34PM +0800, Zhu Lingshan wrote:
> > > When SUSPEND is set, device states and virtqueue states
> > > should be stablized, therefore the driver should not
> > > reset vqs when SUSPEND is set in device status.
> > > 
> > > Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
> > > ---
> > >   content.tex | 3 +++
> > >   1 file changed, 3 insertions(+)
> > > 
> > > diff --git a/content.tex b/content.tex
> > > index bcc9d4b..060b5c2 100644
> > > --- a/content.tex
> > > +++ b/content.tex
> > > @@ -444,6 +444,9 @@ \subsubsection{Virtqueue Reset}\label{sec:Basic Facilities of a Virtio Device /
> > >   The device MUST reset any state of a virtqueue to the default state,
> > >   including the available state and the used state.
> > > +If VIRTIO_F_SUSPEND is negotiated and SUSPEND is set in \field{device status},
> > > +the driver SHOULD NOT reset any virtqueues.
> > > +
> > >   \drivernormative{\paragraph}{Virtqueue Reset}{Basic Facilities of a Virtio Device / Virtqueues / Virtqueue Reset / Virtqueue Reset}
> > >   After the driver tells the device to reset a queue, the driver MUST verify that
> > Seems somewhat arbitrary and breaks the claim that the
> > feature is orthogonal and can have uses besides migration.
> when suspended, the device is frozen.
> The driver is aware of this process and so should not reset the vqs I think.

Again that is only true because you want to use it for migration.
But then you can't claim it's a generic facility.

> > 
> > 
> > 
> > > -- 
> > > 2.35.3


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [virtio-comment] Re: [PATCH V2 2/6] virtio: introduce SUSPEND bit in device status
  2023-11-07  9:09     ` Zhu, Lingshan
@ 2023-11-08 17:55       ` Michael S. Tsirkin
  2023-11-09  9:55         ` Zhu, Lingshan
  0 siblings, 1 reply; 186+ messages in thread
From: Michael S. Tsirkin @ 2023-11-08 17:55 UTC (permalink / raw)
  To: Zhu, Lingshan; +Cc: jasowang, eperezma, cohuck, stefanha, virtio-comment, parav

On Tue, Nov 07, 2023 at 05:09:06PM +0800, Zhu, Lingshan wrote:
> 
> 
> On 11/6/2023 5:43 PM, Michael S. Tsirkin wrote:
> 
>     On Fri, Nov 03, 2023 at 06:34:33PM +0800, Zhu Lingshan wrote:
> 
>         This patch introduces a new status bit in the device status: SUSPEND.
> 
>         This SUSPEND bit can be used by the driver to suspend a device,
>         in order to stabilize the device states and virtqueue states.
> 
>         Its main use case is live migration.
> 
>         Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
>         Signed-off-by: Jason Wang <jasowang@redhat.com>
>         Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
>         ---
>          content.tex | 36 ++++++++++++++++++++++++++++++++++--
>          1 file changed, 34 insertions(+), 2 deletions(-)
> 
>         diff --git a/content.tex b/content.tex
>         index 76813b5..bcc9d4b 100644
>         --- a/content.tex
>         +++ b/content.tex
>         @@ -49,6 +49,10 @@ \section{\field{Device Status} Field}\label{sec:Basic Facilities of a Virtio Dev
> 
>          \item[DEVICE_NEEDS_RESET (64)] Indicates that the device has experienced
>            an error from which it can't recover.
>         +
>         +\item[SUSPEND (16)] When VIRTIO_F_SUSPEND is negotiated, indicates that the
>         +  device has been suspended by the driver.
>         +
> 
>     what does this mean?
> 
> When the driver sets SUSPEND and the device presents SUSPEND, means
> the device has been suspended by the driver.
> 
> Do you suggest to remove "When VIRTIO_F_SUSPEND is negotiated"
> 


No I suggest explaining what does it mean that device has been
suspended.

> 
>          \end{description}
> 
>          The \field{device status} field starts out as 0, and is reinitialized to 0 by
>         @@ -73,6 +77,10 @@ \section{\field{Device Status} Field}\label{sec:Basic Facilities of a Virtio Dev
>          recover by issuing a reset.
>          \end{note}
> 
>         +The driver SHOULD NOT set SUSPEND if FEATURES_OK is not set.
>         +
>         +When setting SUSPEND, the driver MUST re-read \field{device status} to ensure the SUSPEND bit is set.
>         +
> 
>     and if it's not?
> 
> Then the device may run into errors or just need longer time to suspend.

and then what does driver do?

> This is how we handle features_OK: "Re-read device status to ensure the
> FEATURES_OK bit is still set"

this is designed in case features are inconsistent.
what kind of thing are you handling here?
Also and then we bail out and retry with other feature set if not.
what do we do here?

> 
> 
>          \devicenormative{\subsection}{Device Status Field}{Basic Facilities of a Virtio Device / Device Status Field}
> 
>          The device MUST NOT consume buffers or send any used buffer
>         @@ -82,6 +90,26 @@ \section{\field{Device Status} Field}\label{sec:Basic Facilities of a Virtio Dev
>          that a reset is needed.  If DRIVER_OK is set, after it sets DEVICE_NEEDS_RESET, the device
>          MUST send a device configuration change notification to the driver.
> 
>         +The device MUST ignore SUSPEND if FEATURES_OK is not set.
>         +
>         +The device MUST ignore SUSPEND if VIRTIO_F_SUSPEND is not negotiated.
>         +
>         +The device SHOULD allow settings to \field{device status} even when SUSPEND is set.
> 
>     which settings?
> 
> any legit writing to the device status, like DRIVER_OK
> 

that's not "settings" that's setting.

> 
>         +
>         +If VIRTIO_F_SUSPEND is negotiated and SUSPEND is set, the device SHOULD clear SUSPEND
>         +and resumes operation upon DRIVER_OK.
>         +
>         +If VIRTIO_F_SUSPEND is negotiated, when the driver sets SUSPEND,
>         +the device SHOULD perform the following actions before presenting SUSPEND bit in the \field{device status}:
>         +
>         +\begin{itemize}
>         +\item Stop consuming buffers of any virtqueues and mark all finished descritors as used.
> 
>     descritors? and what does finished mean?
> 
> Sorry my typo.
> 
> Finished means done processing it.
> 
> Like the spec words: When the device has finished a buffer, it writes the
> descriptor index into the used ring, and sends a used buffer notification.
>

only good for internal ring documentation - we did not want to say
used because it's not used until used entry written.
here just say used.

> 
> 
>         +\item Wait until all descriptors that being processed to finish and mark them as used.
> 
>     descriptors are not marked used. buffers are.
> 
>     that being -> that are being maybe?
> 
> Will fix
> 
> 
> 
>         +\item Flush all used buffer and send used buffer notifications to the driver.
> 
>     used buffers?
> 
> Here it means the buffer marked as used.
> shall I use finished buffer or any other suggestions?
> 
>     what does Flush mean?
> 
> Flush means send all of them out. Like 5.19.7.1 Device Requirements: Device
> Operation: Virtqueue flush
> 
> 
> 
> 
>         +\item Record Virtqueue State of each enabled virtqueue, see section \ref{sec:Virtqueues / Virtqueue State}
> 
>     execpt that one unfortunately does not bother to say what does this mean
>     :(
> 
> The virtqueue state has been defined in this series, in packed/split-ring.tex.
> And an PCI implementation of the interfaces is included.
> 
> Do you suggest any supplementary materials?
> 

I suggest something that documents what it means unlike what is
in this series.


> 
>         +\item Pause its operation except \field{device status} and preserve configurations in its Device Configuration Space, see \ref{sec:Basic Facilities of a Virtio Device / Device Configuration Space}
> 
>     How do you Pause? For example, consider a link state register. You set
> 
> The device pauses itself.
> 
>     SUSPEND, then link goes down. What is device supposed to do?
> 
> Once the device suspended, the device should not respond to the link_down
> until alive again.

what is link_down?
link goes down just by virtue of disconnecting it physically.

> This is to preserve the device states, just record
> whatever it is when SUSPEND-ed. And process the signal when resume or
> alive at the destination side. At the destination it also needs a
> live announce which require an active link.
> 
>     Record this somewhere internal but do not show it to driver?
>     And how exactly will this hidden internal state be migrated
>     since it is not visible?
> 
> May I know what kind of internal states?
> This series migrates stateless devices, hard to define virtio-fs device
> context.
> 

all devices have some state, none are completely stateless.


> 
> 
>         +\end{itemize}
>         +
>          \section{Feature Bits}\label{sec:Basic Facilities of a Virtio Device / Feature Bits}
> 
>          Each virtio device offers all the features it understands.  During
>         @@ -99,10 +127,10 @@ \section{Feature Bits}\label{sec:Basic Facilities of a Virtio Device / Feature B
>          \begin{description}
>          \item[0 to 23, and 50 to 127] Feature bits for the specific device type
> 
>         -\item[24 to 42] Feature bits reserved for extensions to the queue and
>         +\item[24 to 43] Feature bits reserved for extensions to the queue and
>            feature negotiation mechanisms
> 
>         -\item[43 to 49, and 128 and above] Feature bits reserved for future extensions.
>         +\item[44 to 49, and 128 and above] Feature bits reserved for future extensions.
>          \end{description}
> 
>          \begin{note}
>         @@ -875,6 +903,10 @@ \chapter{Reserved Feature Bits}\label{sec:Reserved Feature Bits}
>            \item[VIRTIO_F_QUEUE_STATE(42)] This feature indicates that the device allows the driver
>            to access its internal virtqueue state.
> 
>         +  \item[VIRTIO_F_SUSPEND(43)] This feature indicates that the driver can
>         +   SUSPEND the device.
> 
>     why is SUSPEND upper-case here?
> 
> will be lower in V3.
> 
> Thanks
> 
> 
> 
>         +   See \ref{sec:Basic Facilities of a Virtio Device / Device Status Field}.
>         +
>          \end{description}
> 
>          \drivernormative{\section}{Reserved Feature Bits}{Reserved Feature Bits}
>         --
>         2.35.3
> 
> 
>     This publicly archived list offers a means to provide input to the
>     OASIS Virtual I/O Device (VIRTIO) TC.
> 
>     In order to verify user consent to the Feedback License terms and
>     to minimize spam in the list archive, subscription is required
>     before posting.
> 
>     Subscribe: virtio-comment-subscribe@lists.oasis-open.org
>     Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
>     List help: virtio-comment-help@lists.oasis-open.org
>     List archive: https://lists.oasis-open.org/archives/virtio-comment/
>     Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
>     List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
>     Committee: https://www.oasis-open.org/committees/virtio/
>     Join OASIS: https://www.oasis-open.org/join/
> 
> 
> 


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* [virtio-comment] Re: [PATCH V2 4/6] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  2023-11-03 10:34 ` [virtio-comment] [PATCH V2 4/6] virtio-pci: implement VIRTIO_F_QUEUE_STATE Zhu Lingshan
  2023-11-03 11:35   ` [virtio-comment] " Parav Pandit
@ 2023-11-08 17:56   ` Michael S. Tsirkin
  2023-11-13  9:29     ` Zhu, Lingshan
  1 sibling, 1 reply; 186+ messages in thread
From: Michael S. Tsirkin @ 2023-11-08 17:56 UTC (permalink / raw)
  To: Zhu Lingshan; +Cc: jasowang, eperezma, cohuck, stefanha, virtio-comment, parav

On Fri, Nov 03, 2023 at 06:34:35PM +0800, Zhu Lingshan wrote:
> This patch adds two new le16 fields to common configuration structure
> to support VIRTIO_F_QUEUE_STATE in PCI transport layer.
> 
> Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
> ---
>  transport-pci.tex | 18 ++++++++++++++++++
>  1 file changed, 18 insertions(+)
> 
> diff --git a/transport-pci.tex b/transport-pci.tex
> index a5c6719..3161519 100644
> --- a/transport-pci.tex
> +++ b/transport-pci.tex
> @@ -325,6 +325,10 @@ \subsubsection{Common configuration structure layout}\label{sec:Virtio Transport
>          /* About the administration virtqueue. */
>          le16 admin_queue_index;         /* read-only for driver */
>          le16 admin_queue_num;         /* read-only for driver */
> +
> +	/* Virtqueue state */
> +        le16 queue_avail_state;         /* read-write */
> +        le16 queue_used_state;          /* read-write */
>  };
>  \end{lstlisting}
>  
> @@ -428,6 +432,17 @@ \subsubsection{Common configuration structure layout}\label{sec:Virtio Transport
>  	The value 0 indicates no supported administration virtqueues.
>  	This field is valid only if VIRTIO_F_ADMIN_VQ has been
>  	negotiated.
> +
> +\item[\field{queue_avail_state}]
> +        This field is valid only if VIRTIO_F_QUEUE_STATE has been
> +        negotiated. The driver sets and gets the available state of
> +        the virtqueue here (see \ref{sec:Virtqueues / Virtqueue State}).
> +
> +\item[\field{queue_used_state}]
> +        This field is valid only if VIRTIO_F_QUEUE_STATE has been
> +        negotiated. The driver sets and gets the used state of the
> +        virtqueue here (see \ref{sec:Virtqueues / Virtqueue State}).
> +
>  \end{description}
>  
>  \devicenormative{\paragraph}{Common configuration structure layout}{Virtio Transport Options / Virtio Over PCI Bus / PCI Device Layout / Common configuration structure layout}


Two fields are pointless in the general case.  Fix this to at least
support out of order buffer use, then there's something to talk about.
I suspect we'll be back to yet another bespoke mailbox and a bitmap for
this.


> @@ -488,6 +503,9 @@ \subsubsection{Common configuration structure layout}\label{sec:Virtio Transport
>  present either a value of 0 or a power of 2 in
>  \field{queue_size}.
>  
> +If VIRTIO_F_QUEUE_STATE has not been negotiated, the device MUST ignore
> +any accesses to \field{queue_avail_state} and \field{queue_used_state}.
> +
>  If VIRTIO_F_ADMIN_VQ has been negotiated, the value
>  \field{admin_queue_index} MUST be equal to, or bigger than
>  \field{num_queues}; also, \field{admin_queue_num} MUST be
> -- 
> 2.35.3


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* RE: [virtio-comment] Re: [PATCH V2 4/6] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  2023-11-07  9:31                 ` Zhu, Lingshan
  2023-11-08 17:44                   ` Michael S. Tsirkin
@ 2023-11-09  6:28                   ` Parav Pandit
  2023-11-09  8:41                     ` Michael S. Tsirkin
  2023-11-09 10:09                     ` Zhu, Lingshan
  1 sibling, 2 replies; 186+ messages in thread
From: Parav Pandit @ 2023-11-09  6:28 UTC (permalink / raw)
  To: Zhu, Lingshan, jasowang, mst, eperezma, cohuck, stefanha; +Cc: virtio-comment


> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> Sent: Tuesday, November 7, 2023 3:02 PM
> 
> On 11/6/2023 6:52 PM, Parav Pandit wrote:
> >> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> >> Sent: Monday, November 6, 2023 2:57 PM
> >>
> >> On 11/6/2023 12:12 PM, Parav Pandit wrote:
> >>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> >>>> Sent: Monday, November 6, 2023 9:01 AM
> >>>>
> >>>> On 11/3/2023 11:50 PM, Parav Pandit wrote:
> >>>>>> From: virtio-comment@lists.oasis-open.org
> >>>>>> <virtio-comment@lists.oasis- open.org> On Behalf Of Zhu, Lingshan
> >>>>>> Sent: Friday, November 3, 2023 8:27 PM
> >>>>>>
> >>>>>> On 11/3/2023 7:35 PM, Parav Pandit wrote:
> >>>>>>>> From: Zhu Lingshan <lingshan.zhu@intel.com>
> >>>>>>>> Sent: Friday, November 3, 2023 4:05 PM
> >>>>>>>>
> >>>>>>>> This patch adds two new le16 fields to common configuration
> >>>>>>>> structure to support VIRTIO_F_QUEUE_STATE in PCI transport layer.
> >>>>>>>>
> >>>>>>>> Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
> >>>>>>>> ---
> >>>>>>>>      transport-pci.tex | 18 ++++++++++++++++++
> >>>>>>>>      1 file changed, 18 insertions(+)
> >>>>>>>>
> >>>>>>>> diff --git a/transport-pci.tex b/transport-pci.tex index
> >>>>>>>> a5c6719..3161519 100644
> >>>>>>>> --- a/transport-pci.tex
> >>>>>>>> +++ b/transport-pci.tex
> >>>>>>>> @@ -325,6 +325,10 @@ \subsubsection{Common configuration
> >>>> structure
> >>>>>>>> layout}\label{sec:Virtio Transport
> >>>>>>>>              /* About the administration virtqueue. */
> >>>>>>>>              le16 admin_queue_index;         /* read-only for driver */
> >>>>>>>>              le16 admin_queue_num;         /* read-only for driver */
> >>>>>>>> +
> >>>>>>>> +	/* Virtqueue state */
> >>>>>>>> +        le16 queue_avail_state;         /* read-write */
> >>>>>>>> +        le16 queue_used_state;          /* read-write */
> >>>>>>> This tiny interface for 128 virtio net queues through register
> >>>>>>> read writes, does
> >>>>>> not work effectively.
> >>>>>>> There are inflight out of order descriptors for block also.
> >>>>>>> Hence toy registers like this do not work.
> >>>>>> Do you know there is a queue_select? Why this does not work? Do
> >>>>>> you know how other queue related fields work?
> >>>>> :)
> >>>>> Yes. If you notice queue_reset related critical spec bug fix was
> >>>>> done when it
> >>>> was introduced so that live migration can _actually_ work.
> >>>>> When queue_select is done for 128 queues serially, it take a lot
> >>>>> of time to
> >>>> read those slow register interface for this + inflight descriptors + more.
> >>>> interesting, virtio work in this pattern for many years, right?
> >>> All these years 400Gbps and 800Gbps virtio was not present, number
> >>> of
> >> queues were not in hw.
> >> The registers are control path in config space, how 400G or 800G affect??
> > Because those are the one in practice requires large number of VQs.
> >
> > You are asking per VQ register commands to modify things dynamically via
> this one vq at a time, serializing all the operations.
> > It does not scale well with high q count.
> This is not dynamically, it only happens when SUSPEND and RESUME.
> This is the same mechanism how virtio initialize a virtqueue, working for many
> years.
No. when virtio driver initializes it for the first time, there is no active traffic that gets lost.
This is because the interface is not yet up and not part of the network yet.

The resume must be fast enough, because the remote node is sending packets.
Hence it is different from driver init time queue enable.

> >> See the virtio common cfg, you will find the max number of vqs is
> >> there, num_queues.
> > :)
> > Sure. those values at high q count affects.
> the driver need to initialize them anyway.
That is before the traffic starts from remote end.

> >
> >>> Device didn’t support LM.
> >>> Many limitations existed all these years and TC is improving and
> >>> expanding
> >> them.
> >>> So all these years do not matter.
> >> Not sure what are you talking about, haven't we initialize the device
> >> and vqs in config space for years?????? What's wrong with this mechanism?
> >> Are you questioning virito-pci fundamentals???
> > Don’t point to in-efficient past to establish similar in-efficient future.
> interesting, you know this is a one-time thing, right?
> and you are aware of this has been there for years.
> >
> >>>>>> Like how to set a queue size and enable it?
> >>>>> Those are meant to be used before DRIVER_OK stage as they are init
> >>>>> time
> >>>> registers.
> >>>>> Not to keep abusing them..
> >>>> don't you need to set queue_size at the destination side?
> >>> No.
> >>> But the src/dst does not matter.
> >>> Queue_size to be set before DRIVER_OK like rest of the registers, as
> >>> all
> >> queues must be created before the driver_ok phase.
> >>> Queue_reset was last moment exception.
> >> create a queue? Nvidia specific?
> >>
> > Huh. No.
> > Do git log and realize what happened with queue_reset.
> You didn't answer the question, does the spec even has defined "create a vq"?

Enabled/created = tomato/tomato when discussing the spec in non-normative email conversation.
It's irrelevant.

All I am saying is, when we know the limitations of the transport and when industry is forwarding to not introduced more and more on-die register for once in lifetime work of device migration,
we just use the optimal command and queue interface that is native to virtio.


^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [virtio-comment] Re: [PATCH V2 4/6] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  2023-11-09  6:28                   ` Parav Pandit
@ 2023-11-09  8:41                     ` Michael S. Tsirkin
  2023-11-09  9:10                       ` Parav Pandit
  2023-11-09 10:09                     ` Zhu, Lingshan
  1 sibling, 1 reply; 186+ messages in thread
From: Michael S. Tsirkin @ 2023-11-09  8:41 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Zhu, Lingshan, jasowang, eperezma, cohuck, stefanha, virtio-comment

On Thu, Nov 09, 2023 at 06:28:17AM +0000, Parav Pandit wrote:
> > >> The registers are control path in config space, how 400G or 800G affect??
> > > Because those are the one in practice requires large number of VQs.
> > >
> > > You are asking per VQ register commands to modify things dynamically via
> > this one vq at a time, serializing all the operations.
> > > It does not scale well with high q count.
> > This is not dynamically, it only happens when SUSPEND and RESUME.
> > This is the same mechanism how virtio initialize a virtqueue, working for many
> > years.
> No. when virtio driver initializes it for the first time, there is no active traffic that gets lost.
> This is because the interface is not yet up and not part of the network yet.
> 
> The resume must be fast enough, because the remote node is sending packets.
> Hence it is different from driver init time queue enable.

Maybe but I think not qualitatively different.
If you care about these things please provide some estimates.
I just don't see how queue resume writes even with 64k queues will
saturate an express link.


> > >> See the virtio common cfg, you will find the max number of vqs is
> > >> there, num_queues.
> > > :)
> > > Sure. those values at high q count affects.
> > the driver need to initialize them anyway.
> That is before the traffic starts from remote end.
> 
> > >
> > >>> Device didn’t support LM.
> > >>> Many limitations existed all these years and TC is improving and
> > >>> expanding
> > >> them.
> > >>> So all these years do not matter.
> > >> Not sure what are you talking about, haven't we initialize the device
> > >> and vqs in config space for years?????? What's wrong with this mechanism?
> > >> Are you questioning virito-pci fundamentals???
> > > Don’t point to in-efficient past to establish similar in-efficient future.
> > interesting, you know this is a one-time thing, right?
> > and you are aware of this has been there for years.
> > >
> > >>>>>> Like how to set a queue size and enable it?
> > >>>>> Those are meant to be used before DRIVER_OK stage as they are init
> > >>>>> time
> > >>>> registers.
> > >>>>> Not to keep abusing them..
> > >>>> don't you need to set queue_size at the destination side?
> > >>> No.
> > >>> But the src/dst does not matter.
> > >>> Queue_size to be set before DRIVER_OK like rest of the registers, as
> > >>> all
> > >> queues must be created before the driver_ok phase.
> > >>> Queue_reset was last moment exception.
> > >> create a queue? Nvidia specific?
> > >>
> > > Huh. No.
> > > Do git log and realize what happened with queue_reset.
> > You didn't answer the question, does the spec even has defined "create a vq"?
> 
> Enabled/created = tomato/tomato when discussing the spec in non-normative email conversation.
> It's irrelevant.
> 
> All I am saying is, when we know the limitations of the transport and when industry is forwarding to not introduced more and more on-die register for once in lifetime work of device migration,
> we just use the optimal command and queue interface that is native to virtio.

we really do not need to prematurely optimize all things.
control path is control path it is going to be slow because
virtio designed it to be slow and drivers don't optimize it.
Shaving off a microsecond here or there is going to do nothing
except increase maintainance costs.

-- 
MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* RE: [virtio-comment] Re: [PATCH V2 4/6] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  2023-11-09  8:41                     ` Michael S. Tsirkin
@ 2023-11-09  9:10                       ` Parav Pandit
  2023-11-09  9:53                         ` Michael S. Tsirkin
  0 siblings, 1 reply; 186+ messages in thread
From: Parav Pandit @ 2023-11-09  9:10 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Zhu, Lingshan, jasowang, eperezma, cohuck, stefanha, virtio-comment


> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Thursday, November 9, 2023 2:12 PM
> 
> On Thu, Nov 09, 2023 at 06:28:17AM +0000, Parav Pandit wrote:
> > > >> The registers are control path in config space, how 400G or 800G affect??
> > > > Because those are the one in practice requires large number of VQs.
> > > >
> > > > You are asking per VQ register commands to modify things
> > > > dynamically via
> > > this one vq at a time, serializing all the operations.
> > > > It does not scale well with high q count.
> > > This is not dynamically, it only happens when SUSPEND and RESUME.
> > > This is the same mechanism how virtio initialize a virtqueue,
> > > working for many years.
> > No. when virtio driver initializes it for the first time, there is no active traffic
> that gets lost.
> > This is because the interface is not yet up and not part of the network yet.
> >
> > The resume must be fast enough, because the remote node is sending
> packets.
> > Hence it is different from driver init time queue enable.
> 
> Maybe but I think not qualitatively different.
> If you care about these things please provide some estimates.
> I just don't see how queue resume writes even with 64k queues will saturate an
> express link.
> 
It is not the bw of the link.
It is how the need for the device to be always read to suspend those many queues through register interface.

> 
> > > >> See the virtio common cfg, you will find the max number of vqs is
> > > >> there, num_queues.
> > > > :)
> > > > Sure. those values at high q count affects.
> > > the driver need to initialize them anyway.
> > That is before the traffic starts from remote end.
> >
> > > >
> > > >>> Device didn’t support LM.
> > > >>> Many limitations existed all these years and TC is improving and
> > > >>> expanding
> > > >> them.
> > > >>> So all these years do not matter.
> > > >> Not sure what are you talking about, haven't we initialize the
> > > >> device and vqs in config space for years?????? What's wrong with this
> mechanism?
> > > >> Are you questioning virito-pci fundamentals???
> > > > Don’t point to in-efficient past to establish similar in-efficient future.
> > > interesting, you know this is a one-time thing, right?
> > > and you are aware of this has been there for years.
> > > >
> > > >>>>>> Like how to set a queue size and enable it?
> > > >>>>> Those are meant to be used before DRIVER_OK stage as they are
> > > >>>>> init time
> > > >>>> registers.
> > > >>>>> Not to keep abusing them..
> > > >>>> don't you need to set queue_size at the destination side?
> > > >>> No.
> > > >>> But the src/dst does not matter.
> > > >>> Queue_size to be set before DRIVER_OK like rest of the
> > > >>> registers, as all
> > > >> queues must be created before the driver_ok phase.
> > > >>> Queue_reset was last moment exception.
> > > >> create a queue? Nvidia specific?
> > > >>
> > > > Huh. No.
> > > > Do git log and realize what happened with queue_reset.
> > > You didn't answer the question, does the spec even has defined "create a
> vq"?
> >
> > Enabled/created = tomato/tomato when discussing the spec in non-normative
> email conversation.
> > It's irrelevant.
> >
> > All I am saying is, when we know the limitations of the transport and
> > when industry is forwarding to not introduced more and more on-die register
> for once in lifetime work of device migration, we just use the optimal command
> and queue interface that is native to virtio.
> 
> we really do not need to prematurely optimize all things.
> control path is control path it is going to be slow because virtio designed it to be
> slow and drivers don't optimize it.
> Shaving off a microsecond here or there is going to do nothing except increase
> maintainance costs.

Control path post device initialization for large part is through the cvq.

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [virtio-comment] Re: [PATCH V2 4/6] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  2023-11-09  9:10                       ` Parav Pandit
@ 2023-11-09  9:53                         ` Michael S. Tsirkin
  2023-11-09 10:11                           ` Parav Pandit
  0 siblings, 1 reply; 186+ messages in thread
From: Michael S. Tsirkin @ 2023-11-09  9:53 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Zhu, Lingshan, jasowang, eperezma, cohuck, stefanha, virtio-comment

On Thu, Nov 09, 2023 at 09:10:55AM +0000, Parav Pandit wrote:
> 
> > From: Michael S. Tsirkin <mst@redhat.com>
> > Sent: Thursday, November 9, 2023 2:12 PM
> > 
> > On Thu, Nov 09, 2023 at 06:28:17AM +0000, Parav Pandit wrote:
> > > > >> The registers are control path in config space, how 400G or 800G affect??
> > > > > Because those are the one in practice requires large number of VQs.
> > > > >
> > > > > You are asking per VQ register commands to modify things
> > > > > dynamically via
> > > > this one vq at a time, serializing all the operations.
> > > > > It does not scale well with high q count.
> > > > This is not dynamically, it only happens when SUSPEND and RESUME.
> > > > This is the same mechanism how virtio initialize a virtqueue,
> > > > working for many years.
> > > No. when virtio driver initializes it for the first time, there is no active traffic
> > that gets lost.
> > > This is because the interface is not yet up and not part of the network yet.
> > >
> > > The resume must be fast enough, because the remote node is sending
> > packets.
> > > Hence it is different from driver init time queue enable.
> > 
> > Maybe but I think not qualitatively different.
> > If you care about these things please provide some estimates.
> > I just don't see how queue resume writes even with 64k queues will saturate an
> > express link.
> > 
> It is not the bw of the link.
> It is how the need for the device to be always read to suspend those many queues through register interface.

read->ready?

If we want to we can solve it btw. Prohibit changing queue select
while suspend is in progress.

But more importantly once Zhu Lingshan stops arguing about suspend bit
he'll hopefully work on transport vq. Then we can have it in config
space and at the same time not use up a register.


> > 
> > > > >> See the virtio common cfg, you will find the max number of vqs is
> > > > >> there, num_queues.
> > > > > :)
> > > > > Sure. those values at high q count affects.
> > > > the driver need to initialize them anyway.
> > > That is before the traffic starts from remote end.
> > >
> > > > >
> > > > >>> Device didn’t support LM.
> > > > >>> Many limitations existed all these years and TC is improving and
> > > > >>> expanding
> > > > >> them.
> > > > >>> So all these years do not matter.
> > > > >> Not sure what are you talking about, haven't we initialize the
> > > > >> device and vqs in config space for years?????? What's wrong with this
> > mechanism?
> > > > >> Are you questioning virito-pci fundamentals???
> > > > > Don’t point to in-efficient past to establish similar in-efficient future.
> > > > interesting, you know this is a one-time thing, right?
> > > > and you are aware of this has been there for years.
> > > > >
> > > > >>>>>> Like how to set a queue size and enable it?
> > > > >>>>> Those are meant to be used before DRIVER_OK stage as they are
> > > > >>>>> init time
> > > > >>>> registers.
> > > > >>>>> Not to keep abusing them..
> > > > >>>> don't you need to set queue_size at the destination side?
> > > > >>> No.
> > > > >>> But the src/dst does not matter.
> > > > >>> Queue_size to be set before DRIVER_OK like rest of the
> > > > >>> registers, as all
> > > > >> queues must be created before the driver_ok phase.
> > > > >>> Queue_reset was last moment exception.
> > > > >> create a queue? Nvidia specific?
> > > > >>
> > > > > Huh. No.
> > > > > Do git log and realize what happened with queue_reset.
> > > > You didn't answer the question, does the spec even has defined "create a
> > vq"?
> > >
> > > Enabled/created = tomato/tomato when discussing the spec in non-normative
> > email conversation.
> > > It's irrelevant.
> > >
> > > All I am saying is, when we know the limitations of the transport and
> > > when industry is forwarding to not introduced more and more on-die register
> > for once in lifetime work of device migration, we just use the optimal command
> > and queue interface that is native to virtio.
> > 
> > we really do not need to prematurely optimize all things.
> > control path is control path it is going to be slow because virtio designed it to be
> > slow and drivers don't optimize it.
> > Shaving off a microsecond here or there is going to do nothing except increase
> > maintainance costs.
> 
> Control path post device initialization for large part is through the cvq.

Really depends on the device. For virtio net most of the time is
spent on filling up receive queues and sending network announcements
and such.

-- 
MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [virtio-comment] Re: [PATCH V2 2/6] virtio: introduce SUSPEND bit in device status
  2023-11-08 17:55       ` Michael S. Tsirkin
@ 2023-11-09  9:55         ` Zhu, Lingshan
  0 siblings, 0 replies; 186+ messages in thread
From: Zhu, Lingshan @ 2023-11-09  9:55 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: jasowang, eperezma, cohuck, stefanha, virtio-comment, parav

[-- Attachment #1: Type: text/plain, Size: 10847 bytes --]



On 11/9/2023 1:55 AM, Michael S. Tsirkin wrote:
> On Tue, Nov 07, 2023 at 05:09:06PM +0800, Zhu, Lingshan wrote:
>>
>> On 11/6/2023 5:43 PM, Michael S. Tsirkin wrote:
>>
>>      On Fri, Nov 03, 2023 at 06:34:33PM +0800, Zhu Lingshan wrote:
>>
>>          This patch introduces a new status bit in the device status: SUSPEND.
>>
>>          This SUSPEND bit can be used by the driver to suspend a device,
>>          in order to stabilize the device states and virtqueue states.
>>
>>          Its main use case is live migration.
>>
>>          Signed-off-by: Zhu Lingshan<lingshan.zhu@intel.com>
>>          Signed-off-by: Jason Wang<jasowang@redhat.com>
>>          Signed-off-by: Eugenio Pérez<eperezma@redhat.com>
>>          ---
>>           content.tex | 36 ++++++++++++++++++++++++++++++++++--
>>           1 file changed, 34 insertions(+), 2 deletions(-)
>>
>>          diff --git a/content.tex b/content.tex
>>          index 76813b5..bcc9d4b 100644
>>          --- a/content.tex
>>          +++ b/content.tex
>>          @@ -49,6 +49,10 @@ \section{\field{Device Status} Field}\label{sec:Basic Facilities of a Virtio Dev
>>
>>           \item[DEVICE_NEEDS_RESET (64)] Indicates that the device has experienced
>>             an error from which it can't recover.
>>          +
>>          +\item[SUSPEND (16)] When VIRTIO_F_SUSPEND is negotiated, indicates that the
>>          +  device has been suspended by the driver.
>>          +
>>
>>      what does this mean?
>>
>> When the driver sets SUSPEND and the device presents SUSPEND, means
>> the device has been suspended by the driver.
>>
>> Do you suggest to remove "When VIRTIO_F_SUSPEND is negotiated"
>>
>
> No I suggest explaining what does it mean that device has been
> suspended.
I will add "to stabilize the device states", is that OK?
>
>>           \end{description}
>>
>>           The \field{device status} field starts out as 0, and is reinitialized to 0 by
>>          @@ -73,6 +77,10 @@ \section{\field{Device Status} Field}\label{sec:Basic Facilities of a Virtio Dev
>>           recover by issuing a reset.
>>           \end{note}
>>
>>          +The driver SHOULD NOT set SUSPEND if FEATURES_OK is not set.
>>          +
>>          +When setting SUSPEND, the driver MUST re-read \field{device status} to ensure the SUSPEND bit is set.
>>          +
>>
>>      and if it's not?
>>
>> Then the device may run into errors or just need longer time to suspend.
> and then what does driver do?
I assume the driver should reset the device or give up the device.

Re-read is how we handle other status, like features_ok.
>
>> This is how we handle features_OK: "Re-read device status to ensure the
>> FEATURES_OK bit is still set"
> this is designed in case features are inconsistent.
I think this is designed to: 1) flush the device status 2) make sure the 
features are set.
> what kind of thing are you handling here?
The same purpose, to flush the status and make sure suspend is set.
Once we make sure the device has been suspended, we can fetch
the device states.
> Also and then we bail out and retry with other feature set if not.
> what do we do here?
same as above, if the device is not suspended, like vqs still running,
we can not fetch stable vq states. So we need to make sure it is
set.
>
>>
>>           \devicenormative{\subsection}{Device Status Field}{Basic Facilities of a Virtio Device / Device Status Field}
>>
>>           The device MUST NOT consume buffers or send any used buffer
>>          @@ -82,6 +90,26 @@ \section{\field{Device Status} Field}\label{sec:Basic Facilities of a Virtio Dev
>>           that a reset is needed.  If DRIVER_OK is set, after it sets DEVICE_NEEDS_RESET, the device
>>           MUST send a device configuration change notification to the driver.
>>
>>          +The device MUST ignore SUSPEND if FEATURES_OK is not set.
>>          +
>>          +The device MUST ignore SUSPEND if VIRTIO_F_SUSPEND is not negotiated.
>>          +
>>          +The device SHOULD allow settings to \field{device status} even when SUSPEND is set.
>>
>>      which settings?
>>
>> any legit writing to the device status, like DRIVER_OK
>>
> that's not "settings" that's setting.
>
>>          +
>>          +If VIRTIO_F_SUSPEND is negotiated and SUSPEND is set, the device SHOULD clear SUSPEND
>>          +and resumes operation upon DRIVER_OK.
>>          +
>>          +If VIRTIO_F_SUSPEND is negotiated, when the driver sets SUSPEND,
>>          +the device SHOULD perform the following actions before presenting SUSPEND bit in the \field{device status}:
>>          +
>>          +\begin{itemize}
>>          +\item Stop consuming buffers of any virtqueues and mark all finished descritors as used.
>>
>>      descritors? and what does finished mean?
>>
>> Sorry my typo.
>>
>> Finished means done processing it.
>>
>> Like the spec words: When the device has finished a buffer, it writes the
>> descriptor index into the used ring, and sends a used buffer notification.
>>
> only good for internal ring documentation - we did not want to say
> used because it's not used until used entry written.
> here just say used.
I am confused, by "here just say used", do you suggest we say:
"Stop consuming buffers of any virtqueues and mark all used descriptors 
as used."?

I am not a native speaker, and I am not sure this is better than finished
>
>>
>>          +\item Wait until all descriptors that being processed to finish and mark them as used.
>>
>>      descriptors are not marked used. buffers are.
>>
>>      that being -> that are being maybe?
>>
>> Will fix
>>
>>
>>
>>          +\item Flush all used buffer and send used buffer notifications to the driver.
>>
>>      used buffers?
>>
>> Here it means the buffer marked as used.
>> shall I use finished buffer or any other suggestions?
>>
>>      what does Flush mean?
>>
>> Flush means send all of them out. Like 5.19.7.1 Device Requirements: Device
>> Operation: Virtqueue flush
>>
>>
>>
>>
>>          +\item Record Virtqueue State of each enabled virtqueue, see section \ref{sec:Virtqueues / Virtqueue State}
>>
>>      execpt that one unfortunately does not bother to say what does this mean
>>      :(
>>
>> The virtqueue state has been defined in this series, in packed/split-ring.tex.
>> And an PCI implementation of the interfaces is included.
>>
>> Do you suggest any supplementary materials?
>>
> I suggest something that documents what it means unlike what is
> in this series.
How about: "Record Virtqueue State of each enabled virtqueue by the 
transport specific interfaces"?
>
>
>>          +\item Pause its operation except \field{device status} and preserve configurations in its Device Configuration Space, see \ref{sec:Basic Facilities of a Virtio Device / Device Configuration Space}
>>
>>      How do you Pause? For example, consider a link state register. You set
>>
>> The device pauses itself.
>>
>>      SUSPEND, then link goes down. What is device supposed to do?
>>
>> Once the device suspended, the device should not respond to the link_down
>> until alive again.
> what is link_down?
> link goes down just by virtue of disconnecting it physically.
To stabilize the device states, once suspended, even link down,
the device should not respond to it, this means:
1) if link down happens before suspend, the device detect and send an 
config interrupt.
2) if link down happens after suspend, the device should not respond to it.
3) if the device resume running, then detect link down and send an 
config interrupt.
>
>> This is to preserve the device states, just record
>> whatever it is when SUSPEND-ed. And process the signal when resume or
>> alive at the destination side. At the destination it also needs a
>> live announce which require an active link.
>>
>>      Record this somewhere internal but do not show it to driver?
>>      And how exactly will this hidden internal state be migrated
>>      since it is not visible?
>>
>> May I know what kind of internal states?
>> This series migrates stateless devices, hard to define virtio-fs device
>> context.
>>
> all devices have some state, none are completely stateless.
Some of the states can be read from config space, hypervisor can migrate 
them for sure.
Others like virtio-fs are hard to define for now, but we still need to 
make progress in live migration.
>
>
>>
>>          +\end{itemize}
>>          +
>>           \section{Feature Bits}\label{sec:Basic Facilities of a Virtio Device / Feature Bits}
>>
>>           Each virtio device offers all the features it understands.  During
>>          @@ -99,10 +127,10 @@ \section{Feature Bits}\label{sec:Basic Facilities of a Virtio Device / Feature B
>>           \begin{description}
>>           \item[0 to 23, and 50 to 127] Feature bits for the specific device type
>>
>>          -\item[24 to 42] Feature bits reserved for extensions to the queue and
>>          +\item[24 to 43] Feature bits reserved for extensions to the queue and
>>             feature negotiation mechanisms
>>
>>          -\item[43 to 49, and 128 and above] Feature bits reserved for future extensions.
>>          +\item[44 to 49, and 128 and above] Feature bits reserved for future extensions.
>>           \end{description}
>>
>>           \begin{note}
>>          @@ -875,6 +903,10 @@ \chapter{Reserved Feature Bits}\label{sec:Reserved Feature Bits}
>>             \item[VIRTIO_F_QUEUE_STATE(42)] This feature indicates that the device allows the driver
>>             to access its internal virtqueue state.
>>
>>          +  \item[VIRTIO_F_SUSPEND(43)] This feature indicates that the driver can
>>          +   SUSPEND the device.
>>
>>      why is SUSPEND upper-case here?
>>
>> will be lower in V3.
>>
>> Thanks
>>
>>
>>
>>          +   See \ref{sec:Basic Facilities of a Virtio Device / Device Status Field}.
>>          +
>>           \end{description}
>>
>>           \drivernormative{\section}{Reserved Feature Bits}{Reserved Feature Bits}
>>          --
>>          2.35.3
>>
>>
>>      This publicly archived list offers a means to provide input to the
>>      OASIS Virtual I/O Device (VIRTIO) TC.
>>
>>      In order to verify user consent to the Feedback License terms and
>>      to minimize spam in the list archive, subscription is required
>>      before posting.
>>
>>      Subscribe:virtio-comment-subscribe@lists.oasis-open.org
>>      Unsubscribe:virtio-comment-unsubscribe@lists.oasis-open.org
>>      List help:virtio-comment-help@lists.oasis-open.org
>>      List archive:https://lists.oasis-open.org/archives/virtio-comment/
>>      Feedback License:https://www.oasis-open.org/who/ipr/feedback_license.pdf
>>      List Guidelines:https://www.oasis-open.org/policies-guidelines/mailing-lists
>>      Committee:https://www.oasis-open.org/committees/virtio/
>>      Join OASIS:https://www.oasis-open.org/join/
>>
>>
>>

[-- Attachment #2: Type: text/html, Size: 15055 bytes --]

^ permalink raw reply	[flat|nested] 186+ messages in thread

* [virtio-comment] Re: [PATCH V2 3/6] virtio: dont reset vqs when SUSPEND
  2023-11-08 17:46       ` Michael S. Tsirkin
@ 2023-11-09  9:58         ` Zhu, Lingshan
  2023-11-09 10:15           ` [virtio-comment] " Parav Pandit
  0 siblings, 1 reply; 186+ messages in thread
From: Zhu, Lingshan @ 2023-11-09  9:58 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: jasowang, eperezma, cohuck, stefanha, virtio-comment, parav



On 11/9/2023 1:46 AM, Michael S. Tsirkin wrote:
> On Tue, Nov 07, 2023 at 05:27:23PM +0800, Zhu, Lingshan wrote:
>>
>> On 11/6/2023 5:49 PM, Michael S. Tsirkin wrote:
>>> On Fri, Nov 03, 2023 at 06:34:34PM +0800, Zhu Lingshan wrote:
>>>> When SUSPEND is set, device states and virtqueue states
>>>> should be stablized, therefore the driver should not
>>>> reset vqs when SUSPEND is set in device status.
>>>>
>>>> Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
>>>> ---
>>>>    content.tex | 3 +++
>>>>    1 file changed, 3 insertions(+)
>>>>
>>>> diff --git a/content.tex b/content.tex
>>>> index bcc9d4b..060b5c2 100644
>>>> --- a/content.tex
>>>> +++ b/content.tex
>>>> @@ -444,6 +444,9 @@ \subsubsection{Virtqueue Reset}\label{sec:Basic Facilities of a Virtio Device /
>>>>    The device MUST reset any state of a virtqueue to the default state,
>>>>    including the available state and the used state.
>>>> +If VIRTIO_F_SUSPEND is negotiated and SUSPEND is set in \field{device status},
>>>> +the driver SHOULD NOT reset any virtqueues.
>>>> +
>>>>    \drivernormative{\paragraph}{Virtqueue Reset}{Basic Facilities of a Virtio Device / Virtqueues / Virtqueue Reset / Virtqueue Reset}
>>>>    After the driver tells the device to reset a queue, the driver MUST verify that
>>> Seems somewhat arbitrary and breaks the claim that the
>>> feature is orthogonal and can have uses besides migration.
>> when suspended, the device is frozen.
>> The driver is aware of this process and so should not reset the vqs I think.
> Again that is only true because you want to use it for migration.
> But then you can't claim it's a generic facility.
I don't get it. The device status is a basic facility.

We need to SUSPEND the device by setting SUSPEND bit, to stabilize the 
device states for migration.
This can also be used for debugging I think.
>
>>>
>>>
>>>> -- 
>>>> 2.35.3


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [virtio-comment] Re: [PATCH V2 4/6] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  2023-11-08 17:44                   ` Michael S. Tsirkin
@ 2023-11-09 10:00                     ` Zhu, Lingshan
  2023-11-09 10:02                       ` Michael S. Tsirkin
  0 siblings, 1 reply; 186+ messages in thread
From: Zhu, Lingshan @ 2023-11-09 10:00 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Parav Pandit, jasowang, eperezma, cohuck, stefanha, virtio-comment



On 11/9/2023 1:44 AM, Michael S. Tsirkin wrote:
> On Tue, Nov 07, 2023 at 05:31:38PM +0800, Zhu, Lingshan wrote:
>>
>> On 11/6/2023 6:52 PM, Parav Pandit wrote:
>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>>>> Sent: Monday, November 6, 2023 2:57 PM
>>>>
>>>> On 11/6/2023 12:12 PM, Parav Pandit wrote:
>>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>>>>>> Sent: Monday, November 6, 2023 9:01 AM
>>>>>>
>>>>>> On 11/3/2023 11:50 PM, Parav Pandit wrote:
>>>>>>>> From: virtio-comment@lists.oasis-open.org
>>>>>>>> <virtio-comment@lists.oasis- open.org> On Behalf Of Zhu, Lingshan
>>>>>>>> Sent: Friday, November 3, 2023 8:27 PM
>>>>>>>>
>>>>>>>> On 11/3/2023 7:35 PM, Parav Pandit wrote:
>>>>>>>>>> From: Zhu Lingshan <lingshan.zhu@intel.com>
>>>>>>>>>> Sent: Friday, November 3, 2023 4:05 PM
>>>>>>>>>>
>>>>>>>>>> This patch adds two new le16 fields to common configuration
>>>>>>>>>> structure to support VIRTIO_F_QUEUE_STATE in PCI transport layer.
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
>>>>>>>>>> ---
>>>>>>>>>>       transport-pci.tex | 18 ++++++++++++++++++
>>>>>>>>>>       1 file changed, 18 insertions(+)
>>>>>>>>>>
>>>>>>>>>> diff --git a/transport-pci.tex b/transport-pci.tex index
>>>>>>>>>> a5c6719..3161519 100644
>>>>>>>>>> --- a/transport-pci.tex
>>>>>>>>>> +++ b/transport-pci.tex
>>>>>>>>>> @@ -325,6 +325,10 @@ \subsubsection{Common configuration
>>>>>> structure
>>>>>>>>>> layout}\label{sec:Virtio Transport
>>>>>>>>>>               /* About the administration virtqueue. */
>>>>>>>>>>               le16 admin_queue_index;         /* read-only for driver */
>>>>>>>>>>               le16 admin_queue_num;         /* read-only for driver */
>>>>>>>>>> +
>>>>>>>>>> +	/* Virtqueue state */
>>>>>>>>>> +        le16 queue_avail_state;         /* read-write */
>>>>>>>>>> +        le16 queue_used_state;          /* read-write */
>>>>>>>>> This tiny interface for 128 virtio net queues through register
>>>>>>>>> read writes, does
>>>>>>>> not work effectively.
>>>>>>>>> There are inflight out of order descriptors for block also.
>>>>>>>>> Hence toy registers like this do not work.
>>>>>>>> Do you know there is a queue_select? Why this does not work? Do you
>>>>>>>> know how other queue related fields work?
>>>>>>> :)
>>>>>>> Yes. If you notice queue_reset related critical spec bug fix was
>>>>>>> done when it
>>>>>> was introduced so that live migration can _actually_ work.
>>>>>>> When queue_select is done for 128 queues serially, it take a lot of
>>>>>>> time to
>>>>>> read those slow register interface for this + inflight descriptors + more.
>>>>>> interesting, virtio work in this pattern for many years, right?
>>>>> All these years 400Gbps and 800Gbps virtio was not present, number of
>>>> queues were not in hw.
>>>> The registers are control path in config space, how 400G or 800G affect??
>>> Because those are the one in practice requires large number of VQs.
>>>
>>> You are asking per VQ register commands to modify things dynamically via this one vq at a time, serializing all the operations.
>>> It does not scale well with high q count.
>> This is not dynamically, it only happens when SUSPEND and RESUME.
>> This is the same mechanism how virtio initialize a virtqueue, working for
>> many years.
> I wish we just had a transport vq already. That's the way to solve this
> not fighting individual bits.
Yeah, I agree, transport is a queued task(sent out V4 months ago...), 
one by one... hard and tough work...
>
>>>> See the virtio common cfg, you will find the max number of vqs is there,
>>>> num_queues.
>>> :)
>>> Sure. those values at high q count affects.
>> the driver need to initialize them anyway.
>>>>> Device didn’t support LM.
>>>>> Many limitations existed all these years and TC is improving and expanding
>>>> them.
>>>>> So all these years do not matter.
>>>> Not sure what are you talking about, haven't we initialize the device and vqs in
>>>> config space for years?????? What's wrong with this mechanism?
>>>> Are you questioning virito-pci fundamentals???
>>> Don’t point to in-efficient past to establish similar in-efficient future.
>> interesting, you know this is a one-time thing, right?
>> and you are aware of this has been there for years.
>>>>>>>> Like how to set a queue size and enable it?
>>>>>>> Those are meant to be used before DRIVER_OK stage as they are init
>>>>>>> time
>>>>>> registers.
>>>>>>> Not to keep abusing them..
>>>>>> don't you need to set queue_size at the destination side?
>>>>> No.
>>>>> But the src/dst does not matter.
>>>>> Queue_size to be set before DRIVER_OK like rest of the registers, as all
>>>> queues must be created before the driver_ok phase.
>>>>> Queue_reset was last moment exception.
>>>> create a queue? Nvidia specific?
>>>>
>>> Huh. No.
>>> Do git log and realize what happened with queue_reset.
>> You didn't answer the question, does the spec even has defined "create a
>> vq"?
>>>> For standard virtio, you need to read the number of enabled vqs at the source
>>>> side, then enable them at the dst, so queue_size matters, not to create.
>>> All that happens in the pre-copy phase.
>> Yes and how your answer related to this discussion?


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [virtio-comment] Re: [PATCH V2 4/6] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  2023-11-09 10:00                     ` Zhu, Lingshan
@ 2023-11-09 10:02                       ` Michael S. Tsirkin
  2023-11-10  6:52                         ` Zhu, Lingshan
  0 siblings, 1 reply; 186+ messages in thread
From: Michael S. Tsirkin @ 2023-11-09 10:02 UTC (permalink / raw)
  To: Zhu, Lingshan
  Cc: Parav Pandit, jasowang, eperezma, cohuck, stefanha, virtio-comment

On Thu, Nov 09, 2023 at 06:00:27PM +0800, Zhu, Lingshan wrote:
> 
> 
> On 11/9/2023 1:44 AM, Michael S. Tsirkin wrote:
> > On Tue, Nov 07, 2023 at 05:31:38PM +0800, Zhu, Lingshan wrote:
> > > 
> > > On 11/6/2023 6:52 PM, Parav Pandit wrote:
> > > > > From: Zhu, Lingshan <lingshan.zhu@intel.com>
> > > > > Sent: Monday, November 6, 2023 2:57 PM
> > > > > 
> > > > > On 11/6/2023 12:12 PM, Parav Pandit wrote:
> > > > > > > From: Zhu, Lingshan <lingshan.zhu@intel.com>
> > > > > > > Sent: Monday, November 6, 2023 9:01 AM
> > > > > > > 
> > > > > > > On 11/3/2023 11:50 PM, Parav Pandit wrote:
> > > > > > > > > From: virtio-comment@lists.oasis-open.org
> > > > > > > > > <virtio-comment@lists.oasis- open.org> On Behalf Of Zhu, Lingshan
> > > > > > > > > Sent: Friday, November 3, 2023 8:27 PM
> > > > > > > > > 
> > > > > > > > > On 11/3/2023 7:35 PM, Parav Pandit wrote:
> > > > > > > > > > > From: Zhu Lingshan <lingshan.zhu@intel.com>
> > > > > > > > > > > Sent: Friday, November 3, 2023 4:05 PM
> > > > > > > > > > > 
> > > > > > > > > > > This patch adds two new le16 fields to common configuration
> > > > > > > > > > > structure to support VIRTIO_F_QUEUE_STATE in PCI transport layer.
> > > > > > > > > > > 
> > > > > > > > > > > Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
> > > > > > > > > > > ---
> > > > > > > > > > >       transport-pci.tex | 18 ++++++++++++++++++
> > > > > > > > > > >       1 file changed, 18 insertions(+)
> > > > > > > > > > > 
> > > > > > > > > > > diff --git a/transport-pci.tex b/transport-pci.tex index
> > > > > > > > > > > a5c6719..3161519 100644
> > > > > > > > > > > --- a/transport-pci.tex
> > > > > > > > > > > +++ b/transport-pci.tex
> > > > > > > > > > > @@ -325,6 +325,10 @@ \subsubsection{Common configuration
> > > > > > > structure
> > > > > > > > > > > layout}\label{sec:Virtio Transport
> > > > > > > > > > >               /* About the administration virtqueue. */
> > > > > > > > > > >               le16 admin_queue_index;         /* read-only for driver */
> > > > > > > > > > >               le16 admin_queue_num;         /* read-only for driver */
> > > > > > > > > > > +
> > > > > > > > > > > +	/* Virtqueue state */
> > > > > > > > > > > +        le16 queue_avail_state;         /* read-write */
> > > > > > > > > > > +        le16 queue_used_state;          /* read-write */
> > > > > > > > > > This tiny interface for 128 virtio net queues through register
> > > > > > > > > > read writes, does
> > > > > > > > > not work effectively.
> > > > > > > > > > There are inflight out of order descriptors for block also.
> > > > > > > > > > Hence toy registers like this do not work.
> > > > > > > > > Do you know there is a queue_select? Why this does not work? Do you
> > > > > > > > > know how other queue related fields work?
> > > > > > > > :)
> > > > > > > > Yes. If you notice queue_reset related critical spec bug fix was
> > > > > > > > done when it
> > > > > > > was introduced so that live migration can _actually_ work.
> > > > > > > > When queue_select is done for 128 queues serially, it take a lot of
> > > > > > > > time to
> > > > > > > read those slow register interface for this + inflight descriptors + more.
> > > > > > > interesting, virtio work in this pattern for many years, right?
> > > > > > All these years 400Gbps and 800Gbps virtio was not present, number of
> > > > > queues were not in hw.
> > > > > The registers are control path in config space, how 400G or 800G affect??
> > > > Because those are the one in practice requires large number of VQs.
> > > > 
> > > > You are asking per VQ register commands to modify things dynamically via this one vq at a time, serializing all the operations.
> > > > It does not scale well with high q count.
> > > This is not dynamically, it only happens when SUSPEND and RESUME.
> > > This is the same mechanism how virtio initialize a virtqueue, working for
> > > many years.
> > I wish we just had a transport vq already. That's the way to solve this
> > not fighting individual bits.
> Yeah, I agree, transport is a queued task(sent out V4 months ago...), one by
> one... hard and tough work...

Frankly I think that should take precedence, then Parav will not get
annoyed each time add a couple of registers.

> > > > > See the virtio common cfg, you will find the max number of vqs is there,
> > > > > num_queues.
> > > > :)
> > > > Sure. those values at high q count affects.
> > > the driver need to initialize them anyway.
> > > > > > Device didn’t support LM.
> > > > > > Many limitations existed all these years and TC is improving and expanding
> > > > > them.
> > > > > > So all these years do not matter.
> > > > > Not sure what are you talking about, haven't we initialize the device and vqs in
> > > > > config space for years?????? What's wrong with this mechanism?
> > > > > Are you questioning virito-pci fundamentals???
> > > > Don’t point to in-efficient past to establish similar in-efficient future.
> > > interesting, you know this is a one-time thing, right?
> > > and you are aware of this has been there for years.
> > > > > > > > > Like how to set a queue size and enable it?
> > > > > > > > Those are meant to be used before DRIVER_OK stage as they are init
> > > > > > > > time
> > > > > > > registers.
> > > > > > > > Not to keep abusing them..
> > > > > > > don't you need to set queue_size at the destination side?
> > > > > > No.
> > > > > > But the src/dst does not matter.
> > > > > > Queue_size to be set before DRIVER_OK like rest of the registers, as all
> > > > > queues must be created before the driver_ok phase.
> > > > > > Queue_reset was last moment exception.
> > > > > create a queue? Nvidia specific?
> > > > > 
> > > > Huh. No.
> > > > Do git log and realize what happened with queue_reset.
> > > You didn't answer the question, does the spec even has defined "create a
> > > vq"?
> > > > > For standard virtio, you need to read the number of enabled vqs at the source
> > > > > side, then enable them at the dst, so queue_size matters, not to create.
> > > > All that happens in the pre-copy phase.
> > > Yes and how your answer related to this discussion?


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [virtio-comment] Re: [PATCH V2 4/6] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  2023-11-09  6:28                   ` Parav Pandit
  2023-11-09  8:41                     ` Michael S. Tsirkin
@ 2023-11-09 10:09                     ` Zhu, Lingshan
  2023-11-09 10:25                       ` Parav Pandit
  1 sibling, 1 reply; 186+ messages in thread
From: Zhu, Lingshan @ 2023-11-09 10:09 UTC (permalink / raw)
  To: Parav Pandit, jasowang, mst, eperezma, cohuck, stefanha; +Cc: virtio-comment



On 11/9/2023 2:28 PM, Parav Pandit wrote:
>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>> Sent: Tuesday, November 7, 2023 3:02 PM
>>
>> On 11/6/2023 6:52 PM, Parav Pandit wrote:
>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>>>> Sent: Monday, November 6, 2023 2:57 PM
>>>>
>>>> On 11/6/2023 12:12 PM, Parav Pandit wrote:
>>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>>>>>> Sent: Monday, November 6, 2023 9:01 AM
>>>>>>
>>>>>> On 11/3/2023 11:50 PM, Parav Pandit wrote:
>>>>>>>> From: virtio-comment@lists.oasis-open.org
>>>>>>>> <virtio-comment@lists.oasis- open.org> On Behalf Of Zhu, Lingshan
>>>>>>>> Sent: Friday, November 3, 2023 8:27 PM
>>>>>>>>
>>>>>>>> On 11/3/2023 7:35 PM, Parav Pandit wrote:
>>>>>>>>>> From: Zhu Lingshan <lingshan.zhu@intel.com>
>>>>>>>>>> Sent: Friday, November 3, 2023 4:05 PM
>>>>>>>>>>
>>>>>>>>>> This patch adds two new le16 fields to common configuration
>>>>>>>>>> structure to support VIRTIO_F_QUEUE_STATE in PCI transport layer.
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
>>>>>>>>>> ---
>>>>>>>>>>       transport-pci.tex | 18 ++++++++++++++++++
>>>>>>>>>>       1 file changed, 18 insertions(+)
>>>>>>>>>>
>>>>>>>>>> diff --git a/transport-pci.tex b/transport-pci.tex index
>>>>>>>>>> a5c6719..3161519 100644
>>>>>>>>>> --- a/transport-pci.tex
>>>>>>>>>> +++ b/transport-pci.tex
>>>>>>>>>> @@ -325,6 +325,10 @@ \subsubsection{Common configuration
>>>>>> structure
>>>>>>>>>> layout}\label{sec:Virtio Transport
>>>>>>>>>>               /* About the administration virtqueue. */
>>>>>>>>>>               le16 admin_queue_index;         /* read-only for driver */
>>>>>>>>>>               le16 admin_queue_num;         /* read-only for driver */
>>>>>>>>>> +
>>>>>>>>>> +	/* Virtqueue state */
>>>>>>>>>> +        le16 queue_avail_state;         /* read-write */
>>>>>>>>>> +        le16 queue_used_state;          /* read-write */
>>>>>>>>> This tiny interface for 128 virtio net queues through register
>>>>>>>>> read writes, does
>>>>>>>> not work effectively.
>>>>>>>>> There are inflight out of order descriptors for block also.
>>>>>>>>> Hence toy registers like this do not work.
>>>>>>>> Do you know there is a queue_select? Why this does not work? Do
>>>>>>>> you know how other queue related fields work?
>>>>>>> :)
>>>>>>> Yes. If you notice queue_reset related critical spec bug fix was
>>>>>>> done when it
>>>>>> was introduced so that live migration can _actually_ work.
>>>>>>> When queue_select is done for 128 queues serially, it take a lot
>>>>>>> of time to
>>>>>> read those slow register interface for this + inflight descriptors + more.
>>>>>> interesting, virtio work in this pattern for many years, right?
>>>>> All these years 400Gbps and 800Gbps virtio was not present, number
>>>>> of
>>>> queues were not in hw.
>>>> The registers are control path in config space, how 400G or 800G affect??
>>> Because those are the one in practice requires large number of VQs.
>>>
>>> You are asking per VQ register commands to modify things dynamically via
>> this one vq at a time, serializing all the operations.
>>> It does not scale well with high q count.
>> This is not dynamically, it only happens when SUSPEND and RESUME.
>> This is the same mechanism how virtio initialize a virtqueue, working for many
>> years.
> No. when virtio driver initializes it for the first time, there is no active traffic that gets lost.
> This is because the interface is not yet up and not part of the network yet.
>
> The resume must be fast enough, because the remote node is sending packets.
> Hence it is different from driver init time queue enable.
I am not sure any packets arrive before a link announce at the 
destination side.
>
>>>> See the virtio common cfg, you will find the max number of vqs is
>>>> there, num_queues.
>>> :)
>>> Sure. those values at high q count affects.
>> the driver need to initialize them anyway.
> That is before the traffic starts from remote end.
see above, that needs a link announce and this is after re-initialization
>
>>>>> Device didn’t support LM.
>>>>> Many limitations existed all these years and TC is improving and
>>>>> expanding
>>>> them.
>>>>> So all these years do not matter.
>>>> Not sure what are you talking about, haven't we initialize the device
>>>> and vqs in config space for years?????? What's wrong with this mechanism?
>>>> Are you questioning virito-pci fundamentals???
>>> Don’t point to in-efficient past to establish similar in-efficient future.
>> interesting, you know this is a one-time thing, right?
>> and you are aware of this has been there for years.
>>>>>>>> Like how to set a queue size and enable it?
>>>>>>> Those are meant to be used before DRIVER_OK stage as they are init
>>>>>>> time
>>>>>> registers.
>>>>>>> Not to keep abusing them..
>>>>>> don't you need to set queue_size at the destination side?
>>>>> No.
>>>>> But the src/dst does not matter.
>>>>> Queue_size to be set before DRIVER_OK like rest of the registers, as
>>>>> all
>>>> queues must be created before the driver_ok phase.
>>>>> Queue_reset was last moment exception.
>>>> create a queue? Nvidia specific?
>>>>
>>> Huh. No.
>>> Do git log and realize what happened with queue_reset.
>> You didn't answer the question, does the spec even has defined "create a vq"?
> Enabled/created = tomato/tomato when discussing the spec in non-normative email conversation.
> It's irrelevant.
Then lets not debate on this enable a vq or create a vq anymore
> All I am saying is, when we know the limitations of the transport and when industry is forwarding to not introduced more and more on-die register for once in lifetime work of device migration,
> we just use the optimal command and queue interface that is native to virtio.
PCI config space has its own limitations, and admin vq has its 
advantages, but that does not apply to all use cases.

I don't want to repeat why I don't think admin vq is a good idea for 
migration again, we have already discussed on that.
>


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* RE: [virtio-comment] Re: [PATCH V2 4/6] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  2023-11-09  9:53                         ` Michael S. Tsirkin
@ 2023-11-09 10:11                           ` Parav Pandit
  0 siblings, 0 replies; 186+ messages in thread
From: Parav Pandit @ 2023-11-09 10:11 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Zhu, Lingshan, jasowang, eperezma, cohuck, stefanha, virtio-comment


> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Thursday, November 9, 2023 3:23 PM
> To: Parav Pandit <parav@nvidia.com>
> Cc: Zhu, Lingshan <lingshan.zhu@intel.com>; jasowang@redhat.com;
> eperezma@redhat.com; cohuck@redhat.com; stefanha@redhat.com; virtio-
> comment@lists.oasis-open.org
> Subject: Re: [virtio-comment] Re: [PATCH V2 4/6] virtio-pci: implement
> VIRTIO_F_QUEUE_STATE
> 
> On Thu, Nov 09, 2023 at 09:10:55AM +0000, Parav Pandit wrote:
> >
> > > From: Michael S. Tsirkin <mst@redhat.com>
> > > Sent: Thursday, November 9, 2023 2:12 PM
> > >
> > > On Thu, Nov 09, 2023 at 06:28:17AM +0000, Parav Pandit wrote:
> > > > > >> The registers are control path in config space, how 400G or 800G
> affect??
> > > > > > Because those are the one in practice requires large number of VQs.
> > > > > >
> > > > > > You are asking per VQ register commands to modify things
> > > > > > dynamically via
> > > > > this one vq at a time, serializing all the operations.
> > > > > > It does not scale well with high q count.
> > > > > This is not dynamically, it only happens when SUSPEND and RESUME.
> > > > > This is the same mechanism how virtio initialize a virtqueue,
> > > > > working for many years.
> > > > No. when virtio driver initializes it for the first time, there is
> > > > no active traffic
> > > that gets lost.
> > > > This is because the interface is not yet up and not part of the network yet.
> > > >
> > > > The resume must be fast enough, because the remote node is sending
> > > packets.
> > > > Hence it is different from driver init time queue enable.
> > >
> > > Maybe but I think not qualitatively different.
> > > If you care about these things please provide some estimates.
> > > I just don't see how queue resume writes even with 64k queues will
> > > saturate an express link.
> > >
> > It is not the bw of the link.
> > It is how the need for the device to be always read to suspend those many
> queues through register interface.
> 
> read->ready?
> 
> If we want to we can solve it btw. Prohibit changing queue select while suspend
> is in progress.
> 
> But more importantly once Zhu Lingshan stops arguing about suspend bit he'll
> hopefully work on transport vq. Then we can have it in config space and at the
> same time not use up a register.
Transport vq on the owner device for long term does not make sense as hypervisor should not be involved in viewing this content from TDISP time.
So cvq of whatever vq we want to call from the member device will be just perfect.
And it will behave like any other queues.
This will be uniformly available for VF, SIOV, PF devices regardless of TDISP and with TDISP.
This solves both the scale issue and security issue.

Hence 3 solutions,
1. transport vq for VF using owner device for purpose of device migration
2. not using it and inventing another VQ for TDISP in future and for PF, VF, SIOV and TDISP is overkill.

All the operations of the live migration driver like suspend device or others, seems doable just fine using owner device.
What is missing?

> 
> 
> > >
> > > > > >> See the virtio common cfg, you will find the max number of
> > > > > >> vqs is there, num_queues.
> > > > > > :)
> > > > > > Sure. those values at high q count affects.
> > > > > the driver need to initialize them anyway.
> > > > That is before the traffic starts from remote end.
> > > >
> > > > > >
> > > > > >>> Device didn’t support LM.
> > > > > >>> Many limitations existed all these years and TC is improving
> > > > > >>> and expanding
> > > > > >> them.
> > > > > >>> So all these years do not matter.
> > > > > >> Not sure what are you talking about, haven't we initialize
> > > > > >> the device and vqs in config space for years?????? What's
> > > > > >> wrong with this
> > > mechanism?
> > > > > >> Are you questioning virito-pci fundamentals???
> > > > > > Don’t point to in-efficient past to establish similar in-efficient future.
> > > > > interesting, you know this is a one-time thing, right?
> > > > > and you are aware of this has been there for years.
> > > > > >
> > > > > >>>>>> Like how to set a queue size and enable it?
> > > > > >>>>> Those are meant to be used before DRIVER_OK stage as they
> > > > > >>>>> are init time
> > > > > >>>> registers.
> > > > > >>>>> Not to keep abusing them..
> > > > > >>>> don't you need to set queue_size at the destination side?
> > > > > >>> No.
> > > > > >>> But the src/dst does not matter.
> > > > > >>> Queue_size to be set before DRIVER_OK like rest of the
> > > > > >>> registers, as all
> > > > > >> queues must be created before the driver_ok phase.
> > > > > >>> Queue_reset was last moment exception.
> > > > > >> create a queue? Nvidia specific?
> > > > > >>
> > > > > > Huh. No.
> > > > > > Do git log and realize what happened with queue_reset.
> > > > > You didn't answer the question, does the spec even has defined
> > > > > "create a
> > > vq"?
> > > >
> > > > Enabled/created = tomato/tomato when discussing the spec in
> > > > non-normative
> > > email conversation.
> > > > It's irrelevant.
> > > >
> > > > All I am saying is, when we know the limitations of the transport
> > > > and when industry is forwarding to not introduced more and more
> > > > on-die register
> > > for once in lifetime work of device migration, we just use the
> > > optimal command and queue interface that is native to virtio.
> > >
> > > we really do not need to prematurely optimize all things.
> > > control path is control path it is going to be slow because virtio
> > > designed it to be slow and drivers don't optimize it.
> > > Shaving off a microsecond here or there is going to do nothing
> > > except increase maintainance costs.
> >
> > Control path post device initialization for large part is through the cvq.
> 
> Really depends on the device. For virtio net most of the time is spent on filling
> up receive queues and sending network announcements and such.

Right for the time. But the issue is not about the time spent.
Issue is for the demand of those registers which are of no use when filling large part of filling receive queues.

^ permalink raw reply	[flat|nested] 186+ messages in thread

* [virtio-comment] RE: [PATCH V2 3/6] virtio: dont reset vqs when SUSPEND
  2023-11-09  9:58         ` Zhu, Lingshan
@ 2023-11-09 10:15           ` Parav Pandit
  2023-11-10  6:22             ` [virtio-comment] " Zhu, Lingshan
  2023-11-13  3:34             ` [virtio-comment] " Jason Wang
  0 siblings, 2 replies; 186+ messages in thread
From: Parav Pandit @ 2023-11-09 10:15 UTC (permalink / raw)
  To: Zhu, Lingshan, Michael S. Tsirkin
  Cc: jasowang, eperezma, cohuck, stefanha, virtio-comment


> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> Sent: Thursday, November 9, 2023 3:28 PM
> 
> On 11/9/2023 1:46 AM, Michael S. Tsirkin wrote:
> > On Tue, Nov 07, 2023 at 05:27:23PM +0800, Zhu, Lingshan wrote:
> >>
> >> On 11/6/2023 5:49 PM, Michael S. Tsirkin wrote:
> >>> On Fri, Nov 03, 2023 at 06:34:34PM +0800, Zhu Lingshan wrote:
> >>>> When SUSPEND is set, device states and virtqueue states should be
> >>>> stablized, therefore the driver should not reset vqs when SUSPEND
> >>>> is set in device status.
> >>>>
> >>>> Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
> >>>> ---
> >>>>    content.tex | 3 +++
> >>>>    1 file changed, 3 insertions(+)
> >>>>
> >>>> diff --git a/content.tex b/content.tex index bcc9d4b..060b5c2
> >>>> 100644
> >>>> --- a/content.tex
> >>>> +++ b/content.tex
> >>>> @@ -444,6 +444,9 @@ \subsubsection{Virtqueue Reset}\label{sec:Basic
> Facilities of a Virtio Device /
> >>>>    The device MUST reset any state of a virtqueue to the default state,
> >>>>    including the available state and the used state.
> >>>> +If VIRTIO_F_SUSPEND is negotiated and SUSPEND is set in
> >>>> +\field{device status}, the driver SHOULD NOT reset any virtqueues.
> >>>> +
> >>>>    \drivernormative{\paragraph}{Virtqueue Reset}{Basic Facilities of a
> Virtio Device / Virtqueues / Virtqueue Reset / Virtqueue Reset}
> >>>>    After the driver tells the device to reset a queue, the driver
> >>>> MUST verify that
> >>> Seems somewhat arbitrary and breaks the claim that the feature is
> >>> orthogonal and can have uses besides migration.
> >> when suspended, the device is frozen.
> >> The driver is aware of this process and so should not reset the vqs I think.
> > Again that is only true because you want to use it for migration.
> > But then you can't claim it's a generic facility.
> I don't get it. The device status is a basic facility.
> 
> We need to SUSPEND the device by setting SUSPEND bit, to stabilize the device
> states for migration.
Is the PCI's PM time not enough to suspend the device?
For large device I could imagine it could be short.

In that case if there is suspend the device available, it will be used by the guest driver itself, hypervisor wouldn’t know about it when those registers are not trapped.
So we need two ways to suspend.
One is guest visible, and guest controlled.
Second is hypervisor control to fulfill the device migration needs.

So if you can please take a look if the proposed admin command to freeze/stop mode can be used in the emulated register case or not.
It helps to have the suspend bit in guest control as well with/without emulation mode.

> This can also be used for debugging I think.

As Michael listed, a dedicated debug interface is usually more useful instead of in-band.

^ permalink raw reply	[flat|nested] 186+ messages in thread

* RE: [virtio-comment] Re: [PATCH V2 4/6] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  2023-11-09 10:09                     ` Zhu, Lingshan
@ 2023-11-09 10:25                       ` Parav Pandit
  2023-11-10  7:52                         ` Zhu, Lingshan
  0 siblings, 1 reply; 186+ messages in thread
From: Parav Pandit @ 2023-11-09 10:25 UTC (permalink / raw)
  To: Zhu, Lingshan, jasowang, mst, eperezma, cohuck, stefanha; +Cc: virtio-comment

> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> Sent: Thursday, November 9, 2023 3:39 PM
> 
> 
> On 11/9/2023 2:28 PM, Parav Pandit wrote:
> >> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> >> Sent: Tuesday, November 7, 2023 3:02 PM
> >>
> >> On 11/6/2023 6:52 PM, Parav Pandit wrote:
> >>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> >>>> Sent: Monday, November 6, 2023 2:57 PM
> >>>>
> >>>> On 11/6/2023 12:12 PM, Parav Pandit wrote:
> >>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> >>>>>> Sent: Monday, November 6, 2023 9:01 AM
> >>>>>>
> >>>>>> On 11/3/2023 11:50 PM, Parav Pandit wrote:
> >>>>>>>> From: virtio-comment@lists.oasis-open.org
> >>>>>>>> <virtio-comment@lists.oasis- open.org> On Behalf Of Zhu,
> >>>>>>>> Lingshan
> >>>>>>>> Sent: Friday, November 3, 2023 8:27 PM
> >>>>>>>>
> >>>>>>>> On 11/3/2023 7:35 PM, Parav Pandit wrote:
> >>>>>>>>>> From: Zhu Lingshan <lingshan.zhu@intel.com>
> >>>>>>>>>> Sent: Friday, November 3, 2023 4:05 PM
> >>>>>>>>>>
> >>>>>>>>>> This patch adds two new le16 fields to common configuration
> >>>>>>>>>> structure to support VIRTIO_F_QUEUE_STATE in PCI transport layer.
> >>>>>>>>>>
> >>>>>>>>>> Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
> >>>>>>>>>> ---
> >>>>>>>>>>       transport-pci.tex | 18 ++++++++++++++++++
> >>>>>>>>>>       1 file changed, 18 insertions(+)
> >>>>>>>>>>
> >>>>>>>>>> diff --git a/transport-pci.tex b/transport-pci.tex index
> >>>>>>>>>> a5c6719..3161519 100644
> >>>>>>>>>> --- a/transport-pci.tex
> >>>>>>>>>> +++ b/transport-pci.tex
> >>>>>>>>>> @@ -325,6 +325,10 @@ \subsubsection{Common configuration
> >>>>>> structure
> >>>>>>>>>> layout}\label{sec:Virtio Transport
> >>>>>>>>>>               /* About the administration virtqueue. */
> >>>>>>>>>>               le16 admin_queue_index;         /* read-only for driver */
> >>>>>>>>>>               le16 admin_queue_num;         /* read-only for driver */
> >>>>>>>>>> +
> >>>>>>>>>> +	/* Virtqueue state */
> >>>>>>>>>> +        le16 queue_avail_state;         /* read-write */
> >>>>>>>>>> +        le16 queue_used_state;          /* read-write */
> >>>>>>>>> This tiny interface for 128 virtio net queues through register
> >>>>>>>>> read writes, does
> >>>>>>>> not work effectively.
> >>>>>>>>> There are inflight out of order descriptors for block also.
> >>>>>>>>> Hence toy registers like this do not work.
> >>>>>>>> Do you know there is a queue_select? Why this does not work? Do
> >>>>>>>> you know how other queue related fields work?
> >>>>>>> :)
> >>>>>>> Yes. If you notice queue_reset related critical spec bug fix was
> >>>>>>> done when it
> >>>>>> was introduced so that live migration can _actually_ work.
> >>>>>>> When queue_select is done for 128 queues serially, it take a lot
> >>>>>>> of time to
> >>>>>> read those slow register interface for this + inflight descriptors + more.
> >>>>>> interesting, virtio work in this pattern for many years, right?
> >>>>> All these years 400Gbps and 800Gbps virtio was not present, number
> >>>>> of
> >>>> queues were not in hw.
> >>>> The registers are control path in config space, how 400G or 800G affect??
> >>> Because those are the one in practice requires large number of VQs.
> >>>
> >>> You are asking per VQ register commands to modify things dynamically
> >>> via
> >> this one vq at a time, serializing all the operations.
> >>> It does not scale well with high q count.
> >> This is not dynamically, it only happens when SUSPEND and RESUME.
> >> This is the same mechanism how virtio initialize a virtqueue, working
> >> for many years.
> > No. when virtio driver initializes it for the first time, there is no active traffic
> that gets lost.
> > This is because the interface is not yet up and not part of the network yet.
> >
> > The resume must be fast enough, because the remote node is sending
> packets.
> > Hence it is different from driver init time queue enable.
> I am not sure any packets arrive before a link announce at the destination side.
I think it can.
Because there is no notification of member device link down intimation to remote side.
The L4 and L5 protocols have no knowledge that node which they are interacting is behind some layers of switches.

So keeping this time low is desired.

> >
> >>>> See the virtio common cfg, you will find the max number of vqs is
> >>>> there, num_queues.
> >>> :)
> >>> Sure. those values at high q count affects.
> >> the driver need to initialize them anyway.
> > That is before the traffic starts from remote end.
> see above, that needs a link announce and this is after re-initialization
> >
> >>>>> Device didn’t support LM.
> >>>>> Many limitations existed all these years and TC is improving and
> >>>>> expanding
> >>>> them.
> >>>>> So all these years do not matter.
> >>>> Not sure what are you talking about, haven't we initialize the
> >>>> device and vqs in config space for years?????? What's wrong with this
> mechanism?
> >>>> Are you questioning virito-pci fundamentals???
> >>> Don’t point to in-efficient past to establish similar in-efficient future.
> >> interesting, you know this is a one-time thing, right?
> >> and you are aware of this has been there for years.
> >>>>>>>> Like how to set a queue size and enable it?
> >>>>>>> Those are meant to be used before DRIVER_OK stage as they are
> >>>>>>> init time
> >>>>>> registers.
> >>>>>>> Not to keep abusing them..
> >>>>>> don't you need to set queue_size at the destination side?
> >>>>> No.
> >>>>> But the src/dst does not matter.
> >>>>> Queue_size to be set before DRIVER_OK like rest of the registers,
> >>>>> as all
> >>>> queues must be created before the driver_ok phase.
> >>>>> Queue_reset was last moment exception.
> >>>> create a queue? Nvidia specific?
> >>>>
> >>> Huh. No.
> >>> Do git log and realize what happened with queue_reset.
> >> You didn't answer the question, does the spec even has defined "create a
> vq"?
> > Enabled/created = tomato/tomato when discussing the spec in non-normative
> email conversation.
> > It's irrelevant.
> Then lets not debate on this enable a vq or create a vq anymore
> > All I am saying is, when we know the limitations of the transport and
> > when industry is forwarding to not introduced more and more on-die register
> for once in lifetime work of device migration, we just use the optimal command
> and queue interface that is native to virtio.
> PCI config space has its own limitations, and admin vq has its advantages, but
> that does not apply to all use cases.
> 
There was a recent work done emulating the SR-IOV cap and allowing VM to enable SR-IOV in [1].
This is the option I mentioned few weeks ago.

So with admin commands and admin virtqueues, even nested model will work using [1].

[1] https://netdevconf.info/0x17/sessions/talk/unleashing-sr-iov-offload-on-virtual-machines.html

> I don't want to repeat why I don't think admin vq is a good idea for migration
> again, we have already discussed on that.
> >


^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [virtio-comment] Re: [PATCH V2 6/6] virtio-pci: implement dirty page tracking
  2023-11-08 17:18                   ` Michael S. Tsirkin
@ 2023-11-09 10:29                     ` Zhu, Lingshan
  2023-11-09 10:41                       ` Michael S. Tsirkin
  0 siblings, 1 reply; 186+ messages in thread
From: Zhu, Lingshan @ 2023-11-09 10:29 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Parav Pandit, jasowang, eperezma, cohuck, stefanha, virtio-comment



On 11/9/2023 1:18 AM, Michael S. Tsirkin wrote:
> On Wed, Nov 08, 2023 at 05:29:00PM +0800, Zhu, Lingshan wrote:
>>
>> On 11/7/2023 7:13 PM, Michael S. Tsirkin wrote:
>>> On Mon, Nov 06, 2023 at 04:03:42AM +0000, Parav Pandit wrote:
>>>>> From: Michael S. Tsirkin <mst@redhat.com>
>>>>> Sent: Sunday, November 5, 2023 9:42 PM
>>>>>
>>>>> On Fri, Nov 03, 2023 at 03:47:34PM +0000, Parav Pandit wrote:
>>>>>>>> [1]
>>>>>>>> https://lists.oasis-open.org/archives/virtio-comment/202310/msg004
>>>>>>>> 75.h
>>>>>>>> tml
>>>>>>> you still need to explain why this does not work for pass-through.
>>>>>> It does not work for following reasons.
>>>>>> 1. Because all the fields that put on the member device are not in direct
>>>>> control of the hypervisor.
>>>>>> The device is directly controlled by the guest including the device status and
>>>>> when it resets the device all the things stored in the device are lost.
>>>>>
>>>>> I think the idea is that when this gateway is in the device then device reset has
>>>>> to trap. At a high level, ok. But then what?
>>>>> Is a full scan of all memory required until device reset is complete?
>>>>> Drivers currently tend to busy poll the reset register, if this takes very long we
>>>>> might start seeing soft lockup messages. What is the idea then? Maybe for this
>>>>> we need a separate weaker reset that does not touch this capability?
>>>>>
>>>> You meant the gateway is not in the device, right?
>>>>
>>>> I likely didn't understand. I don't see a relation to timing.
>>>>
>>>> When the device reset is not trapped by the hypervisor, most things does not work, it requires trapping other things to like cvq, device registers and more.
>>>> It may be fine for those use case, but it does not fullfill the requirement of passthrough mode of hw.
>>> I wish we'd just stop using the term, it just confuses everyone.
>>>
>>> I feel the point worth making is that currently, all this job is done
>>> by hypervisors. And they manage fine! vdpa really truly does not need
>>> the SUSPEND bit because it knows about devices and it
>>> can just use whatever it wants in any vendor specific way it wants.
>> So true, this is exact what Intel implements in some productions.
>>> where all this migration work comes handy is if we say that
>>> we want our device to all just do what the
>>> spec says. No vendor specific tricks. And I find it exciting that
>>> there are people who want to work on this instead of
>>> each vendor wasting man hours on their own almost the same but
>>> slightly different driver.
>> I agree
>>> I personally think this patch is not great for the trap use-case either.
>>> Why? For example if device is somewhat slow then it will take it
>>> hundreds of milliseconds to synchronize the whole guest memory, and
>>> blocking reset means blocking e.g. guest boot.  I was wrong about soft
>>> lockup btw - linux does msleep which I think means no soft lockups. But boot is
>>> blocked and modules are not loaded.
>> I am not sure SUSPEND can block RESET, I think reset can take immediate
>> actions, because
>> once reset, whether suspended does not matter.
> No, because if you don't suspend device will keep changing memory.
> You need to
> 1. suspend
> 2. get all dirty memory synced
> 3. reset
>
>
> Reset earlier will corrupt guest memory.
IMHO, it may be fine to lose the dirty pages during reset,
because without an interrupt, the driver won't process the
dirty pages, they are still considered as unused(even not all zero pages)
by CPU, so nothing corrupted.

And if the driver resets the device, it will reinitialize the device
and re-config the virtqueue including the ring buffer.
>
>


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* [virtio-comment] Re: [PATCH V2 6/6] virtio-pci: implement dirty page tracking
  2023-11-08 17:19                         ` Michael S. Tsirkin
@ 2023-11-09 10:34                           ` Zhu, Lingshan
  0 siblings, 0 replies; 186+ messages in thread
From: Zhu, Lingshan @ 2023-11-09 10:34 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Parav Pandit, jasowang, eperezma, cohuck, stefanha, virtio-comment



On 11/9/2023 1:19 AM, Michael S. Tsirkin wrote:
> On Wed, Nov 08, 2023 at 05:30:02PM +0800, Zhu, Lingshan wrote:
>>
>> On 11/7/2023 7:33 PM, Michael S. Tsirkin wrote:
>>> On Tue, Nov 07, 2023 at 05:52:41PM +0800, Zhu, Lingshan wrote:
>>>>>>>>> 2. the PCI FLR is clearing all the registers you exposed here.
>>>>>>>> see above
>>>>>>>>> 3. Endless expansion of config registers of dirty tracking is not
>>>>>>>>> scalable, as they
>>>>>>>> are not init time registers not following the Appendix B guidelines.
>>>>>>>> endless expansion?? It is a complete set of dirty page tracking, right????
>>>>>>>> have you see this cap only controls? The device DMA writes the
>>>>>>>> bitmap, not by registers.
>>>>>>> Device dirty page tracking is start/stop command to be done by the
>>>>>> hypervisor.
>>>>>>> So when guest is resetting the device, it stopped the DMA initiated by the
>>>>>> hypervisor.
>>>>>>> This fundamentally breaks things.
>>>>>> Why? When device resets, do you want to keep tracking dirty pages????
>>>>> Yes, when the device resets, before that event occurred, all the pages which were dirtied, must be migrated.
>>>>> And after reset also new page tracking to continue.
>>>> That depends on whether there is an interrupt for the dirty pages.
>>>> If there is an interrupt, then the guest owns the pages
>>> Not in the virtio model, guest owns the memory once buffer has been used.
>> Yes and even better, interrupt happens after buffers marked as used.
> But guest owns memory earlier and you can not change it after this
> point.
If you mean the guest polls the used_idx, yes that can happen, if so
the guest owns the pages even earlier, and I am not sure whether
we need to change anything.

Maybe we don't need to argue on this, because we MAY don't need this 
dirty page
tracking facility. Platform PML or shadow virtqueue can work.
>


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [virtio-comment] Re: [PATCH V2 6/6] virtio-pci: implement dirty page tracking
  2023-11-09 10:29                     ` Zhu, Lingshan
@ 2023-11-09 10:41                       ` Michael S. Tsirkin
  2023-11-10  7:24                         ` Zhu, Lingshan
  0 siblings, 1 reply; 186+ messages in thread
From: Michael S. Tsirkin @ 2023-11-09 10:41 UTC (permalink / raw)
  To: Zhu, Lingshan
  Cc: Parav Pandit, jasowang, eperezma, cohuck, stefanha, virtio-comment

On Thu, Nov 09, 2023 at 06:29:59PM +0800, Zhu, Lingshan wrote:
> 
> 
> On 11/9/2023 1:18 AM, Michael S. Tsirkin wrote:
> > On Wed, Nov 08, 2023 at 05:29:00PM +0800, Zhu, Lingshan wrote:
> > > 
> > > On 11/7/2023 7:13 PM, Michael S. Tsirkin wrote:
> > > > On Mon, Nov 06, 2023 at 04:03:42AM +0000, Parav Pandit wrote:
> > > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > > Sent: Sunday, November 5, 2023 9:42 PM
> > > > > > 
> > > > > > On Fri, Nov 03, 2023 at 03:47:34PM +0000, Parav Pandit wrote:
> > > > > > > > > [1]
> > > > > > > > > https://lists.oasis-open.org/archives/virtio-comment/202310/msg004
> > > > > > > > > 75.h
> > > > > > > > > tml
> > > > > > > > you still need to explain why this does not work for pass-through.
> > > > > > > It does not work for following reasons.
> > > > > > > 1. Because all the fields that put on the member device are not in direct
> > > > > > control of the hypervisor.
> > > > > > > The device is directly controlled by the guest including the device status and
> > > > > > when it resets the device all the things stored in the device are lost.
> > > > > > 
> > > > > > I think the idea is that when this gateway is in the device then device reset has
> > > > > > to trap. At a high level, ok. But then what?
> > > > > > Is a full scan of all memory required until device reset is complete?
> > > > > > Drivers currently tend to busy poll the reset register, if this takes very long we
> > > > > > might start seeing soft lockup messages. What is the idea then? Maybe for this
> > > > > > we need a separate weaker reset that does not touch this capability?
> > > > > > 
> > > > > You meant the gateway is not in the device, right?
> > > > > 
> > > > > I likely didn't understand. I don't see a relation to timing.
> > > > > 
> > > > > When the device reset is not trapped by the hypervisor, most things does not work, it requires trapping other things to like cvq, device registers and more.
> > > > > It may be fine for those use case, but it does not fullfill the requirement of passthrough mode of hw.
> > > > I wish we'd just stop using the term, it just confuses everyone.
> > > > 
> > > > I feel the point worth making is that currently, all this job is done
> > > > by hypervisors. And they manage fine! vdpa really truly does not need
> > > > the SUSPEND bit because it knows about devices and it
> > > > can just use whatever it wants in any vendor specific way it wants.
> > > So true, this is exact what Intel implements in some productions.
> > > > where all this migration work comes handy is if we say that
> > > > we want our device to all just do what the
> > > > spec says. No vendor specific tricks. And I find it exciting that
> > > > there are people who want to work on this instead of
> > > > each vendor wasting man hours on their own almost the same but
> > > > slightly different driver.
> > > I agree
> > > > I personally think this patch is not great for the trap use-case either.
> > > > Why? For example if device is somewhat slow then it will take it
> > > > hundreds of milliseconds to synchronize the whole guest memory, and
> > > > blocking reset means blocking e.g. guest boot.  I was wrong about soft
> > > > lockup btw - linux does msleep which I think means no soft lockups. But boot is
> > > > blocked and modules are not loaded.
> > > I am not sure SUSPEND can block RESET, I think reset can take immediate
> > > actions, because
> > > once reset, whether suspended does not matter.
> > No, because if you don't suspend device will keep changing memory.
> > You need to
> > 1. suspend
> > 2. get all dirty memory synced
> > 3. reset
> > 
> > 
> > Reset earlier will corrupt guest memory.
> IMHO, it may be fine to lose the dirty pages during reset,
> because without an interrupt, the driver won't process the
> dirty pages, they are still considered as unused(even not all zero pages)
> by CPU, so nothing corrupted.
> 
> And if the driver resets the device, it will reinitialize the device
> and re-config the virtqueue including the ring buffer.

It's too late to invent new consistency semantics for virtio.

-- 
MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* [virtio-comment] Re: [PATCH V2 3/6] virtio: dont reset vqs when SUSPEND
  2023-11-09 10:15           ` [virtio-comment] " Parav Pandit
@ 2023-11-10  6:22             ` Zhu, Lingshan
  2023-11-10  6:31               ` [virtio-comment] " Parav Pandit
  2023-11-13  3:34             ` [virtio-comment] " Jason Wang
  1 sibling, 1 reply; 186+ messages in thread
From: Zhu, Lingshan @ 2023-11-10  6:22 UTC (permalink / raw)
  To: Parav Pandit, Michael S. Tsirkin
  Cc: jasowang, eperezma, cohuck, stefanha, virtio-comment



On 11/9/2023 6:15 PM, Parav Pandit wrote:
>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>> Sent: Thursday, November 9, 2023 3:28 PM
>>
>> On 11/9/2023 1:46 AM, Michael S. Tsirkin wrote:
>>> On Tue, Nov 07, 2023 at 05:27:23PM +0800, Zhu, Lingshan wrote:
>>>> On 11/6/2023 5:49 PM, Michael S. Tsirkin wrote:
>>>>> On Fri, Nov 03, 2023 at 06:34:34PM +0800, Zhu Lingshan wrote:
>>>>>> When SUSPEND is set, device states and virtqueue states should be
>>>>>> stablized, therefore the driver should not reset vqs when SUSPEND
>>>>>> is set in device status.
>>>>>>
>>>>>> Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
>>>>>> ---
>>>>>>     content.tex | 3 +++
>>>>>>     1 file changed, 3 insertions(+)
>>>>>>
>>>>>> diff --git a/content.tex b/content.tex index bcc9d4b..060b5c2
>>>>>> 100644
>>>>>> --- a/content.tex
>>>>>> +++ b/content.tex
>>>>>> @@ -444,6 +444,9 @@ \subsubsection{Virtqueue Reset}\label{sec:Basic
>> Facilities of a Virtio Device /
>>>>>>     The device MUST reset any state of a virtqueue to the default state,
>>>>>>     including the available state and the used state.
>>>>>> +If VIRTIO_F_SUSPEND is negotiated and SUSPEND is set in
>>>>>> +\field{device status}, the driver SHOULD NOT reset any virtqueues.
>>>>>> +
>>>>>>     \drivernormative{\paragraph}{Virtqueue Reset}{Basic Facilities of a
>> Virtio Device / Virtqueues / Virtqueue Reset / Virtqueue Reset}
>>>>>>     After the driver tells the device to reset a queue, the driver
>>>>>> MUST verify that
>>>>> Seems somewhat arbitrary and breaks the claim that the feature is
>>>>> orthogonal and can have uses besides migration.
>>>> when suspended, the device is frozen.
>>>> The driver is aware of this process and so should not reset the vqs I think.
>>> Again that is only true because you want to use it for migration.
>>> But then you can't claim it's a generic facility.
>> I don't get it. The device status is a basic facility.
>>
>> We need to SUSPEND the device by setting SUSPEND bit, to stabilize the device
>> states for migration.
> Is the PCI's PM time not enough to suspend the device?
> For large device I could imagine it could be short.
As you see, PCI PM, so this is a layer violation, virtio should be self 
contained, and what about MMIO and CCW?
This should be a basic facility.
>
> In that case if there is suspend the device available, it will be used by the guest driver itself, hypervisor wouldn’t know about it when those registers are not trapped.
> So we need two ways to suspend.
> One is guest visible, and guest controlled.
> Second is hypervisor control to fulfill the device migration needs.
The guest can eve reset the device.
>
> So if you can please take a look if the proposed admin command to freeze/stop mode can be used in the emulated register case or not.
> It helps to have the suspend bit in guest control as well with/without emulation mode.
Parav, please believe I have read your series, I didn't comment there 
because I want to avoid further
conflicts/debating, we have done these enough.

As explained before, freeze/stop the device by PCI is a layer violation.

And device status can be pass-through(without emulation, just map it to 
guest) to the guest or trapped(trap and emulate by the hypervisor, for 
example set_status in vDPA).
>
>> This can also be used for debugging I think.
> As Michael listed, a dedicated debug interface is usually more useful instead of in-band.
re-using another facility without extra efforts is not a bad thing anyway.


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* [virtio-comment] RE: [PATCH V2 3/6] virtio: dont reset vqs when SUSPEND
  2023-11-10  6:22             ` [virtio-comment] " Zhu, Lingshan
@ 2023-11-10  6:31               ` Parav Pandit
  2023-11-13  9:23                 ` Zhu, Lingshan
  0 siblings, 1 reply; 186+ messages in thread
From: Parav Pandit @ 2023-11-10  6:31 UTC (permalink / raw)
  To: Zhu, Lingshan, Michael S. Tsirkin
  Cc: jasowang, eperezma, cohuck, stefanha, virtio-comment


> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> Sent: Friday, November 10, 2023 11:52 AM
> 
> On 11/9/2023 6:15 PM, Parav Pandit wrote:
> >> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> >> Sent: Thursday, November 9, 2023 3:28 PM
> >>
> >> On 11/9/2023 1:46 AM, Michael S. Tsirkin wrote:
> >>> On Tue, Nov 07, 2023 at 05:27:23PM +0800, Zhu, Lingshan wrote:
> >>>> On 11/6/2023 5:49 PM, Michael S. Tsirkin wrote:
> >>>>> On Fri, Nov 03, 2023 at 06:34:34PM +0800, Zhu Lingshan wrote:
> >>>>>> When SUSPEND is set, device states and virtqueue states should be
> >>>>>> stablized, therefore the driver should not reset vqs when SUSPEND
> >>>>>> is set in device status.
> >>>>>>
> >>>>>> Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
> >>>>>> ---
> >>>>>>     content.tex | 3 +++
> >>>>>>     1 file changed, 3 insertions(+)
> >>>>>>
> >>>>>> diff --git a/content.tex b/content.tex index bcc9d4b..060b5c2
> >>>>>> 100644
> >>>>>> --- a/content.tex
> >>>>>> +++ b/content.tex
> >>>>>> @@ -444,6 +444,9 @@ \subsubsection{Virtqueue
> >>>>>> Reset}\label{sec:Basic
> >> Facilities of a Virtio Device /
> >>>>>>     The device MUST reset any state of a virtqueue to the default state,
> >>>>>>     including the available state and the used state.
> >>>>>> +If VIRTIO_F_SUSPEND is negotiated and SUSPEND is set in
> >>>>>> +\field{device status}, the driver SHOULD NOT reset any virtqueues.
> >>>>>> +
> >>>>>>     \drivernormative{\paragraph}{Virtqueue Reset}{Basic
> >>>>>> Facilities of a
> >> Virtio Device / Virtqueues / Virtqueue Reset / Virtqueue Reset}
> >>>>>>     After the driver tells the device to reset a queue, the
> >>>>>> driver MUST verify that
> >>>>> Seems somewhat arbitrary and breaks the claim that the feature is
> >>>>> orthogonal and can have uses besides migration.
> >>>> when suspended, the device is frozen.
> >>>> The driver is aware of this process and so should not reset the vqs I think.
> >>> Again that is only true because you want to use it for migration.
> >>> But then you can't claim it's a generic facility.
> >> I don't get it. The device status is a basic facility.
> >>
> >> We need to SUSPEND the device by setting SUSPEND bit, to stabilize
> >> the device states for migration.
> > Is the PCI's PM time not enough to suspend the device?
> > For large device I could imagine it could be short.
> As you see, PCI PM, so this is a layer violation, virtio should be self contained,

If you think it is layer violation, than suspend bit for sure is not needed. PCI PM interface should suspend/resume the device on D0<->D3 state transitions.

> and what about MMIO and CCW?

They have largely lacked the richness of PCI transport. So those transport needs to evolve.
Otherwise, PCI offers rich transport facilities compared to MMIO, hence, it will continue wider use.

> This should be a basic facility.
Other transport can also offer like PCI.

> >
> > In that case if there is suspend the device available, it will be used by the
> guest driver itself, hypervisor wouldn’t know about it when those registers are
> not trapped.
> > So we need two ways to suspend.
> > One is guest visible, and guest controlled.
> > Second is hypervisor control to fulfill the device migration needs.
> The guest can eve reset the device.
> >
> > So if you can please take a look if the proposed admin command to
> freeze/stop mode can be used in the emulated register case or not.
> > It helps to have the suspend bit in guest control as well with/without
> emulation mode.
> Parav, please believe I have read your series, I didn't comment there because I
> want to avoid further conflicts/debating, we have done these enough.
> 
I believe the series posted in v3 can support vdpa use case as well.
So I will progress to post v4.

> As explained before, freeze/stop the device by PCI is a layer violation.
I am afraid, we have different vision.
I don’t see any layer violation.
Suspend is enough in the PCI PM.
Our vision is more aligned with rest of the hypervisor knobs that owns the migration framework.

> 
> And device status can be pass-through(without emulation, just map it to
> guest) to the guest or trapped(trap and emulate by the hypervisor, for example
> set_status in vDPA).
When it is pass-through, it is controlled by the guest, so for example, if the guest resets the device, hypervisor has lost the control of migration context etc.
Hence, hypervisor needs a channel which is not guest owned.

Same channel can work when trap+emulation is done.

> >
> >> This can also be used for debugging I think.
> > As Michael listed, a dedicated debug interface is usually more useful instead
> of in-band.
> re-using another facility without extra efforts is not a bad thing anyway.

I just don’t see how a suspend bit some debug feature.
Almost everything with that regard is a debug feature to me.

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [virtio-comment] Re: [PATCH V2 4/6] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  2023-11-09 10:02                       ` Michael S. Tsirkin
@ 2023-11-10  6:52                         ` Zhu, Lingshan
  2023-11-10 12:31                           ` Parav Pandit
  0 siblings, 1 reply; 186+ messages in thread
From: Zhu, Lingshan @ 2023-11-10  6:52 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Parav Pandit, jasowang, eperezma, cohuck, stefanha, virtio-comment



On 11/9/2023 6:02 PM, Michael S. Tsirkin wrote:
> On Thu, Nov 09, 2023 at 06:00:27PM +0800, Zhu, Lingshan wrote:
>>
>> On 11/9/2023 1:44 AM, Michael S. Tsirkin wrote:
>>> On Tue, Nov 07, 2023 at 05:31:38PM +0800, Zhu, Lingshan wrote:
>>>> On 11/6/2023 6:52 PM, Parav Pandit wrote:
>>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>>>>>> Sent: Monday, November 6, 2023 2:57 PM
>>>>>>
>>>>>> On 11/6/2023 12:12 PM, Parav Pandit wrote:
>>>>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>>>>>>>> Sent: Monday, November 6, 2023 9:01 AM
>>>>>>>>
>>>>>>>> On 11/3/2023 11:50 PM, Parav Pandit wrote:
>>>>>>>>>> From: virtio-comment@lists.oasis-open.org
>>>>>>>>>> <virtio-comment@lists.oasis- open.org> On Behalf Of Zhu, Lingshan
>>>>>>>>>> Sent: Friday, November 3, 2023 8:27 PM
>>>>>>>>>>
>>>>>>>>>> On 11/3/2023 7:35 PM, Parav Pandit wrote:
>>>>>>>>>>>> From: Zhu Lingshan <lingshan.zhu@intel.com>
>>>>>>>>>>>> Sent: Friday, November 3, 2023 4:05 PM
>>>>>>>>>>>>
>>>>>>>>>>>> This patch adds two new le16 fields to common configuration
>>>>>>>>>>>> structure to support VIRTIO_F_QUEUE_STATE in PCI transport layer.
>>>>>>>>>>>>
>>>>>>>>>>>> Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
>>>>>>>>>>>> ---
>>>>>>>>>>>>        transport-pci.tex | 18 ++++++++++++++++++
>>>>>>>>>>>>        1 file changed, 18 insertions(+)
>>>>>>>>>>>>
>>>>>>>>>>>> diff --git a/transport-pci.tex b/transport-pci.tex index
>>>>>>>>>>>> a5c6719..3161519 100644
>>>>>>>>>>>> --- a/transport-pci.tex
>>>>>>>>>>>> +++ b/transport-pci.tex
>>>>>>>>>>>> @@ -325,6 +325,10 @@ \subsubsection{Common configuration
>>>>>>>> structure
>>>>>>>>>>>> layout}\label{sec:Virtio Transport
>>>>>>>>>>>>                /* About the administration virtqueue. */
>>>>>>>>>>>>                le16 admin_queue_index;         /* read-only for driver */
>>>>>>>>>>>>                le16 admin_queue_num;         /* read-only for driver */
>>>>>>>>>>>> +
>>>>>>>>>>>> +	/* Virtqueue state */
>>>>>>>>>>>> +        le16 queue_avail_state;         /* read-write */
>>>>>>>>>>>> +        le16 queue_used_state;          /* read-write */
>>>>>>>>>>> This tiny interface for 128 virtio net queues through register
>>>>>>>>>>> read writes, does
>>>>>>>>>> not work effectively.
>>>>>>>>>>> There are inflight out of order descriptors for block also.
>>>>>>>>>>> Hence toy registers like this do not work.
>>>>>>>>>> Do you know there is a queue_select? Why this does not work? Do you
>>>>>>>>>> know how other queue related fields work?
>>>>>>>>> :)
>>>>>>>>> Yes. If you notice queue_reset related critical spec bug fix was
>>>>>>>>> done when it
>>>>>>>> was introduced so that live migration can _actually_ work.
>>>>>>>>> When queue_select is done for 128 queues serially, it take a lot of
>>>>>>>>> time to
>>>>>>>> read those slow register interface for this + inflight descriptors + more.
>>>>>>>> interesting, virtio work in this pattern for many years, right?
>>>>>>> All these years 400Gbps and 800Gbps virtio was not present, number of
>>>>>> queues were not in hw.
>>>>>> The registers are control path in config space, how 400G or 800G affect??
>>>>> Because those are the one in practice requires large number of VQs.
>>>>>
>>>>> You are asking per VQ register commands to modify things dynamically via this one vq at a time, serializing all the operations.
>>>>> It does not scale well with high q count.
>>>> This is not dynamically, it only happens when SUSPEND and RESUME.
>>>> This is the same mechanism how virtio initialize a virtqueue, working for
>>>> many years.
>>> I wish we just had a transport vq already. That's the way to solve this
>>> not fighting individual bits.
>> Yeah, I agree, transport is a queued task(sent out V4 months ago...), one by
>> one... hard and tough work...
> Frankly I think that should take precedence, then Parav will not get
> annoyed each time add a couple of registers.
I agree, things can happen and we are already here..
>
>>>>>> See the virtio common cfg, you will find the max number of vqs is there,
>>>>>> num_queues.
>>>>> :)
>>>>> Sure. those values at high q count affects.
>>>> the driver need to initialize them anyway.
>>>>>>> Device didn’t support LM.
>>>>>>> Many limitations existed all these years and TC is improving and expanding
>>>>>> them.
>>>>>>> So all these years do not matter.
>>>>>> Not sure what are you talking about, haven't we initialize the device and vqs in
>>>>>> config space for years?????? What's wrong with this mechanism?
>>>>>> Are you questioning virito-pci fundamentals???
>>>>> Don’t point to in-efficient past to establish similar in-efficient future.
>>>> interesting, you know this is a one-time thing, right?
>>>> and you are aware of this has been there for years.
>>>>>>>>>> Like how to set a queue size and enable it?
>>>>>>>>> Those are meant to be used before DRIVER_OK stage as they are init
>>>>>>>>> time
>>>>>>>> registers.
>>>>>>>>> Not to keep abusing them..
>>>>>>>> don't you need to set queue_size at the destination side?
>>>>>>> No.
>>>>>>> But the src/dst does not matter.
>>>>>>> Queue_size to be set before DRIVER_OK like rest of the registers, as all
>>>>>> queues must be created before the driver_ok phase.
>>>>>>> Queue_reset was last moment exception.
>>>>>> create a queue? Nvidia specific?
>>>>>>
>>>>> Huh. No.
>>>>> Do git log and realize what happened with queue_reset.
>>>> You didn't answer the question, does the spec even has defined "create a
>>>> vq"?
>>>>>> For standard virtio, you need to read the number of enabled vqs at the source
>>>>>> side, then enable them at the dst, so queue_size matters, not to create.
>>>>> All that happens in the pre-copy phase.
>>>> Yes and how your answer related to this discussion?


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [virtio-comment] Re: [PATCH V2 6/6] virtio-pci: implement dirty page tracking
  2023-11-09 10:41                       ` Michael S. Tsirkin
@ 2023-11-10  7:24                         ` Zhu, Lingshan
  0 siblings, 0 replies; 186+ messages in thread
From: Zhu, Lingshan @ 2023-11-10  7:24 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Parav Pandit, jasowang, eperezma, cohuck, stefanha, virtio-comment



On 11/9/2023 6:41 PM, Michael S. Tsirkin wrote:
> On Thu, Nov 09, 2023 at 06:29:59PM +0800, Zhu, Lingshan wrote:
>>
>> On 11/9/2023 1:18 AM, Michael S. Tsirkin wrote:
>>> On Wed, Nov 08, 2023 at 05:29:00PM +0800, Zhu, Lingshan wrote:
>>>> On 11/7/2023 7:13 PM, Michael S. Tsirkin wrote:
>>>>> On Mon, Nov 06, 2023 at 04:03:42AM +0000, Parav Pandit wrote:
>>>>>>> From: Michael S. Tsirkin <mst@redhat.com>
>>>>>>> Sent: Sunday, November 5, 2023 9:42 PM
>>>>>>>
>>>>>>> On Fri, Nov 03, 2023 at 03:47:34PM +0000, Parav Pandit wrote:
>>>>>>>>>> [1]
>>>>>>>>>> https://lists.oasis-open.org/archives/virtio-comment/202310/msg004
>>>>>>>>>> 75.h
>>>>>>>>>> tml
>>>>>>>>> you still need to explain why this does not work for pass-through.
>>>>>>>> It does not work for following reasons.
>>>>>>>> 1. Because all the fields that put on the member device are not in direct
>>>>>>> control of the hypervisor.
>>>>>>>> The device is directly controlled by the guest including the device status and
>>>>>>> when it resets the device all the things stored in the device are lost.
>>>>>>>
>>>>>>> I think the idea is that when this gateway is in the device then device reset has
>>>>>>> to trap. At a high level, ok. But then what?
>>>>>>> Is a full scan of all memory required until device reset is complete?
>>>>>>> Drivers currently tend to busy poll the reset register, if this takes very long we
>>>>>>> might start seeing soft lockup messages. What is the idea then? Maybe for this
>>>>>>> we need a separate weaker reset that does not touch this capability?
>>>>>>>
>>>>>> You meant the gateway is not in the device, right?
>>>>>>
>>>>>> I likely didn't understand. I don't see a relation to timing.
>>>>>>
>>>>>> When the device reset is not trapped by the hypervisor, most things does not work, it requires trapping other things to like cvq, device registers and more.
>>>>>> It may be fine for those use case, but it does not fullfill the requirement of passthrough mode of hw.
>>>>> I wish we'd just stop using the term, it just confuses everyone.
>>>>>
>>>>> I feel the point worth making is that currently, all this job is done
>>>>> by hypervisors. And they manage fine! vdpa really truly does not need
>>>>> the SUSPEND bit because it knows about devices and it
>>>>> can just use whatever it wants in any vendor specific way it wants.
>>>> So true, this is exact what Intel implements in some productions.
>>>>> where all this migration work comes handy is if we say that
>>>>> we want our device to all just do what the
>>>>> spec says. No vendor specific tricks. And I find it exciting that
>>>>> there are people who want to work on this instead of
>>>>> each vendor wasting man hours on their own almost the same but
>>>>> slightly different driver.
>>>> I agree
>>>>> I personally think this patch is not great for the trap use-case either.
>>>>> Why? For example if device is somewhat slow then it will take it
>>>>> hundreds of milliseconds to synchronize the whole guest memory, and
>>>>> blocking reset means blocking e.g. guest boot.  I was wrong about soft
>>>>> lockup btw - linux does msleep which I think means no soft lockups. But boot is
>>>>> blocked and modules are not loaded.
>>>> I am not sure SUSPEND can block RESET, I think reset can take immediate
>>>> actions, because
>>>> once reset, whether suspended does not matter.
>>> No, because if you don't suspend device will keep changing memory.
>>> You need to
>>> 1. suspend
>>> 2. get all dirty memory synced
>>> 3. reset
>>>
>>>
>>> Reset earlier will corrupt guest memory.
>> IMHO, it may be fine to lose the dirty pages during reset,
>> because without an interrupt, the driver won't process the
>> dirty pages, they are still considered as unused(even not all zero pages)
>> by CPU, so nothing corrupted.
>>
>> And if the driver resets the device, it will reinitialize the device
>> and re-config the virtqueue including the ring buffer.
> It's too late to invent new consistency semantics for virtio.
I think when reset, the legacy vring buffer can be considered as invalid.
>


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [virtio-comment] Re: [PATCH V2 4/6] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  2023-11-09 10:25                       ` Parav Pandit
@ 2023-11-10  7:52                         ` Zhu, Lingshan
  2023-11-10 12:31                           ` Parav Pandit
  0 siblings, 1 reply; 186+ messages in thread
From: Zhu, Lingshan @ 2023-11-10  7:52 UTC (permalink / raw)
  To: Parav Pandit, jasowang, mst, eperezma, cohuck, stefanha; +Cc: virtio-comment



On 11/9/2023 6:25 PM, Parav Pandit wrote:
>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>> Sent: Thursday, November 9, 2023 3:39 PM
>>
>>
>> On 11/9/2023 2:28 PM, Parav Pandit wrote:
>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>>>> Sent: Tuesday, November 7, 2023 3:02 PM
>>>>
>>>> On 11/6/2023 6:52 PM, Parav Pandit wrote:
>>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>>>>>> Sent: Monday, November 6, 2023 2:57 PM
>>>>>>
>>>>>> On 11/6/2023 12:12 PM, Parav Pandit wrote:
>>>>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>>>>>>>> Sent: Monday, November 6, 2023 9:01 AM
>>>>>>>>
>>>>>>>> On 11/3/2023 11:50 PM, Parav Pandit wrote:
>>>>>>>>>> From: virtio-comment@lists.oasis-open.org
>>>>>>>>>> <virtio-comment@lists.oasis- open.org> On Behalf Of Zhu,
>>>>>>>>>> Lingshan
>>>>>>>>>> Sent: Friday, November 3, 2023 8:27 PM
>>>>>>>>>>
>>>>>>>>>> On 11/3/2023 7:35 PM, Parav Pandit wrote:
>>>>>>>>>>>> From: Zhu Lingshan <lingshan.zhu@intel.com>
>>>>>>>>>>>> Sent: Friday, November 3, 2023 4:05 PM
>>>>>>>>>>>>
>>>>>>>>>>>> This patch adds two new le16 fields to common configuration
>>>>>>>>>>>> structure to support VIRTIO_F_QUEUE_STATE in PCI transport layer.
>>>>>>>>>>>>
>>>>>>>>>>>> Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
>>>>>>>>>>>> ---
>>>>>>>>>>>>        transport-pci.tex | 18 ++++++++++++++++++
>>>>>>>>>>>>        1 file changed, 18 insertions(+)
>>>>>>>>>>>>
>>>>>>>>>>>> diff --git a/transport-pci.tex b/transport-pci.tex index
>>>>>>>>>>>> a5c6719..3161519 100644
>>>>>>>>>>>> --- a/transport-pci.tex
>>>>>>>>>>>> +++ b/transport-pci.tex
>>>>>>>>>>>> @@ -325,6 +325,10 @@ \subsubsection{Common configuration
>>>>>>>> structure
>>>>>>>>>>>> layout}\label{sec:Virtio Transport
>>>>>>>>>>>>                /* About the administration virtqueue. */
>>>>>>>>>>>>                le16 admin_queue_index;         /* read-only for driver */
>>>>>>>>>>>>                le16 admin_queue_num;         /* read-only for driver */
>>>>>>>>>>>> +
>>>>>>>>>>>> +	/* Virtqueue state */
>>>>>>>>>>>> +        le16 queue_avail_state;         /* read-write */
>>>>>>>>>>>> +        le16 queue_used_state;          /* read-write */
>>>>>>>>>>> This tiny interface for 128 virtio net queues through register
>>>>>>>>>>> read writes, does
>>>>>>>>>> not work effectively.
>>>>>>>>>>> There are inflight out of order descriptors for block also.
>>>>>>>>>>> Hence toy registers like this do not work.
>>>>>>>>>> Do you know there is a queue_select? Why this does not work? Do
>>>>>>>>>> you know how other queue related fields work?
>>>>>>>>> :)
>>>>>>>>> Yes. If you notice queue_reset related critical spec bug fix was
>>>>>>>>> done when it
>>>>>>>> was introduced so that live migration can _actually_ work.
>>>>>>>>> When queue_select is done for 128 queues serially, it take a lot
>>>>>>>>> of time to
>>>>>>>> read those slow register interface for this + inflight descriptors + more.
>>>>>>>> interesting, virtio work in this pattern for many years, right?
>>>>>>> All these years 400Gbps and 800Gbps virtio was not present, number
>>>>>>> of
>>>>>> queues were not in hw.
>>>>>> The registers are control path in config space, how 400G or 800G affect??
>>>>> Because those are the one in practice requires large number of VQs.
>>>>>
>>>>> You are asking per VQ register commands to modify things dynamically
>>>>> via
>>>> this one vq at a time, serializing all the operations.
>>>>> It does not scale well with high q count.
>>>> This is not dynamically, it only happens when SUSPEND and RESUME.
>>>> This is the same mechanism how virtio initialize a virtqueue, working
>>>> for many years.
>>> No. when virtio driver initializes it for the first time, there is no active traffic
>> that gets lost.
>>> This is because the interface is not yet up and not part of the network yet.
>>>
>>> The resume must be fast enough, because the remote node is sending
>> packets.
>>> Hence it is different from driver init time queue enable.
>> I am not sure any packets arrive before a link announce at the destination side.
> I think it can.
> Because there is no notification of member device link down intimation to remote side.
> The L4 and L5 protocols have no knowledge that node which they are interacting is behind some layers of switches.
>
> So keeping this time low is desired.
The NIC should broad cast itself first, so that other peers in the 
network know(for example its mac to route it) how to send a message to it.

This is necessary, for example VIRTIO_NET_F_GUEST_ANNOUNCE, similar 
mechanism work for in-marketing productions
for years.

This is out of the topic anyway.
>
>>>>>> See the virtio common cfg, you will find the max number of vqs is
>>>>>> there, num_queues.
>>>>> :)
>>>>> Sure. those values at high q count affects.
>>>> the driver need to initialize them anyway.
>>> That is before the traffic starts from remote end.
>> see above, that needs a link announce and this is after re-initialization
>>>>>>> Device didn’t support LM.
>>>>>>> Many limitations existed all these years and TC is improving and
>>>>>>> expanding
>>>>>> them.
>>>>>>> So all these years do not matter.
>>>>>> Not sure what are you talking about, haven't we initialize the
>>>>>> device and vqs in config space for years?????? What's wrong with this
>> mechanism?
>>>>>> Are you questioning virito-pci fundamentals???
>>>>> Don’t point to in-efficient past to establish similar in-efficient future.
>>>> interesting, you know this is a one-time thing, right?
>>>> and you are aware of this has been there for years.
>>>>>>>>>> Like how to set a queue size and enable it?
>>>>>>>>> Those are meant to be used before DRIVER_OK stage as they are
>>>>>>>>> init time
>>>>>>>> registers.
>>>>>>>>> Not to keep abusing them..
>>>>>>>> don't you need to set queue_size at the destination side?
>>>>>>> No.
>>>>>>> But the src/dst does not matter.
>>>>>>> Queue_size to be set before DRIVER_OK like rest of the registers,
>>>>>>> as all
>>>>>> queues must be created before the driver_ok phase.
>>>>>>> Queue_reset was last moment exception.
>>>>>> create a queue? Nvidia specific?
>>>>>>
>>>>> Huh. No.
>>>>> Do git log and realize what happened with queue_reset.
>>>> You didn't answer the question, does the spec even has defined "create a
>> vq"?
>>> Enabled/created = tomato/tomato when discussing the spec in non-normative
>> email conversation.
>>> It's irrelevant.
>> Then lets not debate on this enable a vq or create a vq anymore
>>> All I am saying is, when we know the limitations of the transport and
>>> when industry is forwarding to not introduced more and more on-die register
>> for once in lifetime work of device migration, we just use the optimal command
>> and queue interface that is native to virtio.
>> PCI config space has its own limitations, and admin vq has its advantages, but
>> that does not apply to all use cases.
>>
> There was a recent work done emulating the SR-IOV cap and allowing VM to enable SR-IOV in [1].
> This is the option I mentioned few weeks ago.
>
> So with admin commands and admin virtqueues, even nested model will work using [1].
>
> [1] https://netdevconf.info/0x17/sessions/talk/unleashing-sr-iov-offload-on-virtual-machines.html
We should take this into consideration once it is standardized in the 
spec, maybe not now, there can
always be many workarounds to solve one problem.
>
>> I don't want to repeat why I don't think admin vq is a good idea for migration
>> again, we have already discussed on that.


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* RE: [virtio-comment] Re: [PATCH V2 4/6] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  2023-11-10  7:52                         ` Zhu, Lingshan
@ 2023-11-10 12:31                           ` Parav Pandit
  2023-11-13  9:25                             ` Zhu, Lingshan
  0 siblings, 1 reply; 186+ messages in thread
From: Parav Pandit @ 2023-11-10 12:31 UTC (permalink / raw)
  To: Zhu, Lingshan, jasowang, mst, eperezma, cohuck, stefanha; +Cc: virtio-comment


> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> Sent: Friday, November 10, 2023 1:22 PM
> 
> 
> On 11/9/2023 6:25 PM, Parav Pandit wrote:
> >> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> >> Sent: Thursday, November 9, 2023 3:39 PM
> >>
> >>
> >> On 11/9/2023 2:28 PM, Parav Pandit wrote:
> >>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> >>>> Sent: Tuesday, November 7, 2023 3:02 PM
> >>>>
> >>>> On 11/6/2023 6:52 PM, Parav Pandit wrote:
> >>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> >>>>>> Sent: Monday, November 6, 2023 2:57 PM
> >>>>>>
> >>>>>> On 11/6/2023 12:12 PM, Parav Pandit wrote:
> >>>>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> >>>>>>>> Sent: Monday, November 6, 2023 9:01 AM
> >>>>>>>>
> >>>>>>>> On 11/3/2023 11:50 PM, Parav Pandit wrote:
> >>>>>>>>>> From: virtio-comment@lists.oasis-open.org
> >>>>>>>>>> <virtio-comment@lists.oasis- open.org> On Behalf Of Zhu,
> >>>>>>>>>> Lingshan
> >>>>>>>>>> Sent: Friday, November 3, 2023 8:27 PM
> >>>>>>>>>>
> >>>>>>>>>> On 11/3/2023 7:35 PM, Parav Pandit wrote:
> >>>>>>>>>>>> From: Zhu Lingshan <lingshan.zhu@intel.com>
> >>>>>>>>>>>> Sent: Friday, November 3, 2023 4:05 PM
> >>>>>>>>>>>>
> >>>>>>>>>>>> This patch adds two new le16 fields to common configuration
> >>>>>>>>>>>> structure to support VIRTIO_F_QUEUE_STATE in PCI transport
> layer.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
> >>>>>>>>>>>> ---
> >>>>>>>>>>>>        transport-pci.tex | 18 ++++++++++++++++++
> >>>>>>>>>>>>        1 file changed, 18 insertions(+)
> >>>>>>>>>>>>
> >>>>>>>>>>>> diff --git a/transport-pci.tex b/transport-pci.tex index
> >>>>>>>>>>>> a5c6719..3161519 100644
> >>>>>>>>>>>> --- a/transport-pci.tex
> >>>>>>>>>>>> +++ b/transport-pci.tex
> >>>>>>>>>>>> @@ -325,6 +325,10 @@ \subsubsection{Common configuration
> >>>>>>>> structure
> >>>>>>>>>>>> layout}\label{sec:Virtio Transport
> >>>>>>>>>>>>                /* About the administration virtqueue. */
> >>>>>>>>>>>>                le16 admin_queue_index;         /* read-only for driver
> */
> >>>>>>>>>>>>                le16 admin_queue_num;         /* read-only for driver
> */
> >>>>>>>>>>>> +
> >>>>>>>>>>>> +	/* Virtqueue state */
> >>>>>>>>>>>> +        le16 queue_avail_state;         /* read-write */
> >>>>>>>>>>>> +        le16 queue_used_state;          /* read-write */
> >>>>>>>>>>> This tiny interface for 128 virtio net queues through
> >>>>>>>>>>> register read writes, does
> >>>>>>>>>> not work effectively.
> >>>>>>>>>>> There are inflight out of order descriptors for block also.
> >>>>>>>>>>> Hence toy registers like this do not work.
> >>>>>>>>>> Do you know there is a queue_select? Why this does not work?
> >>>>>>>>>> Do you know how other queue related fields work?
> >>>>>>>>> :)
> >>>>>>>>> Yes. If you notice queue_reset related critical spec bug fix
> >>>>>>>>> was done when it
> >>>>>>>> was introduced so that live migration can _actually_ work.
> >>>>>>>>> When queue_select is done for 128 queues serially, it take a
> >>>>>>>>> lot of time to
> >>>>>>>> read those slow register interface for this + inflight descriptors +
> more.
> >>>>>>>> interesting, virtio work in this pattern for many years, right?
> >>>>>>> All these years 400Gbps and 800Gbps virtio was not present,
> >>>>>>> number of
> >>>>>> queues were not in hw.
> >>>>>> The registers are control path in config space, how 400G or 800G
> affect??
> >>>>> Because those are the one in practice requires large number of VQs.
> >>>>>
> >>>>> You are asking per VQ register commands to modify things
> >>>>> dynamically via
> >>>> this one vq at a time, serializing all the operations.
> >>>>> It does not scale well with high q count.
> >>>> This is not dynamically, it only happens when SUSPEND and RESUME.
> >>>> This is the same mechanism how virtio initialize a virtqueue,
> >>>> working for many years.
> >>> No. when virtio driver initializes it for the first time, there is
> >>> no active traffic
> >> that gets lost.
> >>> This is because the interface is not yet up and not part of the network yet.
> >>>
> >>> The resume must be fast enough, because the remote node is sending
> >> packets.
> >>> Hence it is different from driver init time queue enable.
> >> I am not sure any packets arrive before a link announce at the destination
> side.
> > I think it can.
> > Because there is no notification of member device link down intimation to
> remote side.
> > The L4 and L5 protocols have no knowledge that node which they are
> interacting is behind some layers of switches.
> >
> > So keeping this time low is desired.
> The NIC should broad cast itself first, so that other peers in the network
> know(for example its mac to route it) how to send a message to it.
> 
> This is necessary, for example VIRTIO_NET_F_GUEST_ANNOUNCE, similar
> mechanism work for in-marketing productions for years.
> 
> This is out of the topic anyway.
> >
> >>>>>> See the virtio common cfg, you will find the max number of vqs is
> >>>>>> there, num_queues.
> >>>>> :)
> >>>>> Sure. those values at high q count affects.
> >>>> the driver need to initialize them anyway.
> >>> That is before the traffic starts from remote end.
> >> see above, that needs a link announce and this is after
> >> re-initialization
> >>>>>>> Device didn’t support LM.
> >>>>>>> Many limitations existed all these years and TC is improving and
> >>>>>>> expanding
> >>>>>> them.
> >>>>>>> So all these years do not matter.
> >>>>>> Not sure what are you talking about, haven't we initialize the
> >>>>>> device and vqs in config space for years?????? What's wrong with
> >>>>>> this
> >> mechanism?
> >>>>>> Are you questioning virito-pci fundamentals???
> >>>>> Don’t point to in-efficient past to establish similar in-efficient future.
> >>>> interesting, you know this is a one-time thing, right?
> >>>> and you are aware of this has been there for years.
> >>>>>>>>>> Like how to set a queue size and enable it?
> >>>>>>>>> Those are meant to be used before DRIVER_OK stage as they are
> >>>>>>>>> init time
> >>>>>>>> registers.
> >>>>>>>>> Not to keep abusing them..
> >>>>>>>> don't you need to set queue_size at the destination side?
> >>>>>>> No.
> >>>>>>> But the src/dst does not matter.
> >>>>>>> Queue_size to be set before DRIVER_OK like rest of the
> >>>>>>> registers, as all
> >>>>>> queues must be created before the driver_ok phase.
> >>>>>>> Queue_reset was last moment exception.
> >>>>>> create a queue? Nvidia specific?
> >>>>>>
> >>>>> Huh. No.
> >>>>> Do git log and realize what happened with queue_reset.
> >>>> You didn't answer the question, does the spec even has defined
> >>>> "create a
> >> vq"?
> >>> Enabled/created = tomato/tomato when discussing the spec in
> >>> non-normative
> >> email conversation.
> >>> It's irrelevant.
> >> Then lets not debate on this enable a vq or create a vq anymore
> >>> All I am saying is, when we know the limitations of the transport
> >>> and when industry is forwarding to not introduced more and more
> >>> on-die register
> >> for once in lifetime work of device migration, we just use the
> >> optimal command and queue interface that is native to virtio.
> >> PCI config space has its own limitations, and admin vq has its
> >> advantages, but that does not apply to all use cases.
> >>
> > There was a recent work done emulating the SR-IOV cap and allowing VM to
> enable SR-IOV in [1].
> > This is the option I mentioned few weeks ago.
> >
> > So with admin commands and admin virtqueues, even nested model will work
> using [1].
> >
> > [1]
> > https://netdevconf.info/0x17/sessions/talk/unleashing-sr-iov-offload-o
> > n-virtual-machines.html
> We should take this into consideration once it is standardized in the spec,
> maybe not now, there can always be many workarounds to solve one problem.
Sure, until that point the admin commands are able to suffice the need well.
And when the spec changes in transport occurs (if needed), current admin command and admin vq also fits very well that will follow above [1].

Thanks.
> >
> >> I don't want to repeat why I don't think admin vq is a good idea for
> >> migration again, we have already discussed on that.


^ permalink raw reply	[flat|nested] 186+ messages in thread

* RE: [virtio-comment] Re: [PATCH V2 4/6] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  2023-11-10  6:52                         ` Zhu, Lingshan
@ 2023-11-10 12:31                           ` Parav Pandit
  2023-11-13  3:46                             ` Jason Wang
  0 siblings, 1 reply; 186+ messages in thread
From: Parav Pandit @ 2023-11-10 12:31 UTC (permalink / raw)
  To: Zhu, Lingshan, Michael S. Tsirkin
  Cc: jasowang, eperezma, cohuck, stefanha, virtio-comment


> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> Sent: Friday, November 10, 2023 12:23 PM
> 
> 
> On 11/9/2023 6:02 PM, Michael S. Tsirkin wrote:
> > On Thu, Nov 09, 2023 at 06:00:27PM +0800, Zhu, Lingshan wrote:
> >>
> >> On 11/9/2023 1:44 AM, Michael S. Tsirkin wrote:
> >>> On Tue, Nov 07, 2023 at 05:31:38PM +0800, Zhu, Lingshan wrote:
> >>>> On 11/6/2023 6:52 PM, Parav Pandit wrote:
> >>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> >>>>>> Sent: Monday, November 6, 2023 2:57 PM
> >>>>>>
> >>>>>> On 11/6/2023 12:12 PM, Parav Pandit wrote:
> >>>>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> >>>>>>>> Sent: Monday, November 6, 2023 9:01 AM
> >>>>>>>>
> >>>>>>>> On 11/3/2023 11:50 PM, Parav Pandit wrote:
> >>>>>>>>>> From: virtio-comment@lists.oasis-open.org
> >>>>>>>>>> <virtio-comment@lists.oasis- open.org> On Behalf Of Zhu,
> >>>>>>>>>> Lingshan
> >>>>>>>>>> Sent: Friday, November 3, 2023 8:27 PM
> >>>>>>>>>>
> >>>>>>>>>> On 11/3/2023 7:35 PM, Parav Pandit wrote:
> >>>>>>>>>>>> From: Zhu Lingshan <lingshan.zhu@intel.com>
> >>>>>>>>>>>> Sent: Friday, November 3, 2023 4:05 PM
> >>>>>>>>>>>>
> >>>>>>>>>>>> This patch adds two new le16 fields to common configuration
> >>>>>>>>>>>> structure to support VIRTIO_F_QUEUE_STATE in PCI transport
> layer.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
> >>>>>>>>>>>> ---
> >>>>>>>>>>>>        transport-pci.tex | 18 ++++++++++++++++++
> >>>>>>>>>>>>        1 file changed, 18 insertions(+)
> >>>>>>>>>>>>
> >>>>>>>>>>>> diff --git a/transport-pci.tex b/transport-pci.tex index
> >>>>>>>>>>>> a5c6719..3161519 100644
> >>>>>>>>>>>> --- a/transport-pci.tex
> >>>>>>>>>>>> +++ b/transport-pci.tex
> >>>>>>>>>>>> @@ -325,6 +325,10 @@ \subsubsection{Common configuration
> >>>>>>>> structure
> >>>>>>>>>>>> layout}\label{sec:Virtio Transport
> >>>>>>>>>>>>                /* About the administration virtqueue. */
> >>>>>>>>>>>>                le16 admin_queue_index;         /* read-only for driver
> */
> >>>>>>>>>>>>                le16 admin_queue_num;         /* read-only for driver
> */
> >>>>>>>>>>>> +
> >>>>>>>>>>>> +	/* Virtqueue state */
> >>>>>>>>>>>> +        le16 queue_avail_state;         /* read-write */
> >>>>>>>>>>>> +        le16 queue_used_state;          /* read-write */
> >>>>>>>>>>> This tiny interface for 128 virtio net queues through
> >>>>>>>>>>> register read writes, does
> >>>>>>>>>> not work effectively.
> >>>>>>>>>>> There are inflight out of order descriptors for block also.
> >>>>>>>>>>> Hence toy registers like this do not work.
> >>>>>>>>>> Do you know there is a queue_select? Why this does not work?
> >>>>>>>>>> Do you know how other queue related fields work?
> >>>>>>>>> :)
> >>>>>>>>> Yes. If you notice queue_reset related critical spec bug fix
> >>>>>>>>> was done when it
> >>>>>>>> was introduced so that live migration can _actually_ work.
> >>>>>>>>> When queue_select is done for 128 queues serially, it take a
> >>>>>>>>> lot of time to
> >>>>>>>> read those slow register interface for this + inflight descriptors +
> more.
> >>>>>>>> interesting, virtio work in this pattern for many years, right?
> >>>>>>> All these years 400Gbps and 800Gbps virtio was not present,
> >>>>>>> number of
> >>>>>> queues were not in hw.
> >>>>>> The registers are control path in config space, how 400G or 800G
> affect??
> >>>>> Because those are the one in practice requires large number of VQs.
> >>>>>
> >>>>> You are asking per VQ register commands to modify things dynamically
> via this one vq at a time, serializing all the operations.
> >>>>> It does not scale well with high q count.
> >>>> This is not dynamically, it only happens when SUSPEND and RESUME.
> >>>> This is the same mechanism how virtio initialize a virtqueue,
> >>>> working for many years.
> >>> I wish we just had a transport vq already. That's the way to solve
> >>> this not fighting individual bits.
> >> Yeah, I agree, transport is a queued task(sent out V4 months ago...),
> >> one by one... hard and tough work...
> > Frankly I think that should take precedence, then Parav will not get
> > annoyed each time add a couple of registers.
> I agree, things can happen and we are already here..
Unfortunately transport vq is of not much help for below fundamental reasons.

1.1 as it involves many VMEXITS of accessing runtime config spaced on slow registers.
1.2 or alternatively hypervisor end up polling may thousands of registers wasting cpu resources.
2. It does not help of future CC use case where hypervisor must not be involved in dynamic config
3. Complex device like vnet has already stopped using every growing config space and using cvq, large part of work in 1.3 has shown that already
4. PFs also cannot infinitely grow registers, they also need less on-die registers. 

And who knows the backward compatible SIOV devices may offer same bar as VFs.

virto spec has already outlined this efficient concept in the spec and TC members are already following it.

SIOV for non-backward compatible mode, anyway, need new interface and vq is inherently already there which fulfilling the needs.

> >
> >>>>>> See the virtio common cfg, you will find the max number of vqs is
> >>>>>> there, num_queues.
> >>>>> :)
> >>>>> Sure. those values at high q count affects.
> >>>> the driver need to initialize them anyway.
> >>>>>>> Device didn’t support LM.
> >>>>>>> Many limitations existed all these years and TC is improving and
> >>>>>>> expanding
> >>>>>> them.
> >>>>>>> So all these years do not matter.
> >>>>>> Not sure what are you talking about, haven't we initialize the
> >>>>>> device and vqs in config space for years?????? What's wrong with this
> mechanism?
> >>>>>> Are you questioning virito-pci fundamentals???
> >>>>> Don’t point to in-efficient past to establish similar in-efficient future.
> >>>> interesting, you know this is a one-time thing, right?
> >>>> and you are aware of this has been there for years.
> >>>>>>>>>> Like how to set a queue size and enable it?
> >>>>>>>>> Those are meant to be used before DRIVER_OK stage as they are
> >>>>>>>>> init time
> >>>>>>>> registers.
> >>>>>>>>> Not to keep abusing them..
> >>>>>>>> don't you need to set queue_size at the destination side?
> >>>>>>> No.
> >>>>>>> But the src/dst does not matter.
> >>>>>>> Queue_size to be set before DRIVER_OK like rest of the
> >>>>>>> registers, as all
> >>>>>> queues must be created before the driver_ok phase.
> >>>>>>> Queue_reset was last moment exception.
> >>>>>> create a queue? Nvidia specific?
> >>>>>>
> >>>>> Huh. No.
> >>>>> Do git log and realize what happened with queue_reset.
> >>>> You didn't answer the question, does the spec even has defined
> >>>> "create a vq"?
> >>>>>> For standard virtio, you need to read the number of enabled vqs
> >>>>>> at the source side, then enable them at the dst, so queue_size matters,
> not to create.
> >>>>> All that happens in the pre-copy phase.
> >>>> Yes and how your answer related to this discussion?


^ permalink raw reply	[flat|nested] 186+ messages in thread

* [virtio-comment] Re: [PATCH V2 3/6] virtio: dont reset vqs when SUSPEND
  2023-11-09 10:15           ` [virtio-comment] " Parav Pandit
  2023-11-10  6:22             ` [virtio-comment] " Zhu, Lingshan
@ 2023-11-13  3:34             ` Jason Wang
  2023-11-15 17:39               ` [virtio-comment] " Parav Pandit
  1 sibling, 1 reply; 186+ messages in thread
From: Jason Wang @ 2023-11-13  3:34 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Zhu, Lingshan, Michael S. Tsirkin, eperezma, cohuck, stefanha,
	virtio-comment

On Thu, Nov 9, 2023 at 6:16 PM Parav Pandit <parav@nvidia.com> wrote:
>
>
> > From: Zhu, Lingshan <lingshan.zhu@intel.com>
> > Sent: Thursday, November 9, 2023 3:28 PM
> >
> > On 11/9/2023 1:46 AM, Michael S. Tsirkin wrote:
> > > On Tue, Nov 07, 2023 at 05:27:23PM +0800, Zhu, Lingshan wrote:
> > >>
> > >> On 11/6/2023 5:49 PM, Michael S. Tsirkin wrote:
> > >>> On Fri, Nov 03, 2023 at 06:34:34PM +0800, Zhu Lingshan wrote:
> > >>>> When SUSPEND is set, device states and virtqueue states should be
> > >>>> stablized, therefore the driver should not reset vqs when SUSPEND
> > >>>> is set in device status.
> > >>>>
> > >>>> Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
> > >>>> ---
> > >>>>    content.tex | 3 +++
> > >>>>    1 file changed, 3 insertions(+)
> > >>>>
> > >>>> diff --git a/content.tex b/content.tex index bcc9d4b..060b5c2
> > >>>> 100644
> > >>>> --- a/content.tex
> > >>>> +++ b/content.tex
> > >>>> @@ -444,6 +444,9 @@ \subsubsection{Virtqueue Reset}\label{sec:Basic
> > Facilities of a Virtio Device /
> > >>>>    The device MUST reset any state of a virtqueue to the default state,
> > >>>>    including the available state and the used state.
> > >>>> +If VIRTIO_F_SUSPEND is negotiated and SUSPEND is set in
> > >>>> +\field{device status}, the driver SHOULD NOT reset any virtqueues.
> > >>>> +
> > >>>>    \drivernormative{\paragraph}{Virtqueue Reset}{Basic Facilities of a
> > Virtio Device / Virtqueues / Virtqueue Reset / Virtqueue Reset}
> > >>>>    After the driver tells the device to reset a queue, the driver
> > >>>> MUST verify that
> > >>> Seems somewhat arbitrary and breaks the claim that the feature is
> > >>> orthogonal and can have uses besides migration.
> > >> when suspended, the device is frozen.
> > >> The driver is aware of this process and so should not reset the vqs I think.
> > > Again that is only true because you want to use it for migration.
> > > But then you can't claim it's a generic facility.
> > I don't get it. The device status is a basic facility.
> >
> > We need to SUSPEND the device by setting SUSPEND bit, to stabilize the device
> > states for migration.
> Is the PCI's PM time not enough to suspend the device?

Are you saying we don't need virtio reset assuming we had FLR?

Suspending at different layers like rest at different layers.

We have both FLR and virtio reset. The Virtio level function could be
reset without FLR. So did suspend.

That's it.

And if you want to rule P2P behaviours, PCI PM is really the correct
way to go instead of trying to do it at the virtio layer.

> For large device I could imagine it could be short.
>
> In that case if there is suspend the device available, it will be used by the guest driver itself, hypervisor wouldn’t know about it when those registers are not trapped.
> So we need two ways to suspend.
> One is guest visible, and guest controlled.
> Second is hypervisor control to fulfill the device migration needs.

Can you explain why suspend is special but not reset or why reset can
work but not suspend? If reset can work, so does suspend. If reset
can't, neither does suspend.

For example, can you explain how a system_reset in Qemu can work with
your proposal?

>
> So if you can please take a look if the proposed admin command to freeze/stop mode can be used in the emulated register case or not.

Again, if you design those for PCI, it's a layer violation. You have
answered yourself that PM is the right way to go.

> It helps to have the suspend bit in guest control as well with/without emulation mode.

I won't repeat it again. You will find you need a full transport to
satisfy all the requirements.

>
> > This can also be used for debugging I think.
>
> As Michael listed, a dedicated debug interface is usually more useful instead of in-band.

Well, I've shown you the in-band facilities like debugging via ethtool
and kernel has a lot of other ones. If you have ever tried to debug in
a real production environment, you will find how useful such handy
information is where out-of-band facilities are often dangerous and
usually prohibited or even unsupported.

Thanks


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [virtio-comment] Re: [PATCH V2 4/6] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  2023-11-10 12:31                           ` Parav Pandit
@ 2023-11-13  3:46                             ` Jason Wang
  2023-11-13  9:23                               ` Zhu, Lingshan
  2023-11-15 17:36                               ` Parav Pandit
  0 siblings, 2 replies; 186+ messages in thread
From: Jason Wang @ 2023-11-13  3:46 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Zhu, Lingshan, Michael S. Tsirkin, eperezma, cohuck, stefanha,
	virtio-comment

On Fri, Nov 10, 2023 at 8:31 PM Parav Pandit <parav@nvidia.com> wrote:
>
>
> > From: Zhu, Lingshan <lingshan.zhu@intel.com>
> > Sent: Friday, November 10, 2023 12:23 PM
> >
> >
> > On 11/9/2023 6:02 PM, Michael S. Tsirkin wrote:
> > > On Thu, Nov 09, 2023 at 06:00:27PM +0800, Zhu, Lingshan wrote:
> > >>
> > >> On 11/9/2023 1:44 AM, Michael S. Tsirkin wrote:
> > >>> On Tue, Nov 07, 2023 at 05:31:38PM +0800, Zhu, Lingshan wrote:
> > >>>> On 11/6/2023 6:52 PM, Parav Pandit wrote:
> > >>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> > >>>>>> Sent: Monday, November 6, 2023 2:57 PM
> > >>>>>>
> > >>>>>> On 11/6/2023 12:12 PM, Parav Pandit wrote:
> > >>>>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> > >>>>>>>> Sent: Monday, November 6, 2023 9:01 AM
> > >>>>>>>>
> > >>>>>>>> On 11/3/2023 11:50 PM, Parav Pandit wrote:
> > >>>>>>>>>> From: virtio-comment@lists.oasis-open.org
> > >>>>>>>>>> <virtio-comment@lists.oasis- open.org> On Behalf Of Zhu,
> > >>>>>>>>>> Lingshan
> > >>>>>>>>>> Sent: Friday, November 3, 2023 8:27 PM
> > >>>>>>>>>>
> > >>>>>>>>>> On 11/3/2023 7:35 PM, Parav Pandit wrote:
> > >>>>>>>>>>>> From: Zhu Lingshan <lingshan.zhu@intel.com>
> > >>>>>>>>>>>> Sent: Friday, November 3, 2023 4:05 PM
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> This patch adds two new le16 fields to common configuration
> > >>>>>>>>>>>> structure to support VIRTIO_F_QUEUE_STATE in PCI transport
> > layer.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
> > >>>>>>>>>>>> ---
> > >>>>>>>>>>>>        transport-pci.tex | 18 ++++++++++++++++++
> > >>>>>>>>>>>>        1 file changed, 18 insertions(+)
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> diff --git a/transport-pci.tex b/transport-pci.tex index
> > >>>>>>>>>>>> a5c6719..3161519 100644
> > >>>>>>>>>>>> --- a/transport-pci.tex
> > >>>>>>>>>>>> +++ b/transport-pci.tex
> > >>>>>>>>>>>> @@ -325,6 +325,10 @@ \subsubsection{Common configuration
> > >>>>>>>> structure
> > >>>>>>>>>>>> layout}\label{sec:Virtio Transport
> > >>>>>>>>>>>>                /* About the administration virtqueue. */
> > >>>>>>>>>>>>                le16 admin_queue_index;         /* read-only for driver
> > */
> > >>>>>>>>>>>>                le16 admin_queue_num;         /* read-only for driver
> > */
> > >>>>>>>>>>>> +
> > >>>>>>>>>>>> +        /* Virtqueue state */
> > >>>>>>>>>>>> +        le16 queue_avail_state;         /* read-write */
> > >>>>>>>>>>>> +        le16 queue_used_state;          /* read-write */
> > >>>>>>>>>>> This tiny interface for 128 virtio net queues through
> > >>>>>>>>>>> register read writes, does
> > >>>>>>>>>> not work effectively.
> > >>>>>>>>>>> There are inflight out of order descriptors for block also.
> > >>>>>>>>>>> Hence toy registers like this do not work.
> > >>>>>>>>>> Do you know there is a queue_select? Why this does not work?
> > >>>>>>>>>> Do you know how other queue related fields work?
> > >>>>>>>>> :)
> > >>>>>>>>> Yes. If you notice queue_reset related critical spec bug fix
> > >>>>>>>>> was done when it
> > >>>>>>>> was introduced so that live migration can _actually_ work.
> > >>>>>>>>> When queue_select is done for 128 queues serially, it take a
> > >>>>>>>>> lot of time to
> > >>>>>>>> read those slow register interface for this + inflight descriptors +
> > more.
> > >>>>>>>> interesting, virtio work in this pattern for many years, right?
> > >>>>>>> All these years 400Gbps and 800Gbps virtio was not present,
> > >>>>>>> number of
> > >>>>>> queues were not in hw.
> > >>>>>> The registers are control path in config space, how 400G or 800G
> > affect??
> > >>>>> Because those are the one in practice requires large number of VQs.
> > >>>>>
> > >>>>> You are asking per VQ register commands to modify things dynamically
> > via this one vq at a time, serializing all the operations.
> > >>>>> It does not scale well with high q count.
> > >>>> This is not dynamically, it only happens when SUSPEND and RESUME.
> > >>>> This is the same mechanism how virtio initialize a virtqueue,
> > >>>> working for many years.
> > >>> I wish we just had a transport vq already. That's the way to solve
> > >>> this not fighting individual bits.
> > >> Yeah, I agree, transport is a queued task(sent out V4 months ago...),
> > >> one by one... hard and tough work...
> > > Frankly I think that should take precedence, then Parav will not get
> > > annoyed each time add a couple of registers.
> > I agree, things can happen and we are already here..
> Unfortunately transport vq is of not much help for below fundamental reasons.
>
> 1.1 as it involves many VMEXITS of accessing runtime config spaced on slow registers.
> 1.2 or alternatively hypervisor end up polling may thousands of registers wasting cpu resources.
> 2. It does not help of future CC use case where hypervisor must not be involved in dynamic config
> 3. Complex device like vnet has already stopped using every growing config space and using cvq, large part of work in 1.3 has shown that already
> 4. PFs also cannot infinitely grow registers, they also need less on-die registers.

You miss the point here. Nothing makes transport vq different from
what you proposed here (especially the device context part).

>
> And who knows the backward compatible SIOV devices may offer same bar as VFs.

Another self-contradictory, isn't it? You claim the bar is not
scalable, but you still want to offer a bar for SIOV?

>
> virto spec has already outlined this efficient concept in the spec and TC members are already following it.

Another shifting concept. Admin commands/virtqueues makes sense
doesn't mean you can simply layer everything on top.

What's more, Spec can't be 100% correct, that's why there are fixes or
even revert.

>
> SIOV for non-backward compatible mode, anyway, need new interface and vq is inherently already there which fulfilling the needs.

What interface did you mean here? How much does it differ from the
transport virtqueue?

Again, if you keep raising unrelated topics like CC or TDISP, the
discussion won't converge. And you are self-contradicting that you
still haven't explained why your proposal can work in those cases.

Thanks



>
> > >
> > >>>>>> See the virtio common cfg, you will find the max number of vqs is
> > >>>>>> there, num_queues.
> > >>>>> :)
> > >>>>> Sure. those values at high q count affects.
> > >>>> the driver need to initialize them anyway.
> > >>>>>>> Device didn’t support LM.
> > >>>>>>> Many limitations existed all these years and TC is improving and
> > >>>>>>> expanding
> > >>>>>> them.
> > >>>>>>> So all these years do not matter.
> > >>>>>> Not sure what are you talking about, haven't we initialize the
> > >>>>>> device and vqs in config space for years?????? What's wrong with this
> > mechanism?
> > >>>>>> Are you questioning virito-pci fundamentals???
> > >>>>> Don’t point to in-efficient past to establish similar in-efficient future.
> > >>>> interesting, you know this is a one-time thing, right?
> > >>>> and you are aware of this has been there for years.
> > >>>>>>>>>> Like how to set a queue size and enable it?
> > >>>>>>>>> Those are meant to be used before DRIVER_OK stage as they are
> > >>>>>>>>> init time
> > >>>>>>>> registers.
> > >>>>>>>>> Not to keep abusing them..
> > >>>>>>>> don't you need to set queue_size at the destination side?
> > >>>>>>> No.
> > >>>>>>> But the src/dst does not matter.
> > >>>>>>> Queue_size to be set before DRIVER_OK like rest of the
> > >>>>>>> registers, as all
> > >>>>>> queues must be created before the driver_ok phase.
> > >>>>>>> Queue_reset was last moment exception.
> > >>>>>> create a queue? Nvidia specific?
> > >>>>>>
> > >>>>> Huh. No.
> > >>>>> Do git log and realize what happened with queue_reset.
> > >>>> You didn't answer the question, does the spec even has defined
> > >>>> "create a vq"?
> > >>>>>> For standard virtio, you need to read the number of enabled vqs
> > >>>>>> at the source side, then enable them at the dst, so queue_size matters,
> > not to create.
> > >>>>> All that happens in the pre-copy phase.
> > >>>> Yes and how your answer related to this discussion?
>


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [virtio-comment] RE: [PATCH V2 3/6] virtio: dont reset vqs when SUSPEND
  2023-11-10  6:31               ` [virtio-comment] " Parav Pandit
@ 2023-11-13  9:23                 ` Zhu, Lingshan
  2023-11-15 17:35                   ` Parav Pandit
  0 siblings, 1 reply; 186+ messages in thread
From: Zhu, Lingshan @ 2023-11-13  9:23 UTC (permalink / raw)
  To: Parav Pandit, Michael S. Tsirkin
  Cc: jasowang, eperezma, cohuck, stefanha, virtio-comment



On 11/10/2023 2:31 PM, Parav Pandit wrote:
>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>> Sent: Friday, November 10, 2023 11:52 AM
>>
>> On 11/9/2023 6:15 PM, Parav Pandit wrote:
>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>>>> Sent: Thursday, November 9, 2023 3:28 PM
>>>>
>>>> On 11/9/2023 1:46 AM, Michael S. Tsirkin wrote:
>>>>> On Tue, Nov 07, 2023 at 05:27:23PM +0800, Zhu, Lingshan wrote:
>>>>>> On 11/6/2023 5:49 PM, Michael S. Tsirkin wrote:
>>>>>>> On Fri, Nov 03, 2023 at 06:34:34PM +0800, Zhu Lingshan wrote:
>>>>>>>> When SUSPEND is set, device states and virtqueue states should be
>>>>>>>> stablized, therefore the driver should not reset vqs when SUSPEND
>>>>>>>> is set in device status.
>>>>>>>>
>>>>>>>> Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
>>>>>>>> ---
>>>>>>>>      content.tex | 3 +++
>>>>>>>>      1 file changed, 3 insertions(+)
>>>>>>>>
>>>>>>>> diff --git a/content.tex b/content.tex index bcc9d4b..060b5c2
>>>>>>>> 100644
>>>>>>>> --- a/content.tex
>>>>>>>> +++ b/content.tex
>>>>>>>> @@ -444,6 +444,9 @@ \subsubsection{Virtqueue
>>>>>>>> Reset}\label{sec:Basic
>>>> Facilities of a Virtio Device /
>>>>>>>>      The device MUST reset any state of a virtqueue to the default state,
>>>>>>>>      including the available state and the used state.
>>>>>>>> +If VIRTIO_F_SUSPEND is negotiated and SUSPEND is set in
>>>>>>>> +\field{device status}, the driver SHOULD NOT reset any virtqueues.
>>>>>>>> +
>>>>>>>>      \drivernormative{\paragraph}{Virtqueue Reset}{Basic
>>>>>>>> Facilities of a
>>>> Virtio Device / Virtqueues / Virtqueue Reset / Virtqueue Reset}
>>>>>>>>      After the driver tells the device to reset a queue, the
>>>>>>>> driver MUST verify that
>>>>>>> Seems somewhat arbitrary and breaks the claim that the feature is
>>>>>>> orthogonal and can have uses besides migration.
>>>>>> when suspended, the device is frozen.
>>>>>> The driver is aware of this process and so should not reset the vqs I think.
>>>>> Again that is only true because you want to use it for migration.
>>>>> But then you can't claim it's a generic facility.
>>>> I don't get it. The device status is a basic facility.
>>>>
>>>> We need to SUSPEND the device by setting SUSPEND bit, to stabilize
>>>> the device states for migration.
>>> Is the PCI's PM time not enough to suspend the device?
>>> For large device I could imagine it could be short.
>> As you see, PCI PM, so this is a layer violation, virtio should be self contained,
> If you think it is layer violation, than suspend bit for sure is not needed. PCI PM interface should suspend/resume the device on D0<->D3 state transitions.
Doesn't make sense logically, because it is layer violation, so you want 
it to be worse? For example, virito writes 0 to device status to reset a 
device, not by PCI.
>
>> and what about MMIO and CCW?
> They have largely lacked the richness of PCI transport. So those transport needs to evolve.
I am not sure CCW and MMIO maintainers want to hear this.
> Otherwise, PCI offers rich transport facilities compared to MMIO, hence, it will continue wider use.
you know this SUSPEND bit work fine on all transport, right? Because 
device_status is transport independent.
>
>> This should be a basic facility.
> Other transport can also offer like PCI.
Do you want to work for these transport? Implementing the new features 
as PCI?
>
>>> In that case if there is suspend the device available, it will be used by the
>> guest driver itself, hypervisor wouldn’t know about it when those registers are
>> not trapped.
>>> So we need two ways to suspend.
>>> One is guest visible, and guest controlled.
>>> Second is hypervisor control to fulfill the device migration needs.
>> The guest can eve reset the device.
>>> So if you can please take a look if the proposed admin command to
>> freeze/stop mode can be used in the emulated register case or not.
>>> It helps to have the suspend bit in guest control as well with/without
>> emulation mode.
>> Parav, please believe I have read your series, I didn't comment there because I
>> want to avoid further conflicts/debating, we have done these enough.
>>
> I believe the series posted in v3 can support vdpa use case as well.
> So I will progress to post v4.
>
>> As explained before, freeze/stop the device by PCI is a layer violation.
> I am afraid, we have different vision.
> I don’t see any layer violation.
> Suspend is enough in the PCI PM.
> Our vision is more aligned with rest of the hypervisor knobs that owns the migration framework.
I think I have explained, virito builds on other transport and it should 
be self-contained, so far so good.
>
>> And device status can be pass-through(without emulation, just map it to
>> guest) to the guest or trapped(trap and emulate by the hypervisor, for example
>> set_status in vDPA).
> When it is pass-through, it is controlled by the guest, so for example, if the guest resets the device, hypervisor has lost the control of migration context etc.
> Hence, hypervisor needs a channel which is not guest owned.
>
> Same channel can work when trap+emulation is done.
It is the guest owns the device, it can reset the device, once reset, 
the device context are cleared.
>
>>>> This can also be used for debugging I think.
>>> As Michael listed, a dedicated debug interface is usually more useful instead
>> of in-band.
>> re-using another facility without extra efforts is not a bad thing anyway.
> I just don’t see how a suspend bit some debug feature.
> Almost everything with that regard is a debug feature to me.
suspend then check the device states?


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [virtio-comment] Re: [PATCH V2 4/6] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  2023-11-13  3:46                             ` Jason Wang
@ 2023-11-13  9:23                               ` Zhu, Lingshan
  2023-11-15 17:36                               ` Parav Pandit
  1 sibling, 0 replies; 186+ messages in thread
From: Zhu, Lingshan @ 2023-11-13  9:23 UTC (permalink / raw)
  To: Jason Wang, Parav Pandit
  Cc: Michael S. Tsirkin, eperezma, cohuck, stefanha, virtio-comment



On 11/13/2023 11:46 AM, Jason Wang wrote:
> On Fri, Nov 10, 2023 at 8:31 PM Parav Pandit <parav@nvidia.com> wrote:
>>
>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>>> Sent: Friday, November 10, 2023 12:23 PM
>>>
>>>
>>> On 11/9/2023 6:02 PM, Michael S. Tsirkin wrote:
>>>> On Thu, Nov 09, 2023 at 06:00:27PM +0800, Zhu, Lingshan wrote:
>>>>> On 11/9/2023 1:44 AM, Michael S. Tsirkin wrote:
>>>>>> On Tue, Nov 07, 2023 at 05:31:38PM +0800, Zhu, Lingshan wrote:
>>>>>>> On 11/6/2023 6:52 PM, Parav Pandit wrote:
>>>>>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>>>>>>>>> Sent: Monday, November 6, 2023 2:57 PM
>>>>>>>>>
>>>>>>>>> On 11/6/2023 12:12 PM, Parav Pandit wrote:
>>>>>>>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>>>>>>>>>>> Sent: Monday, November 6, 2023 9:01 AM
>>>>>>>>>>>
>>>>>>>>>>> On 11/3/2023 11:50 PM, Parav Pandit wrote:
>>>>>>>>>>>>> From: virtio-comment@lists.oasis-open.org
>>>>>>>>>>>>> <virtio-comment@lists.oasis- open.org> On Behalf Of Zhu,
>>>>>>>>>>>>> Lingshan
>>>>>>>>>>>>> Sent: Friday, November 3, 2023 8:27 PM
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 11/3/2023 7:35 PM, Parav Pandit wrote:
>>>>>>>>>>>>>>> From: Zhu Lingshan <lingshan.zhu@intel.com>
>>>>>>>>>>>>>>> Sent: Friday, November 3, 2023 4:05 PM
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> This patch adds two new le16 fields to common configuration
>>>>>>>>>>>>>>> structure to support VIRTIO_F_QUEUE_STATE in PCI transport
>>> layer.
>>>>>>>>>>>>>>> Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>         transport-pci.tex | 18 ++++++++++++++++++
>>>>>>>>>>>>>>>         1 file changed, 18 insertions(+)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> diff --git a/transport-pci.tex b/transport-pci.tex index
>>>>>>>>>>>>>>> a5c6719..3161519 100644
>>>>>>>>>>>>>>> --- a/transport-pci.tex
>>>>>>>>>>>>>>> +++ b/transport-pci.tex
>>>>>>>>>>>>>>> @@ -325,6 +325,10 @@ \subsubsection{Common configuration
>>>>>>>>>>> structure
>>>>>>>>>>>>>>> layout}\label{sec:Virtio Transport
>>>>>>>>>>>>>>>                 /* About the administration virtqueue. */
>>>>>>>>>>>>>>>                 le16 admin_queue_index;         /* read-only for driver
>>> */
>>>>>>>>>>>>>>>                 le16 admin_queue_num;         /* read-only for driver
>>> */
>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>> +        /* Virtqueue state */
>>>>>>>>>>>>>>> +        le16 queue_avail_state;         /* read-write */
>>>>>>>>>>>>>>> +        le16 queue_used_state;          /* read-write */
>>>>>>>>>>>>>> This tiny interface for 128 virtio net queues through
>>>>>>>>>>>>>> register read writes, does
>>>>>>>>>>>>> not work effectively.
>>>>>>>>>>>>>> There are inflight out of order descriptors for block also.
>>>>>>>>>>>>>> Hence toy registers like this do not work.
>>>>>>>>>>>>> Do you know there is a queue_select? Why this does not work?
>>>>>>>>>>>>> Do you know how other queue related fields work?
>>>>>>>>>>>> :)
>>>>>>>>>>>> Yes. If you notice queue_reset related critical spec bug fix
>>>>>>>>>>>> was done when it
>>>>>>>>>>> was introduced so that live migration can _actually_ work.
>>>>>>>>>>>> When queue_select is done for 128 queues serially, it take a
>>>>>>>>>>>> lot of time to
>>>>>>>>>>> read those slow register interface for this + inflight descriptors +
>>> more.
>>>>>>>>>>> interesting, virtio work in this pattern for many years, right?
>>>>>>>>>> All these years 400Gbps and 800Gbps virtio was not present,
>>>>>>>>>> number of
>>>>>>>>> queues were not in hw.
>>>>>>>>> The registers are control path in config space, how 400G or 800G
>>> affect??
>>>>>>>> Because those are the one in practice requires large number of VQs.
>>>>>>>>
>>>>>>>> You are asking per VQ register commands to modify things dynamically
>>> via this one vq at a time, serializing all the operations.
>>>>>>>> It does not scale well with high q count.
>>>>>>> This is not dynamically, it only happens when SUSPEND and RESUME.
>>>>>>> This is the same mechanism how virtio initialize a virtqueue,
>>>>>>> working for many years.
>>>>>> I wish we just had a transport vq already. That's the way to solve
>>>>>> this not fighting individual bits.
>>>>> Yeah, I agree, transport is a queued task(sent out V4 months ago...),
>>>>> one by one... hard and tough work...
>>>> Frankly I think that should take precedence, then Parav will not get
>>>> annoyed each time add a couple of registers.
>>> I agree, things can happen and we are already here..
>> Unfortunately transport vq is of not much help for below fundamental reasons.
>>
>> 1.1 as it involves many VMEXITS of accessing runtime config spaced on slow registers.
>> 1.2 or alternatively hypervisor end up polling may thousands of registers wasting cpu resources.
>> 2. It does not help of future CC use case where hypervisor must not be involved in dynamic config
>> 3. Complex device like vnet has already stopped using every growing config space and using cvq, large part of work in 1.3 has shown that already
>> 4. PFs also cannot infinitely grow registers, they also need less on-die registers.
> You miss the point here. Nothing makes transport vq different from
> what you proposed here (especially the device context part).
>
>> And who knows the backward compatible SIOV devices may offer same bar as VFs.
> Another self-contradictory, isn't it? You claim the bar is not
> scalable, but you still want to offer a bar for SIOV?
>
>> virto spec has already outlined this efficient concept in the spec and TC members are already following it.
> Another shifting concept. Admin commands/virtqueues makes sense
> doesn't mean you can simply layer everything on top.
>
> What's more, Spec can't be 100% correct, that's why there are fixes or
> even revert.
>
>> SIOV for non-backward compatible mode, anyway, need new interface and vq is inherently already there which fulfilling the needs.
> What interface did you mean here? How much does it differ from the
> transport virtqueue?
>
> Again, if you keep raising unrelated topics like CC or TDISP, the
> discussion won't converge. And you are self-contradicting that you
> still haven't explained why your proposal can work in those cases.
I agree with Jason on his replies.
>
> Thanks
>
>
>
>>>>>>>>> See the virtio common cfg, you will find the max number of vqs is
>>>>>>>>> there, num_queues.
>>>>>>>> :)
>>>>>>>> Sure. those values at high q count affects.
>>>>>>> the driver need to initialize them anyway.
>>>>>>>>>> Device didn’t support LM.
>>>>>>>>>> Many limitations existed all these years and TC is improving and
>>>>>>>>>> expanding
>>>>>>>>> them.
>>>>>>>>>> So all these years do not matter.
>>>>>>>>> Not sure what are you talking about, haven't we initialize the
>>>>>>>>> device and vqs in config space for years?????? What's wrong with this
>>> mechanism?
>>>>>>>>> Are you questioning virito-pci fundamentals???
>>>>>>>> Don’t point to in-efficient past to establish similar in-efficient future.
>>>>>>> interesting, you know this is a one-time thing, right?
>>>>>>> and you are aware of this has been there for years.
>>>>>>>>>>>>> Like how to set a queue size and enable it?
>>>>>>>>>>>> Those are meant to be used before DRIVER_OK stage as they are
>>>>>>>>>>>> init time
>>>>>>>>>>> registers.
>>>>>>>>>>>> Not to keep abusing them..
>>>>>>>>>>> don't you need to set queue_size at the destination side?
>>>>>>>>>> No.
>>>>>>>>>> But the src/dst does not matter.
>>>>>>>>>> Queue_size to be set before DRIVER_OK like rest of the
>>>>>>>>>> registers, as all
>>>>>>>>> queues must be created before the driver_ok phase.
>>>>>>>>>> Queue_reset was last moment exception.
>>>>>>>>> create a queue? Nvidia specific?
>>>>>>>>>
>>>>>>>> Huh. No.
>>>>>>>> Do git log and realize what happened with queue_reset.
>>>>>>> You didn't answer the question, does the spec even has defined
>>>>>>> "create a vq"?
>>>>>>>>> For standard virtio, you need to read the number of enabled vqs
>>>>>>>>> at the source side, then enable them at the dst, so queue_size matters,
>>> not to create.
>>>>>>>> All that happens in the pre-copy phase.
>>>>>>> Yes and how your answer related to this discussion?


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [virtio-comment] Re: [PATCH V2 4/6] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  2023-11-10 12:31                           ` Parav Pandit
@ 2023-11-13  9:25                             ` Zhu, Lingshan
  2023-11-15 17:35                               ` Parav Pandit
  0 siblings, 1 reply; 186+ messages in thread
From: Zhu, Lingshan @ 2023-11-13  9:25 UTC (permalink / raw)
  To: Parav Pandit, jasowang, mst, eperezma, cohuck, stefanha; +Cc: virtio-comment



On 11/10/2023 8:31 PM, Parav Pandit wrote:
>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>> Sent: Friday, November 10, 2023 1:22 PM
>>
>>
>> On 11/9/2023 6:25 PM, Parav Pandit wrote:
>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>>>> Sent: Thursday, November 9, 2023 3:39 PM
>>>>
>>>>
>>>> On 11/9/2023 2:28 PM, Parav Pandit wrote:
>>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>>>>>> Sent: Tuesday, November 7, 2023 3:02 PM
>>>>>>
>>>>>> On 11/6/2023 6:52 PM, Parav Pandit wrote:
>>>>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>>>>>>>> Sent: Monday, November 6, 2023 2:57 PM
>>>>>>>>
>>>>>>>> On 11/6/2023 12:12 PM, Parav Pandit wrote:
>>>>>>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>>>>>>>>>> Sent: Monday, November 6, 2023 9:01 AM
>>>>>>>>>>
>>>>>>>>>> On 11/3/2023 11:50 PM, Parav Pandit wrote:
>>>>>>>>>>>> From: virtio-comment@lists.oasis-open.org
>>>>>>>>>>>> <virtio-comment@lists.oasis- open.org> On Behalf Of Zhu,
>>>>>>>>>>>> Lingshan
>>>>>>>>>>>> Sent: Friday, November 3, 2023 8:27 PM
>>>>>>>>>>>>
>>>>>>>>>>>> On 11/3/2023 7:35 PM, Parav Pandit wrote:
>>>>>>>>>>>>>> From: Zhu Lingshan <lingshan.zhu@intel.com>
>>>>>>>>>>>>>> Sent: Friday, November 3, 2023 4:05 PM
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> This patch adds two new le16 fields to common configuration
>>>>>>>>>>>>>> structure to support VIRTIO_F_QUEUE_STATE in PCI transport
>> layer.
>>>>>>>>>>>>>> Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>         transport-pci.tex | 18 ++++++++++++++++++
>>>>>>>>>>>>>>         1 file changed, 18 insertions(+)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> diff --git a/transport-pci.tex b/transport-pci.tex index
>>>>>>>>>>>>>> a5c6719..3161519 100644
>>>>>>>>>>>>>> --- a/transport-pci.tex
>>>>>>>>>>>>>> +++ b/transport-pci.tex
>>>>>>>>>>>>>> @@ -325,6 +325,10 @@ \subsubsection{Common configuration
>>>>>>>>>> structure
>>>>>>>>>>>>>> layout}\label{sec:Virtio Transport
>>>>>>>>>>>>>>                 /* About the administration virtqueue. */
>>>>>>>>>>>>>>                 le16 admin_queue_index;         /* read-only for driver
>> */
>>>>>>>>>>>>>>                 le16 admin_queue_num;         /* read-only for driver
>> */
>>>>>>>>>>>>>> +
>>>>>>>>>>>>>> +	/* Virtqueue state */
>>>>>>>>>>>>>> +        le16 queue_avail_state;         /* read-write */
>>>>>>>>>>>>>> +        le16 queue_used_state;          /* read-write */
>>>>>>>>>>>>> This tiny interface for 128 virtio net queues through
>>>>>>>>>>>>> register read writes, does
>>>>>>>>>>>> not work effectively.
>>>>>>>>>>>>> There are inflight out of order descriptors for block also.
>>>>>>>>>>>>> Hence toy registers like this do not work.
>>>>>>>>>>>> Do you know there is a queue_select? Why this does not work?
>>>>>>>>>>>> Do you know how other queue related fields work?
>>>>>>>>>>> :)
>>>>>>>>>>> Yes. If you notice queue_reset related critical spec bug fix
>>>>>>>>>>> was done when it
>>>>>>>>>> was introduced so that live migration can _actually_ work.
>>>>>>>>>>> When queue_select is done for 128 queues serially, it take a
>>>>>>>>>>> lot of time to
>>>>>>>>>> read those slow register interface for this + inflight descriptors +
>> more.
>>>>>>>>>> interesting, virtio work in this pattern for many years, right?
>>>>>>>>> All these years 400Gbps and 800Gbps virtio was not present,
>>>>>>>>> number of
>>>>>>>> queues were not in hw.
>>>>>>>> The registers are control path in config space, how 400G or 800G
>> affect??
>>>>>>> Because those are the one in practice requires large number of VQs.
>>>>>>>
>>>>>>> You are asking per VQ register commands to modify things
>>>>>>> dynamically via
>>>>>> this one vq at a time, serializing all the operations.
>>>>>>> It does not scale well with high q count.
>>>>>> This is not dynamically, it only happens when SUSPEND and RESUME.
>>>>>> This is the same mechanism how virtio initialize a virtqueue,
>>>>>> working for many years.
>>>>> No. when virtio driver initializes it for the first time, there is
>>>>> no active traffic
>>>> that gets lost.
>>>>> This is because the interface is not yet up and not part of the network yet.
>>>>>
>>>>> The resume must be fast enough, because the remote node is sending
>>>> packets.
>>>>> Hence it is different from driver init time queue enable.
>>>> I am not sure any packets arrive before a link announce at the destination
>> side.
>>> I think it can.
>>> Because there is no notification of member device link down intimation to
>> remote side.
>>> The L4 and L5 protocols have no knowledge that node which they are
>> interacting is behind some layers of switches.
>>> So keeping this time low is desired.
>> The NIC should broad cast itself first, so that other peers in the network
>> know(for example its mac to route it) how to send a message to it.
>>
>> This is necessary, for example VIRTIO_NET_F_GUEST_ANNOUNCE, similar
>> mechanism work for in-marketing productions for years.
>>
>> This is out of the topic anyway.
>>>>>>>> See the virtio common cfg, you will find the max number of vqs is
>>>>>>>> there, num_queues.
>>>>>>> :)
>>>>>>> Sure. those values at high q count affects.
>>>>>> the driver need to initialize them anyway.
>>>>> That is before the traffic starts from remote end.
>>>> see above, that needs a link announce and this is after
>>>> re-initialization
>>>>>>>>> Device didn’t support LM.
>>>>>>>>> Many limitations existed all these years and TC is improving and
>>>>>>>>> expanding
>>>>>>>> them.
>>>>>>>>> So all these years do not matter.
>>>>>>>> Not sure what are you talking about, haven't we initialize the
>>>>>>>> device and vqs in config space for years?????? What's wrong with
>>>>>>>> this
>>>> mechanism?
>>>>>>>> Are you questioning virito-pci fundamentals???
>>>>>>> Don’t point to in-efficient past to establish similar in-efficient future.
>>>>>> interesting, you know this is a one-time thing, right?
>>>>>> and you are aware of this has been there for years.
>>>>>>>>>>>> Like how to set a queue size and enable it?
>>>>>>>>>>> Those are meant to be used before DRIVER_OK stage as they are
>>>>>>>>>>> init time
>>>>>>>>>> registers.
>>>>>>>>>>> Not to keep abusing them..
>>>>>>>>>> don't you need to set queue_size at the destination side?
>>>>>>>>> No.
>>>>>>>>> But the src/dst does not matter.
>>>>>>>>> Queue_size to be set before DRIVER_OK like rest of the
>>>>>>>>> registers, as all
>>>>>>>> queues must be created before the driver_ok phase.
>>>>>>>>> Queue_reset was last moment exception.
>>>>>>>> create a queue? Nvidia specific?
>>>>>>>>
>>>>>>> Huh. No.
>>>>>>> Do git log and realize what happened with queue_reset.
>>>>>> You didn't answer the question, does the spec even has defined
>>>>>> "create a
>>>> vq"?
>>>>> Enabled/created = tomato/tomato when discussing the spec in
>>>>> non-normative
>>>> email conversation.
>>>>> It's irrelevant.
>>>> Then lets not debate on this enable a vq or create a vq anymore
>>>>> All I am saying is, when we know the limitations of the transport
>>>>> and when industry is forwarding to not introduced more and more
>>>>> on-die register
>>>> for once in lifetime work of device migration, we just use the
>>>> optimal command and queue interface that is native to virtio.
>>>> PCI config space has its own limitations, and admin vq has its
>>>> advantages, but that does not apply to all use cases.
>>>>
>>> There was a recent work done emulating the SR-IOV cap and allowing VM to
>> enable SR-IOV in [1].
>>> This is the option I mentioned few weeks ago.
>>>
>>> So with admin commands and admin virtqueues, even nested model will work
>> using [1].
>>> [1]
>>> https://netdevconf.info/0x17/sessions/talk/unleashing-sr-iov-offload-o
>>> n-virtual-machines.html
>> We should take this into consideration once it is standardized in the spec,
>> maybe not now, there can always be many workarounds to solve one problem.
> Sure, until that point the admin commands are able to suffice the need well.
> And when the spec changes in transport occurs (if needed), current admin command and admin vq also fits very well that will follow above [1].
we have pointed lots of problems for admin vq based live migration 
proposal, I won't repeat them here
>
> Thanks.
>>>> I don't want to repeat why I don't think admin vq is a good idea for
>>>> migration again, we have already discussed on that.


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* [virtio-comment] Re: [PATCH V2 4/6] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  2023-11-08 17:56   ` Michael S. Tsirkin
@ 2023-11-13  9:29     ` Zhu, Lingshan
  2023-11-13 10:10       ` Michael S. Tsirkin
  0 siblings, 1 reply; 186+ messages in thread
From: Zhu, Lingshan @ 2023-11-13  9:29 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: jasowang, eperezma, cohuck, stefanha, virtio-comment, parav



On 11/9/2023 1:56 AM, Michael S. Tsirkin wrote:
> On Fri, Nov 03, 2023 at 06:34:35PM +0800, Zhu Lingshan wrote:
>> This patch adds two new le16 fields to common configuration structure
>> to support VIRTIO_F_QUEUE_STATE in PCI transport layer.
>>
>> Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
>> ---
>>   transport-pci.tex | 18 ++++++++++++++++++
>>   1 file changed, 18 insertions(+)
>>
>> diff --git a/transport-pci.tex b/transport-pci.tex
>> index a5c6719..3161519 100644
>> --- a/transport-pci.tex
>> +++ b/transport-pci.tex
>> @@ -325,6 +325,10 @@ \subsubsection{Common configuration structure layout}\label{sec:Virtio Transport
>>           /* About the administration virtqueue. */
>>           le16 admin_queue_index;         /* read-only for driver */
>>           le16 admin_queue_num;         /* read-only for driver */
>> +
>> +	/* Virtqueue state */
>> +        le16 queue_avail_state;         /* read-write */
>> +        le16 queue_used_state;          /* read-write */
>>   };
>>   \end{lstlisting}
>>   
>> @@ -428,6 +432,17 @@ \subsubsection{Common configuration structure layout}\label{sec:Virtio Transport
>>   	The value 0 indicates no supported administration virtqueues.
>>   	This field is valid only if VIRTIO_F_ADMIN_VQ has been
>>   	negotiated.
>> +
>> +\item[\field{queue_avail_state}]
>> +        This field is valid only if VIRTIO_F_QUEUE_STATE has been
>> +        negotiated. The driver sets and gets the available state of
>> +        the virtqueue here (see \ref{sec:Virtqueues / Virtqueue State}).
>> +
>> +\item[\field{queue_used_state}]
>> +        This field is valid only if VIRTIO_F_QUEUE_STATE has been
>> +        negotiated. The driver sets and gets the used state of the
>> +        virtqueue here (see \ref{sec:Virtqueues / Virtqueue State}).
>> +
>>   \end{description}
>>   
>>   \devicenormative{\paragraph}{Common configuration structure layout}{Virtio Transport Options / Virtio Over PCI Bus / PCI Device Layout / Common configuration structure layout}
>
> Two fields are pointless in the general case.  Fix this to at least
> support out of order buffer use, then there's something to talk about.
> I suspect we'll be back to yet another bespoke mailbox and a bitmap for
> this.
For split virtqueue, it has available ring, used ring and descriptor table,
means the device can always tell which descriptor/buffer is in-flight
or not processed even when out_of_order.

For packed virtqueue, because the descriptors may be overwritten, so when
out_out_order, the descriptors behind last_avial_idx should be considered
as in-flight and should be processed at the destination side.

Does this work for you?
>
>
>> @@ -488,6 +503,9 @@ \subsubsection{Common configuration structure layout}\label{sec:Virtio Transport
>>   present either a value of 0 or a power of 2 in
>>   \field{queue_size}.
>>   
>> +If VIRTIO_F_QUEUE_STATE has not been negotiated, the device MUST ignore
>> +any accesses to \field{queue_avail_state} and \field{queue_used_state}.
>> +
>>   If VIRTIO_F_ADMIN_VQ has been negotiated, the value
>>   \field{admin_queue_index} MUST be equal to, or bigger than
>>   \field{num_queues}; also, \field{admin_queue_num} MUST be
>> -- 
>> 2.35.3


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* [virtio-comment] Re: [PATCH V2 4/6] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  2023-11-13  9:29     ` Zhu, Lingshan
@ 2023-11-13 10:10       ` Michael S. Tsirkin
  0 siblings, 0 replies; 186+ messages in thread
From: Michael S. Tsirkin @ 2023-11-13 10:10 UTC (permalink / raw)
  To: Zhu, Lingshan; +Cc: jasowang, eperezma, cohuck, stefanha, virtio-comment, parav

On Mon, Nov 13, 2023 at 05:29:51PM +0800, Zhu, Lingshan wrote:
> 
> 
> On 11/9/2023 1:56 AM, Michael S. Tsirkin wrote:
> > On Fri, Nov 03, 2023 at 06:34:35PM +0800, Zhu Lingshan wrote:
> > > This patch adds two new le16 fields to common configuration structure
> > > to support VIRTIO_F_QUEUE_STATE in PCI transport layer.
> > > 
> > > Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
> > > ---
> > >   transport-pci.tex | 18 ++++++++++++++++++
> > >   1 file changed, 18 insertions(+)
> > > 
> > > diff --git a/transport-pci.tex b/transport-pci.tex
> > > index a5c6719..3161519 100644
> > > --- a/transport-pci.tex
> > > +++ b/transport-pci.tex
> > > @@ -325,6 +325,10 @@ \subsubsection{Common configuration structure layout}\label{sec:Virtio Transport
> > >           /* About the administration virtqueue. */
> > >           le16 admin_queue_index;         /* read-only for driver */
> > >           le16 admin_queue_num;         /* read-only for driver */
> > > +
> > > +	/* Virtqueue state */
> > > +        le16 queue_avail_state;         /* read-write */
> > > +        le16 queue_used_state;          /* read-write */
> > >   };
> > >   \end{lstlisting}
> > > @@ -428,6 +432,17 @@ \subsubsection{Common configuration structure layout}\label{sec:Virtio Transport
> > >   	The value 0 indicates no supported administration virtqueues.
> > >   	This field is valid only if VIRTIO_F_ADMIN_VQ has been
> > >   	negotiated.
> > > +
> > > +\item[\field{queue_avail_state}]
> > > +        This field is valid only if VIRTIO_F_QUEUE_STATE has been
> > > +        negotiated. The driver sets and gets the available state of
> > > +        the virtqueue here (see \ref{sec:Virtqueues / Virtqueue State}).
> > > +
> > > +\item[\field{queue_used_state}]
> > > +        This field is valid only if VIRTIO_F_QUEUE_STATE has been
> > > +        negotiated. The driver sets and gets the used state of the
> > > +        virtqueue here (see \ref{sec:Virtqueues / Virtqueue State}).
> > > +
> > >   \end{description}
> > >   \devicenormative{\paragraph}{Common configuration structure layout}{Virtio Transport Options / Virtio Over PCI Bus / PCI Device Layout / Common configuration structure layout}
> > 
> > Two fields are pointless in the general case.  Fix this to at least
> > support out of order buffer use, then there's something to talk about.
> > I suspect we'll be back to yet another bespoke mailbox and a bitmap for
> > this.
> For split virtqueue, it has available ring, used ring and descriptor table,
> means the device can always tell which descriptor/buffer is in-flight
> or not processed even when out_of_order.

Unfortunately, I don't believe so. Hard to say exactly without a
specific algorithm you propose what the bug in it is. Describe an
algorithm I should be able to point the issues out to you.


> For packed virtqueue, because the descriptors may be overwritten, so when
> out_out_order, the descriptors behind last_avial_idx should be considered
> as in-flight and should be processed at the destination side.
> 
> Does this work for you?

Not without a lot more detail.


> > 
> > 
> > > @@ -488,6 +503,9 @@ \subsubsection{Common configuration structure layout}\label{sec:Virtio Transport
> > >   present either a value of 0 or a power of 2 in
> > >   \field{queue_size}.
> > > +If VIRTIO_F_QUEUE_STATE has not been negotiated, the device MUST ignore
> > > +any accesses to \field{queue_avail_state} and \field{queue_used_state}.
> > > +
> > >   If VIRTIO_F_ADMIN_VQ has been negotiated, the value
> > >   \field{admin_queue_index} MUST be equal to, or bigger than
> > >   \field{num_queues}; also, \field{admin_queue_num} MUST be
> > > -- 
> > > 2.35.3


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* RE: [virtio-comment] Re: [PATCH V2 4/6] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  2023-11-13  9:25                             ` Zhu, Lingshan
@ 2023-11-15 17:35                               ` Parav Pandit
  2023-11-16 10:14                                 ` Zhu, Lingshan
  0 siblings, 1 reply; 186+ messages in thread
From: Parav Pandit @ 2023-11-15 17:35 UTC (permalink / raw)
  To: Zhu, Lingshan, jasowang, mst, eperezma, cohuck, stefanha; +Cc: virtio-comment



> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> Sent: Monday, November 13, 2023 2:56 PM
> 
> 
> 
> On 11/10/2023 8:31 PM, Parav Pandit wrote:
> >> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> >> Sent: Friday, November 10, 2023 1:22 PM
> >>
> >>
> >> On 11/9/2023 6:25 PM, Parav Pandit wrote:
> >>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> >>>> Sent: Thursday, November 9, 2023 3:39 PM
> >>>>
> >>>>
> >>>> On 11/9/2023 2:28 PM, Parav Pandit wrote:
> >>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> >>>>>> Sent: Tuesday, November 7, 2023 3:02 PM
> >>>>>>
> >>>>>> On 11/6/2023 6:52 PM, Parav Pandit wrote:
> >>>>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> >>>>>>>> Sent: Monday, November 6, 2023 2:57 PM
> >>>>>>>>
> >>>>>>>> On 11/6/2023 12:12 PM, Parav Pandit wrote:
> >>>>>>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> >>>>>>>>>> Sent: Monday, November 6, 2023 9:01 AM
> >>>>>>>>>>
> >>>>>>>>>> On 11/3/2023 11:50 PM, Parav Pandit wrote:
> >>>>>>>>>>>> From: virtio-comment@lists.oasis-open.org
> >>>>>>>>>>>> <virtio-comment@lists.oasis- open.org> On Behalf Of Zhu,
> >>>>>>>>>>>> Lingshan
> >>>>>>>>>>>> Sent: Friday, November 3, 2023 8:27 PM
> >>>>>>>>>>>>
> >>>>>>>>>>>> On 11/3/2023 7:35 PM, Parav Pandit wrote:
> >>>>>>>>>>>>>> From: Zhu Lingshan <lingshan.zhu@intel.com>
> >>>>>>>>>>>>>> Sent: Friday, November 3, 2023 4:05 PM
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> This patch adds two new le16 fields to common
> >>>>>>>>>>>>>> configuration structure to support VIRTIO_F_QUEUE_STATE
> >>>>>>>>>>>>>> in PCI transport
> >> layer.
> >>>>>>>>>>>>>> Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
> >>>>>>>>>>>>>> ---
> >>>>>>>>>>>>>>         transport-pci.tex | 18 ++++++++++++++++++
> >>>>>>>>>>>>>>         1 file changed, 18 insertions(+)
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> diff --git a/transport-pci.tex b/transport-pci.tex index
> >>>>>>>>>>>>>> a5c6719..3161519 100644
> >>>>>>>>>>>>>> --- a/transport-pci.tex
> >>>>>>>>>>>>>> +++ b/transport-pci.tex
> >>>>>>>>>>>>>> @@ -325,6 +325,10 @@ \subsubsection{Common
> configuration
> >>>>>>>>>> structure
> >>>>>>>>>>>>>> layout}\label{sec:Virtio Transport
> >>>>>>>>>>>>>>                 /* About the administration virtqueue. */
> >>>>>>>>>>>>>>                 le16 admin_queue_index;         /* read-only for driver
> >> */
> >>>>>>>>>>>>>>                 le16 admin_queue_num;         /* read-only for driver
> >> */
> >>>>>>>>>>>>>> +
> >>>>>>>>>>>>>> +	/* Virtqueue state */
> >>>>>>>>>>>>>> +        le16 queue_avail_state;         /* read-write */
> >>>>>>>>>>>>>> +        le16 queue_used_state;          /* read-write */
> >>>>>>>>>>>>> This tiny interface for 128 virtio net queues through
> >>>>>>>>>>>>> register read writes, does
> >>>>>>>>>>>> not work effectively.
> >>>>>>>>>>>>> There are inflight out of order descriptors for block also.
> >>>>>>>>>>>>> Hence toy registers like this do not work.
> >>>>>>>>>>>> Do you know there is a queue_select? Why this does not work?
> >>>>>>>>>>>> Do you know how other queue related fields work?
> >>>>>>>>>>> :)
> >>>>>>>>>>> Yes. If you notice queue_reset related critical spec bug fix
> >>>>>>>>>>> was done when it
> >>>>>>>>>> was introduced so that live migration can _actually_ work.
> >>>>>>>>>>> When queue_select is done for 128 queues serially, it take a
> >>>>>>>>>>> lot of time to
> >>>>>>>>>> read those slow register interface for this + inflight
> >>>>>>>>>> descriptors +
> >> more.
> >>>>>>>>>> interesting, virtio work in this pattern for many years, right?
> >>>>>>>>> All these years 400Gbps and 800Gbps virtio was not present,
> >>>>>>>>> number of
> >>>>>>>> queues were not in hw.
> >>>>>>>> The registers are control path in config space, how 400G or
> >>>>>>>> 800G
> >> affect??
> >>>>>>> Because those are the one in practice requires large number of VQs.
> >>>>>>>
> >>>>>>> You are asking per VQ register commands to modify things
> >>>>>>> dynamically via
> >>>>>> this one vq at a time, serializing all the operations.
> >>>>>>> It does not scale well with high q count.
> >>>>>> This is not dynamically, it only happens when SUSPEND and RESUME.
> >>>>>> This is the same mechanism how virtio initialize a virtqueue,
> >>>>>> working for many years.
> >>>>> No. when virtio driver initializes it for the first time, there is
> >>>>> no active traffic
> >>>> that gets lost.
> >>>>> This is because the interface is not yet up and not part of the network
> yet.
> >>>>>
> >>>>> The resume must be fast enough, because the remote node is sending
> >>>> packets.
> >>>>> Hence it is different from driver init time queue enable.
> >>>> I am not sure any packets arrive before a link announce at the
> >>>> destination
> >> side.
> >>> I think it can.
> >>> Because there is no notification of member device link down
> >>> intimation to
> >> remote side.
> >>> The L4 and L5 protocols have no knowledge that node which they are
> >> interacting is behind some layers of switches.
> >>> So keeping this time low is desired.
> >> The NIC should broad cast itself first, so that other peers in the
> >> network know(for example its mac to route it) how to send a message to it.
> >>
> >> This is necessary, for example VIRTIO_NET_F_GUEST_ANNOUNCE, similar
> >> mechanism work for in-marketing productions for years.
> >>
> >> This is out of the topic anyway.
> >>>>>>>> See the virtio common cfg, you will find the max number of vqs
> >>>>>>>> is there, num_queues.
> >>>>>>> :)
> >>>>>>> Sure. those values at high q count affects.
> >>>>>> the driver need to initialize them anyway.
> >>>>> That is before the traffic starts from remote end.
> >>>> see above, that needs a link announce and this is after
> >>>> re-initialization
> >>>>>>>>> Device didn’t support LM.
> >>>>>>>>> Many limitations existed all these years and TC is improving
> >>>>>>>>> and expanding
> >>>>>>>> them.
> >>>>>>>>> So all these years do not matter.
> >>>>>>>> Not sure what are you talking about, haven't we initialize the
> >>>>>>>> device and vqs in config space for years?????? What's wrong
> >>>>>>>> with this
> >>>> mechanism?
> >>>>>>>> Are you questioning virito-pci fundamentals???
> >>>>>>> Don’t point to in-efficient past to establish similar in-efficient future.
> >>>>>> interesting, you know this is a one-time thing, right?
> >>>>>> and you are aware of this has been there for years.
> >>>>>>>>>>>> Like how to set a queue size and enable it?
> >>>>>>>>>>> Those are meant to be used before DRIVER_OK stage as they
> >>>>>>>>>>> are init time
> >>>>>>>>>> registers.
> >>>>>>>>>>> Not to keep abusing them..
> >>>>>>>>>> don't you need to set queue_size at the destination side?
> >>>>>>>>> No.
> >>>>>>>>> But the src/dst does not matter.
> >>>>>>>>> Queue_size to be set before DRIVER_OK like rest of the
> >>>>>>>>> registers, as all
> >>>>>>>> queues must be created before the driver_ok phase.
> >>>>>>>>> Queue_reset was last moment exception.
> >>>>>>>> create a queue? Nvidia specific?
> >>>>>>>>
> >>>>>>> Huh. No.
> >>>>>>> Do git log and realize what happened with queue_reset.
> >>>>>> You didn't answer the question, does the spec even has defined
> >>>>>> "create a
> >>>> vq"?
> >>>>> Enabled/created = tomato/tomato when discussing the spec in
> >>>>> non-normative
> >>>> email conversation.
> >>>>> It's irrelevant.
> >>>> Then lets not debate on this enable a vq or create a vq anymore
> >>>>> All I am saying is, when we know the limitations of the transport
> >>>>> and when industry is forwarding to not introduced more and more
> >>>>> on-die register
> >>>> for once in lifetime work of device migration, we just use the
> >>>> optimal command and queue interface that is native to virtio.
> >>>> PCI config space has its own limitations, and admin vq has its
> >>>> advantages, but that does not apply to all use cases.
> >>>>
> >>> There was a recent work done emulating the SR-IOV cap and allowing
> >>> VM to
> >> enable SR-IOV in [1].
> >>> This is the option I mentioned few weeks ago.
> >>>
> >>> So with admin commands and admin virtqueues, even nested model will
> >>> work
> >> using [1].
> >>> [1]
> >>> https://netdevconf.info/0x17/sessions/talk/unleashing-sr-iov-offload
> >>> -o
> >>> n-virtual-machines.html
> >> We should take this into consideration once it is standardized in the
> >> spec, maybe not now, there can always be many workarounds to solve one
> problem.
> > Sure, until that point the admin commands are able to suffice the need well.
> > And when the spec changes in transport occurs (if needed), current admin
> command and admin vq also fits very well that will follow above [1].
> we have pointed lots of problems for admin vq based live migration proposal, I
> won't repeat them here
I don’t see any.
Nested is already solved using above.
Long time ago, you mentioned some QoS issue, which anyway exists in the device register method too.
Can you please list them if anything other than QoS and nest?

^ permalink raw reply	[flat|nested] 186+ messages in thread

* RE: [virtio-comment] RE: [PATCH V2 3/6] virtio: dont reset vqs when SUSPEND
  2023-11-13  9:23                 ` Zhu, Lingshan
@ 2023-11-15 17:35                   ` Parav Pandit
  2023-11-16 10:09                     ` Zhu, Lingshan
  0 siblings, 1 reply; 186+ messages in thread
From: Parav Pandit @ 2023-11-15 17:35 UTC (permalink / raw)
  To: Zhu, Lingshan, Michael S. Tsirkin
  Cc: jasowang, eperezma, cohuck, stefanha, virtio-comment



> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> Sent: Monday, November 13, 2023 2:53 PM
> 
> On 11/10/2023 2:31 PM, Parav Pandit wrote:
> >> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> >> Sent: Friday, November 10, 2023 11:52 AM
> >>
> >> On 11/9/2023 6:15 PM, Parav Pandit wrote:
> >>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> >>>> Sent: Thursday, November 9, 2023 3:28 PM
> >>>>
> >>>> On 11/9/2023 1:46 AM, Michael S. Tsirkin wrote:
> >>>>> On Tue, Nov 07, 2023 at 05:27:23PM +0800, Zhu, Lingshan wrote:
> >>>>>> On 11/6/2023 5:49 PM, Michael S. Tsirkin wrote:
> >>>>>>> On Fri, Nov 03, 2023 at 06:34:34PM +0800, Zhu Lingshan wrote:
> >>>>>>>> When SUSPEND is set, device states and virtqueue states should
> >>>>>>>> be stablized, therefore the driver should not reset vqs when
> >>>>>>>> SUSPEND is set in device status.
> >>>>>>>>
> >>>>>>>> Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
> >>>>>>>> ---
> >>>>>>>>      content.tex | 3 +++
> >>>>>>>>      1 file changed, 3 insertions(+)
> >>>>>>>>
> >>>>>>>> diff --git a/content.tex b/content.tex index bcc9d4b..060b5c2
> >>>>>>>> 100644
> >>>>>>>> --- a/content.tex
> >>>>>>>> +++ b/content.tex
> >>>>>>>> @@ -444,6 +444,9 @@ \subsubsection{Virtqueue
> >>>>>>>> Reset}\label{sec:Basic
> >>>> Facilities of a Virtio Device /
> >>>>>>>>      The device MUST reset any state of a virtqueue to the default
> state,
> >>>>>>>>      including the available state and the used state.
> >>>>>>>> +If VIRTIO_F_SUSPEND is negotiated and SUSPEND is set in
> >>>>>>>> +\field{device status}, the driver SHOULD NOT reset any virtqueues.
> >>>>>>>> +
> >>>>>>>>      \drivernormative{\paragraph}{Virtqueue Reset}{Basic
> >>>>>>>> Facilities of a
> >>>> Virtio Device / Virtqueues / Virtqueue Reset / Virtqueue Reset}
> >>>>>>>>      After the driver tells the device to reset a queue, the
> >>>>>>>> driver MUST verify that
> >>>>>>> Seems somewhat arbitrary and breaks the claim that the feature
> >>>>>>> is orthogonal and can have uses besides migration.
> >>>>>> when suspended, the device is frozen.
> >>>>>> The driver is aware of this process and so should not reset the vqs I
> think.
> >>>>> Again that is only true because you want to use it for migration.
> >>>>> But then you can't claim it's a generic facility.
> >>>> I don't get it. The device status is a basic facility.
> >>>>
> >>>> We need to SUSPEND the device by setting SUSPEND bit, to stabilize
> >>>> the device states for migration.
> >>> Is the PCI's PM time not enough to suspend the device?
> >>> For large device I could imagine it could be short.
> >> As you see, PCI PM, so this is a layer violation, virtio should be
> >> self contained,
> > If you think it is layer violation, than suspend bit for sure is not needed. PCI
> PM interface should suspend/resume the device on D0<->D3 state transitions.
> Doesn't make sense logically, because it is layer violation, so you want it to be
> worse? For example, virito writes 0 to device status to reset a device, not by PCI.
All these layer violation thing is just abstract to me.
Your argument contradicts with your fellow author and yourself.

I don’t want to make it worse.
If you think its layer violation, just depend on the PCI PM, no need to include new suspend bit.

> >
> >> and what about MMIO and CCW?
> > They have largely lacked the richness of PCI transport. So those transport
> needs to evolve.
> I am not sure CCW and MMIO maintainers want to hear this.
> > Otherwise, PCI offers rich transport facilities compared to MMIO, hence, it will
> continue wider use.
> you know this SUSPEND bit work fine on all transport, right? Because
> device_status is transport independent.

I want to emphasize that I am not against the suspend bit as long as it is guest driver controlled without interfering the device migration flow (like rest of the state).

The practical reason for suspending functionality under guest control is, that resuming/suspending the large device can take time.
So let it be in guest driver control. No need to muddy with device migration flow.

> >
> >> This should be a basic facility.
> > Other transport can also offer like PCI.
> Do you want to work for these transport? Implementing the new features as
> PCI?
Not presently as PCI as more features than rest of the two.
What I read about ccw is: " S/390 based virtual machines support neither PCI nor MMIO".

And I also read, "The IBM System/390 is a discontinued mainframe product family implementing".

So I don’t know who needs to extend ccw.
And if one needs, those maintainers will extend it to match to PCI standard.

> >
> >>> In that case if there is suspend the device available, it will be
> >>> used by the
> >> guest driver itself, hypervisor wouldn’t know about it when those
> >> registers are not trapped.
> >>> So we need two ways to suspend.
> >>> One is guest visible, and guest controlled.
> >>> Second is hypervisor control to fulfill the device migration needs.
> >> The guest can eve reset the device.
> >>> So if you can please take a look if the proposed admin command to
> >> freeze/stop mode can be used in the emulated register case or not.
> >>> It helps to have the suspend bit in guest control as well
> >>> with/without
> >> emulation mode.
> >> Parav, please believe I have read your series, I didn't comment there
> >> because I want to avoid further conflicts/debating, we have done these
> enough.
> >>
> > I believe the series posted in v3 can support vdpa use case as well.
> > So I will progress to post v4.
> >
> >> As explained before, freeze/stop the device by PCI is a layer violation.
> > I am afraid, we have different vision.
> > I don’t see any layer violation.
> > Suspend is enough in the PCI PM.
> > Our vision is more aligned with rest of the hypervisor knobs that owns the
> migration framework.
> I think I have explained, virito builds on other transport and it should be self-
> contained, so far so good.
Virtio without any transport binding is just blank paper discussion.

> >
> >> And device status can be pass-through(without emulation, just map it
> >> to
> >> guest) to the guest or trapped(trap and emulate by the hypervisor,
> >> for example set_status in vDPA).
> > When it is pass-through, it is controlled by the guest, so for example, if the
> guest resets the device, hypervisor has lost the control of migration context etc.
> > Hence, hypervisor needs a channel which is not guest owned.
> >
> > Same channel can work when trap+emulation is done.
> It is the guest owns the device, it can reset the device, once reset, the device
> context are cleared.
Hypervisor do not have the ability to read/write the device context. It lost the channel as hypervisor is not involved in trap+emulation.
So it is not helpful in one use case.

Admin commands can work even with trap+emulation mode.

What is missing, that should be added?

> >
> >>>> This can also be used for debugging I think.
> >>> As Michael listed, a dedicated debug interface is usually more
> >>> useful instead
> >> of in-band.
> >> re-using another facility without extra efforts is not a bad thing anyway.
> > I just don’t see how a suspend bit some debug feature.
> > Almost everything with that regard is a debug feature to me.
> suspend then check the device states?
You already suspended the device, so device state is already changed. 
All debug information is changed, so not useful now.

^ permalink raw reply	[flat|nested] 186+ messages in thread

* RE: [virtio-comment] Re: [PATCH V2 4/6] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  2023-11-13  3:46                             ` Jason Wang
  2023-11-13  9:23                               ` Zhu, Lingshan
@ 2023-11-15 17:36                               ` Parav Pandit
  1 sibling, 0 replies; 186+ messages in thread
From: Parav Pandit @ 2023-11-15 17:36 UTC (permalink / raw)
  To: Jason Wang
  Cc: Zhu, Lingshan, Michael S. Tsirkin, eperezma, cohuck, stefanha,
	virtio-comment



> From: Jason Wang <jasowang@redhat.com>
> Sent: Monday, November 13, 2023 9:17 AM
> 
> On Fri, Nov 10, 2023 at 8:31 PM Parav Pandit <parav@nvidia.com> wrote:
> >
> >
> > > From: Zhu, Lingshan <lingshan.zhu@intel.com>
> > > Sent: Friday, November 10, 2023 12:23 PM
> > >
> > >
> > > On 11/9/2023 6:02 PM, Michael S. Tsirkin wrote:
> > > > On Thu, Nov 09, 2023 at 06:00:27PM +0800, Zhu, Lingshan wrote:
> > > >>
> > > >> On 11/9/2023 1:44 AM, Michael S. Tsirkin wrote:
> > > >>> On Tue, Nov 07, 2023 at 05:31:38PM +0800, Zhu, Lingshan wrote:
> > > >>>> On 11/6/2023 6:52 PM, Parav Pandit wrote:
> > > >>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> > > >>>>>> Sent: Monday, November 6, 2023 2:57 PM
> > > >>>>>>
> > > >>>>>> On 11/6/2023 12:12 PM, Parav Pandit wrote:
> > > >>>>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> > > >>>>>>>> Sent: Monday, November 6, 2023 9:01 AM
> > > >>>>>>>>
> > > >>>>>>>> On 11/3/2023 11:50 PM, Parav Pandit wrote:
> > > >>>>>>>>>> From: virtio-comment@lists.oasis-open.org
> > > >>>>>>>>>> <virtio-comment@lists.oasis- open.org> On Behalf Of Zhu,
> > > >>>>>>>>>> Lingshan
> > > >>>>>>>>>> Sent: Friday, November 3, 2023 8:27 PM
> > > >>>>>>>>>>
> > > >>>>>>>>>> On 11/3/2023 7:35 PM, Parav Pandit wrote:
> > > >>>>>>>>>>>> From: Zhu Lingshan <lingshan.zhu@intel.com>
> > > >>>>>>>>>>>> Sent: Friday, November 3, 2023 4:05 PM
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> This patch adds two new le16 fields to common
> > > >>>>>>>>>>>> configuration structure to support VIRTIO_F_QUEUE_STATE
> > > >>>>>>>>>>>> in PCI transport
> > > layer.
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
> > > >>>>>>>>>>>> ---
> > > >>>>>>>>>>>>        transport-pci.tex | 18 ++++++++++++++++++
> > > >>>>>>>>>>>>        1 file changed, 18 insertions(+)
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> diff --git a/transport-pci.tex b/transport-pci.tex
> > > >>>>>>>>>>>> index
> > > >>>>>>>>>>>> a5c6719..3161519 100644
> > > >>>>>>>>>>>> --- a/transport-pci.tex
> > > >>>>>>>>>>>> +++ b/transport-pci.tex
> > > >>>>>>>>>>>> @@ -325,6 +325,10 @@ \subsubsection{Common
> > > >>>>>>>>>>>> configuration
> > > >>>>>>>> structure
> > > >>>>>>>>>>>> layout}\label{sec:Virtio Transport
> > > >>>>>>>>>>>>                /* About the administration virtqueue. */
> > > >>>>>>>>>>>>                le16 admin_queue_index;         /* read-only for driver
> > > */
> > > >>>>>>>>>>>>                le16 admin_queue_num;         /* read-only for driver
> > > */
> > > >>>>>>>>>>>> +
> > > >>>>>>>>>>>> +        /* Virtqueue state */
> > > >>>>>>>>>>>> +        le16 queue_avail_state;         /* read-write */
> > > >>>>>>>>>>>> +        le16 queue_used_state;          /* read-write */
> > > >>>>>>>>>>> This tiny interface for 128 virtio net queues through
> > > >>>>>>>>>>> register read writes, does
> > > >>>>>>>>>> not work effectively.
> > > >>>>>>>>>>> There are inflight out of order descriptors for block also.
> > > >>>>>>>>>>> Hence toy registers like this do not work.
> > > >>>>>>>>>> Do you know there is a queue_select? Why this does not work?
> > > >>>>>>>>>> Do you know how other queue related fields work?
> > > >>>>>>>>> :)
> > > >>>>>>>>> Yes. If you notice queue_reset related critical spec bug
> > > >>>>>>>>> fix was done when it
> > > >>>>>>>> was introduced so that live migration can _actually_ work.
> > > >>>>>>>>> When queue_select is done for 128 queues serially, it take
> > > >>>>>>>>> a lot of time to
> > > >>>>>>>> read those slow register interface for this + inflight
> > > >>>>>>>> descriptors +
> > > more.
> > > >>>>>>>> interesting, virtio work in this pattern for many years, right?
> > > >>>>>>> All these years 400Gbps and 800Gbps virtio was not present,
> > > >>>>>>> number of
> > > >>>>>> queues were not in hw.
> > > >>>>>> The registers are control path in config space, how 400G or
> > > >>>>>> 800G
> > > affect??
> > > >>>>> Because those are the one in practice requires large number of VQs.
> > > >>>>>
> > > >>>>> You are asking per VQ register commands to modify things
> > > >>>>> dynamically
> > > via this one vq at a time, serializing all the operations.
> > > >>>>> It does not scale well with high q count.
> > > >>>> This is not dynamically, it only happens when SUSPEND and RESUME.
> > > >>>> This is the same mechanism how virtio initialize a virtqueue,
> > > >>>> working for many years.
> > > >>> I wish we just had a transport vq already. That's the way to
> > > >>> solve this not fighting individual bits.
> > > >> Yeah, I agree, transport is a queued task(sent out V4 months
> > > >> ago...), one by one... hard and tough work...
> > > > Frankly I think that should take precedence, then Parav will not
> > > > get annoyed each time add a couple of registers.
> > > I agree, things can happen and we are already here..
> > Unfortunately transport vq is of not much help for below fundamental
> reasons.
> >
> > 1.1 as it involves many VMEXITS of accessing runtime config spaced on slow
> registers.
> > 1.2 or alternatively hypervisor end up polling may thousands of registers
> wasting cpu resources.
> > 2. It does not help of future CC use case where hypervisor must not be
> > involved in dynamic config 3. Complex device like vnet has already
> > stopped using every growing config space and using cvq, large part of work in
> 1.3 has shown that already 4. PFs also cannot infinitely grow registers, they also
> need less on-die registers.
> 
> You miss the point here. Nothing makes transport vq different from what you
> proposed here (especially the device context part).
> 
It does.
Device context on read only reports delta. For transport via MMIO or VQ, reads/write what is asked.

The second functional difference is:
Transport VQ when present on the member device itself, it further makes the communication secure from the hypervisor.
3rd is: it eliminates above 1 to 4 isssues.

> >
> > And who knows the backward compatible SIOV devices may offer same bar as
> VFs.
> 
> Another self-contradictory, isn't it? You claim the bar is not scalable, but you still
> want to offer a bar for SIOV?
> 
> >
> > virto spec has already outlined this efficient concept in the spec and TC
> members are already following it.
> 
> Another shifting concept. Admin commands/virtqueues makes sense doesn't
> mean you can simply layer everything on top.
>
Not at all. Using CVQ for non init work is existing concept since 2014.
No need to prove again.

> What's more, Spec can't be 100% correct, that's why there are fixes or even
> revert.
In this aspect it is correct and all the recent work is rightly following it too.

> 
> >
> > SIOV for non-backward compatible mode, anyway, need new interface and vq
> is inherently already there which fulfilling the needs.
> 
> What interface did you mean here? How much does it differ from the transport
> virtqueue?
No point for me to deviate the discussion here. It is meaningless to invent SIOV when we are still making VF migration to work.
What I know is, non-backward compat mode does not need giant PCI registers that we have today.

> 
> Again, if you keep raising unrelated topics like CC or TDISP, the discussion won't
> converge. And you are self-contradicting that you still haven't explained why
> your proposal can work in those cases.
>
The fundamental point are: 
1. it does not build on a volatile concept of involving hypervisor as listed in TDISP for driver to member device communication.
2. it provides a unified interface for PF, VF, SIOV.

 
> Thanks
> 
> 
> 
> >
> > > >
> > > >>>>>> See the virtio common cfg, you will find the max number of
> > > >>>>>> vqs is there, num_queues.
> > > >>>>> :)
> > > >>>>> Sure. those values at high q count affects.
> > > >>>> the driver need to initialize them anyway.
> > > >>>>>>> Device didn’t support LM.
> > > >>>>>>> Many limitations existed all these years and TC is improving
> > > >>>>>>> and expanding
> > > >>>>>> them.
> > > >>>>>>> So all these years do not matter.
> > > >>>>>> Not sure what are you talking about, haven't we initialize
> > > >>>>>> the device and vqs in config space for years?????? What's
> > > >>>>>> wrong with this
> > > mechanism?
> > > >>>>>> Are you questioning virito-pci fundamentals???
> > > >>>>> Don’t point to in-efficient past to establish similar in-efficient future.
> > > >>>> interesting, you know this is a one-time thing, right?
> > > >>>> and you are aware of this has been there for years.
> > > >>>>>>>>>> Like how to set a queue size and enable it?
> > > >>>>>>>>> Those are meant to be used before DRIVER_OK stage as they
> > > >>>>>>>>> are init time
> > > >>>>>>>> registers.
> > > >>>>>>>>> Not to keep abusing them..
> > > >>>>>>>> don't you need to set queue_size at the destination side?
> > > >>>>>>> No.
> > > >>>>>>> But the src/dst does not matter.
> > > >>>>>>> Queue_size to be set before DRIVER_OK like rest of the
> > > >>>>>>> registers, as all
> > > >>>>>> queues must be created before the driver_ok phase.
> > > >>>>>>> Queue_reset was last moment exception.
> > > >>>>>> create a queue? Nvidia specific?
> > > >>>>>>
> > > >>>>> Huh. No.
> > > >>>>> Do git log and realize what happened with queue_reset.
> > > >>>> You didn't answer the question, does the spec even has defined
> > > >>>> "create a vq"?
> > > >>>>>> For standard virtio, you need to read the number of enabled
> > > >>>>>> vqs at the source side, then enable them at the dst, so
> > > >>>>>> queue_size matters,
> > > not to create.
> > > >>>>> All that happens in the pre-copy phase.
> > > >>>> Yes and how your answer related to this discussion?
> >


^ permalink raw reply	[flat|nested] 186+ messages in thread

* [virtio-comment] RE: [PATCH V2 3/6] virtio: dont reset vqs when SUSPEND
  2023-11-13  3:34             ` [virtio-comment] " Jason Wang
@ 2023-11-15 17:39               ` Parav Pandit
  2023-11-16  4:19                 ` Jason Wang
  0 siblings, 1 reply; 186+ messages in thread
From: Parav Pandit @ 2023-11-15 17:39 UTC (permalink / raw)
  To: Jason Wang
  Cc: Zhu, Lingshan, Michael S. Tsirkin, eperezma, cohuck, stefanha,
	virtio-comment



> From: Jason Wang <jasowang@redhat.com>
> Sent: Monday, November 13, 2023 9:05 AM
> 
> On Thu, Nov 9, 2023 at 6:16 PM Parav Pandit <parav@nvidia.com> wrote:
> >
> >
> > > From: Zhu, Lingshan <lingshan.zhu@intel.com>
> > > Sent: Thursday, November 9, 2023 3:28 PM
> > >
> > > On 11/9/2023 1:46 AM, Michael S. Tsirkin wrote:
> > > > On Tue, Nov 07, 2023 at 05:27:23PM +0800, Zhu, Lingshan wrote:
> > > >>
> > > >> On 11/6/2023 5:49 PM, Michael S. Tsirkin wrote:
> > > >>> On Fri, Nov 03, 2023 at 06:34:34PM +0800, Zhu Lingshan wrote:
> > > >>>> When SUSPEND is set, device states and virtqueue states should
> > > >>>> be stablized, therefore the driver should not reset vqs when
> > > >>>> SUSPEND is set in device status.
> > > >>>>
> > > >>>> Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
> > > >>>> ---
> > > >>>>    content.tex | 3 +++
> > > >>>>    1 file changed, 3 insertions(+)
> > > >>>>
> > > >>>> diff --git a/content.tex b/content.tex index bcc9d4b..060b5c2
> > > >>>> 100644
> > > >>>> --- a/content.tex
> > > >>>> +++ b/content.tex
> > > >>>> @@ -444,6 +444,9 @@ \subsubsection{Virtqueue
> > > >>>> Reset}\label{sec:Basic
> > > Facilities of a Virtio Device /
> > > >>>>    The device MUST reset any state of a virtqueue to the default state,
> > > >>>>    including the available state and the used state.
> > > >>>> +If VIRTIO_F_SUSPEND is negotiated and SUSPEND is set in
> > > >>>> +\field{device status}, the driver SHOULD NOT reset any virtqueues.
> > > >>>> +
> > > >>>>    \drivernormative{\paragraph}{Virtqueue Reset}{Basic
> > > >>>> Facilities of a
> > > Virtio Device / Virtqueues / Virtqueue Reset / Virtqueue Reset}
> > > >>>>    After the driver tells the device to reset a queue, the
> > > >>>> driver MUST verify that
> > > >>> Seems somewhat arbitrary and breaks the claim that the feature
> > > >>> is orthogonal and can have uses besides migration.
> > > >> when suspended, the device is frozen.
> > > >> The driver is aware of this process and so should not reset the vqs I think.
> > > > Again that is only true because you want to use it for migration.
> > > > But then you can't claim it's a generic facility.
> > > I don't get it. The device status is a basic facility.
> > >
> > > We need to SUSPEND the device by setting SUSPEND bit, to stabilize
> > > the device states for migration.
> > Is the PCI's PM time not enough to suspend the device?
> 
> Are you saying we don't need virtio reset assuming we had FLR?
>
No. often FLR timing is not enough. Hence every PCI level device has some sort of its own reset mechanism.
 
> Suspending at different layers like rest at different layers.
> 
> We have both FLR and virtio reset. The Virtio level function could be reset
> without FLR. So did suspend.
> 
> That's it.
Sure, but wrapping it under some "basic facility" is just does not make sense.

> 
> And if you want to rule P2P behaviours, PCI PM is really the correct way to go
> instead of trying to do it at the virtio layer.
>
PCI PM is supposed to be controlled by the guest and so the suspend.

Hypervisor needs its channel to suspend the device, as fundamentally guest is unaware of device migration flow.
 
> > For large device I could imagine it could be short.
> >
> > In that case if there is suspend the device available, it will be used by the guest
> driver itself, hypervisor wouldn’t know about it when those registers are not
> trapped.
> > So we need two ways to suspend.
> > One is guest visible, and guest controlled.
> > Second is hypervisor control to fulfill the device migration needs.
> 
> Can you explain why suspend is special but not reset or why reset can work but
> not suspend? If reset can work, so does suspend. If reset can't, neither does
> suspend.
> 
As long as reset and suspend both are under guest control, I am fine.

> For example, can you explain how a system_reset in Qemu can work with your
> proposal?
> 
> >
> > So if you can please take a look if the proposed admin command to
> freeze/stop mode can be used in the emulated register case or not.
> 
> Again, if you design those for PCI, it's a layer violation. You have answered
They are used by the PCI layer, just like your suspend bit.
Andy other transport can also use it.

> yourself that PM is the right way to go.
> 
> > It helps to have the suspend bit in guest control as well with/without
> emulation mode.
> 
> I won't repeat it again. You will find you need a full transport to satisfy all the
> requirements.
I disagree for full transport.
If you want to get discuss transport for sure it is some other thread and I want to see "driver notifications via such transport VQ" to fully qualify it as transport,
And that would be just sub-optimal for actual working.
And hence, I wouldn’t call it a transport anymore.

> 
> >
> > > This can also be used for debugging I think.
> >
> > As Michael listed, a dedicated debug interface is usually more useful instead
> of in-band.
> 
> Well, I've shown you the in-band facilities like debugging via ethtool and kernel
> has a lot of other ones. If you have ever tried to debug in a real production
> environment, you will find how useful such handy information is where out-of-
> band facilities are often dangerous and usually prohibited or even unsupported.
Guest driver can always read and write the device status without adding a suspend bit.


^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [virtio-comment] RE: [PATCH V2 3/6] virtio: dont reset vqs when SUSPEND
  2023-11-15 17:39               ` [virtio-comment] " Parav Pandit
@ 2023-11-16  4:19                 ` Jason Wang
  2023-11-16  5:27                   ` Parav Pandit
  0 siblings, 1 reply; 186+ messages in thread
From: Jason Wang @ 2023-11-16  4:19 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Zhu, Lingshan, Michael S. Tsirkin, eperezma, cohuck, stefanha,
	virtio-comment

On Thu, Nov 16, 2023 at 1:39 AM Parav Pandit <parav@nvidia.com> wrote:
>
>
>
> > From: Jason Wang <jasowang@redhat.com>
> > Sent: Monday, November 13, 2023 9:05 AM
> >
> > On Thu, Nov 9, 2023 at 6:16 PM Parav Pandit <parav@nvidia.com> wrote:
> > >
> > >
> > > > From: Zhu, Lingshan <lingshan.zhu@intel.com>
> > > > Sent: Thursday, November 9, 2023 3:28 PM
> > > >
> > > > On 11/9/2023 1:46 AM, Michael S. Tsirkin wrote:
> > > > > On Tue, Nov 07, 2023 at 05:27:23PM +0800, Zhu, Lingshan wrote:
> > > > >>
> > > > >> On 11/6/2023 5:49 PM, Michael S. Tsirkin wrote:
> > > > >>> On Fri, Nov 03, 2023 at 06:34:34PM +0800, Zhu Lingshan wrote:
> > > > >>>> When SUSPEND is set, device states and virtqueue states should
> > > > >>>> be stablized, therefore the driver should not reset vqs when
> > > > >>>> SUSPEND is set in device status.
> > > > >>>>
> > > > >>>> Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
> > > > >>>> ---
> > > > >>>>    content.tex | 3 +++
> > > > >>>>    1 file changed, 3 insertions(+)
> > > > >>>>
> > > > >>>> diff --git a/content.tex b/content.tex index bcc9d4b..060b5c2
> > > > >>>> 100644
> > > > >>>> --- a/content.tex
> > > > >>>> +++ b/content.tex
> > > > >>>> @@ -444,6 +444,9 @@ \subsubsection{Virtqueue
> > > > >>>> Reset}\label{sec:Basic
> > > > Facilities of a Virtio Device /
> > > > >>>>    The device MUST reset any state of a virtqueue to the default state,
> > > > >>>>    including the available state and the used state.
> > > > >>>> +If VIRTIO_F_SUSPEND is negotiated and SUSPEND is set in
> > > > >>>> +\field{device status}, the driver SHOULD NOT reset any virtqueues.
> > > > >>>> +
> > > > >>>>    \drivernormative{\paragraph}{Virtqueue Reset}{Basic
> > > > >>>> Facilities of a
> > > > Virtio Device / Virtqueues / Virtqueue Reset / Virtqueue Reset}
> > > > >>>>    After the driver tells the device to reset a queue, the
> > > > >>>> driver MUST verify that
> > > > >>> Seems somewhat arbitrary and breaks the claim that the feature
> > > > >>> is orthogonal and can have uses besides migration.
> > > > >> when suspended, the device is frozen.
> > > > >> The driver is aware of this process and so should not reset the vqs I think.
> > > > > Again that is only true because you want to use it for migration.
> > > > > But then you can't claim it's a generic facility.
> > > > I don't get it. The device status is a basic facility.
> > > >
> > > > We need to SUSPEND the device by setting SUSPEND bit, to stabilize
> > > > the device states for migration.
> > > Is the PCI's PM time not enough to suspend the device?
> >
> > Are you saying we don't need virtio reset assuming we had FLR?
> >
> No. often FLR timing is not enough. Hence every PCI level device has some sort of its own reset mechanism.
>
> > Suspending at different layers like rest at different layers.
> >
> > We have both FLR and virtio reset. The Virtio level function could be reset
> > without FLR. So did suspend.
> >
> > That's it.
> Sure, but wrapping it under some "basic facility" is just does not make sense.

Why, device status (e.g reset) belongs to that part.

>
> >
> > And if you want to rule P2P behaviours, PCI PM is really the correct way to go
> > instead of trying to do it at the virtio layer.
> >
> PCI PM is supposed to be controlled by the guest and so the suspend.

I've listed issues about D3cold and others, I can't believe it can't
be controlled totally by guests.

>
> Hypervisor needs its channel to suspend the device, as fundamentally guest is unaware of device migration flow.

That's pretty fine, the hypervisor also needs its channel to reset the
device. If you think there's a conflict with suspend, there should be
one for reset as well.

>
> > > For large device I could imagine it could be short.
> > >
> > > In that case if there is suspend the device available, it will be used by the guest
> > driver itself, hypervisor wouldn’t know about it when those registers are not
> > trapped.
> > > So we need two ways to suspend.
> > > One is guest visible, and guest controlled.
> > > Second is hypervisor control to fulfill the device migration needs.
> >
> > Can you explain why suspend is special but not reset or why reset can work but
> > not suspend? If reset can work, so does suspend. If reset can't, neither does
> > suspend.
> >
> As long as reset and suspend both are under guest control, I am fine.

Well, you seem to ignore my question below. Hypervisor needs to reset
the device as well.

>
> > For example, can you explain how a system_reset in Qemu can work with your
> > proposal?
> >
> > >
> > > So if you can please take a look if the proposed admin command to
> > freeze/stop mode can be used in the emulated register case or not.
> >
> > Again, if you design those for PCI, it's a layer violation. You have answered
> They are used by the PCI layer, just like your suspend bit.
> Andy other transport can also use it.
>
> > yourself that PM is the right way to go.
> >
> > > It helps to have the suspend bit in guest control as well with/without
> > emulation mode.
> >
> > I won't repeat it again. You will find you need a full transport to satisfy all the
> > requirements.
> I disagree for full transport.

See above and the discussion in another thread.

> If you want to get discuss transport for sure it is some other thread and I want to see "driver notifications via such transport VQ" to fully qualify it as transport,
> And that would be just sub-optimal for actual working.

Sub-optimal since the function is duplicated with a transport but it
doesn't claim or design as a transport.

> And hence, I wouldn’t call it a transport anymore.
>
> >
> > >
> > > > This can also be used for debugging I think.
> > >
> > > As Michael listed, a dedicated debug interface is usually more useful instead
> > of in-band.
> >
> > Well, I've shown you the in-band facilities like debugging via ethtool and kernel
> > has a lot of other ones. If you have ever tried to debug in a real production
> > environment, you will find how useful such handy information is where out-of-
> > band facilities are often dangerous and usually prohibited or even unsupported.
> Guest driver can always read and write the device status without adding a suspend bit.

I don't get here. Suspend make sure the device state is frozen which
helps for debugging for sure.

Thanks

>


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* RE: [virtio-comment] RE: [PATCH V2 3/6] virtio: dont reset vqs when SUSPEND
  2023-11-16  4:19                 ` Jason Wang
@ 2023-11-16  5:27                   ` Parav Pandit
  2023-11-16 10:12                     ` Zhu, Lingshan
  2023-11-21  7:33                     ` Jason Wang
  0 siblings, 2 replies; 186+ messages in thread
From: Parav Pandit @ 2023-11-16  5:27 UTC (permalink / raw)
  To: Jason Wang
  Cc: Zhu, Lingshan, Michael S. Tsirkin, eperezma, cohuck, stefanha,
	virtio-comment

> From: Jason Wang <jasowang@redhat.com>
> Sent: Thursday, November 16, 2023 9:50 AM
> 
> On Thu, Nov 16, 2023 at 1:39 AM Parav Pandit <parav@nvidia.com> wrote:
> >
> >
> >
> > > From: Jason Wang <jasowang@redhat.com>
> > > Sent: Monday, November 13, 2023 9:05 AM
> > >
> > > On Thu, Nov 9, 2023 at 6:16 PM Parav Pandit <parav@nvidia.com> wrote:
> > > >
> > > >
> > > > > From: Zhu, Lingshan <lingshan.zhu@intel.com>
> > > > > Sent: Thursday, November 9, 2023 3:28 PM
> > > > >
> > > > > On 11/9/2023 1:46 AM, Michael S. Tsirkin wrote:
> > > > > > On Tue, Nov 07, 2023 at 05:27:23PM +0800, Zhu, Lingshan wrote:
> > > > > >>
> > > > > >> On 11/6/2023 5:49 PM, Michael S. Tsirkin wrote:
> > > > > >>> On Fri, Nov 03, 2023 at 06:34:34PM +0800, Zhu Lingshan wrote:
> > > > > >>>> When SUSPEND is set, device states and virtqueue states
> > > > > >>>> should be stablized, therefore the driver should not reset
> > > > > >>>> vqs when SUSPEND is set in device status.
> > > > > >>>>
> > > > > >>>> Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
> > > > > >>>> ---
> > > > > >>>>    content.tex | 3 +++
> > > > > >>>>    1 file changed, 3 insertions(+)
> > > > > >>>>
> > > > > >>>> diff --git a/content.tex b/content.tex index
> > > > > >>>> bcc9d4b..060b5c2
> > > > > >>>> 100644
> > > > > >>>> --- a/content.tex
> > > > > >>>> +++ b/content.tex
> > > > > >>>> @@ -444,6 +444,9 @@ \subsubsection{Virtqueue
> > > > > >>>> Reset}\label{sec:Basic
> > > > > Facilities of a Virtio Device /
> > > > > >>>>    The device MUST reset any state of a virtqueue to the default
> state,
> > > > > >>>>    including the available state and the used state.
> > > > > >>>> +If VIRTIO_F_SUSPEND is negotiated and SUSPEND is set in
> > > > > >>>> +\field{device status}, the driver SHOULD NOT reset any
> virtqueues.
> > > > > >>>> +
> > > > > >>>>    \drivernormative{\paragraph}{Virtqueue Reset}{Basic
> > > > > >>>> Facilities of a
> > > > > Virtio Device / Virtqueues / Virtqueue Reset / Virtqueue Reset}
> > > > > >>>>    After the driver tells the device to reset a queue, the
> > > > > >>>> driver MUST verify that
> > > > > >>> Seems somewhat arbitrary and breaks the claim that the
> > > > > >>> feature is orthogonal and can have uses besides migration.
> > > > > >> when suspended, the device is frozen.
> > > > > >> The driver is aware of this process and so should not reset the vqs I
> think.
> > > > > > Again that is only true because you want to use it for migration.
> > > > > > But then you can't claim it's a generic facility.
> > > > > I don't get it. The device status is a basic facility.
> > > > >
> > > > > We need to SUSPEND the device by setting SUSPEND bit, to
> > > > > stabilize the device states for migration.
> > > > Is the PCI's PM time not enough to suspend the device?
> > >
> > > Are you saying we don't need virtio reset assuming we had FLR?
> > >
> > No. often FLR timing is not enough. Hence every PCI level device has some
> sort of its own reset mechanism.
> >
> > > Suspending at different layers like rest at different layers.
> > >
> > > We have both FLR and virtio reset. The Virtio level function could
> > > be reset without FLR. So did suspend.
> > >
> > > That's it.
> > Sure, but wrapping it under some "basic facility" is just does not make sense.
> 
> Why, device status (e.g reset) belongs to that part.
>
Lingshan claimed that suspending device is for live migration in commit log and in discussion he portray it as some basic facility unrelated to device migration such as debug etc.
Instead of claiming it as some non_device_migration facility does not make sense.
 
> >
> > >
> > > And if you want to rule P2P behaviours, PCI PM is really the correct
> > > way to go instead of trying to do it at the virtio layer.
> > >
> > PCI PM is supposed to be controlled by the guest and so the suspend.
> 
> I've listed issues about D3cold and others, I can't believe it can't be controlled
> totally by guests.
>
D3cold is not controlled by the driver as defined by the PCI spec hence it is not applicable.
D3hot is controlled by the driver.
> >
> > Hypervisor needs its channel to suspend the device, as fundamentally guest is
> unaware of device migration flow.
> 
> That's pretty fine, the hypervisor also needs its channel to reset the device. If
> you think there's a conflict with suspend, there should be one for reset as well.
> 
I don’t see a need for hypervisor to reset the device in passthrough mode. Can you explain why is it needed?
Do you mean, it is needed in vdpa mode? If yes, the registers are emulated anyway, so why the member device's native channel cannot be used in vdpa mode?

> >
> > > > For large device I could imagine it could be short.
> > > >
> > > > In that case if there is suspend the device available, it will be
> > > > used by the guest
> > > driver itself, hypervisor wouldn’t know about it when those
> > > registers are not trapped.
> > > > So we need two ways to suspend.
> > > > One is guest visible, and guest controlled.
> > > > Second is hypervisor control to fulfill the device migration needs.
> > >
> > > Can you explain why suspend is special but not reset or why reset
> > > can work but not suspend? If reset can work, so does suspend. If
> > > reset can't, neither does suspend.
> > >
> > As long as reset and suspend both are under guest control, I am fine.
> 
> Well, you seem to ignore my question below. Hypervisor needs to reset the
> device as well.
> 
Why is it needed in passthrough mode?

> >
> > > For example, can you explain how a system_reset in Qemu can work
> > > with your proposal?
> > >
> > > >
> > > > So if you can please take a look if the proposed admin command to
> > > freeze/stop mode can be used in the emulated register case or not.
> > >
> > > Again, if you design those for PCI, it's a layer violation. You have
> > > answered
> > They are used by the PCI layer, just like your suspend bit.
> > Andy other transport can also use it.
> >
> > > yourself that PM is the right way to go.
> > >
> > > > It helps to have the suspend bit in guest control as well
> > > > with/without
> > > emulation mode.
> > >
> > > I won't repeat it again. You will find you need a full transport to
> > > satisfy all the requirements.
> > I disagree for full transport.
> 
> See above and the discussion in another thread.
> 
> > If you want to get discuss transport for sure it is some other thread
> > and I want to see "driver notifications via such transport VQ" to fully qualify it
> as transport, And that would be just sub-optimal for actual working.
> 
> Sub-optimal since the function is duplicated with a transport but it doesn't
> claim or design as a transport.
>
It is not sub-optimal because of duplication. It is because you want to transport notifications via virtqueue.
 
> > And hence, I wouldn’t call it a transport anymore.
> >
> > >
> > > >
> > > > > This can also be used for debugging I think.
> > > >
> > > > As Michael listed, a dedicated debug interface is usually more
> > > > useful instead
> > > of in-band.
> > >
> > > Well, I've shown you the in-band facilities like debugging via
> > > ethtool and kernel has a lot of other ones. If you have ever tried
> > > to debug in a real production environment, you will find how useful
> > > such handy information is where out-of- band facilities are often dangerous
> and usually prohibited or even unsupported.
> > Guest driver can always read and write the device status without adding a
> suspend bit.
> 
> I don't get here. Suspend make sure the device state is frozen which helps for
> debugging for sure.
You wanted to debug some vq live, you suspend the device, the vq state got changed.

I just don’t see that suspend is a debug tool. Every feature is a debug feature literally.
Classic heisenbug effect.

Once can change driver notification frequency to see if interrupt rate changed for debugging.
One can disabled few RQs and see RSS...
Blk can change blk_size to higher value to perf debug..
The list continues..

> 
> Thanks
> 
> >


^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [virtio-comment] RE: [PATCH V2 3/6] virtio: dont reset vqs when SUSPEND
  2023-11-15 17:35                   ` Parav Pandit
@ 2023-11-16 10:09                     ` Zhu, Lingshan
  2023-11-16 10:19                       ` Parav Pandit
  2023-11-16 12:09                       ` Michael S. Tsirkin
  0 siblings, 2 replies; 186+ messages in thread
From: Zhu, Lingshan @ 2023-11-16 10:09 UTC (permalink / raw)
  To: Parav Pandit, Michael S. Tsirkin
  Cc: jasowang, eperezma, cohuck, stefanha, virtio-comment



On 11/16/2023 1:35 AM, Parav Pandit wrote:
>
>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>> Sent: Monday, November 13, 2023 2:53 PM
>>
>> On 11/10/2023 2:31 PM, Parav Pandit wrote:
>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>>>> Sent: Friday, November 10, 2023 11:52 AM
>>>>
>>>> On 11/9/2023 6:15 PM, Parav Pandit wrote:
>>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>>>>>> Sent: Thursday, November 9, 2023 3:28 PM
>>>>>>
>>>>>> On 11/9/2023 1:46 AM, Michael S. Tsirkin wrote:
>>>>>>> On Tue, Nov 07, 2023 at 05:27:23PM +0800, Zhu, Lingshan wrote:
>>>>>>>> On 11/6/2023 5:49 PM, Michael S. Tsirkin wrote:
>>>>>>>>> On Fri, Nov 03, 2023 at 06:34:34PM +0800, Zhu Lingshan wrote:
>>>>>>>>>> When SUSPEND is set, device states and virtqueue states should
>>>>>>>>>> be stablized, therefore the driver should not reset vqs when
>>>>>>>>>> SUSPEND is set in device status.
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
>>>>>>>>>> ---
>>>>>>>>>>       content.tex | 3 +++
>>>>>>>>>>       1 file changed, 3 insertions(+)
>>>>>>>>>>
>>>>>>>>>> diff --git a/content.tex b/content.tex index bcc9d4b..060b5c2
>>>>>>>>>> 100644
>>>>>>>>>> --- a/content.tex
>>>>>>>>>> +++ b/content.tex
>>>>>>>>>> @@ -444,6 +444,9 @@ \subsubsection{Virtqueue
>>>>>>>>>> Reset}\label{sec:Basic
>>>>>> Facilities of a Virtio Device /
>>>>>>>>>>       The device MUST reset any state of a virtqueue to the default
>> state,
>>>>>>>>>>       including the available state and the used state.
>>>>>>>>>> +If VIRTIO_F_SUSPEND is negotiated and SUSPEND is set in
>>>>>>>>>> +\field{device status}, the driver SHOULD NOT reset any virtqueues.
>>>>>>>>>> +
>>>>>>>>>>       \drivernormative{\paragraph}{Virtqueue Reset}{Basic
>>>>>>>>>> Facilities of a
>>>>>> Virtio Device / Virtqueues / Virtqueue Reset / Virtqueue Reset}
>>>>>>>>>>       After the driver tells the device to reset a queue, the
>>>>>>>>>> driver MUST verify that
>>>>>>>>> Seems somewhat arbitrary and breaks the claim that the feature
>>>>>>>>> is orthogonal and can have uses besides migration.
>>>>>>>> when suspended, the device is frozen.
>>>>>>>> The driver is aware of this process and so should not reset the vqs I
>> think.
>>>>>>> Again that is only true because you want to use it for migration.
>>>>>>> But then you can't claim it's a generic facility.
>>>>>> I don't get it. The device status is a basic facility.
>>>>>>
>>>>>> We need to SUSPEND the device by setting SUSPEND bit, to stabilize
>>>>>> the device states for migration.
>>>>> Is the PCI's PM time not enough to suspend the device?
>>>>> For large device I could imagine it could be short.
>>>> As you see, PCI PM, so this is a layer violation, virtio should be
>>>> self contained,
>>> If you think it is layer violation, than suspend bit for sure is not needed. PCI
>> PM interface should suspend/resume the device on D0<->D3 state transitions.
>> Doesn't make sense logically, because it is layer violation, so you want it to be
>> worse? For example, virito writes 0 to device status to reset a device, not by PCI.
> All these layer violation thing is just abstract to me.
> Your argument contradicts with your fellow author and yourself.
I don't see how, we keep telling you virtio should be self contained, 
and suspend by PCI PM is a
layer volition, this is a fact, right?
>
> I don’t want to make it worse.
> If you think its layer violation, just depend on the PCI PM, no need to include new suspend bit.
Again, virtio should be self-contained, not layer volited, for example, 
we reset virito devices
by writing 0 to device status, not by PCI FLR.
>
>>>> and what about MMIO and CCW?
>>> They have largely lacked the richness of PCI transport. So those transport
>> needs to evolve.
>> I am not sure CCW and MMIO maintainers want to hear this.
>>> Otherwise, PCI offers rich transport facilities compared to MMIO, hence, it will
>> continue wider use.
>> you know this SUSPEND bit work fine on all transport, right? Because
>> device_status is transport independent.
> I want to emphasize that I am not against the suspend bit as long as it is guest driver controlled without interfering the device migration flow (like rest of the state).
When migrate a device, it is the host who suspends the device. The 
reason is the live migration process should be transparent to
the guest, so we should suspend the guest first, then suspend the 
device(by host).
>
> The practical reason for suspending functionality under guest control is, that resuming/suspending the large device can take time.
> So let it be in guest driver control. No need to muddy with device migration flow.
The time cost is reasonable in O(N) no matter how you suspend/resume the 
device.
>
>>>> This should be a basic facility.
>>> Other transport can also offer like PCI.
>> Do you want to work for these transport? Implementing the new features as
>> PCI?
> Not presently as PCI as more features than rest of the two.
> What I read about ccw is: " S/390 based virtual machines support neither PCI nor MMIO".
>
> And I also read, "The IBM System/390 is a discontinued mainframe product family implementing".
>
> So I don’t know who needs to extend ccw.
> And if one needs, those maintainers will extend it to match to PCI standard.
So these features are even not planned, so don't depend on them.
>
>>>>> In that case if there is suspend the device available, it will be
>>>>> used by the
>>>> guest driver itself, hypervisor wouldn’t know about it when those
>>>> registers are not trapped.
>>>>> So we need two ways to suspend.
>>>>> One is guest visible, and guest controlled.
>>>>> Second is hypervisor control to fulfill the device migration needs.
>>>> The guest can eve reset the device.
>>>>> So if you can please take a look if the proposed admin command to
>>>> freeze/stop mode can be used in the emulated register case or not.
>>>>> It helps to have the suspend bit in guest control as well
>>>>> with/without
>>>> emulation mode.
>>>> Parav, please believe I have read your series, I didn't comment there
>>>> because I want to avoid further conflicts/debating, we have done these
>> enough.
>>> I believe the series posted in v3 can support vdpa use case as well.
>>> So I will progress to post v4.
>>>
>>>> As explained before, freeze/stop the device by PCI is a layer violation.
>>> I am afraid, we have different vision.
>>> I don’t see any layer violation.
>>> Suspend is enough in the PCI PM.
>>> Our vision is more aligned with rest of the hypervisor knobs that owns the
>> migration framework.
>> I think I have explained, virito builds on other transport and it should be self-
>> contained, so far so good.
> Virtio without any transport binding is just blank paper discussion.
virtio is built on some transports, but not bind to any.
>
>>>> And device status can be pass-through(without emulation, just map it
>>>> to
>>>> guest) to the guest or trapped(trap and emulate by the hypervisor,
>>>> for example set_status in vDPA).
>>> When it is pass-through, it is controlled by the guest, so for example, if the
>> guest resets the device, hypervisor has lost the control of migration context etc.
>>> Hence, hypervisor needs a channel which is not guest owned.
>>>
>>> Same channel can work when trap+emulation is done.
>> It is the guest owns the device, it can reset the device, once reset, the device
>> context are cleared.
> Hypervisor do not have the ability to read/write the device context. It lost the channel as hypervisor is not involved in trap+emulation.
> So it is not helpful in one use case.
>
> Admin commands can work even with trap+emulation mode.
>
> What is missing, that should be added?
as explained above, when live migration, the guest should be suspended 
first, at this point,
the host owns the device, it has access to the device.
>
>>>>>> This can also be used for debugging I think.
>>>>> As Michael listed, a dedicated debug interface is usually more
>>>>> useful instead
>>>> of in-band.
>>>> re-using another facility without extra efforts is not a bad thing anyway.
>>> I just don’t see how a suspend bit some debug feature.
>>> Almost everything with that regard is a debug feature to me.
>> suspend then check the device states?
> You already suspended the device, so device state is already changed.
> All debug information is changed, so not useful now.
When suspended, the device should keep and stabilize its device states,
at least in my series it should behave like this.


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [virtio-comment] RE: [PATCH V2 3/6] virtio: dont reset vqs when SUSPEND
  2023-11-16  5:27                   ` Parav Pandit
@ 2023-11-16 10:12                     ` Zhu, Lingshan
  2023-11-21  7:33                     ` Jason Wang
  1 sibling, 0 replies; 186+ messages in thread
From: Zhu, Lingshan @ 2023-11-16 10:12 UTC (permalink / raw)
  To: Parav Pandit, Jason Wang
  Cc: Michael S. Tsirkin, eperezma, cohuck, stefanha, virtio-comment



On 11/16/2023 1:27 PM, Parav Pandit wrote:
>> From: Jason Wang <jasowang@redhat.com>
>> Sent: Thursday, November 16, 2023 9:50 AM
>>
>> On Thu, Nov 16, 2023 at 1:39 AM Parav Pandit <parav@nvidia.com> wrote:
>>>
>>>
>>>> From: Jason Wang <jasowang@redhat.com>
>>>> Sent: Monday, November 13, 2023 9:05 AM
>>>>
>>>> On Thu, Nov 9, 2023 at 6:16 PM Parav Pandit <parav@nvidia.com> wrote:
>>>>>
>>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>>>>>> Sent: Thursday, November 9, 2023 3:28 PM
>>>>>>
>>>>>> On 11/9/2023 1:46 AM, Michael S. Tsirkin wrote:
>>>>>>> On Tue, Nov 07, 2023 at 05:27:23PM +0800, Zhu, Lingshan wrote:
>>>>>>>> On 11/6/2023 5:49 PM, Michael S. Tsirkin wrote:
>>>>>>>>> On Fri, Nov 03, 2023 at 06:34:34PM +0800, Zhu Lingshan wrote:
>>>>>>>>>> When SUSPEND is set, device states and virtqueue states
>>>>>>>>>> should be stablized, therefore the driver should not reset
>>>>>>>>>> vqs when SUSPEND is set in device status.
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
>>>>>>>>>> ---
>>>>>>>>>>     content.tex | 3 +++
>>>>>>>>>>     1 file changed, 3 insertions(+)
>>>>>>>>>>
>>>>>>>>>> diff --git a/content.tex b/content.tex index
>>>>>>>>>> bcc9d4b..060b5c2
>>>>>>>>>> 100644
>>>>>>>>>> --- a/content.tex
>>>>>>>>>> +++ b/content.tex
>>>>>>>>>> @@ -444,6 +444,9 @@ \subsubsection{Virtqueue
>>>>>>>>>> Reset}\label{sec:Basic
>>>>>> Facilities of a Virtio Device /
>>>>>>>>>>     The device MUST reset any state of a virtqueue to the default
>> state,
>>>>>>>>>>     including the available state and the used state.
>>>>>>>>>> +If VIRTIO_F_SUSPEND is negotiated and SUSPEND is set in
>>>>>>>>>> +\field{device status}, the driver SHOULD NOT reset any
>> virtqueues.
>>>>>>>>>> +
>>>>>>>>>>     \drivernormative{\paragraph}{Virtqueue Reset}{Basic
>>>>>>>>>> Facilities of a
>>>>>> Virtio Device / Virtqueues / Virtqueue Reset / Virtqueue Reset}
>>>>>>>>>>     After the driver tells the device to reset a queue, the
>>>>>>>>>> driver MUST verify that
>>>>>>>>> Seems somewhat arbitrary and breaks the claim that the
>>>>>>>>> feature is orthogonal and can have uses besides migration.
>>>>>>>> when suspended, the device is frozen.
>>>>>>>> The driver is aware of this process and so should not reset the vqs I
>> think.
>>>>>>> Again that is only true because you want to use it for migration.
>>>>>>> But then you can't claim it's a generic facility.
>>>>>> I don't get it. The device status is a basic facility.
>>>>>>
>>>>>> We need to SUSPEND the device by setting SUSPEND bit, to
>>>>>> stabilize the device states for migration.
>>>>> Is the PCI's PM time not enough to suspend the device?
>>>> Are you saying we don't need virtio reset assuming we had FLR?
>>>>
>>> No. often FLR timing is not enough. Hence every PCI level device has some
>> sort of its own reset mechanism.
>>>> Suspending at different layers like rest at different layers.
>>>>
>>>> We have both FLR and virtio reset. The Virtio level function could
>>>> be reset without FLR. So did suspend.
>>>>
>>>> That's it.
>>> Sure, but wrapping it under some "basic facility" is just does not make sense.
>> Why, device status (e.g reset) belongs to that part.
>>
> Lingshan claimed that suspending device is for live migration in commit log and in discussion he portray it as some basic facility unrelated to device migration such as debug etc.
> Instead of claiming it as some non_device_migration facility does not make sense.
I said live migration is a use-case of the SUSPEND bit. I did not say 
the SUSPEND bit is only for live migration.
>   
>>>> And if you want to rule P2P behaviours, PCI PM is really the correct
>>>> way to go instead of trying to do it at the virtio layer.
>>>>
>>> PCI PM is supposed to be controlled by the guest and so the suspend.
>> I've listed issues about D3cold and others, I can't believe it can't be controlled
>> totally by guests.
>>
> D3cold is not controlled by the driver as defined by the PCI spec hence it is not applicable.
> D3hot is controlled by the driver.
>>> Hypervisor needs its channel to suspend the device, as fundamentally guest is
>> unaware of device migration flow.
>>
>> That's pretty fine, the hypervisor also needs its channel to reset the device. If
>> you think there's a conflict with suspend, there should be one for reset as well.
>>
> I don’t see a need for hypervisor to reset the device in passthrough mode. Can you explain why is it needed?
> Do you mean, it is needed in vdpa mode? If yes, the registers are emulated anyway, so why the member device's native channel cannot be used in vdpa mode?
>
>>>>> For large device I could imagine it could be short.
>>>>>
>>>>> In that case if there is suspend the device available, it will be
>>>>> used by the guest
>>>> driver itself, hypervisor wouldn’t know about it when those
>>>> registers are not trapped.
>>>>> So we need two ways to suspend.
>>>>> One is guest visible, and guest controlled.
>>>>> Second is hypervisor control to fulfill the device migration needs.
>>>> Can you explain why suspend is special but not reset or why reset
>>>> can work but not suspend? If reset can work, so does suspend. If
>>>> reset can't, neither does suspend.
>>>>
>>> As long as reset and suspend both are under guest control, I am fine.
>> Well, you seem to ignore my question below. Hypervisor needs to reset the
>> device as well.
>>
> Why is it needed in passthrough mode?
>
>>>> For example, can you explain how a system_reset in Qemu can work
>>>> with your proposal?
>>>>
>>>>> So if you can please take a look if the proposed admin command to
>>>> freeze/stop mode can be used in the emulated register case or not.
>>>>
>>>> Again, if you design those for PCI, it's a layer violation. You have
>>>> answered
>>> They are used by the PCI layer, just like your suspend bit.
>>> Andy other transport can also use it.
>>>
>>>> yourself that PM is the right way to go.
>>>>
>>>>> It helps to have the suspend bit in guest control as well
>>>>> with/without
>>>> emulation mode.
>>>>
>>>> I won't repeat it again. You will find you need a full transport to
>>>> satisfy all the requirements.
>>> I disagree for full transport.
>> See above and the discussion in another thread.
>>
>>> If you want to get discuss transport for sure it is some other thread
>>> and I want to see "driver notifications via such transport VQ" to fully qualify it
>> as transport, And that would be just sub-optimal for actual working.
>>
>> Sub-optimal since the function is duplicated with a transport but it doesn't
>> claim or design as a transport.
>>
> It is not sub-optimal because of duplication. It is because you want to transport notifications via virtqueue.
>   
>>> And hence, I wouldn’t call it a transport anymore.
>>>
>>>>>> This can also be used for debugging I think.
>>>>> As Michael listed, a dedicated debug interface is usually more
>>>>> useful instead
>>>> of in-band.
>>>>
>>>> Well, I've shown you the in-band facilities like debugging via
>>>> ethtool and kernel has a lot of other ones. If you have ever tried
>>>> to debug in a real production environment, you will find how useful
>>>> such handy information is where out-of- band facilities are often dangerous
>> and usually prohibited or even unsupported.
>>> Guest driver can always read and write the device status without adding a
>> suspend bit.
>>
>> I don't get here. Suspend make sure the device state is frozen which helps for
>> debugging for sure.
> You wanted to debug some vq live, you suspend the device, the vq state got changed.
>
> I just don’t see that suspend is a debug tool. Every feature is a debug feature literally.
> Classic heisenbug effect.
>
> Once can change driver notification frequency to see if interrupt rate changed for debugging.
> One can disabled few RQs and see RSS...
> Blk can change blk_size to higher value to perf debug..
> The list continues..
>
>> Thanks
>>


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [virtio-comment] Re: [PATCH V2 4/6] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  2023-11-15 17:35                               ` Parav Pandit
@ 2023-11-16 10:14                                 ` Zhu, Lingshan
  2023-11-16 10:21                                   ` Parav Pandit
  0 siblings, 1 reply; 186+ messages in thread
From: Zhu, Lingshan @ 2023-11-16 10:14 UTC (permalink / raw)
  To: Parav Pandit, jasowang, mst, eperezma, cohuck, stefanha; +Cc: virtio-comment



On 11/16/2023 1:35 AM, Parav Pandit wrote:
>
>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>> Sent: Monday, November 13, 2023 2:56 PM
>>
>>
>>
>> On 11/10/2023 8:31 PM, Parav Pandit wrote:
>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>>>> Sent: Friday, November 10, 2023 1:22 PM
>>>>
>>>>
>>>> On 11/9/2023 6:25 PM, Parav Pandit wrote:
>>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>>>>>> Sent: Thursday, November 9, 2023 3:39 PM
>>>>>>
>>>>>>
>>>>>> On 11/9/2023 2:28 PM, Parav Pandit wrote:
>>>>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>>>>>>>> Sent: Tuesday, November 7, 2023 3:02 PM
>>>>>>>>
>>>>>>>> On 11/6/2023 6:52 PM, Parav Pandit wrote:
>>>>>>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>>>>>>>>>> Sent: Monday, November 6, 2023 2:57 PM
>>>>>>>>>>
>>>>>>>>>> On 11/6/2023 12:12 PM, Parav Pandit wrote:
>>>>>>>>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>>>>>>>>>>>> Sent: Monday, November 6, 2023 9:01 AM
>>>>>>>>>>>>
>>>>>>>>>>>> On 11/3/2023 11:50 PM, Parav Pandit wrote:
>>>>>>>>>>>>>> From: virtio-comment@lists.oasis-open.org
>>>>>>>>>>>>>> <virtio-comment@lists.oasis- open.org> On Behalf Of Zhu,
>>>>>>>>>>>>>> Lingshan
>>>>>>>>>>>>>> Sent: Friday, November 3, 2023 8:27 PM
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 11/3/2023 7:35 PM, Parav Pandit wrote:
>>>>>>>>>>>>>>>> From: Zhu Lingshan <lingshan.zhu@intel.com>
>>>>>>>>>>>>>>>> Sent: Friday, November 3, 2023 4:05 PM
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> This patch adds two new le16 fields to common
>>>>>>>>>>>>>>>> configuration structure to support VIRTIO_F_QUEUE_STATE
>>>>>>>>>>>>>>>> in PCI transport
>>>> layer.
>>>>>>>>>>>>>>>> Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>          transport-pci.tex | 18 ++++++++++++++++++
>>>>>>>>>>>>>>>>          1 file changed, 18 insertions(+)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> diff --git a/transport-pci.tex b/transport-pci.tex index
>>>>>>>>>>>>>>>> a5c6719..3161519 100644
>>>>>>>>>>>>>>>> --- a/transport-pci.tex
>>>>>>>>>>>>>>>> +++ b/transport-pci.tex
>>>>>>>>>>>>>>>> @@ -325,6 +325,10 @@ \subsubsection{Common
>> configuration
>>>>>>>>>>>> structure
>>>>>>>>>>>>>>>> layout}\label{sec:Virtio Transport
>>>>>>>>>>>>>>>>                  /* About the administration virtqueue. */
>>>>>>>>>>>>>>>>                  le16 admin_queue_index;         /* read-only for driver
>>>> */
>>>>>>>>>>>>>>>>                  le16 admin_queue_num;         /* read-only for driver
>>>> */
>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>> +	/* Virtqueue state */
>>>>>>>>>>>>>>>> +        le16 queue_avail_state;         /* read-write */
>>>>>>>>>>>>>>>> +        le16 queue_used_state;          /* read-write */
>>>>>>>>>>>>>>> This tiny interface for 128 virtio net queues through
>>>>>>>>>>>>>>> register read writes, does
>>>>>>>>>>>>>> not work effectively.
>>>>>>>>>>>>>>> There are inflight out of order descriptors for block also.
>>>>>>>>>>>>>>> Hence toy registers like this do not work.
>>>>>>>>>>>>>> Do you know there is a queue_select? Why this does not work?
>>>>>>>>>>>>>> Do you know how other queue related fields work?
>>>>>>>>>>>>> :)
>>>>>>>>>>>>> Yes. If you notice queue_reset related critical spec bug fix
>>>>>>>>>>>>> was done when it
>>>>>>>>>>>> was introduced so that live migration can _actually_ work.
>>>>>>>>>>>>> When queue_select is done for 128 queues serially, it take a
>>>>>>>>>>>>> lot of time to
>>>>>>>>>>>> read those slow register interface for this + inflight
>>>>>>>>>>>> descriptors +
>>>> more.
>>>>>>>>>>>> interesting, virtio work in this pattern for many years, right?
>>>>>>>>>>> All these years 400Gbps and 800Gbps virtio was not present,
>>>>>>>>>>> number of
>>>>>>>>>> queues were not in hw.
>>>>>>>>>> The registers are control path in config space, how 400G or
>>>>>>>>>> 800G
>>>> affect??
>>>>>>>>> Because those are the one in practice requires large number of VQs.
>>>>>>>>>
>>>>>>>>> You are asking per VQ register commands to modify things
>>>>>>>>> dynamically via
>>>>>>>> this one vq at a time, serializing all the operations.
>>>>>>>>> It does not scale well with high q count.
>>>>>>>> This is not dynamically, it only happens when SUSPEND and RESUME.
>>>>>>>> This is the same mechanism how virtio initialize a virtqueue,
>>>>>>>> working for many years.
>>>>>>> No. when virtio driver initializes it for the first time, there is
>>>>>>> no active traffic
>>>>>> that gets lost.
>>>>>>> This is because the interface is not yet up and not part of the network
>> yet.
>>>>>>> The resume must be fast enough, because the remote node is sending
>>>>>> packets.
>>>>>>> Hence it is different from driver init time queue enable.
>>>>>> I am not sure any packets arrive before a link announce at the
>>>>>> destination
>>>> side.
>>>>> I think it can.
>>>>> Because there is no notification of member device link down
>>>>> intimation to
>>>> remote side.
>>>>> The L4 and L5 protocols have no knowledge that node which they are
>>>> interacting is behind some layers of switches.
>>>>> So keeping this time low is desired.
>>>> The NIC should broad cast itself first, so that other peers in the
>>>> network know(for example its mac to route it) how to send a message to it.
>>>>
>>>> This is necessary, for example VIRTIO_NET_F_GUEST_ANNOUNCE, similar
>>>> mechanism work for in-marketing productions for years.
>>>>
>>>> This is out of the topic anyway.
>>>>>>>>>> See the virtio common cfg, you will find the max number of vqs
>>>>>>>>>> is there, num_queues.
>>>>>>>>> :)
>>>>>>>>> Sure. those values at high q count affects.
>>>>>>>> the driver need to initialize them anyway.
>>>>>>> That is before the traffic starts from remote end.
>>>>>> see above, that needs a link announce and this is after
>>>>>> re-initialization
>>>>>>>>>>> Device didn’t support LM.
>>>>>>>>>>> Many limitations existed all these years and TC is improving
>>>>>>>>>>> and expanding
>>>>>>>>>> them.
>>>>>>>>>>> So all these years do not matter.
>>>>>>>>>> Not sure what are you talking about, haven't we initialize the
>>>>>>>>>> device and vqs in config space for years?????? What's wrong
>>>>>>>>>> with this
>>>>>> mechanism?
>>>>>>>>>> Are you questioning virito-pci fundamentals???
>>>>>>>>> Don’t point to in-efficient past to establish similar in-efficient future.
>>>>>>>> interesting, you know this is a one-time thing, right?
>>>>>>>> and you are aware of this has been there for years.
>>>>>>>>>>>>>> Like how to set a queue size and enable it?
>>>>>>>>>>>>> Those are meant to be used before DRIVER_OK stage as they
>>>>>>>>>>>>> are init time
>>>>>>>>>>>> registers.
>>>>>>>>>>>>> Not to keep abusing them..
>>>>>>>>>>>> don't you need to set queue_size at the destination side?
>>>>>>>>>>> No.
>>>>>>>>>>> But the src/dst does not matter.
>>>>>>>>>>> Queue_size to be set before DRIVER_OK like rest of the
>>>>>>>>>>> registers, as all
>>>>>>>>>> queues must be created before the driver_ok phase.
>>>>>>>>>>> Queue_reset was last moment exception.
>>>>>>>>>> create a queue? Nvidia specific?
>>>>>>>>>>
>>>>>>>>> Huh. No.
>>>>>>>>> Do git log and realize what happened with queue_reset.
>>>>>>>> You didn't answer the question, does the spec even has defined
>>>>>>>> "create a
>>>>>> vq"?
>>>>>>> Enabled/created = tomato/tomato when discussing the spec in
>>>>>>> non-normative
>>>>>> email conversation.
>>>>>>> It's irrelevant.
>>>>>> Then lets not debate on this enable a vq or create a vq anymore
>>>>>>> All I am saying is, when we know the limitations of the transport
>>>>>>> and when industry is forwarding to not introduced more and more
>>>>>>> on-die register
>>>>>> for once in lifetime work of device migration, we just use the
>>>>>> optimal command and queue interface that is native to virtio.
>>>>>> PCI config space has its own limitations, and admin vq has its
>>>>>> advantages, but that does not apply to all use cases.
>>>>>>
>>>>> There was a recent work done emulating the SR-IOV cap and allowing
>>>>> VM to
>>>> enable SR-IOV in [1].
>>>>> This is the option I mentioned few weeks ago.
>>>>>
>>>>> So with admin commands and admin virtqueues, even nested model will
>>>>> work
>>>> using [1].
>>>>> [1]
>>>>> https://netdevconf.info/0x17/sessions/talk/unleashing-sr-iov-offload
>>>>> -o
>>>>> n-virtual-machines.html
>>>> We should take this into consideration once it is standardized in the
>>>> spec, maybe not now, there can always be many workarounds to solve one
>> problem.
>>> Sure, until that point the admin commands are able to suffice the need well.
>>> And when the spec changes in transport occurs (if needed), current admin
>> command and admin vq also fits very well that will follow above [1].
>> we have pointed lots of problems for admin vq based live migration proposal, I
>> won't repeat them here
> I don’t see any.
> Nested is already solved using above.
I don't see how, do you mind to work out the patches?
> Long time ago, you mentioned some QoS issue, which anyway exists in the device register method too.
> Can you please list them if anything other than QoS and nest?


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* RE: [virtio-comment] RE: [PATCH V2 3/6] virtio: dont reset vqs when SUSPEND
  2023-11-16 10:09                     ` Zhu, Lingshan
@ 2023-11-16 10:19                       ` Parav Pandit
  2023-11-16 12:09                       ` Michael S. Tsirkin
  1 sibling, 0 replies; 186+ messages in thread
From: Parav Pandit @ 2023-11-16 10:19 UTC (permalink / raw)
  To: Zhu, Lingshan, Michael S. Tsirkin
  Cc: jasowang, eperezma, cohuck, stefanha, virtio-comment


> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> Sent: Thursday, November 16, 2023 3:40 PM
> 
> On 11/16/2023 1:35 AM, Parav Pandit wrote:
> >
> >> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> >> Sent: Monday, November 13, 2023 2:53 PM
> >>
> >> On 11/10/2023 2:31 PM, Parav Pandit wrote:
> >>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> >>>> Sent: Friday, November 10, 2023 11:52 AM
> >>>>
> >>>> On 11/9/2023 6:15 PM, Parav Pandit wrote:
> >>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> >>>>>> Sent: Thursday, November 9, 2023 3:28 PM
> >>>>>>
> >>>>>> On 11/9/2023 1:46 AM, Michael S. Tsirkin wrote:
> >>>>>>> On Tue, Nov 07, 2023 at 05:27:23PM +0800, Zhu, Lingshan wrote:
> >>>>>>>> On 11/6/2023 5:49 PM, Michael S. Tsirkin wrote:
> >>>>>>>>> On Fri, Nov 03, 2023 at 06:34:34PM +0800, Zhu Lingshan wrote:
> >>>>>>>>>> When SUSPEND is set, device states and virtqueue states
> >>>>>>>>>> should be stablized, therefore the driver should not reset
> >>>>>>>>>> vqs when SUSPEND is set in device status.
> >>>>>>>>>>
> >>>>>>>>>> Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
> >>>>>>>>>> ---
> >>>>>>>>>>       content.tex | 3 +++
> >>>>>>>>>>       1 file changed, 3 insertions(+)
> >>>>>>>>>>
> >>>>>>>>>> diff --git a/content.tex b/content.tex index bcc9d4b..060b5c2
> >>>>>>>>>> 100644
> >>>>>>>>>> --- a/content.tex
> >>>>>>>>>> +++ b/content.tex
> >>>>>>>>>> @@ -444,6 +444,9 @@ \subsubsection{Virtqueue
> >>>>>>>>>> Reset}\label{sec:Basic
> >>>>>> Facilities of a Virtio Device /
> >>>>>>>>>>       The device MUST reset any state of a virtqueue to the
> >>>>>>>>>> default
> >> state,
> >>>>>>>>>>       including the available state and the used state.
> >>>>>>>>>> +If VIRTIO_F_SUSPEND is negotiated and SUSPEND is set in
> >>>>>>>>>> +\field{device status}, the driver SHOULD NOT reset any
> virtqueues.
> >>>>>>>>>> +
> >>>>>>>>>>       \drivernormative{\paragraph}{Virtqueue Reset}{Basic
> >>>>>>>>>> Facilities of a
> >>>>>> Virtio Device / Virtqueues / Virtqueue Reset / Virtqueue Reset}
> >>>>>>>>>>       After the driver tells the device to reset a queue, the
> >>>>>>>>>> driver MUST verify that
> >>>>>>>>> Seems somewhat arbitrary and breaks the claim that the feature
> >>>>>>>>> is orthogonal and can have uses besides migration.
> >>>>>>>> when suspended, the device is frozen.
> >>>>>>>> The driver is aware of this process and so should not reset the
> >>>>>>>> vqs I
> >> think.
> >>>>>>> Again that is only true because you want to use it for migration.
> >>>>>>> But then you can't claim it's a generic facility.
> >>>>>> I don't get it. The device status is a basic facility.
> >>>>>>
> >>>>>> We need to SUSPEND the device by setting SUSPEND bit, to
> >>>>>> stabilize the device states for migration.
> >>>>> Is the PCI's PM time not enough to suspend the device?
> >>>>> For large device I could imagine it could be short.
> >>>> As you see, PCI PM, so this is a layer violation, virtio should be
> >>>> self contained,
> >>> If you think it is layer violation, than suspend bit for sure is not
> >>> needed. PCI
> >> PM interface should suspend/resume the device on D0<->D3 state
> transitions.
> >> Doesn't make sense logically, because it is layer violation, so you
> >> want it to be worse? For example, virito writes 0 to device status to reset a
> device, not by PCI.
> > All these layer violation thing is just abstract to me.
> > Your argument contradicts with your fellow author and yourself.
> I don't see how, we keep telling you virtio should be self contained, and suspend
> by PCI PM is a layer volition, this is a fact, right?
I don’t see PCI PM as any layer violation. It is used by hundreds of industry devices.
You might want to ask PCI-SIG to eliminate PCI PM and get their feedback.

> >
> > I don’t want to make it worse.
> > If you think its layer violation, just depend on the PCI PM, no need to include
> new suspend bit.
> Again, virtio should be self-contained, not layer volited, for example, we reset
> virito devices by writing 0 to device status, not by PCI FLR.
It is not for layer adherence or layer violation purpose.

It exists because it gives much flexibility in device implementation to not depend on FLR or PM timings.

> >
> >>>> and what about MMIO and CCW?
> >>> They have largely lacked the richness of PCI transport. So those
> >>> transport
> >> needs to evolve.
> >> I am not sure CCW and MMIO maintainers want to hear this.
> >>> Otherwise, PCI offers rich transport facilities compared to MMIO,
> >>> hence, it will
> >> continue wider use.
> >> you know this SUSPEND bit work fine on all transport, right? Because
> >> device_status is transport independent.
> > I want to emphasize that I am not against the suspend bit as long as it is guest
> driver controlled without interfering the device migration flow (like rest of the
> state).
> When migrate a device, it is the host who suspends the device. The reason is
> the live migration process should be transparent to the guest, so we should
> suspend the guest first, then suspend the device(by host).
This miss the fundamental point that I explained in the first paragraph of theory of operation.
I.e. hypervisor administers the device even when the guest is running.
This is the usually referreed as pre-copy phase.
During this stage large part of the device context read and write is done on src and dst hypervisor respectively.
This cuts down majority of the downtime.

> >
> > The practical reason for suspending functionality under guest control is, that
> resuming/suspending the large device can take time.
> > So let it be in guest driver control. No need to muddy with device migration
> flow.
> The time cost is reasonable in O(N) no matter how you suspend/resume the
> device.
In the proposed approach using admin commands, it is nearly O(1) for N devices and M queues per each.

> >
> >>>> This should be a basic facility.
> >>> Other transport can also offer like PCI.
> >> Do you want to work for these transport? Implementing the new
> >> features as PCI?
> > Not presently as PCI as more features than rest of the two.
> > What I read about ccw is: " S/390 based virtual machines support neither PCI
> nor MMIO".
> >
> > And I also read, "The IBM System/390 is a discontinued mainframe product
> family implementing".
> >
> > So I don’t know who needs to extend ccw.
> > And if one needs, those maintainers will extend it to match to PCI standard.
> So these features are even not planned, so don't depend on them.
My series does not depend on ccw or mmio.
Not sure what you mean by don’t depend on them.

> >
> >>>>> In that case if there is suspend the device available, it will be
> >>>>> used by the
> >>>> guest driver itself, hypervisor wouldn’t know about it when those
> >>>> registers are not trapped.
> >>>>> So we need two ways to suspend.
> >>>>> One is guest visible, and guest controlled.
> >>>>> Second is hypervisor control to fulfill the device migration needs.
> >>>> The guest can eve reset the device.
> >>>>> So if you can please take a look if the proposed admin command to
> >>>> freeze/stop mode can be used in the emulated register case or not.
> >>>>> It helps to have the suspend bit in guest control as well
> >>>>> with/without
> >>>> emulation mode.
> >>>> Parav, please believe I have read your series, I didn't comment
> >>>> there because I want to avoid further conflicts/debating, we have
> >>>> done these
> >> enough.
> >>> I believe the series posted in v3 can support vdpa use case as well.
> >>> So I will progress to post v4.
> >>>
> >>>> As explained before, freeze/stop the device by PCI is a layer violation.
> >>> I am afraid, we have different vision.
> >>> I don’t see any layer violation.
> >>> Suspend is enough in the PCI PM.
> >>> Our vision is more aligned with rest of the hypervisor knobs that
> >>> owns the
> >> migration framework.
> >> I think I have explained, virito builds on other transport and it
> >> should be self- contained, so far so good.
> > Virtio without any transport binding is just blank paper discussion.
> virtio is built on some transports, but not bind to any.
Not sure what you mean not bind to any.

Virtio objects has transport binding such as pci for driver notifications, device config, q enable etc.

> >
> >>>> And device status can be pass-through(without emulation, just map
> >>>> it to
> >>>> guest) to the guest or trapped(trap and emulate by the hypervisor,
> >>>> for example set_status in vDPA).
> >>> When it is pass-through, it is controlled by the guest, so for
> >>> example, if the
> >> guest resets the device, hypervisor has lost the control of migration context
> etc.
> >>> Hence, hypervisor needs a channel which is not guest owned.
> >>>
> >>> Same channel can work when trap+emulation is done.
> >> It is the guest owns the device, it can reset the device, once reset,
> >> the device context are cleared.
> > Hypervisor do not have the ability to read/write the device context. It lost the
> channel as hypervisor is not involved in trap+emulation.
> > So it is not helpful in one use case.
> >
> > Admin commands can work even with trap+emulation mode.
> >
> > What is missing, that should be added?
> as explained above, when live migration, the guest should be suspended first, at
> this point, the host owns the device, it has access to the device.
This is the missing piece for you for a long time.
The proposed admin command approach, the guest is not suspend first and accessed by the hypervisor.

Think of this as PML for device context... (just for analogy purpose).

> >
> >>>>>> This can also be used for debugging I think.
> >>>>> As Michael listed, a dedicated debug interface is usually more
> >>>>> useful instead
> >>>> of in-band.
> >>>> re-using another facility without extra efforts is not a bad thing anyway.
> >>> I just don’t see how a suspend bit some debug feature.
> >>> Almost everything with that regard is a debug feature to me.
> >> suspend then check the device states?
> > You already suspended the device, so device state is already changed.
> > All debug information is changed, so not useful now.
> When suspended, the device should keep and stabilize its device states, at least
> in my series it should behave like this.


^ permalink raw reply	[flat|nested] 186+ messages in thread

* RE: [virtio-comment] Re: [PATCH V2 4/6] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  2023-11-16 10:14                                 ` Zhu, Lingshan
@ 2023-11-16 10:21                                   ` Parav Pandit
  2023-11-17 10:02                                     ` Zhu, Lingshan
  0 siblings, 1 reply; 186+ messages in thread
From: Parav Pandit @ 2023-11-16 10:21 UTC (permalink / raw)
  To: Zhu, Lingshan, jasowang, mst, eperezma, cohuck, stefanha; +Cc: virtio-comment


> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> Sent: Thursday, November 16, 2023 3:45 PM
> 
> On 11/16/2023 1:35 AM, Parav Pandit wrote:
> >
> >> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> >> Sent: Monday, November 13, 2023 2:56 PM
> >>
> >>
> >>
> >> On 11/10/2023 8:31 PM, Parav Pandit wrote:
> >>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> >>>> Sent: Friday, November 10, 2023 1:22 PM
> >>>>
> >>>>
> >>>> On 11/9/2023 6:25 PM, Parav Pandit wrote:
> >>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> >>>>>> Sent: Thursday, November 9, 2023 3:39 PM
> >>>>>>
> >>>>>>
> >>>>>> On 11/9/2023 2:28 PM, Parav Pandit wrote:
> >>>>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> >>>>>>>> Sent: Tuesday, November 7, 2023 3:02 PM
> >>>>>>>>
> >>>>>>>> On 11/6/2023 6:52 PM, Parav Pandit wrote:
> >>>>>>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> >>>>>>>>>> Sent: Monday, November 6, 2023 2:57 PM
> >>>>>>>>>>
> >>>>>>>>>> On 11/6/2023 12:12 PM, Parav Pandit wrote:
> >>>>>>>>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> >>>>>>>>>>>> Sent: Monday, November 6, 2023 9:01 AM
> >>>>>>>>>>>>
> >>>>>>>>>>>> On 11/3/2023 11:50 PM, Parav Pandit wrote:
> >>>>>>>>>>>>>> From: virtio-comment@lists.oasis-open.org
> >>>>>>>>>>>>>> <virtio-comment@lists.oasis- open.org> On Behalf Of Zhu,
> >>>>>>>>>>>>>> Lingshan
> >>>>>>>>>>>>>> Sent: Friday, November 3, 2023 8:27 PM
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On 11/3/2023 7:35 PM, Parav Pandit wrote:
> >>>>>>>>>>>>>>>> From: Zhu Lingshan <lingshan.zhu@intel.com>
> >>>>>>>>>>>>>>>> Sent: Friday, November 3, 2023 4:05 PM
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> This patch adds two new le16 fields to common
> >>>>>>>>>>>>>>>> configuration structure to support VIRTIO_F_QUEUE_STATE
> >>>>>>>>>>>>>>>> in PCI transport
> >>>> layer.
> >>>>>>>>>>>>>>>> Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
> >>>>>>>>>>>>>>>> ---
> >>>>>>>>>>>>>>>>          transport-pci.tex | 18 ++++++++++++++++++
> >>>>>>>>>>>>>>>>          1 file changed, 18 insertions(+)
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> diff --git a/transport-pci.tex b/transport-pci.tex
> >>>>>>>>>>>>>>>> index
> >>>>>>>>>>>>>>>> a5c6719..3161519 100644
> >>>>>>>>>>>>>>>> --- a/transport-pci.tex
> >>>>>>>>>>>>>>>> +++ b/transport-pci.tex
> >>>>>>>>>>>>>>>> @@ -325,6 +325,10 @@ \subsubsection{Common
> >> configuration
> >>>>>>>>>>>> structure
> >>>>>>>>>>>>>>>> layout}\label{sec:Virtio Transport
> >>>>>>>>>>>>>>>>                  /* About the administration virtqueue. */
> >>>>>>>>>>>>>>>>                  le16 admin_queue_index;         /* read-only for
> driver
> >>>> */
> >>>>>>>>>>>>>>>>                  le16 admin_queue_num;         /* read-only for
> driver
> >>>> */
> >>>>>>>>>>>>>>>> +
> >>>>>>>>>>>>>>>> +	/* Virtqueue state */
> >>>>>>>>>>>>>>>> +        le16 queue_avail_state;         /* read-write */
> >>>>>>>>>>>>>>>> +        le16 queue_used_state;          /* read-write */
> >>>>>>>>>>>>>>> This tiny interface for 128 virtio net queues through
> >>>>>>>>>>>>>>> register read writes, does
> >>>>>>>>>>>>>> not work effectively.
> >>>>>>>>>>>>>>> There are inflight out of order descriptors for block also.
> >>>>>>>>>>>>>>> Hence toy registers like this do not work.
> >>>>>>>>>>>>>> Do you know there is a queue_select? Why this does not
> work?
> >>>>>>>>>>>>>> Do you know how other queue related fields work?
> >>>>>>>>>>>>> :)
> >>>>>>>>>>>>> Yes. If you notice queue_reset related critical spec bug
> >>>>>>>>>>>>> fix was done when it
> >>>>>>>>>>>> was introduced so that live migration can _actually_ work.
> >>>>>>>>>>>>> When queue_select is done for 128 queues serially, it take
> >>>>>>>>>>>>> a lot of time to
> >>>>>>>>>>>> read those slow register interface for this + inflight
> >>>>>>>>>>>> descriptors +
> >>>> more.
> >>>>>>>>>>>> interesting, virtio work in this pattern for many years, right?
> >>>>>>>>>>> All these years 400Gbps and 800Gbps virtio was not present,
> >>>>>>>>>>> number of
> >>>>>>>>>> queues were not in hw.
> >>>>>>>>>> The registers are control path in config space, how 400G or
> >>>>>>>>>> 800G
> >>>> affect??
> >>>>>>>>> Because those are the one in practice requires large number of VQs.
> >>>>>>>>>
> >>>>>>>>> You are asking per VQ register commands to modify things
> >>>>>>>>> dynamically via
> >>>>>>>> this one vq at a time, serializing all the operations.
> >>>>>>>>> It does not scale well with high q count.
> >>>>>>>> This is not dynamically, it only happens when SUSPEND and RESUME.
> >>>>>>>> This is the same mechanism how virtio initialize a virtqueue,
> >>>>>>>> working for many years.
> >>>>>>> No. when virtio driver initializes it for the first time, there
> >>>>>>> is no active traffic
> >>>>>> that gets lost.
> >>>>>>> This is because the interface is not yet up and not part of the
> >>>>>>> network
> >> yet.
> >>>>>>> The resume must be fast enough, because the remote node is
> >>>>>>> sending
> >>>>>> packets.
> >>>>>>> Hence it is different from driver init time queue enable.
> >>>>>> I am not sure any packets arrive before a link announce at the
> >>>>>> destination
> >>>> side.
> >>>>> I think it can.
> >>>>> Because there is no notification of member device link down
> >>>>> intimation to
> >>>> remote side.
> >>>>> The L4 and L5 protocols have no knowledge that node which they are
> >>>> interacting is behind some layers of switches.
> >>>>> So keeping this time low is desired.
> >>>> The NIC should broad cast itself first, so that other peers in the
> >>>> network know(for example its mac to route it) how to send a message to
> it.
> >>>>
> >>>> This is necessary, for example VIRTIO_NET_F_GUEST_ANNOUNCE, similar
> >>>> mechanism work for in-marketing productions for years.
> >>>>
> >>>> This is out of the topic anyway.
> >>>>>>>>>> See the virtio common cfg, you will find the max number of
> >>>>>>>>>> vqs is there, num_queues.
> >>>>>>>>> :)
> >>>>>>>>> Sure. those values at high q count affects.
> >>>>>>>> the driver need to initialize them anyway.
> >>>>>>> That is before the traffic starts from remote end.
> >>>>>> see above, that needs a link announce and this is after
> >>>>>> re-initialization
> >>>>>>>>>>> Device didn’t support LM.
> >>>>>>>>>>> Many limitations existed all these years and TC is improving
> >>>>>>>>>>> and expanding
> >>>>>>>>>> them.
> >>>>>>>>>>> So all these years do not matter.
> >>>>>>>>>> Not sure what are you talking about, haven't we initialize
> >>>>>>>>>> the device and vqs in config space for years?????? What's
> >>>>>>>>>> wrong with this
> >>>>>> mechanism?
> >>>>>>>>>> Are you questioning virito-pci fundamentals???
> >>>>>>>>> Don’t point to in-efficient past to establish similar in-efficient future.
> >>>>>>>> interesting, you know this is a one-time thing, right?
> >>>>>>>> and you are aware of this has been there for years.
> >>>>>>>>>>>>>> Like how to set a queue size and enable it?
> >>>>>>>>>>>>> Those are meant to be used before DRIVER_OK stage as they
> >>>>>>>>>>>>> are init time
> >>>>>>>>>>>> registers.
> >>>>>>>>>>>>> Not to keep abusing them..
> >>>>>>>>>>>> don't you need to set queue_size at the destination side?
> >>>>>>>>>>> No.
> >>>>>>>>>>> But the src/dst does not matter.
> >>>>>>>>>>> Queue_size to be set before DRIVER_OK like rest of the
> >>>>>>>>>>> registers, as all
> >>>>>>>>>> queues must be created before the driver_ok phase.
> >>>>>>>>>>> Queue_reset was last moment exception.
> >>>>>>>>>> create a queue? Nvidia specific?
> >>>>>>>>>>
> >>>>>>>>> Huh. No.
> >>>>>>>>> Do git log and realize what happened with queue_reset.
> >>>>>>>> You didn't answer the question, does the spec even has defined
> >>>>>>>> "create a
> >>>>>> vq"?
> >>>>>>> Enabled/created = tomato/tomato when discussing the spec in
> >>>>>>> non-normative
> >>>>>> email conversation.
> >>>>>>> It's irrelevant.
> >>>>>> Then lets not debate on this enable a vq or create a vq anymore
> >>>>>>> All I am saying is, when we know the limitations of the
> >>>>>>> transport and when industry is forwarding to not introduced more
> >>>>>>> and more on-die register
> >>>>>> for once in lifetime work of device migration, we just use the
> >>>>>> optimal command and queue interface that is native to virtio.
> >>>>>> PCI config space has its own limitations, and admin vq has its
> >>>>>> advantages, but that does not apply to all use cases.
> >>>>>>
> >>>>> There was a recent work done emulating the SR-IOV cap and allowing
> >>>>> VM to
> >>>> enable SR-IOV in [1].
> >>>>> This is the option I mentioned few weeks ago.
> >>>>>
> >>>>> So with admin commands and admin virtqueues, even nested model
> >>>>> will work
> >>>> using [1].
> >>>>> [1]
> >>>>> https://netdevconf.info/0x17/sessions/talk/unleashing-sr-iov-offlo
> >>>>> ad
> >>>>> -o
> >>>>> n-virtual-machines.html
> >>>> We should take this into consideration once it is standardized in
> >>>> the spec, maybe not now, there can always be many workarounds to
> >>>> solve one
> >> problem.
> >>> Sure, until that point the admin commands are able to suffice the need
> well.
> >>> And when the spec changes in transport occurs (if needed), current
> >>> admin
> >> command and admin vq also fits very well that will follow above [1].
> >> we have pointed lots of problems for admin vq based live migration
> >> proposal, I won't repeat them here
> > I don’t see any.
> > Nested is already solved using above.
> I don't see how, do you mind to work out the patches?
Once the base series is completed, nested cases can be addressed.
I wont be able to work on the patches for it until we finish for the first level virtualization.

> > Long time ago, you mentioned some QoS issue, which anyway exists in the
> device register method too.
> > Can you please list them if anything other than QoS and nest?


^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [virtio-comment] RE: [PATCH V2 3/6] virtio: dont reset vqs when SUSPEND
  2023-11-16 10:09                     ` Zhu, Lingshan
  2023-11-16 10:19                       ` Parav Pandit
@ 2023-11-16 12:09                       ` Michael S. Tsirkin
  2023-11-17 10:13                         ` Zhu, Lingshan
  1 sibling, 1 reply; 186+ messages in thread
From: Michael S. Tsirkin @ 2023-11-16 12:09 UTC (permalink / raw)
  To: Zhu, Lingshan
  Cc: Parav Pandit, jasowang, eperezma, cohuck, stefanha, virtio-comment

On Thu, Nov 16, 2023 at 06:09:38PM +0800, Zhu, Lingshan wrote:
> 
> 
> On 11/16/2023 1:35 AM, Parav Pandit wrote:
> > 
> > > From: Zhu, Lingshan <lingshan.zhu@intel.com>
> > > Sent: Monday, November 13, 2023 2:53 PM
> > > 
> > > On 11/10/2023 2:31 PM, Parav Pandit wrote:
> > > > > From: Zhu, Lingshan <lingshan.zhu@intel.com>
> > > > > Sent: Friday, November 10, 2023 11:52 AM
> > > > > 
> > > > > On 11/9/2023 6:15 PM, Parav Pandit wrote:
> > > > > > > From: Zhu, Lingshan <lingshan.zhu@intel.com>
> > > > > > > Sent: Thursday, November 9, 2023 3:28 PM
> > > > > > > 
> > > > > > > On 11/9/2023 1:46 AM, Michael S. Tsirkin wrote:
> > > > > > > > On Tue, Nov 07, 2023 at 05:27:23PM +0800, Zhu, Lingshan wrote:
> > > > > > > > > On 11/6/2023 5:49 PM, Michael S. Tsirkin wrote:
> > > > > > > > > > On Fri, Nov 03, 2023 at 06:34:34PM +0800, Zhu Lingshan wrote:
> > > > > > > > > > > When SUSPEND is set, device states and virtqueue states should
> > > > > > > > > > > be stablized, therefore the driver should not reset vqs when
> > > > > > > > > > > SUSPEND is set in device status.
> > > > > > > > > > > 
> > > > > > > > > > > Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
> > > > > > > > > > > ---
> > > > > > > > > > >       content.tex | 3 +++
> > > > > > > > > > >       1 file changed, 3 insertions(+)
> > > > > > > > > > > 
> > > > > > > > > > > diff --git a/content.tex b/content.tex index bcc9d4b..060b5c2
> > > > > > > > > > > 100644
> > > > > > > > > > > --- a/content.tex
> > > > > > > > > > > +++ b/content.tex
> > > > > > > > > > > @@ -444,6 +444,9 @@ \subsubsection{Virtqueue
> > > > > > > > > > > Reset}\label{sec:Basic
> > > > > > > Facilities of a Virtio Device /
> > > > > > > > > > >       The device MUST reset any state of a virtqueue to the default
> > > state,
> > > > > > > > > > >       including the available state and the used state.
> > > > > > > > > > > +If VIRTIO_F_SUSPEND is negotiated and SUSPEND is set in
> > > > > > > > > > > +\field{device status}, the driver SHOULD NOT reset any virtqueues.
> > > > > > > > > > > +
> > > > > > > > > > >       \drivernormative{\paragraph}{Virtqueue Reset}{Basic
> > > > > > > > > > > Facilities of a
> > > > > > > Virtio Device / Virtqueues / Virtqueue Reset / Virtqueue Reset}
> > > > > > > > > > >       After the driver tells the device to reset a queue, the
> > > > > > > > > > > driver MUST verify that
> > > > > > > > > > Seems somewhat arbitrary and breaks the claim that the feature
> > > > > > > > > > is orthogonal and can have uses besides migration.
> > > > > > > > > when suspended, the device is frozen.
> > > > > > > > > The driver is aware of this process and so should not reset the vqs I
> > > think.
> > > > > > > > Again that is only true because you want to use it for migration.
> > > > > > > > But then you can't claim it's a generic facility.
> > > > > > > I don't get it. The device status is a basic facility.
> > > > > > > 
> > > > > > > We need to SUSPEND the device by setting SUSPEND bit, to stabilize
> > > > > > > the device states for migration.
> > > > > > Is the PCI's PM time not enough to suspend the device?
> > > > > > For large device I could imagine it could be short.
> > > > > As you see, PCI PM, so this is a layer violation, virtio should be
> > > > > self contained,
> > > > If you think it is layer violation, than suspend bit for sure is not needed. PCI
> > > PM interface should suspend/resume the device on D0<->D3 state transitions.
> > > Doesn't make sense logically, because it is layer violation, so you want it to be
> > > worse? For example, virito writes 0 to device status to reset a device, not by PCI.
> > All these layer violation thing is just abstract to me.
> > Your argument contradicts with your fellow author and yourself.
> I don't see how, we keep telling you virtio should be self contained, and
> suspend by PCI PM is a
> layer volition, this is a fact, right?

Not really. Look at the charter - when available we should use platform
capabilities because it makes it easier to write drivers.


> > I don’t want to make it worse.
> > If you think its layer violation, just depend on the PCI PM, no need to include new suspend bit.
> Again, virtio should be self-contained, not layer volited, for example, we
> reset virito devices
> by writing 0 to device status, not by PCI FLR.

There are some advantage to doing it like this, e.g. one does not need
to save and restore config space. What are advatages of suspend via this
bit?

> > 
> > > > > and what about MMIO and CCW?
> > > > They have largely lacked the richness of PCI transport. So those transport
> > > needs to evolve.
> > > I am not sure CCW and MMIO maintainers want to hear this.
> > > > Otherwise, PCI offers rich transport facilities compared to MMIO, hence, it will
> > > continue wider use.
> > > you know this SUSPEND bit work fine on all transport, right? Because
> > > device_status is transport independent.
> > I want to emphasize that I am not against the suspend bit as long as it is guest driver controlled without interfering the device migration flow (like rest of the state).
> When migrate a device, it is the host who suspends the device. The reason is
> the live migration process should be transparent to
> the guest, so we should suspend the guest first, then suspend the device(by
> host).
> > 
> > The practical reason for suspending functionality under guest control is, that resuming/suspending the large device can take time.
> > So let it be in guest driver control. No need to muddy with device migration flow.
> The time cost is reasonable in O(N) no matter how you suspend/resume the
> device.

Very much depends. Big O notation can be misleading. If you have to
repeat an operation 1000 times that's 1000 * N and suddenly you are
going from milliseconds to seconds.


> > 
> > > > > This should be a basic facility.
> > > > Other transport can also offer like PCI.
> > > Do you want to work for these transport? Implementing the new features as
> > > PCI?
> > Not presently as PCI as more features than rest of the two.
> > What I read about ccw is: " S/390 based virtual machines support neither PCI nor MMIO".
> > 
> > And I also read, "The IBM System/390 is a discontinued mainframe product family implementing".
> > 
> > So I don’t know who needs to extend ccw.
> > And if one needs, those maintainers will extend it to match to PCI standard.
> So these features are even not planned, so don't depend on them.

But again can one suspend ccw device? If you are adding this feature and
claiming it's supported for all transports you better find out
what does it do.


> > 
> > > > > > In that case if there is suspend the device available, it will be
> > > > > > used by the
> > > > > guest driver itself, hypervisor wouldn’t know about it when those
> > > > > registers are not trapped.
> > > > > > So we need two ways to suspend.
> > > > > > One is guest visible, and guest controlled.
> > > > > > Second is hypervisor control to fulfill the device migration needs.
> > > > > The guest can eve reset the device.
> > > > > > So if you can please take a look if the proposed admin command to
> > > > > freeze/stop mode can be used in the emulated register case or not.
> > > > > > It helps to have the suspend bit in guest control as well
> > > > > > with/without
> > > > > emulation mode.
> > > > > Parav, please believe I have read your series, I didn't comment there
> > > > > because I want to avoid further conflicts/debating, we have done these
> > > enough.
> > > > I believe the series posted in v3 can support vdpa use case as well.
> > > > So I will progress to post v4.
> > > > 
> > > > > As explained before, freeze/stop the device by PCI is a layer violation.
> > > > I am afraid, we have different vision.
> > > > I don’t see any layer violation.
> > > > Suspend is enough in the PCI PM.
> > > > Our vision is more aligned with rest of the hypervisor knobs that owns the
> > > migration framework.
> > > I think I have explained, virito builds on other transport and it should be self-
> > > contained, so far so good.
> > Virtio without any transport binding is just blank paper discussion.
> virtio is built on some transports, but not bind to any.

Binding is an OS specific thing, but e.g. under Linux transport drivers bind to
devices then virtio drivers bind to virtio bus. No binding -> nothing
works.


> > 
> > > > > And device status can be pass-through(without emulation, just map it
> > > > > to
> > > > > guest) to the guest or trapped(trap and emulate by the hypervisor,
> > > > > for example set_status in vDPA).
> > > > When it is pass-through, it is controlled by the guest, so for example, if the
> > > guest resets the device, hypervisor has lost the control of migration context etc.
> > > > Hence, hypervisor needs a channel which is not guest owned.
> > > > 
> > > > Same channel can work when trap+emulation is done.
> > > It is the guest owns the device, it can reset the device, once reset, the device
> > > context are cleared.
> > Hypervisor do not have the ability to read/write the device context. It lost the channel as hypervisor is not involved in trap+emulation.
> > So it is not helpful in one use case.
> > 
> > Admin commands can work even with trap+emulation mode.
> > 
> > What is missing, that should be added?
> as explained above, when live migration, the guest should be suspended
> first, at this point,
> the host owns the device, it has access to the device.

Where do you say this in the spec patch?


> > 
> > > > > > > This can also be used for debugging I think.
> > > > > > As Michael listed, a dedicated debug interface is usually more
> > > > > > useful instead
> > > > > of in-band.
> > > > > re-using another facility without extra efforts is not a bad thing anyway.
> > > > I just don’t see how a suspend bit some debug feature.
> > > > Almost everything with that regard is a debug feature to me.
> > > suspend then check the device states?
> > You already suspended the device, so device state is already changed.
> > All debug information is changed, so not useful now.
> When suspended, the device should keep and stabilize its device states,
> at least in my series it should behave like this.

That's vague. What does it mean exactly and what happens if
some external event causes state change?

-- 
MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [virtio-comment] Re: [PATCH V2 4/6] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  2023-11-16 10:21                                   ` Parav Pandit
@ 2023-11-17 10:02                                     ` Zhu, Lingshan
  2023-11-17 10:06                                       ` Parav Pandit
  2023-11-17 10:45                                       ` Michael S. Tsirkin
  0 siblings, 2 replies; 186+ messages in thread
From: Zhu, Lingshan @ 2023-11-17 10:02 UTC (permalink / raw)
  To: Parav Pandit, jasowang, mst, eperezma, cohuck, stefanha; +Cc: virtio-comment



On 11/16/2023 6:21 PM, Parav Pandit wrote:
>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>> Sent: Thursday, November 16, 2023 3:45 PM
>>
>> On 11/16/2023 1:35 AM, Parav Pandit wrote:
>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>>>> Sent: Monday, November 13, 2023 2:56 PM
>>>>
>>>>
>>>>
>>>> On 11/10/2023 8:31 PM, Parav Pandit wrote:
>>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>>>>>> Sent: Friday, November 10, 2023 1:22 PM
>>>>>>
>>>>>>
>>>>>> On 11/9/2023 6:25 PM, Parav Pandit wrote:
>>>>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>>>>>>>> Sent: Thursday, November 9, 2023 3:39 PM
>>>>>>>>
>>>>>>>>
>>>>>>>> On 11/9/2023 2:28 PM, Parav Pandit wrote:
>>>>>>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>>>>>>>>>> Sent: Tuesday, November 7, 2023 3:02 PM
>>>>>>>>>>
>>>>>>>>>> On 11/6/2023 6:52 PM, Parav Pandit wrote:
>>>>>>>>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>>>>>>>>>>>> Sent: Monday, November 6, 2023 2:57 PM
>>>>>>>>>>>>
>>>>>>>>>>>> On 11/6/2023 12:12 PM, Parav Pandit wrote:
>>>>>>>>>>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>>>>>>>>>>>>>> Sent: Monday, November 6, 2023 9:01 AM
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 11/3/2023 11:50 PM, Parav Pandit wrote:
>>>>>>>>>>>>>>>> From: virtio-comment@lists.oasis-open.org
>>>>>>>>>>>>>>>> <virtio-comment@lists.oasis- open.org> On Behalf Of Zhu,
>>>>>>>>>>>>>>>> Lingshan
>>>>>>>>>>>>>>>> Sent: Friday, November 3, 2023 8:27 PM
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On 11/3/2023 7:35 PM, Parav Pandit wrote:
>>>>>>>>>>>>>>>>>> From: Zhu Lingshan <lingshan.zhu@intel.com>
>>>>>>>>>>>>>>>>>> Sent: Friday, November 3, 2023 4:05 PM
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> This patch adds two new le16 fields to common
>>>>>>>>>>>>>>>>>> configuration structure to support VIRTIO_F_QUEUE_STATE
>>>>>>>>>>>>>>>>>> in PCI transport
>>>>>> layer.
>>>>>>>>>>>>>>>>>> Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
>>>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>>>           transport-pci.tex | 18 ++++++++++++++++++
>>>>>>>>>>>>>>>>>>           1 file changed, 18 insertions(+)
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> diff --git a/transport-pci.tex b/transport-pci.tex
>>>>>>>>>>>>>>>>>> index
>>>>>>>>>>>>>>>>>> a5c6719..3161519 100644
>>>>>>>>>>>>>>>>>> --- a/transport-pci.tex
>>>>>>>>>>>>>>>>>> +++ b/transport-pci.tex
>>>>>>>>>>>>>>>>>> @@ -325,6 +325,10 @@ \subsubsection{Common
>>>> configuration
>>>>>>>>>>>>>> structure
>>>>>>>>>>>>>>>>>> layout}\label{sec:Virtio Transport
>>>>>>>>>>>>>>>>>>                   /* About the administration virtqueue. */
>>>>>>>>>>>>>>>>>>                   le16 admin_queue_index;         /* read-only for
>> driver
>>>>>> */
>>>>>>>>>>>>>>>>>>                   le16 admin_queue_num;         /* read-only for
>> driver
>>>>>> */
>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>> +	/* Virtqueue state */
>>>>>>>>>>>>>>>>>> +        le16 queue_avail_state;         /* read-write */
>>>>>>>>>>>>>>>>>> +        le16 queue_used_state;          /* read-write */
>>>>>>>>>>>>>>>>> This tiny interface for 128 virtio net queues through
>>>>>>>>>>>>>>>>> register read writes, does
>>>>>>>>>>>>>>>> not work effectively.
>>>>>>>>>>>>>>>>> There are inflight out of order descriptors for block also.
>>>>>>>>>>>>>>>>> Hence toy registers like this do not work.
>>>>>>>>>>>>>>>> Do you know there is a queue_select? Why this does not
>> work?
>>>>>>>>>>>>>>>> Do you know how other queue related fields work?
>>>>>>>>>>>>>>> :)
>>>>>>>>>>>>>>> Yes. If you notice queue_reset related critical spec bug
>>>>>>>>>>>>>>> fix was done when it
>>>>>>>>>>>>>> was introduced so that live migration can _actually_ work.
>>>>>>>>>>>>>>> When queue_select is done for 128 queues serially, it take
>>>>>>>>>>>>>>> a lot of time to
>>>>>>>>>>>>>> read those slow register interface for this + inflight
>>>>>>>>>>>>>> descriptors +
>>>>>> more.
>>>>>>>>>>>>>> interesting, virtio work in this pattern for many years, right?
>>>>>>>>>>>>> All these years 400Gbps and 800Gbps virtio was not present,
>>>>>>>>>>>>> number of
>>>>>>>>>>>> queues were not in hw.
>>>>>>>>>>>> The registers are control path in config space, how 400G or
>>>>>>>>>>>> 800G
>>>>>> affect??
>>>>>>>>>>> Because those are the one in practice requires large number of VQs.
>>>>>>>>>>>
>>>>>>>>>>> You are asking per VQ register commands to modify things
>>>>>>>>>>> dynamically via
>>>>>>>>>> this one vq at a time, serializing all the operations.
>>>>>>>>>>> It does not scale well with high q count.
>>>>>>>>>> This is not dynamically, it only happens when SUSPEND and RESUME.
>>>>>>>>>> This is the same mechanism how virtio initialize a virtqueue,
>>>>>>>>>> working for many years.
>>>>>>>>> No. when virtio driver initializes it for the first time, there
>>>>>>>>> is no active traffic
>>>>>>>> that gets lost.
>>>>>>>>> This is because the interface is not yet up and not part of the
>>>>>>>>> network
>>>> yet.
>>>>>>>>> The resume must be fast enough, because the remote node is
>>>>>>>>> sending
>>>>>>>> packets.
>>>>>>>>> Hence it is different from driver init time queue enable.
>>>>>>>> I am not sure any packets arrive before a link announce at the
>>>>>>>> destination
>>>>>> side.
>>>>>>> I think it can.
>>>>>>> Because there is no notification of member device link down
>>>>>>> intimation to
>>>>>> remote side.
>>>>>>> The L4 and L5 protocols have no knowledge that node which they are
>>>>>> interacting is behind some layers of switches.
>>>>>>> So keeping this time low is desired.
>>>>>> The NIC should broad cast itself first, so that other peers in the
>>>>>> network know(for example its mac to route it) how to send a message to
>> it.
>>>>>> This is necessary, for example VIRTIO_NET_F_GUEST_ANNOUNCE, similar
>>>>>> mechanism work for in-marketing productions for years.
>>>>>>
>>>>>> This is out of the topic anyway.
>>>>>>>>>>>> See the virtio common cfg, you will find the max number of
>>>>>>>>>>>> vqs is there, num_queues.
>>>>>>>>>>> :)
>>>>>>>>>>> Sure. those values at high q count affects.
>>>>>>>>>> the driver need to initialize them anyway.
>>>>>>>>> That is before the traffic starts from remote end.
>>>>>>>> see above, that needs a link announce and this is after
>>>>>>>> re-initialization
>>>>>>>>>>>>> Device didn’t support LM.
>>>>>>>>>>>>> Many limitations existed all these years and TC is improving
>>>>>>>>>>>>> and expanding
>>>>>>>>>>>> them.
>>>>>>>>>>>>> So all these years do not matter.
>>>>>>>>>>>> Not sure what are you talking about, haven't we initialize
>>>>>>>>>>>> the device and vqs in config space for years?????? What's
>>>>>>>>>>>> wrong with this
>>>>>>>> mechanism?
>>>>>>>>>>>> Are you questioning virito-pci fundamentals???
>>>>>>>>>>> Don’t point to in-efficient past to establish similar in-efficient future.
>>>>>>>>>> interesting, you know this is a one-time thing, right?
>>>>>>>>>> and you are aware of this has been there for years.
>>>>>>>>>>>>>>>> Like how to set a queue size and enable it?
>>>>>>>>>>>>>>> Those are meant to be used before DRIVER_OK stage as they
>>>>>>>>>>>>>>> are init time
>>>>>>>>>>>>>> registers.
>>>>>>>>>>>>>>> Not to keep abusing them..
>>>>>>>>>>>>>> don't you need to set queue_size at the destination side?
>>>>>>>>>>>>> No.
>>>>>>>>>>>>> But the src/dst does not matter.
>>>>>>>>>>>>> Queue_size to be set before DRIVER_OK like rest of the
>>>>>>>>>>>>> registers, as all
>>>>>>>>>>>> queues must be created before the driver_ok phase.
>>>>>>>>>>>>> Queue_reset was last moment exception.
>>>>>>>>>>>> create a queue? Nvidia specific?
>>>>>>>>>>>>
>>>>>>>>>>> Huh. No.
>>>>>>>>>>> Do git log and realize what happened with queue_reset.
>>>>>>>>>> You didn't answer the question, does the spec even has defined
>>>>>>>>>> "create a
>>>>>>>> vq"?
>>>>>>>>> Enabled/created = tomato/tomato when discussing the spec in
>>>>>>>>> non-normative
>>>>>>>> email conversation.
>>>>>>>>> It's irrelevant.
>>>>>>>> Then lets not debate on this enable a vq or create a vq anymore
>>>>>>>>> All I am saying is, when we know the limitations of the
>>>>>>>>> transport and when industry is forwarding to not introduced more
>>>>>>>>> and more on-die register
>>>>>>>> for once in lifetime work of device migration, we just use the
>>>>>>>> optimal command and queue interface that is native to virtio.
>>>>>>>> PCI config space has its own limitations, and admin vq has its
>>>>>>>> advantages, but that does not apply to all use cases.
>>>>>>>>
>>>>>>> There was a recent work done emulating the SR-IOV cap and allowing
>>>>>>> VM to
>>>>>> enable SR-IOV in [1].
>>>>>>> This is the option I mentioned few weeks ago.
>>>>>>>
>>>>>>> So with admin commands and admin virtqueues, even nested model
>>>>>>> will work
>>>>>> using [1].
>>>>>>> [1]
>>>>>>> https://netdevconf.info/0x17/sessions/talk/unleashing-sr-iov-offlo
>>>>>>> ad
>>>>>>> -o
>>>>>>> n-virtual-machines.html
>>>>>> We should take this into consideration once it is standardized in
>>>>>> the spec, maybe not now, there can always be many workarounds to
>>>>>> solve one
>>>> problem.
>>>>> Sure, until that point the admin commands are able to suffice the need
>> well.
>>>>> And when the spec changes in transport occurs (if needed), current
>>>>> admin
>>>> command and admin vq also fits very well that will follow above [1].
>>>> we have pointed lots of problems for admin vq based live migration
>>>> proposal, I won't repeat them here
>>> I don’t see any.
>>> Nested is already solved using above.
>> I don't see how, do you mind to work out the patches?
> Once the base series is completed, nested cases can be addressed.
> I wont be able to work on the patches for it until we finish for the first level virtualization.
As you know, nested is supported well in current virtio, so please don't 
break it.
>
>>> Long time ago, you mentioned some QoS issue, which anyway exists in the
>> device register method too.
>>> Can you please list them if anything other than QoS and nest?


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* RE: [virtio-comment] Re: [PATCH V2 4/6] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  2023-11-17 10:02                                     ` Zhu, Lingshan
@ 2023-11-17 10:06                                       ` Parav Pandit
  2023-11-21  4:30                                         ` Jason Wang
  2023-11-17 10:45                                       ` Michael S. Tsirkin
  1 sibling, 1 reply; 186+ messages in thread
From: Parav Pandit @ 2023-11-17 10:06 UTC (permalink / raw)
  To: Zhu, Lingshan, jasowang, mst, eperezma, cohuck, stefanha; +Cc: virtio-comment


> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> Sent: Friday, November 17, 2023 3:32 PM
> To: Parav Pandit <parav@nvidia.com>; jasowang@redhat.com;
> mst@redhat.com; eperezma@redhat.com; cohuck@redhat.com;
> stefanha@redhat.com
> Cc: virtio-comment@lists.oasis-open.org
> Subject: Re: [virtio-comment] Re: [PATCH V2 4/6] virtio-pci: implement
> VIRTIO_F_QUEUE_STATE
> 
> 
> 
> On 11/16/2023 6:21 PM, Parav Pandit wrote:
> >> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> >> Sent: Thursday, November 16, 2023 3:45 PM
> >>
> >> On 11/16/2023 1:35 AM, Parav Pandit wrote:
> >>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> >>>> Sent: Monday, November 13, 2023 2:56 PM
> >>>>
> >>>>
> >>>>
> >>>> On 11/10/2023 8:31 PM, Parav Pandit wrote:
> >>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> >>>>>> Sent: Friday, November 10, 2023 1:22 PM
> >>>>>>
> >>>>>>
> >>>>>> On 11/9/2023 6:25 PM, Parav Pandit wrote:
> >>>>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> >>>>>>>> Sent: Thursday, November 9, 2023 3:39 PM
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On 11/9/2023 2:28 PM, Parav Pandit wrote:
> >>>>>>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> >>>>>>>>>> Sent: Tuesday, November 7, 2023 3:02 PM
> >>>>>>>>>>
> >>>>>>>>>> On 11/6/2023 6:52 PM, Parav Pandit wrote:
> >>>>>>>>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> >>>>>>>>>>>> Sent: Monday, November 6, 2023 2:57 PM
> >>>>>>>>>>>>
> >>>>>>>>>>>> On 11/6/2023 12:12 PM, Parav Pandit wrote:
> >>>>>>>>>>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> >>>>>>>>>>>>>> Sent: Monday, November 6, 2023 9:01 AM
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On 11/3/2023 11:50 PM, Parav Pandit wrote:
> >>>>>>>>>>>>>>>> From: virtio-comment@lists.oasis-open.org
> >>>>>>>>>>>>>>>> <virtio-comment@lists.oasis- open.org> On Behalf Of
> >>>>>>>>>>>>>>>> Zhu, Lingshan
> >>>>>>>>>>>>>>>> Sent: Friday, November 3, 2023 8:27 PM
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> On 11/3/2023 7:35 PM, Parav Pandit wrote:
> >>>>>>>>>>>>>>>>>> From: Zhu Lingshan <lingshan.zhu@intel.com>
> >>>>>>>>>>>>>>>>>> Sent: Friday, November 3, 2023 4:05 PM
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> This patch adds two new le16 fields to common
> >>>>>>>>>>>>>>>>>> configuration structure to support
> >>>>>>>>>>>>>>>>>> VIRTIO_F_QUEUE_STATE in PCI transport
> >>>>>> layer.
> >>>>>>>>>>>>>>>>>> Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
> >>>>>>>>>>>>>>>>>> ---
> >>>>>>>>>>>>>>>>>>           transport-pci.tex | 18 ++++++++++++++++++
> >>>>>>>>>>>>>>>>>>           1 file changed, 18 insertions(+)
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> diff --git a/transport-pci.tex b/transport-pci.tex
> >>>>>>>>>>>>>>>>>> index
> >>>>>>>>>>>>>>>>>> a5c6719..3161519 100644
> >>>>>>>>>>>>>>>>>> --- a/transport-pci.tex
> >>>>>>>>>>>>>>>>>> +++ b/transport-pci.tex
> >>>>>>>>>>>>>>>>>> @@ -325,6 +325,10 @@ \subsubsection{Common
> >>>> configuration
> >>>>>>>>>>>>>> structure
> >>>>>>>>>>>>>>>>>> layout}\label{sec:Virtio Transport
> >>>>>>>>>>>>>>>>>>                   /* About the administration virtqueue. */
> >>>>>>>>>>>>>>>>>>                   le16 admin_queue_index;         /* read-only for
> >> driver
> >>>>>> */
> >>>>>>>>>>>>>>>>>>                   le16 admin_queue_num;         /* read-only for
> >> driver
> >>>>>> */
> >>>>>>>>>>>>>>>>>> +
> >>>>>>>>>>>>>>>>>> +	/* Virtqueue state */
> >>>>>>>>>>>>>>>>>> +        le16 queue_avail_state;         /* read-write */
> >>>>>>>>>>>>>>>>>> +        le16 queue_used_state;          /* read-write */
> >>>>>>>>>>>>>>>>> This tiny interface for 128 virtio net queues through
> >>>>>>>>>>>>>>>>> register read writes, does
> >>>>>>>>>>>>>>>> not work effectively.
> >>>>>>>>>>>>>>>>> There are inflight out of order descriptors for block also.
> >>>>>>>>>>>>>>>>> Hence toy registers like this do not work.
> >>>>>>>>>>>>>>>> Do you know there is a queue_select? Why this does not
> >> work?
> >>>>>>>>>>>>>>>> Do you know how other queue related fields work?
> >>>>>>>>>>>>>>> :)
> >>>>>>>>>>>>>>> Yes. If you notice queue_reset related critical spec bug
> >>>>>>>>>>>>>>> fix was done when it
> >>>>>>>>>>>>>> was introduced so that live migration can _actually_ work.
> >>>>>>>>>>>>>>> When queue_select is done for 128 queues serially, it
> >>>>>>>>>>>>>>> take a lot of time to
> >>>>>>>>>>>>>> read those slow register interface for this + inflight
> >>>>>>>>>>>>>> descriptors +
> >>>>>> more.
> >>>>>>>>>>>>>> interesting, virtio work in this pattern for many years, right?
> >>>>>>>>>>>>> All these years 400Gbps and 800Gbps virtio was not
> >>>>>>>>>>>>> present, number of
> >>>>>>>>>>>> queues were not in hw.
> >>>>>>>>>>>> The registers are control path in config space, how 400G or
> >>>>>>>>>>>> 800G
> >>>>>> affect??
> >>>>>>>>>>> Because those are the one in practice requires large number of
> VQs.
> >>>>>>>>>>>
> >>>>>>>>>>> You are asking per VQ register commands to modify things
> >>>>>>>>>>> dynamically via
> >>>>>>>>>> this one vq at a time, serializing all the operations.
> >>>>>>>>>>> It does not scale well with high q count.
> >>>>>>>>>> This is not dynamically, it only happens when SUSPEND and
> RESUME.
> >>>>>>>>>> This is the same mechanism how virtio initialize a virtqueue,
> >>>>>>>>>> working for many years.
> >>>>>>>>> No. when virtio driver initializes it for the first time,
> >>>>>>>>> there is no active traffic
> >>>>>>>> that gets lost.
> >>>>>>>>> This is because the interface is not yet up and not part of
> >>>>>>>>> the network
> >>>> yet.
> >>>>>>>>> The resume must be fast enough, because the remote node is
> >>>>>>>>> sending
> >>>>>>>> packets.
> >>>>>>>>> Hence it is different from driver init time queue enable.
> >>>>>>>> I am not sure any packets arrive before a link announce at the
> >>>>>>>> destination
> >>>>>> side.
> >>>>>>> I think it can.
> >>>>>>> Because there is no notification of member device link down
> >>>>>>> intimation to
> >>>>>> remote side.
> >>>>>>> The L4 and L5 protocols have no knowledge that node which they
> >>>>>>> are
> >>>>>> interacting is behind some layers of switches.
> >>>>>>> So keeping this time low is desired.
> >>>>>> The NIC should broad cast itself first, so that other peers in
> >>>>>> the network know(for example its mac to route it) how to send a
> >>>>>> message to
> >> it.
> >>>>>> This is necessary, for example VIRTIO_NET_F_GUEST_ANNOUNCE,
> >>>>>> similar mechanism work for in-marketing productions for years.
> >>>>>>
> >>>>>> This is out of the topic anyway.
> >>>>>>>>>>>> See the virtio common cfg, you will find the max number of
> >>>>>>>>>>>> vqs is there, num_queues.
> >>>>>>>>>>> :)
> >>>>>>>>>>> Sure. those values at high q count affects.
> >>>>>>>>>> the driver need to initialize them anyway.
> >>>>>>>>> That is before the traffic starts from remote end.
> >>>>>>>> see above, that needs a link announce and this is after
> >>>>>>>> re-initialization
> >>>>>>>>>>>>> Device didn’t support LM.
> >>>>>>>>>>>>> Many limitations existed all these years and TC is
> >>>>>>>>>>>>> improving and expanding
> >>>>>>>>>>>> them.
> >>>>>>>>>>>>> So all these years do not matter.
> >>>>>>>>>>>> Not sure what are you talking about, haven't we initialize
> >>>>>>>>>>>> the device and vqs in config space for years?????? What's
> >>>>>>>>>>>> wrong with this
> >>>>>>>> mechanism?
> >>>>>>>>>>>> Are you questioning virito-pci fundamentals???
> >>>>>>>>>>> Don’t point to in-efficient past to establish similar in-efficient
> future.
> >>>>>>>>>> interesting, you know this is a one-time thing, right?
> >>>>>>>>>> and you are aware of this has been there for years.
> >>>>>>>>>>>>>>>> Like how to set a queue size and enable it?
> >>>>>>>>>>>>>>> Those are meant to be used before DRIVER_OK stage as
> >>>>>>>>>>>>>>> they are init time
> >>>>>>>>>>>>>> registers.
> >>>>>>>>>>>>>>> Not to keep abusing them..
> >>>>>>>>>>>>>> don't you need to set queue_size at the destination side?
> >>>>>>>>>>>>> No.
> >>>>>>>>>>>>> But the src/dst does not matter.
> >>>>>>>>>>>>> Queue_size to be set before DRIVER_OK like rest of the
> >>>>>>>>>>>>> registers, as all
> >>>>>>>>>>>> queues must be created before the driver_ok phase.
> >>>>>>>>>>>>> Queue_reset was last moment exception.
> >>>>>>>>>>>> create a queue? Nvidia specific?
> >>>>>>>>>>>>
> >>>>>>>>>>> Huh. No.
> >>>>>>>>>>> Do git log and realize what happened with queue_reset.
> >>>>>>>>>> You didn't answer the question, does the spec even has
> >>>>>>>>>> defined "create a
> >>>>>>>> vq"?
> >>>>>>>>> Enabled/created = tomato/tomato when discussing the spec in
> >>>>>>>>> non-normative
> >>>>>>>> email conversation.
> >>>>>>>>> It's irrelevant.
> >>>>>>>> Then lets not debate on this enable a vq or create a vq anymore
> >>>>>>>>> All I am saying is, when we know the limitations of the
> >>>>>>>>> transport and when industry is forwarding to not introduced
> >>>>>>>>> more and more on-die register
> >>>>>>>> for once in lifetime work of device migration, we just use the
> >>>>>>>> optimal command and queue interface that is native to virtio.
> >>>>>>>> PCI config space has its own limitations, and admin vq has its
> >>>>>>>> advantages, but that does not apply to all use cases.
> >>>>>>>>
> >>>>>>> There was a recent work done emulating the SR-IOV cap and
> >>>>>>> allowing VM to
> >>>>>> enable SR-IOV in [1].
> >>>>>>> This is the option I mentioned few weeks ago.
> >>>>>>>
> >>>>>>> So with admin commands and admin virtqueues, even nested model
> >>>>>>> will work
> >>>>>> using [1].
> >>>>>>> [1]
> >>>>>>> https://netdevconf.info/0x17/sessions/talk/unleashing-sr-iov-off
> >>>>>>> lo
> >>>>>>> ad
> >>>>>>> -o
> >>>>>>> n-virtual-machines.html
> >>>>>> We should take this into consideration once it is standardized in
> >>>>>> the spec, maybe not now, there can always be many workarounds to
> >>>>>> solve one
> >>>> problem.
> >>>>> Sure, until that point the admin commands are able to suffice the
> >>>>> need
> >> well.
> >>>>> And when the spec changes in transport occurs (if needed), current
> >>>>> admin
> >>>> command and admin vq also fits very well that will follow above [1].
> >>>> we have pointed lots of problems for admin vq based live migration
> >>>> proposal, I won't repeat them here
> >>> I don’t see any.
> >>> Nested is already solved using above.
> >> I don't see how, do you mind to work out the patches?
> > Once the base series is completed, nested cases can be addressed.
> > I wont be able to work on the patches for it until we finish for the first level
> virtualization.
> As you know, nested is supported well in current virtio, so please don't break it.

And same comment repeats. 😊
Expect same response...
Sorry, no virtio specification does not support device migration today.
Nothing is broken by adding new features. 

Above [1] has the right proposal that Jason's paper pointed out. Please use it.

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [virtio-comment] RE: [PATCH V2 3/6] virtio: dont reset vqs when SUSPEND
  2023-11-16 12:09                       ` Michael S. Tsirkin
@ 2023-11-17 10:13                         ` Zhu, Lingshan
  2023-11-17 11:04                           ` Michael S. Tsirkin
  0 siblings, 1 reply; 186+ messages in thread
From: Zhu, Lingshan @ 2023-11-17 10:13 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Parav Pandit, jasowang, eperezma, cohuck, stefanha, virtio-comment

[-- Attachment #1: Type: text/plain, Size: 10646 bytes --]



On 11/16/2023 8:09 PM, Michael S. Tsirkin wrote:
> On Thu, Nov 16, 2023 at 06:09:38PM +0800, Zhu, Lingshan wrote:
>>
>> On 11/16/2023 1:35 AM, Parav Pandit wrote:
>>>> From: Zhu, Lingshan<lingshan.zhu@intel.com>
>>>> Sent: Monday, November 13, 2023 2:53 PM
>>>>
>>>> On 11/10/2023 2:31 PM, Parav Pandit wrote:
>>>>>> From: Zhu, Lingshan<lingshan.zhu@intel.com>
>>>>>> Sent: Friday, November 10, 2023 11:52 AM
>>>>>>
>>>>>> On 11/9/2023 6:15 PM, Parav Pandit wrote:
>>>>>>>> From: Zhu, Lingshan<lingshan.zhu@intel.com>
>>>>>>>> Sent: Thursday, November 9, 2023 3:28 PM
>>>>>>>>
>>>>>>>> On 11/9/2023 1:46 AM, Michael S. Tsirkin wrote:
>>>>>>>>> On Tue, Nov 07, 2023 at 05:27:23PM +0800, Zhu, Lingshan wrote:
>>>>>>>>>> On 11/6/2023 5:49 PM, Michael S. Tsirkin wrote:
>>>>>>>>>>> On Fri, Nov 03, 2023 at 06:34:34PM +0800, Zhu Lingshan wrote:
>>>>>>>>>>>> When SUSPEND is set, device states and virtqueue states should
>>>>>>>>>>>> be stablized, therefore the driver should not reset vqs when
>>>>>>>>>>>> SUSPEND is set in device status.
>>>>>>>>>>>>
>>>>>>>>>>>> Signed-off-by: Zhu Lingshan<lingshan.zhu@intel.com>
>>>>>>>>>>>> ---
>>>>>>>>>>>>        content.tex | 3 +++
>>>>>>>>>>>>        1 file changed, 3 insertions(+)
>>>>>>>>>>>>
>>>>>>>>>>>> diff --git a/content.tex b/content.tex index bcc9d4b..060b5c2
>>>>>>>>>>>> 100644
>>>>>>>>>>>> --- a/content.tex
>>>>>>>>>>>> +++ b/content.tex
>>>>>>>>>>>> @@ -444,6 +444,9 @@ \subsubsection{Virtqueue
>>>>>>>>>>>> Reset}\label{sec:Basic
>>>>>>>> Facilities of a Virtio Device /
>>>>>>>>>>>>        The device MUST reset any state of a virtqueue to the default
>>>> state,
>>>>>>>>>>>>        including the available state and the used state.
>>>>>>>>>>>> +If VIRTIO_F_SUSPEND is negotiated and SUSPEND is set in
>>>>>>>>>>>> +\field{device status}, the driver SHOULD NOT reset any virtqueues.
>>>>>>>>>>>> +
>>>>>>>>>>>>        \drivernormative{\paragraph}{Virtqueue Reset}{Basic
>>>>>>>>>>>> Facilities of a
>>>>>>>> Virtio Device / Virtqueues / Virtqueue Reset / Virtqueue Reset}
>>>>>>>>>>>>        After the driver tells the device to reset a queue, the
>>>>>>>>>>>> driver MUST verify that
>>>>>>>>>>> Seems somewhat arbitrary and breaks the claim that the feature
>>>>>>>>>>> is orthogonal and can have uses besides migration.
>>>>>>>>>> when suspended, the device is frozen.
>>>>>>>>>> The driver is aware of this process and so should not reset the vqs I
>>>> think.
>>>>>>>>> Again that is only true because you want to use it for migration.
>>>>>>>>> But then you can't claim it's a generic facility.
>>>>>>>> I don't get it. The device status is a basic facility.
>>>>>>>>
>>>>>>>> We need to SUSPEND the device by setting SUSPEND bit, to stabilize
>>>>>>>> the device states for migration.
>>>>>>> Is the PCI's PM time not enough to suspend the device?
>>>>>>> For large device I could imagine it could be short.
>>>>>> As you see, PCI PM, so this is a layer violation, virtio should be
>>>>>> self contained,
>>>>> If you think it is layer violation, than suspend bit for sure is not needed. PCI
>>>> PM interface should suspend/resume the device on D0<->D3 state transitions.
>>>> Doesn't make sense logically, because it is layer violation, so you want it to be
>>>> worse? For example, virito writes 0 to device status to reset a device, not by PCI.
>>> All these layer violation thing is just abstract to me.
>>> Your argument contradicts with your fellow author and yourself.
>> I don't see how, we keep telling you virtio should be self contained, and
>> suspend by PCI PM is a
>> layer volition, this is a fact, right?
> Not really. Look at the charter - when available we should use platform
> capabilities because it makes it easier to write drivers.
I think that is transport specific implementation, for example pci 
common cfg.
>
>
>>> I don’t want to make it worse.
>>> If you think its layer violation, just depend on the PCI PM, no need to include new suspend bit.
>> Again, virtio should be self-contained, not layer volited, for example, we
>> reset virito devices
>> by writing 0 to device status, not by PCI FLR.
> There are some advantage to doing it like this, e.g. one does not need
> to save and restore config space. What are advatages of suspend via this
> bit?
suspend a device by the device status is the same as how we enable a 
virito device.

Doing this by PCI is clearly a layer volition, and does not work for 
other transports.
>
>>>>>> and what about MMIO and CCW?
>>>>> They have largely lacked the richness of PCI transport. So those transport
>>>> needs to evolve.
>>>> I am not sure CCW and MMIO maintainers want to hear this.
>>>>> Otherwise, PCI offers rich transport facilities compared to MMIO, hence, it will
>>>> continue wider use.
>>>> you know this SUSPEND bit work fine on all transport, right? Because
>>>> device_status is transport independent.
>>> I want to emphasize that I am not against the suspend bit as long as it is guest driver controlled without interfering the device migration flow (like rest of the state).
>> When migrate a device, it is the host who suspends the device. The reason is
>> the live migration process should be transparent to
>> the guest, so we should suspend the guest first, then suspend the device(by
>> host).
>>> The practical reason for suspending functionality under guest control is, that resuming/suspending the large device can take time.
>>> So let it be in guest driver control. No need to muddy with device migration flow.
>> The time cost is reasonable in O(N) no matter how you suspend/resume the
>> device.
> Very much depends. Big O notation can be misleading. If you have to
> repeat an operation 1000 times that's 1000 * N and suddenly you are
> going from milliseconds to seconds.
I mean enable 100 queues cost more time then enable 1 vq no matter
how we enable it. that is O(N)
>
>
>>>>>> This should be a basic facility.
>>>>> Other transport can also offer like PCI.
>>>> Do you want to work for these transport? Implementing the new features as
>>>> PCI?
>>> Not presently as PCI as more features than rest of the two.
>>> What I read about ccw is: " S/390 based virtual machines support neither PCI nor MMIO".
>>>
>>> And I also read, "The IBM System/390 is a discontinued mainframe product family implementing".
>>>
>>> So I don’t know who needs to extend ccw.
>>> And if one needs, those maintainers will extend it to match to PCI standard.
>> So these features are even not planned, so don't depend on them.
> But again can one suspend ccw device? If you are adding this feature and
> claiming it's supported for all transports you better find out
> what does it do.
I am not an expert on CCW, anything block we suspend a CCW device by 
this bit?
This seems only controlled by the device itself.
>
>
>>>>>>> In that case if there is suspend the device available, it will be
>>>>>>> used by the
>>>>>> guest driver itself, hypervisor wouldn’t know about it when those
>>>>>> registers are not trapped.
>>>>>>> So we need two ways to suspend.
>>>>>>> One is guest visible, and guest controlled.
>>>>>>> Second is hypervisor control to fulfill the device migration needs.
>>>>>> The guest can eve reset the device.
>>>>>>> So if you can please take a look if the proposed admin command to
>>>>>> freeze/stop mode can be used in the emulated register case or not.
>>>>>>> It helps to have the suspend bit in guest control as well
>>>>>>> with/without
>>>>>> emulation mode.
>>>>>> Parav, please believe I have read your series, I didn't comment there
>>>>>> because I want to avoid further conflicts/debating, we have done these
>>>> enough.
>>>>> I believe the series posted in v3 can support vdpa use case as well.
>>>>> So I will progress to post v4.
>>>>>
>>>>>> As explained before, freeze/stop the device by PCI is a layer violation.
>>>>> I am afraid, we have different vision.
>>>>> I don’t see any layer violation.
>>>>> Suspend is enough in the PCI PM.
>>>>> Our vision is more aligned with rest of the hypervisor knobs that owns the
>>>> migration framework.
>>>> I think I have explained, virito builds on other transport and it should be self-
>>>> contained, so far so good.
>>> Virtio without any transport binding is just blank paper discussion.
>> virtio is built on some transports, but not bind to any.
> Binding is an OS specific thing, but e.g. under Linux transport drivers bind to
> devices then virtio drivers bind to virtio bus. No binding -> nothing
> works.
I think general facilities are better not only work on a specific transport
>
>>>>>> And device status can be pass-through(without emulation, just map it
>>>>>> to
>>>>>> guest) to the guest or trapped(trap and emulate by the hypervisor,
>>>>>> for example set_status in vDPA).
>>>>> When it is pass-through, it is controlled by the guest, so for example, if the
>>>> guest resets the device, hypervisor has lost the control of migration context etc.
>>>>> Hence, hypervisor needs a channel which is not guest owned.
>>>>>
>>>>> Same channel can work when trap+emulation is done.
>>>> It is the guest owns the device, it can reset the device, once reset, the device
>>>> context are cleared.
>>> Hypervisor do not have the ability to read/write the device context. It lost the channel as hypervisor is not involved in trap+emulation.
>>> So it is not helpful in one use case.
>>>
>>> Admin commands can work even with trap+emulation mode.
>>>
>>> What is missing, that should be added?
>> as explained above, when live migration, the guest should be suspended
>> first, at this point,
>> the host owns the device, it has access to the device.
> Where do you say this in the spec patch?
VM live migration is not in this spec.
If we suspend the device first, then the guest may detect IO errors.
>
>
>>>>>>>> This can also be used for debugging I think.
>>>>>>> As Michael listed, a dedicated debug interface is usually more
>>>>>>> useful instead
>>>>>> of in-band.
>>>>>> re-using another facility without extra efforts is not a bad thing anyway.
>>>>> I just don’t see how a suspend bit some debug feature.
>>>>> Almost everything with that regard is a debug feature to me.
>>>> suspend then check the device states?
>>> You already suspended the device, so device state is already changed.
>>> All debug information is changed, so not useful now.
>> When suspended, the device should keep and stabilize its device states,
>> at least in my series it should behave like this.
> That's vague. What does it mean exactly and what happens if
> some external event causes state change?
it is suspended, somehow like powered-down, so it should not
respond to the events until resume.
>

[-- Attachment #2: Type: text/html, Size: 21248 bytes --]

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [virtio-comment] Re: [PATCH V2 4/6] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  2023-11-17 10:02                                     ` Zhu, Lingshan
  2023-11-17 10:06                                       ` Parav Pandit
@ 2023-11-17 10:45                                       ` Michael S. Tsirkin
  2023-11-22  1:32                                         ` Zhu, Lingshan
  1 sibling, 1 reply; 186+ messages in thread
From: Michael S. Tsirkin @ 2023-11-17 10:45 UTC (permalink / raw)
  To: Zhu, Lingshan
  Cc: Parav Pandit, jasowang, eperezma, cohuck, stefanha, virtio-comment

On Fri, Nov 17, 2023 at 06:02:14PM +0800, Zhu, Lingshan wrote:
> 
> 
> On 11/16/2023 6:21 PM, Parav Pandit wrote:
> > > From: Zhu, Lingshan <lingshan.zhu@intel.com>
> > > Sent: Thursday, November 16, 2023 3:45 PM
> > > 
> > > On 11/16/2023 1:35 AM, Parav Pandit wrote:
> > > > > From: Zhu, Lingshan <lingshan.zhu@intel.com>
> > > > > Sent: Monday, November 13, 2023 2:56 PM
> > > > > 
> > > > > 
> > > > > 
> > > > > On 11/10/2023 8:31 PM, Parav Pandit wrote:
> > > > > > > From: Zhu, Lingshan <lingshan.zhu@intel.com>
> > > > > > > Sent: Friday, November 10, 2023 1:22 PM
> > > > > > > 
> > > > > > > 
> > > > > > > On 11/9/2023 6:25 PM, Parav Pandit wrote:
> > > > > > > > > From: Zhu, Lingshan <lingshan.zhu@intel.com>
> > > > > > > > > Sent: Thursday, November 9, 2023 3:39 PM
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > On 11/9/2023 2:28 PM, Parav Pandit wrote:
> > > > > > > > > > > From: Zhu, Lingshan <lingshan.zhu@intel.com>
> > > > > > > > > > > Sent: Tuesday, November 7, 2023 3:02 PM
> > > > > > > > > > > 
> > > > > > > > > > > On 11/6/2023 6:52 PM, Parav Pandit wrote:
> > > > > > > > > > > > > From: Zhu, Lingshan <lingshan.zhu@intel.com>
> > > > > > > > > > > > > Sent: Monday, November 6, 2023 2:57 PM
> > > > > > > > > > > > > 
> > > > > > > > > > > > > On 11/6/2023 12:12 PM, Parav Pandit wrote:
> > > > > > > > > > > > > > > From: Zhu, Lingshan <lingshan.zhu@intel.com>
> > > > > > > > > > > > > > > Sent: Monday, November 6, 2023 9:01 AM
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > On 11/3/2023 11:50 PM, Parav Pandit wrote:
> > > > > > > > > > > > > > > > > From: virtio-comment@lists.oasis-open.org
> > > > > > > > > > > > > > > > > <virtio-comment@lists.oasis- open.org> On Behalf Of Zhu,
> > > > > > > > > > > > > > > > > Lingshan
> > > > > > > > > > > > > > > > > Sent: Friday, November 3, 2023 8:27 PM
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > On 11/3/2023 7:35 PM, Parav Pandit wrote:
> > > > > > > > > > > > > > > > > > > From: Zhu Lingshan <lingshan.zhu@intel.com>
> > > > > > > > > > > > > > > > > > > Sent: Friday, November 3, 2023 4:05 PM
> > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > This patch adds two new le16 fields to common
> > > > > > > > > > > > > > > > > > > configuration structure to support VIRTIO_F_QUEUE_STATE
> > > > > > > > > > > > > > > > > > > in PCI transport
> > > > > > > layer.
> > > > > > > > > > > > > > > > > > > Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
> > > > > > > > > > > > > > > > > > > ---
> > > > > > > > > > > > > > > > > > >           transport-pci.tex | 18 ++++++++++++++++++
> > > > > > > > > > > > > > > > > > >           1 file changed, 18 insertions(+)
> > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > diff --git a/transport-pci.tex b/transport-pci.tex
> > > > > > > > > > > > > > > > > > > index
> > > > > > > > > > > > > > > > > > > a5c6719..3161519 100644
> > > > > > > > > > > > > > > > > > > --- a/transport-pci.tex
> > > > > > > > > > > > > > > > > > > +++ b/transport-pci.tex
> > > > > > > > > > > > > > > > > > > @@ -325,6 +325,10 @@ \subsubsection{Common
> > > > > configuration
> > > > > > > > > > > > > > > structure
> > > > > > > > > > > > > > > > > > > layout}\label{sec:Virtio Transport
> > > > > > > > > > > > > > > > > > >                   /* About the administration virtqueue. */
> > > > > > > > > > > > > > > > > > >                   le16 admin_queue_index;         /* read-only for
> > > driver
> > > > > > > */
> > > > > > > > > > > > > > > > > > >                   le16 admin_queue_num;         /* read-only for
> > > driver
> > > > > > > */
> > > > > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > > > > +	/* Virtqueue state */
> > > > > > > > > > > > > > > > > > > +        le16 queue_avail_state;         /* read-write */
> > > > > > > > > > > > > > > > > > > +        le16 queue_used_state;          /* read-write */
> > > > > > > > > > > > > > > > > > This tiny interface for 128 virtio net queues through
> > > > > > > > > > > > > > > > > > register read writes, does
> > > > > > > > > > > > > > > > > not work effectively.
> > > > > > > > > > > > > > > > > > There are inflight out of order descriptors for block also.
> > > > > > > > > > > > > > > > > > Hence toy registers like this do not work.
> > > > > > > > > > > > > > > > > Do you know there is a queue_select? Why this does not
> > > work?
> > > > > > > > > > > > > > > > > Do you know how other queue related fields work?
> > > > > > > > > > > > > > > > :)
> > > > > > > > > > > > > > > > Yes. If you notice queue_reset related critical spec bug
> > > > > > > > > > > > > > > > fix was done when it
> > > > > > > > > > > > > > > was introduced so that live migration can _actually_ work.
> > > > > > > > > > > > > > > > When queue_select is done for 128 queues serially, it take
> > > > > > > > > > > > > > > > a lot of time to
> > > > > > > > > > > > > > > read those slow register interface for this + inflight
> > > > > > > > > > > > > > > descriptors +
> > > > > > > more.
> > > > > > > > > > > > > > > interesting, virtio work in this pattern for many years, right?
> > > > > > > > > > > > > > All these years 400Gbps and 800Gbps virtio was not present,
> > > > > > > > > > > > > > number of
> > > > > > > > > > > > > queues were not in hw.
> > > > > > > > > > > > > The registers are control path in config space, how 400G or
> > > > > > > > > > > > > 800G
> > > > > > > affect??
> > > > > > > > > > > > Because those are the one in practice requires large number of VQs.
> > > > > > > > > > > > 
> > > > > > > > > > > > You are asking per VQ register commands to modify things
> > > > > > > > > > > > dynamically via
> > > > > > > > > > > this one vq at a time, serializing all the operations.
> > > > > > > > > > > > It does not scale well with high q count.
> > > > > > > > > > > This is not dynamically, it only happens when SUSPEND and RESUME.
> > > > > > > > > > > This is the same mechanism how virtio initialize a virtqueue,
> > > > > > > > > > > working for many years.
> > > > > > > > > > No. when virtio driver initializes it for the first time, there
> > > > > > > > > > is no active traffic
> > > > > > > > > that gets lost.
> > > > > > > > > > This is because the interface is not yet up and not part of the
> > > > > > > > > > network
> > > > > yet.
> > > > > > > > > > The resume must be fast enough, because the remote node is
> > > > > > > > > > sending
> > > > > > > > > packets.
> > > > > > > > > > Hence it is different from driver init time queue enable.
> > > > > > > > > I am not sure any packets arrive before a link announce at the
> > > > > > > > > destination
> > > > > > > side.
> > > > > > > > I think it can.
> > > > > > > > Because there is no notification of member device link down
> > > > > > > > intimation to
> > > > > > > remote side.
> > > > > > > > The L4 and L5 protocols have no knowledge that node which they are
> > > > > > > interacting is behind some layers of switches.
> > > > > > > > So keeping this time low is desired.
> > > > > > > The NIC should broad cast itself first, so that other peers in the
> > > > > > > network know(for example its mac to route it) how to send a message to
> > > it.
> > > > > > > This is necessary, for example VIRTIO_NET_F_GUEST_ANNOUNCE, similar
> > > > > > > mechanism work for in-marketing productions for years.
> > > > > > > 
> > > > > > > This is out of the topic anyway.
> > > > > > > > > > > > > See the virtio common cfg, you will find the max number of
> > > > > > > > > > > > > vqs is there, num_queues.
> > > > > > > > > > > > :)
> > > > > > > > > > > > Sure. those values at high q count affects.
> > > > > > > > > > > the driver need to initialize them anyway.
> > > > > > > > > > That is before the traffic starts from remote end.
> > > > > > > > > see above, that needs a link announce and this is after
> > > > > > > > > re-initialization
> > > > > > > > > > > > > > Device didn’t support LM.
> > > > > > > > > > > > > > Many limitations existed all these years and TC is improving
> > > > > > > > > > > > > > and expanding
> > > > > > > > > > > > > them.
> > > > > > > > > > > > > > So all these years do not matter.
> > > > > > > > > > > > > Not sure what are you talking about, haven't we initialize
> > > > > > > > > > > > > the device and vqs in config space for years?????? What's
> > > > > > > > > > > > > wrong with this
> > > > > > > > > mechanism?
> > > > > > > > > > > > > Are you questioning virito-pci fundamentals???
> > > > > > > > > > > > Don’t point to in-efficient past to establish similar in-efficient future.
> > > > > > > > > > > interesting, you know this is a one-time thing, right?
> > > > > > > > > > > and you are aware of this has been there for years.
> > > > > > > > > > > > > > > > > Like how to set a queue size and enable it?
> > > > > > > > > > > > > > > > Those are meant to be used before DRIVER_OK stage as they
> > > > > > > > > > > > > > > > are init time
> > > > > > > > > > > > > > > registers.
> > > > > > > > > > > > > > > > Not to keep abusing them..
> > > > > > > > > > > > > > > don't you need to set queue_size at the destination side?
> > > > > > > > > > > > > > No.
> > > > > > > > > > > > > > But the src/dst does not matter.
> > > > > > > > > > > > > > Queue_size to be set before DRIVER_OK like rest of the
> > > > > > > > > > > > > > registers, as all
> > > > > > > > > > > > > queues must be created before the driver_ok phase.
> > > > > > > > > > > > > > Queue_reset was last moment exception.
> > > > > > > > > > > > > create a queue? Nvidia specific?
> > > > > > > > > > > > > 
> > > > > > > > > > > > Huh. No.
> > > > > > > > > > > > Do git log and realize what happened with queue_reset.
> > > > > > > > > > > You didn't answer the question, does the spec even has defined
> > > > > > > > > > > "create a
> > > > > > > > > vq"?
> > > > > > > > > > Enabled/created = tomato/tomato when discussing the spec in
> > > > > > > > > > non-normative
> > > > > > > > > email conversation.
> > > > > > > > > > It's irrelevant.
> > > > > > > > > Then lets not debate on this enable a vq or create a vq anymore
> > > > > > > > > > All I am saying is, when we know the limitations of the
> > > > > > > > > > transport and when industry is forwarding to not introduced more
> > > > > > > > > > and more on-die register
> > > > > > > > > for once in lifetime work of device migration, we just use the
> > > > > > > > > optimal command and queue interface that is native to virtio.
> > > > > > > > > PCI config space has its own limitations, and admin vq has its
> > > > > > > > > advantages, but that does not apply to all use cases.
> > > > > > > > > 
> > > > > > > > There was a recent work done emulating the SR-IOV cap and allowing
> > > > > > > > VM to
> > > > > > > enable SR-IOV in [1].
> > > > > > > > This is the option I mentioned few weeks ago.
> > > > > > > > 
> > > > > > > > So with admin commands and admin virtqueues, even nested model
> > > > > > > > will work
> > > > > > > using [1].
> > > > > > > > [1]
> > > > > > > > https://netdevconf.info/0x17/sessions/talk/unleashing-sr-iov-offlo
> > > > > > > > ad
> > > > > > > > -o
> > > > > > > > n-virtual-machines.html
> > > > > > > We should take this into consideration once it is standardized in
> > > > > > > the spec, maybe not now, there can always be many workarounds to
> > > > > > > solve one
> > > > > problem.
> > > > > > Sure, until that point the admin commands are able to suffice the need
> > > well.
> > > > > > And when the spec changes in transport occurs (if needed), current
> > > > > > admin
> > > > > command and admin vq also fits very well that will follow above [1].
> > > > > we have pointed lots of problems for admin vq based live migration
> > > > > proposal, I won't repeat them here
> > > > I don’t see any.
> > > > Nested is already solved using above.
> > > I don't see how, do you mind to work out the patches?
> > Once the base series is completed, nested cases can be addressed.
> > I wont be able to work on the patches for it until we finish for the first level virtualization.
> As you know, nested is supported well in current virtio, so please don't
> break it.

So for nesting, it seems cleaner to support sending commands through
device itself.  You aren't going to fit VQ state in a 16 bit register in
the general case though, and will have to resort to DMA. And if you are
doing that then please just use the admin command format (does not have
to be a VQ) and then we can all make peace finally.

-- 
MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [virtio-comment] RE: [PATCH V2 3/6] virtio: dont reset vqs when SUSPEND
  2023-11-17 10:13                         ` Zhu, Lingshan
@ 2023-11-17 11:04                           ` Michael S. Tsirkin
  2023-11-22  1:41                             ` Zhu, Lingshan
  0 siblings, 1 reply; 186+ messages in thread
From: Michael S. Tsirkin @ 2023-11-17 11:04 UTC (permalink / raw)
  To: Zhu, Lingshan
  Cc: Parav Pandit, jasowang, eperezma, cohuck, stefanha, virtio-comment

On Fri, Nov 17, 2023 at 06:13:50PM +0800, Zhu, Lingshan wrote:
> 
> 
> On 11/16/2023 8:09 PM, Michael S. Tsirkin wrote:
> 
>     On Thu, Nov 16, 2023 at 06:09:38PM +0800, Zhu, Lingshan wrote:
> 
> 
>         On 11/16/2023 1:35 AM, Parav Pandit wrote:
> 
>                 From: Zhu, Lingshan <lingshan.zhu@intel.com>
>                 Sent: Monday, November 13, 2023 2:53 PM
> 
>                 On 11/10/2023 2:31 PM, Parav Pandit wrote:
> 
>                         From: Zhu, Lingshan <lingshan.zhu@intel.com>
>                         Sent: Friday, November 10, 2023 11:52 AM
> 
>                         On 11/9/2023 6:15 PM, Parav Pandit wrote:
> 
>                                 From: Zhu, Lingshan <lingshan.zhu@intel.com>
>                                 Sent: Thursday, November 9, 2023 3:28 PM
> 
>                                 On 11/9/2023 1:46 AM, Michael S. Tsirkin wrote:
> 
>                                     On Tue, Nov 07, 2023 at 05:27:23PM +0800, Zhu, Lingshan wrote:
> 
>                                         On 11/6/2023 5:49 PM, Michael S. Tsirkin wrote:
> 
>                                         On Fri, Nov 03, 2023 at 06:34:34PM +0800, Zhu Lingshan wrote:
> 
>                                         When SUSPEND is set, device states and virtqueue states should
>                                         be stablized, therefore the driver should not reset vqs when
>                                         SUSPEND is set in device status.
> 
>                                         Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
>                                         ---
>                                               content.tex | 3 +++
>                                               1 file changed, 3 insertions(+)
> 
>                                         diff --git a/content.tex b/content.tex index bcc9d4b..060b5c2
>                                         100644
>                                         --- a/content.tex
>                                         +++ b/content.tex
>                                         @@ -444,6 +444,9 @@ \subsubsection{Virtqueue
>                                         Reset}\label{sec:Basic
> 
>                                 Facilities of a Virtio Device /
> 
>                                               The device MUST reset any state of a virtqueue to the default
> 
>                 state,
> 
>                                               including the available state and the used state.
>                                         +If VIRTIO_F_SUSPEND is negotiated and SUSPEND is set in
>                                         +\field{device status}, the driver SHOULD NOT reset any virtqueues.
>                                         +
>                                               \drivernormative{\paragraph}{Virtqueue Reset}{Basic
>                                         Facilities of a
> 
>                                 Virtio Device / Virtqueues / Virtqueue Reset / Virtqueue Reset}
> 
>                                               After the driver tells the device to reset a queue, the
>                                         driver MUST verify that
> 
>                                         Seems somewhat arbitrary and breaks the claim that the feature
>                                         is orthogonal and can have uses besides migration.
> 
>                                         when suspended, the device is frozen.
>                                         The driver is aware of this process and so should not reset the vqs I
> 
>                 think.
> 
>                                     Again that is only true because you want to use it for migration.
>                                     But then you can't claim it's a generic facility.
> 
>                                 I don't get it. The device status is a basic facility.
> 
>                                 We need to SUSPEND the device by setting SUSPEND bit, to stabilize
>                                 the device states for migration.
> 
>                             Is the PCI's PM time not enough to suspend the device?
>                             For large device I could imagine it could be short.
> 
>                         As you see, PCI PM, so this is a layer violation, virtio should be
>                         self contained,
> 
>                     If you think it is layer violation, than suspend bit for sure is not needed. PCI
> 
>                 PM interface should suspend/resume the device on D0<->D3 state transitions.
>                 Doesn't make sense logically, because it is layer violation, so you want it to be
>                 worse? For example, virito writes 0 to device status to reset a device, not by PCI.
> 
>             All these layer violation thing is just abstract to me.
>             Your argument contradicts with your fellow author and yourself.
> 
>         I don't see how, we keep telling you virtio should be self contained, and
>         suspend by PCI PM is a
>         layer volition, this is a fact, right?
> 
>     Not really. Look at the charter - when available we should use platform
>     capabilities because it makes it easier to write drivers.
> 
> I think that is transport specific implementation, for example pci common cfg.
> 
> 
> 
> 
>             I don’t want to make it worse.
>             If you think its layer violation, just depend on the PCI PM, no need to include new suspend bit.
> 
>         Again, virtio should be self-contained, not layer volited, for example, we
>         reset virito devices
>         by writing 0 to device status, not by PCI FLR.
> 
>     There are some advantage to doing it like this, e.g. one does not need
>     to save and restore config space. What are advatages of suspend via this
>     bit?
> 
> suspend a device by the device status is the same as how we enable a virito
> device.
> 
> Doing this by PCI is clearly a layer volition, and does not work for other
> transports.
> 
> 
> 
>                         and what about MMIO and CCW?
> 
>                     They have largely lacked the richness of PCI transport. So those transport
> 
>                 needs to evolve.
>                 I am not sure CCW and MMIO maintainers want to hear this.
> 
>                     Otherwise, PCI offers rich transport facilities compared to MMIO, hence, it will
> 
>                 continue wider use.
>                 you know this SUSPEND bit work fine on all transport, right? Because
>                 device_status is transport independent.
> 
>             I want to emphasize that I am not against the suspend bit as long as it is guest driver controlled without interfering the device migration flow (like rest of the state).
> 
>         When migrate a device, it is the host who suspends the device. The reason is
>         the live migration process should be transparent to
>         the guest, so we should suspend the guest first, then suspend the device(by
>         host).
> 
>             The practical reason for suspending functionality under guest control is, that resuming/suspending the large device can take time.
>             So let it be in guest driver control. No need to muddy with device migration flow.
> 
>         The time cost is reasonable in O(N) no matter how you suspend/resume the
>         device.
> 
>     Very much depends. Big O notation can be misleading. If you have to
>     repeat an operation 1000 times that's 1000 * N and suddenly you are
>     going from milliseconds to seconds.
> 
> I mean enable 100 queues cost more time then enable 1 vq no matter
> how we enable it. that is O(N)

Depends on what "that" is. Number of VM exits does not have to be O(N),
you can pass these 100 queues in memory.


> 
> 
> 
>                         This should be a basic facility.
> 
>                     Other transport can also offer like PCI.
> 
>                 Do you want to work for these transport? Implementing the new features as
>                 PCI?
> 
>             Not presently as PCI as more features than rest of the two.
>             What I read about ccw is: " S/390 based virtual machines support neither PCI nor MMIO".
> 
>             And I also read, "The IBM System/390 is a discontinued mainframe product family implementing".
> 
>             So I don’t know who needs to extend ccw.
>             And if one needs, those maintainers will extend it to match to PCI standard.
> 
>         So these features are even not planned, so don't depend on them.
> 
>     But again can one suspend ccw device? If you are adding this feature and
>     claiming it's supported for all transports you better find out
>     what does it do.
> 
> I am not an expert on CCW, anything block we suspend a CCW device by this bit?

I don't think CCW supports suspend at all.

> This seems only controlled by the device itself.
> 

And? What it the point of suspending only the device if rest of system
is still going?

> 
> 
>                             In that case if there is suspend the device available, it will be
>                             used by the
> 
>                         guest driver itself, hypervisor wouldn’t know about it when those
>                         registers are not trapped.
> 
>                             So we need two ways to suspend.
>                             One is guest visible, and guest controlled.
>                             Second is hypervisor control to fulfill the device migration needs.
> 
>                         The guest can eve reset the device.
> 
>                             So if you can please take a look if the proposed admin command to
> 
>                         freeze/stop mode can be used in the emulated register case or not.
> 
>                             It helps to have the suspend bit in guest control as well
>                             with/without
> 
>                         emulation mode.
>                         Parav, please believe I have read your series, I didn't comment there
>                         because I want to avoid further conflicts/debating, we have done these
> 
>                 enough.
> 
>                     I believe the series posted in v3 can support vdpa use case as well.
>                     So I will progress to post v4.
> 
> 
>                         As explained before, freeze/stop the device by PCI is a layer violation.
> 
>                     I am afraid, we have different vision.
>                     I don’t see any layer violation.
>                     Suspend is enough in the PCI PM.
>                     Our vision is more aligned with rest of the hypervisor knobs that owns the
> 
>                 migration framework.
>                 I think I have explained, virito builds on other transport and it should be self-
>                 contained, so far so good.
> 
>             Virtio without any transport binding is just blank paper discussion.
> 
>         virtio is built on some transports, but not bind to any.
> 
>     Binding is an OS specific thing, but e.g. under Linux transport drivers bind to
>     devices then virtio drivers bind to virtio bus. No binding -> nothing
>     works.
> 
> I think general facilities are better not only work on a specific transport
> 

But platform facilities are even better we don't need to work on them at
all.


> 
>                         And device status can be pass-through(without emulation, just map it
>                         to
>                         guest) to the guest or trapped(trap and emulate by the hypervisor,
>                         for example set_status in vDPA).
> 
>                     When it is pass-through, it is controlled by the guest, so for example, if the
> 
>                 guest resets the device, hypervisor has lost the control of migration context etc.
> 
>                     Hence, hypervisor needs a channel which is not guest owned.
> 
>                     Same channel can work when trap+emulation is done.
> 
>                 It is the guest owns the device, it can reset the device, once reset, the device
>                 context are cleared.
> 
>             Hypervisor do not have the ability to read/write the device context. It lost the channel as hypervisor is not involved in trap+emulation.
>             So it is not helpful in one use case.
> 
>             Admin commands can work even with trap+emulation mode.
> 
>             What is missing, that should be added?
> 
>         as explained above, when live migration, the guest should be suspended
>         first, at this point,
>         the host owns the device, it has access to the device.
> 
>     Where do you say this in the spec patch?
> 
> VM live migration is not in this spec.

Then it should be.

> If we suspend the device first, then the guest may detect IO errors.
> 

That's bad. So you need to tell driver what not to do so as not to get
errors.

> 
> 
>                                 This can also be used for debugging I think.
> 
>                             As Michael listed, a dedicated debug interface is usually more
>                             useful instead
> 
>                         of in-band.
>                         re-using another facility without extra efforts is not a bad thing anyway.
> 
>                     I just don’t see how a suspend bit some debug feature.
>                     Almost everything with that regard is a debug feature to me.
> 
>                 suspend then check the device states?
> 
>             You already suspended the device, so device state is already changed.
>             All debug information is changed, so not useful now.
> 
>         When suspended, the device should keep and stabilize its device states,
>         at least in my series it should behave like this.
> 
>     That's vague. What does it mean exactly and what happens if
>     some external event causes state change?
> 
> it is suspended, somehow like powered-down, so it should not
> respond to the events until resume.

"somehow" is too vague for the spec.

-- 
MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [virtio-comment] Re: [PATCH V2 4/6] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  2023-11-17 10:06                                       ` Parav Pandit
@ 2023-11-21  4:30                                         ` Jason Wang
  2023-11-21 16:26                                           ` Parav Pandit
  0 siblings, 1 reply; 186+ messages in thread
From: Jason Wang @ 2023-11-21  4:30 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Zhu, Lingshan, mst, eperezma, cohuck, stefanha, virtio-comment

On Fri, Nov 17, 2023 at 6:06 PM Parav Pandit <parav@nvidia.com> wrote:
>
>
> > From: Zhu, Lingshan <lingshan.zhu@intel.com>
> > Sent: Friday, November 17, 2023 3:32 PM
> > To: Parav Pandit <parav@nvidia.com>; jasowang@redhat.com;
> > mst@redhat.com; eperezma@redhat.com; cohuck@redhat.com;
> > stefanha@redhat.com
> > Cc: virtio-comment@lists.oasis-open.org
> > Subject: Re: [virtio-comment] Re: [PATCH V2 4/6] virtio-pci: implement
> > VIRTIO_F_QUEUE_STATE
> >
> >
> >
> > On 11/16/2023 6:21 PM, Parav Pandit wrote:
> > >> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> > >> Sent: Thursday, November 16, 2023 3:45 PM
> > >>
> > >> On 11/16/2023 1:35 AM, Parav Pandit wrote:
> > >>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> > >>>> Sent: Monday, November 13, 2023 2:56 PM
> > >>>>
> > >>>>
> > >>>>
> > >>>> On 11/10/2023 8:31 PM, Parav Pandit wrote:
> > >>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> > >>>>>> Sent: Friday, November 10, 2023 1:22 PM
> > >>>>>>
> > >>>>>>
> > >>>>>> On 11/9/2023 6:25 PM, Parav Pandit wrote:
> > >>>>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> > >>>>>>>> Sent: Thursday, November 9, 2023 3:39 PM
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> On 11/9/2023 2:28 PM, Parav Pandit wrote:
> > >>>>>>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> > >>>>>>>>>> Sent: Tuesday, November 7, 2023 3:02 PM
> > >>>>>>>>>>
> > >>>>>>>>>> On 11/6/2023 6:52 PM, Parav Pandit wrote:
> > >>>>>>>>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> > >>>>>>>>>>>> Sent: Monday, November 6, 2023 2:57 PM
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> On 11/6/2023 12:12 PM, Parav Pandit wrote:
> > >>>>>>>>>>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> > >>>>>>>>>>>>>> Sent: Monday, November 6, 2023 9:01 AM
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> On 11/3/2023 11:50 PM, Parav Pandit wrote:
> > >>>>>>>>>>>>>>>> From: virtio-comment@lists.oasis-open.org
> > >>>>>>>>>>>>>>>> <virtio-comment@lists.oasis- open.org> On Behalf Of
> > >>>>>>>>>>>>>>>> Zhu, Lingshan
> > >>>>>>>>>>>>>>>> Sent: Friday, November 3, 2023 8:27 PM
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> On 11/3/2023 7:35 PM, Parav Pandit wrote:
> > >>>>>>>>>>>>>>>>>> From: Zhu Lingshan <lingshan.zhu@intel.com>
> > >>>>>>>>>>>>>>>>>> Sent: Friday, November 3, 2023 4:05 PM
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> This patch adds two new le16 fields to common
> > >>>>>>>>>>>>>>>>>> configuration structure to support
> > >>>>>>>>>>>>>>>>>> VIRTIO_F_QUEUE_STATE in PCI transport
> > >>>>>> layer.
> > >>>>>>>>>>>>>>>>>> Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
> > >>>>>>>>>>>>>>>>>> ---
> > >>>>>>>>>>>>>>>>>>           transport-pci.tex | 18 ++++++++++++++++++
> > >>>>>>>>>>>>>>>>>>           1 file changed, 18 insertions(+)
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> diff --git a/transport-pci.tex b/transport-pci.tex
> > >>>>>>>>>>>>>>>>>> index
> > >>>>>>>>>>>>>>>>>> a5c6719..3161519 100644
> > >>>>>>>>>>>>>>>>>> --- a/transport-pci.tex
> > >>>>>>>>>>>>>>>>>> +++ b/transport-pci.tex
> > >>>>>>>>>>>>>>>>>> @@ -325,6 +325,10 @@ \subsubsection{Common
> > >>>> configuration
> > >>>>>>>>>>>>>> structure
> > >>>>>>>>>>>>>>>>>> layout}\label{sec:Virtio Transport
> > >>>>>>>>>>>>>>>>>>                   /* About the administration virtqueue. */
> > >>>>>>>>>>>>>>>>>>                   le16 admin_queue_index;         /* read-only for
> > >> driver
> > >>>>>> */
> > >>>>>>>>>>>>>>>>>>                   le16 admin_queue_num;         /* read-only for
> > >> driver
> > >>>>>> */
> > >>>>>>>>>>>>>>>>>> +
> > >>>>>>>>>>>>>>>>>> +  /* Virtqueue state */
> > >>>>>>>>>>>>>>>>>> +        le16 queue_avail_state;         /* read-write */
> > >>>>>>>>>>>>>>>>>> +        le16 queue_used_state;          /* read-write */
> > >>>>>>>>>>>>>>>>> This tiny interface for 128 virtio net queues through
> > >>>>>>>>>>>>>>>>> register read writes, does
> > >>>>>>>>>>>>>>>> not work effectively.
> > >>>>>>>>>>>>>>>>> There are inflight out of order descriptors for block also.
> > >>>>>>>>>>>>>>>>> Hence toy registers like this do not work.
> > >>>>>>>>>>>>>>>> Do you know there is a queue_select? Why this does not
> > >> work?
> > >>>>>>>>>>>>>>>> Do you know how other queue related fields work?
> > >>>>>>>>>>>>>>> :)
> > >>>>>>>>>>>>>>> Yes. If you notice queue_reset related critical spec bug
> > >>>>>>>>>>>>>>> fix was done when it
> > >>>>>>>>>>>>>> was introduced so that live migration can _actually_ work.
> > >>>>>>>>>>>>>>> When queue_select is done for 128 queues serially, it
> > >>>>>>>>>>>>>>> take a lot of time to
> > >>>>>>>>>>>>>> read those slow register interface for this + inflight
> > >>>>>>>>>>>>>> descriptors +
> > >>>>>> more.
> > >>>>>>>>>>>>>> interesting, virtio work in this pattern for many years, right?
> > >>>>>>>>>>>>> All these years 400Gbps and 800Gbps virtio was not
> > >>>>>>>>>>>>> present, number of
> > >>>>>>>>>>>> queues were not in hw.
> > >>>>>>>>>>>> The registers are control path in config space, how 400G or
> > >>>>>>>>>>>> 800G
> > >>>>>> affect??
> > >>>>>>>>>>> Because those are the one in practice requires large number of
> > VQs.
> > >>>>>>>>>>>
> > >>>>>>>>>>> You are asking per VQ register commands to modify things
> > >>>>>>>>>>> dynamically via
> > >>>>>>>>>> this one vq at a time, serializing all the operations.
> > >>>>>>>>>>> It does not scale well with high q count.
> > >>>>>>>>>> This is not dynamically, it only happens when SUSPEND and
> > RESUME.
> > >>>>>>>>>> This is the same mechanism how virtio initialize a virtqueue,
> > >>>>>>>>>> working for many years.
> > >>>>>>>>> No. when virtio driver initializes it for the first time,
> > >>>>>>>>> there is no active traffic
> > >>>>>>>> that gets lost.
> > >>>>>>>>> This is because the interface is not yet up and not part of
> > >>>>>>>>> the network
> > >>>> yet.
> > >>>>>>>>> The resume must be fast enough, because the remote node is
> > >>>>>>>>> sending
> > >>>>>>>> packets.
> > >>>>>>>>> Hence it is different from driver init time queue enable.
> > >>>>>>>> I am not sure any packets arrive before a link announce at the
> > >>>>>>>> destination
> > >>>>>> side.
> > >>>>>>> I think it can.
> > >>>>>>> Because there is no notification of member device link down
> > >>>>>>> intimation to
> > >>>>>> remote side.
> > >>>>>>> The L4 and L5 protocols have no knowledge that node which they
> > >>>>>>> are
> > >>>>>> interacting is behind some layers of switches.
> > >>>>>>> So keeping this time low is desired.
> > >>>>>> The NIC should broad cast itself first, so that other peers in
> > >>>>>> the network know(for example its mac to route it) how to send a
> > >>>>>> message to
> > >> it.
> > >>>>>> This is necessary, for example VIRTIO_NET_F_GUEST_ANNOUNCE,
> > >>>>>> similar mechanism work for in-marketing productions for years.
> > >>>>>>
> > >>>>>> This is out of the topic anyway.
> > >>>>>>>>>>>> See the virtio common cfg, you will find the max number of
> > >>>>>>>>>>>> vqs is there, num_queues.
> > >>>>>>>>>>> :)
> > >>>>>>>>>>> Sure. those values at high q count affects.
> > >>>>>>>>>> the driver need to initialize them anyway.
> > >>>>>>>>> That is before the traffic starts from remote end.
> > >>>>>>>> see above, that needs a link announce and this is after
> > >>>>>>>> re-initialization
> > >>>>>>>>>>>>> Device didn’t support LM.
> > >>>>>>>>>>>>> Many limitations existed all these years and TC is
> > >>>>>>>>>>>>> improving and expanding
> > >>>>>>>>>>>> them.
> > >>>>>>>>>>>>> So all these years do not matter.
> > >>>>>>>>>>>> Not sure what are you talking about, haven't we initialize
> > >>>>>>>>>>>> the device and vqs in config space for years?????? What's
> > >>>>>>>>>>>> wrong with this
> > >>>>>>>> mechanism?
> > >>>>>>>>>>>> Are you questioning virito-pci fundamentals???
> > >>>>>>>>>>> Don’t point to in-efficient past to establish similar in-efficient
> > future.
> > >>>>>>>>>> interesting, you know this is a one-time thing, right?
> > >>>>>>>>>> and you are aware of this has been there for years.
> > >>>>>>>>>>>>>>>> Like how to set a queue size and enable it?
> > >>>>>>>>>>>>>>> Those are meant to be used before DRIVER_OK stage as
> > >>>>>>>>>>>>>>> they are init time
> > >>>>>>>>>>>>>> registers.
> > >>>>>>>>>>>>>>> Not to keep abusing them..
> > >>>>>>>>>>>>>> don't you need to set queue_size at the destination side?
> > >>>>>>>>>>>>> No.
> > >>>>>>>>>>>>> But the src/dst does not matter.
> > >>>>>>>>>>>>> Queue_size to be set before DRIVER_OK like rest of the
> > >>>>>>>>>>>>> registers, as all
> > >>>>>>>>>>>> queues must be created before the driver_ok phase.
> > >>>>>>>>>>>>> Queue_reset was last moment exception.
> > >>>>>>>>>>>> create a queue? Nvidia specific?
> > >>>>>>>>>>>>
> > >>>>>>>>>>> Huh. No.
> > >>>>>>>>>>> Do git log and realize what happened with queue_reset.
> > >>>>>>>>>> You didn't answer the question, does the spec even has
> > >>>>>>>>>> defined "create a
> > >>>>>>>> vq"?
> > >>>>>>>>> Enabled/created = tomato/tomato when discussing the spec in
> > >>>>>>>>> non-normative
> > >>>>>>>> email conversation.
> > >>>>>>>>> It's irrelevant.
> > >>>>>>>> Then lets not debate on this enable a vq or create a vq anymore
> > >>>>>>>>> All I am saying is, when we know the limitations of the
> > >>>>>>>>> transport and when industry is forwarding to not introduced
> > >>>>>>>>> more and more on-die register
> > >>>>>>>> for once in lifetime work of device migration, we just use the
> > >>>>>>>> optimal command and queue interface that is native to virtio.
> > >>>>>>>> PCI config space has its own limitations, and admin vq has its
> > >>>>>>>> advantages, but that does not apply to all use cases.
> > >>>>>>>>
> > >>>>>>> There was a recent work done emulating the SR-IOV cap and
> > >>>>>>> allowing VM to
> > >>>>>> enable SR-IOV in [1].
> > >>>>>>> This is the option I mentioned few weeks ago.
> > >>>>>>>
> > >>>>>>> So with admin commands and admin virtqueues, even nested model
> > >>>>>>> will work
> > >>>>>> using [1].
> > >>>>>>> [1]
> > >>>>>>> https://netdevconf.info/0x17/sessions/talk/unleashing-sr-iov-off
> > >>>>>>> lo
> > >>>>>>> ad
> > >>>>>>> -o
> > >>>>>>> n-virtual-machines.html
> > >>>>>> We should take this into consideration once it is standardized in
> > >>>>>> the spec, maybe not now, there can always be many workarounds to
> > >>>>>> solve one
> > >>>> problem.
> > >>>>> Sure, until that point the admin commands are able to suffice the
> > >>>>> need
> > >> well.
> > >>>>> And when the spec changes in transport occurs (if needed), current
> > >>>>> admin
> > >>>> command and admin vq also fits very well that will follow above [1].
> > >>>> we have pointed lots of problems for admin vq based live migration
> > >>>> proposal, I won't repeat them here
> > >>> I don’t see any.
> > >>> Nested is already solved using above.
> > >> I don't see how, do you mind to work out the patches?
> > > Once the base series is completed, nested cases can be addressed.
> > > I wont be able to work on the patches for it until we finish for the first level
> > virtualization.
> > As you know, nested is supported well in current virtio, so please don't break it.
>
> And same comment repeats. 😊
> Expect same response...
> Sorry, no virtio specification does not support device migration today.
> Nothing is broken by adding new features.
>
> Above [1] has the right proposal that Jason's paper pointed out. Please use it.

I was involved in the design in [1]. And I don't see a connection to
the dicussion here

1) It is based on vDPA in L0
2) It doesn't address the nesting issue, it requires a proper design
in the virtio spec to support migration in the nesting layer.

Thanks


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [virtio-comment] RE: [PATCH V2 3/6] virtio: dont reset vqs when SUSPEND
  2023-11-16  5:27                   ` Parav Pandit
  2023-11-16 10:12                     ` Zhu, Lingshan
@ 2023-11-21  7:33                     ` Jason Wang
  2023-11-21 16:32                       ` Parav Pandit
  2023-11-21 21:18                       ` Michael S. Tsirkin
  1 sibling, 2 replies; 186+ messages in thread
From: Jason Wang @ 2023-11-21  7:33 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Zhu, Lingshan, Michael S. Tsirkin, eperezma, cohuck, stefanha,
	virtio-comment

On Thu, Nov 16, 2023 at 1:27 PM Parav Pandit <parav@nvidia.com> wrote:
>
> > From: Jason Wang <jasowang@redhat.com>
> > Sent: Thursday, November 16, 2023 9:50 AM
> >
> > On Thu, Nov 16, 2023 at 1:39 AM Parav Pandit <parav@nvidia.com> wrote:
> > >
> > >
> > >
> > > > From: Jason Wang <jasowang@redhat.com>
> > > > Sent: Monday, November 13, 2023 9:05 AM
> > > >
> > > > On Thu, Nov 9, 2023 at 6:16 PM Parav Pandit <parav@nvidia.com> wrote:
> > > > >
> > > > >
> > > > > > From: Zhu, Lingshan <lingshan.zhu@intel.com>
> > > > > > Sent: Thursday, November 9, 2023 3:28 PM
> > > > > >
> > > > > > On 11/9/2023 1:46 AM, Michael S. Tsirkin wrote:
> > > > > > > On Tue, Nov 07, 2023 at 05:27:23PM +0800, Zhu, Lingshan wrote:
> > > > > > >>
> > > > > > >> On 11/6/2023 5:49 PM, Michael S. Tsirkin wrote:
> > > > > > >>> On Fri, Nov 03, 2023 at 06:34:34PM +0800, Zhu Lingshan wrote:
> > > > > > >>>> When SUSPEND is set, device states and virtqueue states
> > > > > > >>>> should be stablized, therefore the driver should not reset
> > > > > > >>>> vqs when SUSPEND is set in device status.
> > > > > > >>>>
> > > > > > >>>> Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
> > > > > > >>>> ---
> > > > > > >>>>    content.tex | 3 +++
> > > > > > >>>>    1 file changed, 3 insertions(+)
> > > > > > >>>>
> > > > > > >>>> diff --git a/content.tex b/content.tex index
> > > > > > >>>> bcc9d4b..060b5c2
> > > > > > >>>> 100644
> > > > > > >>>> --- a/content.tex
> > > > > > >>>> +++ b/content.tex
> > > > > > >>>> @@ -444,6 +444,9 @@ \subsubsection{Virtqueue
> > > > > > >>>> Reset}\label{sec:Basic
> > > > > > Facilities of a Virtio Device /
> > > > > > >>>>    The device MUST reset any state of a virtqueue to the default
> > state,
> > > > > > >>>>    including the available state and the used state.
> > > > > > >>>> +If VIRTIO_F_SUSPEND is negotiated and SUSPEND is set in
> > > > > > >>>> +\field{device status}, the driver SHOULD NOT reset any
> > virtqueues.
> > > > > > >>>> +
> > > > > > >>>>    \drivernormative{\paragraph}{Virtqueue Reset}{Basic
> > > > > > >>>> Facilities of a
> > > > > > Virtio Device / Virtqueues / Virtqueue Reset / Virtqueue Reset}
> > > > > > >>>>    After the driver tells the device to reset a queue, the
> > > > > > >>>> driver MUST verify that
> > > > > > >>> Seems somewhat arbitrary and breaks the claim that the
> > > > > > >>> feature is orthogonal and can have uses besides migration.
> > > > > > >> when suspended, the device is frozen.
> > > > > > >> The driver is aware of this process and so should not reset the vqs I
> > think.
> > > > > > > Again that is only true because you want to use it for migration.
> > > > > > > But then you can't claim it's a generic facility.
> > > > > > I don't get it. The device status is a basic facility.
> > > > > >
> > > > > > We need to SUSPEND the device by setting SUSPEND bit, to
> > > > > > stabilize the device states for migration.
> > > > > Is the PCI's PM time not enough to suspend the device?
> > > >
> > > > Are you saying we don't need virtio reset assuming we had FLR?
> > > >
> > > No. often FLR timing is not enough. Hence every PCI level device has some
> > sort of its own reset mechanism.
> > >
> > > > Suspending at different layers like rest at different layers.
> > > >
> > > > We have both FLR and virtio reset. The Virtio level function could
> > > > be reset without FLR. So did suspend.
> > > >
> > > > That's it.
> > > Sure, but wrapping it under some "basic facility" is just does not make sense.
> >
> > Why, device status (e.g reset) belongs to that part.
> >
> Lingshan claimed that suspending device is for live migration in commit log and in discussion he portray it as some basic facility unrelated to device migration such as debug etc.
> Instead of claiming it as some non_device_migration facility does not make sense.

It is used for migration for sure.

>
> > >
> > > >
> > > > And if you want to rule P2P behaviours, PCI PM is really the correct
> > > > way to go instead of trying to do it at the virtio layer.
> > > >
> > > PCI PM is supposed to be controlled by the guest and so the suspend.
> >
> > I've listed issues about D3cold and others, I can't believe it can't be controlled
> > totally by guests.
> >
> D3cold is not controlled by the driver as defined by the PCI spec hence it is not applicable.

Have you seen the link I give you? Even if you are right, there still
could be such a request from the firmware, no?

> D3hot is controlled by the driver.

So, it requires the device context to be preserved, which is not
documented in your patch.

> > >
> > > Hypervisor needs its channel to suspend the device, as fundamentally guest is
> > unaware of device migration flow.
> >
> > That's pretty fine, the hypervisor also needs its channel to reset the device. If
> > you think there's a conflict with suspend, there should be one for reset as well.
> >
> I don’t see a need for hypervisor to reset the device in passthrough mode. Can you explain why is it needed?

Qemu has a command "system_reset".

> Do you mean, it is needed in vdpa mode? If yes, the registers are emulated anyway, so why the member device's native channel cannot be used in vdpa mode?
>
> > >
> > > > > For large device I could imagine it could be short.
> > > > >
> > > > > In that case if there is suspend the device available, it will be
> > > > > used by the guest
> > > > driver itself, hypervisor wouldn’t know about it when those
> > > > registers are not trapped.
> > > > > So we need two ways to suspend.
> > > > > One is guest visible, and guest controlled.
> > > > > Second is hypervisor control to fulfill the device migration needs.
> > > >
> > > > Can you explain why suspend is special but not reset or why reset
> > > > can work but not suspend? If reset can work, so does suspend. If
> > > > reset can't, neither does suspend.
> > > >
> > > As long as reset and suspend both are under guest control, I am fine.
> >
> > Well, you seem to ignore my question below. Hypervisor needs to reset the
> > device as well.
> >
> Why is it needed in passthrough mode?
>
> > >
> > > > For example, can you explain how a system_reset in Qemu can work
> > > > with your proposal?
> > > >
> > > > >
> > > > > So if you can please take a look if the proposed admin command to
> > > > freeze/stop mode can be used in the emulated register case or not.
> > > >
> > > > Again, if you design those for PCI, it's a layer violation. You have
> > > > answered
> > > They are used by the PCI layer, just like your suspend bit.
> > > Andy other transport can also use it.
> > >
> > > > yourself that PM is the right way to go.
> > > >
> > > > > It helps to have the suspend bit in guest control as well
> > > > > with/without
> > > > emulation mode.
> > > >
> > > > I won't repeat it again. You will find you need a full transport to
> > > > satisfy all the requirements.
> > > I disagree for full transport.
> >
> > See above and the discussion in another thread.
> >
> > > If you want to get discuss transport for sure it is some other thread
> > > and I want to see "driver notifications via such transport VQ" to fully qualify it
> > as transport, And that would be just sub-optimal for actual working.
> >
> > Sub-optimal since the function is duplicated with a transport but it doesn't
> > claim or design as a transport.
> >
> It is not sub-optimal because of duplication. It is because you want to transport notifications via virtqueue.

Have you ever read the series of tvq? You won't get this conclusion if
you do that.

>
> > > And hence, I wouldn’t call it a transport anymore.
> > >
> > > >
> > > > >
> > > > > > This can also be used for debugging I think.
> > > > >
> > > > > As Michael listed, a dedicated debug interface is usually more
> > > > > useful instead
> > > > of in-band.
> > > >
> > > > Well, I've shown you the in-band facilities like debugging via
> > > > ethtool and kernel has a lot of other ones. If you have ever tried
> > > > to debug in a real production environment, you will find how useful
> > > > such handy information is where out-of- band facilities are often dangerous
> > and usually prohibited or even unsupported.
> > > Guest driver can always read and write the device status without adding a
> > suspend bit.
> >
> > I don't get here. Suspend make sure the device state is frozen which helps for
> > debugging for sure.
> You wanted to debug some vq live, you suspend the device, the vq state got changed.
>
> I just don’t see that suspend is a debug tool.

It's not a tool, it's a function that can be used as a debug tool.

> Every feature is a debug feature literally.
> Classic heisenbug effect.
>
> Once can change driver notification frequency to see if interrupt rate changed for debugging.
> One can disabled few RQs and see RSS...
> Blk can change blk_size to higher value to perf debug..
> The list continues..

Let's not shift concepts.

Obviously, suspend is not the only way to debug. But that's not the
context here.

Thanks

>
> >
> > Thanks
> >
> > >
>


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* RE: [virtio-comment] Re: [PATCH V2 4/6] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  2023-11-21  4:30                                         ` Jason Wang
@ 2023-11-21 16:26                                           ` Parav Pandit
  2023-11-22  4:15                                             ` Jason Wang
  0 siblings, 1 reply; 186+ messages in thread
From: Parav Pandit @ 2023-11-21 16:26 UTC (permalink / raw)
  To: Jason Wang; +Cc: Zhu, Lingshan, mst, eperezma, cohuck, stefanha, virtio-comment


> From: Jason Wang <jasowang@redhat.com>
> Sent: Tuesday, November 21, 2023 10:01 AM
> 
> On Fri, Nov 17, 2023 at 6:06 PM Parav Pandit <parav@nvidia.com> wrote:
> >
> >
> > > From: Zhu, Lingshan <lingshan.zhu@intel.com>
> > > Sent: Friday, November 17, 2023 3:32 PM
> > > To: Parav Pandit <parav@nvidia.com>; jasowang@redhat.com;
> > > mst@redhat.com; eperezma@redhat.com; cohuck@redhat.com;
> > > stefanha@redhat.com
> > > Cc: virtio-comment@lists.oasis-open.org
> > > Subject: Re: [virtio-comment] Re: [PATCH V2 4/6] virtio-pci:
> > > implement VIRTIO_F_QUEUE_STATE
> > >
> > >
> > >
> > > On 11/16/2023 6:21 PM, Parav Pandit wrote:
> > > >> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> > > >> Sent: Thursday, November 16, 2023 3:45 PM
> > > >>
> > > >> On 11/16/2023 1:35 AM, Parav Pandit wrote:
> > > >>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> > > >>>> Sent: Monday, November 13, 2023 2:56 PM
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>> On 11/10/2023 8:31 PM, Parav Pandit wrote:
> > > >>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> > > >>>>>> Sent: Friday, November 10, 2023 1:22 PM
> > > >>>>>>
> > > >>>>>>
> > > >>>>>> On 11/9/2023 6:25 PM, Parav Pandit wrote:
> > > >>>>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> > > >>>>>>>> Sent: Thursday, November 9, 2023 3:39 PM
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>> On 11/9/2023 2:28 PM, Parav Pandit wrote:
> > > >>>>>>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> > > >>>>>>>>>> Sent: Tuesday, November 7, 2023 3:02 PM
> > > >>>>>>>>>>
> > > >>>>>>>>>> On 11/6/2023 6:52 PM, Parav Pandit wrote:
> > > >>>>>>>>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> > > >>>>>>>>>>>> Sent: Monday, November 6, 2023 2:57 PM
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> On 11/6/2023 12:12 PM, Parav Pandit wrote:
> > > >>>>>>>>>>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> > > >>>>>>>>>>>>>> Sent: Monday, November 6, 2023 9:01 AM
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> On 11/3/2023 11:50 PM, Parav Pandit wrote:
> > > >>>>>>>>>>>>>>>> From: virtio-comment@lists.oasis-open.org
> > > >>>>>>>>>>>>>>>> <virtio-comment@lists.oasis- open.org> On Behalf Of
> > > >>>>>>>>>>>>>>>> Zhu, Lingshan
> > > >>>>>>>>>>>>>>>> Sent: Friday, November 3, 2023 8:27 PM
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> On 11/3/2023 7:35 PM, Parav Pandit wrote:
> > > >>>>>>>>>>>>>>>>>> From: Zhu Lingshan <lingshan.zhu@intel.com>
> > > >>>>>>>>>>>>>>>>>> Sent: Friday, November 3, 2023 4:05 PM
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> This patch adds two new le16 fields to common
> > > >>>>>>>>>>>>>>>>>> configuration structure to support
> > > >>>>>>>>>>>>>>>>>> VIRTIO_F_QUEUE_STATE in PCI transport
> > > >>>>>> layer.
> > > >>>>>>>>>>>>>>>>>> Signed-off-by: Zhu Lingshan
> > > >>>>>>>>>>>>>>>>>> <lingshan.zhu@intel.com>
> > > >>>>>>>>>>>>>>>>>> ---
> > > >>>>>>>>>>>>>>>>>>           transport-pci.tex | 18 ++++++++++++++++++
> > > >>>>>>>>>>>>>>>>>>           1 file changed, 18 insertions(+)
> > > >>>>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>>>> diff --git a/transport-pci.tex
> > > >>>>>>>>>>>>>>>>>> b/transport-pci.tex index
> > > >>>>>>>>>>>>>>>>>> a5c6719..3161519 100644
> > > >>>>>>>>>>>>>>>>>> --- a/transport-pci.tex
> > > >>>>>>>>>>>>>>>>>> +++ b/transport-pci.tex
> > > >>>>>>>>>>>>>>>>>> @@ -325,6 +325,10 @@ \subsubsection{Common
> > > >>>> configuration
> > > >>>>>>>>>>>>>> structure
> > > >>>>>>>>>>>>>>>>>> layout}\label{sec:Virtio Transport
> > > >>>>>>>>>>>>>>>>>>                   /* About the administration virtqueue. */
> > > >>>>>>>>>>>>>>>>>>                   le16 admin_queue_index;         /* read-only
> for
> > > >> driver
> > > >>>>>> */
> > > >>>>>>>>>>>>>>>>>>                   le16 admin_queue_num;         /* read-only
> for
> > > >> driver
> > > >>>>>> */
> > > >>>>>>>>>>>>>>>>>> +
> > > >>>>>>>>>>>>>>>>>> +  /* Virtqueue state */
> > > >>>>>>>>>>>>>>>>>> +        le16 queue_avail_state;         /* read-write */
> > > >>>>>>>>>>>>>>>>>> +        le16 queue_used_state;          /* read-write */
> > > >>>>>>>>>>>>>>>>> This tiny interface for 128 virtio net queues
> > > >>>>>>>>>>>>>>>>> through register read writes, does
> > > >>>>>>>>>>>>>>>> not work effectively.
> > > >>>>>>>>>>>>>>>>> There are inflight out of order descriptors for block
> also.
> > > >>>>>>>>>>>>>>>>> Hence toy registers like this do not work.
> > > >>>>>>>>>>>>>>>> Do you know there is a queue_select? Why this does
> > > >>>>>>>>>>>>>>>> not
> > > >> work?
> > > >>>>>>>>>>>>>>>> Do you know how other queue related fields work?
> > > >>>>>>>>>>>>>>> :)
> > > >>>>>>>>>>>>>>> Yes. If you notice queue_reset related critical spec
> > > >>>>>>>>>>>>>>> bug fix was done when it
> > > >>>>>>>>>>>>>> was introduced so that live migration can _actually_ work.
> > > >>>>>>>>>>>>>>> When queue_select is done for 128 queues serially,
> > > >>>>>>>>>>>>>>> it take a lot of time to
> > > >>>>>>>>>>>>>> read those slow register interface for this +
> > > >>>>>>>>>>>>>> inflight descriptors +
> > > >>>>>> more.
> > > >>>>>>>>>>>>>> interesting, virtio work in this pattern for many years,
> right?
> > > >>>>>>>>>>>>> All these years 400Gbps and 800Gbps virtio was not
> > > >>>>>>>>>>>>> present, number of
> > > >>>>>>>>>>>> queues were not in hw.
> > > >>>>>>>>>>>> The registers are control path in config space, how
> > > >>>>>>>>>>>> 400G or 800G
> > > >>>>>> affect??
> > > >>>>>>>>>>> Because those are the one in practice requires large
> > > >>>>>>>>>>> number of
> > > VQs.
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> You are asking per VQ register commands to modify things
> > > >>>>>>>>>>> dynamically via
> > > >>>>>>>>>> this one vq at a time, serializing all the operations.
> > > >>>>>>>>>>> It does not scale well with high q count.
> > > >>>>>>>>>> This is not dynamically, it only happens when SUSPEND and
> > > RESUME.
> > > >>>>>>>>>> This is the same mechanism how virtio initialize a
> > > >>>>>>>>>> virtqueue, working for many years.
> > > >>>>>>>>> No. when virtio driver initializes it for the first time,
> > > >>>>>>>>> there is no active traffic
> > > >>>>>>>> that gets lost.
> > > >>>>>>>>> This is because the interface is not yet up and not part
> > > >>>>>>>>> of the network
> > > >>>> yet.
> > > >>>>>>>>> The resume must be fast enough, because the remote node is
> > > >>>>>>>>> sending
> > > >>>>>>>> packets.
> > > >>>>>>>>> Hence it is different from driver init time queue enable.
> > > >>>>>>>> I am not sure any packets arrive before a link announce at
> > > >>>>>>>> the destination
> > > >>>>>> side.
> > > >>>>>>> I think it can.
> > > >>>>>>> Because there is no notification of member device link down
> > > >>>>>>> intimation to
> > > >>>>>> remote side.
> > > >>>>>>> The L4 and L5 protocols have no knowledge that node which
> > > >>>>>>> they are
> > > >>>>>> interacting is behind some layers of switches.
> > > >>>>>>> So keeping this time low is desired.
> > > >>>>>> The NIC should broad cast itself first, so that other peers
> > > >>>>>> in the network know(for example its mac to route it) how to
> > > >>>>>> send a message to
> > > >> it.
> > > >>>>>> This is necessary, for example VIRTIO_NET_F_GUEST_ANNOUNCE,
> > > >>>>>> similar mechanism work for in-marketing productions for years.
> > > >>>>>>
> > > >>>>>> This is out of the topic anyway.
> > > >>>>>>>>>>>> See the virtio common cfg, you will find the max number
> > > >>>>>>>>>>>> of vqs is there, num_queues.
> > > >>>>>>>>>>> :)
> > > >>>>>>>>>>> Sure. those values at high q count affects.
> > > >>>>>>>>>> the driver need to initialize them anyway.
> > > >>>>>>>>> That is before the traffic starts from remote end.
> > > >>>>>>>> see above, that needs a link announce and this is after
> > > >>>>>>>> re-initialization
> > > >>>>>>>>>>>>> Device didn’t support LM.
> > > >>>>>>>>>>>>> Many limitations existed all these years and TC is
> > > >>>>>>>>>>>>> improving and expanding
> > > >>>>>>>>>>>> them.
> > > >>>>>>>>>>>>> So all these years do not matter.
> > > >>>>>>>>>>>> Not sure what are you talking about, haven't we
> > > >>>>>>>>>>>> initialize the device and vqs in config space for
> > > >>>>>>>>>>>> years?????? What's wrong with this
> > > >>>>>>>> mechanism?
> > > >>>>>>>>>>>> Are you questioning virito-pci fundamentals???
> > > >>>>>>>>>>> Don’t point to in-efficient past to establish similar
> > > >>>>>>>>>>> in-efficient
> > > future.
> > > >>>>>>>>>> interesting, you know this is a one-time thing, right?
> > > >>>>>>>>>> and you are aware of this has been there for years.
> > > >>>>>>>>>>>>>>>> Like how to set a queue size and enable it?
> > > >>>>>>>>>>>>>>> Those are meant to be used before DRIVER_OK stage as
> > > >>>>>>>>>>>>>>> they are init time
> > > >>>>>>>>>>>>>> registers.
> > > >>>>>>>>>>>>>>> Not to keep abusing them..
> > > >>>>>>>>>>>>>> don't you need to set queue_size at the destination side?
> > > >>>>>>>>>>>>> No.
> > > >>>>>>>>>>>>> But the src/dst does not matter.
> > > >>>>>>>>>>>>> Queue_size to be set before DRIVER_OK like rest of the
> > > >>>>>>>>>>>>> registers, as all
> > > >>>>>>>>>>>> queues must be created before the driver_ok phase.
> > > >>>>>>>>>>>>> Queue_reset was last moment exception.
> > > >>>>>>>>>>>> create a queue? Nvidia specific?
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>> Huh. No.
> > > >>>>>>>>>>> Do git log and realize what happened with queue_reset.
> > > >>>>>>>>>> You didn't answer the question, does the spec even has
> > > >>>>>>>>>> defined "create a
> > > >>>>>>>> vq"?
> > > >>>>>>>>> Enabled/created = tomato/tomato when discussing the spec
> > > >>>>>>>>> in non-normative
> > > >>>>>>>> email conversation.
> > > >>>>>>>>> It's irrelevant.
> > > >>>>>>>> Then lets not debate on this enable a vq or create a vq
> > > >>>>>>>> anymore
> > > >>>>>>>>> All I am saying is, when we know the limitations of the
> > > >>>>>>>>> transport and when industry is forwarding to not
> > > >>>>>>>>> introduced more and more on-die register
> > > >>>>>>>> for once in lifetime work of device migration, we just use
> > > >>>>>>>> the optimal command and queue interface that is native to virtio.
> > > >>>>>>>> PCI config space has its own limitations, and admin vq has
> > > >>>>>>>> its advantages, but that does not apply to all use cases.
> > > >>>>>>>>
> > > >>>>>>> There was a recent work done emulating the SR-IOV cap and
> > > >>>>>>> allowing VM to
> > > >>>>>> enable SR-IOV in [1].
> > > >>>>>>> This is the option I mentioned few weeks ago.
> > > >>>>>>>
> > > >>>>>>> So with admin commands and admin virtqueues, even nested
> > > >>>>>>> model will work
> > > >>>>>> using [1].
> > > >>>>>>> [1]
> > > >>>>>>> https://netdevconf.info/0x17/sessions/talk/unleashing-sr-iov
> > > >>>>>>> -off
> > > >>>>>>> lo
> > > >>>>>>> ad
> > > >>>>>>> -o
> > > >>>>>>> n-virtual-machines.html
> > > >>>>>> We should take this into consideration once it is
> > > >>>>>> standardized in the spec, maybe not now, there can always be
> > > >>>>>> many workarounds to solve one
> > > >>>> problem.
> > > >>>>> Sure, until that point the admin commands are able to suffice
> > > >>>>> the need
> > > >> well.
> > > >>>>> And when the spec changes in transport occurs (if needed),
> > > >>>>> current admin
> > > >>>> command and admin vq also fits very well that will follow above [1].
> > > >>>> we have pointed lots of problems for admin vq based live
> > > >>>> migration proposal, I won't repeat them here
> > > >>> I don’t see any.
> > > >>> Nested is already solved using above.
> > > >> I don't see how, do you mind to work out the patches?
> > > > Once the base series is completed, nested cases can be addressed.
> > > > I wont be able to work on the patches for it until we finish for
> > > > the first level
> > > virtualization.
> > > As you know, nested is supported well in current virtio, so please don't
> break it.
> >
> > And same comment repeats. 😊
> > Expect same response...
> > Sorry, no virtio specification does not support device migration today.
> > Nothing is broken by adding new features.
> >
> > Above [1] has the right proposal that Jason's paper pointed out. Please use
> it.
> 
> I was involved in the design in [1]. And I don't see a connection to the
> dicussion here
> 
> 1) It is based on vDPA in L0
> 2) It doesn't address the nesting issue, it requires a proper design in the virtio
> spec to support migration in the nesting layer.

Nothing prevents [1] to be done without vdpa.

^ permalink raw reply	[flat|nested] 186+ messages in thread

* RE: [virtio-comment] RE: [PATCH V2 3/6] virtio: dont reset vqs when SUSPEND
  2023-11-21  7:33                     ` Jason Wang
@ 2023-11-21 16:32                       ` Parav Pandit
  2023-11-22  5:28                         ` Jason Wang
  2023-11-21 21:18                       ` Michael S. Tsirkin
  1 sibling, 1 reply; 186+ messages in thread
From: Parav Pandit @ 2023-11-21 16:32 UTC (permalink / raw)
  To: Jason Wang
  Cc: Zhu, Lingshan, Michael S. Tsirkin, eperezma, cohuck, stefanha,
	virtio-comment


> From: Jason Wang <jasowang@redhat.com>
> Sent: Tuesday, November 21, 2023 1:03 PM
> 
> On Thu, Nov 16, 2023 at 1:27 PM Parav Pandit <parav@nvidia.com> wrote:
> >
> > > From: Jason Wang <jasowang@redhat.com>
> > > Sent: Thursday, November 16, 2023 9:50 AM
> > >
> > > On Thu, Nov 16, 2023 at 1:39 AM Parav Pandit <parav@nvidia.com>
> wrote:
> > > >
> > > >
> > > >
> > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > Sent: Monday, November 13, 2023 9:05 AM
> > > > >
> > > > > On Thu, Nov 9, 2023 at 6:16 PM Parav Pandit <parav@nvidia.com>
> wrote:
> > > > > >
> > > > > >
> > > > > > > From: Zhu, Lingshan <lingshan.zhu@intel.com>
> > > > > > > Sent: Thursday, November 9, 2023 3:28 PM
> > > > > > >
> > > > > > > On 11/9/2023 1:46 AM, Michael S. Tsirkin wrote:
> > > > > > > > On Tue, Nov 07, 2023 at 05:27:23PM +0800, Zhu, Lingshan
> wrote:
> > > > > > > >>
> > > > > > > >> On 11/6/2023 5:49 PM, Michael S. Tsirkin wrote:
> > > > > > > >>> On Fri, Nov 03, 2023 at 06:34:34PM +0800, Zhu Lingshan
> wrote:
> > > > > > > >>>> When SUSPEND is set, device states and virtqueue states
> > > > > > > >>>> should be stablized, therefore the driver should not
> > > > > > > >>>> reset vqs when SUSPEND is set in device status.
> > > > > > > >>>>
> > > > > > > >>>> Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
> > > > > > > >>>> ---
> > > > > > > >>>>    content.tex | 3 +++
> > > > > > > >>>>    1 file changed, 3 insertions(+)
> > > > > > > >>>>
> > > > > > > >>>> diff --git a/content.tex b/content.tex index
> > > > > > > >>>> bcc9d4b..060b5c2
> > > > > > > >>>> 100644
> > > > > > > >>>> --- a/content.tex
> > > > > > > >>>> +++ b/content.tex
> > > > > > > >>>> @@ -444,6 +444,9 @@ \subsubsection{Virtqueue
> > > > > > > >>>> Reset}\label{sec:Basic
> > > > > > > Facilities of a Virtio Device /
> > > > > > > >>>>    The device MUST reset any state of a virtqueue to
> > > > > > > >>>> the default
> > > state,
> > > > > > > >>>>    including the available state and the used state.
> > > > > > > >>>> +If VIRTIO_F_SUSPEND is negotiated and SUSPEND is set
> > > > > > > >>>> +in \field{device status}, the driver SHOULD NOT reset
> > > > > > > >>>> +any
> > > virtqueues.
> > > > > > > >>>> +
> > > > > > > >>>>    \drivernormative{\paragraph}{Virtqueue Reset}{Basic
> > > > > > > >>>> Facilities of a
> > > > > > > Virtio Device / Virtqueues / Virtqueue Reset / Virtqueue
> > > > > > > Reset}
> > > > > > > >>>>    After the driver tells the device to reset a queue,
> > > > > > > >>>> the driver MUST verify that
> > > > > > > >>> Seems somewhat arbitrary and breaks the claim that the
> > > > > > > >>> feature is orthogonal and can have uses besides migration.
> > > > > > > >> when suspended, the device is frozen.
> > > > > > > >> The driver is aware of this process and so should not
> > > > > > > >> reset the vqs I
> > > think.
> > > > > > > > Again that is only true because you want to use it for migration.
> > > > > > > > But then you can't claim it's a generic facility.
> > > > > > > I don't get it. The device status is a basic facility.
> > > > > > >
> > > > > > > We need to SUSPEND the device by setting SUSPEND bit, to
> > > > > > > stabilize the device states for migration.
> > > > > > Is the PCI's PM time not enough to suspend the device?
> > > > >
> > > > > Are you saying we don't need virtio reset assuming we had FLR?
> > > > >
> > > > No. often FLR timing is not enough. Hence every PCI level device
> > > > has some
> > > sort of its own reset mechanism.
> > > >
> > > > > Suspending at different layers like rest at different layers.
> > > > >
> > > > > We have both FLR and virtio reset. The Virtio level function
> > > > > could be reset without FLR. So did suspend.
> > > > >
> > > > > That's it.
> > > > Sure, but wrapping it under some "basic facility" is just does not make
> sense.
> > >
> > > Why, device status (e.g reset) belongs to that part.
> > >
> > Lingshan claimed that suspending device is for live migration in commit log
> and in discussion he portray it as some basic facility unrelated to device
> migration such as debug etc.
> > Instead of claiming it as some non_device_migration facility does not make
> sense.
> 
> It is used for migration for sure.
This is why it is not working when device is directly mapped.
The hypervisor messing this bit and guest is also doing power management with it.

Both of them needs separate channel to do their own work.

> 
> >
> > > >
> > > > >
> > > > > And if you want to rule P2P behaviours, PCI PM is really the
> > > > > correct way to go instead of trying to do it at the virtio layer.
> > > > >
> > > > PCI PM is supposed to be controlled by the guest and so the suspend.
> > >
> > > I've listed issues about D3cold and others, I can't believe it can't
> > > be controlled totally by guests.
> > >
> > D3cold is not controlled by the driver as defined by the PCI spec hence it is
> not applicable.
> 
> Have you seen the link I give you? Even if you are right, there still could be such
> a request from the firmware, no?
I may have missed the link.
You have 10 replies, so it is easy to miss important things in rest of the comments.

> 
> > D3hot is controlled by the driver.
> 
> So, it requires the device context to be preserved, which is not documented in
> your patch.
PCI PM interactions is covered in v4 in the device requirements section.

> 
> > > >
> > > > Hypervisor needs its channel to suspend the device, as
> > > > fundamentally guest is
> > > unaware of device migration flow.
> > >
> > > That's pretty fine, the hypervisor also needs its channel to reset
> > > the device. If you think there's a conflict with suspend, there should be one
> for reset as well.
> > >
> > I don’t see a need for hypervisor to reset the device in passthrough mode.
> Can you explain why is it needed?
> 
> Qemu has a command "system_reset".
> 
I mean, what does this translate to reset the device in passthrough mode?
If this is FLR, it is there.

> > Do you mean, it is needed in vdpa mode? If yes, the registers are emulated
> anyway, so why the member device's native channel cannot be used in vdpa
> mode?
> >
> > > >
> > > > > > For large device I could imagine it could be short.
> > > > > >
> > > > > > In that case if there is suspend the device available, it will
> > > > > > be used by the guest
> > > > > driver itself, hypervisor wouldn’t know about it when those
> > > > > registers are not trapped.
> > > > > > So we need two ways to suspend.
> > > > > > One is guest visible, and guest controlled.
> > > > > > Second is hypervisor control to fulfill the device migration needs.
> > > > >
> > > > > Can you explain why suspend is special but not reset or why
> > > > > reset can work but not suspend? If reset can work, so does
> > > > > suspend. If reset can't, neither does suspend.
> > > > >
> > > > As long as reset and suspend both are under guest control, I am fine.
> > >
> > > Well, you seem to ignore my question below. Hypervisor needs to
> > > reset the device as well.
> > >
> > Why is it needed in passthrough mode?
> >
> > > >
> > > > > For example, can you explain how a system_reset in Qemu can work
> > > > > with your proposal?
> > > > >
> > > > > >
> > > > > > So if you can please take a look if the proposed admin command
> > > > > > to
> > > > > freeze/stop mode can be used in the emulated register case or not.
> > > > >
> > > > > Again, if you design those for PCI, it's a layer violation. You
> > > > > have answered
> > > > They are used by the PCI layer, just like your suspend bit.
> > > > Andy other transport can also use it.
> > > >
> > > > > yourself that PM is the right way to go.
> > > > >
> > > > > > It helps to have the suspend bit in guest control as well
> > > > > > with/without
> > > > > emulation mode.
> > > > >
> > > > > I won't repeat it again. You will find you need a full transport
> > > > > to satisfy all the requirements.
> > > > I disagree for full transport.
> > >
> > > See above and the discussion in another thread.
> > >
> > > > If you want to get discuss transport for sure it is some other
> > > > thread and I want to see "driver notifications via such transport
> > > > VQ" to fully qualify it
> > > as transport, And that would be just sub-optimal for actual working.
> > >
> > > Sub-optimal since the function is duplicated with a transport but it
> > > doesn't claim or design as a transport.
> > >
> > It is not sub-optimal because of duplication. It is because you want to
> transport notifications via virtqueue.
> 
> Have you ever read the series of tvq? You won't get this conclusion if you do
> that.
> 
I have read those 4 patches and I have seen that transportvq do not want to transport notifications.
Hence it does not qualify as transport vq.

Frankly, transport vq seems a way to formalize mediation forever in virtio.
It is very weird way to build new SIOV device.
For most things it should be the direct channel that virtio has already from driver to the device.


> >
> > > > And hence, I wouldn’t call it a transport anymore.
> > > >
> > > > >
> > > > > >
> > > > > > > This can also be used for debugging I think.
> > > > > >
> > > > > > As Michael listed, a dedicated debug interface is usually more
> > > > > > useful instead
> > > > > of in-band.
> > > > >
> > > > > Well, I've shown you the in-band facilities like debugging via
> > > > > ethtool and kernel has a lot of other ones. If you have ever
> > > > > tried to debug in a real production environment, you will find
> > > > > how useful such handy information is where out-of- band
> > > > > facilities are often dangerous
> > > and usually prohibited or even unsupported.
> > > > Guest driver can always read and write the device status without
> > > > adding a
> > > suspend bit.
> > >
> > > I don't get here. Suspend make sure the device state is frozen which
> > > helps for debugging for sure.
> > You wanted to debug some vq live, you suspend the device, the vq state got
> changed.
> >
> > I just don’t see that suspend is a debug tool.
> 
> It's not a tool, it's a function that can be used as a debug tool.
> 
> > Every feature is a debug feature literally.
> > Classic heisenbug effect.
> >
> > Once can change driver notification frequency to see if interrupt rate
> changed for debugging.
> > One can disabled few RQs and see RSS...
> > Blk can change blk_size to higher value to perf debug..
> > The list continues..
> 
> Let's not shift concepts.
> 
Your comment to attribute device migration as debug feature is actually shifting the concept.

> Obviously, suspend is not the only way to debug. But that's not the context
> here.
> 
I have no further comments on the claim that suspending a device a debug feature.
If it, add a debug section and put it under that.
You also know that it is not, so let's not waste our time.

I just don’t suspend bit as debug interface that undergoes classic heisenbug effect.

> Thanks
> 
> >
> > >
> > > Thanks
> > >
> > > >
> >


^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [virtio-comment] RE: [PATCH V2 3/6] virtio: dont reset vqs when SUSPEND
  2023-11-21  7:33                     ` Jason Wang
  2023-11-21 16:32                       ` Parav Pandit
@ 2023-11-21 21:18                       ` Michael S. Tsirkin
  2023-11-22  1:51                         ` Zhu, Lingshan
  2023-11-22  5:28                         ` Jason Wang
  1 sibling, 2 replies; 186+ messages in thread
From: Michael S. Tsirkin @ 2023-11-21 21:18 UTC (permalink / raw)
  To: Jason Wang
  Cc: Parav Pandit, Zhu, Lingshan, eperezma, cohuck, stefanha, virtio-comment

On Tue, Nov 21, 2023 at 03:33:07PM +0800, Jason Wang wrote:
> > Lingshan claimed that suspending device is for live migration in commit log and in discussion he portray it as some basic facility unrelated to device migration such as debug etc.
> > Instead of claiming it as some non_device_migration facility does not make sense.
> 
> It is used for migration for sure.

Well having a generic facility to stop device sounds like a nice thing.
However the devil is in the detail. A lot of detail here seems very much
tailored to a very specific implementation in mind.
So thinking through how it will work e.g. for power management
would be a good excercise to figure out how it should work in detail.
Parav did you indicate at some point a virtio specific SUSPEND
bit can be useful for PM? Could you share how it's better than
transport level PM and what the requirements are?
Thanks!

-- 
MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [virtio-comment] Re: [PATCH V2 4/6] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  2023-11-17 10:45                                       ` Michael S. Tsirkin
@ 2023-11-22  1:32                                         ` Zhu, Lingshan
  2023-11-22  6:53                                           ` Michael S. Tsirkin
  0 siblings, 1 reply; 186+ messages in thread
From: Zhu, Lingshan @ 2023-11-22  1:32 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Parav Pandit, jasowang, eperezma, cohuck, stefanha, virtio-comment



On 11/17/2023 6:45 PM, Michael S. Tsirkin wrote:
> On Fri, Nov 17, 2023 at 06:02:14PM +0800, Zhu, Lingshan wrote:
>>
>> On 11/16/2023 6:21 PM, Parav Pandit wrote:
>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>>>> Sent: Thursday, November 16, 2023 3:45 PM
>>>>
>>>> On 11/16/2023 1:35 AM, Parav Pandit wrote:
>>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>>>>>> Sent: Monday, November 13, 2023 2:56 PM
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 11/10/2023 8:31 PM, Parav Pandit wrote:
>>>>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>>>>>>>> Sent: Friday, November 10, 2023 1:22 PM
>>>>>>>>
>>>>>>>>
>>>>>>>> On 11/9/2023 6:25 PM, Parav Pandit wrote:
>>>>>>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>>>>>>>>>> Sent: Thursday, November 9, 2023 3:39 PM
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 11/9/2023 2:28 PM, Parav Pandit wrote:
>>>>>>>>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>>>>>>>>>>>> Sent: Tuesday, November 7, 2023 3:02 PM
>>>>>>>>>>>>
>>>>>>>>>>>> On 11/6/2023 6:52 PM, Parav Pandit wrote:
>>>>>>>>>>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>>>>>>>>>>>>>> Sent: Monday, November 6, 2023 2:57 PM
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 11/6/2023 12:12 PM, Parav Pandit wrote:
>>>>>>>>>>>>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>>>>>>>>>>>>>>>> Sent: Monday, November 6, 2023 9:01 AM
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On 11/3/2023 11:50 PM, Parav Pandit wrote:
>>>>>>>>>>>>>>>>>> From: virtio-comment@lists.oasis-open.org
>>>>>>>>>>>>>>>>>> <virtio-comment@lists.oasis- open.org> On Behalf Of Zhu,
>>>>>>>>>>>>>>>>>> Lingshan
>>>>>>>>>>>>>>>>>> Sent: Friday, November 3, 2023 8:27 PM
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On 11/3/2023 7:35 PM, Parav Pandit wrote:
>>>>>>>>>>>>>>>>>>>> From: Zhu Lingshan <lingshan.zhu@intel.com>
>>>>>>>>>>>>>>>>>>>> Sent: Friday, November 3, 2023 4:05 PM
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> This patch adds two new le16 fields to common
>>>>>>>>>>>>>>>>>>>> configuration structure to support VIRTIO_F_QUEUE_STATE
>>>>>>>>>>>>>>>>>>>> in PCI transport
>>>>>>>> layer.
>>>>>>>>>>>>>>>>>>>> Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
>>>>>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>>>>>            transport-pci.tex | 18 ++++++++++++++++++
>>>>>>>>>>>>>>>>>>>>            1 file changed, 18 insertions(+)
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> diff --git a/transport-pci.tex b/transport-pci.tex
>>>>>>>>>>>>>>>>>>>> index
>>>>>>>>>>>>>>>>>>>> a5c6719..3161519 100644
>>>>>>>>>>>>>>>>>>>> --- a/transport-pci.tex
>>>>>>>>>>>>>>>>>>>> +++ b/transport-pci.tex
>>>>>>>>>>>>>>>>>>>> @@ -325,6 +325,10 @@ \subsubsection{Common
>>>>>> configuration
>>>>>>>>>>>>>>>> structure
>>>>>>>>>>>>>>>>>>>> layout}\label{sec:Virtio Transport
>>>>>>>>>>>>>>>>>>>>                    /* About the administration virtqueue. */
>>>>>>>>>>>>>>>>>>>>                    le16 admin_queue_index;         /* read-only for
>>>> driver
>>>>>>>> */
>>>>>>>>>>>>>>>>>>>>                    le16 admin_queue_num;         /* read-only for
>>>> driver
>>>>>>>> */
>>>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>>>> +	/* Virtqueue state */
>>>>>>>>>>>>>>>>>>>> +        le16 queue_avail_state;         /* read-write */
>>>>>>>>>>>>>>>>>>>> +        le16 queue_used_state;          /* read-write */
>>>>>>>>>>>>>>>>>>> This tiny interface for 128 virtio net queues through
>>>>>>>>>>>>>>>>>>> register read writes, does
>>>>>>>>>>>>>>>>>> not work effectively.
>>>>>>>>>>>>>>>>>>> There are inflight out of order descriptors for block also.
>>>>>>>>>>>>>>>>>>> Hence toy registers like this do not work.
>>>>>>>>>>>>>>>>>> Do you know there is a queue_select? Why this does not
>>>> work?
>>>>>>>>>>>>>>>>>> Do you know how other queue related fields work?
>>>>>>>>>>>>>>>>> :)
>>>>>>>>>>>>>>>>> Yes. If you notice queue_reset related critical spec bug
>>>>>>>>>>>>>>>>> fix was done when it
>>>>>>>>>>>>>>>> was introduced so that live migration can _actually_ work.
>>>>>>>>>>>>>>>>> When queue_select is done for 128 queues serially, it take
>>>>>>>>>>>>>>>>> a lot of time to
>>>>>>>>>>>>>>>> read those slow register interface for this + inflight
>>>>>>>>>>>>>>>> descriptors +
>>>>>>>> more.
>>>>>>>>>>>>>>>> interesting, virtio work in this pattern for many years, right?
>>>>>>>>>>>>>>> All these years 400Gbps and 800Gbps virtio was not present,
>>>>>>>>>>>>>>> number of
>>>>>>>>>>>>>> queues were not in hw.
>>>>>>>>>>>>>> The registers are control path in config space, how 400G or
>>>>>>>>>>>>>> 800G
>>>>>>>> affect??
>>>>>>>>>>>>> Because those are the one in practice requires large number of VQs.
>>>>>>>>>>>>>
>>>>>>>>>>>>> You are asking per VQ register commands to modify things
>>>>>>>>>>>>> dynamically via
>>>>>>>>>>>> this one vq at a time, serializing all the operations.
>>>>>>>>>>>>> It does not scale well with high q count.
>>>>>>>>>>>> This is not dynamically, it only happens when SUSPEND and RESUME.
>>>>>>>>>>>> This is the same mechanism how virtio initialize a virtqueue,
>>>>>>>>>>>> working for many years.
>>>>>>>>>>> No. when virtio driver initializes it for the first time, there
>>>>>>>>>>> is no active traffic
>>>>>>>>>> that gets lost.
>>>>>>>>>>> This is because the interface is not yet up and not part of the
>>>>>>>>>>> network
>>>>>> yet.
>>>>>>>>>>> The resume must be fast enough, because the remote node is
>>>>>>>>>>> sending
>>>>>>>>>> packets.
>>>>>>>>>>> Hence it is different from driver init time queue enable.
>>>>>>>>>> I am not sure any packets arrive before a link announce at the
>>>>>>>>>> destination
>>>>>>>> side.
>>>>>>>>> I think it can.
>>>>>>>>> Because there is no notification of member device link down
>>>>>>>>> intimation to
>>>>>>>> remote side.
>>>>>>>>> The L4 and L5 protocols have no knowledge that node which they are
>>>>>>>> interacting is behind some layers of switches.
>>>>>>>>> So keeping this time low is desired.
>>>>>>>> The NIC should broad cast itself first, so that other peers in the
>>>>>>>> network know(for example its mac to route it) how to send a message to
>>>> it.
>>>>>>>> This is necessary, for example VIRTIO_NET_F_GUEST_ANNOUNCE, similar
>>>>>>>> mechanism work for in-marketing productions for years.
>>>>>>>>
>>>>>>>> This is out of the topic anyway.
>>>>>>>>>>>>>> See the virtio common cfg, you will find the max number of
>>>>>>>>>>>>>> vqs is there, num_queues.
>>>>>>>>>>>>> :)
>>>>>>>>>>>>> Sure. those values at high q count affects.
>>>>>>>>>>>> the driver need to initialize them anyway.
>>>>>>>>>>> That is before the traffic starts from remote end.
>>>>>>>>>> see above, that needs a link announce and this is after
>>>>>>>>>> re-initialization
>>>>>>>>>>>>>>> Device didn’t support LM.
>>>>>>>>>>>>>>> Many limitations existed all these years and TC is improving
>>>>>>>>>>>>>>> and expanding
>>>>>>>>>>>>>> them.
>>>>>>>>>>>>>>> So all these years do not matter.
>>>>>>>>>>>>>> Not sure what are you talking about, haven't we initialize
>>>>>>>>>>>>>> the device and vqs in config space for years?????? What's
>>>>>>>>>>>>>> wrong with this
>>>>>>>>>> mechanism?
>>>>>>>>>>>>>> Are you questioning virito-pci fundamentals???
>>>>>>>>>>>>> Don’t point to in-efficient past to establish similar in-efficient future.
>>>>>>>>>>>> interesting, you know this is a one-time thing, right?
>>>>>>>>>>>> and you are aware of this has been there for years.
>>>>>>>>>>>>>>>>>> Like how to set a queue size and enable it?
>>>>>>>>>>>>>>>>> Those are meant to be used before DRIVER_OK stage as they
>>>>>>>>>>>>>>>>> are init time
>>>>>>>>>>>>>>>> registers.
>>>>>>>>>>>>>>>>> Not to keep abusing them..
>>>>>>>>>>>>>>>> don't you need to set queue_size at the destination side?
>>>>>>>>>>>>>>> No.
>>>>>>>>>>>>>>> But the src/dst does not matter.
>>>>>>>>>>>>>>> Queue_size to be set before DRIVER_OK like rest of the
>>>>>>>>>>>>>>> registers, as all
>>>>>>>>>>>>>> queues must be created before the driver_ok phase.
>>>>>>>>>>>>>>> Queue_reset was last moment exception.
>>>>>>>>>>>>>> create a queue? Nvidia specific?
>>>>>>>>>>>>>>
>>>>>>>>>>>>> Huh. No.
>>>>>>>>>>>>> Do git log and realize what happened with queue_reset.
>>>>>>>>>>>> You didn't answer the question, does the spec even has defined
>>>>>>>>>>>> "create a
>>>>>>>>>> vq"?
>>>>>>>>>>> Enabled/created = tomato/tomato when discussing the spec in
>>>>>>>>>>> non-normative
>>>>>>>>>> email conversation.
>>>>>>>>>>> It's irrelevant.
>>>>>>>>>> Then lets not debate on this enable a vq or create a vq anymore
>>>>>>>>>>> All I am saying is, when we know the limitations of the
>>>>>>>>>>> transport and when industry is forwarding to not introduced more
>>>>>>>>>>> and more on-die register
>>>>>>>>>> for once in lifetime work of device migration, we just use the
>>>>>>>>>> optimal command and queue interface that is native to virtio.
>>>>>>>>>> PCI config space has its own limitations, and admin vq has its
>>>>>>>>>> advantages, but that does not apply to all use cases.
>>>>>>>>>>
>>>>>>>>> There was a recent work done emulating the SR-IOV cap and allowing
>>>>>>>>> VM to
>>>>>>>> enable SR-IOV in [1].
>>>>>>>>> This is the option I mentioned few weeks ago.
>>>>>>>>>
>>>>>>>>> So with admin commands and admin virtqueues, even nested model
>>>>>>>>> will work
>>>>>>>> using [1].
>>>>>>>>> [1]
>>>>>>>>> https://netdevconf.info/0x17/sessions/talk/unleashing-sr-iov-offlo
>>>>>>>>> ad
>>>>>>>>> -o
>>>>>>>>> n-virtual-machines.html
>>>>>>>> We should take this into consideration once it is standardized in
>>>>>>>> the spec, maybe not now, there can always be many workarounds to
>>>>>>>> solve one
>>>>>> problem.
>>>>>>> Sure, until that point the admin commands are able to suffice the need
>>>> well.
>>>>>>> And when the spec changes in transport occurs (if needed), current
>>>>>>> admin
>>>>>> command and admin vq also fits very well that will follow above [1].
>>>>>> we have pointed lots of problems for admin vq based live migration
>>>>>> proposal, I won't repeat them here
>>>>> I don’t see any.
>>>>> Nested is already solved using above.
>>>> I don't see how, do you mind to work out the patches?
>>> Once the base series is completed, nested cases can be addressed.
>>> I wont be able to work on the patches for it until we finish for the first level virtualization.
>> As you know, nested is supported well in current virtio, so please don't
>> break it.
> So for nesting, it seems cleaner to support sending commands through
> device itself.
I guess this requires per-VF admin vq or some agents & tricks.
> You aren't going to fit VQ state in a 16 bit register in
> the general case though, and will have to resort to DMA.
Yes, at least we need in-flight descriptors tracking.
Still working with Eugenio for this feature.
> And if you are
> doing that then please just use the admin command format (does not have
> to be a VQ) and then we can all make peace finally.
>


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [virtio-comment] RE: [PATCH V2 3/6] virtio: dont reset vqs when SUSPEND
  2023-11-17 11:04                           ` Michael S. Tsirkin
@ 2023-11-22  1:41                             ` Zhu, Lingshan
  2023-11-22  7:30                               ` Michael S. Tsirkin
  0 siblings, 1 reply; 186+ messages in thread
From: Zhu, Lingshan @ 2023-11-22  1:41 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Parav Pandit, jasowang, eperezma, cohuck, stefanha, virtio-comment



On 11/17/2023 7:04 PM, Michael S. Tsirkin wrote:
> On Fri, Nov 17, 2023 at 06:13:50PM +0800, Zhu, Lingshan wrote:
>>
>> On 11/16/2023 8:09 PM, Michael S. Tsirkin wrote:
>>
>>      On Thu, Nov 16, 2023 at 06:09:38PM +0800, Zhu, Lingshan wrote:
>>
>>
>>          On 11/16/2023 1:35 AM, Parav Pandit wrote:
>>
>>                  From: Zhu, Lingshan <lingshan.zhu@intel.com>
>>                  Sent: Monday, November 13, 2023 2:53 PM
>>
>>                  On 11/10/2023 2:31 PM, Parav Pandit wrote:
>>
>>                          From: Zhu, Lingshan <lingshan.zhu@intel.com>
>>                          Sent: Friday, November 10, 2023 11:52 AM
>>
>>                          On 11/9/2023 6:15 PM, Parav Pandit wrote:
>>
>>                                  From: Zhu, Lingshan <lingshan.zhu@intel.com>
>>                                  Sent: Thursday, November 9, 2023 3:28 PM
>>
>>                                  On 11/9/2023 1:46 AM, Michael S. Tsirkin wrote:
>>
>>                                      On Tue, Nov 07, 2023 at 05:27:23PM +0800, Zhu, Lingshan wrote:
>>
>>                                          On 11/6/2023 5:49 PM, Michael S. Tsirkin wrote:
>>
>>                                          On Fri, Nov 03, 2023 at 06:34:34PM +0800, Zhu Lingshan wrote:
>>
>>                                          When SUSPEND is set, device states and virtqueue states should
>>                                          be stablized, therefore the driver should not reset vqs when
>>                                          SUSPEND is set in device status.
>>
>>                                          Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
>>                                          ---
>>                                                content.tex | 3 +++
>>                                                1 file changed, 3 insertions(+)
>>
>>                                          diff --git a/content.tex b/content.tex index bcc9d4b..060b5c2
>>                                          100644
>>                                          --- a/content.tex
>>                                          +++ b/content.tex
>>                                          @@ -444,6 +444,9 @@ \subsubsection{Virtqueue
>>                                          Reset}\label{sec:Basic
>>
>>                                  Facilities of a Virtio Device /
>>
>>                                                The device MUST reset any state of a virtqueue to the default
>>
>>                  state,
>>
>>                                                including the available state and the used state.
>>                                          +If VIRTIO_F_SUSPEND is negotiated and SUSPEND is set in
>>                                          +\field{device status}, the driver SHOULD NOT reset any virtqueues.
>>                                          +
>>                                                \drivernormative{\paragraph}{Virtqueue Reset}{Basic
>>                                          Facilities of a
>>
>>                                  Virtio Device / Virtqueues / Virtqueue Reset / Virtqueue Reset}
>>
>>                                                After the driver tells the device to reset a queue, the
>>                                          driver MUST verify that
>>
>>                                          Seems somewhat arbitrary and breaks the claim that the feature
>>                                          is orthogonal and can have uses besides migration.
>>
>>                                          when suspended, the device is frozen.
>>                                          The driver is aware of this process and so should not reset the vqs I
>>
>>                  think.
>>
>>                                      Again that is only true because you want to use it for migration.
>>                                      But then you can't claim it's a generic facility.
>>
>>                                  I don't get it. The device status is a basic facility.
>>
>>                                  We need to SUSPEND the device by setting SUSPEND bit, to stabilize
>>                                  the device states for migration.
>>
>>                              Is the PCI's PM time not enough to suspend the device?
>>                              For large device I could imagine it could be short.
>>
>>                          As you see, PCI PM, so this is a layer violation, virtio should be
>>                          self contained,
>>
>>                      If you think it is layer violation, than suspend bit for sure is not needed. PCI
>>
>>                  PM interface should suspend/resume the device on D0<->D3 state transitions.
>>                  Doesn't make sense logically, because it is layer violation, so you want it to be
>>                  worse? For example, virito writes 0 to device status to reset a device, not by PCI.
>>
>>              All these layer violation thing is just abstract to me.
>>              Your argument contradicts with your fellow author and yourself.
>>
>>          I don't see how, we keep telling you virtio should be self contained, and
>>          suspend by PCI PM is a
>>          layer volition, this is a fact, right?
>>
>>      Not really. Look at the charter - when available we should use platform
>>      capabilities because it makes it easier to write drivers.
>>
>> I think that is transport specific implementation, for example pci common cfg.
>>
>>
>>
>>
>>              I don’t want to make it worse.
>>              If you think its layer violation, just depend on the PCI PM, no need to include new suspend bit.
>>
>>          Again, virtio should be self-contained, not layer volited, for example, we
>>          reset virito devices
>>          by writing 0 to device status, not by PCI FLR.
>>
>>      There are some advantage to doing it like this, e.g. one does not need
>>      to save and restore config space. What are advatages of suspend via this
>>      bit?
>>
>> suspend a device by the device status is the same as how we enable a virito
>> device.
>>
>> Doing this by PCI is clearly a layer volition, and does not work for other
>> transports.
>>
>>
>>
>>                          and what about MMIO and CCW?
>>
>>                      They have largely lacked the richness of PCI transport. So those transport
>>
>>                  needs to evolve.
>>                  I am not sure CCW and MMIO maintainers want to hear this.
>>
>>                      Otherwise, PCI offers rich transport facilities compared to MMIO, hence, it will
>>
>>                  continue wider use.
>>                  you know this SUSPEND bit work fine on all transport, right? Because
>>                  device_status is transport independent.
>>
>>              I want to emphasize that I am not against the suspend bit as long as it is guest driver controlled without interfering the device migration flow (like rest of the state).
>>
>>          When migrate a device, it is the host who suspends the device. The reason is
>>          the live migration process should be transparent to
>>          the guest, so we should suspend the guest first, then suspend the device(by
>>          host).
>>
>>              The practical reason for suspending functionality under guest control is, that resuming/suspending the large device can take time.
>>              So let it be in guest driver control. No need to muddy with device migration flow.
>>
>>          The time cost is reasonable in O(N) no matter how you suspend/resume the
>>          device.
>>
>>      Very much depends. Big O notation can be misleading. If you have to
>>      repeat an operation 1000 times that's 1000 * N and suddenly you are
>>      going from milliseconds to seconds.
>>
>> I mean enable 100 queues cost more time then enable 1 vq no matter
>> how we enable it. that is O(N)
> Depends on what "that" is. Number of VM exits does not have to be O(N),
> you can pass these 100 queues in memory.
For batching, yest. But I don't see this as a problem because we enable
vqs by this way for many years, so far so good.
>
>
>>
>>
>>                          This should be a basic facility.
>>
>>                      Other transport can also offer like PCI.
>>
>>                  Do you want to work for these transport? Implementing the new features as
>>                  PCI?
>>
>>              Not presently as PCI as more features than rest of the two.
>>              What I read about ccw is: " S/390 based virtual machines support neither PCI nor MMIO".
>>
>>              And I also read, "The IBM System/390 is a discontinued mainframe product family implementing".
>>
>>              So I don’t know who needs to extend ccw.
>>              And if one needs, those maintainers will extend it to match to PCI standard.
>>
>>          So these features are even not planned, so don't depend on them.
>>
>>      But again can one suspend ccw device? If you are adding this feature and
>>      claiming it's supported for all transports you better find out
>>      what does it do.
>>
>> I am not an expert on CCW, anything block we suspend a CCW device by this bit?
> I don't think CCW supports suspend at all.
I think it is not a transport feature but a device feature,
the device can always suspend it self, like don't process data
and stop responding until a specific signal.
>
>> This seems only controlled by the device itself.
>>
> And? What it the point of suspending only the device if rest of system
> is still going?
That is an orchestration issue, totally up to the administrators.
Normally when suspending the device, the guest are very likely
to be suspended already.
>
>>
>>                              In that case if there is suspend the device available, it will be
>>                              used by the
>>
>>                          guest driver itself, hypervisor wouldn’t know about it when those
>>                          registers are not trapped.
>>
>>                              So we need two ways to suspend.
>>                              One is guest visible, and guest controlled.
>>                              Second is hypervisor control to fulfill the device migration needs.
>>
>>                          The guest can eve reset the device.
>>
>>                              So if you can please take a look if the proposed admin command to
>>
>>                          freeze/stop mode can be used in the emulated register case or not.
>>
>>                              It helps to have the suspend bit in guest control as well
>>                              with/without
>>
>>                          emulation mode.
>>                          Parav, please believe I have read your series, I didn't comment there
>>                          because I want to avoid further conflicts/debating, we have done these
>>
>>                  enough.
>>
>>                      I believe the series posted in v3 can support vdpa use case as well.
>>                      So I will progress to post v4.
>>
>>
>>                          As explained before, freeze/stop the device by PCI is a layer violation.
>>
>>                      I am afraid, we have different vision.
>>                      I don’t see any layer violation.
>>                      Suspend is enough in the PCI PM.
>>                      Our vision is more aligned with rest of the hypervisor knobs that owns the
>>
>>                  migration framework.
>>                  I think I have explained, virito builds on other transport and it should be self-
>>                  contained, so far so good.
>>
>>              Virtio without any transport binding is just blank paper discussion.
>>
>>          virtio is built on some transports, but not bind to any.
>>
>>      Binding is an OS specific thing, but e.g. under Linux transport drivers bind to
>>      devices then virtio drivers bind to virtio bus. No binding -> nothing
>>      works.
>>
>> I think general facilities are better not only work on a specific transport
>>
> But platform facilities are even better we don't need to work on them at
> all.
Yes, so I also agree to track dirty pages by the platform, on-CPU dirty page
tracking facilities serving all transport, not only PCI.
>
>
>>                          And device status can be pass-through(without emulation, just map it
>>                          to
>>                          guest) to the guest or trapped(trap and emulate by the hypervisor,
>>                          for example set_status in vDPA).
>>
>>                      When it is pass-through, it is controlled by the guest, so for example, if the
>>
>>                  guest resets the device, hypervisor has lost the control of migration context etc.
>>
>>                      Hence, hypervisor needs a channel which is not guest owned.
>>
>>                      Same channel can work when trap+emulation is done.
>>
>>                  It is the guest owns the device, it can reset the device, once reset, the device
>>                  context are cleared.
>>
>>              Hypervisor do not have the ability to read/write the device context. It lost the channel as hypervisor is not involved in trap+emulation.
>>              So it is not helpful in one use case.
>>
>>              Admin commands can work even with trap+emulation mode.
>>
>>              What is missing, that should be added?
>>
>>          as explained above, when live migration, the guest should be suspended
>>          first, at this point,
>>          the host owns the device, it has access to the device.
>>
>>      Where do you say this in the spec patch?
>>
>> VM live migration is not in this spec.
> Then it should be.
>
>> If we suspend the device first, then the guest may detect IO errors.
>>
> That's bad. So you need to tell driver what not to do so as not to get
> errors.
I think the process should be suspending the guest first, then the host
owns the device, so it can suspend the guest and collect the necessary data
for live migration.
>
>>
>>                                  This can also be used for debugging I think.
>>
>>                              As Michael listed, a dedicated debug interface is usually more
>>                              useful instead
>>
>>                          of in-band.
>>                          re-using another facility without extra efforts is not a bad thing anyway.
>>
>>                      I just don’t see how a suspend bit some debug feature.
>>                      Almost everything with that regard is a debug feature to me.
>>
>>                  suspend then check the device states?
>>
>>              You already suspended the device, so device state is already changed.
>>              All debug information is changed, so not useful now.
>>
>>          When suspended, the device should keep and stabilize its device states,
>>          at least in my series it should behave like this.
>>
>>      That's vague. What does it mean exactly and what happens if
>>      some external event causes state change?
>>
>> it is suspended, somehow like powered-down, so it should not
>> respond to the events until resume.
> "somehow" is too vague for the spec.
Yeah, in spec, we have a section to describe what the device should do 
when SUSPEND.
>


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [virtio-comment] RE: [PATCH V2 3/6] virtio: dont reset vqs when SUSPEND
  2023-11-21 21:18                       ` Michael S. Tsirkin
@ 2023-11-22  1:51                         ` Zhu, Lingshan
  2023-11-22  6:47                           ` Parav Pandit
  2023-11-22  6:49                           ` Michael S. Tsirkin
  2023-11-22  5:28                         ` Jason Wang
  1 sibling, 2 replies; 186+ messages in thread
From: Zhu, Lingshan @ 2023-11-22  1:51 UTC (permalink / raw)
  To: Michael S. Tsirkin, Jason Wang
  Cc: Parav Pandit, eperezma, cohuck, stefanha, virtio-comment



On 11/22/2023 5:18 AM, Michael S. Tsirkin wrote:
> On Tue, Nov 21, 2023 at 03:33:07PM +0800, Jason Wang wrote:
>>> Lingshan claimed that suspending device is for live migration in commit log and in discussion he portray it as some basic facility unrelated to device migration such as debug etc.
>>> Instead of claiming it as some non_device_migration facility does not make sense.
>> It is used for migration for sure.
> Well having a generic facility to stop device sounds like a nice thing.
> However the devil is in the detail. A lot of detail here seems very much
> tailored to a very specific implementation in mind.
> So thinking through how it will work e.g. for power management
> would be a good excercise to figure out how it should work in detail.
> Parav did you indicate at some point a virtio specific SUSPEND
> bit can be useful for PM? Could you share how it's better than
> transport level PM and what the requirements are?
> Thanks!
Do you mean letting the device enter a new power state when SUSPEND,
and such description in transport-pci.tex? Then resume normal
state on DRIVER_OK.
>


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [virtio-comment] Re: [PATCH V2 4/6] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  2023-11-21 16:26                                           ` Parav Pandit
@ 2023-11-22  4:15                                             ` Jason Wang
  2023-11-22  7:15                                               ` Michael S. Tsirkin
  0 siblings, 1 reply; 186+ messages in thread
From: Jason Wang @ 2023-11-22  4:15 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Zhu, Lingshan, mst, eperezma, cohuck, stefanha, virtio-comment

On Wed, Nov 22, 2023 at 12:26 AM Parav Pandit <parav@nvidia.com> wrote:
>
>
> > From: Jason Wang <jasowang@redhat.com>
> > Sent: Tuesday, November 21, 2023 10:01 AM
> >
> > On Fri, Nov 17, 2023 at 6:06 PM Parav Pandit <parav@nvidia.com> wrote:
> > >
> > >

[...]

> > > Sorry, no virtio specification does not support device migration today.
> > > Nothing is broken by adding new features.
> > >
> > > Above [1] has the right proposal that Jason's paper pointed out. Please use
> > it.
> >
> > I was involved in the design in [1]. And I don't see a connection to the
> > dicussion here
> >
> > 1) It is based on vDPA in L0
> > 2) It doesn't address the nesting issue, it requires a proper design in the virtio
> > spec to support migration in the nesting layer.
>
> Nothing prevents [1] to be done without vdpa.

Well, Qemu has SR-IOV emulation for IGB. What's the point of the above
reply? If a hypervisor wishes, it can be done with any device on L0.

We discuss the possibility of migrating nested VMs, not the
possibility of emulating SR-IOV, no?

Thanks


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [virtio-comment] RE: [PATCH V2 3/6] virtio: dont reset vqs when SUSPEND
  2023-11-21 16:32                       ` Parav Pandit
@ 2023-11-22  5:28                         ` Jason Wang
  2023-11-22  6:11                           ` Parav Pandit
  0 siblings, 1 reply; 186+ messages in thread
From: Jason Wang @ 2023-11-22  5:28 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Zhu, Lingshan, Michael S. Tsirkin, eperezma, cohuck, stefanha,
	virtio-comment

On Wed, Nov 22, 2023 at 12:32 AM Parav Pandit <parav@nvidia.com> wrote:
>
>
> > From: Jason Wang <jasowang@redhat.com>
> > Sent: Tuesday, November 21, 2023 1:03 PM
> >
> > On Thu, Nov 16, 2023 at 1:27 PM Parav Pandit <parav@nvidia.com> wrote:
> > >
> > > > From: Jason Wang <jasowang@redhat.com>
> > > > Sent: Thursday, November 16, 2023 9:50 AM
> > > >
> > > > On Thu, Nov 16, 2023 at 1:39 AM Parav Pandit <parav@nvidia.com>
> > wrote:
> > > > >
> > > > >
> > > > >
> > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > Sent: Monday, November 13, 2023 9:05 AM
> > > > > >
> > > > > > On Thu, Nov 9, 2023 at 6:16 PM Parav Pandit <parav@nvidia.com>
> > wrote:
> > > > > > >
> > > > > > >
> > > > > > > > From: Zhu, Lingshan <lingshan.zhu@intel.com>
> > > > > > > > Sent: Thursday, November 9, 2023 3:28 PM
> > > > > > > >
> > > > > > > > On 11/9/2023 1:46 AM, Michael S. Tsirkin wrote:
> > > > > > > > > On Tue, Nov 07, 2023 at 05:27:23PM +0800, Zhu, Lingshan
> > wrote:
> > > > > > > > >>
> > > > > > > > >> On 11/6/2023 5:49 PM, Michael S. Tsirkin wrote:
> > > > > > > > >>> On Fri, Nov 03, 2023 at 06:34:34PM +0800, Zhu Lingshan
> > wrote:
> > > > > > > > >>>> When SUSPEND is set, device states and virtqueue states
> > > > > > > > >>>> should be stablized, therefore the driver should not
> > > > > > > > >>>> reset vqs when SUSPEND is set in device status.
> > > > > > > > >>>>
> > > > > > > > >>>> Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
> > > > > > > > >>>> ---
> > > > > > > > >>>>    content.tex | 3 +++
> > > > > > > > >>>>    1 file changed, 3 insertions(+)
> > > > > > > > >>>>
> > > > > > > > >>>> diff --git a/content.tex b/content.tex index
> > > > > > > > >>>> bcc9d4b..060b5c2
> > > > > > > > >>>> 100644
> > > > > > > > >>>> --- a/content.tex
> > > > > > > > >>>> +++ b/content.tex
> > > > > > > > >>>> @@ -444,6 +444,9 @@ \subsubsection{Virtqueue
> > > > > > > > >>>> Reset}\label{sec:Basic
> > > > > > > > Facilities of a Virtio Device /
> > > > > > > > >>>>    The device MUST reset any state of a virtqueue to
> > > > > > > > >>>> the default
> > > > state,
> > > > > > > > >>>>    including the available state and the used state.
> > > > > > > > >>>> +If VIRTIO_F_SUSPEND is negotiated and SUSPEND is set
> > > > > > > > >>>> +in \field{device status}, the driver SHOULD NOT reset
> > > > > > > > >>>> +any
> > > > virtqueues.
> > > > > > > > >>>> +
> > > > > > > > >>>>    \drivernormative{\paragraph}{Virtqueue Reset}{Basic
> > > > > > > > >>>> Facilities of a
> > > > > > > > Virtio Device / Virtqueues / Virtqueue Reset / Virtqueue
> > > > > > > > Reset}
> > > > > > > > >>>>    After the driver tells the device to reset a queue,
> > > > > > > > >>>> the driver MUST verify that
> > > > > > > > >>> Seems somewhat arbitrary and breaks the claim that the
> > > > > > > > >>> feature is orthogonal and can have uses besides migration.
> > > > > > > > >> when suspended, the device is frozen.
> > > > > > > > >> The driver is aware of this process and so should not
> > > > > > > > >> reset the vqs I
> > > > think.
> > > > > > > > > Again that is only true because you want to use it for migration.
> > > > > > > > > But then you can't claim it's a generic facility.
> > > > > > > > I don't get it. The device status is a basic facility.
> > > > > > > >
> > > > > > > > We need to SUSPEND the device by setting SUSPEND bit, to
> > > > > > > > stabilize the device states for migration.
> > > > > > > Is the PCI's PM time not enough to suspend the device?
> > > > > >
> > > > > > Are you saying we don't need virtio reset assuming we had FLR?
> > > > > >
> > > > > No. often FLR timing is not enough. Hence every PCI level device
> > > > > has some
> > > > sort of its own reset mechanism.
> > > > >
> > > > > > Suspending at different layers like rest at different layers.
> > > > > >
> > > > > > We have both FLR and virtio reset. The Virtio level function
> > > > > > could be reset without FLR. So did suspend.
> > > > > >
> > > > > > That's it.
> > > > > Sure, but wrapping it under some "basic facility" is just does not make
> > sense.
> > > >
> > > > Why, device status (e.g reset) belongs to that part.
> > > >
> > > Lingshan claimed that suspending device is for live migration in commit log
> > and in discussion he portray it as some basic facility unrelated to device
> > migration such as debug etc.
> > > Instead of claiming it as some non_device_migration facility does not make
> > sense.
> >
> > It is used for migration for sure.
> This is why it is not working when device is directly mapped.

We circle back. It works for the case of trap/emulation.

For direct mapping:

You claim guest reset can work but suspend can't?

> The hypervisor messing this bit and guest is also doing power management with it.

So I don't see how it differs from: virtio reset and FLR is under the
control of guests.

>
> Both of them needs separate channel to do their own work.
>
> >
> > >
> > > > >
> > > > > >
> > > > > > And if you want to rule P2P behaviours, PCI PM is really the
> > > > > > correct way to go instead of trying to do it at the virtio layer.
> > > > > >
> > > > > PCI PM is supposed to be controlled by the guest and so the suspend.
> > > >
> > > > I've listed issues about D3cold and others, I can't believe it can't
> > > > be controlled totally by guests.
> > > >
> > > D3cold is not controlled by the driver as defined by the PCI spec hence it is
> > not applicable.
> >
> > Have you seen the link I give you? Even if you are right, there still could be such
> > a request from the firmware, no?
> I may have missed the link.
> You have 10 replies, so it is easy to miss important things in rest of the comments.

I meant there still could be D3cold requests from the guest via
virtual firmware.

So it's not necessarily related to the guest driver.

>
> >
> > > D3hot is controlled by the driver.
> >
> > So, it requires the device context to be preserved, which is not documented in
> > your patch.
> PCI PM interactions is covered in v4 in the device requirements section.
>
> >
> > > > >
> > > > > Hypervisor needs its channel to suspend the device, as
> > > > > fundamentally guest is
> > > > unaware of device migration flow.
> > > >
> > > > That's pretty fine, the hypervisor also needs its channel to reset
> > > > the device. If you think there's a conflict with suspend, there should be one
> > for reset as well.
> > > >
> > > I don’t see a need for hypervisor to reset the device in passthrough mode.
> > Can you explain why is it needed?
> >
> > Qemu has a command "system_reset".
> >
> I mean, what does this translate to reset the device in passthrough mode?

It needs to reset the virtio device.

> If this is FLR, it is there.

Please explain how it works. (It's not only a FLR, it also need virtio
level reset)

>
> > > Do you mean, it is needed in vdpa mode? If yes, the registers are emulated
> > anyway, so why the member device's native channel cannot be used in vdpa
> > mode?
> > >
> > > > >
> > > > > > > For large device I could imagine it could be short.
> > > > > > >
> > > > > > > In that case if there is suspend the device available, it will
> > > > > > > be used by the guest
> > > > > > driver itself, hypervisor wouldn’t know about it when those
> > > > > > registers are not trapped.
> > > > > > > So we need two ways to suspend.
> > > > > > > One is guest visible, and guest controlled.
> > > > > > > Second is hypervisor control to fulfill the device migration needs.
> > > > > >
> > > > > > Can you explain why suspend is special but not reset or why
> > > > > > reset can work but not suspend? If reset can work, so does
> > > > > > suspend. If reset can't, neither does suspend.
> > > > > >
> > > > > As long as reset and suspend both are under guest control, I am fine.
> > > >
> > > > Well, you seem to ignore my question below. Hypervisor needs to
> > > > reset the device as well.
> > > >
> > > Why is it needed in passthrough mode?
> > >
> > > > >
> > > > > > For example, can you explain how a system_reset in Qemu can work
> > > > > > with your proposal?
> > > > > >
> > > > > > >
> > > > > > > So if you can please take a look if the proposed admin command
> > > > > > > to
> > > > > > freeze/stop mode can be used in the emulated register case or not.
> > > > > >
> > > > > > Again, if you design those for PCI, it's a layer violation. You
> > > > > > have answered
> > > > > They are used by the PCI layer, just like your suspend bit.
> > > > > Andy other transport can also use it.
> > > > >
> > > > > > yourself that PM is the right way to go.
> > > > > >
> > > > > > > It helps to have the suspend bit in guest control as well
> > > > > > > with/without
> > > > > > emulation mode.
> > > > > >
> > > > > > I won't repeat it again. You will find you need a full transport
> > > > > > to satisfy all the requirements.
> > > > > I disagree for full transport.
> > > >
> > > > See above and the discussion in another thread.
> > > >
> > > > > If you want to get discuss transport for sure it is some other
> > > > > thread and I want to see "driver notifications via such transport
> > > > > VQ" to fully qualify it
> > > > as transport, And that would be just sub-optimal for actual working.
> > > >
> > > > Sub-optimal since the function is duplicated with a transport but it
> > > > doesn't claim or design as a transport.
> > > >
> > > It is not sub-optimal because of duplication. It is because you want to
> > transport notifications via virtqueue.
> >
> > Have you ever read the series of tvq? You won't get this conclusion if you do
> > that.
> >
> I have read those 4 patches and I have seen that transportvq do not want to transport notifications.
> Hence it does not qualify as transport vq.

It exposes the platform MMIO area for driver notification. This is
sufficient. Any issue you see?

>
> Frankly, transport vq seems a way to formalize mediation forever in virtio.

Nope, it can be accessed by a guest driver directly.

> It is very weird way to build new SIOV device.
> For most things it should be the direct channel that virtio has already from driver to the device.

See above. SIOV might require a new transport or not.

>
>
> > >
> > > > > And hence, I wouldn’t call it a transport anymore.
> > > > >
> > > > > >
> > > > > > >
> > > > > > > > This can also be used for debugging I think.
> > > > > > >
> > > > > > > As Michael listed, a dedicated debug interface is usually more
> > > > > > > useful instead
> > > > > > of in-band.
> > > > > >
> > > > > > Well, I've shown you the in-band facilities like debugging via
> > > > > > ethtool and kernel has a lot of other ones. If you have ever
> > > > > > tried to debug in a real production environment, you will find
> > > > > > how useful such handy information is where out-of- band
> > > > > > facilities are often dangerous
> > > > and usually prohibited or even unsupported.
> > > > > Guest driver can always read and write the device status without
> > > > > adding a
> > > > suspend bit.
> > > >
> > > > I don't get here. Suspend make sure the device state is frozen which
> > > > helps for debugging for sure.
> > > You wanted to debug some vq live, you suspend the device, the vq state got
> > changed.
> > >
> > > I just don’t see that suspend is a debug tool.
> >
> > It's not a tool, it's a function that can be used as a debug tool.
> >
> > > Every feature is a debug feature literally.
> > > Classic heisenbug effect.
> > >
> > > Once can change driver notification frequency to see if interrupt rate
> > changed for debugging.
> > > One can disabled few RQs and see RSS...
> > > Blk can change blk_size to higher value to perf debug..
> > > The list continues..
> >
> > Let's not shift concepts.
> >
> Your comment to attribute device migration as debug feature is actually shifting the concept.

It's not.

Ling Shan put it in the basic facilities as part of device status. You
wonder why, we explained it can be used beyond migration. You asked
where, we told you for example things like debugging. We never claim
it can only be used in debug. Then you shift the concept to say debug
could be achieved by a lot of other facilities. For sure this is
correct, but it doesn't have any relationship with the discussion
here.

>
> > Obviously, suspend is not the only way to debug. But that's not the context
> > here.
> >
> I have no further comments on the claim that suspending a device a debug feature.
> If it, add a debug section and put it under that.
> You also know that it is not, so let's not waste our time.
>
> I just don’t suspend bit as debug interface that undergoes classic heisenbug effect.

I never say it can be used for solving all problems. Anyhow that's
another topic right?

Thanks

>
> > Thanks
> >
> > >
> > > >
> > > > Thanks
> > > >
> > > > >
> > >
>


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [virtio-comment] RE: [PATCH V2 3/6] virtio: dont reset vqs when SUSPEND
  2023-11-21 21:18                       ` Michael S. Tsirkin
  2023-11-22  1:51                         ` Zhu, Lingshan
@ 2023-11-22  5:28                         ` Jason Wang
  2023-11-22  6:32                           ` Parav Pandit
  1 sibling, 1 reply; 186+ messages in thread
From: Jason Wang @ 2023-11-22  5:28 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Parav Pandit, Zhu, Lingshan, eperezma, cohuck, stefanha, virtio-comment

On Wed, Nov 22, 2023 at 5:18 AM Michael S. Tsirkin <mst@redhat.com> wrote:
>
> On Tue, Nov 21, 2023 at 03:33:07PM +0800, Jason Wang wrote:
> > > Lingshan claimed that suspending device is for live migration in commit log and in discussion he portray it as some basic facility unrelated to device migration such as debug etc.
> > > Instead of claiming it as some non_device_migration facility does not make sense.
> >
> > It is used for migration for sure.
>
> Well having a generic facility to stop device sounds like a nice thing.
> However the devil is in the detail. A lot of detail here seems very much
> tailored to a very specific implementation in mind.
> So thinking through how it will work e.g. for power management
> would be a good excercise to figure out how it should work in detail.

It might work in the case where there's no PM support in the
transport. E.g for MMIO devices.

Thanks


> Parav did you indicate at some point a virtio specific SUSPEND
> bit can be useful for PM? Could you share how it's better than
> transport level PM and what the requirements are?
> Thanks!
>
> --
> MST
>


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* RE: [virtio-comment] RE: [PATCH V2 3/6] virtio: dont reset vqs when SUSPEND
  2023-11-22  5:28                         ` Jason Wang
@ 2023-11-22  6:11                           ` Parav Pandit
  2023-11-24  3:35                             ` Jason Wang
  0 siblings, 1 reply; 186+ messages in thread
From: Parav Pandit @ 2023-11-22  6:11 UTC (permalink / raw)
  To: Jason Wang
  Cc: Zhu, Lingshan, Michael S. Tsirkin, eperezma, cohuck, stefanha,
	virtio-comment


> From: Jason Wang <jasowang@redhat.com>
> Sent: Wednesday, November 22, 2023 10:58 AM
> 
> On Wed, Nov 22, 2023 at 12:32 AM Parav Pandit <parav@nvidia.com> wrote:
> >
> >
> > > From: Jason Wang <jasowang@redhat.com>
> > > Sent: Tuesday, November 21, 2023 1:03 PM
> > >
> > > On Thu, Nov 16, 2023 at 1:27 PM Parav Pandit <parav@nvidia.com>
> wrote:
> > > >
> > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > Sent: Thursday, November 16, 2023 9:50 AM
> > > > >
> > > > > On Thu, Nov 16, 2023 at 1:39 AM Parav Pandit <parav@nvidia.com>
> > > wrote:
> > > > > >
> > > > > >
> > > > > >
> > > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > > Sent: Monday, November 13, 2023 9:05 AM
> > > > > > >
> > > > > > > On Thu, Nov 9, 2023 at 6:16 PM Parav Pandit
> > > > > > > <parav@nvidia.com>
> > > wrote:
> > > > > > > >
> > > > > > > >
> > > > > > > > > From: Zhu, Lingshan <lingshan.zhu@intel.com>
> > > > > > > > > Sent: Thursday, November 9, 2023 3:28 PM
> > > > > > > > >
> > > > > > > > > On 11/9/2023 1:46 AM, Michael S. Tsirkin wrote:
> > > > > > > > > > On Tue, Nov 07, 2023 at 05:27:23PM +0800, Zhu,
> > > > > > > > > > Lingshan
> > > wrote:
> > > > > > > > > >>
> > > > > > > > > >> On 11/6/2023 5:49 PM, Michael S. Tsirkin wrote:
> > > > > > > > > >>> On Fri, Nov 03, 2023 at 06:34:34PM +0800, Zhu
> > > > > > > > > >>> Lingshan
> > > wrote:
> > > > > > > > > >>>> When SUSPEND is set, device states and virtqueue
> > > > > > > > > >>>> states should be stablized, therefore the driver
> > > > > > > > > >>>> should not reset vqs when SUSPEND is set in device status.
> > > > > > > > > >>>>
> > > > > > > > > >>>> Signed-off-by: Zhu Lingshan
> > > > > > > > > >>>> <lingshan.zhu@intel.com>
> > > > > > > > > >>>> ---
> > > > > > > > > >>>>    content.tex | 3 +++
> > > > > > > > > >>>>    1 file changed, 3 insertions(+)
> > > > > > > > > >>>>
> > > > > > > > > >>>> diff --git a/content.tex b/content.tex index
> > > > > > > > > >>>> bcc9d4b..060b5c2
> > > > > > > > > >>>> 100644
> > > > > > > > > >>>> --- a/content.tex
> > > > > > > > > >>>> +++ b/content.tex
> > > > > > > > > >>>> @@ -444,6 +444,9 @@ \subsubsection{Virtqueue
> > > > > > > > > >>>> Reset}\label{sec:Basic
> > > > > > > > > Facilities of a Virtio Device /
> > > > > > > > > >>>>    The device MUST reset any state of a virtqueue
> > > > > > > > > >>>> to the default
> > > > > state,
> > > > > > > > > >>>>    including the available state and the used state.
> > > > > > > > > >>>> +If VIRTIO_F_SUSPEND is negotiated and SUSPEND is
> > > > > > > > > >>>> +set in \field{device status}, the driver SHOULD
> > > > > > > > > >>>> +NOT reset any
> > > > > virtqueues.
> > > > > > > > > >>>> +
> > > > > > > > > >>>>    \drivernormative{\paragraph}{Virtqueue
> > > > > > > > > >>>> Reset}{Basic Facilities of a
> > > > > > > > > Virtio Device / Virtqueues / Virtqueue Reset / Virtqueue
> > > > > > > > > Reset}
> > > > > > > > > >>>>    After the driver tells the device to reset a
> > > > > > > > > >>>> queue, the driver MUST verify that
> > > > > > > > > >>> Seems somewhat arbitrary and breaks the claim that
> > > > > > > > > >>> the feature is orthogonal and can have uses besides
> migration.
> > > > > > > > > >> when suspended, the device is frozen.
> > > > > > > > > >> The driver is aware of this process and so should not
> > > > > > > > > >> reset the vqs I
> > > > > think.
> > > > > > > > > > Again that is only true because you want to use it for migration.
> > > > > > > > > > But then you can't claim it's a generic facility.
> > > > > > > > > I don't get it. The device status is a basic facility.
> > > > > > > > >
> > > > > > > > > We need to SUSPEND the device by setting SUSPEND bit, to
> > > > > > > > > stabilize the device states for migration.
> > > > > > > > Is the PCI's PM time not enough to suspend the device?
> > > > > > >
> > > > > > > Are you saying we don't need virtio reset assuming we had FLR?
> > > > > > >
> > > > > > No. often FLR timing is not enough. Hence every PCI level
> > > > > > device has some
> > > > > sort of its own reset mechanism.
> > > > > >
> > > > > > > Suspending at different layers like rest at different layers.
> > > > > > >
> > > > > > > We have both FLR and virtio reset. The Virtio level function
> > > > > > > could be reset without FLR. So did suspend.
> > > > > > >
> > > > > > > That's it.
> > > > > > Sure, but wrapping it under some "basic facility" is just does
> > > > > > not make
> > > sense.
> > > > >
> > > > > Why, device status (e.g reset) belongs to that part.
> > > > >
> > > > Lingshan claimed that suspending device is for live migration in
> > > > commit log
> > > and in discussion he portray it as some basic facility unrelated to
> > > device migration such as debug etc.
> > > > Instead of claiming it as some non_device_migration facility does
> > > > not make
> > > sense.
> > >
> > > It is used for migration for sure.
> > This is why it is not working when device is directly mapped.
> 
> We circle back. It works for the case of trap/emulation.
> 
> For direct mapping:
> 
> You claim guest reset can work but suspend can't?
No. guest reset to be done by the guest.
Suspend for PM also to be done by guest.

The hypervisor will have 2 modes, stop and freeze as admin operation for device migration flow without telling the guest driver about it.

> 
> > The hypervisor messing this bit and guest is also doing power management
> with it.
> 
> So I don't see how it differs from: virtio reset and FLR is under the control of
> guests.
> 
And suspend for power management too under control of the guest.

> >
> > Both of them needs separate channel to do their own work.
> >
> > >
> > > >
> > > > > >
> > > > > > >
> > > > > > > And if you want to rule P2P behaviours, PCI PM is really the
> > > > > > > correct way to go instead of trying to do it at the virtio layer.
> > > > > > >
> > > > > > PCI PM is supposed to be controlled by the guest and so the suspend.
> > > > >
> > > > > I've listed issues about D3cold and others, I can't believe it
> > > > > can't be controlled totally by guests.
> > > > >
> > > > D3cold is not controlled by the driver as defined by the PCI spec
> > > > hence it is
> > > not applicable.
> > >
> > > Have you seen the link I give you? Even if you are right, there
> > > still could be such a request from the firmware, no?
> > I may have missed the link.
> > You have 10 replies, so it is easy to miss important things in rest of the
> comments.
> 
> I meant there still could be D3cold requests from the guest via virtual
> firmware.
> 
So it will deliver PME.

> So it's not necessarily related to the guest driver.
> 
> >
> > >
> > > > D3hot is controlled by the driver.
> > >
> > > So, it requires the device context to be preserved, which is not
> > > documented in your patch.
> > PCI PM interactions is covered in v4 in the device requirements section.
> >
> > >
> > > > > >
> > > > > > Hypervisor needs its channel to suspend the device, as
> > > > > > fundamentally guest is
> > > > > unaware of device migration flow.
> > > > >
> > > > > That's pretty fine, the hypervisor also needs its channel to
> > > > > reset the device. If you think there's a conflict with suspend,
> > > > > there should be one
> > > for reset as well.
> > > > >
> > > > I don’t see a need for hypervisor to reset the device in passthrough
> mode.
> > > Can you explain why is it needed?
> > >
> > > Qemu has a command "system_reset".
> > >
> > I mean, what does this translate to reset the device in passthrough mode?
> 
> It needs to reset the virtio device.
> 
> > If this is FLR, it is there.
> 
> Please explain how it works. (It's not only a FLR, it also need virtio level reset)
> 
FLR obviously covers the virtio level reset as FLR covers the PCI + virtio reset.

> >
> > > > Do you mean, it is needed in vdpa mode? If yes, the registers are
> > > > emulated
> > > anyway, so why the member device's native channel cannot be used in
> > > vdpa mode?
> > > >
> > > > > >
> > > > > > > > For large device I could imagine it could be short.
> > > > > > > >
> > > > > > > > In that case if there is suspend the device available, it
> > > > > > > > will be used by the guest
> > > > > > > driver itself, hypervisor wouldn’t know about it when those
> > > > > > > registers are not trapped.
> > > > > > > > So we need two ways to suspend.
> > > > > > > > One is guest visible, and guest controlled.
> > > > > > > > Second is hypervisor control to fulfill the device migration needs.
> > > > > > >
> > > > > > > Can you explain why suspend is special but not reset or why
> > > > > > > reset can work but not suspend? If reset can work, so does
> > > > > > > suspend. If reset can't, neither does suspend.
> > > > > > >
> > > > > > As long as reset and suspend both are under guest control, I am fine.
> > > > >
> > > > > Well, you seem to ignore my question below. Hypervisor needs to
> > > > > reset the device as well.
> > > > >
> > > > Why is it needed in passthrough mode?
> > > >
> > > > > >
> > > > > > > For example, can you explain how a system_reset in Qemu can
> > > > > > > work with your proposal?
> > > > > > >
> > > > > > > >
> > > > > > > > So if you can please take a look if the proposed admin
> > > > > > > > command to
> > > > > > > freeze/stop mode can be used in the emulated register case or not.
> > > > > > >
> > > > > > > Again, if you design those for PCI, it's a layer violation.
> > > > > > > You have answered
> > > > > > They are used by the PCI layer, just like your suspend bit.
> > > > > > Andy other transport can also use it.
> > > > > >
> > > > > > > yourself that PM is the right way to go.
> > > > > > >
> > > > > > > > It helps to have the suspend bit in guest control as well
> > > > > > > > with/without
> > > > > > > emulation mode.
> > > > > > >
> > > > > > > I won't repeat it again. You will find you need a full
> > > > > > > transport to satisfy all the requirements.
> > > > > > I disagree for full transport.
> > > > >
> > > > > See above and the discussion in another thread.
> > > > >
> > > > > > If you want to get discuss transport for sure it is some other
> > > > > > thread and I want to see "driver notifications via such
> > > > > > transport VQ" to fully qualify it
> > > > > as transport, And that would be just sub-optimal for actual working.
> > > > >
> > > > > Sub-optimal since the function is duplicated with a transport
> > > > > but it doesn't claim or design as a transport.
> > > > >
> > > > It is not sub-optimal because of duplication. It is because you
> > > > want to
> > > transport notifications via virtqueue.
> > >
> > > Have you ever read the series of tvq? You won't get this conclusion
> > > if you do that.
> > >
> > I have read those 4 patches and I have seen that transportvq do not want to
> transport notifications.
> > Hence it does not qualify as transport vq.
> 
> It exposes the platform MMIO area for driver notification. This is sufficient.
> Any issue you see?
Yes, the issue is, it is not transporting the driver notifications.
Hence, it is not a transport virtqueue.

> 
> >
> > Frankly, transport vq seems a way to formalize mediation forever in virtio.
> 
> Nope, it can be accessed by a guest driver directly.
> 
> > It is very weird way to build new SIOV device.
> > For most things it should be the direct channel that virtio has already from
> driver to the device.
> 
> See above. SIOV might require a new transport or not.
> 
It depends on the performance tests that Lingshan will show at scale.

> >
> >
> > > >
> > > > > > And hence, I wouldn’t call it a transport anymore.
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > > This can also be used for debugging I think.
> > > > > > > >
> > > > > > > > As Michael listed, a dedicated debug interface is usually
> > > > > > > > more useful instead
> > > > > > > of in-band.
> > > > > > >
> > > > > > > Well, I've shown you the in-band facilities like debugging
> > > > > > > via ethtool and kernel has a lot of other ones. If you have
> > > > > > > ever tried to debug in a real production environment, you
> > > > > > > will find how useful such handy information is where out-of-
> > > > > > > band facilities are often dangerous
> > > > > and usually prohibited or even unsupported.
> > > > > > Guest driver can always read and write the device status
> > > > > > without adding a
> > > > > suspend bit.
> > > > >
> > > > > I don't get here. Suspend make sure the device state is frozen
> > > > > which helps for debugging for sure.
> > > > You wanted to debug some vq live, you suspend the device, the vq
> > > > state got
> > > changed.
> > > >
> > > > I just don’t see that suspend is a debug tool.
> > >
> > > It's not a tool, it's a function that can be used as a debug tool.
> > >
> > > > Every feature is a debug feature literally.
> > > > Classic heisenbug effect.
> > > >
> > > > Once can change driver notification frequency to see if interrupt
> > > > rate
> > > changed for debugging.
> > > > One can disabled few RQs and see RSS...
> > > > Blk can change blk_size to higher value to perf debug..
> > > > The list continues..
> > >
> > > Let's not shift concepts.
> > >
> > Your comment to attribute device migration as debug feature is actually
> shifting the concept.
> 
> It's not.
> 
> Ling Shan put it in the basic facilities as part of device status. You wonder why,
> we explained it can be used beyond migration. You asked where, we told you
> for example things like debugging. We never claim it can only be used in debug.
> Then you shift the concept to say debug could be achieved by a lot of other
> facilities. For sure this is correct, but it doesn't have any relationship with the
> discussion here.
> 
I don’t see wasting time here.
If its debug, its debug.
If its migration, it is migration.
If its pm, its pm.

> >
> > > Obviously, suspend is not the only way to debug. But that's not the
> > > context here.
> > >
> > I have no further comments on the claim that suspending a device a debug
> feature.
> > If it, add a debug section and put it under that.
> > You also know that it is not, so let's not waste our time.
> >
> > I just don’t suspend bit as debug interface that undergoes classic heisenbug
> effect.
> 
> I never say it can be used for solving all problems. Anyhow that's another topic
> right?
> 
> Thanks
> 
> >
> > > Thanks
> > >
> > > >
> > > > >
> > > > > Thanks
> > > > >
> > > > > >
> > > >
> >


^ permalink raw reply	[flat|nested] 186+ messages in thread

* RE: [virtio-comment] RE: [PATCH V2 3/6] virtio: dont reset vqs when SUSPEND
  2023-11-22  5:28                         ` Jason Wang
@ 2023-11-22  6:32                           ` Parav Pandit
  2023-11-24  3:25                             ` Jason Wang
  0 siblings, 1 reply; 186+ messages in thread
From: Parav Pandit @ 2023-11-22  6:32 UTC (permalink / raw)
  To: Jason Wang, Michael S. Tsirkin
  Cc: Zhu, Lingshan, eperezma, cohuck, stefanha, virtio-comment


> From: Jason Wang <jasowang@redhat.com>
> Sent: Wednesday, November 22, 2023 10:59 AM
> 
> On Wed, Nov 22, 2023 at 5:18 AM Michael S. Tsirkin <mst@redhat.com>
> wrote:
> >
> > On Tue, Nov 21, 2023 at 03:33:07PM +0800, Jason Wang wrote:
> > > > Lingshan claimed that suspending device is for live migration in commit
> log and in discussion he portray it as some basic facility unrelated to device
> migration such as debug etc.
> > > > Instead of claiming it as some non_device_migration facility does not
> make sense.
> > >
> > > It is used for migration for sure.
> >
> > Well having a generic facility to stop device sounds like a nice thing.
> > However the devil is in the detail. A lot of detail here seems very
> > much tailored to a very specific implementation in mind.
> > So thinking through how it will work e.g. for power management would
> > be a good excercise to figure out how it should work in detail.
> 
> It might work in the case where there's no PM support in the transport. E.g for
> MMIO devices.
> 
MMIO should implement PM like other transport. That brings the equivalency principle.

> Thanks
> 
> 
> > Parav did you indicate at some point a virtio specific SUSPEND bit can
> > be useful for PM? Could you share how it's better than transport level
> > PM and what the requirements are?

The practical reason for having suspend bit is,
Some devices may not be able to support Immediate_Readiness_on_Return_to_D0.
This is because the context may be huge to finish restoring in 10msec.

Hence, the complex devices rely on device specific bit to ensure that the device is ready.

This is why suspend bit under direct guest control is useful.

This must not be confused with the hypervisor controlled active/stop/freeze mode to drive during device migration.

^ permalink raw reply	[flat|nested] 186+ messages in thread

* RE: [virtio-comment] RE: [PATCH V2 3/6] virtio: dont reset vqs when SUSPEND
  2023-11-22  1:51                         ` Zhu, Lingshan
@ 2023-11-22  6:47                           ` Parav Pandit
  2023-11-22 10:04                             ` Zhu, Lingshan
  2023-11-22  6:49                           ` Michael S. Tsirkin
  1 sibling, 1 reply; 186+ messages in thread
From: Parav Pandit @ 2023-11-22  6:47 UTC (permalink / raw)
  To: Zhu, Lingshan, Michael S. Tsirkin, Jason Wang
  Cc: eperezma, cohuck, stefanha, virtio-comment


> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> Sent: Wednesday, November 22, 2023 7:22 AM
> 
> On 11/22/2023 5:18 AM, Michael S. Tsirkin wrote:
> > On Tue, Nov 21, 2023 at 03:33:07PM +0800, Jason Wang wrote:
> >>> Lingshan claimed that suspending device is for live migration in commit log
> and in discussion he portray it as some basic facility unrelated to device
> migration such as debug etc.
> >>> Instead of claiming it as some non_device_migration facility does not make
> sense.
> >> It is used for migration for sure.
> > Well having a generic facility to stop device sounds like a nice thing.
> > However the devil is in the detail. A lot of detail here seems very
> > much tailored to a very specific implementation in mind.
> > So thinking through how it will work e.g. for power management would
> > be a good excercise to figure out how it should work in detail.
> > Parav did you indicate at some point a virtio specific SUSPEND bit can
> > be useful for PM? Could you share how it's better than transport level
> > PM and what the requirements are?
> > Thanks!
> Do you mean letting the device enter a new power state when SUSPEND, and
> such description in transport-pci.tex? Then resume normal state on
> DRIVER_OK.

My proposal is, 
1. suspend bit (not state) to be controlled by the guest driver
2. this bit must be busy poll type. Meaning, 
a. the driver must get acknowledgement from the device that the suspend operation is completed in the device.
b. the driver must get acknowledgement from the device that the resume operation is completed in the device.

3. Not to confuse this with administrative mode active/stop/freeze set by the owner device during device migration

4. This feature is usable in power management use case and may be some other.

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [virtio-comment] RE: [PATCH V2 3/6] virtio: dont reset vqs when SUSPEND
  2023-11-22  1:51                         ` Zhu, Lingshan
  2023-11-22  6:47                           ` Parav Pandit
@ 2023-11-22  6:49                           ` Michael S. Tsirkin
  2023-11-22 10:03                             ` Zhu, Lingshan
  1 sibling, 1 reply; 186+ messages in thread
From: Michael S. Tsirkin @ 2023-11-22  6:49 UTC (permalink / raw)
  To: Zhu, Lingshan
  Cc: Jason Wang, Parav Pandit, eperezma, cohuck, stefanha, virtio-comment

On Wed, Nov 22, 2023 at 09:51:45AM +0800, Zhu, Lingshan wrote:
> 
> 
> On 11/22/2023 5:18 AM, Michael S. Tsirkin wrote:
> > On Tue, Nov 21, 2023 at 03:33:07PM +0800, Jason Wang wrote:
> > > > Lingshan claimed that suspending device is for live migration in commit log and in discussion he portray it as some basic facility unrelated to device migration such as debug etc.
> > > > Instead of claiming it as some non_device_migration facility does not make sense.
> > > It is used for migration for sure.
> > Well having a generic facility to stop device sounds like a nice thing.
> > However the devil is in the detail. A lot of detail here seems very much
> > tailored to a very specific implementation in mind.
> > So thinking through how it will work e.g. for power management
> > would be a good excercise to figure out how it should work in detail.
> > Parav did you indicate at some point a virtio specific SUSPEND
> > bit can be useful for PM? Could you share how it's better than
> > transport level PM and what the requirements are?
> > Thanks!
> Do you mean letting the device enter a new power state when SUSPEND,
> and such description in transport-pci.tex? Then resume normal
> state on DRIVER_OK.

That would be one example.

-- 
MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [virtio-comment] Re: [PATCH V2 4/6] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  2023-11-22  1:32                                         ` Zhu, Lingshan
@ 2023-11-22  6:53                                           ` Michael S. Tsirkin
  0 siblings, 0 replies; 186+ messages in thread
From: Michael S. Tsirkin @ 2023-11-22  6:53 UTC (permalink / raw)
  To: Zhu, Lingshan
  Cc: Parav Pandit, jasowang, eperezma, cohuck, stefanha, virtio-comment

On Wed, Nov 22, 2023 at 09:32:53AM +0800, Zhu, Lingshan wrote:
> 
> 
> On 11/17/2023 6:45 PM, Michael S. Tsirkin wrote:
> > On Fri, Nov 17, 2023 at 06:02:14PM +0800, Zhu, Lingshan wrote:
> > > 
> > > On 11/16/2023 6:21 PM, Parav Pandit wrote:
> > > > > From: Zhu, Lingshan <lingshan.zhu@intel.com>
> > > > > Sent: Thursday, November 16, 2023 3:45 PM
> > > > > 
> > > > > On 11/16/2023 1:35 AM, Parav Pandit wrote:
> > > > > > > From: Zhu, Lingshan <lingshan.zhu@intel.com>
> > > > > > > Sent: Monday, November 13, 2023 2:56 PM
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > On 11/10/2023 8:31 PM, Parav Pandit wrote:
> > > > > > > > > From: Zhu, Lingshan <lingshan.zhu@intel.com>
> > > > > > > > > Sent: Friday, November 10, 2023 1:22 PM
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > On 11/9/2023 6:25 PM, Parav Pandit wrote:
> > > > > > > > > > > From: Zhu, Lingshan <lingshan.zhu@intel.com>
> > > > > > > > > > > Sent: Thursday, November 9, 2023 3:39 PM
> > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > On 11/9/2023 2:28 PM, Parav Pandit wrote:
> > > > > > > > > > > > > From: Zhu, Lingshan <lingshan.zhu@intel.com>
> > > > > > > > > > > > > Sent: Tuesday, November 7, 2023 3:02 PM
> > > > > > > > > > > > > 
> > > > > > > > > > > > > On 11/6/2023 6:52 PM, Parav Pandit wrote:
> > > > > > > > > > > > > > > From: Zhu, Lingshan <lingshan.zhu@intel.com>
> > > > > > > > > > > > > > > Sent: Monday, November 6, 2023 2:57 PM
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > On 11/6/2023 12:12 PM, Parav Pandit wrote:
> > > > > > > > > > > > > > > > > From: Zhu, Lingshan <lingshan.zhu@intel.com>
> > > > > > > > > > > > > > > > > Sent: Monday, November 6, 2023 9:01 AM
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > On 11/3/2023 11:50 PM, Parav Pandit wrote:
> > > > > > > > > > > > > > > > > > > From: virtio-comment@lists.oasis-open.org
> > > > > > > > > > > > > > > > > > > <virtio-comment@lists.oasis- open.org> On Behalf Of Zhu,
> > > > > > > > > > > > > > > > > > > Lingshan
> > > > > > > > > > > > > > > > > > > Sent: Friday, November 3, 2023 8:27 PM
> > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > On 11/3/2023 7:35 PM, Parav Pandit wrote:
> > > > > > > > > > > > > > > > > > > > > From: Zhu Lingshan <lingshan.zhu@intel.com>
> > > > > > > > > > > > > > > > > > > > > Sent: Friday, November 3, 2023 4:05 PM
> > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > This patch adds two new le16 fields to common
> > > > > > > > > > > > > > > > > > > > > configuration structure to support VIRTIO_F_QUEUE_STATE
> > > > > > > > > > > > > > > > > > > > > in PCI transport
> > > > > > > > > layer.
> > > > > > > > > > > > > > > > > > > > > Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
> > > > > > > > > > > > > > > > > > > > > ---
> > > > > > > > > > > > > > > > > > > > >            transport-pci.tex | 18 ++++++++++++++++++
> > > > > > > > > > > > > > > > > > > > >            1 file changed, 18 insertions(+)
> > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > diff --git a/transport-pci.tex b/transport-pci.tex
> > > > > > > > > > > > > > > > > > > > > index
> > > > > > > > > > > > > > > > > > > > > a5c6719..3161519 100644
> > > > > > > > > > > > > > > > > > > > > --- a/transport-pci.tex
> > > > > > > > > > > > > > > > > > > > > +++ b/transport-pci.tex
> > > > > > > > > > > > > > > > > > > > > @@ -325,6 +325,10 @@ \subsubsection{Common
> > > > > > > configuration
> > > > > > > > > > > > > > > > > structure
> > > > > > > > > > > > > > > > > > > > > layout}\label{sec:Virtio Transport
> > > > > > > > > > > > > > > > > > > > >                    /* About the administration virtqueue. */
> > > > > > > > > > > > > > > > > > > > >                    le16 admin_queue_index;         /* read-only for
> > > > > driver
> > > > > > > > > */
> > > > > > > > > > > > > > > > > > > > >                    le16 admin_queue_num;         /* read-only for
> > > > > driver
> > > > > > > > > */
> > > > > > > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > > > > > > +	/* Virtqueue state */
> > > > > > > > > > > > > > > > > > > > > +        le16 queue_avail_state;         /* read-write */
> > > > > > > > > > > > > > > > > > > > > +        le16 queue_used_state;          /* read-write */
> > > > > > > > > > > > > > > > > > > > This tiny interface for 128 virtio net queues through
> > > > > > > > > > > > > > > > > > > > register read writes, does
> > > > > > > > > > > > > > > > > > > not work effectively.
> > > > > > > > > > > > > > > > > > > > There are inflight out of order descriptors for block also.
> > > > > > > > > > > > > > > > > > > > Hence toy registers like this do not work.
> > > > > > > > > > > > > > > > > > > Do you know there is a queue_select? Why this does not
> > > > > work?
> > > > > > > > > > > > > > > > > > > Do you know how other queue related fields work?
> > > > > > > > > > > > > > > > > > :)
> > > > > > > > > > > > > > > > > > Yes. If you notice queue_reset related critical spec bug
> > > > > > > > > > > > > > > > > > fix was done when it
> > > > > > > > > > > > > > > > > was introduced so that live migration can _actually_ work.
> > > > > > > > > > > > > > > > > > When queue_select is done for 128 queues serially, it take
> > > > > > > > > > > > > > > > > > a lot of time to
> > > > > > > > > > > > > > > > > read those slow register interface for this + inflight
> > > > > > > > > > > > > > > > > descriptors +
> > > > > > > > > more.
> > > > > > > > > > > > > > > > > interesting, virtio work in this pattern for many years, right?
> > > > > > > > > > > > > > > > All these years 400Gbps and 800Gbps virtio was not present,
> > > > > > > > > > > > > > > > number of
> > > > > > > > > > > > > > > queues were not in hw.
> > > > > > > > > > > > > > > The registers are control path in config space, how 400G or
> > > > > > > > > > > > > > > 800G
> > > > > > > > > affect??
> > > > > > > > > > > > > > Because those are the one in practice requires large number of VQs.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > You are asking per VQ register commands to modify things
> > > > > > > > > > > > > > dynamically via
> > > > > > > > > > > > > this one vq at a time, serializing all the operations.
> > > > > > > > > > > > > > It does not scale well with high q count.
> > > > > > > > > > > > > This is not dynamically, it only happens when SUSPEND and RESUME.
> > > > > > > > > > > > > This is the same mechanism how virtio initialize a virtqueue,
> > > > > > > > > > > > > working for many years.
> > > > > > > > > > > > No. when virtio driver initializes it for the first time, there
> > > > > > > > > > > > is no active traffic
> > > > > > > > > > > that gets lost.
> > > > > > > > > > > > This is because the interface is not yet up and not part of the
> > > > > > > > > > > > network
> > > > > > > yet.
> > > > > > > > > > > > The resume must be fast enough, because the remote node is
> > > > > > > > > > > > sending
> > > > > > > > > > > packets.
> > > > > > > > > > > > Hence it is different from driver init time queue enable.
> > > > > > > > > > > I am not sure any packets arrive before a link announce at the
> > > > > > > > > > > destination
> > > > > > > > > side.
> > > > > > > > > > I think it can.
> > > > > > > > > > Because there is no notification of member device link down
> > > > > > > > > > intimation to
> > > > > > > > > remote side.
> > > > > > > > > > The L4 and L5 protocols have no knowledge that node which they are
> > > > > > > > > interacting is behind some layers of switches.
> > > > > > > > > > So keeping this time low is desired.
> > > > > > > > > The NIC should broad cast itself first, so that other peers in the
> > > > > > > > > network know(for example its mac to route it) how to send a message to
> > > > > it.
> > > > > > > > > This is necessary, for example VIRTIO_NET_F_GUEST_ANNOUNCE, similar
> > > > > > > > > mechanism work for in-marketing productions for years.
> > > > > > > > > 
> > > > > > > > > This is out of the topic anyway.
> > > > > > > > > > > > > > > See the virtio common cfg, you will find the max number of
> > > > > > > > > > > > > > > vqs is there, num_queues.
> > > > > > > > > > > > > > :)
> > > > > > > > > > > > > > Sure. those values at high q count affects.
> > > > > > > > > > > > > the driver need to initialize them anyway.
> > > > > > > > > > > > That is before the traffic starts from remote end.
> > > > > > > > > > > see above, that needs a link announce and this is after
> > > > > > > > > > > re-initialization
> > > > > > > > > > > > > > > > Device didn’t support LM.
> > > > > > > > > > > > > > > > Many limitations existed all these years and TC is improving
> > > > > > > > > > > > > > > > and expanding
> > > > > > > > > > > > > > > them.
> > > > > > > > > > > > > > > > So all these years do not matter.
> > > > > > > > > > > > > > > Not sure what are you talking about, haven't we initialize
> > > > > > > > > > > > > > > the device and vqs in config space for years?????? What's
> > > > > > > > > > > > > > > wrong with this
> > > > > > > > > > > mechanism?
> > > > > > > > > > > > > > > Are you questioning virito-pci fundamentals???
> > > > > > > > > > > > > > Don’t point to in-efficient past to establish similar in-efficient future.
> > > > > > > > > > > > > interesting, you know this is a one-time thing, right?
> > > > > > > > > > > > > and you are aware of this has been there for years.
> > > > > > > > > > > > > > > > > > > Like how to set a queue size and enable it?
> > > > > > > > > > > > > > > > > > Those are meant to be used before DRIVER_OK stage as they
> > > > > > > > > > > > > > > > > > are init time
> > > > > > > > > > > > > > > > > registers.
> > > > > > > > > > > > > > > > > > Not to keep abusing them..
> > > > > > > > > > > > > > > > > don't you need to set queue_size at the destination side?
> > > > > > > > > > > > > > > > No.
> > > > > > > > > > > > > > > > But the src/dst does not matter.
> > > > > > > > > > > > > > > > Queue_size to be set before DRIVER_OK like rest of the
> > > > > > > > > > > > > > > > registers, as all
> > > > > > > > > > > > > > > queues must be created before the driver_ok phase.
> > > > > > > > > > > > > > > > Queue_reset was last moment exception.
> > > > > > > > > > > > > > > create a queue? Nvidia specific?
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > Huh. No.
> > > > > > > > > > > > > > Do git log and realize what happened with queue_reset.
> > > > > > > > > > > > > You didn't answer the question, does the spec even has defined
> > > > > > > > > > > > > "create a
> > > > > > > > > > > vq"?
> > > > > > > > > > > > Enabled/created = tomato/tomato when discussing the spec in
> > > > > > > > > > > > non-normative
> > > > > > > > > > > email conversation.
> > > > > > > > > > > > It's irrelevant.
> > > > > > > > > > > Then lets not debate on this enable a vq or create a vq anymore
> > > > > > > > > > > > All I am saying is, when we know the limitations of the
> > > > > > > > > > > > transport and when industry is forwarding to not introduced more
> > > > > > > > > > > > and more on-die register
> > > > > > > > > > > for once in lifetime work of device migration, we just use the
> > > > > > > > > > > optimal command and queue interface that is native to virtio.
> > > > > > > > > > > PCI config space has its own limitations, and admin vq has its
> > > > > > > > > > > advantages, but that does not apply to all use cases.
> > > > > > > > > > > 
> > > > > > > > > > There was a recent work done emulating the SR-IOV cap and allowing
> > > > > > > > > > VM to
> > > > > > > > > enable SR-IOV in [1].
> > > > > > > > > > This is the option I mentioned few weeks ago.
> > > > > > > > > > 
> > > > > > > > > > So with admin commands and admin virtqueues, even nested model
> > > > > > > > > > will work
> > > > > > > > > using [1].
> > > > > > > > > > [1]
> > > > > > > > > > https://netdevconf.info/0x17/sessions/talk/unleashing-sr-iov-offlo
> > > > > > > > > > ad
> > > > > > > > > > -o
> > > > > > > > > > n-virtual-machines.html
> > > > > > > > > We should take this into consideration once it is standardized in
> > > > > > > > > the spec, maybe not now, there can always be many workarounds to
> > > > > > > > > solve one
> > > > > > > problem.
> > > > > > > > Sure, until that point the admin commands are able to suffice the need
> > > > > well.
> > > > > > > > And when the spec changes in transport occurs (if needed), current
> > > > > > > > admin
> > > > > > > command and admin vq also fits very well that will follow above [1].
> > > > > > > we have pointed lots of problems for admin vq based live migration
> > > > > > > proposal, I won't repeat them here
> > > > > > I don’t see any.
> > > > > > Nested is already solved using above.
> > > > > I don't see how, do you mind to work out the patches?
> > > > Once the base series is completed, nested cases can be addressed.
> > > > I wont be able to work on the patches for it until we finish for the first level virtualization.
> > > As you know, nested is supported well in current virtio, so please don't
> > > break it.
> > So for nesting, it seems cleaner to support sending commands through
> > device itself.
> I guess this requires per-VF admin vq or some agents & tricks.

I suggested a gateway in the VF for this. Really more or less like what
you did for write tracking except use the admin command format.
We'll need a new group type which just includes device itself.

> > You aren't going to fit VQ state in a 16 bit register in
> > the general case though, and will have to resort to DMA.
> Yes, at least we need in-flight descriptors tracking.
> Still working with Eugenio for this feature.
> > And if you are
> > doing that then please just use the admin command format (does not have
> > to be a VQ) and then we can all make peace finally.
> > 


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [virtio-comment] Re: [PATCH V2 4/6] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  2023-11-22  4:15                                             ` Jason Wang
@ 2023-11-22  7:15                                               ` Michael S. Tsirkin
  2023-11-22  7:33                                                 ` Parav Pandit
  0 siblings, 1 reply; 186+ messages in thread
From: Michael S. Tsirkin @ 2023-11-22  7:15 UTC (permalink / raw)
  To: Jason Wang
  Cc: Parav Pandit, Zhu, Lingshan, eperezma, cohuck, stefanha, virtio-comment

On Wed, Nov 22, 2023 at 12:15:44PM +0800, Jason Wang wrote:
> On Wed, Nov 22, 2023 at 12:26 AM Parav Pandit <parav@nvidia.com> wrote:
> >
> >
> > > From: Jason Wang <jasowang@redhat.com>
> > > Sent: Tuesday, November 21, 2023 10:01 AM
> > >
> > > On Fri, Nov 17, 2023 at 6:06 PM Parav Pandit <parav@nvidia.com> wrote:
> > > >
> > > >
> 
> [...]
> 
> > > > Sorry, no virtio specification does not support device migration today.
> > > > Nothing is broken by adding new features.
> > > >
> > > > Above [1] has the right proposal that Jason's paper pointed out. Please use
> > > it.
> > >
> > > I was involved in the design in [1]. And I don't see a connection to the
> > > dicussion here
> > >
> > > 1) It is based on vDPA in L0
> > > 2) It doesn't address the nesting issue, it requires a proper design in the virtio
> > > spec to support migration in the nesting layer.
> >
> > Nothing prevents [1] to be done without vdpa.
> 
> Well, Qemu has SR-IOV emulation for IGB. What's the point of the above
> reply? If a hypervisor wishes, it can be done with any device on L0.
> 
> We discuss the possibility of migrating nested VMs, not the
> possibility of emulating SR-IOV, no?
> 
> Thanks

It depends really. If the way to support nesting is to emulate sr-iov
then the question becomes how hard it is to emulate sriov to the point
where nesting works.

The fact that current hypervisors pretend that a VF is a PF when
passing things on to guest and things still seem to work is
nice but it's not a hugely important design point IMHO.

-- 
MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [virtio-comment] RE: [PATCH V2 3/6] virtio: dont reset vqs when SUSPEND
  2023-11-22  1:41                             ` Zhu, Lingshan
@ 2023-11-22  7:30                               ` Michael S. Tsirkin
  0 siblings, 0 replies; 186+ messages in thread
From: Michael S. Tsirkin @ 2023-11-22  7:30 UTC (permalink / raw)
  To: Zhu, Lingshan
  Cc: Parav Pandit, jasowang, eperezma, cohuck, stefanha, virtio-comment

On Wed, Nov 22, 2023 at 09:41:10AM +0800, Zhu, Lingshan wrote:
> > > I mean enable 100 queues cost more time then enable 1 vq no matter
> > > how we enable it. that is O(N)
> > Depends on what "that" is. Number of VM exits does not have to be O(N),
> > you can pass these 100 queues in memory.
> For batching, yest. But I don't see this as a problem because we enable
> vqs by this way for many years, so far so good.

Well boot time is less contrained than migration time
because it often involves a ton of IO to load guest software.



> > 
> > 
> > > 
> > > 
> > >                          This should be a basic facility.
> > > 
> > >                      Other transport can also offer like PCI.
> > > 
> > >                  Do you want to work for these transport? Implementing the new features as
> > >                  PCI?
> > > 
> > >              Not presently as PCI as more features than rest of the two.
> > >              What I read about ccw is: " S/390 based virtual machines support neither PCI nor MMIO".
> > > 
> > >              And I also read, "The IBM System/390 is a discontinued mainframe product family implementing".
> > > 
> > >              So I don’t know who needs to extend ccw.
> > >              And if one needs, those maintainers will extend it to match to PCI standard.
> > > 
> > >          So these features are even not planned, so don't depend on them.
> > > 
> > >      But again can one suspend ccw device? If you are adding this feature and
> > >      claiming it's supported for all transports you better find out
> > >      what does it do.
> > > 
> > > I am not an expert on CCW, anything block we suspend a CCW device by this bit?
> > I don't think CCW supports suspend at all.
> I think it is not a transport feature but a device feature,
> the device can always suspend it self, like don't process data
> and stop responding until a specific signal.

If guest keeps going but device just stop then guest has
a decent chance to crash.

> > 
> > > This seems only controlled by the device itself.
> > > 
> > And? What it the point of suspending only the device if rest of system
> > is still going?
> That is an orchestration issue, totally up to the administrators.
> Normally when suspending the device, the guest are very likely
> to be suspended already.

If you need an orchestration system to suspend your laptop or your
phone things are very bad indeed.

> > 
> > > 
> > >                              In that case if there is suspend the device available, it will be
> > >                              used by the
> > > 
> > >                          guest driver itself, hypervisor wouldn’t know about it when those
> > >                          registers are not trapped.
> > > 
> > >                              So we need two ways to suspend.
> > >                              One is guest visible, and guest controlled.
> > >                              Second is hypervisor control to fulfill the device migration needs.
> > > 
> > >                          The guest can eve reset the device.
> > > 
> > >                              So if you can please take a look if the proposed admin command to
> > > 
> > >                          freeze/stop mode can be used in the emulated register case or not.
> > > 
> > >                              It helps to have the suspend bit in guest control as well
> > >                              with/without
> > > 
> > >                          emulation mode.
> > >                          Parav, please believe I have read your series, I didn't comment there
> > >                          because I want to avoid further conflicts/debating, we have done these
> > > 
> > >                  enough.
> > > 
> > >                      I believe the series posted in v3 can support vdpa use case as well.
> > >                      So I will progress to post v4.
> > > 
> > > 
> > >                          As explained before, freeze/stop the device by PCI is a layer violation.
> > > 
> > >                      I am afraid, we have different vision.
> > >                      I don’t see any layer violation.
> > >                      Suspend is enough in the PCI PM.
> > >                      Our vision is more aligned with rest of the hypervisor knobs that owns the
> > > 
> > >                  migration framework.
> > >                  I think I have explained, virito builds on other transport and it should be self-
> > >                  contained, so far so good.
> > > 
> > >              Virtio without any transport binding is just blank paper discussion.
> > > 
> > >          virtio is built on some transports, but not bind to any.
> > > 
> > >      Binding is an OS specific thing, but e.g. under Linux transport drivers bind to
> > >      devices then virtio drivers bind to virtio bus. No binding -> nothing
> > >      works.
> > > 
> > > I think general facilities are better not only work on a specific transport
> > > 
> > But platform facilities are even better we don't need to work on them at
> > all.
> Yes, so I also agree to track dirty pages by the platform, on-CPU dirty page
> tracking facilities serving all transport, not only PCI.

If available. Both you and Parav should stop just repeating your
preference and start showing some actual info on which systems
support this platform tracking, how recent they have to be,
are all of server/desktop/mobile covered, etc.

> > 
> > 
> > >                          And device status can be pass-through(without emulation, just map it
> > >                          to
> > >                          guest) to the guest or trapped(trap and emulate by the hypervisor,
> > >                          for example set_status in vDPA).
> > > 
> > >                      When it is pass-through, it is controlled by the guest, so for example, if the
> > > 
> > >                  guest resets the device, hypervisor has lost the control of migration context etc.
> > > 
> > >                      Hence, hypervisor needs a channel which is not guest owned.
> > > 
> > >                      Same channel can work when trap+emulation is done.
> > > 
> > >                  It is the guest owns the device, it can reset the device, once reset, the device
> > >                  context are cleared.
> > > 
> > >              Hypervisor do not have the ability to read/write the device context. It lost the channel as hypervisor is not involved in trap+emulation.
> > >              So it is not helpful in one use case.
> > > 
> > >              Admin commands can work even with trap+emulation mode.
> > > 
> > >              What is missing, that should be added?
> > > 
> > >          as explained above, when live migration, the guest should be suspended
> > >          first, at this point,
> > >          the host owns the device, it has access to the device.
> > > 
> > >      Where do you say this in the spec patch?
> > > 
> > > VM live migration is not in this spec.
> > Then it should be.
> > 
> > > If we suspend the device first, then the guest may detect IO errors.
> > > 
> > That's bad. So you need to tell driver what not to do so as not to get
> > errors.
> I think the process should be suspending the guest first, then the host
> owns the device, so it can suspend the guest and collect the necessary data
> for live migration.

So you need to say what is and what is not allowed.
Because in your model, device has no idea whether it's
host or guest accessing it and so you can not just say
"don't access device at all". And this by the way is
a big plus in Parav's approach - since host uses a
distinct channel for state retrieval it is possible
to simply say "don't access device" and we are done.

> > 
> > > 
> > >                                  This can also be used for debugging I think.
> > > 
> > >                              As Michael listed, a dedicated debug interface is usually more
> > >                              useful instead
> > > 
> > >                          of in-band.
> > >                          re-using another facility without extra efforts is not a bad thing anyway.
> > > 
> > >                      I just don’t see how a suspend bit some debug feature.
> > >                      Almost everything with that regard is a debug feature to me.
> > > 
> > >                  suspend then check the device states?
> > > 
> > >              You already suspended the device, so device state is already changed.
> > >              All debug information is changed, so not useful now.
> > > 
> > >          When suspended, the device should keep and stabilize its device states,
> > >          at least in my series it should behave like this.
> > > 
> > >      That's vague. What does it mean exactly and what happens if
> > >      some external event causes state change?
> > > 
> > > it is suspended, somehow like powered-down, so it should not
> > > respond to the events until resume.
> > "somehow" is too vague for the spec.
> Yeah, in spec, we have a section to describe what the device should do when
> SUSPEND.
> > 

maybe I missed it. generally I think the whole SUSPEND feature
should be in one patch, no real reason to split it up.
queue state thing is a different matter.

-- 
MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* RE: [virtio-comment] Re: [PATCH V2 4/6] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  2023-11-22  7:15                                               ` Michael S. Tsirkin
@ 2023-11-22  7:33                                                 ` Parav Pandit
  2023-11-22 14:43                                                   ` Michael S. Tsirkin
  0 siblings, 1 reply; 186+ messages in thread
From: Parav Pandit @ 2023-11-22  7:33 UTC (permalink / raw)
  To: Michael S. Tsirkin, Jason Wang
  Cc: Zhu, Lingshan, eperezma, cohuck, stefanha, virtio-comment


> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Wednesday, November 22, 2023 12:45 PM
> 
> On Wed, Nov 22, 2023 at 12:15:44PM +0800, Jason Wang wrote:
> > On Wed, Nov 22, 2023 at 12:26 AM Parav Pandit <parav@nvidia.com>
> wrote:
> > >
> > >
> > > > From: Jason Wang <jasowang@redhat.com>
> > > > Sent: Tuesday, November 21, 2023 10:01 AM
> > > >
> > > > On Fri, Nov 17, 2023 at 6:06 PM Parav Pandit <parav@nvidia.com>
> wrote:
> > > > >
> > > > >
> >
> > [...]
> >
> > > > > Sorry, no virtio specification does not support device migration today.
> > > > > Nothing is broken by adding new features.
> > > > >
> > > > > Above [1] has the right proposal that Jason's paper pointed out.
> > > > > Please use
> > > > it.
> > > >
> > > > I was involved in the design in [1]. And I don't see a connection
> > > > to the dicussion here
> > > >
> > > > 1) It is based on vDPA in L0
> > > > 2) It doesn't address the nesting issue, it requires a proper
> > > > design in the virtio spec to support migration in the nesting layer.
> > >
> > > Nothing prevents [1] to be done without vdpa.
> >
> > Well, Qemu has SR-IOV emulation for IGB. What's the point of the above
> > reply? If a hypervisor wishes, it can be done with any device on L0.
> >
> > We discuss the possibility of migrating nested VMs, not the
> > possibility of emulating SR-IOV, no?
> >
> > Thanks
> 
> It depends really. If the way to support nesting is to emulate sr-iov then the
> question becomes how hard it is to emulate sriov to the point where nesting
> works.
> 
> The fact that current hypervisors pretend that a VF is a PF when passing things
> on to guest and things still seem to work is nice but it's not a hugely important
> design point IMHO.

Nesting is not a virtio problem as if nested PCI devices needed, industry will work on it anyway that virtio will be able to leverage.
We should focus on two practical and immediate cases of vfio and vdpa that for first level guest.
And admin command + admin queue on the owner device seems to suffice both the use cases.
We should progress towards this and unblock users to start using it.

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [virtio-comment] RE: [PATCH V2 3/6] virtio: dont reset vqs when SUSPEND
  2023-11-22  6:49                           ` Michael S. Tsirkin
@ 2023-11-22 10:03                             ` Zhu, Lingshan
  2023-11-22 13:37                               ` Michael S. Tsirkin
  0 siblings, 1 reply; 186+ messages in thread
From: Zhu, Lingshan @ 2023-11-22 10:03 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason Wang, Parav Pandit, eperezma, cohuck, stefanha, virtio-comment



On 11/22/2023 2:49 PM, Michael S. Tsirkin wrote:
> On Wed, Nov 22, 2023 at 09:51:45AM +0800, Zhu, Lingshan wrote:
>>
>> On 11/22/2023 5:18 AM, Michael S. Tsirkin wrote:
>>> On Tue, Nov 21, 2023 at 03:33:07PM +0800, Jason Wang wrote:
>>>>> Lingshan claimed that suspending device is for live migration in commit log and in discussion he portray it as some basic facility unrelated to device migration such as debug etc.
>>>>> Instead of claiming it as some non_device_migration facility does not make sense.
>>>> It is used for migration for sure.
>>> Well having a generic facility to stop device sounds like a nice thing.
>>> However the devil is in the detail. A lot of detail here seems very much
>>> tailored to a very specific implementation in mind.
>>> So thinking through how it will work e.g. for power management
>>> would be a good excercise to figure out how it should work in detail.
>>> Parav did you indicate at some point a virtio specific SUSPEND
>>> bit can be useful for PM? Could you share how it's better than
>>> transport level PM and what the requirements are?
>>> Thanks!
>> Do you mean letting the device enter a new power state when SUSPEND,
>> and such description in transport-pci.tex? Then resume normal
>> state on DRIVER_OK.
> That would be one example.
OK, I will look into this.
Roughly say in transport-pci.tex: When SUSPEND, the device MAY 
optionally enter a
power-saving state.

Please allow me some time working on the details
>


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [virtio-comment] RE: [PATCH V2 3/6] virtio: dont reset vqs when SUSPEND
  2023-11-22  6:47                           ` Parav Pandit
@ 2023-11-22 10:04                             ` Zhu, Lingshan
  2023-11-22 10:14                               ` Parav Pandit
  0 siblings, 1 reply; 186+ messages in thread
From: Zhu, Lingshan @ 2023-11-22 10:04 UTC (permalink / raw)
  To: Parav Pandit, Michael S. Tsirkin, Jason Wang
  Cc: eperezma, cohuck, stefanha, virtio-comment



On 11/22/2023 2:47 PM, Parav Pandit wrote:
>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
>> Sent: Wednesday, November 22, 2023 7:22 AM
>>
>> On 11/22/2023 5:18 AM, Michael S. Tsirkin wrote:
>>> On Tue, Nov 21, 2023 at 03:33:07PM +0800, Jason Wang wrote:
>>>>> Lingshan claimed that suspending device is for live migration in commit log
>> and in discussion he portray it as some basic facility unrelated to device
>> migration such as debug etc.
>>>>> Instead of claiming it as some non_device_migration facility does not make
>> sense.
>>>> It is used for migration for sure.
>>> Well having a generic facility to stop device sounds like a nice thing.
>>> However the devil is in the detail. A lot of detail here seems very
>>> much tailored to a very specific implementation in mind.
>>> So thinking through how it will work e.g. for power management would
>>> be a good excercise to figure out how it should work in detail.
>>> Parav did you indicate at some point a virtio specific SUSPEND bit can
>>> be useful for PM? Could you share how it's better than transport level
>>> PM and what the requirements are?
>>> Thanks!
>> Do you mean letting the device enter a new power state when SUSPEND, and
>> such description in transport-pci.tex? Then resume normal state on
>> DRIVER_OK.
> My proposal is,
> 1. suspend bit (not state) to be controlled by the guest driver
The guest can set SUSPEND bit for sure. However there is no reason to
forbid host from setting SUSPEND.

The reason is, the live migration process should be transparent to the 
guest,
that means the guest should be suspended first, then the host suspend 
the device,
so no I/O errors in the guest.
> 2. this bit must be busy poll type. Meaning,
This is already in the patch, the driver should re-read to make sure the 
SUSPEND bit is set.
> a. the driver must get acknowledgement from the device that the suspend operation is completed in the device.
> b. the driver must get acknowledgement from the device that the resume operation is completed in the device.
Yes, by re-read, for example the device should only present SUSPEND bit 
in the device status when it finished
the process to suspend, already in the patch series.
>
> 3. Not to confuse this with administrative mode active/stop/freeze set by the owner device during device migration
>
> 4. This feature is usable in power management use case and may be some other.
Yes, as long as we use virtio status to control PM, rather than the reverse.


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* RE: [virtio-comment] RE: [PATCH V2 3/6] virtio: dont reset vqs when SUSPEND
  2023-11-22 10:04                             ` Zhu, Lingshan
@ 2023-11-22 10:14                               ` Parav Pandit
  0 siblings, 0 replies; 186+ messages in thread
From: Parav Pandit @ 2023-11-22 10:14 UTC (permalink / raw)
  To: Zhu, Lingshan, Michael S. Tsirkin, Jason Wang
  Cc: eperezma, cohuck, stefanha, virtio-comment


> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> Sent: Wednesday, November 22, 2023 3:34 PM
> 
> On 11/22/2023 2:47 PM, Parav Pandit wrote:
> >> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> >> Sent: Wednesday, November 22, 2023 7:22 AM
> >>
> >> On 11/22/2023 5:18 AM, Michael S. Tsirkin wrote:
> >>> On Tue, Nov 21, 2023 at 03:33:07PM +0800, Jason Wang wrote:
> >>>>> Lingshan claimed that suspending device is for live migration in
> >>>>> commit log
> >> and in discussion he portray it as some basic facility unrelated to
> >> device migration such as debug etc.
> >>>>> Instead of claiming it as some non_device_migration facility does
> >>>>> not make
> >> sense.
> >>>> It is used for migration for sure.
> >>> Well having a generic facility to stop device sounds like a nice thing.
> >>> However the devil is in the detail. A lot of detail here seems very
> >>> much tailored to a very specific implementation in mind.
> >>> So thinking through how it will work e.g. for power management would
> >>> be a good excercise to figure out how it should work in detail.
> >>> Parav did you indicate at some point a virtio specific SUSPEND bit
> >>> can be useful for PM? Could you share how it's better than transport
> >>> level PM and what the requirements are?
> >>> Thanks!
> >> Do you mean letting the device enter a new power state when SUSPEND,
> >> and such description in transport-pci.tex? Then resume normal state
> >> on DRIVER_OK.
> > My proposal is,
> > 1. suspend bit (not state) to be controlled by the guest driver
> The guest can set SUSPEND bit for sure. However there is no reason to forbid
> host from setting SUSPEND.
Sure. hypervisor can also touch all the bits it has access to.

> 
> The reason is, the live migration process should be transparent to the guest,
> that means the guest should be suspended first, then the host suspend the
> device, so no I/O errors in the guest.
I explained this yday to my best, that hypervisor in my proposal also must touch the device when the guest is NOT suspended.
This is helping to lower the downtime as base tenet.

(you can imagine as the pre-copy phase of the device context like how there is pre-copy is present for the memory).

so guest vcpus to suspend first before device is smaller case of what my proposal covers.
And it is fine.

I am ok with guest controlled suspend bit, that optionally may be used by the hypervisor.

> > 2. this bit must be busy poll type. Meaning,
> This is already in the patch, the driver should re-read to make sure the
> SUSPEND bit is set.
> > a. the driver must get acknowledgement from the device that the suspend
> operation is completed in the device.
> > b. the driver must get acknowledgement from the device that the resume
> operation is completed in the device.
> Yes, by re-read, for example the device should only present SUSPEND bit in the
> device status when it finished the process to suspend, already in the patch
> series.
> >
> > 3. Not to confuse this with administrative mode active/stop/freeze set
> > by the owner device during device migration
> >
> > 4. This feature is usable in power management use case and may be some
> other.
> Yes, as long as we use virtio status to control PM, rather than the reverse.

PCI PM is already there. The extra optional suspend bit is an aid to those virtio device which cannot meet the deadlines in 10msec timing.

^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [virtio-comment] RE: [PATCH V2 3/6] virtio: dont reset vqs when SUSPEND
  2023-11-22 10:03                             ` Zhu, Lingshan
@ 2023-11-22 13:37                               ` Michael S. Tsirkin
  0 siblings, 0 replies; 186+ messages in thread
From: Michael S. Tsirkin @ 2023-11-22 13:37 UTC (permalink / raw)
  To: Zhu, Lingshan
  Cc: Jason Wang, Parav Pandit, eperezma, cohuck, stefanha, virtio-comment

On Wed, Nov 22, 2023 at 06:03:57PM +0800, Zhu, Lingshan wrote:
> 
> 
> On 11/22/2023 2:49 PM, Michael S. Tsirkin wrote:
> > On Wed, Nov 22, 2023 at 09:51:45AM +0800, Zhu, Lingshan wrote:
> > > 
> > > On 11/22/2023 5:18 AM, Michael S. Tsirkin wrote:
> > > > On Tue, Nov 21, 2023 at 03:33:07PM +0800, Jason Wang wrote:
> > > > > > Lingshan claimed that suspending device is for live migration in commit log and in discussion he portray it as some basic facility unrelated to device migration such as debug etc.
> > > > > > Instead of claiming it as some non_device_migration facility does not make sense.
> > > > > It is used for migration for sure.
> > > > Well having a generic facility to stop device sounds like a nice thing.
> > > > However the devil is in the detail. A lot of detail here seems very much
> > > > tailored to a very specific implementation in mind.
> > > > So thinking through how it will work e.g. for power management
> > > > would be a good excercise to figure out how it should work in detail.
> > > > Parav did you indicate at some point a virtio specific SUSPEND
> > > > bit can be useful for PM? Could you share how it's better than
> > > > transport level PM and what the requirements are?
> > > > Thanks!
> > > Do you mean letting the device enter a new power state when SUSPEND,
> > > and such description in transport-pci.tex? Then resume normal
> > > state on DRIVER_OK.
> > That would be one example.
> OK, I will look into this.
> Roughly say in transport-pci.tex: When SUSPEND, the device MAY optionally
> enter a
> power-saving state.
> 
> Please allow me some time working on the details

Just this is likely not sufficient. I am not 100% sure what is wanted
here but let me try to guess - unlike normal suspend, this special
state is actually controlled by hypervisor.

-- 
MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [virtio-comment] Re: [PATCH V2 4/6] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  2023-11-22  7:33                                                 ` Parav Pandit
@ 2023-11-22 14:43                                                   ` Michael S. Tsirkin
  0 siblings, 0 replies; 186+ messages in thread
From: Michael S. Tsirkin @ 2023-11-22 14:43 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Jason Wang, Zhu, Lingshan, eperezma, cohuck, stefanha, virtio-comment

On Wed, Nov 22, 2023 at 07:33:41AM +0000, Parav Pandit wrote:
> Nesting is not a virtio problem

Really virtio problem is whatever virtio TC members are willing to
tackle. I support work on nesting exactly to the same level I support
work on platforms without memory change tracking.  Both are
non-universal, platform specific problems that platforms might address
in the future.

-- 
MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [virtio-comment] RE: [PATCH V2 3/6] virtio: dont reset vqs when SUSPEND
  2023-11-22  6:32                           ` Parav Pandit
@ 2023-11-24  3:25                             ` Jason Wang
  2023-11-24  6:20                               ` Michael S. Tsirkin
  0 siblings, 1 reply; 186+ messages in thread
From: Jason Wang @ 2023-11-24  3:25 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Michael S. Tsirkin, Zhu, Lingshan, eperezma, cohuck, stefanha,
	virtio-comment

On Wed, Nov 22, 2023 at 2:32 PM Parav Pandit <parav@nvidia.com> wrote:
>
>
> > From: Jason Wang <jasowang@redhat.com>
> > Sent: Wednesday, November 22, 2023 10:59 AM
> >
> > On Wed, Nov 22, 2023 at 5:18 AM Michael S. Tsirkin <mst@redhat.com>
> > wrote:
> > >
> > > On Tue, Nov 21, 2023 at 03:33:07PM +0800, Jason Wang wrote:
> > > > > Lingshan claimed that suspending device is for live migration in commit
> > log and in discussion he portray it as some basic facility unrelated to device
> > migration such as debug etc.
> > > > > Instead of claiming it as some non_device_migration facility does not
> > make sense.
> > > >
> > > > It is used for migration for sure.
> > >
> > > Well having a generic facility to stop device sounds like a nice thing.
> > > However the devil is in the detail. A lot of detail here seems very
> > > much tailored to a very specific implementation in mind.
> > > So thinking through how it will work e.g. for power management would
> > > be a good excercise to figure out how it should work in detail.
> >
> > It might work in the case where there's no PM support in the transport. E.g for
> > MMIO devices.
> >
> MMIO should implement PM like other transport. That brings the equivalency principle.
>

MMIO are usually platform devices. I don't see the point.

Thanks


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [virtio-comment] RE: [PATCH V2 3/6] virtio: dont reset vqs when SUSPEND
  2023-11-22  6:11                           ` Parav Pandit
@ 2023-11-24  3:35                             ` Jason Wang
  2023-11-24  9:04                               ` Michael S. Tsirkin
  0 siblings, 1 reply; 186+ messages in thread
From: Jason Wang @ 2023-11-24  3:35 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Zhu, Lingshan, Michael S. Tsirkin, eperezma, cohuck, stefanha,
	virtio-comment

On Wed, Nov 22, 2023 at 2:11 PM Parav Pandit <parav@nvidia.com> wrote:
>
>
> > From: Jason Wang <jasowang@redhat.com>
> > Sent: Wednesday, November 22, 2023 10:58 AM
> >
> > On Wed, Nov 22, 2023 at 12:32 AM Parav Pandit <parav@nvidia.com> wrote:
> > >
> > >
> > > > From: Jason Wang <jasowang@redhat.com>
> > > > Sent: Tuesday, November 21, 2023 1:03 PM
> > > >
> > > > On Thu, Nov 16, 2023 at 1:27 PM Parav Pandit <parav@nvidia.com>
> > wrote:
> > > > >
> > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > Sent: Thursday, November 16, 2023 9:50 AM
> > > > > >
> > > > > > On Thu, Nov 16, 2023 at 1:39 AM Parav Pandit <parav@nvidia.com>
> > > > wrote:
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > > > Sent: Monday, November 13, 2023 9:05 AM
> > > > > > > >
> > > > > > > > On Thu, Nov 9, 2023 at 6:16 PM Parav Pandit
> > > > > > > > <parav@nvidia.com>
> > > > wrote:
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > From: Zhu, Lingshan <lingshan.zhu@intel.com>
> > > > > > > > > > Sent: Thursday, November 9, 2023 3:28 PM
> > > > > > > > > >
> > > > > > > > > > On 11/9/2023 1:46 AM, Michael S. Tsirkin wrote:
> > > > > > > > > > > On Tue, Nov 07, 2023 at 05:27:23PM +0800, Zhu,
> > > > > > > > > > > Lingshan
> > > > wrote:
> > > > > > > > > > >>
> > > > > > > > > > >> On 11/6/2023 5:49 PM, Michael S. Tsirkin wrote:
> > > > > > > > > > >>> On Fri, Nov 03, 2023 at 06:34:34PM +0800, Zhu
> > > > > > > > > > >>> Lingshan
> > > > wrote:
> > > > > > > > > > >>>> When SUSPEND is set, device states and virtqueue
> > > > > > > > > > >>>> states should be stablized, therefore the driver
> > > > > > > > > > >>>> should not reset vqs when SUSPEND is set in device status.
> > > > > > > > > > >>>>
> > > > > > > > > > >>>> Signed-off-by: Zhu Lingshan
> > > > > > > > > > >>>> <lingshan.zhu@intel.com>
> > > > > > > > > > >>>> ---
> > > > > > > > > > >>>>    content.tex | 3 +++
> > > > > > > > > > >>>>    1 file changed, 3 insertions(+)
> > > > > > > > > > >>>>
> > > > > > > > > > >>>> diff --git a/content.tex b/content.tex index
> > > > > > > > > > >>>> bcc9d4b..060b5c2
> > > > > > > > > > >>>> 100644
> > > > > > > > > > >>>> --- a/content.tex
> > > > > > > > > > >>>> +++ b/content.tex
> > > > > > > > > > >>>> @@ -444,6 +444,9 @@ \subsubsection{Virtqueue
> > > > > > > > > > >>>> Reset}\label{sec:Basic
> > > > > > > > > > Facilities of a Virtio Device /
> > > > > > > > > > >>>>    The device MUST reset any state of a virtqueue
> > > > > > > > > > >>>> to the default
> > > > > > state,
> > > > > > > > > > >>>>    including the available state and the used state.
> > > > > > > > > > >>>> +If VIRTIO_F_SUSPEND is negotiated and SUSPEND is
> > > > > > > > > > >>>> +set in \field{device status}, the driver SHOULD
> > > > > > > > > > >>>> +NOT reset any
> > > > > > virtqueues.
> > > > > > > > > > >>>> +
> > > > > > > > > > >>>>    \drivernormative{\paragraph}{Virtqueue
> > > > > > > > > > >>>> Reset}{Basic Facilities of a
> > > > > > > > > > Virtio Device / Virtqueues / Virtqueue Reset / Virtqueue
> > > > > > > > > > Reset}
> > > > > > > > > > >>>>    After the driver tells the device to reset a
> > > > > > > > > > >>>> queue, the driver MUST verify that
> > > > > > > > > > >>> Seems somewhat arbitrary and breaks the claim that
> > > > > > > > > > >>> the feature is orthogonal and can have uses besides
> > migration.
> > > > > > > > > > >> when suspended, the device is frozen.
> > > > > > > > > > >> The driver is aware of this process and so should not
> > > > > > > > > > >> reset the vqs I
> > > > > > think.
> > > > > > > > > > > Again that is only true because you want to use it for migration.
> > > > > > > > > > > But then you can't claim it's a generic facility.
> > > > > > > > > > I don't get it. The device status is a basic facility.
> > > > > > > > > >
> > > > > > > > > > We need to SUSPEND the device by setting SUSPEND bit, to
> > > > > > > > > > stabilize the device states for migration.
> > > > > > > > > Is the PCI's PM time not enough to suspend the device?
> > > > > > > >
> > > > > > > > Are you saying we don't need virtio reset assuming we had FLR?
> > > > > > > >
> > > > > > > No. often FLR timing is not enough. Hence every PCI level
> > > > > > > device has some
> > > > > > sort of its own reset mechanism.
> > > > > > >
> > > > > > > > Suspending at different layers like rest at different layers.
> > > > > > > >
> > > > > > > > We have both FLR and virtio reset. The Virtio level function
> > > > > > > > could be reset without FLR. So did suspend.
> > > > > > > >
> > > > > > > > That's it.
> > > > > > > Sure, but wrapping it under some "basic facility" is just does
> > > > > > > not make
> > > > sense.
> > > > > >
> > > > > > Why, device status (e.g reset) belongs to that part.
> > > > > >
> > > > > Lingshan claimed that suspending device is for live migration in
> > > > > commit log
> > > > and in discussion he portray it as some basic facility unrelated to
> > > > device migration such as debug etc.
> > > > > Instead of claiming it as some non_device_migration facility does
> > > > > not make
> > > > sense.
> > > >
> > > > It is used for migration for sure.
> > > This is why it is not working when device is directly mapped.
> >
> > We circle back. It works for the case of trap/emulation.
> >
> > For direct mapping:
> >
> > You claim guest reset can work but suspend can't?
> No. guest reset to be done by the guest.

What reset did you mean here?

> Suspend for PM also to be done by guest.

PM is transport specific.

We have both FLR and reset. So we have a transport PM and suspend.

>
> The hypervisor will have 2 modes, stop and freeze as admin operation for device migration flow without telling the guest driver about it.

What prevents stop/freeze work here?

>
> >
> > > The hypervisor messing this bit and guest is also doing power management
> > with it.
> >
> > So I don't see how it differs from: virtio reset and FLR is under the control of
> > guests.
> >
> And suspend for power management too under control of the guest.
>
> > >
> > > Both of them needs separate channel to do their own work.
> > >
> > > >
> > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > And if you want to rule P2P behaviours, PCI PM is really the
> > > > > > > > correct way to go instead of trying to do it at the virtio layer.
> > > > > > > >
> > > > > > > PCI PM is supposed to be controlled by the guest and so the suspend.
> > > > > >
> > > > > > I've listed issues about D3cold and others, I can't believe it
> > > > > > can't be controlled totally by guests.
> > > > > >
> > > > > D3cold is not controlled by the driver as defined by the PCI spec
> > > > > hence it is
> > > > not applicable.
> > > >
> > > > Have you seen the link I give you? Even if you are right, there
> > > > still could be such a request from the firmware, no?
> > > I may have missed the link.
> > > You have 10 replies, so it is easy to miss important things in rest of the
> > comments.
> >
> > I meant there still could be D3cold requests from the guest via virtual
> > firmware.
> >
> So it will deliver PME.

So what do you want to say here?

>
> > So it's not necessarily related to the guest driver.
> >
> > >
> > > >
> > > > > D3hot is controlled by the driver.
> > > >
> > > > So, it requires the device context to be preserved, which is not
> > > > documented in your patch.
> > > PCI PM interactions is covered in v4 in the device requirements section.
> > >
> > > >
> > > > > > >
> > > > > > > Hypervisor needs its channel to suspend the device, as
> > > > > > > fundamentally guest is
> > > > > > unaware of device migration flow.
> > > > > >
> > > > > > That's pretty fine, the hypervisor also needs its channel to
> > > > > > reset the device. If you think there's a conflict with suspend,
> > > > > > there should be one
> > > > for reset as well.
> > > > > >
> > > > > I don’t see a need for hypervisor to reset the device in passthrough
> > mode.
> > > > Can you explain why is it needed?
> > > >
> > > > Qemu has a command "system_reset".
> > > >
> > > I mean, what does this translate to reset the device in passthrough mode?
> >
> > It needs to reset the virtio device.
> >
> > > If this is FLR, it is there.
> >
> > Please explain how it works. (It's not only a FLR, it also need virtio level reset)
> >
> FLR obviously covers the virtio level reset as FLR covers the PCI + virtio reset.

Which part of the spec says this? And at least Qemu is not implemented
in this way.

And there's another conflict, you said FLR is under the control of guests ...

>
> > >
> > > > > Do you mean, it is needed in vdpa mode? If yes, the registers are
> > > > > emulated
> > > > anyway, so why the member device's native channel cannot be used in
> > > > vdpa mode?
> > > > >
> > > > > > >
> > > > > > > > > For large device I could imagine it could be short.
> > > > > > > > >
> > > > > > > > > In that case if there is suspend the device available, it
> > > > > > > > > will be used by the guest
> > > > > > > > driver itself, hypervisor wouldn’t know about it when those
> > > > > > > > registers are not trapped.
> > > > > > > > > So we need two ways to suspend.
> > > > > > > > > One is guest visible, and guest controlled.
> > > > > > > > > Second is hypervisor control to fulfill the device migration needs.
> > > > > > > >
> > > > > > > > Can you explain why suspend is special but not reset or why
> > > > > > > > reset can work but not suspend? If reset can work, so does
> > > > > > > > suspend. If reset can't, neither does suspend.
> > > > > > > >
> > > > > > > As long as reset and suspend both are under guest control, I am fine.
> > > > > >
> > > > > > Well, you seem to ignore my question below. Hypervisor needs to
> > > > > > reset the device as well.
> > > > > >
> > > > > Why is it needed in passthrough mode?
> > > > >
> > > > > > >
> > > > > > > > For example, can you explain how a system_reset in Qemu can
> > > > > > > > work with your proposal?
> > > > > > > >
> > > > > > > > >
> > > > > > > > > So if you can please take a look if the proposed admin
> > > > > > > > > command to
> > > > > > > > freeze/stop mode can be used in the emulated register case or not.
> > > > > > > >
> > > > > > > > Again, if you design those for PCI, it's a layer violation.
> > > > > > > > You have answered
> > > > > > > They are used by the PCI layer, just like your suspend bit.
> > > > > > > Andy other transport can also use it.
> > > > > > >
> > > > > > > > yourself that PM is the right way to go.
> > > > > > > >
> > > > > > > > > It helps to have the suspend bit in guest control as well
> > > > > > > > > with/without
> > > > > > > > emulation mode.
> > > > > > > >
> > > > > > > > I won't repeat it again. You will find you need a full
> > > > > > > > transport to satisfy all the requirements.
> > > > > > > I disagree for full transport.
> > > > > >
> > > > > > See above and the discussion in another thread.
> > > > > >
> > > > > > > If you want to get discuss transport for sure it is some other
> > > > > > > thread and I want to see "driver notifications via such
> > > > > > > transport VQ" to fully qualify it
> > > > > > as transport, And that would be just sub-optimal for actual working.
> > > > > >
> > > > > > Sub-optimal since the function is duplicated with a transport
> > > > > > but it doesn't claim or design as a transport.
> > > > > >
> > > > > It is not sub-optimal because of duplication. It is because you
> > > > > want to
> > > > transport notifications via virtqueue.
> > > >
> > > > Have you ever read the series of tvq? You won't get this conclusion
> > > > if you do that.
> > > >
> > > I have read those 4 patches and I have seen that transportvq do not want to
> > transport notifications.
> > > Hence it does not qualify as transport vq.
> >
> > It exposes the platform MMIO area for driver notification. This is sufficient.
> > Any issue you see?
> Yes, the issue is, it is not transporting the driver notifications.
> Hence, it is not a transport virtqueue.

Please read how driver/device notification is done in the spec for
existing transports.

>
> >
> > >
> > > Frankly, transport vq seems a way to formalize mediation forever in virtio.
> >
> > Nope, it can be accessed by a guest driver directly.
> >
> > > It is very weird way to build new SIOV device.
> > > For most things it should be the direct channel that virtio has already from
> > driver to the device.
> >
> > See above. SIOV might require a new transport or not.
> >
> It depends on the performance tests that Lingshan will show at scale.
>
> > >
> > >
> > > > >
> > > > > > > And hence, I wouldn’t call it a transport anymore.
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > > This can also be used for debugging I think.
> > > > > > > > >
> > > > > > > > > As Michael listed, a dedicated debug interface is usually
> > > > > > > > > more useful instead
> > > > > > > > of in-band.
> > > > > > > >
> > > > > > > > Well, I've shown you the in-band facilities like debugging
> > > > > > > > via ethtool and kernel has a lot of other ones. If you have
> > > > > > > > ever tried to debug in a real production environment, you
> > > > > > > > will find how useful such handy information is where out-of-
> > > > > > > > band facilities are often dangerous
> > > > > > and usually prohibited or even unsupported.
> > > > > > > Guest driver can always read and write the device status
> > > > > > > without adding a
> > > > > > suspend bit.
> > > > > >
> > > > > > I don't get here. Suspend make sure the device state is frozen
> > > > > > which helps for debugging for sure.
> > > > > You wanted to debug some vq live, you suspend the device, the vq
> > > > > state got
> > > > changed.
> > > > >
> > > > > I just don’t see that suspend is a debug tool.
> > > >
> > > > It's not a tool, it's a function that can be used as a debug tool.
> > > >
> > > > > Every feature is a debug feature literally.
> > > > > Classic heisenbug effect.
> > > > >
> > > > > Once can change driver notification frequency to see if interrupt
> > > > > rate
> > > > changed for debugging.
> > > > > One can disabled few RQs and see RSS...
> > > > > Blk can change blk_size to higher value to perf debug..
> > > > > The list continues..
> > > >
> > > > Let's not shift concepts.
> > > >
> > > Your comment to attribute device migration as debug feature is actually
> > shifting the concept.
> >
> > It's not.
> >
> > Ling Shan put it in the basic facilities as part of device status. You wonder why,
> > we explained it can be used beyond migration. You asked where, we told you
> > for example things like debugging. We never claim it can only be used in debug.
> > Then you shift the concept to say debug could be achieved by a lot of other
> > facilities. For sure this is correct, but it doesn't have any relationship with the
> > discussion here.
> >
> I don’t see wasting time here.
> If its debug, its debug.
> If its migration, it is migration.
> If its pm, its pm.

Obviously not. Migration should leverage existing facilities as much
as possible instead of duplicating them.

Thanks


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [virtio-comment] RE: [PATCH V2 3/6] virtio: dont reset vqs when SUSPEND
  2023-11-24  3:25                             ` Jason Wang
@ 2023-11-24  6:20                               ` Michael S. Tsirkin
  2023-11-24  6:28                                 ` Jason Wang
  0 siblings, 1 reply; 186+ messages in thread
From: Michael S. Tsirkin @ 2023-11-24  6:20 UTC (permalink / raw)
  To: Jason Wang
  Cc: Parav Pandit, Zhu, Lingshan, eperezma, cohuck, stefanha, virtio-comment

On Fri, Nov 24, 2023 at 11:25:41AM +0800, Jason Wang wrote:
> On Wed, Nov 22, 2023 at 2:32 PM Parav Pandit <parav@nvidia.com> wrote:
> >
> >
> > > From: Jason Wang <jasowang@redhat.com>
> > > Sent: Wednesday, November 22, 2023 10:59 AM
> > >
> > > On Wed, Nov 22, 2023 at 5:18 AM Michael S. Tsirkin <mst@redhat.com>
> > > wrote:
> > > >
> > > > On Tue, Nov 21, 2023 at 03:33:07PM +0800, Jason Wang wrote:
> > > > > > Lingshan claimed that suspending device is for live migration in commit
> > > log and in discussion he portray it as some basic facility unrelated to device
> > > migration such as debug etc.
> > > > > > Instead of claiming it as some non_device_migration facility does not
> > > make sense.
> > > > >
> > > > > It is used for migration for sure.
> > > >
> > > > Well having a generic facility to stop device sounds like a nice thing.
> > > > However the devil is in the detail. A lot of detail here seems very
> > > > much tailored to a very specific implementation in mind.
> > > > So thinking through how it will work e.g. for power management would
> > > > be a good excercise to figure out how it should work in detail.
> > >
> > > It might work in the case where there's no PM support in the transport. E.g for
> > > MMIO devices.
> > >
> > MMIO should implement PM like other transport. That brings the equivalency principle.
> >
> 
> MMIO are usually platform devices. I don't see the point.
> 
> Thanks

I don't understand what you are saying. Why does it make sense to
suspend individual platform devices when they are suspended
with the whole platform?

-- 
MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [virtio-comment] RE: [PATCH V2 3/6] virtio: dont reset vqs when SUSPEND
  2023-11-24  6:20                               ` Michael S. Tsirkin
@ 2023-11-24  6:28                                 ` Jason Wang
  2023-11-24  6:43                                   ` Zhu, Lingshan
  2023-11-24  8:50                                   ` Michael S. Tsirkin
  0 siblings, 2 replies; 186+ messages in thread
From: Jason Wang @ 2023-11-24  6:28 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Parav Pandit, Zhu, Lingshan, eperezma, cohuck, stefanha, virtio-comment

On Fri, Nov 24, 2023 at 2:20 PM Michael S. Tsirkin <mst@redhat.com> wrote:
>
> On Fri, Nov 24, 2023 at 11:25:41AM +0800, Jason Wang wrote:
> > On Wed, Nov 22, 2023 at 2:32 PM Parav Pandit <parav@nvidia.com> wrote:
> > >
> > >
> > > > From: Jason Wang <jasowang@redhat.com>
> > > > Sent: Wednesday, November 22, 2023 10:59 AM
> > > >
> > > > On Wed, Nov 22, 2023 at 5:18 AM Michael S. Tsirkin <mst@redhat.com>
> > > > wrote:
> > > > >
> > > > > On Tue, Nov 21, 2023 at 03:33:07PM +0800, Jason Wang wrote:
> > > > > > > Lingshan claimed that suspending device is for live migration in commit
> > > > log and in discussion he portray it as some basic facility unrelated to device
> > > > migration such as debug etc.
> > > > > > > Instead of claiming it as some non_device_migration facility does not
> > > > make sense.
> > > > > >
> > > > > > It is used for migration for sure.
> > > > >
> > > > > Well having a generic facility to stop device sounds like a nice thing.
> > > > > However the devil is in the detail. A lot of detail here seems very
> > > > > much tailored to a very specific implementation in mind.
> > > > > So thinking through how it will work e.g. for power management would
> > > > > be a good excercise to figure out how it should work in detail.
> > > >
> > > > It might work in the case where there's no PM support in the transport. E.g for
> > > > MMIO devices.
> > > >
> > > MMIO should implement PM like other transport. That brings the equivalency principle.
> > >
> >
> > MMIO are usually platform devices. I don't see the point.
> >
> > Thanks
>
> I don't understand what you are saying. Why does it make sense to
> suspend individual platform devices when they are suspended
> with the whole platform?

It is because we don't need to suspend the whole platform to migrate
the virtio-MMIO device.

Thanks

>
> --
> MST
>


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [virtio-comment] RE: [PATCH V2 3/6] virtio: dont reset vqs when SUSPEND
  2023-11-24  6:28                                 ` Jason Wang
@ 2023-11-24  6:43                                   ` Zhu, Lingshan
  2023-11-24  8:50                                   ` Michael S. Tsirkin
  1 sibling, 0 replies; 186+ messages in thread
From: Zhu, Lingshan @ 2023-11-24  6:43 UTC (permalink / raw)
  To: Jason Wang, Michael S. Tsirkin
  Cc: Parav Pandit, eperezma, cohuck, stefanha, virtio-comment



On 11/24/2023 2:28 PM, Jason Wang wrote:
> On Fri, Nov 24, 2023 at 2:20 PM Michael S. Tsirkin <mst@redhat.com> wrote:
>> On Fri, Nov 24, 2023 at 11:25:41AM +0800, Jason Wang wrote:
>>> On Wed, Nov 22, 2023 at 2:32 PM Parav Pandit <parav@nvidia.com> wrote:
>>>>
>>>>> From: Jason Wang <jasowang@redhat.com>
>>>>> Sent: Wednesday, November 22, 2023 10:59 AM
>>>>>
>>>>> On Wed, Nov 22, 2023 at 5:18 AM Michael S. Tsirkin <mst@redhat.com>
>>>>> wrote:
>>>>>> On Tue, Nov 21, 2023 at 03:33:07PM +0800, Jason Wang wrote:
>>>>>>>> Lingshan claimed that suspending device is for live migration in commit
>>>>> log and in discussion he portray it as some basic facility unrelated to device
>>>>> migration such as debug etc.
>>>>>>>> Instead of claiming it as some non_device_migration facility does not
>>>>> make sense.
>>>>>>> It is used for migration for sure.
>>>>>> Well having a generic facility to stop device sounds like a nice thing.
>>>>>> However the devil is in the detail. A lot of detail here seems very
>>>>>> much tailored to a very specific implementation in mind.
>>>>>> So thinking through how it will work e.g. for power management would
>>>>>> be a good excercise to figure out how it should work in detail.
>>>>> It might work in the case where there's no PM support in the transport. E.g for
>>>>> MMIO devices.
>>>>>
>>>> MMIO should implement PM like other transport. That brings the equivalency principle.
>>>>
>>> MMIO are usually platform devices. I don't see the point.
>>>
>>> Thanks
>> I don't understand what you are saying. Why does it make sense to
>> suspend individual platform devices when they are suspended
>> with the whole platform?
> It is because we don't need to suspend the whole platform to migrate
> the virtio-MMIO device.
I agree, should only need to suspend the device
>
> Thanks
>
>> --
>> MST
>>


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [virtio-comment] RE: [PATCH V2 3/6] virtio: dont reset vqs when SUSPEND
  2023-11-24  6:28                                 ` Jason Wang
  2023-11-24  6:43                                   ` Zhu, Lingshan
@ 2023-11-24  8:50                                   ` Michael S. Tsirkin
  2023-11-24 11:51                                     ` Jason Wang
  1 sibling, 1 reply; 186+ messages in thread
From: Michael S. Tsirkin @ 2023-11-24  8:50 UTC (permalink / raw)
  To: Jason Wang
  Cc: Parav Pandit, Zhu, Lingshan, eperezma, cohuck, stefanha, virtio-comment

On Fri, Nov 24, 2023 at 02:28:44PM +0800, Jason Wang wrote:
> On Fri, Nov 24, 2023 at 2:20 PM Michael S. Tsirkin <mst@redhat.com> wrote:
> >
> > On Fri, Nov 24, 2023 at 11:25:41AM +0800, Jason Wang wrote:
> > > On Wed, Nov 22, 2023 at 2:32 PM Parav Pandit <parav@nvidia.com> wrote:
> > > >
> > > >
> > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > Sent: Wednesday, November 22, 2023 10:59 AM
> > > > >
> > > > > On Wed, Nov 22, 2023 at 5:18 AM Michael S. Tsirkin <mst@redhat.com>
> > > > > wrote:
> > > > > >
> > > > > > On Tue, Nov 21, 2023 at 03:33:07PM +0800, Jason Wang wrote:
> > > > > > > > Lingshan claimed that suspending device is for live migration in commit
> > > > > log and in discussion he portray it as some basic facility unrelated to device
> > > > > migration such as debug etc.
> > > > > > > > Instead of claiming it as some non_device_migration facility does not
> > > > > make sense.
> > > > > > >
> > > > > > > It is used for migration for sure.
> > > > > >
> > > > > > Well having a generic facility to stop device sounds like a nice thing.
> > > > > > However the devil is in the detail. A lot of detail here seems very
> > > > > > much tailored to a very specific implementation in mind.
> > > > > > So thinking through how it will work e.g. for power management would
> > > > > > be a good excercise to figure out how it should work in detail.
> > > > >
> > > > > It might work in the case where there's no PM support in the transport. E.g for
> > > > > MMIO devices.
> > > > >
> > > > MMIO should implement PM like other transport. That brings the equivalency principle.
> > > >
> > >
> > > MMIO are usually platform devices. I don't see the point.
> > >
> > > Thanks
> >
> > I don't understand what you are saying. Why does it make sense to
> > suspend individual platform devices when they are suspended
> > with the whole platform?
> 
> It is because we don't need to suspend the whole platform to migrate
> the virtio-MMIO device.
> 
> Thanks

We were talking about uses beyond migration.


> >
> > --
> > MST
> >


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [virtio-comment] RE: [PATCH V2 3/6] virtio: dont reset vqs when SUSPEND
  2023-11-24  3:35                             ` Jason Wang
@ 2023-11-24  9:04                               ` Michael S. Tsirkin
  2023-11-24 11:50                                 ` Jason Wang
  0 siblings, 1 reply; 186+ messages in thread
From: Michael S. Tsirkin @ 2023-11-24  9:04 UTC (permalink / raw)
  To: Jason Wang
  Cc: Parav Pandit, Zhu, Lingshan, eperezma, cohuck, stefanha, virtio-comment

On Fri, Nov 24, 2023 at 11:35:59AM +0800, Jason Wang wrote:
> > > > > > I don’t see a need for hypervisor to reset the device in passthrough
> > > mode.
> > > > > Can you explain why is it needed?
> > > > >
> > > > > Qemu has a command "system_reset".
> > > > >
> > > > I mean, what does this translate to reset the device in passthrough mode?
> > >
> > > It needs to reset the virtio device.
> > >
> > > > If this is FLR, it is there.
> > >
> > > Please explain how it works. (It's not only a FLR, it also need virtio level reset)
> > >
> > FLR obviously covers the virtio level reset as FLR covers the PCI + virtio reset.
> 
> Which part of the spec says this? And at least Qemu is not implemented
> in this way.

PCI spec says this. And yes I believe qemu will fully reset the function
on FLR.


> And there's another conflict, you said FLR is under the control of guests ...
> 
> >
> > > >
> > > > > > Do you mean, it is needed in vdpa mode? If yes, the registers are
> > > > > > emulated
> > > > > anyway, so why the member device's native channel cannot be used in
> > > > > vdpa mode?
> > > > > >
> > > > > > > >
> > > > > > > > > > For large device I could imagine it could be short.
> > > > > > > > > >
> > > > > > > > > > In that case if there is suspend the device available, it
> > > > > > > > > > will be used by the guest
> > > > > > > > > driver itself, hypervisor wouldn’t know about it when those
> > > > > > > > > registers are not trapped.
> > > > > > > > > > So we need two ways to suspend.
> > > > > > > > > > One is guest visible, and guest controlled.
> > > > > > > > > > Second is hypervisor control to fulfill the device migration needs.
> > > > > > > > >
> > > > > > > > > Can you explain why suspend is special but not reset or why
> > > > > > > > > reset can work but not suspend? If reset can work, so does
> > > > > > > > > suspend. If reset can't, neither does suspend.
> > > > > > > > >
> > > > > > > > As long as reset and suspend both are under guest control, I am fine.
> > > > > > >
> > > > > > > Well, you seem to ignore my question below. Hypervisor needs to
> > > > > > > reset the device as well.
> > > > > > >
> > > > > > Why is it needed in passthrough mode?
> > > > > >
> > > > > > > >
> > > > > > > > > For example, can you explain how a system_reset in Qemu can
> > > > > > > > > work with your proposal?
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > So if you can please take a look if the proposed admin
> > > > > > > > > > command to
> > > > > > > > > freeze/stop mode can be used in the emulated register case or not.
> > > > > > > > >
> > > > > > > > > Again, if you design those for PCI, it's a layer violation.
> > > > > > > > > You have answered
> > > > > > > > They are used by the PCI layer, just like your suspend bit.
> > > > > > > > Andy other transport can also use it.
> > > > > > > >
> > > > > > > > > yourself that PM is the right way to go.
> > > > > > > > >
> > > > > > > > > > It helps to have the suspend bit in guest control as well
> > > > > > > > > > with/without
> > > > > > > > > emulation mode.
> > > > > > > > >
> > > > > > > > > I won't repeat it again. You will find you need a full
> > > > > > > > > transport to satisfy all the requirements.
> > > > > > > > I disagree for full transport.
> > > > > > >
> > > > > > > See above and the discussion in another thread.
> > > > > > >
> > > > > > > > If you want to get discuss transport for sure it is some other
> > > > > > > > thread and I want to see "driver notifications via such
> > > > > > > > transport VQ" to fully qualify it
> > > > > > > as transport, And that would be just sub-optimal for actual working.
> > > > > > >
> > > > > > > Sub-optimal since the function is duplicated with a transport
> > > > > > > but it doesn't claim or design as a transport.
> > > > > > >
> > > > > > It is not sub-optimal because of duplication. It is because you
> > > > > > want to
> > > > > transport notifications via virtqueue.
> > > > >
> > > > > Have you ever read the series of tvq? You won't get this conclusion
> > > > > if you do that.
> > > > >
> > > > I have read those 4 patches and I have seen that transportvq do not want to
> > > transport notifications.
> > > > Hence it does not qualify as transport vq.
> > >
> > > It exposes the platform MMIO area for driver notification. This is sufficient.
> > > Any issue you see?
> > Yes, the issue is, it is not transporting the driver notifications.
> > Hence, it is not a transport virtqueue.
> 
> Please read how driver/device notification is done in the spec for
> existing transports.

Yea Parav I don't really see what you are driving at here.
But the bigger problem with tvq is that it was supposed to
add a new group type and be reworked on top of admin command
infrastructure and it never was.

> >
> > >
> > > >
> > > > Frankly, transport vq seems a way to formalize mediation forever in virtio.
> > >
> > > Nope, it can be accessed by a guest driver directly.
> > >
> > > > It is very weird way to build new SIOV device.
> > > > For most things it should be the direct channel that virtio has already from
> > > driver to the device.
> > >
> > > See above. SIOV might require a new transport or not.
> > >
> > It depends on the performance tests that Lingshan will show at scale.
> >
> > > >
> > > >
> > > > > >
> > > > > > > > And hence, I wouldn’t call it a transport anymore.
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > This can also be used for debugging I think.
> > > > > > > > > >
> > > > > > > > > > As Michael listed, a dedicated debug interface is usually
> > > > > > > > > > more useful instead
> > > > > > > > > of in-band.
> > > > > > > > >
> > > > > > > > > Well, I've shown you the in-band facilities like debugging
> > > > > > > > > via ethtool and kernel has a lot of other ones. If you have
> > > > > > > > > ever tried to debug in a real production environment, you
> > > > > > > > > will find how useful such handy information is where out-of-
> > > > > > > > > band facilities are often dangerous
> > > > > > > and usually prohibited or even unsupported.
> > > > > > > > Guest driver can always read and write the device status
> > > > > > > > without adding a
> > > > > > > suspend bit.
> > > > > > >
> > > > > > > I don't get here. Suspend make sure the device state is frozen
> > > > > > > which helps for debugging for sure.
> > > > > > You wanted to debug some vq live, you suspend the device, the vq
> > > > > > state got
> > > > > changed.
> > > > > >
> > > > > > I just don’t see that suspend is a debug tool.
> > > > >
> > > > > It's not a tool, it's a function that can be used as a debug tool.
> > > > >
> > > > > > Every feature is a debug feature literally.
> > > > > > Classic heisenbug effect.
> > > > > >
> > > > > > Once can change driver notification frequency to see if interrupt
> > > > > > rate
> > > > > changed for debugging.
> > > > > > One can disabled few RQs and see RSS...
> > > > > > Blk can change blk_size to higher value to perf debug..
> > > > > > The list continues..
> > > > >
> > > > > Let's not shift concepts.
> > > > >
> > > > Your comment to attribute device migration as debug feature is actually
> > > shifting the concept.
> > >
> > > It's not.
> > >
> > > Ling Shan put it in the basic facilities as part of device status. You wonder why,
> > > we explained it can be used beyond migration. You asked where, we told you
> > > for example things like debugging. We never claim it can only be used in debug.
> > > Then you shift the concept to say debug could be achieved by a lot of other
> > > facilities. For sure this is correct, but it doesn't have any relationship with the
> > > discussion here.

I frankly don't see how a bit which is completely non-orthogonal with
device and driver state can be useful for debug.
For debug you want something that just always works.
Not an interface that has so many requirements it will break if
you look at it sidewise.


> > >
> > I don’t see wasting time here.
> > If its debug, its debug.
> > If its migration, it is migration.
> > If its pm, its pm.
> 
> Obviously not. Migration should leverage existing facilities as much
> as possible instead of duplicating them.

I don't think there's anything obvious here.  A lot of device state
can't be easily accessed with existing facilities. The logical
continuation of your reasoning would be to add state introspection
commands e.g. to cvq in virtio net and then use tricks like shadow vq to
issue these.  Yea, possible, but can we not go there please?  Nothing is
wrong with just building commands that do exactly what we want them to
do instead of trying to build a ship in a bottle.

-- 
MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [virtio-comment] RE: [PATCH V2 3/6] virtio: dont reset vqs when SUSPEND
  2023-11-24  9:04                               ` Michael S. Tsirkin
@ 2023-11-24 11:50                                 ` Jason Wang
  2023-11-24 12:17                                   ` Michael S. Tsirkin
  0 siblings, 1 reply; 186+ messages in thread
From: Jason Wang @ 2023-11-24 11:50 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Parav Pandit, Zhu, Lingshan, eperezma, cohuck, stefanha, virtio-comment

On Fri, Nov 24, 2023 at 5:05 PM Michael S. Tsirkin <mst@redhat.com> wrote:
>
> On Fri, Nov 24, 2023 at 11:35:59AM +0800, Jason Wang wrote:
> > > > > > > I don’t see a need for hypervisor to reset the device in passthrough
> > > > mode.
> > > > > > Can you explain why is it needed?
> > > > > >
> > > > > > Qemu has a command "system_reset".
> > > > > >
> > > > > I mean, what does this translate to reset the device in passthrough mode?
> > > >
> > > > It needs to reset the virtio device.
> > > >
> > > > > If this is FLR, it is there.
> > > >
> > > > Please explain how it works. (It's not only a FLR, it also need virtio level reset)
> > > >
> > > FLR obviously covers the virtio level reset as FLR covers the PCI + virtio reset.
> >
> > Which part of the spec says this? And at least Qemu is not implemented
> > in this way.
>
> PCI spec says this.
> And yes I believe qemu will fully reset the function
> on FLR.

Ok, I have another glance at the spec and code. It works like this.

>
>
> > And there's another conflict, you said FLR is under the control of guests ...
> >
> > >
> > > > >
> > > > > > > Do you mean, it is needed in vdpa mode? If yes, the registers are
> > > > > > > emulated
> > > > > > anyway, so why the member device's native channel cannot be used in
> > > > > > vdpa mode?
> > > > > > >
> > > > > > > > >
> > > > > > > > > > > For large device I could imagine it could be short.
> > > > > > > > > > >
> > > > > > > > > > > In that case if there is suspend the device available, it
> > > > > > > > > > > will be used by the guest
> > > > > > > > > > driver itself, hypervisor wouldn’t know about it when those
> > > > > > > > > > registers are not trapped.
> > > > > > > > > > > So we need two ways to suspend.
> > > > > > > > > > > One is guest visible, and guest controlled.
> > > > > > > > > > > Second is hypervisor control to fulfill the device migration needs.
> > > > > > > > > >
> > > > > > > > > > Can you explain why suspend is special but not reset or why
> > > > > > > > > > reset can work but not suspend? If reset can work, so does
> > > > > > > > > > suspend. If reset can't, neither does suspend.
> > > > > > > > > >
> > > > > > > > > As long as reset and suspend both are under guest control, I am fine.
> > > > > > > >
> > > > > > > > Well, you seem to ignore my question below. Hypervisor needs to
> > > > > > > > reset the device as well.
> > > > > > > >
> > > > > > > Why is it needed in passthrough mode?
> > > > > > >
> > > > > > > > >
> > > > > > > > > > For example, can you explain how a system_reset in Qemu can
> > > > > > > > > > work with your proposal?
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > So if you can please take a look if the proposed admin
> > > > > > > > > > > command to
> > > > > > > > > > freeze/stop mode can be used in the emulated register case or not.
> > > > > > > > > >
> > > > > > > > > > Again, if you design those for PCI, it's a layer violation.
> > > > > > > > > > You have answered
> > > > > > > > > They are used by the PCI layer, just like your suspend bit.
> > > > > > > > > Andy other transport can also use it.
> > > > > > > > >
> > > > > > > > > > yourself that PM is the right way to go.
> > > > > > > > > >
> > > > > > > > > > > It helps to have the suspend bit in guest control as well
> > > > > > > > > > > with/without
> > > > > > > > > > emulation mode.
> > > > > > > > > >
> > > > > > > > > > I won't repeat it again. You will find you need a full
> > > > > > > > > > transport to satisfy all the requirements.
> > > > > > > > > I disagree for full transport.
> > > > > > > >
> > > > > > > > See above and the discussion in another thread.
> > > > > > > >
> > > > > > > > > If you want to get discuss transport for sure it is some other
> > > > > > > > > thread and I want to see "driver notifications via such
> > > > > > > > > transport VQ" to fully qualify it
> > > > > > > > as transport, And that would be just sub-optimal for actual working.
> > > > > > > >
> > > > > > > > Sub-optimal since the function is duplicated with a transport
> > > > > > > > but it doesn't claim or design as a transport.
> > > > > > > >
> > > > > > > It is not sub-optimal because of duplication. It is because you
> > > > > > > want to
> > > > > > transport notifications via virtqueue.
> > > > > >
> > > > > > Have you ever read the series of tvq? You won't get this conclusion
> > > > > > if you do that.
> > > > > >
> > > > > I have read those 4 patches and I have seen that transportvq do not want to
> > > > transport notifications.
> > > > > Hence it does not qualify as transport vq.
> > > >
> > > > It exposes the platform MMIO area for driver notification. This is sufficient.
> > > > Any issue you see?
> > > Yes, the issue is, it is not transporting the driver notifications.
> > > Hence, it is not a transport virtqueue.
> >
> > Please read how driver/device notification is done in the spec for
> > existing transports.
>
> Yea Parav I don't really see what you are driving at here.
> But the bigger problem with tvq is that it was supposed to
> add a new group type and be reworked on top of admin command
> infrastructure and it never was.
>
> > >
> > > >
> > > > >
> > > > > Frankly, transport vq seems a way to formalize mediation forever in virtio.
> > > >
> > > > Nope, it can be accessed by a guest driver directly.
> > > >
> > > > > It is very weird way to build new SIOV device.
> > > > > For most things it should be the direct channel that virtio has already from
> > > > driver to the device.
> > > >
> > > > See above. SIOV might require a new transport or not.
> > > >
> > > It depends on the performance tests that Lingshan will show at scale.
> > >
> > > > >
> > > > >
> > > > > > >
> > > > > > > > > And hence, I wouldn’t call it a transport anymore.
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > > This can also be used for debugging I think.
> > > > > > > > > > >
> > > > > > > > > > > As Michael listed, a dedicated debug interface is usually
> > > > > > > > > > > more useful instead
> > > > > > > > > > of in-band.
> > > > > > > > > >
> > > > > > > > > > Well, I've shown you the in-band facilities like debugging
> > > > > > > > > > via ethtool and kernel has a lot of other ones. If you have
> > > > > > > > > > ever tried to debug in a real production environment, you
> > > > > > > > > > will find how useful such handy information is where out-of-
> > > > > > > > > > band facilities are often dangerous
> > > > > > > > and usually prohibited or even unsupported.
> > > > > > > > > Guest driver can always read and write the device status
> > > > > > > > > without adding a
> > > > > > > > suspend bit.
> > > > > > > >
> > > > > > > > I don't get here. Suspend make sure the device state is frozen
> > > > > > > > which helps for debugging for sure.
> > > > > > > You wanted to debug some vq live, you suspend the device, the vq
> > > > > > > state got
> > > > > > changed.
> > > > > > >
> > > > > > > I just don’t see that suspend is a debug tool.
> > > > > >
> > > > > > It's not a tool, it's a function that can be used as a debug tool.
> > > > > >
> > > > > > > Every feature is a debug feature literally.
> > > > > > > Classic heisenbug effect.
> > > > > > >
> > > > > > > Once can change driver notification frequency to see if interrupt
> > > > > > > rate
> > > > > > changed for debugging.
> > > > > > > One can disabled few RQs and see RSS...
> > > > > > > Blk can change blk_size to higher value to perf debug..
> > > > > > > The list continues..
> > > > > >
> > > > > > Let's not shift concepts.
> > > > > >
> > > > > Your comment to attribute device migration as debug feature is actually
> > > > shifting the concept.
> > > >
> > > > It's not.
> > > >
> > > > Ling Shan put it in the basic facilities as part of device status. You wonder why,
> > > > we explained it can be used beyond migration. You asked where, we told you
> > > > for example things like debugging. We never claim it can only be used in debug.
> > > > Then you shift the concept to say debug could be achieved by a lot of other
> > > > facilities. For sure this is correct, but it doesn't have any relationship with the
> > > > discussion here.
>
> I frankly don't see how a bit which is completely non-orthogonal with
> device and driver state can be useful for debug.

The bit is to make sure the state of a device doesn't change. It may
help or not just like if you want to pause a cpu/process during the
debug.

> For debug you want something that just always works.
> Not an interface that has so many requirements it will break if
> you look at it sidewise.
>
>
> > > >
> > > I don’t see wasting time here.
> > > If its debug, its debug.
> > > If its migration, it is migration.
> > > If its pm, its pm.
> >
> > Obviously not. Migration should leverage existing facilities as much
> > as possible instead of duplicating them.
>
> I don't think there's anything obvious here.  A lot of device state
> can't be easily accessed with existing facilities.

Then we can invent new things.

> The logical
> continuation of your reasoning would be to add state introspection
> commands e.g. to cvq in virtio net and then use tricks like shadow vq to
> issue these.

For the device state, yes. Because it is device logic and it is not
what platform or transport can know.

> Yea, possible, but can we not go there please?  Nothing is
> wrong with just building commands that do exactly what we want them to
> do instead of trying to build a ship in a bottle.

But it's not the case for others.

E.g in Parav's proposal, it tries to rule P2P behaviour via virtio
admin commands when there is a duplication and a layer violation.

Thanks


>
> --
> MST
>


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [virtio-comment] RE: [PATCH V2 3/6] virtio: dont reset vqs when SUSPEND
  2023-11-24  8:50                                   ` Michael S. Tsirkin
@ 2023-11-24 11:51                                     ` Jason Wang
  0 siblings, 0 replies; 186+ messages in thread
From: Jason Wang @ 2023-11-24 11:51 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Parav Pandit, Zhu, Lingshan, eperezma, cohuck, stefanha, virtio-comment

On Fri, Nov 24, 2023 at 4:51 PM Michael S. Tsirkin <mst@redhat.com> wrote:
>
> On Fri, Nov 24, 2023 at 02:28:44PM +0800, Jason Wang wrote:
> > On Fri, Nov 24, 2023 at 2:20 PM Michael S. Tsirkin <mst@redhat.com> wrote:
> > >
> > > On Fri, Nov 24, 2023 at 11:25:41AM +0800, Jason Wang wrote:
> > > > On Wed, Nov 22, 2023 at 2:32 PM Parav Pandit <parav@nvidia.com> wrote:
> > > > >
> > > > >
> > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > Sent: Wednesday, November 22, 2023 10:59 AM
> > > > > >
> > > > > > On Wed, Nov 22, 2023 at 5:18 AM Michael S. Tsirkin <mst@redhat.com>
> > > > > > wrote:
> > > > > > >
> > > > > > > On Tue, Nov 21, 2023 at 03:33:07PM +0800, Jason Wang wrote:
> > > > > > > > > Lingshan claimed that suspending device is for live migration in commit
> > > > > > log and in discussion he portray it as some basic facility unrelated to device
> > > > > > migration such as debug etc.
> > > > > > > > > Instead of claiming it as some non_device_migration facility does not
> > > > > > make sense.
> > > > > > > >
> > > > > > > > It is used for migration for sure.
> > > > > > >
> > > > > > > Well having a generic facility to stop device sounds like a nice thing.
> > > > > > > However the devil is in the detail. A lot of detail here seems very
> > > > > > > much tailored to a very specific implementation in mind.
> > > > > > > So thinking through how it will work e.g. for power management would
> > > > > > > be a good excercise to figure out how it should work in detail.
> > > > > >
> > > > > > It might work in the case where there's no PM support in the transport. E.g for
> > > > > > MMIO devices.
> > > > > >
> > > > > MMIO should implement PM like other transport. That brings the equivalency principle.
> > > > >
> > > >
> > > > MMIO are usually platform devices. I don't see the point.
> > > >
> > > > Thanks
> > >
> > > I don't understand what you are saying. Why does it make sense to
> > > suspend individual platform devices when they are suspended
> > > with the whole platform?
> >
> > It is because we don't need to suspend the whole platform to migrate
> > the virtio-MMIO device.
> >
> > Thanks
>
> We were talking about uses beyond migration.

For example, vm stop/cont.

Thanks

>
>
> > >
> > > --
> > > MST
> > >
>


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [virtio-comment] RE: [PATCH V2 3/6] virtio: dont reset vqs when SUSPEND
  2023-11-24 11:50                                 ` Jason Wang
@ 2023-11-24 12:17                                   ` Michael S. Tsirkin
  2023-11-24 13:01                                     ` Jason Wang
  0 siblings, 1 reply; 186+ messages in thread
From: Michael S. Tsirkin @ 2023-11-24 12:17 UTC (permalink / raw)
  To: Jason Wang
  Cc: Parav Pandit, Zhu, Lingshan, eperezma, cohuck, stefanha, virtio-comment

On Fri, Nov 24, 2023 at 07:50:42PM +0800, Jason Wang wrote:
> > I frankly don't see how a bit which is completely non-orthogonal with
> > device and driver state can be useful for debug.
> 
> The bit is to make sure the state of a device doesn't change. It may
> help or not just like if you want to pause a cpu/process during the
> debug.

Heh. But it would be much better to have an orthogonal state
that driver can just set without worrying much about
device being broken somehow.


> > For debug you want something that just always works.
> > Not an interface that has so many requirements it will break if
> > you look at it sidewise.
> >
> >
> > > > >
> > > > I don’t see wasting time here.
> > > > If its debug, its debug.
> > > > If its migration, it is migration.
> > > > If its pm, its pm.
> > >
> > > Obviously not. Migration should leverage existing facilities as much
> > > as possible instead of duplicating them.
> >
> > I don't think there's anything obvious here.  A lot of device state
> > can't be easily accessed with existing facilities.
> 
> Then we can invent new things.

So the approach this patchset takes is a single interface for
all state introspection. Some duplication of functionality
for the sake of consistency. You don't like it fine but
there is nothing obvious that it's a bad thing. It's a tradeoff.

> > The logical
> > continuation of your reasoning would be to add state introspection
> > commands e.g. to cvq in virtio net and then use tricks like shadow vq to
> > issue these.
> 
> For the device state, yes. Because it is device logic and it is not
> what platform or transport can know.

Exactly as I thought. Don't think shadow VQ is something we
can reasonably make a single migration mechanism though.
It feels fragile and heavyweight. It's more of a work
around hardware limitations.


> > Yea, possible, but can we not go there please?  Nothing is
> > wrong with just building commands that do exactly what we want them to
> > do instead of trying to build a ship in a bottle.
> 
> But it's not the case for others.
> 
> E.g in Parav's proposal, it tries to rule P2P behaviour via virtio
> admin commands when there is a duplication

Yes there's some duplication. Advantage is consistency.  I actually
suggested ways to reduce duplication, by using transport offsets as
tags.  Finding a right balance means we all need to stop going to
extremes, I wish you and lingshan would stop trying to force everyone
to use registers and parav would stop trying to force dma.

> and a layer violation.

layers are only good if they make sense.

> Thanks
> 
> 
> >
> > --
> > MST
> >


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [virtio-comment] RE: [PATCH V2 3/6] virtio: dont reset vqs when SUSPEND
  2023-11-24 12:17                                   ` Michael S. Tsirkin
@ 2023-11-24 13:01                                     ` Jason Wang
  2023-11-24 14:45                                       ` Michael S. Tsirkin
  0 siblings, 1 reply; 186+ messages in thread
From: Jason Wang @ 2023-11-24 13:01 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Parav Pandit, Zhu, Lingshan, eperezma, cohuck, stefanha, virtio-comment

On Fri, Nov 24, 2023 at 8:17 PM Michael S. Tsirkin <mst@redhat.com> wrote:
>
> On Fri, Nov 24, 2023 at 07:50:42PM +0800, Jason Wang wrote:
> > > I frankly don't see how a bit which is completely non-orthogonal with
> > > device and driver state can be useful for debug.
> >
> > The bit is to make sure the state of a device doesn't change. It may
> > help or not just like if you want to pause a cpu/process during the
> > debug.
>
> Heh. But it would be much better to have an orthogonal state
> that driver can just set without worrying much about
> device being broken somehow.

In that case we still need a preise definition of the state. So I
don't see the difference here.

Anyhow, we can leave debugging aside.

>
>
> > > For debug you want something that just always works.
> > > Not an interface that has so many requirements it will break if
> > > you look at it sidewise.
> > >
> > >
> > > > > >
> > > > > I don’t see wasting time here.
> > > > > If its debug, its debug.
> > > > > If its migration, it is migration.
> > > > > If its pm, its pm.
> > > >
> > > > Obviously not. Migration should leverage existing facilities as much
> > > > as possible instead of duplicating them.
> > >
> > > I don't think there's anything obvious here.  A lot of device state
> > > can't be easily accessed with existing facilities.
> >
> > Then we can invent new things.
>
> So the approach this patchset takes is a single interface for
> all state introspection. Some duplication of functionality
> for the sake of consistency.
> You don't like it fine but

The (partial) duplication should be fine as long as it doesn't have
any issue, but I do see a lot of issues. So I can't say I like it.

> there is nothing obvious that it's a bad thing. It's a tradeoff.
>
> > > The logical
> > > continuation of your reasoning would be to add state introspection
> > > commands e.g. to cvq in virtio net and then use tricks like shadow vq to
> > > issue these.
> >
> > For the device state, yes. Because it is device logic and it is not
> > what platform or transport can know.
>
> Exactly as I thought. Don't think shadow VQ is something we
> can reasonably make a single migration mechanism though.

That's my understanding as well. It's up to the hypervisor, spec needs
to focus on the mechanism but not policy.

> It feels fragile and heavyweight. It's more of a work
> around hardware limitations.
>
>
> > > Yea, possible, but can we not go there please?  Nothing is
> > > wrong with just building commands that do exactly what we want them to
> > > do instead of trying to build a ship in a bottle.
> >
> > But it's not the case for others.
> >
> > E.g in Parav's proposal, it tries to rule P2P behaviour via virtio
> > admin commands when there is a duplication
>
> Yes there's some duplication. Advantage is consistency.

Just to clarify, we do need things like suspend but I don't see
dealing with PCI P2P in virtio as consistent (e.g we don't imply any
PCI stuff in virtio reset).

>  I actually
> suggested ways to reduce duplication, by using transport offsets as
> tags.

I don't see a connection with P2P here. Or I may miss something.

> Finding a right balance means we all need to stop going to
> extremes, I wish you and lingshan would stop trying to force everyone
> to use registers and parav would stop trying to force dma.

There is a misunderstanding. Actually. I don't want to force
registers. Which kind of interface is the best way to go is really
implementation specific. In some implementations, registers are cheap
but not virtqueue, in others, virtqueue is cheap but not register.
They are all fine. But we can't not claim one proposal that is
optimized for a specific implementation to be the best way. What I
want to do is not limit the facilities that are used for live
migration to any specific transport. I think Ling Shan agreed with me
in this part. It needs to work on all interfaces regardless of
registers, DMA, CMA or others. That's why I suggest we focus on what
needs to be migrated first. That is define the following things

1) way to suspend/resume a device
2) virtqueue states, indices or inflight ones
3) device states

It is just like how we define virtqueues/features/status and other
basic facilities where we do not tie it to any specific interfaces
like DMA, CMA, registers or admin commands. Virtio benefited from the
flexibility like this in the past so why not stick to that?

After this, each transport can choose to implement it in their own
way. For example, if there's a proposal to use those via admin
virtqueue that's fine. And registers also fine and other transports.
That's why only the last patch of this series is dealing with the PCI
specific part.

>
> > and a layer violation.
>
> layers are only good if they make sense.

Case by case, for example, PCI PM has defined the state and the
interaction with P2P. Reusing that seems much cleaner than inventing a
mechanism in the virtio layer. Or if it needs a side channel, it needs
to be invented in PCI not virtio.

Thanks

>
> > Thanks
> >
> >
> > >
> > > --
> > > MST
> > >
>


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [virtio-comment] RE: [PATCH V2 3/6] virtio: dont reset vqs when SUSPEND
  2023-11-24 13:01                                     ` Jason Wang
@ 2023-11-24 14:45                                       ` Michael S. Tsirkin
  2023-11-27  6:38                                         ` Jason Wang
  2023-11-27  9:54                                         ` Zhu, Lingshan
  0 siblings, 2 replies; 186+ messages in thread
From: Michael S. Tsirkin @ 2023-11-24 14:45 UTC (permalink / raw)
  To: Jason Wang
  Cc: Parav Pandit, Zhu, Lingshan, eperezma, cohuck, stefanha, virtio-comment

On Fri, Nov 24, 2023 at 09:01:53PM +0800, Jason Wang wrote:
> On Fri, Nov 24, 2023 at 8:17 PM Michael S. Tsirkin <mst@redhat.com> wrote:
> >
> > On Fri, Nov 24, 2023 at 07:50:42PM +0800, Jason Wang wrote:
> > > > I frankly don't see how a bit which is completely non-orthogonal with
> > > > device and driver state can be useful for debug.
> > >
> > > The bit is to make sure the state of a device doesn't change. It may
> > > help or not just like if you want to pause a cpu/process during the
> > > debug.
> >
> > Heh. But it would be much better to have an orthogonal state
> > that driver can just set without worrying much about
> > device being broken somehow.
> 
> In that case we still need a preise definition of the state. So I
> don't see the difference here.

You don't? If you can inspect state without perturbing it, that is
better for debugging than an intrusive interface that perturbs state.


> Anyhow, we can leave debugging aside.
> 
> >
> >
> > > > For debug you want something that just always works.
> > > > Not an interface that has so many requirements it will break if
> > > > you look at it sidewise.
> > > >
> > > >
> > > > > > >
> > > > > > I don’t see wasting time here.
> > > > > > If its debug, its debug.
> > > > > > If its migration, it is migration.
> > > > > > If its pm, its pm.
> > > > >
> > > > > Obviously not. Migration should leverage existing facilities as much
> > > > > as possible instead of duplicating them.
> > > >
> > > > I don't think there's anything obvious here.  A lot of device state
> > > > can't be easily accessed with existing facilities.
> > >
> > > Then we can invent new things.
> >
> > So the approach this patchset takes is a single interface for
> > all state introspection. Some duplication of functionality
> > for the sake of consistency.
> > You don't like it fine but
> 
> The (partial) duplication should be fine as long as it doesn't have
> any issue, but I do see a lot of issues. So I can't say I like it.

Let's focus on issue really instead of endless high level
architecture discussions.

> > there is nothing obvious that it's a bad thing. It's a tradeoff.
> >
> > > > The logical
> > > > continuation of your reasoning would be to add state introspection
> > > > commands e.g. to cvq in virtio net and then use tricks like shadow vq to
> > > > issue these.
> > >
> > > For the device state, yes. Because it is device logic and it is not
> > > what platform or transport can know.
> >
> > Exactly as I thought. Don't think shadow VQ is something we
> > can reasonably make a single migration mechanism though.
> 
> That's my understanding as well. It's up to the hypervisor, spec needs
> to focus on the mechanism but not policy.
> 
> > It feels fragile and heavyweight. It's more of a work
> > around hardware limitations.
> >
> >
> > > > Yea, possible, but can we not go there please?  Nothing is
> > > > wrong with just building commands that do exactly what we want them to
> > > > do instead of trying to build a ship in a bottle.
> > >
> > > But it's not the case for others.
> > >
> > > E.g in Parav's proposal, it tries to rule P2P behaviour via virtio
> > > admin commands when there is a duplication
> >
> > Yes there's some duplication. Advantage is consistency.
> 
> Just to clarify, we do need things like suspend but I don't see
> dealing with PCI P2P in virtio as consistent (e.g we don't imply any
> PCI stuff in virtio reset).

I think I can clarify.  The consistency is that there's a single chapter
that deals with migration, also hypervisors will all do exactly the same
instead of each hypervisor guessing how to do migration and going its
own way.




> >  I actually
> > suggested ways to reduce duplication, by using transport offsets as
> > tags.
> 
> I don't see a connection with P2P here. Or I may miss something.

I don't even know what P2P is in this context. Or why we are
discussing it. Is this going to be another distraction that
no one knows how it will work, just like mentioning TD randomly?

> > Finding a right balance means we all need to stop going to
> > extremes, I wish you and lingshan would stop trying to force everyone
> > to use registers and parav would stop trying to force dma.
> 
> There is a misunderstanding. Actually. I don't want to force
> registers.

You keep insisting on overlaying suspend functionality
over the existing transport. For pci that is going to be
in a register.

> Which kind of interface is the best way to go is really
> implementation specific. In some implementations, registers are cheap
> but not virtqueue, in others, virtqueue is cheap but not register.
> They are all fine. But we can't not claim one proposal that is
> optimized for a specific implementation to be the best way. What I
> want to do is not limit the facilities that are used for live
> migration to any specific transport. I think Ling Shan agreed with me
> in this part. It needs to work on all interfaces regardless of
> registers, DMA, CMA or others. That's why I suggest we focus on what
> needs to be migrated first. That is define the following things
> 
> 1) way to suspend/resume a device
> 2) virtqueue states, indices or inflight ones
> 3) device states
> 
> It is just like how we define virtqueues/features/status and other
> basic facilities where we do not tie it to any specific interfaces
> like DMA, CMA, registers or admin commands. Virtio benefited from the
> flexibility like this in the past so why not stick to that?

Because migration is a complex enough topic that we simply know
from experience that
- things like error handling are needed
- passing big arrays around is needed
these just do not work reasonably well over registers and
this is why admin commands were invented.



> After this, each transport can choose to implement it in their own
> way. For example, if there's a proposal to use those via admin
> virtqueue that's fine.
> And registers also fine and other transports.
> That's why only the last patch of this series is dealing with the PCI
> specific part.

Maybe they are fine in theory. So far I didn't see anything cohesive
that is close functionally being even to Parav's proposal of migration
over admin commands. There's no specific *reason* not to do that,
practically - I don't see why 


> 
> >
> > > and a layer violation.
> >
> > layers are only good if they make sense.
> 
> Case by case, for example, PCI PM has defined the state and the
> interaction with P2P. Reusing that seems much cleaner than inventing a
> mechanism in the virtio layer. Or if it needs a side channel, it needs
> to be invented in PCI not virtio.
> 
> Thanks

There's no *if* in my opinion - migration is way easier for hypervisors
to implement as a side channel so any device state can be migrated. And
this is uniform across transports. Is it harder or easier for hardware
to implement? We have a hardware vendor pushing a side channel approach
so it seems likely they know what is good for hardware? If it is somehow
nvidia specific - can we please have other vendors with actual
plans implementing this hardware (and note this has nothing to do with
VDPA - all this SUSPEND bit work is only useful for full offload) come
forward and say "we don't support this, we support that"?  Or is this
for some unnamed vendors not on the TC? Maybe they should join the TC to
influence the direction then.  Because right now it looks like software
guys telling hardware guys what is good for hardware and I don't see how
this makes any sense.



> >
> > > Thanks
> > >
> > >
> > > >
> > > > --
> > > > MST
> > > >
> >


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [virtio-comment] RE: [PATCH V2 3/6] virtio: dont reset vqs when SUSPEND
  2023-11-24 14:45                                       ` Michael S. Tsirkin
@ 2023-11-27  6:38                                         ` Jason Wang
  2023-11-27  8:27                                           ` Michael S. Tsirkin
  2023-11-27  9:54                                         ` Zhu, Lingshan
  1 sibling, 1 reply; 186+ messages in thread
From: Jason Wang @ 2023-11-27  6:38 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Parav Pandit, Zhu, Lingshan, eperezma, cohuck, stefanha, virtio-comment

On Fri, Nov 24, 2023 at 10:45 PM Michael S. Tsirkin <mst@redhat.com> wrote:
>
> On Fri, Nov 24, 2023 at 09:01:53PM +0800, Jason Wang wrote:
> > On Fri, Nov 24, 2023 at 8:17 PM Michael S. Tsirkin <mst@redhat.com> wrote:
> > >
> > > On Fri, Nov 24, 2023 at 07:50:42PM +0800, Jason Wang wrote:
> > > > > I frankly don't see how a bit which is completely non-orthogonal with
> > > > > device and driver state can be useful for debug.
> > > >
> > > > The bit is to make sure the state of a device doesn't change. It may
> > > > help or not just like if you want to pause a cpu/process during the
> > > > debug.
> > >
> > > Heh. But it would be much better to have an orthogonal state
> > > that driver can just set without worrying much about
> > > device being broken somehow.
> >
> > In that case we still need a preise definition of the state. So I
> > don't see the difference here.
>
> You don't?

I meant I don't see the difference if

1) it is defined in a new bit in current status
2) or it's an orthogonal state

> If you can inspect state without perturbing it, that is
> better for debugging than an intrusive interface that perturbs state.

I've stated that it is just like you can choose to use GDB or not. If
you don't like GDB, that's fine.

We propose it in the basic facility, you are asking why, we explain it
might be used for debugging or others, then you said you don't want it
for debugging. It's fine but what's the standard here? For example,
even if it can only be used for migration, what's the issue here?
FEAUTRE_OK can only be used during driver probes but it is still
defined in the status part, and there's nobody asking for other use
cases.

>
>
> > Anyhow, we can leave debugging aside.
> >
> > >
> > >
> > > > > For debug you want something that just always works.
> > > > > Not an interface that has so many requirements it will break if
> > > > > you look at it sidewise.
> > > > >
> > > > >
> > > > > > > >
> > > > > > > I don’t see wasting time here.
> > > > > > > If its debug, its debug.
> > > > > > > If its migration, it is migration.
> > > > > > > If its pm, its pm.
> > > > > >
> > > > > > Obviously not. Migration should leverage existing facilities as much
> > > > > > as possible instead of duplicating them.
> > > > >
> > > > > I don't think there's anything obvious here.  A lot of device state
> > > > > can't be easily accessed with existing facilities.
> > > >
> > > > Then we can invent new things.
> > >
> > > So the approach this patchset takes is a single interface for
> > > all state introspection. Some duplication of functionality
> > > for the sake of consistency.
> > > You don't like it fine but
> >
> > The (partial) duplication should be fine as long as it doesn't have
> > any issue, but I do see a lot of issues. So I can't say I like it.
>
> Let's focus on issue really instead of endless high level
> architecture discussions.

I don't understand here. I've pointed out sufficient issues in detail
in individual patches. At least a small fraction of issues have been
acked by Parav. If you think it's not a real issue, please explain in
the individual thread.

>
> > > there is nothing obvious that it's a bad thing. It's a tradeoff.
> > >
> > > > > The logical
> > > > > continuation of your reasoning would be to add state introspection
> > > > > commands e.g. to cvq in virtio net and then use tricks like shadow vq to
> > > > > issue these.
> > > >
> > > > For the device state, yes. Because it is device logic and it is not
> > > > what platform or transport can know.
> > >
> > > Exactly as I thought. Don't think shadow VQ is something we
> > > can reasonably make a single migration mechanism though.
> >
> > That's my understanding as well. It's up to the hypervisor, spec needs
> > to focus on the mechanism but not policy.
> >
> > > It feels fragile and heavyweight. It's more of a work
> > > around hardware limitations.
> > >
> > >
> > > > > Yea, possible, but can we not go there please?  Nothing is
> > > > > wrong with just building commands that do exactly what we want them to
> > > > > do instead of trying to build a ship in a bottle.
> > > >
> > > > But it's not the case for others.
> > > >
> > > > E.g in Parav's proposal, it tries to rule P2P behaviour via virtio
> > > > admin commands when there is a duplication
> > >
> > > Yes there's some duplication. Advantage is consistency.
> >
> > Just to clarify, we do need things like suspend but I don't see
> > dealing with PCI P2P in virtio as consistent (e.g we don't imply any
> > PCI stuff in virtio reset).
>
> I think I can clarify.  The consistency is that there's a single chapter
> that deals with migration, also hypervisors will all do exactly the same
> instead of each hypervisor guessing how to do migration and going its
> own way.

Hypervisors have already done migration in different ways. Virtio
should not be involved in the endless debating on which one is better,
I think we have agreed on this, otherwise we spent months circling
back.

>
>
>
>
> > >  I actually
> > > suggested ways to reduce duplication, by using transport offsets as
> > > tags.
> >
> > I don't see a connection with P2P here. Or I may miss something.
>
> I don't even know what P2P is in this context. Or why we are
> discussing it.

The context is the suspend function. Parav tries to rule P2P in
freeze/stop, and it looks to me it has been covered by PCI PM.

> Is this going to be another distraction that
> no one knows how it will work, just like mentioning TD randomly?
>
> > > Finding a right balance means we all need to stop going to
> > > extremes, I wish you and lingshan would stop trying to force everyone
> > > to use registers and parav would stop trying to force dma.
> >
> > There is a misunderstanding. Actually. I don't want to force
> > registers.
>
> You keep insisting on overlaying suspend functionality
> over the existing transport.

It makes sure it could be reused for transport other than PCI. And it
doesn't prevent the coexistence of a side channel.

> For pci that is going to be
> in a register.

I don't understand. The register is already there. If you disagree,
please explain, especially why reset at virtio in status can work but
not suspend.

>
> > Which kind of interface is the best way to go is really
> > implementation specific. In some implementations, registers are cheap
> > but not virtqueue, in others, virtqueue is cheap but not register.
> > They are all fine. But we can't not claim one proposal that is
> > optimized for a specific implementation to be the best way. What I
> > want to do is not limit the facilities that are used for live
> > migration to any specific transport. I think Ling Shan agreed with me
> > in this part. It needs to work on all interfaces regardless of
> > registers, DMA, CMA or others. That's why I suggest we focus on what
> > needs to be migrated first. That is define the following things
> >
> > 1) way to suspend/resume a device
> > 2) virtqueue states, indices or inflight ones
> > 3) device states
> >
> > It is just like how we define virtqueues/features/status and other
> > basic facilities where we do not tie it to any specific interfaces
> > like DMA, CMA, registers or admin commands. Virtio benefited from the
> > flexibility like this in the past so why not stick to that?
>
> Because migration is a complex enough topic that we simply know
> from experience that
> - things like error handling are needed

Where is it in the Parav series?

> - passing big arrays around is needed

Not necessarily for

1) simple devices
2) devices that are using states in the memory
3) even between device and driver, there are vendors that offload the
migration to DPU

But again, I'm not saying passing big arrays doesn't make sense,
please do not misunderstand my point here.

> these just do not work reasonably well over registers and
> this is why admin commands were invented.

I'm lost here. The reason why we stick admin commands is to make sure
it can be used for register and I wouldn't argue this furtherly. I
don't say admin commands don't make sense. I just say it should not be
limited to that. We can't say admin commands can work in any case or
transport.

>
>
>
> > After this, each transport can choose to implement it in their own
> > way. For example, if there's a proposal to use those via admin
> > virtqueue that's fine.
> > And registers also fine and other transports.
> > That's why only the last patch of this series is dealing with the PCI
> > specific part.
>
> Maybe they are fine in theory.

Tehroticall fine can be done by just relocating texts to a better
place. It's almost free. Do you want a draft?

> So far I didn't see anything cohesive
> that is close functionally being even to Parav's proposal of migration
> over admin commands. There's no specific *reason* not to do that,
> practically - I don't see why

I never say we can not do it over admin commands. I say we should not
*only* do it over admin commands.

>
>
> >
> > >
> > > > and a layer violation.
> > >
> > > layers are only good if they make sense.
> >
> > Case by case, for example, PCI PM has defined the state and the
> > interaction with P2P. Reusing that seems much cleaner than inventing a
> > mechanism in the virtio layer. Or if it needs a side channel, it needs
> > to be invented in PCI not virtio.
> >
> > Thanks
>
> There's no *if* in my opinion - migration is way easier for hypervisors
> to implement as a side channel so any device state can be migrated.

The context is to freeze PCI P2P which is a transport specific
mechanism not device states. When there's an existing mechanism, it's
pretty natural to ask why it can't be used or extended.

I must repeat myself again, It's perfectly fine to use the side
channel to save and restore device state but it's not the only way.

> And
> this is uniform across transports.

I don't understand, for example in 1) MMIO 2) PCI without SR-IOV,
where could we put a side channel?

> Is it harder or easier for hardware
> to implement?

Well, easier for one vendor doesn't mean easier for rest vendors.

And how to synchronize between the transport and the side channel is
still unclear to me at least. So far the only thing I get is something
like "it is implementation specific", so I raise the issue like
PM/D3/FLR/FRS during migration etc where I haven't got a good answer
so far. Or if you think spec is ok without answering those questions,
that's fine but please explain why.

> We have a hardware vendor pushing a side channel approach
> so it seems likely they know what is good for hardware?

If the above is a standard like "a hardware vendor pushing an
approach, it seems likely they know what is good for their hardware",
then it should apply to every vendor instead of just a specific one.

So this calls for a generic design. For example I know virtio that has
been implemented in software, DPU or FPGA (for sure there should be
others). I fully understand that a proposal from a specific vendor can
work well for them. But it doesn't mean it works well for others. Do
you agree? AFAIK, some early versions of IFCVF did virtio via FPGA,
and I guess NV did virtio via collaboration with software running on
DPU. Different hardware architecture may end up with different design
considerations. And there's nothing wrong with them. But what works
better for FPGA doesn't mean it works better in DPU. One key for the
future success of virtio is to not be designed just for a specific
type of hardware or vendor. That's why I disagree with your statement
like "who bothers first who wins".

> If it is somehow
> nvidia specific - can we please have other vendors with actual
> plans implementing this hardware
> (and note this has nothing to do with
> VDPA - all this SUSPEND bit work is only useful for full offload) come
> forward and say "we don't support this, we support that"?

On one hand, you said "the first will win", then Ling Shan told you
IFCVF has been used in production for years.
On the other hand, you want to have a 3rd vendor to comment, so the
standard seems to shift to "the majority will win" and you don't even
ask for a prototype there?

There's no description on what kind of prototyping is done in the
Parav series, for example, what kind of implementation for the
prototype (FPGA, ASIC, DPU, emulation or simulation). Reviewers can do
nothing but guess. You seem to think it's ok. Fine. But Ling Shan has
told us IFCVF has been used in production for years. Sticking to the
flexibility that virito spec already had is much better than waiting
for a vendor to come to say "spec doesn't fit for us, we wouldn't go
for virtio". Again, such decoupling is not guaranteed to succeed but
it's better than coupling.

What's more, I don't see why you think the proposals must be mutually
exclusive. I think both LingShan and Parav are ok for seeking ways for
unficiation or then a way for co-existence.

>  Or is this
> for some unnamed vendors not on the TC?

I don't get the point of the question.

TC is good but there are just too many reasons that they don't want to
join TC. And even if they do, there are just too many reasons that
vendors decide to be silent or not. But I don't think silence means
acquiescence.

> Maybe they should join the TC to
> influence the direction then.

Chicken-egg problem. Vendors may want evaluate both the architecture
and the community before they can invest in virtio.

>  Because right now it looks like software
> guys telling hardware guys what is good for hardware
> and I don't see how
> this makes any sense.

Can you explain why you think my comment doesn't make any sense and
what's the reason for such classification? Most of my comments were
given from the view of a hypervisor developer. It's definitely not
telling what is good for hardware, it's for what is good for software.
Virtio neither works at the level of circuit nor defines any
implementation, so it needs to hear from both software and hardware
engineers.

Thanks

>
>
>
> > >
> > > > Thanks
> > > >
> > > >
> > > > >
> > > > > --
> > > > > MST
> > > > >
> > >
>


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [virtio-comment] RE: [PATCH V2 3/6] virtio: dont reset vqs when SUSPEND
  2023-11-27  6:38                                         ` Jason Wang
@ 2023-11-27  8:27                                           ` Michael S. Tsirkin
  0 siblings, 0 replies; 186+ messages in thread
From: Michael S. Tsirkin @ 2023-11-27  8:27 UTC (permalink / raw)
  To: Jason Wang
  Cc: Parav Pandit, Zhu, Lingshan, eperezma, cohuck, stefanha, virtio-comment

On Mon, Nov 27, 2023 at 02:38:23PM +0800, Jason Wang wrote:
> On Fri, Nov 24, 2023 at 10:45 PM Michael S. Tsirkin <mst@redhat.com> wrote:
> >
> > On Fri, Nov 24, 2023 at 09:01:53PM +0800, Jason Wang wrote:
> > > On Fri, Nov 24, 2023 at 8:17 PM Michael S. Tsirkin <mst@redhat.com> wrote:
> > > >
> > > > On Fri, Nov 24, 2023 at 07:50:42PM +0800, Jason Wang wrote:
> > > > > > I frankly don't see how a bit which is completely non-orthogonal with
> > > > > > device and driver state can be useful for debug.
> > > > >
> > > > > The bit is to make sure the state of a device doesn't change. It may
> > > > > help or not just like if you want to pause a cpu/process during the
> > > > > debug.
> > > >
> > > > Heh. But it would be much better to have an orthogonal state
> > > > that driver can just set without worrying much about
> > > > device being broken somehow.
> > >
> > > In that case we still need a preise definition of the state. So I
> > > don't see the difference here.
> >
> > You don't?
> 
> I meant I don't see the difference if
> 
> 1) it is defined in a new bit in current status
> 2) or it's an orthogonal state

Well the difference is that we don't need to keep track of current state
correctly. For example for debugging, if you can just push out commands
without worrying about what other status bits are, that is best.
Otherwise you need to hope that other status bits are right.


> > If you can inspect state without perturbing it, that is
> > better for debugging than an intrusive interface that perturbs state.
> 
> I've stated that it is just like you can choose to use GDB or not. If
> you don't like GDB, that's fine.

people can use better ways to debug, I think.

> We propose it in the basic facility, you are asking why, we explain it
> might be used for debugging or others, then you said you don't want it
> for debugging.

No, I said it's *not very useful* for debugging the way this patch builds it.


> It's fine but what's the standard here? For example,
> even if it can only be used for migration, what's the issue here?

i see 2 good ways to design things
1- powerful facility to solve all of migration in 1 place
2- modular components that can be useful for other things

both seem reasonable but if you claim you are doing 2 then
better show some other use to demonstrate design is good.

> FEAUTRE_OK can only be used during driver probes but it is still
> defined in the status part, and there's nobody asking for other use
> cases.

our current status bits are designed for device to keep track of
driver state. This, is not it.


> >
> >
> > > Anyhow, we can leave debugging aside.
> > >
> > > >
> > > >
> > > > > > For debug you want something that just always works.
> > > > > > Not an interface that has so many requirements it will break if
> > > > > > you look at it sidewise.
> > > > > >
> > > > > >
> > > > > > > > >
> > > > > > > > I don’t see wasting time here.
> > > > > > > > If its debug, its debug.
> > > > > > > > If its migration, it is migration.
> > > > > > > > If its pm, its pm.
> > > > > > >
> > > > > > > Obviously not. Migration should leverage existing facilities as much
> > > > > > > as possible instead of duplicating them.
> > > > > >
> > > > > > I don't think there's anything obvious here.  A lot of device state
> > > > > > can't be easily accessed with existing facilities.
> > > > >
> > > > > Then we can invent new things.
> > > >
> > > > So the approach this patchset takes is a single interface for
> > > > all state introspection. Some duplication of functionality
> > > > for the sake of consistency.
> > > > You don't like it fine but
> > >
> > > The (partial) duplication should be fine as long as it doesn't have
> > > any issue, but I do see a lot of issues. So I can't say I like it.
> >
> > Let's focus on issue really instead of endless high level
> > architecture discussions.
> 
> I don't understand here. I've pointed out sufficient issues in detail
> in individual patches. At least a small fraction of issues have been
> acked by Parav. If you think it's not a real issue, please explain in
> the individual thread.
> 
> >
> > > > there is nothing obvious that it's a bad thing. It's a tradeoff.
> > > >
> > > > > > The logical
> > > > > > continuation of your reasoning would be to add state introspection
> > > > > > commands e.g. to cvq in virtio net and then use tricks like shadow vq to
> > > > > > issue these.
> > > > >
> > > > > For the device state, yes. Because it is device logic and it is not
> > > > > what platform or transport can know.
> > > >
> > > > Exactly as I thought. Don't think shadow VQ is something we
> > > > can reasonably make a single migration mechanism though.
> > >
> > > That's my understanding as well. It's up to the hypervisor, spec needs
> > > to focus on the mechanism but not policy.
> > >
> > > > It feels fragile and heavyweight. It's more of a work
> > > > around hardware limitations.
> > > >
> > > >
> > > > > > Yea, possible, but can we not go there please?  Nothing is
> > > > > > wrong with just building commands that do exactly what we want them to
> > > > > > do instead of trying to build a ship in a bottle.
> > > > >
> > > > > But it's not the case for others.
> > > > >
> > > > > E.g in Parav's proposal, it tries to rule P2P behaviour via virtio
> > > > > admin commands when there is a duplication
> > > >
> > > > Yes there's some duplication. Advantage is consistency.
> > >
> > > Just to clarify, we do need things like suspend but I don't see
> > > dealing with PCI P2P in virtio as consistent (e.g we don't imply any
> > > PCI stuff in virtio reset).
> >
> > I think I can clarify.  The consistency is that there's a single chapter
> > that deals with migration, also hypervisors will all do exactly the same
> > instead of each hypervisor guessing how to do migration and going its
> > own way.
> 
> Hypervisors have already done migration in different ways. Virtio
> should not be involved in the endless debating on which one is better,
> I think we have agreed on this, otherwise we spent months circling
> back.
> 
> >
> >
> >
> >
> > > >  I actually
> > > > suggested ways to reduce duplication, by using transport offsets as
> > > > tags.
> > >
> > > I don't see a connection with P2P here. Or I may miss something.
> >
> > I don't even know what P2P is in this context. Or why we are
> > discussing it.
> 
> The context is the suspend function. Parav tries to rule P2P in
> freeze/stop, and it looks to me it has been covered by PCI PM.
> 
> > Is this going to be another distraction that
> > no one knows how it will work, just like mentioning TD randomly?
> >
> > > > Finding a right balance means we all need to stop going to
> > > > extremes, I wish you and lingshan would stop trying to force everyone
> > > > to use registers and parav would stop trying to force dma.
> > >
> > > There is a misunderstanding. Actually. I don't want to force
> > > registers.
> >
> > You keep insisting on overlaying suspend functionality
> > over the existing transport.
> 
> It makes sure it could be reused for transport other than PCI. And it
> doesn't prevent the coexistence of a side channel.
> 
> > For pci that is going to be
> > in a register.
> 
> I don't understand. The register is already there. If you disagree,
> please explain, especially why reset at virtio in status can work but
> not suspend.
> 
> >
> > > Which kind of interface is the best way to go is really
> > > implementation specific. In some implementations, registers are cheap
> > > but not virtqueue, in others, virtqueue is cheap but not register.
> > > They are all fine. But we can't not claim one proposal that is
> > > optimized for a specific implementation to be the best way. What I
> > > want to do is not limit the facilities that are used for live
> > > migration to any specific transport. I think Ling Shan agreed with me
> > > in this part. It needs to work on all interfaces regardless of
> > > registers, DMA, CMA or others. That's why I suggest we focus on what
> > > needs to be migrated first. That is define the following things
> > >
> > > 1) way to suspend/resume a device
> > > 2) virtqueue states, indices or inflight ones
> > > 3) device states
> > >
> > > It is just like how we define virtqueues/features/status and other
> > > basic facilities where we do not tie it to any specific interfaces
> > > like DMA, CMA, registers or admin commands. Virtio benefited from the
> > > flexibility like this in the past so why not stick to that?
> >
> > Because migration is a complex enough topic that we simply know
> > from experience that
> > - things like error handling are needed
> 
> Where is it in the Parav series?
> 
> > - passing big arrays around is needed
> 
> Not necessarily for
> 
> 1) simple devices
> 2) devices that are using states in the memory
> 3) even between device and driver, there are vendors that offload the
> migration to DPU
> 
> But again, I'm not saying passing big arrays doesn't make sense,
> please do not misunderstand my point here.
> 
> > these just do not work reasonably well over registers and
> > this is why admin commands were invented.
> 
> I'm lost here. The reason why we stick admin commands is to make sure
> it can be used for register and I wouldn't argue this furtherly. I
> don't say admin commands don't make sense. I just say it should not be
> limited to that. We can't say admin commands can work in any case or
> transport.
> 
> >
> >
> >
> > > After this, each transport can choose to implement it in their own
> > > way. For example, if there's a proposal to use those via admin
> > > virtqueue that's fine.
> > > And registers also fine and other transports.
> > > That's why only the last patch of this series is dealing with the PCI
> > > specific part.
> >
> > Maybe they are fine in theory.
> 
> Tehroticall fine can be done by just relocating texts to a better
> place. It's almost free. Do you want a draft?
> 
> > So far I didn't see anything cohesive
> > that is close functionally being even to Parav's proposal of migration
> > over admin commands. There's no specific *reason* not to do that,
> > practically - I don't see why
> 
> I never say we can not do it over admin commands. I say we should not
> *only* do it over admin commands.
> 
> >
> >
> > >
> > > >
> > > > > and a layer violation.
> > > >
> > > > layers are only good if they make sense.
> > >
> > > Case by case, for example, PCI PM has defined the state and the
> > > interaction with P2P. Reusing that seems much cleaner than inventing a
> > > mechanism in the virtio layer. Or if it needs a side channel, it needs
> > > to be invented in PCI not virtio.
> > >
> > > Thanks
> >
> > There's no *if* in my opinion - migration is way easier for hypervisors
> > to implement as a side channel so any device state can be migrated.
> 
> The context is to freeze PCI P2P which is a transport specific
> mechanism not device states. When there's an existing mechanism, it's
> pretty natural to ask why it can't be used or extended.
> 
> I must repeat myself again, It's perfectly fine to use the side
> channel to save and restore device state but it's not the only way.
> 
> > And
> > this is uniform across transports.
> 
> I don't understand, for example in 1) MMIO 2) PCI without SR-IOV,
> where could we put a side channel?
> 
> > Is it harder or easier for hardware
> > to implement?
> 
> Well, easier for one vendor doesn't mean easier for rest vendors.
> 
> And how to synchronize between the transport and the side channel is
> still unclear to me at least. So far the only thing I get is something
> like "it is implementation specific", so I raise the issue like
> PM/D3/FLR/FRS during migration etc where I haven't got a good answer
> so far. Or if you think spec is ok without answering those questions,
> that's fine but please explain why.
> 
> > We have a hardware vendor pushing a side channel approach
> > so it seems likely they know what is good for hardware?
> 
> If the above is a standard like "a hardware vendor pushing an
> approach, it seems likely they know what is good for their hardware",
> then it should apply to every vendor instead of just a specific one.
> 
> So this calls for a generic design. For example I know virtio that has
> been implemented in software, DPU or FPGA (for sure there should be
> others). I fully understand that a proposal from a specific vendor can
> work well for them. But it doesn't mean it works well for others. Do
> you agree? AFAIK, some early versions of IFCVF did virtio via FPGA,
> and I guess NV did virtio via collaboration with software running on
> DPU. Different hardware architecture may end up with different design
> considerations. And there's nothing wrong with them. But what works
> better for FPGA doesn't mean it works better in DPU. One key for the
> future success of virtio is to not be designed just for a specific
> type of hardware or vendor. That's why I disagree with your statement
> like "who bothers first who wins".
> 
> > If it is somehow
> > nvidia specific - can we please have other vendors with actual
> > plans implementing this hardware
> > (and note this has nothing to do with
> > VDPA - all this SUSPEND bit work is only useful for full offload) come
> > forward and say "we don't support this, we support that"?
> 
> On one hand, you said "the first will win", then Ling Shan told you
> IFCVF has been used in production for years.
> On the other hand, you want to have a 3rd vendor to comment, so the
> standard seems to shift to "the majority will win" and you don't even
> ask for a prototype there?
> 
> There's no description on what kind of prototyping is done in the
> Parav series, for example, what kind of implementation for the
> prototype (FPGA, ASIC, DPU, emulation or simulation). Reviewers can do
> nothing but guess. You seem to think it's ok. Fine. But Ling Shan has
> told us IFCVF has been used in production for years. Sticking to the
> flexibility that virito spec already had is much better than waiting
> for a vendor to come to say "spec doesn't fit for us, we wouldn't go
> for virtio". Again, such decoupling is not guaranteed to succeed but
> it's better than coupling.
> 
> What's more, I don't see why you think the proposals must be mutually
> exclusive. I think both LingShan and Parav are ok for seeking ways for
> unficiation or then a way for co-existence.
> 
> >  Or is this
> > for some unnamed vendors not on the TC?
> 
> I don't get the point of the question.
> 
> TC is good but there are just too many reasons that they don't want to
> join TC. And even if they do, there are just too many reasons that
> vendors decide to be silent or not. But I don't think silence means
> acquiescence.
> 
> > Maybe they should join the TC to
> > influence the direction then.
> 
> Chicken-egg problem. Vendors may want evaluate both the architecture
> and the community before they can invest in virtio.
> 
> >  Because right now it looks like software
> > guys telling hardware guys what is good for hardware
> > and I don't see how
> > this makes any sense.
> 
> Can you explain why you think my comment doesn't make any sense and
> what's the reason for such classification? Most of my comments were
> given from the view of a hypervisor developer. It's definitely not
> telling what is good for hardware, it's for what is good for software.
> Virtio neither works at the level of circuit nor defines any
> implementation, so it needs to hear from both software and hardware
> engineers.
> 
> Thanks
> 
> >
> >
> >
> > > >
> > > > > Thanks
> > > > >
> > > > >
> > > > > >
> > > > > > --
> > > > > > MST
> > > > > >
> > > >
> >


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

* Re: [virtio-comment] RE: [PATCH V2 3/6] virtio: dont reset vqs when SUSPEND
  2023-11-24 14:45                                       ` Michael S. Tsirkin
  2023-11-27  6:38                                         ` Jason Wang
@ 2023-11-27  9:54                                         ` Zhu, Lingshan
  1 sibling, 0 replies; 186+ messages in thread
From: Zhu, Lingshan @ 2023-11-27  9:54 UTC (permalink / raw)
  To: Michael S. Tsirkin, Jason Wang
  Cc: Parav Pandit, eperezma, cohuck, stefanha, virtio-comment



On 11/24/2023 10:45 PM, Michael S. Tsirkin wrote:
> On Fri, Nov 24, 2023 at 09:01:53PM +0800, Jason Wang wrote:
>> On Fri, Nov 24, 2023 at 8:17 PM Michael S. Tsirkin <mst@redhat.com> wrote:
>>> On Fri, Nov 24, 2023 at 07:50:42PM +0800, Jason Wang wrote:
>>>>> I frankly don't see how a bit which is completely non-orthogonal with
>>>>> device and driver state can be useful for debug.
>>>> The bit is to make sure the state of a device doesn't change. It may
>>>> help or not just like if you want to pause a cpu/process during the
>>>> debug.
>>> Heh. But it would be much better to have an orthogonal state
>>> that driver can just set without worrying much about
>>> device being broken somehow.
>> In that case we still need a preise definition of the state. So I
>> don't see the difference here.
> You don't? If you can inspect state without perturbing it, that is
> better for debugging than an intrusive interface that perturbs state.
>
>
>> Anyhow, we can leave debugging aside.
>>
>>>
>>>>> For debug you want something that just always works.
>>>>> Not an interface that has so many requirements it will break if
>>>>> you look at it sidewise.
>>>>>
>>>>>
>>>>>>> I don’t see wasting time here.
>>>>>>> If its debug, its debug.
>>>>>>> If its migration, it is migration.
>>>>>>> If its pm, its pm.
>>>>>> Obviously not. Migration should leverage existing facilities as much
>>>>>> as possible instead of duplicating them.
>>>>> I don't think there's anything obvious here.  A lot of device state
>>>>> can't be easily accessed with existing facilities.
>>>> Then we can invent new things.
>>> So the approach this patchset takes is a single interface for
>>> all state introspection. Some duplication of functionality
>>> for the sake of consistency.
>>> You don't like it fine but
>> The (partial) duplication should be fine as long as it doesn't have
>> any issue, but I do see a lot of issues. So I can't say I like it.
> Let's focus on issue really instead of endless high level
> architecture discussions.
>
>>> there is nothing obvious that it's a bad thing. It's a tradeoff.
>>>
>>>>> The logical
>>>>> continuation of your reasoning would be to add state introspection
>>>>> commands e.g. to cvq in virtio net and then use tricks like shadow vq to
>>>>> issue these.
>>>> For the device state, yes. Because it is device logic and it is not
>>>> what platform or transport can know.
>>> Exactly as I thought. Don't think shadow VQ is something we
>>> can reasonably make a single migration mechanism though.
>> That's my understanding as well. It's up to the hypervisor, spec needs
>> to focus on the mechanism but not policy.
>>
>>> It feels fragile and heavyweight. It's more of a work
>>> around hardware limitations.
>>>
>>>
>>>>> Yea, possible, but can we not go there please?  Nothing is
>>>>> wrong with just building commands that do exactly what we want them to
>>>>> do instead of trying to build a ship in a bottle.
>>>> But it's not the case for others.
>>>>
>>>> E.g in Parav's proposal, it tries to rule P2P behaviour via virtio
>>>> admin commands when there is a duplication
>>> Yes there's some duplication. Advantage is consistency.
>> Just to clarify, we do need things like suspend but I don't see
>> dealing with PCI P2P in virtio as consistent (e.g we don't imply any
>> PCI stuff in virtio reset).
> I think I can clarify.  The consistency is that there's a single chapter
> that deals with migration, also hypervisors will all do exactly the same
> instead of each hypervisor guessing how to do migration and going its
> own way.
>
>
>
>
>>>   I actually
>>> suggested ways to reduce duplication, by using transport offsets as
>>> tags.
>> I don't see a connection with P2P here. Or I may miss something.
> I don't even know what P2P is in this context. Or why we are
> discussing it. Is this going to be another distraction that
> no one knows how it will work, just like mentioning TD randomly?
>
>>> Finding a right balance means we all need to stop going to
>>> extremes, I wish you and lingshan would stop trying to force everyone
>>> to use registers and parav would stop trying to force dma.
>> There is a misunderstanding. Actually. I don't want to force
>> registers.
> You keep insisting on overlaying suspend functionality
> over the existing transport. For pci that is going to be
> in a register.
>
>> Which kind of interface is the best way to go is really
>> implementation specific. In some implementations, registers are cheap
>> but not virtqueue, in others, virtqueue is cheap but not register.
>> They are all fine. But we can't not claim one proposal that is
>> optimized for a specific implementation to be the best way. What I
>> want to do is not limit the facilities that are used for live
>> migration to any specific transport. I think Ling Shan agreed with me
>> in this part. It needs to work on all interfaces regardless of
>> registers, DMA, CMA or others. That's why I suggest we focus on what
>> needs to be migrated first. That is define the following things
>>
>> 1) way to suspend/resume a device
>> 2) virtqueue states, indices or inflight ones
>> 3) device states
>>
>> It is just like how we define virtqueues/features/status and other
>> basic facilities where we do not tie it to any specific interfaces
>> like DMA, CMA, registers or admin commands. Virtio benefited from the
>> flexibility like this in the past so why not stick to that?
> Because migration is a complex enough topic that we simply know
> from experience that
> - things like error handling are needed
> - passing big arrays around is needed
> these just do not work reasonably well over registers and
> this is why admin commands were invented.
>
>
>
>> After this, each transport can choose to implement it in their own
>> way. For example, if there's a proposal to use those via admin
>> virtqueue that's fine.
>> And registers also fine and other transports.
>> That's why only the last patch of this series is dealing with the PCI
>> specific part.
> Maybe they are fine in theory. So far I didn't see anything cohesive
> that is close functionally being even to Parav's proposal of migration
> over admin commands. There's no specific *reason* not to do that,
> practically - I don't see why
>
>
>>>> and a layer violation.
>>> layers are only good if they make sense.
>> Case by case, for example, PCI PM has defined the state and the
>> interaction with P2P. Reusing that seems much cleaner than inventing a
>> mechanism in the virtio layer. Or if it needs a side channel, it needs
>> to be invented in PCI not virtio.
>>
>> Thanks
> There's no *if* in my opinion - migration is way easier for hypervisors
> to implement as a side channel so any device state can be migrated. And
> this is uniform across transports. Is it harder or easier for hardware
> to implement? We have a hardware vendor pushing a side channel approach
> so it seems likely they know what is good for hardware? If it is somehow
> nvidia specific - can we please have other vendors with actual
> plans implementing this hardware (and note this has nothing to do with
> VDPA - all this SUSPEND bit work is only useful for full offload) come
> forward and say "we don't support this, we support that"?  Or is this
> for some unnamed vendors not on the TC? Maybe they should join the TC to
> influence the direction then.  Because right now it looks like software
> guys telling hardware guys what is good for hardware and I don't see how
> this makes any sense.
I guess most of the TC members are SW experts and virtio is not only for HW,
SW emulated virtio works for many years.

This spec defines behaviors of virtio devices, not RTL, not layout.

So I assume SW experts are qualified to comment here.
>
>
>
>>>> Thanks
>>>>
>>>>
>>>>> --
>>>>> MST
>>>>>


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 186+ messages in thread

end of thread, other threads:[~2023-11-27  9:55 UTC | newest]

Thread overview: 186+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-11-03 10:34 [virtio-comment] [PATCH V2 0/6] introduce basic facilities for virito live migration Zhu Lingshan
2023-11-03 10:34 ` [virtio-comment] [PATCH V2 1/6] virtio: introduce virtqueue state Zhu Lingshan
2023-11-03 11:35   ` [virtio-comment] " Parav Pandit
2023-11-03 14:39     ` [virtio-comment] " Zhu, Lingshan
2023-11-03 11:52   ` Michael S. Tsirkin
2023-11-03 14:49     ` Zhu, Lingshan
2023-11-06  9:35       ` Michael S. Tsirkin
2023-11-06  9:42         ` Zhu, Lingshan
2023-11-06  9:45           ` Michael S. Tsirkin
2023-11-07  8:11             ` Zhu, Lingshan
2023-11-07  8:22               ` Michael S. Tsirkin
2023-11-08  4:08                 ` Zhu, Lingshan
2023-11-03 10:34 ` [virtio-comment] [PATCH V2 2/6] virtio: introduce SUSPEND bit in device status Zhu Lingshan
2023-11-03 11:35   ` [virtio-comment] " Parav Pandit
2023-11-03 14:55     ` [virtio-comment] " Zhu, Lingshan
2023-11-03 15:54       ` [virtio-comment] " Parav Pandit
2023-11-06  3:29         ` [virtio-comment] " Zhu, Lingshan
2023-11-06  4:07           ` [virtio-comment] " Parav Pandit
2023-11-06  9:21             ` Zhu, Lingshan
2023-11-06 10:52               ` Parav Pandit
2023-11-07  8:21                 ` Zhu, Lingshan
2023-11-07  8:33                   ` Michael S. Tsirkin
2023-11-07  9:24                     ` Zhu, Lingshan
2023-11-08  7:42                       ` Michael S. Tsirkin
2023-11-06  9:43   ` [virtio-comment] " Michael S. Tsirkin
2023-11-07  9:09     ` Zhu, Lingshan
2023-11-08 17:55       ` Michael S. Tsirkin
2023-11-09  9:55         ` Zhu, Lingshan
2023-11-03 10:34 ` [virtio-comment] [PATCH V2 3/6] virtio: dont reset vqs when SUSPEND Zhu Lingshan
2023-11-06  9:49   ` [virtio-comment] " Michael S. Tsirkin
2023-11-07  9:27     ` Zhu, Lingshan
2023-11-08 17:46       ` Michael S. Tsirkin
2023-11-09  9:58         ` Zhu, Lingshan
2023-11-09 10:15           ` [virtio-comment] " Parav Pandit
2023-11-10  6:22             ` [virtio-comment] " Zhu, Lingshan
2023-11-10  6:31               ` [virtio-comment] " Parav Pandit
2023-11-13  9:23                 ` Zhu, Lingshan
2023-11-15 17:35                   ` Parav Pandit
2023-11-16 10:09                     ` Zhu, Lingshan
2023-11-16 10:19                       ` Parav Pandit
2023-11-16 12:09                       ` Michael S. Tsirkin
2023-11-17 10:13                         ` Zhu, Lingshan
2023-11-17 11:04                           ` Michael S. Tsirkin
2023-11-22  1:41                             ` Zhu, Lingshan
2023-11-22  7:30                               ` Michael S. Tsirkin
2023-11-13  3:34             ` [virtio-comment] " Jason Wang
2023-11-15 17:39               ` [virtio-comment] " Parav Pandit
2023-11-16  4:19                 ` Jason Wang
2023-11-16  5:27                   ` Parav Pandit
2023-11-16 10:12                     ` Zhu, Lingshan
2023-11-21  7:33                     ` Jason Wang
2023-11-21 16:32                       ` Parav Pandit
2023-11-22  5:28                         ` Jason Wang
2023-11-22  6:11                           ` Parav Pandit
2023-11-24  3:35                             ` Jason Wang
2023-11-24  9:04                               ` Michael S. Tsirkin
2023-11-24 11:50                                 ` Jason Wang
2023-11-24 12:17                                   ` Michael S. Tsirkin
2023-11-24 13:01                                     ` Jason Wang
2023-11-24 14:45                                       ` Michael S. Tsirkin
2023-11-27  6:38                                         ` Jason Wang
2023-11-27  8:27                                           ` Michael S. Tsirkin
2023-11-27  9:54                                         ` Zhu, Lingshan
2023-11-21 21:18                       ` Michael S. Tsirkin
2023-11-22  1:51                         ` Zhu, Lingshan
2023-11-22  6:47                           ` Parav Pandit
2023-11-22 10:04                             ` Zhu, Lingshan
2023-11-22 10:14                               ` Parav Pandit
2023-11-22  6:49                           ` Michael S. Tsirkin
2023-11-22 10:03                             ` Zhu, Lingshan
2023-11-22 13:37                               ` Michael S. Tsirkin
2023-11-22  5:28                         ` Jason Wang
2023-11-22  6:32                           ` Parav Pandit
2023-11-24  3:25                             ` Jason Wang
2023-11-24  6:20                               ` Michael S. Tsirkin
2023-11-24  6:28                                 ` Jason Wang
2023-11-24  6:43                                   ` Zhu, Lingshan
2023-11-24  8:50                                   ` Michael S. Tsirkin
2023-11-24 11:51                                     ` Jason Wang
2023-11-03 10:34 ` [virtio-comment] [PATCH V2 4/6] virtio-pci: implement VIRTIO_F_QUEUE_STATE Zhu Lingshan
2023-11-03 11:35   ` [virtio-comment] " Parav Pandit
2023-11-03 14:57     ` [virtio-comment] " Zhu, Lingshan
2023-11-03 15:50       ` Parav Pandit
2023-11-06  3:31         ` Zhu, Lingshan
2023-11-06  4:12           ` Parav Pandit
2023-11-06  9:27             ` Zhu, Lingshan
2023-11-06 10:52               ` Parav Pandit
2023-11-07  9:31                 ` Zhu, Lingshan
2023-11-08 17:44                   ` Michael S. Tsirkin
2023-11-09 10:00                     ` Zhu, Lingshan
2023-11-09 10:02                       ` Michael S. Tsirkin
2023-11-10  6:52                         ` Zhu, Lingshan
2023-11-10 12:31                           ` Parav Pandit
2023-11-13  3:46                             ` Jason Wang
2023-11-13  9:23                               ` Zhu, Lingshan
2023-11-15 17:36                               ` Parav Pandit
2023-11-09  6:28                   ` Parav Pandit
2023-11-09  8:41                     ` Michael S. Tsirkin
2023-11-09  9:10                       ` Parav Pandit
2023-11-09  9:53                         ` Michael S. Tsirkin
2023-11-09 10:11                           ` Parav Pandit
2023-11-09 10:09                     ` Zhu, Lingshan
2023-11-09 10:25                       ` Parav Pandit
2023-11-10  7:52                         ` Zhu, Lingshan
2023-11-10 12:31                           ` Parav Pandit
2023-11-13  9:25                             ` Zhu, Lingshan
2023-11-15 17:35                               ` Parav Pandit
2023-11-16 10:14                                 ` Zhu, Lingshan
2023-11-16 10:21                                   ` Parav Pandit
2023-11-17 10:02                                     ` Zhu, Lingshan
2023-11-17 10:06                                       ` Parav Pandit
2023-11-21  4:30                                         ` Jason Wang
2023-11-21 16:26                                           ` Parav Pandit
2023-11-22  4:15                                             ` Jason Wang
2023-11-22  7:15                                               ` Michael S. Tsirkin
2023-11-22  7:33                                                 ` Parav Pandit
2023-11-22 14:43                                                   ` Michael S. Tsirkin
2023-11-17 10:45                                       ` Michael S. Tsirkin
2023-11-22  1:32                                         ` Zhu, Lingshan
2023-11-22  6:53                                           ` Michael S. Tsirkin
2023-11-08 17:56   ` Michael S. Tsirkin
2023-11-13  9:29     ` Zhu, Lingshan
2023-11-13 10:10       ` Michael S. Tsirkin
2023-11-03 10:34 ` [virtio-comment] [PATCH V2 5/6] virtio: introduce dirty page tracking facility Zhu Lingshan
2023-11-03 11:35   ` [virtio-comment] " Parav Pandit
2023-11-03 14:11     ` [virtio-comment] " Zhu, Lingshan
2023-11-03 10:34 ` [virtio-comment] [PATCH V2 6/6] virtio-pci: implement dirty page tracking Zhu Lingshan
2023-11-03 10:46   ` [virtio-comment] " Michael S. Tsirkin
2023-11-03 14:21     ` Zhu, Lingshan
2023-11-06  9:16       ` Zhu, Lingshan
2023-11-06 10:15         ` Michael S. Tsirkin
2023-11-07  9:43           ` Zhu, Lingshan
2023-11-07 10:43             ` Michael S. Tsirkin
2023-11-03 10:50   ` Michael S. Tsirkin
2023-11-03 11:35     ` [virtio-comment] " Parav Pandit
2023-11-03 15:02       ` [virtio-comment] " Zhu, Lingshan
2023-11-03 15:47         ` [virtio-comment] " Parav Pandit
2023-11-05 16:12           ` [virtio-comment] " Michael S. Tsirkin
2023-11-06  3:58             ` Zhu, Lingshan
2023-11-06 10:33               ` Michael S. Tsirkin
2023-11-07  9:48                 ` Zhu, Lingshan
2023-11-06  4:03             ` [virtio-comment] " Parav Pandit
2023-11-07 11:13               ` [virtio-comment] " Michael S. Tsirkin
2023-11-08  9:29                 ` Zhu, Lingshan
2023-11-08 17:18                   ` Michael S. Tsirkin
2023-11-09 10:29                     ` Zhu, Lingshan
2023-11-09 10:41                       ` Michael S. Tsirkin
2023-11-10  7:24                         ` Zhu, Lingshan
2023-11-06  3:52           ` Zhu, Lingshan
2023-11-06  4:34             ` [virtio-comment] " Parav Pandit
2023-11-06  9:34               ` [virtio-comment] " Zhu, Lingshan
2023-11-06 10:52                 ` [virtio-comment] " Parav Pandit
2023-11-06 11:05                   ` [virtio-comment] " Michael S. Tsirkin
2023-11-06 11:07                     ` [virtio-comment] " Parav Pandit
2023-11-06 11:21                       ` [virtio-comment] " Michael S. Tsirkin
2023-11-07  9:52                   ` Zhu, Lingshan
2023-11-07 11:33                     ` Michael S. Tsirkin
2023-11-08  9:30                       ` Zhu, Lingshan
2023-11-08 17:19                         ` Michael S. Tsirkin
2023-11-09 10:34                           ` Zhu, Lingshan
2023-11-06 11:13                 ` [virtio-comment] " Parav Pandit
2023-11-07 10:01                   ` [virtio-comment] " Zhu, Lingshan
2023-11-07 10:25                     ` Michael S. Tsirkin
2023-11-07 11:12                       ` [virtio-comment] " Parav Pandit
2023-11-07 11:24                         ` Parav Pandit
2023-11-08  7:11                           ` [virtio-comment] " Jason Wang
2023-11-08  7:16                             ` [virtio-comment] " Parav Pandit
2023-11-07 11:31                         ` [virtio-comment] " Michael S. Tsirkin
2023-11-08  9:36                       ` Zhu, Lingshan
2023-11-07 12:00                     ` Michael S. Tsirkin
2023-11-06 10:29               ` Michael S. Tsirkin
2023-11-06 11:21                 ` [virtio-comment] " Parav Pandit
2023-11-06 11:27                   ` [virtio-comment] " Michael S. Tsirkin
2023-11-06 11:31                     ` [virtio-comment] " Parav Pandit
2023-11-07 10:02                   ` [virtio-comment] " Zhu, Lingshan
2023-11-07 11:36                     ` Michael S. Tsirkin
2023-11-05 16:20       ` Michael S. Tsirkin
2023-11-06  3:51         ` [virtio-comment] " Parav Pandit
2023-11-03 14:32     ` [virtio-comment] " Zhu, Lingshan
2023-11-05 16:16       ` Michael S. Tsirkin
2023-11-06  4:06         ` Zhu, Lingshan
2023-11-06 10:22           ` Michael S. Tsirkin
2023-11-07 10:44             ` Zhu, Lingshan
2023-11-07 11:29               ` Michael S. Tsirkin
2023-11-07  8:01 ` [virtio-comment] Re: [PATCH V2 0/6] introduce basic facilities for virito live migration Michael S. Tsirkin
2023-11-08 10:19   ` Zhu, Lingshan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.